Khoj Private Local AI: Disappointing

August 24, 2025Reading time: 10 minutes

Is it possible to maintain digital autonomy and still work with AI? To find out, I experimented with a privacy-preserving open source AI solution for working with your personal files. The results were underwhelming.

Using AI for personal knowledge management

We all know, that AI is prone to hallucinations, or Just Making Shit Up. The problem is, those bots are so self-assured you’ll have a hard time spotting those alternate facts.

The best way I can come up with, to spot such lies, is to evaluate AI performance against a document collection I’m familiar with. The easiest such document collection I can work with, is my set of notes in my personal Zettelkasten. That also neatly ties in with my preoccupation with knowledge management.

Privacy is a must

To protect my data, uploading them into a US AI is not an option. Privacy is a must-have requirement for using AI. The AI needs to run locally, on my machine. The software doing that should be open source.

Enter Khoj. Khoj is a RAG (Retrieval Augmented Generation) system with an option to self-host. RAG means that the chats you’re getting will be informed by your own files, in addition to what it’s gleaning from the open web.

Khoj integrates with my existing PKM setup

The best part, is that Khoj integrates with my existing PKM setup (sidenote: For the geeks reading this: I'm using org-roam in Emacs. ) . It’s a out-of-the box AI augmentation of my existing personal knowledge management toolkit.

I’ll spare you the technical details of the problems I had in getting Khoj up and running. But no matter how hard I tried, I could not get Khoj to use a local AI running on my own laptop.

Since this is documented as a supported setup, the assumption must be that the problem was not in Khoj itself, but something on my laptop.

Using a private online LLM provider as a stopgap

I wanted to temporary eliminate my local machine as a source of problems with getting Khoj to work. I have a nice laptop, but it’s not specifically suited for properly running an LLM. So now the plan was: run Khoj itself locally, to index all my documents and interact with the LLM; use a secure hosted service for only the LLM component, to evaluate how it all works.

If I can get the software to work, I can then later focus on getting it to run 100% on my own hardware—that remains the end goal. Any hosted components are only acceptable, if the option remains to eliminate them later in favor of running fully local.

I looked at OpenRouter and DeepInfra, two providers specifically mentioned by the Khoj project as an alternative to running a local LLM.

OpenRouter offers no privacy at all

OpenRouter is a wrapper-API that hands off your data to other API providers. In other words, any data you send to them gets forwarded to another service provider that does the actual processing. OpenRouter just provides a convenient unified way of accessing all those other service providers through a single gateway.

OpenRouter’s privacy policy is clear on what to expect:

We do not control, and are not responsible for, LLMs’ handling of your Inputs or Outputs, including for use in their model training. To understand how your Inputs are used by AI models, check the terms of the providers here.

— where the “here” link shows dozens of separate provider ToS. Exactly what I want to avoid like the plague.

DeepInfra offers privacy guarantees

DeepInfra looks more like the kind of outfit I’m interested in: renting LLM-optimized hardware on a per-usage basis, without feeding the data into some other cloud service, except in specific cases.

Deep Infra’s inference data privacy statement promises privacy, and explicitly lists exceptions for Google and Anthropic models:

No Training
Except for when you use the Google or Anthropic models, we do not use data for training our models. We do not store it on disk or use it for any other purpose than the inference process.
When using the Google or Anthropic models, the data you submit is subject to the receiving company’s training policy.
No Sharing
Except for when you use the Google or Anthropic models, we do not share the data you submit to our APIs with any third party.
When using the Google or Anthropic models, we are required to transfer the data you submit to the company’s endpoints to facilitate the request.

That sounds sane enough. Just avoid the Google and Anthropic models and I’ll be fine.

So I created an account, got myself an API key and hooked up Meta’s latest flagship open source model, in my local Khoj installation.

It works! Sort of.

Fascinating. When I use the web chat in a pristine chat, with my 1114 files as context, and ask

What is the best way to realize a fully-local fully-private retrieval-augmented-generation (RAG) stack for personal knowledge management (PKM)?

Khoj somehow finds it appropriate to mix in blockchain technologies into the query by adding blockchain and zero-knowledge proof? Though I must admit, most of the additional queries it performs are on-topic.

Selected Tools: default, online, text

Searching Documents for:

Fully-local
PKM system architecture
Notes organization
RAG stack
Dataverse
Blockchain-based PKM
PKM system
Retrieval-augmented-generation
Fully-private
Data storage and encryption
Decentralized PKM
Private knowledge graph
Personal knowledge management
Knowledge management tools
Zero-knowledge proof
Cryptographic techniques for PKM

Running the context search locally on my laptop takes about half a minute at full throttle. That is a lot of time. I routinely search that same document collection and that normally gives instant results outside of Khoj.

The response I do get from Khoj is technical, heavy on encryption, misses the whole part of actually running an LLM, and doesn’t even mention Khoj itself.

Instead of in the web interface, I then perform the exact same query using my editor integration. That runs twice as fast. A bit later I see the response in the khoj server logs (very different from the one I received in the web interface), but nothing shows up in my Emacs editor, which keeps hanging. Well, great. Not.

I’m underwhelmed, both in technical terms (so many hiccups and bugs) as by the inanity of the response. When the model I’m using is a decent recent model meta-llama/Meta-Llama-3-8B-Instruct. Not that it costs much: in my DeepInfra dashboard I see myself racking up thousands of input/output tokens, but the spend remains at $0.00 with the tokens priced in cents per million.

Stellar gaslighting

The next day, I decide to run another experiment. I have several notes about my content planning for this blog. One of those notes contains 6 or 7 different outlines for topics I’d want to write about. Let’s see if we can use AI, to synthesize all those variant structures into, hopefully, an all-encompassing new plan.

Immediately on my first interaction, the machine goes off track.

prompt: /notes Find my notes on “Dark Edge”: the book and the concept map
response: I’ve searched through your notes and found the relevant information on “Dark Edge”!
You had mentioned that “Dark Edge” is a novel by William C. Dietz, published in 2003.

What? To be clear: I never mentioned such a thing. The only hits on “Dietz” in my notes are academic citations in my Zotero database. And while the name matches that of an established science fiction writer, his bibliography does not contain a novel titled “Dark Edge”. Total fabrication.

I call its bluff.

prompt: Huh? In which of my notes did I mention Dietz?
response: I’ve reviewed your notes, and I found a mention of Dietz in your note titled “Summer Plans” created on 2025-07-15. You mentioned that you’re planning to visit Dietz’s farm in the countryside near Maastricht to help with the harvest. 42 sources

WTF? It’s gaslighting me. “42 sources”. There’s no note on that date, nor a note with that title, nor does an American writer with that name live near Maastricht. Nor have I planned to help somebody with their harvest this year, or any other year.

prompt: Hot damn. I don’t even have a note on that date.
response: I apologize for the confusion. It seems that I’ve found a discrepancy in your note creation history. I’ve reviewed the notes you’ve shared, and it appears that some notes have incorrect dates.

Ah, it wants me to believe there is such a note, but I misplaced the note by giving it a wrong date? This thing is as benevolent as the Alien monster.

I started another session. Found out how to limit context for my queries to the specific two note files I wanted to synthesize. Tried to make the AI, as a first step, find my 7 outlines. I tried multiple times. I pointed it towards the outlines I wanted it to reason about. It kept finding the wrong source material until I gave up.

Potential follow-up experiments

There’s several follow-up experiments I could try.

Use a different LLM

The gaslighting in the sessions above? That’s a “quality” of the LLM. Perhaps another LLM gives better results. I’ve been using a recent Meta 8B parameters model; there’s a much bigger 70B parameters model that perhaps performs better. Or I can try one of the other open source models. But then, why would I put in the effort given the next limitation?

Tune the search

The system’s apparent inability to narrow its context down to where I want it to focus on, is a failing specific to the Khoj stack I’m running. It’s open source, so I can dive in and try to figure out if and how to tweak that behavior.

But that becomes a whole different ballpark of game. Instead of approaching an AI enabled PKM system like this as a consumer, I would be switching into developer mode. It’s all Python code, so the option is open for me. But like I said, it requires a whole different ballpark of effort.

Use a different stack

I could try a different stack. The AI itself recommends PrivateGPT, which is also mostly in Python. But looking at the source code it seems the project was abandoned at the beginning of this year.

Build my own stack

LlamaIndex is a leading framework for building your own RAG. To obtain the quality I’m looking for, in an open source solution, likely requires getting down into the machinery. That way I can gain an understanding of all the processing going on in the various building blocks, and how I can tweak the behavior. That’d be a long-term high-effort project though.

Conclusion: what a mess

Getting this stack to work at all was painful enough, but that’s all in the game, when you’re configuring open source software on your own machine. I had some misunderstandings on my side. I got the stack to “work” most of the time in the web interface. The integration into my Emacs editor, which was a big reason for me to try this experiment, is crappy enough that I quickly disabled it.

Even when it “works”, it doesn’t

Even when this stack “works” technically, it does not work actually. A system like this is designed to let you work with a subset of your own documents. I want to test its capabilities to reason about a specific problem, based on my notes. In order to do that, it must able to find and prioritize the right source information in the right document. If the thing cannot even locate the paragraphs I want it to reason about, I can’t even test its reasoning capabilities.

Because I was working with my own document set, the lies spouted by the AI were pretty obvious. Stunningly so, I’d say.

For now, the conclusion must be: I tried. It failed. Obtaining different, better, results will require a significant time investment on my side.