Building a RAG System with TF-IDF

· 5 min read

I wanted to be able to ask questions about a codebase and get accurate answers — not hallucinated ones. The simplest way to do that is RAG: retrieve the relevant pieces first, then let the LLM answer from those.

The Problem

LLMs don’t have access to your code. If you ask “how does the auth middleware work?”, the model either hallucinates an answer or you have to paste the entire codebase into the prompt. Neither is great.

The root cause is that LLMs generate answers from what they were trained on, not from your files. Without grounding, they’ll confidently describe a function that doesn’t exist.

Why It Matters

Sending the whole codebase as context is sometimes fine for small repos, but it has real costs: token limits, slow responses, and API charges that add up. More importantly, a bloated context degrades answer quality — the model has to sift through irrelevant code to find what it needs.

Hallucinations are worse. If you’re using an LLM to understand an unfamiliar codebase, a wrong answer is worse than no answer. It sends you down the wrong path.

Alternatives

The standard approach is semantic embeddings — convert every chunk into a dense vector using a model like text-embedding-3-small, store those in a vector database (Pinecone, pgvector, ChromaDB), and retrieve by cosine similarity. This handles synonyms and semantic meaning well. “Token validation” and “JWT check” will match even if the words don’t overlap.

The tradeoff: embeddings cost money per token to generate, require an API key, and need infrastructure to store and query. For understanding a single codebase, that’s a lot of overhead.

There’s also full-context injection — stuff everything into the prompt and let the model figure it out. This works until it doesn’t: context windows have limits, costs scale linearly, and retrieval quality degrades with noise.

TF-IDF sits in the middle. It’s keyword-based, entirely local, and fast enough for tens of thousands of chunks without any API calls or external services. It won’t match synonyms, but for source code — where function names and variable names repeat consistently — it works well.

How RAG Works

The core idea is to separate retrieval from generation. Instead of asking the LLM to know your code, you fetch the relevant pieces first and hand them over as context.

Four steps:

Ingest    → chunk source files, compute TF-IDF vectors, save index
Retrieve  → convert question to TF-IDF vector, find top matching chunks
Augment   → build a prompt: "answer only from this context" + chunks + question
Generate  → pipe to LLM, get a grounded answer

The LLM only sees the 4 most relevant chunks per query, not the full codebase.

Ingest

Run once per codebase. The script walks the repo, splits each file into logical chunks — functions, classes, markdown sections — and computes a TF-IDF vector for each.

Chunking is file-type aware:

  • .js/.ts: split on function, class, export, and section headers
  • .py: split on def, class, and section headers
  • .md: split on # headings

Each chunk gets a file path prefix for attribution, and anything under 50 characters is discarded as noise.

Tokenization matters here. The same tokenizer must run at both index time and query time. If they diverge, query tokens won’t match index tokens and retrieval silently breaks. The implementation uses a custom three-step process: camelCase splitting, lowercasing, and suffix stemming ("calculating""calculat").

TF-IDF scoring:

TF(term)    = count in chunk / total terms in chunk
IDF(term)   = log((N + 1) / (df + 1)) + 1
TF-IDF      = TF × IDF

In the IDF formula, N is the total number of chunks in the index, and df (document frequency) is the number of chunks that contain the term at least once. The + 1 adjustments are smoothing — they prevent division by zero and stop terms that appear in every chunk from collapsing to zero. IDF rewards rare terms: a term in 2 of 200 chunks scores around 5.18; one in 100 chunks scores around 1.67. Common words like “the” or “return” end up with near-zero weight.

After scoring, each chunk’s vector is L2-normalized to unit length. This makes retrieval simpler: cosine similarity between two unit vectors is just their dot product, no square roots needed.

The output is an index.json with every chunk’s text, source path, and TF-IDF vector.

Retrieve

At query time, the question goes through the same tokenizer, gets scored using the IDF table from the index, and is L2-normalized. Then the system scores every chunk with a dot product and returns the top 4.

This is a linear scan — O(N) over all chunks — but it’s entirely in-memory with no network calls. On a 200-chunk index it runs in under a millisecond.

Augment and Generate

The top 4 chunks get joined into a context block. The prompt tells the model to answer only from what’s provided, not from its training data. That’s the grounding part — it’s what prevents hallucinations. Here’s the prompt sent to the LLM:

Answer only using the context below. If you can't find the answer, say so.

CONTEXT:
// File: src/auth/auth.js
function validateToken(payload) { ... }
---
// File: src/middleware/index.js
...

QUESTION: How does token validation work?

This gets piped to claude -p, and the answer comes back grounded in actual source code.

What TF-IDF Can’t Do

Keyword matching fails on synonyms. If the code says validate and the question asks about verify, they might not match. Semantic embeddings handle this; TF-IDF doesn’t.

The custom stemmer also misses edge cases. It strips common suffixes (-ing, -tion, -ed, -s) but it misses irregular cases that a more rigorous stemming algorithm would handle. Some tokens won’t normalize the same way at query and index time.

And it won’t scale past ~10k chunks without slowing down noticeably. Retrieval is a linear scan — for every query, it computes a dot product against every chunk in the index, one by one. At 10k chunks, that’s 10,000 dot products per query, which starts to add up. A vector database with approximate nearest-neighbor search avoids this by indexing vectors spatially, so it can find the top matches without checking every single one.

For a single codebase where you control the vocabulary — source code, docs, a knowledge base — TF-IDF is good enough and costs nothing to run.


The full implementation is in github.com/andrewou/RAG. ingest.py builds the index, query.py handles retrieval and generation. Both scripts have detailed inline comments on the math if you want to go deeper.

Related Posts