Skip to content

r2: Chunking

In r1 every document was a one-line blurb, so one vector per document was plenty. Real documents are not one line. Embed a whole multi-paragraph guide as a single vector and that vector becomes a blurry average of everything in it: the temples, the trams, the food, the weather, all smeared together. Ask about one of those topics and it barely matches.

The fix is chunking. Split each guide into passages, embed the passages, and retrieve the paragraph that answers the question. The retrieval code is identical to r1; the only change is what you embed.

In a hurry? These three steps are the whole challenge. Everything below is the why and the how.

  1. Run npm run r2 and read the retrieval block: for “what local dishes should I try?” the three guides come back at about 0.25, scored low and close, and the snippets are the opening paragraph of each, not the food.
  2. Edit start/agent.ts, the body of chunk: split the guide on blank lines into paragraphs (TODO 1). indexGuides and retrieve already do the rest.
  3. Done when the same question returns the three food passages at roughly 0.390.47, each tagged with its city, and TripMate answers citing them.

One vector cannot represent four different topics well. A guide’s single embedding sits near the average of its paragraphs, which is near none of them in particular. Chunk it and each paragraph gets its own vector, so “what should I eat” lands right on the food paragraph.

one guide  --embed-->  [ 1 blurry vector ]            "food?" matches weakly (0.25)

chunk it:
para 1 --\
para 2 ---\--embed-->  [ 4 sharp vectors ]            "food?" matches the food para (0.47)
para 3 ---/
para 4 --/

Same query, same embedder, same ranking. Smaller units, sharper matches.

Forget the guides. Say you have a long markdown manual. Embedded whole, it is one blurry vector. Split it on the boundary the author already gave you, the headings, and each section becomes its own vector; the retrieval from r1 is otherwise unchanged:

function chunk(doc: string): string[] {
  return doc
    .split(/\n(?=## )/)        // start a new chunk at each "## " heading
    .map((s) => s.trim())
    .filter(Boolean);          // split, clean, drop empties
}

Split, clean, drop empties: three array moves, no embeddings. The only judgement is where to cut, and that is the boundary the writer already put in. TripMate’s guides separate paragraphs with a blank line rather than headings, so yours splits on those instead. Same move, different boundary.

Open start/agent.ts. indexGuides and retrieve are done and read exactly like r1; each chunk keeps the city it came from, so a retrieved passage can be cited. The blank is chunk, which right now returns the whole guide as one piece. Splitting it into paragraphs is TODO 1.

npm run r2

The retrieval block prints three near-identical low scores and, tellingly, the snippets it shows are each guide’s first paragraph (the general “what this place is like” one), not the food. That is the blunt-average problem in the open. Fix chunk and the food passages jump to the top. Stuck on low scores after that? See TROUBLESHOOTING.md.

  1. Run it and read the muddle. Run npm run r2. For “what local dishes should I try?” you get roughly 0.27 Reykjavik, 0.25 Lisbon, 0.25 Kyoto, low, bunched, and the wrong paragraph. The index holds three vectors, one per guide.

  2. Split into paragraphs (TODO 1). Make chunk return one passage per paragraph. Paragraphs are separated by a blank line, so split the text on those blank lines, run the same whitespace cleanup the placeholder already does on each piece, and drop any empties. Three array moves, no embeddings, and indexGuides and retrieve already do the rest.

  3. Read what that unlocks. You write nothing in indexGuides or retrieve. Look at them: indexGuides now produces twelve passages instead of three guides, each carrying its city, and retrieve ranks passages instead of guides. The startup line should now read Indexed 3 guides as 12 passages.

  4. Run it again. The same question now returns the three food passages, Kyoto, Reykjavik, Lisbon, at roughly 0.47, 0.46, 0.39, well above the old 0.25. TripMate answers with real dishes and cites the city for each, because the tool handed it the food paragraphs and nothing else.

  5. Poke the granularity. In chunk, split into sentences instead (text.split(". ")). Watch the passages get so small they lose context: a sentence that says “Try the lamb soup” no longer carries that it is about Reykjavik. Then go the other way (whole guide) and you are back to the muddle. Good chunking is finding the size that holds one idea and enough context to use it.

  6. Check you’ve got it. You should be able to explain why the food passages scored 0.25 before and 0.47 after, when the embedder and the query never changed. The only thing that changed was the size of the thing you embedded.

Stuck? finish/agent.ts is the canonical version. Read it after you’ve had a real go.

  • Chunks too big. Back to r1’s blunt average: if a passage covers three topics, a query about one of them matches weakly.
  • Chunks too small. A lone sentence loses the context that made it useful, and the thread between sentences. “It is closed on Mondays”, closed what?
  • Re-embedding the query with a different model. Same trap as r1, and fatal here too. Index and query must share the embedder.
Why a blank line is a good boundary

You can chunk by tokens, by sentences, by headings, or with a model that finds topic shifts. For prose with paragraphs, a blank line is a cheap and strong boundary, because a writer already broke the text where the topic changed. Start with the structure the document gives you; reach for cleverer splitters only when that fails.

Overlap, and why people add it

A fact can straddle a boundary: the sentence before the blank line sets up the sentence after it. To avoid cutting that thread, real chunkers often overlap, each chunk repeats the last sentence or two of the previous one. It costs a few duplicate tokens and saves you from a passage that starts mid-thought. Our paragraphs are self-contained enough not to need it, but it is the first knob you reach for when retrieval returns a passage that “starts too late”.

Keep the metadata on the chunk

Each chunk carries its city. That is what lets the agent cite “in Kyoto…” instead of stating a fact with no source. When you chunk, keep a link back to the document: the title, the URL, the section heading. Retrieval that cannot tell you where a passage came from is much harder to trust and to debug.

That closes the RAG track. You can now ground a model in a catalogue of short records (r1) and in long-form documents (r2), which is the shape of most real retrieval. From here, the Full-Stack track puts a grounded agent behind a real web UI, and the Patterns track shows how retrieval slots in next to chaining, routing and the rest.