r1: Retrieval

You start the RAG track here. Until now the model has answered from what it absorbed in training, which is fine for “what is the capital of Portugal” and useless for “which of our destinations suits this traveller”. Retrieval-Augmented Generation fixes that. You embed your own documents, embed the question, rank your documents by how close they are, and hand the closest few to the model. The model then answers from data you chose.

The new idea is the ranking. Everything around it you have already met: the blurbs are data, and you expose the search with @agent.tool_plain, the same shape as f4.

Quick path

In a hurry? These three steps are the whole challenge. Everything below is the why and the how.

Run make r1 and read the retrieval block: it returns Bali, Cancún and the Swiss Alps for “warm beach on a budget”, because search ignores the query and returns the first three.
Edit the body of search: embed the query (TODO 1), score every destination by cosine similarity (TODO 2), sort and take the top k (TODO 3).
Done when the retrieval block lists three real warm-beach destinations in descending score order, and TripMate recommends one of them.

A destination’s blurb becomes a vector. The query becomes a vector in the same space. “Close together” means “about the same thing”, and cosine similarity is how you measure it. Rank by that, keep the top few, and you have retrieval.

The top-K cutoff is the part people underestimate. The model can only recommend what you put in front of it. Retrieve the wrong three and a confident, well-written, wrong answer is exactly what you get.

Mental model

query --embed--> [ vector ]
                     |  cosine similarity vs every doc vector
                     v
docs --embed--> [ vectors ] --rank--> top K --> tool result --> model answers

Each blurb is embedded once at startup. The query is embedded per call. Similarity ranks them; the top K is all the model ever sees.

The mechanic, in another domain

Forget the catalogue. Say you want the FAQ entry that best answers a question. Embed the answers once, embed the question, score each by cosine similarity, and keep the top few:

# index once: embed every FAQ answer
docs = (await embedder.embed_documents([f["answer"] for f in FAQS])).embeddings

# per query: embed it, score every doc, take the top k
q = (await embedder.embed_query(question)).embeddings[0]
top = sorted(
    ({**f, "score": cosine(q, docs[i])} for i, f in enumerate(FAQS)),
    key=lambda f: f["score"],
    reverse=True,
)[:k]

embed_documents embeds the answers in one batch, embed_query embeds a single query, and cosine (the helper at the top of the file) scores two vectors. The query and the docs must use the same embedder, or the scores are noise. That embed → score → sort → slice is retrieval; below you write it for TripMate’s catalogue.

The setup

Open start/agent.py. The DESTINATIONS catalogue, index_destinations (it embeds every blurb once at startup into doc_embeddings), and the search_destinations tool that wraps search are all provided. The blank is the body of search: it returns the first three and ignores the query. The three ranking lines (TODO 1, 2, 3) are yours.

Run it

make r1

You get two blocks. The retrieval block prints the “top” three for “warm beach on a budget”, and the Swiss Alps is in there, because nothing compares the query to the blurbs yet. The agent block then recommends from that broken set. Fix search and both blocks come good at once.

The catalogue’s embeddings are pre-generated and committed (embeddings.embeddinggemma.json in this folder), so the first run is instant. Delete that file to regenerate them (a few seconds, once, while Ollama loads embeddinggemma); a different embedder writes its own. See TROUBLESHOOTING.md.

Build it

Run it and read the gap. Run make r1. The scores are all 0.000 and the order is catalogue order, so “warm beach on a budget” returns the Swiss Alps. That is what “ignores the query” looks like.
Embed the query (TODO 1). Put the query into the same vector space as the blurbs with one embedder.embed_query(query) call, then read the embedding off the result. Same embedder as the blurbs. Embeddings are only comparable within one model.
Score every destination (TODO 2). Build a list, one entry per destination, each carrying its similarity to the query. doc_embeddings[i] is the vector for DESTINATIONS[i], and cosine(...) is already waiting for you at the top of the file.
Rank and cut (TODO 3). Sort highest-first and return the top k. The rest never reaches the model.
Run it again. The retrieval block should now read something like 0.55 Bali, 0.48 Cancún, 0.48 Lisbon, real warm-beach options in descending order, and TripMate should recommend one of them and say why. The Swiss Alps drops out because skiing and fondue are not close to “warm beach”.
Poke the cutoff. Change search(query, 3) to 1, then to 6. At k = 1 the model sees a single option and has no choice; at k = 6 it sees half the catalogue, including things that do not fit, and the recommendation gets vaguer. Retrieval quality is mostly about giving the model enough and no more.
Check you’ve got it. You should be able to say, in one sentence, why the same query returned the Swiss Alps before and Bali after, and point at the line that made the difference. Look at the trace too: you will see embedding spans for the query alongside the agent run and the search_destinations tool call.

Stuck? finish/agent.py is the canonical version. Read it after you’ve had a real go.

Traps

Mixing embedding models. Vectors are only comparable if they came from the same model. Embed your docs with one and your query with another and the scores are noise.
Trusting the top result. The closest match is still only the closest of what you have. If nothing fits, retrieval hands over the least bad option and the model presents it confidently. Watch for that when you poke the query.
Embedding too much at once. One blurb per vector is fine here. When a document is long, a single vector becomes a blurry average and specific questions stop matching. That is exactly what r2 is about.

A couple of things worth knowing

Embeddings, in one paragraph

An embedding model turns text into a list of numbers (a vector) so that text about similar things lands in similar places. “Warm beach on a budget” and “Caribbean white-sand beaches, great-value resorts” end up close; “snow-sure skiing, fondue” ends up far. Cosine similarity measures the angle between two vectors, so it ignores length and asks “do these point the same way”. You never read the numbers yourself; you only ever compare them.

Why retrieval is a tool

Nothing new happened at the agent level. You wrote a @agent.tool_plain function with a clear docstring (the model reads that docstring as the tool’s description) and it returned some data. The model decided to call it. The only difference from f4 is that the tool does a similarity search instead of returning a mock. That is the whole trick: RAG is retrieval inside the tool loop. If the model never calls the tool, the problem is almost always the docstring or the instructions, not the embeddings.

embed_query vs embed_documents

Pydantic AI’s Embedder gives you both. They can produce slightly different vectors for the same text, because some embedding models are trained to put a question and the passage that answers it close together, even though they are worded differently. Use embed_documents for the things you store and embed_query for the thing you search with. Here the difference is small, but using the right one is a free habit worth keeping.

Index once, query many

index_destinations runs once at startup and embeds all eight blurbs. search only embeds the query. That split matters: embedding documents is the slow, expensive part, so you do it ahead of time and cache the vectors. Here that cache is a JSON file in the challenge folder, pre-generated and committed so the first run skips the embed; in production it is a vector database. The shape is identical, only the store changes.

Next up is r2, where the documents get long. One embedding for a whole multi-paragraph guide is too blunt: you chunk the guide into passages, embed those, and retrieve the paragraph that answers the question.