f7: Testing the agent
In f5 you put a guardrail in front of TripMate. It looked right when you ran it, but “looked right once” is not the same as “I can prove it still works tomorrow.” How do you actually know it blocks an off-topic question and lets a trip question through?
The obvious idea, call the real model and check, is a bad test. It is slow, it costs money on a hosted model, and the model will not reliably say “block” on command, so the test passes one run and fails the next. A flaky test is worse than no test.
The fix is to stop using the real model. agent.override(model=TestModel()) runs your agent with a stand-in that makes no network call, and TestModel(custom_output_text="0") makes that stand-in return exactly the verdict you want. Now you can drive the guardrail to “block” or “allow” on demand and assert what your code does with each. Same idea as a unit test: deterministic, fast, and yours to control.
Quick path
Section titled “Quick path”In a hurry? These three steps are the whole challenge. Everything below is the why and the how.
- Run
make f7and watch both checks fail (✗,0/2 green): the two test bodies are unwritten. - Edit
start/agent.py: TODO 1 (force the guardrail to “0”, asserthandle()returns the refusal), TODO 2 (force it to “1”, stub TripMate, assert the reply is not the refusal). - Done when both checks print a ✓ and the run finishes without a single real model call.
Mental model
Section titled “Mental model”Swap the model for one you control, and each branch of the gate becomes a check you can trust.
The mechanic, in another domain
Section titled “The mechanic, in another domain”Forget travel. Say you have a comment box with a moderation gate: a checker votes 1 to keep a comment or 0 to remove it.
override swaps the model just for that with block, and custom_output_text is the verdict you put in its mouth. You are not testing whether the model moderates well, that is its own problem. You are testing that your code does the right thing once a verdict comes back. Below you write the same two moves for TripMate’s gate.
The setup
Section titled “The setup”Open start/agent.py. The f5 gate is already here, pulled into one testable function handle(query): the guardrail agent votes, and only an allowed query reaches tripmate. Two test functions are stubbed out for you to fill in. At the top, ALLOW_MODEL_REQUESTS = False is set as a backstop: if a test forgets to override an agent, the real call raises instead of quietly passing.
Run it
Section titled “Run it”Unlike every challenge before it, this one needs no model server. Every test swaps in a fake model, so f7 runs offline and instantly, even with Ollama stopped.
Build it
Section titled “Build it”-
Test the block branch (TODO 1). Write
test_blocks_off_topic. Override the guardrail withTestModel(custom_output_text="0"), callhandle(...)inside thewithblock, and assert the reply equalsREFUSAL. Because a blocked query never reaches TripMate, you do not need to touchtripmateat all. Mirror the shape of the moderator test above; don’t copy it verbatim, the agents and the assertion are different. -
Test the allow branch (TODO 2). Write
test_allows_trip. This time the guardrail votes"1", sohandle()does call TripMate, which would be a real model request. Override both agents, the guardrail withcustom_output_text="1"andtripmatewith a plainTestModel()(it returns generated text, no real call), then assert the reply is a real answer (non-empty) and notREFUSAL. -
Run it.
make f7should now print2/2 greenwith a ✓ on each line. Nothing waited on Ollama; the whole thing is instant. That speed is the point: you can run this on every commit. -
Prove the backstop works. Comment out the
tripmate.override(...)in TODO 2 and run again. That test now fails instantly with a ✗ and a model-request error, instead of hanging on a slow call, becauseALLOW_MODEL_REQUESTS = Falsecaught the un-stubbed agent. Put it back.
Stuck? finish/agent.py is the canonical version. Read it after you’ve had a real go.
- The allow test hangs or errors on a real call. You overrode the guardrail but not TripMate. An allowed query reaches TripMate, so stub it too. The backstop turns this into an immediate raise instead of a slow hang.
custom_output_textignored. It only applies to the agent you wrapped inoverride. Wrap the guardrail, not TripMate, when you want to control the verdict.- Asserting the model’s quality. A
TestModelsays whatever you tell it, so a test on it proves nothing about how well the real model moderates. These tests cover your wiring and control flow, the part you actually wrote. Judging the model’s answers is evals, a different tool.
A couple of things worth knowing
Section titled “A couple of things worth knowing”What does TestModel do by default?
With no arguments, TestModel() calls every tool the agent has (with made-up arguments) and then returns generated text as the output. That is enough to prove the plumbing works: tools are registered, an output_type validates, your code runs end to end. When you need a specific answer instead, custom_output_text sets the text and custom_output_args sets the fields of a structured output. After a run, the_model.last_model_request_parameters.function_tools tells you which tools were on offer.
Why ALLOW_MODEL_REQUESTS = False?
It is a global switch that makes any real model request raise. TestModel and friends are exempt, so your overridden runs work fine, but a run you forgot to stub turns into a loud error instead of a slow, costly, flaky call to a live model. It is cheap insurance against a test that secretly depends on the network. Many teams set it once in their test setup.
These are real tests; we just run them with a script
To keep every challenge to one runnable file, main() calls the two tests and prints a ✓ for each. In a real project they would live in a file like test_gate.py as async def test_* functions, with no main(): pytest (plus pytest-asyncio or anyio) discovers them, awaits each one, and reports a failed assert as a failing test. The bodies are identical; only the runner changes.
This is the last of the foundations. You have an agent that takes instructions, returns structured output, calls tools, gates its input, routes on descriptions, and now proves it behaves.
Now the path forks. Choose a track (do one, several, or stop here — the Discussion closes the workshop wherever you end up):
- Patterns (p1–p7): compose several model calls in shapes you design (chaining, routing, parallelization, evaluator-optimizer), then hand control to the model with agentic, delegation, and conversation.
- RAG (r1–r2): give the agent your own documents to retrieve and rank by similarity, then chunk the long ones.
- Full-Stack: put this agent behind a streaming web chat UI.
If you’re carrying on with patterns, p1 is chaining: draft, check it with a gate, then fix only what failed.