Skip to content

f7: Testing the agent

In f5 you put a guardrail in front of TripMate. It looked right when you ran it, but “looked right once” is not the same as “I can prove it still works tomorrow.” How do you actually know it blocks an off-topic question and lets a trip question through?

The obvious idea, call the real model and check, is a bad test. It is slow, it costs money on a hosted model, and the model will not reliably say “block” on command, so the test passes one run and fails the next. A flaky test is worse than no test.

The fix is to stop using the real model. If handle takes its agents as an argument, a test can pass in a fake one whose reply it decides, with no network call. Now you can drive the guardrail to “block” or “allow” on demand and assert what your code does with each. Same idea as a unit test: deterministic, fast, and yours to control.

In a hurry? These three steps are the whole challenge. Everything below is the why and the how.

  1. Run npm run f7 and watch both checks fail (, 0/2 green): the two test bodies are unwritten.
  2. Edit start/agent.ts: TODO 1 (a fakeAgent("0") guardrail, assert handle() returns the refusal), TODO 2 (a fakeAgent("1") guardrail plus a fake TripMate, assert the reply is not the refusal).
  3. Done when both checks print a ✓ and the run finishes without a single real model call.
real model in a test    ->  slow, costs money, won't "block" on command  (flaky)
a fake agent you inject  ->  instant, deterministic; tests YOUR branch logic

Pass in an agent whose answer you control, and each branch of the gate becomes a check you can trust.

Forget travel. Say you have a comment box with a moderation gate: a checker votes 1 to keep a comment or 0 to remove it.

type Responder = { generate(o: { prompt: string }): Promise<{ text: string }> };

async function post(text: string, moderator: Responder): Promise<string> {
  const verdict = await moderator.generate({ prompt: text });
  return verdict.text.trim().startsWith("1") ? text : "[removed]";
}

const fake = (text: string): Responder => ({ generate: async () => ({ text }) });

async function testRemovesABadComment() {
  // Make the moderator vote "0" -- no real model, no guessing.
  assert.equal(await post("garbage", fake("0")), "[removed]");
}

post takes the moderator as an argument, so the test hands it a fake whose verdict it decides. You are not testing whether the model moderates well, that is its own problem. You are testing that your code does the right thing once a verdict comes back. Below you write the same two moves for TripMate’s gate.

Open start/agent.ts. The f5 gate is already here, pulled into one testable function handle(query, agents?): the guardrail votes, and only an allowed query reaches TripMate. By default it uses the real agents; pass agents to swap in fakes. fakeAgent(text) (an agent that always answers text, no real call) and two stubbed test functions are provided. You fill the tests in.

npm run f7

Unlike every challenge before it, this one needs no model server. Every test injects a fake agent, so f7 runs offline and instantly, even with Ollama stopped.

  1. Test the block branch (TODO 1). Write testBlocksOffTopic. Call handle(...) with { guardrail: fakeAgent("0") } and assert the reply equals REFUSAL. Because a blocked query never reaches TripMate, you do not need to pass a tripmate. Mirror the shape of the moderator test above; don’t copy it verbatim, the function and the assertion are different.

  2. Test the allow branch (TODO 2). Write testAllowsTrip. This time the guardrail votes "1", so handle() does reach TripMate, which would be a real model call. Pass both fakes, { guardrail: fakeAgent("1"), tripmate: fakeAgent("...") }, then assert the reply is a real answer (non-empty) and not REFUSAL.

  3. Run it. npm run f7 should now print 2/2 green with a ✓ on each line. Nothing waited on Ollama; the whole thing is instant. That speed is the point: you can run this on every commit.

Stuck? finish/agent.ts is the canonical version. Read it after you’ve had a real go.

  • The allow test makes a real, slow call. You passed a fake guardrail but no tripmate, so an allowed query falls through to the real one. Pass both fakes. (The AI SDK has no global “block real requests” switch, so injecting every agent is the discipline that keeps tests offline.)
  • The verdict is ignored. fakeAgent controls whichever agent you actually pass. Pass it as the guardrail when you want to control the allow/block decision.
  • Asserting the model’s quality. A fake says whatever you tell it, so a test on it proves nothing about how well the real model moderates. These tests cover your wiring and control flow, the part you actually wrote. Judging the model’s answers is evals, a different tool.
Why inject the agents instead of mocking the model?

handle only ever reads .text off whatever it calls, so the test stand-in needs nothing more than a generate() that returns { text }, one line. The AI SDK also ships MockLanguageModelV3 (from ai/test) for mocking at the model level, when you need to assert on tool calls, streaming, or finish reasons. Here that is more machinery than the job needs: pass a fake agent and you are done. Either way the lesson is the same, control the answer, assert your branch.

These are real tests; we just run them with a script

To keep every challenge to one runnable file, main() calls the two tests and prints a ✓ for each. In a real project they would live in a file like gate.test.ts and a runner like Vitest would discover them, with expect(reply).toBe(REFUSAL) in place of assert. The bodies are identical; only the runner changes.

This is the last of the foundations. You have an agent that takes instructions, returns structured output, calls tools, gates its input, routes on descriptions, and now proves it behaves.

Now the path forks. Pick a track (do one, several, or stop here — the Discussion closes the workshop wherever you end up):

  • Patterns (p1–p7): compose several model calls in shapes you design, then hand control to the model.
  • RAG (r1–r2): give the agent your own documents to retrieve and rank by similarity, then chunk the long ones.
  • Full-Stack: put this agent behind a streaming web chat UI.

If you carry on with patterns, p1 is chaining: draft, check it with a gate, then fix only what failed.