f5: Guardrails
Once an agent is out in the world, people will ask it things you never intended: off-topic questions, and the occasional unsafe one. A guardrail is how you handle that. It is a cheap check that runs before your real agent and decides whether to let the request through.
Quick path
Section titled “Quick path”In a hurry? These three steps are the whole challenge. Everything below is the why and the how.
- Run
make f5(ships withQUERY_TO_RUN = 2, off-topic) and watch TripMate answer it because nothing gates the request. - Edit
start/agent.py: do TODO 1 (runguardrailon the query first and read its1/0verdict), TODO 2 (if not allowed, print a refusal and return early). - Done when
QUERY_TO_RUN = 1passes through to TripMate whileQUERY_TO_RUN = 2andQUERY_TO_RUN = 3get the refusal and TripMate never runs.
The shape is simple: one small classifier call in front of the expensive one. It looks at the user’s message and returns a verdict, and only allowed messages reach TripMate.
Mental model
Section titled “Mental model”The guardrail is a cheap call in front; a blocked verdict returns before the real agent ever runs.
Both calls here use await agent.run(...): the gate consumes a one-character verdict, and
there is nothing to stream about a 1 or a 0.
The mechanic, on a throwaway bot
Section titled “The mechanic, on a throwaway bot”The gate is two calls and a branch. Take a billing-support bot that has nothing to do with
travel: a cheap classifier decides, and only an allowed message reaches the real bot. The
classifier is the small piece, one character out, 1 to allow and 0 to block, because
cheapness is the point:
From there the gate is three moves, and they are yours to write:
- Run the checker first, before the real bot, on the user’s message.
- Turn its reply into a boolean, failing closed: only a clear
1passes, so strip the reply and check it starts with1, and treat anything else as a block. - Return early with a refusal when it’s blocked, so the real bot never runs.
You did the typed-output and instructions work in f1 to f3; this is the same toolkit wired
into a gate. Carry the three moves over to TripMate’s guardrail below.
Open start/agent.py. There are two agents already wired: a guardrail
and the plain tripmate from f1. The GUARDRAIL_SYSTEM brief is provided, it allows travel
questions, blocks everything else and anything unsafe, and answers with a single 1 or 0.
What’s missing is the gate itself: you write it, applying the three moves above.
Run it:
As it ships there is no guardrail wired in, so TripMate answers whatever you send it.
QUERY_TO_RUN is set to an off-topic question, and you’ll watch TripMate answer it. That is
the gap: an assistant with no gate has no idea what it should refuse.
Build it
Section titled “Build it”-
Run it and watch the gap. Run
make f5withQUERY_TO_RUN = 2(an off-topic question). TripMate answers it, because nothing is checking the request first. TryQUERY_TO_RUN = 3(an unsafe question) and watch it engage with that too. -
Run the guardrail first (TODO 1). Before TripMate runs, send the query to the
guardrailand turn its one-character reply into a boolean, the first two moves from the worked example, now on TripMate’sguardrail. Fail closed: only a clear allow passes; treat anything else as a block. Replace theallowed = Trueplaceholder with the real verdict. -
Refuse blocked queries (TODO 2). If the verdict was a block, print a fixed refusal and
returnbeforetripmateruns, the third move. Then re-run all three:QUERY_TO_RUN = 1(a real trip question) passes through to TripMate;QUERY_TO_RUN = 2andQUERY_TO_RUN = 3get the refusal and TripMate never sees them.Stuck?
finish/agent.pyis the canonical version, read it after you’ve had a real go. -
Check you’ve got it. You should be able to point at the two-call shape: the guardrail runs first, and only an allowed verdict lets TripMate run. Scroll up to the trace: a blocked query shows the guardrail run and then nothing, an allowed one shows the guardrail run and then the TripMate run.
A couple of things worth knowing
Section titled “A couple of things worth knowing”Why a separate call instead of one clever prompt?
You could tell TripMate “refuse anything off-topic” in its own instructions, and that helps, but it is the same model that wants to be helpful deciding to refuse itself, on every turn, mixed in with the real work. A separate guardrail is one job, judged on its own, and you can make it cheap and strict without touching how TripMate answers. It is also the seam where you would later log refusals, swap in a faster model, or tighten the policy.
Keep the guardrail cheap
A guardrail runs on every request, so it should be the smallest call you can make. This one
returns a single character, 1 or 0, so the model writes almost nothing: no schema to
fill, no reason string to compose, just one token. You could give it an output_type with a
typed allowed/reason instead, and the reason is handy while you are learning why a verdict
went the way it did, but every extra token is one that every request pays for. When in doubt,
keep the gate cheap and log the blocked queries somewhere else. On a small local model the
call is not instant either way; the point is the pattern, a light check in front, not a
stopwatch number.
Scope is a guardrail too
Guardrails are not only about unsafe content. The most common use is scope: keeping the assistant on the job it was built for. A travel bot that will write your code or your essays is a support headache and a bigger surface to test. “Only answer travel questions” is a guardrail, and it is the one you will reach for most.
That is the gate in front of the agent. Next up is f6, where the model picks among several tools from their descriptions alone.