f5: Guardrails

Once an agent is out in the world, people will ask it things you never intended: off-topic questions, and the occasional unsafe one. A guardrail is how you handle that. It is a cheap check that runs before your real agent and decides whether to let the request through.

Quick path

In a hurry? These three steps are the whole challenge. Everything below is the why and the how.

Run make f5 (ships with QUERY_TO_RUN = 2, off-topic) and watch TripMate answer it because nothing gates the request.
Edit start/agent.py: do TODO 1 (run guardrail on the query first and read its 1/0 verdict), TODO 2 (if not allowed, print a refusal and return early).
Done when QUERY_TO_RUN = 1 passes through to TripMate while QUERY_TO_RUN = 2 and QUERY_TO_RUN = 3 get the refusal and TripMate never runs.

The shape is simple: one small classifier call in front of the expensive one. It looks at the user’s message and returns a verdict, and only allowed messages reach TripMate.

Mental model

query  ->  [ guardrail check ]  ->  allowed?  --yes-->  TripMate  ->  answer
              (cheap; runs first)      \---no--->  refuse (TripMate never runs)

The guardrail is a cheap call in front; a blocked verdict returns before the real agent ever runs. Both calls here use await agent.run(...): the gate consumes a one-character verdict, and there is nothing to stream about a 1 or a 0.

The mechanic, on a throwaway bot

The gate is two calls and a branch. Take a billing-support bot that has nothing to do with travel: a cheap classifier decides, and only an allowed message reaches the real bot. The classifier is the small piece, one character out, 1 to allow and 0 to block, because cheapness is the point:

checker = Agent(
    model,
    instructions=(
        "You gate a billing-support bot. Reply with ONE character: 1 if the message is "
        "about billing or payments, 0 for anything else. Nothing else."
    ),
)

From there the gate is three moves, and they are yours to write:

Run the checker first, before the real bot, on the user’s message.
Turn its reply into a boolean, failing closed: only a clear 1 passes, so strip the reply and check it starts with 1, and treat anything else as a block.
Return early with a refusal when it’s blocked, so the real bot never runs.

You did the typed-output and instructions work in f1 to f3; this is the same toolkit wired into a gate. Carry the three moves over to TripMate’s guardrail below.

Open start/agent.py. There are two agents already wired: a guardrail and the plain tripmate from f1. The GUARDRAIL_SYSTEM brief is provided, it allows travel questions, blocks everything else and anything unsafe, and answers with a single 1 or 0. What’s missing is the gate itself: you write it, applying the three moves above.

Run it:

make f5

As it ships there is no guardrail wired in, so TripMate answers whatever you send it. QUERY_TO_RUN is set to an off-topic question, and you’ll watch TripMate answer it. That is the gap: an assistant with no gate has no idea what it should refuse.

Build it

Run it and watch the gap. Run make f5 with QUERY_TO_RUN = 2 (an off-topic question). TripMate answers it, because nothing is checking the request first. Try QUERY_TO_RUN = 3 (an unsafe question) and watch it engage with that too.
Run the guardrail first (TODO 1). Before TripMate runs, send the query to the guardrail and turn its one-character reply into a boolean, the first two moves from the worked example, now on TripMate’s guardrail. Fail closed: only a clear allow passes; treat anything else as a block. Replace the allowed = True placeholder with the real verdict.
Refuse blocked queries (TODO 2). If the verdict was a block, print a fixed refusal and return before tripmate runs, the third move. Then re-run all three: QUERY_TO_RUN = 1 (a real trip question) passes through to TripMate; QUERY_TO_RUN = 2 and QUERY_TO_RUN = 3 get the refusal and TripMate never sees them.

Stuck? finish/agent.py is the canonical version, read it after you’ve had a real go.
Check you’ve got it. You should be able to point at the two-call shape: the guardrail runs first, and only an allowed verdict lets TripMate run. Scroll up to the trace: a blocked query shows the guardrail run and then nothing, an allowed one shows the guardrail run and then the TripMate run.

A couple of things worth knowing

Why a separate call instead of one clever prompt?

You could tell TripMate “refuse anything off-topic” in its own instructions, and that helps, but it is the same model that wants to be helpful deciding to refuse itself, on every turn, mixed in with the real work. A separate guardrail is one job, judged on its own, and you can make it cheap and strict without touching how TripMate answers. It is also the seam where you would later log refusals, swap in a faster model, or tighten the policy.

Keep the guardrail cheap

A guardrail runs on every request, so it should be the smallest call you can make. This one returns a single character, 1 or 0, so the model writes almost nothing: no schema to fill, no reason string to compose, just one token. You could give it an output_type with a typed allowed/reason instead, and the reason is handy while you are learning why a verdict went the way it did, but every extra token is one that every request pays for. When in doubt, keep the gate cheap and log the blocked queries somewhere else. On a small local model the call is not instant either way; the point is the pattern, a light check in front, not a stopwatch number.

Scope is a guardrail too

Guardrails are not only about unsafe content. The most common use is scope: keeping the assistant on the job it was built for. A travel bot that will write your code or your essays is a support headache and a bigger surface to test. “Only answer travel questions” is a guardrail, and it is the one you will reach for most.

That is the gate in front of the agent. Next up is f6, where the model picks among several tools from their descriptions alone.