f5: Guardrails
Once an agent is out in the world, people will ask it things you never intended: off-topic questions, and the occasional unsafe one. A guardrail is how you handle that. It is a cheap check that runs before your real agent and decides whether to let the request through.
Quick path
Section titled “Quick path”In a hurry? These three steps are the whole challenge. Everything below is the why and the how.
- Run
npm run f5(RUN = 2) and watch TripMate answer an off-topic question because nothing checks it first. - Read the throwaway worked example below to learn the two-call gate, then edit
start/agent.ts: TODO 1 (runguardrailfirst and turn its1/0verdict into a boolean), TODO 2 (if not allowed, print a refusal and return early). - Done when RUN 1 passes through to TripMate while RUN 2 and RUN 3 get the refusal and TripMate never runs.
One small classifier call sits in front of the expensive one. It looks at the user’s message and returns a verdict, and only allowed messages reach TripMate.
Mental model
Section titled “Mental model”The guardrail is a cheap call in front; a blocked verdict returns before the real agent ever runs.
Both calls here use .generate(): the gate consumes a one-character verdict, and there is
nothing to stream about a 1 or a 0.
The mechanic, on a throwaway bot
Section titled “The mechanic, on a throwaway bot”The gate is two calls and a branch. Take a billing-support bot that has nothing to do with
travel: a cheap classifier decides, and only an allowed message reaches the real bot. The
classifier is the small piece, one character out, 1 to allow and 0 to block, because
cheapness is the point:
From there the gate is three moves, and they are yours to write:
- Run the checker first, before the real bot, on the user’s message.
- Turn its reply into a boolean, failing closed: only a clear
1passes, so trim the reply and check it starts with1, and treat anything else as a block. - Return early with a refusal when it’s blocked, so the real bot never runs.
You did the typed-output and instructions work in f1 to f3; this is the same toolkit wired
into a gate. Carry the three moves over to TripMate’s guardrail below.
Open start/agent.ts. There are two agents already wired: a guardrail
and the plain tripmate from f1. The GUARDRAIL_SYSTEM brief is provided, it allows travel
questions, blocks everything else and anything unsafe, and answers with a single 1 or 0.
What’s missing is the gate itself: you write it, applying the three moves above.
Run it:
As it ships there is no guardrail wired in, so TripMate answers whatever you send it. RUN
is set to an off-topic question, and you’ll watch TripMate answer it. That is the
gap: an assistant with no gate has no idea what it should refuse.
Build it
Section titled “Build it”-
Run it and watch the gap. Run
npm run f5withRUN = 2(an off-topic question). TripMate answers it, because nothing is checking the request first. TryRUN = 3(an unsafe question) and watch it engage with that too. -
Run the guardrail first (TODO 1). Before TripMate runs, send the query to the
guardrailand turn its one-character reply into a boolean, the first two moves from the worked example, now on TripMate’sguardrail. Fail closed: only a clear allow passes; treat anything else as a block. Replace theconst allowed = true;placeholder with the real verdict. -
Refuse blocked queries (TODO 2). If the verdict was a block, print a fixed refusal and
returnbeforetripmateruns, the third move. Then re-run all three:RUN = 1(a real trip question) passes through to TripMate;RUN = 2andRUN = 3get the refusal and TripMate never sees them.Stuck?
finish/agent.tsis the canonical version. Read it after you’ve had a real go. -
Check you’ve got it. You should be able to point at the two-call shape: the guardrail runs first, and only an allowed verdict lets TripMate run. Scroll up to the trace: a blocked query shows the guardrail call and then nothing, an allowed one shows the guardrail call and then the TripMate call.
A couple of things worth knowing
Section titled “A couple of things worth knowing”Why a separate call instead of one clever prompt?
You could tell TripMate “refuse anything off-topic” in its own instructions, and that helps, but it is the same model that wants to be helpful deciding to refuse itself, on every turn, mixed in with the real work. A separate guardrail is one job, judged on its own, and you can make it cheap and strict without touching how TripMate answers. It is also the seam where you would later log refusals, swap in a faster model, or tighten the policy.
Keep the guardrail cheap
A guardrail runs on every request, so it should be the smallest call you can make. This one
returns a single character, 1 or 0, so the model writes almost nothing: no JSON to
assemble, no reason string to compose, just one token. You could ask it for a typed
{ allowed, reason } instead, and the reason is handy while you are learning why a verdict
went the way it did, but every extra token is one that every request pays for. When in doubt,
keep the gate cheap and log the blocked queries somewhere else. On a small local model the
call is not instant either way; the point is the pattern, a light check in front, not a
stopwatch number.
Scope is a guardrail too
Guardrails are not only about unsafe content. The most common use is scope: keeping the assistant on the job it was built for. A travel bot that will write your code or your essays is a support headache and a bigger surface to test. “Only answer travel questions” is a guardrail, and it is the one you will reach for most.
Input gates vs. output guardrails (and middleware)
f5 is an input guardrail: it checks the request before the model runs and can refuse it.
The other half is output guardrails, which inspect what the model generated and
clean it, for example redacting personal data before the reply leaves your system. The AI
SDK’s idiomatic home for those is middleware: wrapLanguageModel({ model, middleware })
wraps a model so a wrapGenerate hook can rewrite the result, and the wrapped model drops
into any agent unchanged. Middleware is also where reusable, model-agnostic concerns like
logging and caching live. The self-serve track guardrails-middleware builds one. A real
system often runs both: an input gate in front, an output filter behind.
That is the gate in front of the agent. Next up is f6, where the model picks among several tools from their descriptions alone.