Your AI Guardrails Fail Up to 100% of the Time. Build the Wrong Answer Out Instead.
Six popular AI guardrails were bypassed at 12–100% success. In law, where base hallucination is already 17–33%, 'mostly works' is malpractice. Governance by construction makes the unsafe action impossible — there's nothing to evade.
The guardrail is a lock you keep picking
Here's how almost every AI guardrail works today. You give the agent a general power — answer any legal question — and then you put a checker in front of it: a system prompt that says 'only these jurisdictions,' a classifier that screens the output, a rule that fires after the model has already chosen what to say. Call it governance by runtime rejection: the agent can do the wrong thing, and you're betting the check catches it first.
That bet is losing, and now we have numbers. In the 2025 study Bypassing LLM Guardrails (arXiv 2504.11168), researchers ran character-injection and adversarial-ML evasion against six popular guardrail systems. Attack success rates:
Emoji smuggling. Unicode tag smuggling. Confidently phrased context — 'the governing-law clause says Delaware, so apply Delaware law.' The pattern is always the same: you left the dangerous action in the agent's hands and you're trying to talk it out of using it. You patch one phrasing; someone finds the next. It's a lock you keep picking because you keep leaving the door.
Why 'mostly works' is malpractice in law
A 13% bypass rate is an interesting security problem in most products. In legal AI it's a liability event, because the base rates are already bad. The Stanford RegLab study found that even RAG-grounded, purpose-built tools — Westlaw AI-Assisted Research and Lexis+ AI — hallucinated on ~33% and >17% of queries respectively. That's the floor, before anyone is actively trying to push the model out of bounds. Add a guardrail that fails 13–65% of the time when someone does try, and you have a system that will, predictably and at scale, give a confident answer under a law that doesn't govern.
We know the downstream cost because it's now its own dataset: 1,458 court cases with AI-fabricated citations and counting. Every one of those is a lawyer who trusted an output a guardrail was supposed to catch. 'Mostly works' is exactly the failure profile that ends in a sanctions order.
The flip: construct, don't reject
Governance by construction starts from a different question. Not 'how do we stop the agent from doing the wrong thing?' but 'what if the wrong thing was never one of its options?'
The agent's action space is built from a policy, once, at construction time. The policy lists what's allowed — for us, the jurisdictions the agent may reason about. The builder creates one action per allowed jurisdiction, plus a single decline action. It never creates an action for anything outside the policy.
Ask the constructed agent about Delaware law and it doesn't refuse in the moral sense — it has no Delaware action to invoke. The only thing it can do with an out-of-scope request is decline. The unsafe behavior isn't forbidden; it's absent. There is no emoji-smuggled, multi-turn, confidently-phrased prompt clever enough to call a function that does not exist. The 13–100% evasion surface collapses to zero, because there's no checker to evade — the capability simply isn't there.
A note on how this was built, because we believe in showing the work: govcon came out of a morning research brief, was sketched before lunch, and had green tests by the afternoon. The core is ~200 lines of TypeScript with zero runtime dependencies. The idea is small — that's the tell. The best safety primitives remove a category of failure instead of adding a category of check.
What it looks like for a legal agent
Our reference agent, govcon, is scoped to six MENA jurisdictions: UAE, DIFC, Saudi Arabia, Lebanon, Egypt and Qatar. For each, the constructor builds an answer action grounded in that jurisdiction's primary instruments — so a DIFC employment question comes back citing DIFC Employment Law No. 2 of 2019, not boilerplate. For anything else, no action exists:
We then ran it through HAQQ-LAB, our open civil-law benchmark, against an ungoverned baseline across 16 tasks including four out-of-jurisdiction traps:
Not because it was prompted better. Because the action to do otherwise was never constructed.
The world-model connection
This is the software echo of a point Yann LeCun keeps making: a large language model predicts tokens, not consequences. It has no internal model of what its output does in the legal, financial or physical world. Asking such a system to reliably stay in bounds by choosing to is asking the wrong thing of it — and the 13–100% bypass numbers are what 'asking it nicely' buys you.
So don't ask. Take the out-of-bounds action out of the world the agent can act in. Governance by construction is one concrete, boring, shippable way to honor LeCun's critique without waiting for a new model architecture. It's the same instinct behind capability-based security and the principle of least privilege, applied to an agent's action space instead of a process's file handles.
This is defense in depth, not a silver bullet
- It eliminates a category: out-of-policy actions. It does not make in-policy answers correct. A govcon agent scoped to DIFC can still be wrong about DIFC law — that's what the benchmark's Substance dimension (and a good reasoning model) is for.
- It's a structural floor, not the whole stack. You still want grounding, citation verification, human review on high-stakes output, and audit logging. Construction is the earliest and hardest-to-bypass layer, not the only one.
- The policy itself is now the thing to get right. You've moved the trust from 'did the model behave?' to 'is the policy correct and complete?' — a much smaller, reviewable, version-controlled surface. That's the trade we want.
The point isn't that construction replaces guardrails everywhere. It's that for the failures you cannot afford — giving advice under a law that doesn't govern — you should make the action impossible, and reserve runtime checks for the failures you can survive.
HAQQ's take: the missing primitive
There's a gap in legal AI nobody has filled. On one side, 'the AI drafts a contract.' On the other, what clients and regulators actually want: 'the AI cannot draft a contract that violates the jurisdiction's mandatory rules.' Between them sits a missing primitive — rules compiled into the agent's action space rather than checked after the fact.
Governance by construction is a down payment on that primitive. The same shape that makes out-of-jurisdiction advice impossible can make a mandatory clause non-optional or a forbidden clause unwritable — compliance as a property of the action space, not a hope about the output. In a market where the leaders are worth a combined $16.6B and still competing on who hallucinates less, 'structurally cannot do the unsafe thing' is a different kind of claim — one you prove, not promise, and one no incumbent can copy by tightening a system prompt.
govcon is AGPL-3.0 on GitHub. It's ~200 lines. Read it in five minutes; the idea is the whole product.
Key takeaways
- Runtime guardrails are bypassed 13–100% of the time (arXiv 2504.11168); in law, where base hallucination is already 17–33%, 'mostly works' is a liability event.
- Governance by construction builds the action space from a policy, so the unsafe action never exists — there's nothing to evade.
- govcon, our jurisdiction-scoped legal agent, scored 100% out-of-jurisdiction trap defense vs 0% for an ungoverned baseline in HAQQ-LAB.
- It's the software answer to LeCun's critique and a least-privilege model for agents: remove the action, don't trust the model to avoid it.
- It's defense in depth, not a silver bullet — it eliminates a category of failure and moves trust to a reviewable policy.