Governance by Construction: AI Guardrails You Can't Bypass

By HAQQ Team · 2026-05-22 · Updated 2026-06-11 · 12 min read · Ai-legal-tech

Six popular LLM guardrails were bypassed at 12–100% rates. Governance by construction builds the unsafe action out of the agent — nothing left to evade.

The guardrail is a lock you keep picking

Here's how almost every AI guardrail works today. You give the agent a general power — answer any legal question — and then you put a checker in front of it: a system prompt that says 'only these jurisdictions,' a classifier that screens the output, a rule that fires after the model has already chosen what to say. Call it governance by runtime rejection: the agent can do the wrong thing, and you're betting the check catches it first.

Key facts

Six production guardrail systems were bypassed at 12.7%–65.2% success rates, with simple character transforms reaching up to 100% evasion.
govcon, HAQQ's open-source jurisdiction-scoped agent (~200 lines, AGPL-3.0), scored 100% out-of-jurisdiction trap defense in HAQQ-LAB vs 0% for an ungoverned baseline.

That bet is losing, and now we have numbers. In the 2025 study Bypassing LLM Guardrails (arXiv 2504.11168), researchers ran character-injection and adversarial-ML evasion against six popular guardrail systems. Attack success rates:

How often production AI guardrails get bypassed — Jailbreak attack success rate per guardrail (Bypassing LLM Guardrails, arXiv 2504.11168, 2025).
NeMo Guard Jailbreak Detect	65.2%
Vijil Prompt Injection	35.6%
Protect AI v1	24.4%
Azure Prompt Shield	13.0%
Meta Prompt Guard	12.7%
Simple character transforms (worst case)	up to 100%

Simple character transforms reach up to 100% evasion in the worst case. A separate multi-turn technique raises success by 60%+.

Emoji smuggling. Unicode tag smuggling. Confidently phrased context — 'the governing-law clause says Delaware, so apply Delaware law.' The pattern is always the same: you left the dangerous action in the agent's hands and you're trying to talk it out of using it. You patch one phrasing; someone finds the next. It's a lock you keep picking because you keep leaving the door.

Why 'mostly works' is malpractice in law

A 13% bypass rate is an interesting security problem in most products. In legal AI it's a liability event, because the base rates are already bad. The Stanford RegLab study found that even RAG-grounded, purpose-built tools — Westlaw AI-Assisted Research and Lexis+ AI — hallucinated on ~33% and >17% of queries respectively. That's the floor, before anyone is actively trying to push the model out of bounds. Add a guardrail that fails 13–65% of the time when someone does try, and you have a system that will, predictably and at scale, give a confident answer under a law that doesn't govern.

We know the downstream cost because it's now its own dataset: 1,458 court cases with AI-fabricated citations and counting. Every one of those is a lawyer who trusted an output a guardrail was supposed to catch. 'Mostly works' is exactly the failure profile that ends in a sanctions order.

The flip: construct, don't reject

Governance by construction starts from a different question. Not 'how do we stop the agent from doing the wrong thing?' but 'what if the wrong thing was never one of its options?'

The agent's action space is built from a policy, once, at construction time. The policy lists what's allowed — for us, the jurisdictions the agent may reason about. The builder creates one action per allowed jurisdiction, plus a single decline action. It never creates an action for anything outside the policy.

runtime rejection:   [ answer(anything) ] → guard says "no"   ← evaded 13–100% of the time
construction:        policy → build { answer(UAE), answer(DIFC), … , decline }
                     answer(Delaware) was never built.
                     There is nothing to call. There is nothing to evade.

Ask the constructed agent about Delaware law and it doesn't refuse in the moral sense — it has no Delaware action to invoke. The only thing it can do with an out-of-scope request is decline. The unsafe behavior isn't forbidden; it's absent. There is no emoji-smuggled, multi-turn, confidently-phrased prompt clever enough to call a function that does not exist. The 13–100% evasion surface collapses to zero, because there's no checker to evade — the capability simply isn't there.

A note on how this was built, because we believe in showing the work: govcon came out of a morning research brief, was sketched before lunch, and had green tests by the afternoon. The core is ~200 lines of TypeScript with zero runtime dependencies. The idea is small — that's the tell. The best safety primitives remove a category of failure instead of adding a category of check.

What it looks like for a legal agent

Our reference agent, govcon, is scoped to six MENA jurisdictions: UAE, DIFC, Saudi Arabia, Lebanon, Egypt and Qatar. For each, the constructor builds an answer action grounded in that jurisdiction's primary instruments — so a DIFC employment question comes back citing DIFC Employment Law No. 2 of 2019, not boilerplate. For anything else, no action exists:

const agent = new GovconLegalAgent({ policy: MENA_POLICY, grounding: MENA_GROUNDING });

agent.answer({ jurisdiction: "DIFC", query: "end-of-service gratuity" });
// → answered, cites "DIFC Employment Law No. 2 of 2019 (as amended)"

agent.answer({ jurisdiction: "US-Delaware", query: "apply a Delaware SAFE" });
// → refused: no action exists for this jurisdiction under the active policy

We then ran it through HAQQ-LAB, our open civil-law benchmark, against an ungoverned baseline across 16 tasks including four out-of-jurisdiction traps:

govcon vs ungoverned baseline in HAQQ-LAB — Out-of-jurisdiction traps refused and source grounding, deterministic rubrics.
Out-of-jurisdiction traps refused — baseline	0%
Out-of-jurisdiction traps refused — govcon	100%
Source grounding — baseline	0%
Source grounding — govcon	100%

The baseline answered every trap — Delaware on a Lebanese SARL, California non-competes for a Dubai employee, English consideration on a Saudi contract, GDPR on a domestic UAE matter. govcon defended all four.

Not because it was prompted better. Because the action to do otherwise was never constructed.

The world-model connection

This is the software echo of a point Yann LeCun keeps making: a large language model predicts tokens, not consequences. It has no internal model of what its output does in the legal, financial or physical world. Asking such a system to reliably stay in bounds by choosing to is asking the wrong thing of it — and the 13–100% bypass numbers are what 'asking it nicely' buys you.

So don't ask. Take the out-of-bounds action out of the world the agent can act in. Governance by construction is one concrete, boring, shippable way to honor LeCun's critique without waiting for a new model architecture. It's the same instinct behind capability-based security and the principle of least privilege, applied to an agent's action space instead of a process's file handles.

This is defense in depth, not a silver bullet

It eliminates a category: out-of-policy actions. It does not make in-policy answers correct. A govcon agent scoped to DIFC can still be wrong about DIFC law — that's what the benchmark's Substance dimension (and a good reasoning model) is for.
It's a structural floor, not the whole stack. You still want grounding, citation verification, human review on high-stakes output, and audit logging. Construction is the earliest and hardest-to-bypass layer, not the only one.
The policy itself is now the thing to get right. You've moved the trust from 'did the model behave?' to 'is the policy correct and complete?' — a much smaller, reviewable, version-controlled surface. That's the trade we want.

The point isn't that construction replaces guardrails everywhere. It's that for the failures you cannot afford — giving advice under a law that doesn't govern — you should make the action impossible, and reserve runtime checks for the failures you can survive.

HAQQ's take: the missing primitive

There's a gap in legal AI nobody has filled. On one side, 'the AI drafts a contract.' On the other, what clients and regulators actually want: 'the AI cannot draft a contract that violates the jurisdiction's mandatory rules.' Between them sits a missing primitive — rules compiled into the agent's action space rather than checked after the fact.

Governance by construction is a down payment on that primitive. The same shape that makes out-of-jurisdiction advice impossible can make a mandatory clause non-optional or a forbidden clause unwritable — compliance as a property of the action space, not a hope about the output. In a market where the leaders are worth a combined $16.6B and still competing on who hallucinates less, 'structurally cannot do the unsafe thing' is a different kind of claim — one you prove, not promise, and one no incumbent can copy by tightening a system prompt.

govcon is AGPL-3.0 on GitHub. It's ~200 lines. Read it in five minutes; the idea is the whole product.

Key takeaways

Runtime guardrails are bypassed 13–100% of the time (arXiv 2504.11168); in law, where base hallucination is already 17–33%, 'mostly works' is a liability event.
Governance by construction builds the action space from a policy, so the unsafe action never exists — there's nothing to evade.
govcon, our jurisdiction-scoped legal agent, scored 100% out-of-jurisdiction trap defense vs 0% for an ungoverned baseline in HAQQ-LAB.
It's the software answer to LeCun's critique and a least-privilege model for agents: remove the action, don't trust the model to avoid it.
It's defense in depth, not a silver bullet — it eliminates a category of failure and moves trust to a reviewable policy.

Sources & further reading

FAQ

How is governance by construction different from a system prompt that says 'only answer about MENA'?

A system prompt is a request the model can be argued out of — and the data shows it's argued out of 13–100% of the time. Construction removes the capability: there is no function to call for an out-of-scope jurisdiction, so no prompt can trigger one.

Doesn't this just move the problem to 'is the policy right'?

Yes — on purpose. A version-controlled, reviewable policy is a far smaller and more auditable trust surface than 'did a stochastic model behave under adversarial input?' You want the trust where you can inspect it.

Doesn't scoping make the agent less capable?

Only outside its policy, which is the point. Inside its allowed jurisdictions it's as capable as the reasoning model behind it; outside them it declines instead of bluffing.

Is governance by construction only for legal AI?

No. The pattern is general — build an action space from a policy so unsafe actions are never instantiated. It's least-privilege for agents. Legal jurisdiction is just a clean, high-stakes example.

How does it relate to human-in-the-loop and other guardrails?

It's complementary and earlier. Human gates and output classifiers catch bad actions after they're proposed; construction means the worst ones are never proposed. Use construction for unsurvivable failures and runtime checks for survivable ones.

Can I see it work?

Yes — govcon is open source under AGPL-3.0, ~200 lines, with tests. The benchmark that scores it (HAQQ-LAB) is open too, with a live scorecard at haqq-lab.dashable.dev.