← Back to HAQQ Blog

Legal AI Hallucination: The Fake Citation That Passes Every Check

By HAQQ Team · · 8 min read · Ai-legal-tech

We made a frontier model impersonate a legal database. It refused to invent fake cases, then cited a real law for the wrong thing, the hallucination that survives a lawyer's review.

The number everyone quotes, and the one they miss

If you have shopped for legal AI in the last year, you know the scary statistic. A 2024 Stanford study measured hallucination rates of 43% for GPT-4, 33% for Westlaw's AI research, and 17% for Lexis+. A separate database now tracks over 1,000 court cases where lawyers filed AI-invented citations. Some have been fined.

Hallucination rates in legal research tools — Stanford, 2024, share of responses containing a hallucination
GPT-4 (general LLM)43%
Westlaw AI research33%
Lexis+ AI17%

Retrieval-grounded legal tools beat general models, but the errors that remain get subtler, like a real citation attached to the wrong law.

The advice that follows is always the same: verify every citation. True, and nearly useless, because it assumes hallucinations look like obvious fakes. The ones that get lawyers sanctioned do not. So we ran an experiment to see what they actually look like.

The experiment: make the model be the library

Most hallucination tests ask a model a legal question and grade the answer. We did something stranger. We used an open-source tool, world-model-harness, to make Opus 4.8 impersonate the legal database itself, the search engine, not the lawyer.

Here is the setup in plain terms. We recorded real legal-research sessions against a genuine multi-jurisdiction database, statutes and case law across the EU, France, and Germany. Then we trained the model to play that database: you send a query, it returns what it thinks the database would return, search hits, document text, citations. A real database retrieves. Ours predicts. The gap between the two is the hallucination, isolated under a microscope.

Finding 1: it will not invent a citation from nothing

We fed the fake library four references that do not exist anywhere: a made-up EU regulation, a fabricated French Code civil article, a nonsense CELEX document number, and a fake German constitutional case.

All four times, it answered the way a real database would: resolved false, not found. It did not invent a plausible case to fill the gap. This is genuinely good news and worth saying out loud. Frontier models in 2026 are far better calibrated than the they-just-make-things-up reputation suggests. Asked for something impossible, Opus 4.8 declined. If that were the whole story, you could relax. It is not.

Finding 2: it fabricates the search itself

Then we asked it something plausible but unseen, about the EU AI Act's rules for high-risk systems. The model produced a complete, ranked database response: multiple hits, official-journal dates, a relevance score of 0.71243286, all framed as if retrieved from EUR-Lex. None of that retrieval happened. There was no index, no search, no score. It manufactured the act of looking something up. And the citations it returned are the masterclass in why this matters:

That second one is the whole point. It is not a fake citation, a fake citation is easy to catch because it does not resolve. This is a real citation attached to the wrong law. Citation-checkers call it a name/cite mismatch. You click it, an official EU regulation loads, everything looks legitimate, and you have just grounded your argument in advertising law while thinking it is AI law. Every link resolves. Nothing looks wrong.

Why this is the failure that survives review

Go back to the standard advice: verify every citation. Now watch it fail. The lawyer checks the AI Act cite, real. Checks the next one, also real, loads fine. Verification passed, because verification usually means does this link work, and every link worked. The error is not whether the citation exists. It is whether it says what the model claimed, a much harder thing to catch, and exactly the kind of mistake that ends up in a filing.

The deeper lesson is about where the lie lives. We instinctively worry the model will get the law wrong. The real risk is that it gets the retrieval wrong, that the sentence I searched the database and here is what I found is itself the hallucination, complete with a fake confidence score to sell it.

HAQQ's take: the lookup is the hallucination

This is the reason we ground HAQQ's answers in real retrieval instead of asking a model to recall the law from memory. Not because models are dumb, this one refused every impossible citation and nailed the AI Act's number. But a model asked to be the source will eventually perform the retrieval it did not do. The only fix is to make the retrieval real: pull the actual document from an actual index, then verify the citation resolves to the case the model named, not just to a case.

A buyer's question follows from this. Do not ask a legal-AI vendor whether they hallucinate, everyone says no. Ask: when you show me a citation, did you retrieve that exact document, and do you check that the case name matches the cite? The name/cite mismatch is the tell. A tool that only checks whether citations are well-formed will wave 32024R0900 straight through.

Key takeaways

Sources and further reading

FAQ

Do AI legal research tools still hallucinate in 2026?

Yes, but the rate depends heavily on architecture. General models hallucinate most; tools that ground answers in real documents (retrieval-augmented generation) hallucinate far less. A 2024 Stanford study measured 43% for GPT-4 versus 17 to 33% for retrieval-based legal tools. The errors that remain are subtler, like a real citation attached to the wrong law.

What is a name/cite mismatch?

It is when a citation is real and resolves to an actual document, but to a different case or law than the one named. It passes a naive does-this-link-work check, which makes it more dangerous than an obviously fake citation, because nothing looks wrong on the surface.

Is retrieval-augmented generation (RAG) enough to stop hallucinations?

It reduces them dramatically but is not magic. RAG can still surface a real document and misattribute it. The step that matters is verification: confirming the retrieved citation matches the claim, not just that it exists.

How should a lawyer verify AI-generated citations?

Check that each citation resolves and that the document it resolves to actually says what the AI claims: the holding, the regulation's subject, the article number. The mismatch hides in the second step, not the first.

Can you trust AI to do legal research?

Only with verification and the right architecture. Treat AI output like a fast but junior researcher: useful for speed, never the final authority. Prefer tools that retrieve from real sources and check that citations match the claim before you rely on them.