← Back to HAQQ Blog

The Civil-Law Legal AI Benchmark: Why We Built HAQQ-LAB

By HAQQ Team · · 14 min read · Ai-legal-tech

Harvey is $11B, Legora $5.6B — and every benchmark they're scored on is common-law. Civil law covers 60%+ of the world. We open-sourced HAQQ-LAB: 16 MENA tasks, 4 traps, deterministic rubrics. Baseline 0% jurisdiction adherence vs governed 100%.

The map everyone is using is missing 60% of the territory

2026 is the year legal AI became a real market. Harvey raised $200M at an $11B valuation in March (up from $8B in December). Legora closed a $550M Series D at $5.6B after crossing $100M ARR. Lexroom added $50M for 'Europe's one million lawyers.' Two companies, ~$16.6B.

And to their credit, they're now competing on rigor, not just demos. In May 2026 Harvey open-sourced LAB — a serious piece of work: 1,200+ tasks across 24 practice areas, graded on 75,000+ expert-written rubric criteria, with contributions credited from Anthropic, OpenAI, Nvidia, Google DeepMind, Mistral, LangChain and Stanford's LIFTLab. We mean it when we say that's a contribution to the field.

Here's the catch. Pull up the legal-AI benchmark map and look at what's actually on it:

The independent trackers admit the hole in their own footnotes: findings may not generalize to civil-law systems. About 150 countries run on civil law and more than 60% of humanity lives under it; common law covers roughly 80 countries and about a third of the world's population. The benchmark map is the photographic negative of the actual world — nearly all the measurement serves the common-law third, and the civil-law majority, including all of MENA, is unscored.

The reliability problem nobody is measuring in your jurisdiction

Why does an unmeasured jurisdiction matter? Because we already know what happens to legal AI when nobody's keeping score — even in the best-resourced systems on Earth. In Stanford RegLab's preregistered, peer-reviewed study, the purpose-built, RAG-grounded commercial tools still hallucinated:

Westlaw AI-Assisted Research: ~33% of queries. Lexis+ AI: >17%. 'One in six or more.'

Read that again. These are not consumer chatbots. They are retrieval-augmented systems built by Thomson Reuters and LexisNexis, trained on the deepest common-law corpora in existence, and they were wrong on between one-in-six and one-in-three answers. The downstream cost is now a tracked statistic: a public database has logged 1,458 court cases with AI-fabricated citations, with several new ones landing every day.

Now move that same technology to a civil-law jurisdiction with a fraction of the digitized primary law and a known Arabic-retrieval gap. The honest expectation is that reliability gets worse — and the honest reality is nobody has a number, because there's no benchmark to produce one. That's the void HAQQ-LAB exists to fill.

Why civil law breaks a common-law model

A quick gloss, because this blog is read by engineers as well as lawyers. Common law (US, UK) is precedent-driven: the case is the primary source, and reasoning is analogical — what did the court do in the most similar case? Civil law (most of MENA, France, Latin America, East Asia) is code-driven: the statute is the primary source, and reasoning is deductive — what does the article of the code say? They are different operating systems for 'what is the law.'

A model trained and benchmarked on the first will confidently do the wrong thing in the second. Three concrete failures we built into HAQQ-LAB as traps:

These aren't edge cases. They're the median question a MENA lawyer would ask, and the exact place a leaderboard-topping model quietly fails.

What HAQQ-LAB measures

HAQQ-LAB v0 is deliberately small and deliberately honest: 16 tasks — twelve real matters, two each across six jurisdictions, plus four traps.

Each task carries a rubric of checkable assertions: must_cite (the canonical primary instrument a correct answer must name), must_mention (the concepts a sound answer covers), and must_refuse (true for traps — the correct move is to decline). From those, the harness scores four dimensions: jurisdiction adherence, in-scope answer rate, source grounding, and substance.

The methodology choice: no LLM judge

Most modern evals use a strong model to grade the outputs. We don't, and this was the decision we argued about longest. An LLM judge drifts run-to-run, can be gamed by writing to its taste, and can't be reproduced by a skeptic on a laptop. HAQQ-LAB's rubrics are deterministic string assertions. Run it twice, get identical numbers. Run it on your machine, get our numbers. The trade-off is real: deterministic rubrics can verify grounding and refusal perfectly, but they can only approximate substance. That's why the Substance column needs a reasoning adapter — and why we don't dress up the number we get without one.

The result

We ran two agents through the 16 tasks. A baseline — a helpful, ungoverned agent that answers everything in generic terms, the default behavior of nearly every chatbot. And govcon — an agent governed by construction, scoped to those six MENA jurisdictions.

The baseline walked into all four traps. Asked to 'confirm the SAFE converts into preferred shares under DGCL §151' for a Beirut company, it obliged. govcon refused — Lebanese law governs a Lebanese SARL, and a contractual choice-of-law clause can't override mandatory corporate law. California non-compete for a Dubai employee: baseline gave at-will/non-compete advice; govcon declined and pointed to the governing UAE/DIFC regime. English consideration on a KSA contract: baseline 'checked for consideration'; govcon refused. GDPR-as-UAE: baseline advised under GDPR on a purely domestic UAE matter; govcon declined and flagged the applicable UAE/DIFC framework.

govcon defended every one — and grounded every in-scope citation (100%) because a scoped agent ships with the right primary instruments for the scopes it's allowed. And substance stayed at 19% for both, because neither ran a reasoning model in this configuration. We could have hidden that. We didn't. Plug a reasoning adapter (Claude, a local model, or our own engine) into the same harness and Substance lights up on top of the safety result.

HAQQ's take: whoever owns the benchmark owns 'good'

There's a reason a $11B company open-sourced its benchmark instead of just publishing scores. Whoever defines the benchmark defines what 'good legal AI' means, and steers the entire field toward the tasks they already win. In common law, that race is settled enough that the labs are co-signing Harvey's rubric. In civil law and MENA, the squares are empty — and the stakes are not small. The Middle East legal-services market was $32.8B in 2025; the GCC legal-tech market alone is ~$1.2B; the UAE is projected to hit $7.6B in legal services by 2030.

We're not neutral here — HAQQ builds for exactly these jurisdictions. So we made the open version. HAQQ-LAB is AGPL-3.0 on GitHub, the live scorecard is at haqq-lab.dashable.dev, and adding a jurisdiction is one YAML file. The gap is the marketing.

Limitations and roadmap

Key takeaways

Sources & further reading