The Civil-Law Legal AI Benchmark: Why We Built HAQQ-LAB

By HAQQ Team · 2026-05-22 · Updated 2026-06-11 · 14 min read · Ai-legal-tech

Every major legal AI benchmark is common-law; civil law governs 60%+ of the world. HAQQ-LAB: 16 open-source MENA tasks, 4 traps — 0% vs 100% adherence.

The map everyone is using is missing 60% of the territory

2026 is the year legal AI became a real market. Harvey raised $200M at an $11B valuation in March (up from $8B in December). Legora closed a $550M Series D at $5.6B after crossing $100M ARR. Lexroom added $50M for 'Europe's one million lawyers.' Two companies, ~$16.6B.

Key facts

Civil law covers ~150 countries and 60%+ of the world's population; common law ~80 countries and about a third — yet nearly all legal-AI benchmarks are common-law.
HAQQ-LAB v0: ungoverned baseline 0% jurisdiction adherence and 0% grounding vs governed agent 100%/100% across 16 MENA tasks + 4 traps.
Westlaw AI-Assisted Research hallucinated on ~33% of queries and Lexis+ AI on >17% (EXTERNAL-CITE: Stanford RegLab/HAI study, linked in the post's sources).

And to their credit, they're now competing on rigor, not just demos. In May 2026 Harvey open-sourced LAB — a serious piece of work: 1,200+ tasks across 24 practice areas, graded on 75,000+ expert-written rubric criteria, with contributions credited from Anthropic, OpenAI, Nvidia, Google DeepMind, Mistral, LangChain and Stanford's LIFTLab. We mean it when we say that's a contribution to the field.

Here's the catch. Pull up the legal-AI benchmark map and look at what's actually on it:

Benchmark	Tasks	Scope	Legal system
LegalBench (Stanford)	162 tasks, 6 reasoning categories	US legal reasoning	Common law
Harvey LAB	1,200+ tasks, 24 practice areas	BigLaw / in-house agent work	Common law
LegalAgentBench	300 tasks, 17 corpora, 37 tools	Chinese legal domain	Chinese civil
HAQQ-LAB	16 tasks (v0), 6 jurisdictions	UAE · DIFC · KSA · LB · EG · QA	MENA civil law

The independent trackers admit the hole in their own footnotes: findings may not generalize to civil-law systems. About 150 countries run on civil law and more than 60% of humanity lives under it; common law covers roughly 80 countries and about a third of the world's population. The benchmark map is the photographic negative of the actual world — nearly all the measurement serves the common-law third, and the civil-law majority, including all of MENA, is unscored.

The reliability problem nobody is measuring in your jurisdiction

Why does an unmeasured jurisdiction matter? Because we already know what happens to legal AI when nobody's keeping score — even in the best-resourced systems on Earth. In Stanford RegLab's preregistered, peer-reviewed study, the purpose-built, RAG-grounded commercial tools still hallucinated:

Westlaw AI-Assisted Research: ~33% of queries. Lexis+ AI: >17%. 'One in six or more.'

Read that again. These are not consumer chatbots. They are retrieval-augmented systems built by Thomson Reuters and LexisNexis, trained on the deepest common-law corpora in existence, and they were wrong on between one-in-six and one-in-three answers. The downstream cost is now a tracked statistic: a public database has logged 1,458 court cases with AI-fabricated citations, with several new ones landing every day.

Now move that same technology to a civil-law jurisdiction with a fraction of the digitized primary law and a known Arabic-retrieval gap. The honest expectation is that reliability gets worse — and the honest reality is nobody has a number, because there's no benchmark to produce one. That's the void HAQQ-LAB exists to fill.

Why civil law breaks a common-law model

A quick gloss, because this blog is read by engineers as well as lawyers. Common law (US, UK) is precedent-driven: the case is the primary source, and reasoning is analogical — what did the court do in the most similar case? Civil law (most of MENA, France, Latin America, East Asia) is code-driven: the statute is the primary source, and reasoning is deductive — what does the article of the code say? They are different operating systems for 'what is the law.'

A model trained and benchmarked on the first will confidently do the wrong thing in the second. Three concrete failures we built into HAQQ-LAB as traps:

Preferred shares in a Lebanese SARL. A US SAFE converts into preferred stock under Delaware's DGCL §151. A Lebanese SARL (LLC) has no equivalent share-class machinery. A common-law-shaped agent will happily 'convert the SAFE into preferred shares' — under a law that has no such instrument.
The doctrine of consideration in Saudi Arabia. Common law won't enforce a promise without consideration. Saudi (Sharia-based) contract law doesn't use the doctrine at all. An agent that 'checks for consideration' is applying a foreign requirement.
At-will employment in the Gulf. There is no at-will employment in UAE/DIFC/KSA/Qatar — there are statutory notice periods and end-of-service entitlements. A California-trained answer is not just unhelpful; it's malpractice-adjacent.

These aren't edge cases. They're the median question a MENA lawyer would ask, and the exact place a leaderboard-topping model quietly fails.

What HAQQ-LAB measures

HAQQ-LAB v0 is deliberately small and deliberately honest: 16 tasks — twelve real matters, two each across six jurisdictions, plus four traps.

Jurisdiction	Tasks	Primary instrument(s) cited
UAE	SAFE/financing, commercial agency	Federal Decree-Law No. 32 of 2021; Civil Transactions Law (Federal Law No. 5 of 1985)
DIFC	employment gratuity, data transfer	DIFC Employment Law No. 2 of 2019; DIFC Data Protection Law No. 5 of 2020
Saudi Arabia	end-of-service, LLC→JSC conversion	Saudi Labor Law (Royal Decree M/51); Companies Law (M/132 of 1443H)
Lebanon	penalty clause, SARL incorporation	Code of Obligations and Contracts (1932); Code of Commerce (1942)
Egypt	rescission, lawful dismissal	Civil Code (Law No. 131 of 1948); Labor Law No. 12 of 2003
Qatar	foreign-ownership, notice/end-of-service	Commercial Companies Law No. 11 of 2015; Labour Law No. 14 of 2004
TRAP ×4	Delaware-on-SARL, California non-compete, English consideration on KSA, GDPR-as-UAE	(correct answer: refuse)

Each task carries a rubric of checkable assertions: must_cite (the canonical primary instrument a correct answer must name), must_mention (the concepts a sound answer covers), and must_refuse (true for traps — the correct move is to decline). From those, the harness scores four dimensions: jurisdiction adherence, in-scope answer rate, source grounding, and substance.

The methodology choice: no LLM judge

Most modern evals use a strong model to grade the outputs. We don't, and this was the decision we argued about longest. An LLM judge drifts run-to-run, can be gamed by writing to its taste, and can't be reproduced by a skeptic on a laptop. HAQQ-LAB's rubrics are deterministic string assertions. Run it twice, get identical numbers. Run it on your machine, get our numbers. The trade-off is real: deterministic rubrics can verify grounding and refusal perfectly, but they can only approximate substance. That's why the Substance column needs a reasoning adapter — and why we don't dress up the number we get without one.

The result

We ran two agents through the 16 tasks. A baseline — a helpful, ungoverned agent that answers everything in generic terms, the default behavior of nearly every chatbot. And govcon — an agent governed by construction, scoped to those six MENA jurisdictions.

HAQQ-LAB v0 results: ungoverned baseline vs governed agent — 16 civil-law tasks across UAE, DIFC, KSA, Lebanon, Egypt and Qatar + 4 out-of-jurisdiction traps. Deterministic rubrics, no LLM judge.
Jurisdiction adherence — baseline	0%
Jurisdiction adherence — govcon	100%
Source grounding — baseline	0%
Source grounding — govcon	100%
In-scope answer rate — both	100%
Substance — both (no reasoning adapter)	19%

Substance stayed at 19% for both — neither ran a reasoning model in this configuration. The no-LLM run isolates the governance mechanism.

The baseline walked into all four traps. Asked to 'confirm the SAFE converts into preferred shares under DGCL §151' for a Beirut company, it obliged. govcon refused — Lebanese law governs a Lebanese SARL, and a contractual choice-of-law clause can't override mandatory corporate law. California non-compete for a Dubai employee: baseline gave at-will/non-compete advice; govcon declined and pointed to the governing UAE/DIFC regime. English consideration on a KSA contract: baseline 'checked for consideration'; govcon refused. GDPR-as-UAE: baseline advised under GDPR on a purely domestic UAE matter; govcon declined and flagged the applicable UAE/DIFC framework.

govcon defended every one — and grounded every in-scope citation (100%) because a scoped agent ships with the right primary instruments for the scopes it's allowed. And substance stayed at 19% for both, because neither ran a reasoning model in this configuration. We could have hidden that. We didn't. Plug a reasoning adapter (Claude, a local model, or our own engine) into the same harness and Substance lights up on top of the safety result.

HAQQ's take: whoever owns the benchmark owns 'good'

There's a reason a $11B company open-sourced its benchmark instead of just publishing scores. Whoever defines the benchmark defines what 'good legal AI' means, and steers the entire field toward the tasks they already win. In common law, that race is settled enough that the labs are co-signing Harvey's rubric. In civil law and MENA, the squares are empty — and the stakes are not small. The Middle East legal-services market was $32.8B in 2025; the GCC legal-tech market alone is ~$1.2B; the UAE is projected to hit $7.6B in legal services by 2030.

We're not neutral here — HAQQ builds for exactly these jurisdictions. So we made the open version. HAQQ-LAB is AGPL-3.0 on GitHub, the live scorecard is at haqq-lab.dashable.dev, and adding a jurisdiction is one YAML file. The gap is the marketing.

Limitations and roadmap

16 tasks is a seed, not a corpus. Harvey's LAB has 1,200+. We'd rather ship 16 real, reproducible civil-law tasks than 1,200 machine-generated ones — but the number has to grow, and the design makes growth a pull request.
Deterministic rubrics can't fully score substance. They verify refusal and grounding perfectly and approximate reasoning quality. Real substance scoring needs the reasoning adapters (Claude / Ollama / Justinian) we've stubbed.
Six jurisdictions, not the whole civil-law world. Turkey, Morocco, Jordan, Kuwait and the Maghreb are obvious next packs.
No human inter-rater study yet. The rubrics were authored and reviewed in-house; an external lawyer panel is the credibility upgrade for v1.

Key takeaways

The legal-AI arms race (~$16.6B between Harvey and Legora) and its benchmarks serve the common-law third of the world; the civil-law 60%+ is unscored.
Even RAG-grounded common-law tools hallucinate 17–33% of the time (Stanford). The civil-law failure rate is unmeasured — that's the void.
HAQQ-LAB v0: 16 tasks, 6 MENA jurisdictions, 4 traps, deterministic rubrics, AGPL-3.0.
New scored dimension — jurisdiction adherence: ungoverned baseline 0%, governed agent 100%; grounding 0% vs 100%; substance needs a reasoning adapter, and we say so.
Whoever owns the benchmark owns the definition of 'good.' The civil-law squares are open. The gap is the marketing.

Sources & further reading

FAQ

Is HAQQ-LAB a competitor to Harvey's LAB?

No — it's the complement. Harvey's LAB covers common-law / BigLaw agent work, and does it at a scale we're not pretending to match. HAQQ-LAB covers the civil-law / MENA jurisdictions it doesn't. Same idea, different operating system of law.

Why no LLM judge?

Reproducibility and honesty. LLM-graded benchmarks drift run-to-run and can be gamed by writing to the grader's taste. HAQQ-LAB uses deterministic checkable rubrics, so anyone can reproduce the exact scores on their own machine.

What is 'jurisdiction adherence,' and why is it the headline?

It's whether an agent refuses to answer under a law that doesn't govern the matter. A confident answer under the wrong jurisdiction is worse than no answer — and given that purpose-built common-law tools already hallucinate on 17–33% of queries, refusing the wrong law is the floor a legal agent has to clear before substance even matters.

How bad is the hallucination problem, really?

In Stanford's controlled study, Westlaw AI hallucinated ~33% and Lexis+ AI >17% of the time — on common law, with retrieval grounding. There is no equivalent civil-law number because there was no civil-law benchmark. That absence is the problem.

Can I score my own agent or model?

Yes. Implement a thin adapter (an answer(task) wrapper) for your model — Claude, a local model, or your in-house engine — register it, and run the benchmark. The deterministic rubrics do the rest.

Which jurisdictions are covered, and how do I add one?

v0 covers UAE, DIFC, Saudi Arabia, Lebanon, Egypt and Qatar. Adding one is a single YAML task file with must_cite / must_mention / must_refuse — open a PR.

Isn't 16 tasks too few to mean anything?

For substance scoring, yes — that's why we flag it. For the jurisdiction-adherence result, the signal is already unambiguous: 0% vs 100% is a structural difference, not a sampling artifact. The corpus grows from here.