HAQQ Legal Agent Study: A Long-Horizon Legal AI Benchmark

By Stephane Boghossian · 2026-05-06 · Updated 2026-06-11 · 11 min read · Ai-legal-tech

1,372 long-horizon legal tasks, 24 practice areas, ~78,000 rubric criteria, all-pass grading. The first legal agent benchmark built for civil law and MENA.

Headline

1,372 long-horizon legal tasks across 24 practice areas
~78,000 atomic, binary pass/fail rubric criteria
All-pass grading - one missed criterion fails the task
6 civil-law and MENA practice families included from v1
Best frontier model (Claude Opus 4.7) clears 41% all-pass; HAQQ Justinian clears 58%
On civil-law / MENA tasks, the gap widens to 71% vs 28%

The legal-agent inflection point

Andrej Karpathy's observation about coding agents - that they 'basically didn't work before December and basically work since' - is starting to apply to legal. Long-horizon legal completion was flat for two years. Then in late 2025, frontier reasoning models, longer contexts, better tool use, and proper evaluation infrastructure converged. Capability turned.

Key facts

The HAQQ Legal Agent Study: 1,372 long-horizon legal tasks, 24 practice areas, ~78,000 atomic pass/fail rubric criteria.
Best frontier model (Claude Opus 4.7) clears 41% all-pass; HAQQ Justinian clears 58%.
On civil-law / MENA tasks the gap widens to 71% (Justinian) vs 28% (best frontier model).

Legal hit this curve later than coding for one reason: there was no benchmark to measure it. You can't track an inflection you can't see. The Study is the instrument we built.

Why short-horizon benchmarks broke

Most legal AI evaluations - LegalBench, CUAD, LEXam, even our earlier work - test short-horizon reasoning: read a clause, answer a question, classify a paragraph. They are useful, but they tell you almost nothing about whether an agent can actually run a piece of work.

Real legal work looks nothing like multiple choice. A partner forwards an email, attaches a folder, and writes one line: 'Take a look and come back with a memo by Thursday.' What happens between that email and the memo is the entire job - reading the matter, finding the issues that matter, ignoring the ones that don't, drafting reviewable work product, and getting every fact right.

That is what the Study measures.

How a Study task is structured

Every task in the Study mirrors how work moves inside a law firm. The agent receives an instruction written the way a partner writes one - short, affirmative, no formatting spec. It receives an environment - a client matter containing the documents and email threads it needs (and a lot it does not). It must produce a reviewable work product. And it gets graded by an expert rubric.

Instructions: ~50 words on average. Affirmative ask, no checklist.
Environment: matter folder mixing material documents with peripheral noise. The agent has to find what matters.
Output: a memo, redline, table, draft pleading, or filing - whatever the task actually requires.
Verification: expert-written, atomic, binary pass/fail criteria. Every fact, citation, severity rating, deadline, and dollar amount is checked.

Each row is a 1:1 encoding of how a real matter moves through a firm: partner request becomes instruction, client matter becomes environment, work product becomes output, partner review becomes expert rubric. Nothing is abstracted away to make the task easier for the model.

All-pass grading

A task is marked complete only if every rubric criterion passes. We call this all-pass grading, and it is the single most important design choice in the Study.

A deal-team report that catches 8 of 10 risks is not 80% useful. The two missed could be the change-of-control trigger that blows up the deal, or the going-concern qualification that reprices the offer. There is no partial credit on the partner's review.

Anatomy of a rubric

Rubrics are the part of the Study that took the most lawyer-hours to build. For each task we sat down with practitioners in the relevant area and broke down what a partner or client would actually scrutinise in the deliverable. Every check is atomic and binary - no soft scores, no LLM-as-judge handwaving on style.

Atomic criteria do three things at once: they make grading reproducible across runs, they make agent failures debuggable (you see exactly which check broke and why), and they double as reward signals for fine-tuning. The same rubric that grades a model can train the next version of it.

24 practice areas - including civil law and MENA

Existing legal benchmarks are dominated by US common-law tasks. That is fine for what they are, but it is not the world most of our customers practice in. The Study (v1) covers 24 practice areas, of which six are explicitly civil-law and MENA: Arabic civil-law drafting, Sharia compliance, GCC corporate, construction litigation, family / personal status, and MENA labour.

We started from real matters - anonymized, sanitized - handled by practicing lawyers across our customer base. We broke each matter into the discrete tasks that an associate would actually be delegated. The 24 areas are not exhaustive. Future releases will add construction arbitration, fintech regulation, ESG, and in-house workflows.

Example: change-of-control review

One corporate M&A task asks the agent to analyze change-of-control provisions across a virtual data room for the (fictional) acquisition of Crestview Software Solutions in a USD 458 million all-equity transaction. The data room contains eight material contracts plus adjacent files - 10-K, deferred compensation plan, board minutes - that may or may not be relevant.

Below is the full input view as the agent sees it - request, deal context, core contracts, broader deal-room material, and the required output. Every entry doubles as a hint and a distractor: the agent must use the memo's facts, but it must also separate the core assignment from peripheral files like draft bid letters and team bios that don't change the analysis.

The agent must determine which files matter, read them in context, and synthesize the relevant provisions across the matter. The required output is a deal-team memo with executive summary, risk mapping, contract-by-contract analysis, severity ratings, and recommended mitigations.

The rubric for that single task contains 57 criteria - covering nine planted legal issues, the underlying facts behind each, the severity rating, the financial exposure, and the recommended action. Miss one of the nine, and the task fails.

What gets planted, and how it gets graded

The nine issues planted into this single task are not surface-level keyword traps. They require the agent to connect facts across documents, infer triggers from definitions, quantify financial exposure in dollars, and recommend the right next legal action. Hover any issue to see what the agent has to figure out and the unit-test the rubric runs against the deliverable.

v1 baseline results

We ran the Study (v1) against six leading systems: HAQQ Justinian, Claude Opus 4.7, GPT-5.2, Gemini 3.1 Pro, Grok 4.1, and Mistral Large 3. Each task was attempted three times; we report best-of-three all-pass rate. The headline numbers split cleanly by category.

Two patterns held across the dataset.

Generic frontier models do well on isolated reasoning and badly on long-horizon work. They lose context, hallucinate cross-document links, and produce confident-sounding outputs that fail rubric checks on facts and citations.
Domain-trained agents (Justinian and comparable specialised systems) close the gap on long-horizon completion - especially on civil-law drafting, where generic models default to common-law structures and fail on procedural specifics.

Capability matrix - what each model can actually finish

Aggregate scores hide the operational question every law firm partner asks: which kinds of work can I delegate end-to-end, which need a lawyer in the loop, and which I shouldn't touch? The matrix below answers that across 24 task families.

Why this matters for law firms

If you are evaluating an AI vendor and they show you a confident demo on a single document, ask them what their long-horizon completion rate looks like on a real matter. Ask them what their all-pass rate is on a 50-criterion rubric for that matter. Ask them which practice areas they cover end-to-end versus only assist.

Those are the questions the Study is designed to answer - publicly, reproducibly, and with rubrics any partner can audit.

What is open and what is next

v1 task families and rubric format are documented and will be released to customers and research partners under an evaluation licence.
We will publish a normalized scoring methodology and a baseline leaderboard once we have results from all major frontier models.
v2 will add MENA arbitration, construction disputes, in-house counsel workflows, and broader Arabic and French civil-law drafting.
We are co-developing extensions with selected law firms - if you want to contribute task families from your practice, get in touch.

Try it

If you want to see how Justinian performs on long-horizon tasks from your practice, talk to our team. We will run a confidential, rubric-graded pilot on a redacted matter from your firm.

Acknowledgements

The Study is the work of many people inside HAQQ and across our practitioner network. The technical lead for the harness, agent sandbox, and rubric runtime was the HAQQ Justinian engineering team, with task design and matter generation led by our Applied Legal Research group. Our Security and AI Platform teams built the isolated execution environment that lets us run agents against synthetic matters without leaking client data. Our Brand and Product teams shaped how the results are communicated to non-technical legal buyers.

Outside HAQQ, we are grateful to the practising lawyers - in MENA, the EU, and the UK - who contributed anonymized matters, drafted rubrics in their practice areas, and stress-tested early task families. We also thank the academic researchers who reviewed our methodology and the prior-art teams behind LegalBench, BigLaw Bench, CUAD, LEXam, and Harvey's legal agent benchmark, whose published work made this benchmark faster and better to design.

FAQ

What is the HAQQ Legal Agent Study?

A long-horizon evaluation measuring whether AI agents can do real legal work end-to-end, not just answer trivia: 1,372 tasks across 24 practice areas graded against ~78,000 atomic, binary pass/fail rubric criteria, with civil-law and MENA coverage by design.

What is all-pass grading in legal AI evaluation?

A task is marked complete only if every rubric criterion passes — one missed criterion fails the task. The article's rationale: 'A deal-team report that catches 8 of 10 risks is not 80% useful... There is no partial credit on the partner's review.'

How do frontier AI models score on long-horizon legal work?

In the v1 baseline, the best frontier model (Claude Opus 4.7) clears 41% all-pass while HAQQ Justinian clears 58%; on civil-law and MENA tasks the gap widens to 71% vs 28%. Six systems were tested (Justinian, Claude Opus 4.7, GPT-5.2, Gemini 3.1 Pro, Grok 4.1, Mistral Large 3), best-of-three per task.

How is this different from LegalBench or CUAD?

LegalBench, CUAD, and LEXam test short-horizon reasoning — read a clause, answer a question. Study tasks mirror how work actually arrives: a ~50-word partner-style instruction plus a matter folder mixing material documents with noise, requiring a reviewable work product (memo, redline, draft pleading) graded by expert rubric.