← Back to HAQQ Blog

Introducing the HAQQ Legal Agent Study

By Stephane Boghossian · · 11 min read · Ai-legal-tech

1,372 long-horizon legal tasks. 24 practice areas. ~78,000 expert rubric criteria. Civil-law and MENA coverage by design - and graded all-pass: catch every issue, or the task fails.

Headline

The legal-agent inflection point

Andrej Karpathy's observation about coding agents - that they 'basically didn't work before December and basically work since' - is starting to apply to legal. Long-horizon legal completion was flat for two years. Then in late 2025, frontier reasoning models, longer contexts, better tool use, and proper evaluation infrastructure converged. Capability turned.

Legal hit this curve later than coding for one reason: there was no benchmark to measure it. You can't track an inflection you can't see. The Study is the instrument we built.

Why short-horizon benchmarks broke

Most legal AI evaluations - LegalBench, CUAD, LEXam, even our earlier work - test short-horizon reasoning: read a clause, answer a question, classify a paragraph. They are useful, but they tell you almost nothing about whether an agent can actually run a piece of work.

Real legal work looks nothing like multiple choice. A partner forwards an email, attaches a folder, and writes one line: 'Take a look and come back with a memo by Thursday.' What happens between that email and the memo is the entire job - reading the matter, finding the issues that matter, ignoring the ones that don't, drafting reviewable work product, and getting every fact right.

That is what the Study measures.

How a Study task is structured

Every task in the Study mirrors how work moves inside a law firm. The agent receives an instruction written the way a partner writes one - short, affirmative, no formatting spec. It receives an environment - a client matter containing the documents and email threads it needs (and a lot it does not). It must produce a reviewable work product. And it gets graded by an expert rubric.

Each row is a 1:1 encoding of how a real matter moves through a firm: partner request becomes instruction, client matter becomes environment, work product becomes output, partner review becomes expert rubric. Nothing is abstracted away to make the task easier for the model.

All-pass grading

A task is marked complete only if every rubric criterion passes. We call this all-pass grading, and it is the single most important design choice in the Study.

A deal-team report that catches 8 of 10 risks is not 80% useful. The two missed could be the change-of-control trigger that blows up the deal, or the going-concern qualification that reprices the offer. There is no partial credit on the partner's review.

Anatomy of a rubric

Rubrics are the part of the Study that took the most lawyer-hours to build. For each task we sat down with practitioners in the relevant area and broke down what a partner or client would actually scrutinise in the deliverable. Every check is atomic and binary - no soft scores, no LLM-as-judge handwaving on style.

Atomic criteria do three things at once: they make grading reproducible across runs, they make agent failures debuggable (you see exactly which check broke and why), and they double as reward signals for fine-tuning. The same rubric that grades a model can train the next version of it.

24 practice areas - including civil law and MENA

Existing legal benchmarks are dominated by US common-law tasks. That is fine for what they are, but it is not the world most of our customers practice in. The Study (v1) covers 24 practice areas, of which six are explicitly civil-law and MENA: Arabic civil-law drafting, Sharia compliance, GCC corporate, construction litigation, family / personal status, and MENA labour.

We started from real matters - anonymized, sanitized - handled by practicing lawyers across our customer base. We broke each matter into the discrete tasks that an associate would actually be delegated. The 24 areas are not exhaustive. Future releases will add construction arbitration, fintech regulation, ESG, and in-house workflows.

Example: change-of-control review

One corporate M&A task asks the agent to analyze change-of-control provisions across a virtual data room for the (fictional) acquisition of Crestview Software Solutions in a USD 458 million all-equity transaction. The data room contains eight material contracts plus adjacent files - 10-K, deferred compensation plan, board minutes - that may or may not be relevant.

Below is the full input view as the agent sees it - request, deal context, core contracts, broader deal-room material, and the required output. Every entry doubles as a hint and a distractor: the agent must use the memo's facts, but it must also separate the core assignment from peripheral files like draft bid letters and team bios that don't change the analysis.

The agent must determine which files matter, read them in context, and synthesize the relevant provisions across the matter. The required output is a deal-team memo with executive summary, risk mapping, contract-by-contract analysis, severity ratings, and recommended mitigations.

The rubric for that single task contains 57 criteria - covering nine planted legal issues, the underlying facts behind each, the severity rating, the financial exposure, and the recommended action. Miss one of the nine, and the task fails.

What gets planted, and how it gets graded

The nine issues planted into this single task are not surface-level keyword traps. They require the agent to connect facts across documents, infer triggers from definitions, quantify financial exposure in dollars, and recommend the right next legal action. Hover any issue to see what the agent has to figure out and the unit-test the rubric runs against the deliverable.

v1 baseline results

We ran the Study (v1) against six leading systems: HAQQ Justinian, Claude Opus 4.7, GPT-5.2, Gemini 3.1 Pro, Grok 4.1, and Mistral Large 3. Each task was attempted three times; we report best-of-three all-pass rate. The headline numbers split cleanly by category.

Two patterns held across the dataset.

Capability matrix - what each model can actually finish

Aggregate scores hide the operational question every law firm partner asks: which kinds of work can I delegate end-to-end, which need a lawyer in the loop, and which I shouldn't touch? The matrix below answers that across 24 task families.

Why this matters for law firms

If you are evaluating an AI vendor and they show you a confident demo on a single document, ask them what their long-horizon completion rate looks like on a real matter. Ask them what their all-pass rate is on a 50-criterion rubric for that matter. Ask them which practice areas they cover end-to-end versus only assist.

Those are the questions the Study is designed to answer - publicly, reproducibly, and with rubrics any partner can audit.

What is open and what is next

Try it

If you want to see how Justinian performs on long-horizon tasks from your practice, talk to our team. We will run a confidential, rubric-graded pilot on a redacted matter from your firm.

Acknowledgements

The Study is the work of many people inside HAQQ and across our practitioner network. The technical lead for the harness, agent sandbox, and rubric runtime was the HAQQ Justinian engineering team, with task design and matter generation led by our Applied Legal Research group. Our Security and AI Platform teams built the isolated execution environment that lets us run agents against synthetic matters without leaking client data. Our Brand and Product teams shaped how the results are communicated to non-technical legal buyers.

Outside HAQQ, we are grateful to the practising lawyers - in MENA, the EU, and the UK - who contributed anonymized matters, drafted rubrics in their practice areas, and stress-tested early task families. We also thank the academic researchers who reviewed our methodology and the prior-art teams behind LegalBench, BigLaw Bench, CUAD, LEXam, and Harvey's legal agent benchmark, whose published work made this benchmark faster and better to design.