Best AI for Legal Work? We Graded 3,000 Answers From 10 Frontier Models

By HAQQ Research · 2026-06-05 · 14 min read · Ai-legal-tech

300 demanding commercial legal tasks, 10 frontier models, 3,000 graded answers. Claude Opus wins, GPT-5.5 is most accurate, and 24% of all answers cite law that does not back them.

Why we benchmark legal AI in the open

Five billion people can't access legal help. That's the problem HAQQ exists to solve. But underneath every legal-AI demo sits a load-bearing question: *can you actually trust the output in front of a client?* "An AI answered a legal question" and "an AI you can put your name on" are different claims, and the distance between them is the entire product.

So we measure it. This is the third post in our benchmark series: [100 consumer legal questions](/blog/ai-benchmark-100-real-legal-questions) was the common-law / consumer angle. [HAQQ-LAB](/blog/civil-law-legal-ai-benchmark) was the first public civil-law / MENA jurisdiction-adherence benchmark. **This report is the largest yet — commercial and cross-border legal work, the matters that pay a real firm's bills.**

If you only read one section, read **The Citation Gap**. It is the finding that should change how every firm buys legal AI.

How we ran the benchmark

We wrote **300 original, specific legal tasks** — not trivia, real matters with named parties, dollar amounts, dates and governing statutes. Draft this clause. Redline this provision. Structure this transaction. Analyze this conflict of laws. They span 51 practice areas, 20+ jurisdictions (US federal + Delaware / California / New York / Texas, UK, EU, UAE, DIFC, Saudi Arabia, Lebanon, Egypt, Qatar, Singapore, Australia, Canada, India, Brazil, Nigeria, OHADA, Japan, Germany, France, Switzerland, plus the Hague Conventions and offshore — Cayman, BVI, Bermuda).

**Difficulty weighted hard:** 114 tasks at level 5 of 5, 108 at level 4, 22 at level 3. These are multi-jurisdiction problems built to break models. Every task went to all 10 models with an identical system prompt at **temperature 0**, capped at 6,000 output tokens.

Each answer was scored on five dimensions:

**Quality (1–10)** — depth, completeness, actionability.
**Accuracy (1–10)** — legal correctness; −5 for hallucinated case law, −3 for wrong jurisdiction.
**Speed (1–5)** — measured response latency, not judged.
**Style (1–5)** — professional structure.
**Creativity (1–5)** — non-obvious risks, cross-jurisdiction issues spotted.

Quality, Accuracy, Style and Creativity are scored by Claude Sonnet 4.6 against a fixed rubric. Speed is computed from real latency. Judge bias is addressed honestly in the caveats — we don't hide it.

The leaderboard: which AI is best for legal work

Claude Opus 4.8 wins — clearly

Opus took first place in **130 of 300 tasks** — nearly double any other model — and finished top-3 in 265 of 300. It posts the highest quality (8.9), top-tier accuracy (8.4), perfect style, and the highest creativity. Its one weakness is speed: at 60.8s it is slow, and at $0.069/task it is among the most expensive. If you want the single best answer and can wait for it, this is it.

Grok 4.3: 98% of the quality at 1/7th the time and 1/20th the cost

The most interesting result in the table. Grok 4.3 lands second (28.98) but does it in **8.8 seconds** at **$0.003 per task** — versus Opus's 60.8s and $0.069. For a client-facing product where latency and unit economics matter, Grok is arguably the better *engineering* choice. It even wins more environmental/ESG, IP and edge-case tasks than anyone.

GPT-5.5: the accuracy champion that rarely "wins"

GPT-5.5 posts the **highest accuracy in the field (8.41)** and the **lowest hallucination rate (3%)** — yet sits fifth on total and won only one task outright. It is rarely wrong and rarely flashy. It is also the slowest (134s) and priciest ($0.082). For legal work, "rarely wrong" may be the dimension that matters most, which is exactly why a single composite score misleads.

o3: the most polarizing model in the test

OpenAI's o3 won **66 tasks outright — third-most of any model — yet ranks eighth overall**, with a 32% hallucination rate and the second-lowest accuracy (5.89). When o3 is good it is brilliant; when it is wrong it is confidently, expensively wrong. That variance is itself a procurement risk.

The floor: Mistral and Llama

Mistral Large hallucinated or misapplied citations in **64% of its answers** (accuracy 4.74). Llama 4 Maverick came last (20.01) — fast and cheap, but quality 4.8 and the thinnest answers. "Cite real law" is not solved at the bottom of the market.

The citation gap: the finding that matters most

Across all 3,000 answers, **24% cited or applied law that doesn't say what the model claimed.** Invented cases. Misapplied statutes. The right doctrine pointed at the wrong jurisdiction. These aren't vague misses — our judge flagged specific, checkable errors.

A sample, one per model:

**Claude Opus 4.8:** *Computer Associates / Gemstar cited as gun-jumping precedents; not accurate characterizations.*
**GPT-5.5:** *Hitz citation appears fabricated; 616 B.R. 374 unverifiable.*
**Gemini 3.1 Pro:** *Halpin v. Riverstone citation unverifiable; likely hallucinated Delaware case.*
**Grok 4.3:** *CBS v. Ziff-Davis cited inaccurately; case concerns warranty reliance, not disclosure.*
**Claude Sonnet 4.6:** *Schrier cite unverifiable; Biotronik case real but misapplied.*
**o3:** *MSA Technology v Antec and GB Gas citations appear fabricated or misrepresented.*
**Qwen3.7 Max:** *Specht v. Netscape misapplied; not a goods/services UCC case.*
**DeepSeek V3.2:** *Crosstown Music citation unverifiable / misapplied.*
**Mistral Large:** *N.Y. Gen. Oblig. Law §5-328 and UCC §2-515 citations are dubious / misapplied.*
**Llama 4 Maverick:** *Cavendish case misapplied; unrelated to pricing clauses.*

Every single model, including the leaders, fabricated or misapplied a citation somewhere in the test. This is not a bottom-of-the-table problem. The most accurate model in the entire field still scored only 8.41 out of 10. The floor is alarming; the ceiling is not safe.

For context, the incumbent tools have the same disease. Independent testing has put **Westlaw's AI-Assisted Research at roughly a one-in-three error rate and Lexis+ AI above one in six.** A bigger model has not fixed this, and on our evidence, will not.

No single model wins every practice area

Break the 300 tasks down by practice area and the leader changes constantly. Across **51 practice areas: Claude Opus 4.8 wins 30, Grok 4.3 wins 13, o3 wins 6, and Gemini 3.1 Pro and Sonnet 4.6 take one each.** Highlights:

**Opus dominates the doctrinal core:** M&A (30.6), Regulatory Compliance (30.5), Securities, Criminal/White Collar, Tax, Banking, Data Privacy, Employment, International Trade.
**Grok owns commercial creativity:** IP/Tech Law (30.2), Environmental/ESG (30.8), Consumer Protection, Compliance/Due Diligence (31.4), edge cases.
**o3 spikes on transactional ops:** Legal Operations (31.1), Securities & Finance (30.9), Government Contracts & Procurement (31.1), cross-border real estate.
**Gemini 3.1 Pro** takes the top single score in Privacy & Data Protection (31.4).

The implication is direct: a legal product that bets everything on one model leaves accuracy on the table in entire practice areas. The right architecture **routes each task to the engine most likely to get it right** — and then verifies the answer before it ships.

The latency-and-cost tax

Quality is not free, and it is not uniform. The spread is enormous: GPT-5.5 (134s) and Claude Sonnet 4.6 (102s) are ~17× slower than Grok 4.3 (8.8s) and Llama 4 Maverick (7.7s). Cost per task spans a **90× range**, from DeepSeek V3.2 at **$0.0009** to GPT-5.5 at **$0.082**. Grok 4.3 delivers second place at **$0.003**.

For a high-volume legal product, the "best" model on a quality-only leaderboard can be the wrong business decision. The Opus-quality-at-Grok-speed-and-cost target is a routing-and-verification problem, not a single-model choice.

Provider-level view (merged with every prior run)

Collapsing models to their **provider brand** and merging this run with every prior run of the same commercial benchmark (0–100 index, weighted by number of evaluations):

The top is a cluster, not a coronation. Which provider you pick matters less than what you build around it.

A note on benchmark integrity

Our first pass capped answers at 1,200 tokens. That quietly rigged the result: reasoning models burned the budget "thinking" and returned empty answers, while verbose models got clipped mid-sentence and the judge penalized them. We caught it, threw the run away, and re-ran uniformly at 6,000 tokens. Most models gained 1.5–2.5 points — but **Mistral and Llama got *worse* with more room**, because the extra length exposed more bad citations. Output caps are an invisible thumb on the scale in many public leaderboards. Ours is in the open so you can see exactly where it sat.

What 3,000 graded answers actually prove

The reasoning layer is largely solved. Frontier models spot issues, structure analysis, and draft like a capable associate. What they cannot yet do is be *trusted*: cite verifiably, refuse out-of-jurisdiction questions, disclaim appropriately, and be right the same way twice.

That gap does not close with a bigger model. It closes with a layer on top:

**Routing** — send each matter to the engine most likely to get it right.
**Citation verification** — no unverifiable reference reaches the client.
**Jurisdiction governance** — the wrong-jurisdiction answer is built out, not filtered out after the fact.

Caveats — because a benchmark you can't check is marketing

**Single run, temperature 0.** Treat rankings as more reliable than decimals; read Opus and Grok as co-leaders, not a photo finish.
**The judge is a Claude model (Sonnet 4.6).** A fair self-preference question — except a non-Claude model (Grok) ties at the top and GPT-5.5 posts the highest accuracy, so it is plainly not grading on a home curve. A second-model re-judge is the next hardening step.
**The provider index blends rubrics across runs** and is directional.
Every prompt, every raw answer, and the scoring code are published. Clone it and check us.

FAQ

What is the best AI for legal work in 2026?

On our 300-task commercial benchmark, Claude Opus 4.8 scored highest overall (30.02/35) and won the most individual tasks (130 of 300). GPT-5.5 was the most accurate (8.41/10) with the fewest hallucinated citations (3%). Grok 4.3 offers the best speed-and-cost-to-quality ratio. "Best" depends on whether you weight overall quality, raw accuracy, or unit economics.

Do AI models hallucinate legal citations?

Yes, frequently. Across 3,000 answers, 24% cited or misapplied law that didn't support the claim. Every model tested — including the leaders — fabricated or misapplied at least one citation. Even the most accurate model only reached 8.41/10. No frontier model is safe for legal work without a verification layer.

Is Claude better than GPT or Gemini for law?

Claude Opus 4.8 led our overall ranking and Anthropic leads the provider index alongside xAI. But GPT-5.5 was the most accurate single model and Gemini 3.1 Pro finished a close third overall. No provider dominates every practice area.

Can I rely on a single AI model for a law practice?

No. No single model won across practice areas (the top model led 30 of 51, leaving 21 to others). The reliable architecture routes between models and verifies citations before output.

How much does AI legal research cost per query?

In our test, cost per task ranged 90× — from $0.0009 (DeepSeek V3.2) to $0.082 (GPT-5.5). Speed ranged from 7.7s to 134s. Quality-only leaderboards hide these operational differences.

Is this benchmark reproducible?

Yes. Same prompts, same rubric, temperature 0, all data and code published.

Key takeaways

**Claude Opus 4.8 is the strongest single model** for commercial legal work; GPT-5.5 is the most accurate; Grok 4.3 is the best value.
**24% of frontier-model legal answers cite law that doesn't back them** — the ceiling isn't safe, not just the floor.
**No model wins every practice area;** routing beats betting on one.
**Cost and latency vary 90× and 17×** — the quality-only "winner" can be the wrong business choice.
**The defensible layer is routing + citation verification + jurisdiction governance**, not the base model.