We Benchmarked 7 LLMs on New York Litigation Strategy

By Stephane Boghossian · 2026-01-14 · Updated 2026-06-11 · 12 min read · Ai-legal-tech

Seven AI models, one $250,000 unpaid-invoice prompt under New York law. Most sounded confident; few got CPLR procedure and collection strategy right.

Humanity has decided that if a machine writes something confidently enough, it must be correct. Lawyers, unfortunately, don't get that luxury. Courts don't care how fluent an argument sounds. They care whether the procedure is right and the law actually applies.

So we ran a simple experiment.

The Prompt

We gave several leading language models the same prompt. The models tested: HAQQ, GPT-5.2, Claude Opus 4.6, Gemini 3.1 Pro, Perplexity Sonar, Mistral Large 3, and Grok 4.1.

The goal wasn't to see who wrote the prettiest paragraph. It was to see which system could produce something that actually resembles a real litigation strategy.

Because in legal work, sounding correct and being correct are very different things.

Why This Use Case Matters

Unpaid invoices are one of the most common commercial disputes. A $250,000 unpaid invoice sits in the uncomfortable middle ground where the amount is large enough to justify litigation but small enough that efficiency matters.

A competent strategy under New York law typically includes:

Evaluating the contract and evidence
Determining causes of action (breach of contract, account stated)
Identifying procedural shortcuts like CPLR §3213
Choosing the correct forum
Planning discovery and summary judgment
Designing a collection strategy after judgment

What We Looked For

Instead of judging writing quality, we evaluated outputs using practical legal criteria:

Legal accuracy — Did the model correctly identify the relevant legal framework?
Procedural understanding — Did it reflect how litigation actually works in New York courts?
Strategic thinking — Did it prioritize the fastest path to recovery?
Citations / Authorities — Did it reference the CPLR and New York-specific procedure?
Structure — Was the output organized for practical use?
Client-ready quality — Could this be delivered to a client without rewriting?

These factors determine whether an answer is useful to a lawyer or just an impressive-looking summary.

What the Models Produced

Most systems generated something that looked like a litigation strategy. But once you read closely, important differences appear. Some outputs read like a general explanation of how lawsuits work. Others resembled an internal litigation memo.

Here's the high-level comparison:

LLM Benchmark — Litigation Strategy (New York Law)

Model	Legal Accuracy	Procedural Depth	Strategic Thinking	Citations / Authorities	Structure	Client-Ready Quality	Key Strength	Key Weakness
HAQQ	9.5	9.5	9	9	9.5	9.5	Full litigation memo with enforcement, discovery, attachment, CPLR references	Slightly verbose; some repetition
Claude Opus 4.6	9	8.5	9	9	9	8.5	Strong legal reasoning + case citations	Slightly theoretical; less procedural detail
GPT-5.2	8.5	8.5	9	7	8.5	8.5	Practical litigation playbook with decision tree	Fewer statutory references
Gemini 3.1 Pro	8	8	8	7.5	8	8	Identifies CPLR 3213 fast-track strategy clearly	Shorter analysis; fewer enforcement tactics
Grok 4.1	7.5	7	7	7	7.5	7	Clear high-level overview	Lacks depth in litigation mechanics
Mistral Large 3	7	7	6.5	6	7	6.5	Easy-to-read step process	Surface-level legal analysis
Perplexity Sonar	6	6	6	6	6.5	6	Includes sources	Several legal inaccuracies

The Real Differences

The biggest separation wasn't style. It was procedural awareness.

Strong answers included elements like:

CPLR §3213 summary judgment in lieu of complaint
Breach of contract and account stated claims
Pre-litigation demand strategy
Jurisdiction and venue analysis
Discovery planning
Post-judgment enforcement mechanisms

Many weaker answers stopped at: "File a lawsuit and pursue damages." Which sounds nice but ignores half the real work.

The Step Most AI Misses

One pattern was especially clear. Most models focus heavily on filing the case. Few think deeply about collecting the judgment.

But in practice, recovery strategies often involve:

Restraining notices
Bank levies
Turnover orders
Property liens
Post-judgment discovery

A lawyer thinking about litigation from the start is already asking: "If we win, how do we actually collect?" Systems trained primarily on general internet text often overlook that reality.

The Risk Problem

Generic AI models are optimized to generate convincing language. That works well for many tasks. In legal work, however, the failure mode is dangerous. Not because the answer is poorly written. Because it is confidently wrong.

Small procedural mistakes can lead to:

Dismissed claims
Missed deadlines
Unenforceable judgments
Malpractice exposure

Which is why legal professionals care less about creativity and more about calibration.

What This Experiment Shows

Two insights emerged from this simple benchmark.

First, modern language models are already capable of producing useful legal analysis when the problem is clearly defined.

Second, there is a meaningful difference between general AI systems and systems designed specifically for legal workflows.

Legal reasoning requires structured thinking about jurisdiction, procedure, evidence, and enforcement. Those elements rarely appear naturally in general AI responses. They must be intentionally modeled.

The Broader Implication

AI is already becoming a standard tool for lawyers. But the question isn't whether AI can write something that sounds like legal advice. The question is whether it can produce work that satisfies the standards of the profession.

A legal memo isn't judged on tone. It's judged on whether the strategy holds up when challenged by opposing counsel and the court. And that's a much higher bar than generating convincing text.

Final Thought

FAQ

What should a litigation strategy for an unpaid invoice in New York include?

Per the article: evaluating the contract and evidence, determining causes of action (breach of contract, account stated), identifying procedural shortcuts like CPLR 3213 summary judgment in lieu of complaint, forum selection, discovery and summary judgment planning, and a post-judgment collection strategy. 'Winning a case is not the objective. Getting paid is.'

What do AI models most often miss in litigation strategy?

Collection. Most models focus on filing the case; few think about collecting the judgment — restraining notices, bank levies, turnover orders, property liens, post-judgment discovery. 'A lawyer thinking about litigation from the start is already asking: if we win, how do we actually collect?'

Why is confidently wrong AI dangerous in legal work?

Generic models are optimized to generate convincing language, so the failure mode is being confidently wrong, not badly written. The article lists the consequences of small procedural mistakes: dismissed claims, missed deadlines, unenforceable judgments, malpractice exposure.

Which AI performed best on the New York litigation benchmark?

In HAQQ's own 7-model comparison, HAQQ topped the table (9.5 legal accuracy, 9.5 procedural depth — a full litigation memo with enforcement and CPLR references), Claude Opus 4.6 followed (9/8.5), and Perplexity Sonar scored lowest (6) with 'several legal inaccuracies.'

We Benchmarked 7 LLMs on New York Litigation Strategy

The Prompt

Why This Use Case Matters

What We Looked For

What the Models Produced

LLM Benchmark — Litigation Strategy (New York Law)

The Real Differences

The Step Most AI Misses

The Risk Problem

What This Experiment Shows

The Broader Implication

Final Thought

Related reading

FAQ

What should a litigation strategy for an unpaid invoice in New York include?

What do AI models most often miss in litigation strategy?

Why is confidently wrong AI dangerous in legal work?

Which AI performed best on the New York litigation benchmark?