LLM Benchmark: Litigation Strategy Under New York Law — Who Gets It Right?
We gave 7 leading AI models the same litigation prompt. Most sounded confident. Few were actually correct. Here's how they compared on legal accuracy, procedure, and collection strategy.
Humanity has decided that if a machine writes something confidently enough, it must be correct. Lawyers, unfortunately, don't get that luxury. Courts don't care how fluent an argument sounds. They care whether the procedure is right and the law actually applies.
So we ran a simple experiment.
The Prompt
We gave several leading language models the same prompt. The models tested: HAQQ, GPT-5.2, Claude Opus 4.6, Gemini 3.1 Pro, Perplexity Sonar, Mistral Large 3, and Grok 4.1.
The goal wasn't to see who wrote the prettiest paragraph. It was to see which system could produce something that actually resembles a real litigation strategy.
Because in legal work, sounding correct and being correct are very different things.
Why This Use Case Matters
Unpaid invoices are one of the most common commercial disputes. A $250,000 unpaid invoice sits in the uncomfortable middle ground where the amount is large enough to justify litigation but small enough that efficiency matters.
A competent strategy under New York law typically includes:
- Evaluating the contract and evidence
- Determining causes of action (breach of contract, account stated)
- Identifying procedural shortcuts like CPLR §3213
- Choosing the correct forum
- Planning discovery and summary judgment
- Designing a collection strategy after judgment
What We Looked For
Instead of judging writing quality, we evaluated outputs using practical legal criteria:
- Legal accuracy — Did the model correctly identify the relevant legal framework?
- Procedural understanding — Did it reflect how litigation actually works in New York courts?
- Strategic thinking — Did it prioritize the fastest path to recovery?
- Citations / Authorities — Did it reference the CPLR and New York-specific procedure?
- Structure — Was the output organized for practical use?
- Client-ready quality — Could this be delivered to a client without rewriting?
These factors determine whether an answer is useful to a lawyer or just an impressive-looking summary.
What the Models Produced
Most systems generated something that looked like a litigation strategy. But once you read closely, important differences appear. Some outputs read like a general explanation of how lawsuits work. Others resembled an internal litigation memo.
Here's the high-level comparison:
LLM Benchmark — Litigation Strategy (New York Law)
The Real Differences
The biggest separation wasn't style. It was procedural awareness.
Strong answers included elements like:
- CPLR §3213 summary judgment in lieu of complaint
- Breach of contract and account stated claims
- Pre-litigation demand strategy
- Jurisdiction and venue analysis
- Discovery planning
- Post-judgment enforcement mechanisms
Many weaker answers stopped at: "File a lawsuit and pursue damages." Which sounds nice but ignores half the real work.
The Step Most AI Misses
One pattern was especially clear. Most models focus heavily on filing the case. Few think deeply about collecting the judgment.
But in practice, recovery strategies often involve:
- Restraining notices
- Bank levies
- Turnover orders
- Property liens
- Post-judgment discovery
A lawyer thinking about litigation from the start is already asking: "If we win, how do we actually collect?" Systems trained primarily on general internet text often overlook that reality.
The Risk Problem
Generic AI models are optimized to generate convincing language. That works well for many tasks. In legal work, however, the failure mode is dangerous. Not because the answer is poorly written. Because it is confidently wrong.
Small procedural mistakes can lead to:
- Dismissed claims
- Missed deadlines
- Unenforceable judgments
- Malpractice exposure
Which is why legal professionals care less about creativity and more about calibration.
What This Experiment Shows
Two insights emerged from this simple benchmark.
First, modern language models are already capable of producing useful legal analysis when the problem is clearly defined.
Second, there is a meaningful difference between general AI systems and systems designed specifically for legal workflows.
Legal reasoning requires structured thinking about jurisdiction, procedure, evidence, and enforcement. Those elements rarely appear naturally in general AI responses. They must be intentionally modeled.
The Broader Implication
AI is already becoming a standard tool for lawyers. But the question isn't whether AI can write something that sounds like legal advice. The question is whether it can produce work that satisfies the standards of the profession.
A legal memo isn't judged on tone. It's judged on whether the strategy holds up when challenged by opposing counsel and the court. And that's a much higher bar than generating convincing text.