HAQQ

We Benchmarked 7 LLMs on New York Litigation Strategy

By Stephane Boghossian · · 12 min read · ai-legal-tech

Seven AI models, one $250,000 unpaid-invoice prompt under New York law. Most sounded confident; few got CPLR procedure and collection strategy right.

Humanity has decided that if a machine writes something confidently enough, it must be correct. Lawyers, unfortunately, don't get that luxury. Courts don't care how fluent an argument sounds. They care whether the procedure is right and the law actually applies.

So we ran a simple experiment.

The Prompt

We gave several leading language models the same prompt. The models tested: HAQQ, GPT-5.2, Claude Opus 4.6, Gemini 3.1 Pro, Perplexity Sonar, Mistral Large 3, and Grok 4.1.

The goal wasn't to see who wrote the prettiest paragraph. It was to see which system could produce something that actually resembles a real litigation strategy.

Because in legal work, sounding correct and being correct are very different things.

Why This Use Case Matters

Unpaid invoices are one of the most common commercial disputes. A $250,000 unpaid invoice sits in the uncomfortable middle ground where the amount is large enough to justify litigation but small enough that efficiency matters.

A competent strategy under New York law typically includes:

  • Evaluating the contract and evidence
  • Determining causes of action (breach of contract, account stated)
  • Identifying procedural shortcuts like CPLR §3213
  • Choosing the correct forum
  • Planning discovery and summary judgment
  • Designing a collection strategy after judgment

What We Looked For

Instead of judging writing quality, we evaluated outputs using practical legal criteria:

  • Legal accuracy — Did the model correctly identify the relevant legal framework?
  • Procedural understanding — Did it reflect how litigation actually works in New York courts?
  • Strategic thinking — Did it prioritize the fastest path to recovery?
  • Citations / Authorities — Did it reference the CPLR and New York-specific procedure?
  • Structure — Was the output organized for practical use?
  • Client-ready quality — Could this be delivered to a client without rewriting?

These factors determine whether an answer is useful to a lawyer or just an impressive-looking summary.

What the Models Produced

Most systems generated something that looked like a litigation strategy. But once you read closely, important differences appear. Some outputs read like a general explanation of how lawsuits work. Others resembled an internal litigation memo.

Here's the high-level comparison:

LLM Benchmark — Litigation Strategy (New York Law)

ModelLegal AccuracyProcedural DepthStrategic ThinkingCitations / AuthoritiesStructureClient-Ready QualityKey StrengthKey Weakness
HAQQ9.59.5999.59.5Full litigation memo with enforcement, discovery, attachment, CPLR referencesSlightly verbose; some repetition
Claude Opus 4.698.59998.5Strong legal reasoning + case citationsSlightly theoretical; less procedural detail
GPT-5.28.58.5978.58.5Practical litigation playbook with decision treeFewer statutory references
Gemini 3.1 Pro8887.588Identifies CPLR 3213 fast-track strategy clearlyShorter analysis; fewer enforcement tactics
Grok 4.17.57777.57Clear high-level overviewLacks depth in litigation mechanics
Mistral Large 3776.5676.5Easy-to-read step processSurface-level legal analysis
Perplexity Sonar66666.56Includes sourcesSeveral legal inaccuracies

The Real Differences

The biggest separation wasn't style. It was procedural awareness.

Strong answers included elements like:

  • CPLR §3213 summary judgment in lieu of complaint
  • Breach of contract and account stated claims
  • Pre-litigation demand strategy
  • Jurisdiction and venue analysis
  • Discovery planning
  • Post-judgment enforcement mechanisms

Many weaker answers stopped at: "File a lawsuit and pursue damages." Which sounds nice but ignores half the real work.

The Step Most AI Misses

One pattern was especially clear. Most models focus heavily on filing the case. Few think deeply about collecting the judgment.

But in practice, recovery strategies often involve:

  • Restraining notices
  • Bank levies
  • Turnover orders
  • Property liens
  • Post-judgment discovery

A lawyer thinking about litigation from the start is already asking: "If we win, how do we actually collect?" Systems trained primarily on general internet text often overlook that reality.

The Risk Problem

Generic AI models are optimized to generate convincing language. That works well for many tasks. In legal work, however, the failure mode is dangerous. Not because the answer is poorly written. Because it is confidently wrong.

Small procedural mistakes can lead to:

  • Dismissed claims
  • Missed deadlines
  • Unenforceable judgments
  • Malpractice exposure

Which is why legal professionals care less about creativity and more about calibration.

What This Experiment Shows

Two insights emerged from this simple benchmark.

First, modern language models are already capable of producing useful legal analysis when the problem is clearly defined.

Second, there is a meaningful difference between general AI systems and systems designed specifically for legal workflows.

Legal reasoning requires structured thinking about jurisdiction, procedure, evidence, and enforcement. Those elements rarely appear naturally in general AI responses. They must be intentionally modeled.

The Broader Implication

AI is already becoming a standard tool for lawyers. But the question isn't whether AI can write something that sounds like legal advice. The question is whether it can produce work that satisfies the standards of the profession.

A legal memo isn't judged on tone. It's judged on whether the strategy holds up when challenged by opposing counsel and the court. And that's a much higher bar than generating convincing text.

Final Thought

FAQ

What should a litigation strategy for an unpaid invoice in New York include?

Per the article: evaluating the contract and evidence, determining causes of action (breach of contract, account stated), identifying procedural shortcuts like CPLR 3213 summary judgment in lieu of complaint, forum selection, discovery and summary judgment planning, and a post-judgment collection strategy. 'Winning a case is not the objective. Getting paid is.'

What do AI models most often miss in litigation strategy?

Collection. Most models focus on filing the case; few think about collecting the judgment — restraining notices, bank levies, turnover orders, property liens, post-judgment discovery. 'A lawyer thinking about litigation from the start is already asking: if we win, how do we actually collect?'

Why is confidently wrong AI dangerous in legal work?

Generic models are optimized to generate convincing language, so the failure mode is being confidently wrong, not badly written. The article lists the consequences of small procedural mistakes: dismissed claims, missed deadlines, unenforceable judgments, malpractice exposure.

Which AI performed best on the New York litigation benchmark?

In HAQQ's own 7-model comparison, HAQQ topped the table (9.5 legal accuracy, 9.5 procedural depth — a full litigation memo with enforcement and CPLR references), Claude Opus 4.6 followed (9/8.5), and Perplexity Sonar scored lowest (6) with 'several legal inaccuracies.'

← All HAQQ articles