← Back to HAQQ Blog

LLM Benchmark: Litigation Strategy Under New York Law — Who Gets It Right?

By Stephane Boghossian · · 12 min read · Ai-legal-tech

We gave 7 leading AI models the same litigation prompt. Most sounded confident. Few were actually correct. Here's how they compared on legal accuracy, procedure, and collection strategy.

Humanity has decided that if a machine writes something confidently enough, it must be correct. Lawyers, unfortunately, don't get that luxury. Courts don't care how fluent an argument sounds. They care whether the procedure is right and the law actually applies.

So we ran a simple experiment.

The Prompt

We gave several leading language models the same prompt. The models tested: HAQQ, GPT-5.2, Claude Opus 4.6, Gemini 3.1 Pro, Perplexity Sonar, Mistral Large 3, and Grok 4.1.

The goal wasn't to see who wrote the prettiest paragraph. It was to see which system could produce something that actually resembles a real litigation strategy.

Because in legal work, sounding correct and being correct are very different things.

Why This Use Case Matters

Unpaid invoices are one of the most common commercial disputes. A $250,000 unpaid invoice sits in the uncomfortable middle ground where the amount is large enough to justify litigation but small enough that efficiency matters.

A competent strategy under New York law typically includes:

What We Looked For

Instead of judging writing quality, we evaluated outputs using practical legal criteria:

These factors determine whether an answer is useful to a lawyer or just an impressive-looking summary.

What the Models Produced

Most systems generated something that looked like a litigation strategy. But once you read closely, important differences appear. Some outputs read like a general explanation of how lawsuits work. Others resembled an internal litigation memo.

Here's the high-level comparison:

LLM Benchmark — Litigation Strategy (New York Law)

The Real Differences

The biggest separation wasn't style. It was procedural awareness.

Strong answers included elements like:

Many weaker answers stopped at: "File a lawsuit and pursue damages." Which sounds nice but ignores half the real work.

The Step Most AI Misses

One pattern was especially clear. Most models focus heavily on filing the case. Few think deeply about collecting the judgment.

But in practice, recovery strategies often involve:

A lawyer thinking about litigation from the start is already asking: "If we win, how do we actually collect?" Systems trained primarily on general internet text often overlook that reality.

The Risk Problem

Generic AI models are optimized to generate convincing language. That works well for many tasks. In legal work, however, the failure mode is dangerous. Not because the answer is poorly written. Because it is confidently wrong.

Small procedural mistakes can lead to:

Which is why legal professionals care less about creativity and more about calibration.

What This Experiment Shows

Two insights emerged from this simple benchmark.

First, modern language models are already capable of producing useful legal analysis when the problem is clearly defined.

Second, there is a meaningful difference between general AI systems and systems designed specifically for legal workflows.

Legal reasoning requires structured thinking about jurisdiction, procedure, evidence, and enforcement. Those elements rarely appear naturally in general AI responses. They must be intentionally modeled.

The Broader Implication

AI is already becoming a standard tool for lawyers. But the question isn't whether AI can write something that sounds like legal advice. The question is whether it can produce work that satisfies the standards of the profession.

A legal memo isn't judged on tone. It's judged on whether the strategy holds up when challenged by opposing counsel and the court. And that's a much higher bar than generating convincing text.

Final Thought