AI for Legal Research: We Tested 3 Models on 100 Real Legal Questions
An independent AI for legal research benchmark. Claude Sonnet 4, GPT-4o and Gemini 2.5 Flash scored on 100 real questions from r/legaladvice. Pass rates 78%-88%. Full breakdown inside.
The Setup
We scraped the top 100 posts of all time from r/legaladvice — real questions from real people covering landlord-tenant disputes, employment law, custody battles, criminal defense, personal injury, and everything in between. Average post length: 2,200+ characters of genuine legal complexity.
Each question was run through three frontier models with identical chain-of-thought prompting:
- Claude Sonnet 4 (Anthropic)
- GPT-4o (OpenAI)
- Gemini 2.5 Flash (Google)
Every model received the same system prompt: act as an experienced US attorney, follow a structured reasoning process — identify jurisdiction, spot issues, cite applicable law, analyze, then advise.
Same prompt. Same questions. Three different engines. Let the answers speak.
The Evaluation
We used Claude as a structured evaluator, grading each answer on five dimensions: Legal Accuracy (are the cited laws correct?), Issue Completeness (did it catch all the legal issues?), Reasoning Quality (is the chain of reasoning logical?), Practical Value (would this advice help someone take the right next steps?), and Appropriate Caveats (does it disclaim properly and recommend a real attorney?).
Pass criteria: Average score ≥ 3.5/5 AND no single dimension below 2/5. Yes, using AI to evaluate AI introduces bias. We address that below.
The Results
Claude Sonnet 4 passed 88 of 100 questions (88%), GPT-4o passed 87 (87%), and Gemini 2.5 Flash passed 78 (78%). All three demonstrated structurally sound legal reasoning across diverse real-world scenarios.
Dimension Breakdown
Legal accuracy scores ranged from 3.98 to 4.30 out of 5. Issue Completeness was highest for Gemini (4.82) and Claude (4.58). Practical Value was Claude's strongest dimension at 4.73. But the weakest dimension across all models — Appropriate Caveats — tells the most important story.
What We Learned
1. The Raw Capability Is Here
Every model identified the correct area of law, spotted the key issues, and provided actionable advice in the vast majority of cases. Legal accuracy scores ranged from 3.98 to 4.30 out of 5 — across 100 diverse, real-world questions. This is not a toy demo. This is production-grade legal reasoning.
2. The Achilles' Heel Is Caveats, Not Accuracy
The weakest dimension across all three models was Appropriate Caveats (3.0-3.15). Models would dive into detailed legal analysis — often correctly — without properly disclaiming that they're not providing legal advice, or recommending that the person consult a local attorney.
This is exactly why raw AI models aren't enough. Technically correct advice delivered with inappropriate confidence is dangerous. You need a layer on top — guardrails, disclaimers, escalation paths — that turns a language model into a responsible legal tool. That's what we build at HAQQ.
3. Consistency Beats Peak Performance
Gemini 2.5 Flash had the highest average scores for Legal Accuracy (4.30) and Issue Completeness (4.82), yet the lowest pass rate (78%). Some answers were truncated. Others skipped disclaimers entirely.
For legal work, you can't afford a model that's brilliant 78% of the time and unreliable the rest. Consistency is the product requirement. That's why HAQQ doesn't rely on a single model — we route, validate, and verify across multiple engines to ensure every output meets a quality bar before it reaches the user.
4. Claude and GPT-4o Are Neck and Neck
At 88% vs 87%, the difference isn't statistically significant. Claude edged ahead on Practical Value (4.73 vs 4.21) — its advice included more concrete next steps. GPT-4o was solid across the board but slightly less structured. The takeaway: model selection matters less than what you build around it.
The Self-Evaluation Question
We used Claude as the judge for all three models, including itself. Known limitations: potential home-court advantage (Claude might favor its own reasoning style), style vs substance bias (the evaluator might reward structural patterns it recognizes), and no ground truth (without attorney validation, we're measuring AI consensus, not legal accuracy).
Our next step is attorney validation. But even with self-evaluation, the signal is clear: frontier models have crossed a threshold where their legal reasoning is structurally sound, well-cited, and practically useful in the majority of cases.
Live Validation: 20 Fresh Questions
The top-100 benchmark uses historical posts. To prove this isn't just pattern-matching, we ran the same pipeline on 20 fresh questions posted to r/legaladvice in the last 48 hours. Claude Sonnet 4 scored 95%, GPT-4o hit 90%, and Gemini 2.5 Flash reached 85%. All three models performed even better on fresh questions.
We then took the best answer for each question across all three models, rewrote it in natural language, and posted it as a reply. The substance was there. The format was human.
Why This Matters for HAQQ
Here's what this benchmark actually proves: the AI layer is solved. The models can reason about law. They can spot issues, cite statutes, and give practical advice that holds up to scrutiny 85-95% of the time.
But '85-95% of the time' isn't good enough for legal work. The gap between a capable model and a trustworthy legal product is everything we do at HAQQ: multi-model routing (we pick the best answer across models for each query), guardrails and caveats (every response includes proper disclaimers and escalation to human attorneys), firm-specific context (answers specific to each firm's practice areas and prior work), and verified sources (no hallucinated case citations — every reference is traceable).
The question was never 'can AI do legal reasoning?' The answer is yes — 88 times out of 100 with the right prompting. The real question is: who builds the product that makes it safe, consistent, and useful for the 5 billion people who need it? That's HAQQ.
Methodology
- Source: Top 100 posts from r/legaladvice (all time) + 20 fresh posts, filtered for substantive self-text posts
- Models: Claude Sonnet 4, GPT-4o, Gemini 2.5 Flash — all via OpenRouter with identical system prompts
- Prompting: Chain-of-thought with structured 5-step legal reasoning framework
- Evaluation: Claude Sonnet 4 as single-judge evaluator, 5 dimensions on 1-5 scale
- Pass threshold: Mean ≥ 3.5 AND minimum ≥ 2 across all dimensions
- Reproducibility: All code, questions, answers, and evaluations available in the GitHub repository