← Back to HAQQ Blog

M&A Due Diligence AI: Single-Prompt vs Multi-Agent on a 30-Doc Data Room

By Stephane Boghossian · · 10 min read · Ai-legal-tech

A controlled M&A due diligence AI experiment with a public answer key. Single-prompt caught 3/5 planted issues. The 3-agent swarm caught 5/5. The failure pattern is structural, not bad luck.

The headline

The three issues both pipelines caught were the obvious ones - the kind a competent associate flags with a yellow highlighter. The two single-prompt missed required either connecting two documents to each other or applying outside legal knowledge to a clause that looks fine on its face. Those are exactly the categories where AI diligence tools quietly fail, and exactly the categories every M&A AI vendor pitch glosses over.

We built both pipelines, ran them on the same data, and made the answer key public.

Why we ran the test

Every AI diligence pitch we've sat through dodges the same question: what's the architecture? One giant context-stuffed prompt? Multi-agent pipeline? RAG over chunked docs? Most pitches won't tell you. The demo is a glossy memo and a dashboard. The architecture is 'proprietary.'

That's a problem, because the architecture is the product. A single context-stuffed prompt and a 3-agent swarm running on the same model produce dramatically different memos on the same data room. We wanted to measure how dramatically.

So we built both. Same model - Claude Opus 4.7, 1M context - same documents, same scoring harness. Differences in output reflect prompt architecture, not model choice. An LLM-as-judge scored the outputs against a planted-issue answer key it could see, while the pipelines themselves could not.

A note before the numbers: this is a controlled experiment on synthetic data. The mock pipeline numbers were calibrated to demonstrate the harness end-to-end. We're publishing it anyway because the *failure pattern* - single-prompt loses cross-document linking and external-knowledge issues - is what we believe replicates with real LLM calls, and the experiment design is reusable. Code is open-source.

The data room

Thirty markdown documents, seven categories, ~13,200 words / 24,000 tokens total. Built to mirror the signal density of a Series-D therapeutics data room, with clearly fake company names so nobody confuses it for a leak (Acme Sprockets, NorthStar Therapeutics, Helix BioSystems, Meridian Bio).

The whole corpus fits in a single Claude Opus 4.7 1M-context call with ~976K tokens to spare. This removes the most common excuse for why single-prompt would underperform. We're not asking the model to retrieve from a corpus too large for its context. The entire data room is in the window.

The five planted issues

Five material issues are planted in ordinary-looking documents - not in headers, not telegraphed. Deliberately varied in detection difficulty:

The full answer key with `must catch` criteria for the LLM-as-judge is in `fixtures/known-issues.md`. We didn't show it to the pipelines.

Pipeline 1: single-prompt

Concatenate every document. Wrap it in a senior-attorney system prompt. One LLM call. Get the memo. This is what most 'we use a frontier model with 1M context' pitches look like under the hood.

One model call. ~24K input tokens, ~3K output, sub-15-second wall clock, low single-digit cents per run. Cheap, fast, structurally simple.

It nailed the change-of-control clause, the litigation exposure, and the going-concern qualification. Clean, well-cited memo. No hallucinations. It looked, frankly, pretty good - until you compared it to the answer key.

Pipeline 2: 3-agent swarm

Three agents, each a separate model call with its own system prompt:

Total: 32 LLM calls. Higher cost - researcher pays input tokens 30 times instead of once. Total cost is roughly 2-3x single-prompt depending on output verbosity.

Note the `sources` field in the risk-flagger schema. The schema *forces* the model to attribute each flag to one or more documents. That single design choice is what makes cross-document issues land - the model is being asked to think across summaries, not just within them.

What single-prompt missed and why

This is the part that matters. Both misses are structural to single-prompt architecture, not random failures.

Miss 1 - The IP chain-of-title gap

The IP assignment log notes that engineer Wei Lin has 'an executed offer letter on file but no countersigned PIIA on record - to be followed up.' The offer letter has a placeholder line saying `[PIIA attached as Exhibit A]` with no exhibit. The master license, in a separate folder, names Wei Lin as the inventor of the licensed core platform. Connect the three and you have a critical issue: the acquirer may not actually own the IP it's paying $250M for.

The single-prompt model can *see* all three documents. It just doesn't connect them under 'find every material issue' prompting. Long contexts blur cross-document linking - the model produces a great per-document scan but rarely takes the extra step of asking 'is the engineer in document X the same person in document Y, and does that change the picture?' There's no prompt incentive to do so.

The swarm caught it because the risk-flagger's input is *structured per-document summaries with named entities surfaced*. When the partner-style prompt sees 'engineer with missing PIIA' in one summary and 'engineer named as inventor' in another, the connection is one inference step away, not buried in 24K tokens of contract prose.

Miss 2 - The unenforceable California non-compete

The Chief Scientific Officer has a 2-year nationwide non-compete in her employment agreement. California governing law. She lives in Palo Alto.

To anyone who has read California Bus. & Prof. § 16600, the clause is functionally void. AB-1076, effective January 2024, sharpened it further: employers must affirmatively notify former employees with such covenants that the covenants are unenforceable, or face additional liability.

The single-prompt model knows this. Ask it directly - 'is this non-compete enforceable in California?' - and it will tell you. It just doesn't surface that knowledge unprompted under a generic 'find every material issue' instruction. The clause looks fine on its face. Nothing in the document itself screams 'ask whether I'm enforceable.'

The swarm caught it because the risk-flagger system prompt scopes the model to a partner role applying a materiality threshold across summaries - and in that role, it asks jurisdictional questions about restrictive covenants. With a more specialized agent (a dedicated employment-law reviewer, or a tool call to a statute database), this becomes deterministic. With a generic single prompt, it's a coin flip.

These are not exotic failure modes. Cross-document linking and external statutory knowledge are *the* two categories where diligence adds the most value. They are also the two categories where context-stuffing a frontier model gives you the worst false sense of security, because the output looks comprehensive.

The cost / quality tradeoff

Single-prompt is cheaper, faster, and produces a more coherent memo. On a 30-doc room it's roughly 3x cheaper and 3x faster than the swarm.

Swarm catches more, leaves a per-document audit trail, and lets you swap in specialist agents (FDA regulatory, ERISA, tax, jurisdiction-specific employment law) without rewriting the pipeline.

When does the lift justify the cost?

At deal sizes this is built for, an extra few dollars per run is rounding error against the cost of missing one of these issues at signing. The default is swarm. Single-prompt is a triage tool, not a diligence tool.

What this means if you're buying M&A AI right now

Four things. None of them are vendor-flattering.

**One.** Don't trust an 'AI diligence assistant' that won't tell you its architecture. If the answer to 'single prompt or multi-agent?' is 'proprietary,' walk. The architecture decides what gets caught.

**Two.** Single-prompt context-stuffing is fine for tight, small reviews and dangerous for deeper diligence. A 1M-context window is not a substitute for forced per-document attention. It just lets the failure mode hide better.

**Three.** Cross-doc and external-knowledge issues will not emerge from 'find all material issues' no matter how big your context is. They require specialized agent prompting, tool calls to authoritative sources, or explicit cross-doc linking instructions. If the product can't show you which of these it does, it isn't doing them.

**Four.** Ground evaluation in planted-issue tests, not vibes. The memo *looks* thorough. It just doesn't catch the IP gap. The only way to know is to run an answer key against it.

What we'd build for HAQQ

Swarm-by-default, with single-prompt as the cost-saving fallback for small or triage reviews. Specialized agents for the failure categories the generic swarm doesn't address: a cross-doc linker that explicitly enumerates entity-to-document maps before flagging, and a jurisdiction-checker that tags every restrictive covenant, governing-law clause, and regulatory rep with an enforceability check against the relevant statute.

The planted-issues benchmark stays open-source. We'll keep adding issues - chain-of-title traps, cap-table-vs-409A drift, hidden MFN clauses, side letters that contradict the main agreement - and publish numbers whenever we ship an architecture change. If a vendor wants to claim their product is better, they can run the same harness and post the receipts.