M&A Due Diligence AI: Single Prompt vs a 3-Agent Swarm

By Stephane Boghossian · 2026-05-11 · Updated 2026-06-11 · 10 min read · Ai-legal-tech

Same model, same 30-doc data room. A single prompt caught 3/5 planted issues; a 3-agent swarm caught 5/5. The misses are structural — here is why.

The headline

The three issues both pipelines caught were the obvious ones - the kind a competent associate flags with a yellow highlighter. The two single-prompt missed required either connecting two documents to each other or applying outside legal knowledge to a clause that looks fine on its face. Those are exactly the categories where AI diligence tools quietly fail, and exactly the categories every M&A AI vendor pitch glosses over.

Key facts

Single-prompt caught 3/5 planted material issues; the 3-agent swarm caught 5/5 — both precision 1.0, same model (Claude Opus 4.7, 1M context), same 30-doc data room. The article discloses the numbers are mock-calibrated to demonstrate the open-source harness.
The 30-document data room (~13,200 words / 24,000 tokens) fits in one 1M-context call with ~976K tokens to spare — context size was not the constraint.
The swarm ran 32 LLM calls (30 researcher + 1 risk-flagger + 1 summarizer) at roughly 2-3x single-prompt cost.

We built both pipelines, ran them on the same data, and made the answer key public.

Why we ran the test

Every AI diligence pitch we've sat through dodges the same question: what's the architecture? One giant context-stuffed prompt? Multi-agent pipeline? RAG over chunked docs? Most pitches won't tell you. The demo is a glossy memo and a dashboard. The architecture is 'proprietary.'

That's a problem, because the architecture is the product. A single context-stuffed prompt and a 3-agent swarm running on the same model produce dramatically different memos on the same data room. We wanted to measure how dramatically.

So we built both. Same model - Claude Opus 4.7, 1M context - same documents, same scoring harness. Differences in output reflect prompt architecture, not model choice. An LLM-as-judge scored the outputs against a planted-issue answer key it could see, while the pipelines themselves could not.

A note before the numbers: this is a controlled experiment on synthetic data. The mock pipeline numbers were calibrated to demonstrate the harness end-to-end. We're publishing it anyway because the failure pattern - single-prompt loses cross-document linking and external-knowledge issues - is what we believe replicates with real LLM calls, and the experiment design is reusable. Code is open-source.

The data room

Thirty markdown documents, seven categories, ~13,200 words / 24,000 tokens total. Built to mirror the signal density of a Series-D therapeutics data room, with clearly fake company names so nobody confuses it for a leak (Acme Sprockets, NorthStar Therapeutics, Helix BioSystems, Meridian Bio).

The whole corpus fits in a single Claude Opus 4.7 1M-context call with ~976K tokens to spare. This removes the most common excuse for why single-prompt would underperform. We're not asking the model to retrieve from a corpus too large for its context. The entire data room is in the window.

The five planted issues

Five material issues are planted in ordinary-looking documents - not in headers, not telegraphed. Deliberately varied in detection difficulty:

Three single-document, on-the-face issues - a change-of-control trigger in a supply agreement, a litigation counterclaim worth more than 10% of the purchase price, and a going-concern qualification in the auditor's report. Checklist items. Fail the checklist, fail diligence.
One cross-document issue - an IP chain-of-title gap visible only when you connect the IP assignment log (which notes one engineer with a missing PIIA), a master license (which names that same engineer as inventor of the licensed platform), and that engineer's offer letter (which says 'PIIA attached' with no exhibit). No single document carries the full signal.
One external-knowledge issue - a 2-year nationwide non-compete on the Chief Scientific Officer, governed by California law. To know it's worthless, you have to know California Bus. & Prof. § 16600 voids most employee non-competes, and that AB-1076 (effective January 2024) added a notice obligation on top.

The full answer key with `must catch` criteria for the LLM-as-judge is in `fixtures/known-issues.md`. We didn't show it to the pipelines.

Pipeline 1: single-prompt

Concatenate every document. Wrap it in a senior-attorney system prompt. One LLM call. Get the memo. This is what most 'we use a frontier model with 1M context' pitches look like under the hood.

One model call. ~24K input tokens, ~3K output, sub-15-second wall clock, low single-digit cents per run. Cheap, fast, structurally simple.

It nailed the change-of-control clause, the litigation exposure, and the going-concern qualification. Clean, well-cited memo. No hallucinations. It looked, frankly, pretty good - until you compared it to the answer key.

Pipeline 2: 3-agent swarm

Three agents, each a separate model call with its own system prompt:

Researcher (30 parallel calls, one per doc). Per-doc structured summary: counterparties, term, economic terms, change-of-control language, unusual provisions, open questions.
Risk-flagger (one call). Reads all 30 researcher summaries. Returns a JSON list of material issues with severity, rationale, source citations.
Summarizer (one call). Turns the flag list into a deal-team memo.

Total: 32 LLM calls. Higher cost - researcher pays input tokens 30 times instead of once. Total cost is roughly 2-3x single-prompt depending on output verbosity.

Note the `sources` field in the risk-flagger schema. The schema forces the model to attribute each flag to one or more documents. That single design choice is what makes cross-document issues land - the model is being asked to think across summaries, not just within them.

What single-prompt missed and why

This is the part that matters. Both misses are structural to single-prompt architecture, not random failures.

Miss 1 - The IP chain-of-title gap

The IP assignment log notes that engineer Wei Lin has 'an executed offer letter on file but no countersigned PIIA on record - to be followed up.' The offer letter has a placeholder line saying `[PIIA attached as Exhibit A]` with no exhibit. The master license, in a separate folder, names Wei Lin as the inventor of the licensed core platform. Connect the three and you have a critical issue: the acquirer may not actually own the IP it's paying $250M for.

The single-prompt model can see all three documents. It just doesn't connect them under 'find every material issue' prompting. Long contexts blur cross-document linking - the model produces a great per-document scan but rarely takes the extra step of asking 'is the engineer in document X the same person in document Y, and does that change the picture?' There's no prompt incentive to do so.

The swarm caught it because the risk-flagger's input is structured per-document summaries with named entities surfaced. When the partner-style prompt sees 'engineer with missing PIIA' in one summary and 'engineer named as inventor' in another, the connection is one inference step away, not buried in 24K tokens of contract prose.

Miss 2 - The unenforceable California non-compete

The Chief Scientific Officer has a 2-year nationwide non-compete in her employment agreement. California governing law. She lives in Palo Alto.

To anyone who has read California Bus. & Prof. § 16600, the clause is functionally void. AB-1076, effective January 2024, sharpened it further: employers must affirmatively notify former employees with such covenants that the covenants are unenforceable, or face additional liability.

The single-prompt model knows this. Ask it directly - 'is this non-compete enforceable in California?' - and it will tell you. It just doesn't surface that knowledge unprompted under a generic 'find every material issue' instruction. The clause looks fine on its face. Nothing in the document itself screams 'ask whether I'm enforceable.'

The swarm caught it because the risk-flagger system prompt scopes the model to a partner role applying a materiality threshold across summaries - and in that role, it asks jurisdictional questions about restrictive covenants. With a more specialized agent (a dedicated employment-law reviewer, or a tool call to a statute database), this becomes deterministic. With a generic single prompt, it's a coin flip.

These are not exotic failure modes. Cross-document linking and external statutory knowledge are the two categories where diligence adds the most value. They are also the two categories where context-stuffing a frontier model gives you the worst false sense of security, because the output looks comprehensive.

The cost / quality tradeoff

Single-prompt is cheaper, faster, and produces a more coherent memo. On a 30-doc room it's roughly 3x cheaper and 3x faster than the swarm.

Swarm catches more, leaves a per-document audit trail, and lets you swap in specialist agents (FDA regulatory, ERISA, tax, jurisdiction-specific employment law) without rewriting the pipeline.

When does the lift justify the cost?

Under 10 docs, deal value under $25M, time-boxed first read - single-prompt is fine. Cross-doc surface area is small.
30+ docs, deal value over $100M, anything pre-signing - swarm. The cost delta on a $100M deal is irrelevant against one missed material issue.
Regulated industries (life sciences, financial services, defense) - swarm, with at least one specialized agent for the relevant regulator.
Anything you'd lose your job for missing - swarm.

At deal sizes this is built for, an extra few dollars per run is rounding error against the cost of missing one of these issues at signing. The default is swarm. Single-prompt is a triage tool, not a diligence tool.

What this means if you're buying M&A AI right now

Four things. None of them are vendor-flattering.

One. Don't trust an 'AI diligence assistant' that won't tell you its architecture. If the answer to 'single prompt or multi-agent?' is 'proprietary,' walk. The architecture decides what gets caught.

Two. Single-prompt context-stuffing is fine for tight, small reviews and dangerous for deeper diligence. A 1M-context window is not a substitute for forced per-document attention. It just lets the failure mode hide better.

Three. Cross-doc and external-knowledge issues will not emerge from 'find all material issues' no matter how big your context is. They require specialized agent prompting, tool calls to authoritative sources, or explicit cross-doc linking instructions. If the product can't show you which of these it does, it isn't doing them.

Four. Ground evaluation in planted-issue tests, not vibes. The memo looks thorough. It just doesn't catch the IP gap. The only way to know is to run an answer key against it.

What we'd build for HAQQ

Swarm-by-default, with single-prompt as the cost-saving fallback for small or triage reviews. Specialized agents for the failure categories the generic swarm doesn't address: a cross-doc linker that explicitly enumerates entity-to-document maps before flagging, and a jurisdiction-checker that tags every restrictive covenant, governing-law clause, and regulatory rep with an enforceability check against the relevant statute.

The planted-issues benchmark stays open-source. We'll keep adding issues - chain-of-title traps, cap-table-vs-409A drift, hidden MFN clauses, side letters that contradict the main agreement - and publish numbers whenever we ship an architecture change. If a vendor wants to claim their product is better, they can run the same harness and post the receipts.

FAQ

What is M&A due diligence AI?

M&A due diligence AI is the use of large language models and retrieval systems to review the data room - contracts, regulatory filings, IP records, employment agreements - and surface issues that affect deal value or risk. It supplements rather than replaces lawyer review, and its value is measured in recall of material issues.

Is AI accurate enough for M&A due diligence?

Single-prompt AI is not accurate enough for material issue spotting - it consistently misses the structural problems that bankrupt deals. Multi-agent AI architectures with retrieval grounding and specialised reviewers reach much higher recall, and become production-viable when combined with named lawyer approval at every gate.

Single-prompt vs multi-agent AI - what is the difference?

Single-prompt sends the data room to one model with one instruction. Multi-agent decomposes the work: one agent indexes, one specialised agent reviews IP, another reviews employment, another reviews regulatory, and a coordinator reconciles findings. Recall on planted-issue benchmarks is substantially higher with the multi-agent approach.

What does the M&A due diligence AI benchmark show?

On a controlled 30-document data room with 5 planted material issues and a public answer key, single-prompt review caught 3 of 5 issues. A 3-agent swarm caught 5 of 5. The two issues single-prompt missed - an IP chain-of-title break and a California non-compete under AB-1076 - are exactly the kind that get associates fired and deals repriced.

Why does single-prompt M&A AI fail on the deal-killer issues?

Because the deal-killer issues are structural and cross-document - a missing assignment in one folder that breaks a chain-of-title described in another. A single prompt over a long context tends to surface the loudest issues and miss the quietest, most consequential ones. Decomposition is what fixes the failure mode.

How does HAQQ handle M&A due diligence?

HAQQ runs M&A diligence as a multi-agent workflow with specialised reviewers per issue category, retrieval grounding against the data room, structured findings output, and named lawyer approval before anything leaves the workspace. The architecture is built for recall on material issues, not demo-quality summaries.

M&A Due Diligence AI: Single Prompt vs a 3-Agent Swarm

The headline

Key facts

Why we ran the test

The data room

The five planted issues

Pipeline 1: single-prompt

Pipeline 2: 3-agent swarm

What single-prompt missed and why

Miss 1 - The IP chain-of-title gap

Miss 2 - The unenforceable California non-compete

The cost / quality tradeoff

What this means if you're buying M&A AI right now

What we'd build for HAQQ

Related reading

FAQ

What is M&A due diligence AI?

Is AI accurate enough for M&A due diligence?

Single-prompt vs multi-agent AI - what is the difference?

What does the M&A due diligence AI benchmark show?

Why does single-prompt M&A AI fail on the deal-killer issues?

How does HAQQ handle M&A due diligence?