Open Source Legal Software in 2026: The Full Landscape and HAQQ's Contributions

By Stephane Boghossian · 2026-05-20 · Updated 2026-06-11 · 16 min read · Ai-legal-tech

From CourtListener to Mike: the full 2026 map of open source legal software, what HAQQ ships back, and why the real bottleneck is data, not models.

Mike is an open source clone of Harvey and Legora. Self-hostable, bring your own API key, no per-seat pricing. The code itself is rough - someone in the comments correctly pointed out it's basically a Supabase auth call and five database tables. But that's not really the point.

Key facts

Free Law Project's CourtListener hosts 250 million pages of US court data, free — and most legal AI startups train on it without credit.
Harvard's Caselaw Access Project digitized 360 years of US case law: 6.9 million cases, fully open since 2024.
HAQQ has ~9,800 paying firms across 80+ countries while keeping non-differentiating infrastructure open source.

The point is the reaction.

Hundreds of comments. Reddit threads. LinkedIn debates. Lawyers asking why they couldn't just have their associate spin up something similar in a weekend. Builders asking why this hadn't happened five years ago.

I read the whole thing twice. And the more I read, the more it felt like a moment. Not because Mike itself is going to disrupt anything - it probably won't. But because legal tech has finally caught the open source bug, and once that starts, you can't put it back.

Why legal was the last vertical to get here

Every other industry got open source years ago. Healthcare has OpenMRS. Fintech has Hyperledger. E-commerce has Magento. Even the boring corners of enterprise have their thing.

Legal had basically nothing. And the reasons were never about technology.

The first reason is that law firms make money by being inefficient. Sorry - I know that sounds harsh. But the billable hour creates a perverse incentive: if you automate a 10-hour task down to 1 hour, you just deleted 9 hours of revenue. So why would any partner contribute to a project that does that?

The second reason is the secret sauce thing. Firms guard their brief banks and templates like trade secrets, because they kind of are. You can't open source your litigation strategy when you might be using it against the firm down the street next month.

The third reason is licensing fear. Bar associations don't move fast. Compliance teams panic at GPL. Most legal counsel reading the words 'open source' picture a teenager in a hoodie stealing client data, not a Linux kernel maintainer.

And the fourth - the one nobody talks about enough - is that Thomson Reuters and LexisNexis built their moats around data, not software. KeyCite and Shepard's are taxonomies that took decades to build. Replicating them costs hundreds of millions. So even if you wanted to ship an open source legal stack, the data layer underneath was locked away.

That's the world we've been working in. It's also the world that's starting to crack.

What's actually being built right now

I've been keeping a running list. Some of these are years old and finally getting attention. Some shipped this month. The space is much bigger than most people realize. Let me try to organize it.

The data layer - the stuff everything else stands on

Free Law Project is the most underrated organization in legal tech. They run CourtListener (250 million pages of US court data, free), RECAP (a browser extension that pulls federal filings out of PACER and into the public domain), eyecite (the de facto US citation parser), and Juriscraper (Python scrapers for hundreds of US courts). 138 repos. Most legal AI startups train on their data and don't credit them. They should.

Harvard's Caselaw Access Project digitized 360 years of US case law. 6.9 million cases, fully open since 2024. If you're building anything that needs American legal precedent, that's where you start.

Pile of Law - 256 GB of legal text across 35 sub-corpora, hosted on Hugging Face. The closest thing to 'The Pile' for law. Nearly every open legal LLM trains on a slice of it.

Find Case Law (UK National Archives) - UK judgments published as machine-readable LegalDocML XML, with Atom feeds. This is the gold standard schema. Other countries should copy it.

EUR-Lex / Cellar - All EU legislation and CJEU case law, with a SPARQL endpoint. Probably the most structured open legal corpus on Earth. Underused outside academia.

OpenLegalData is the German equivalent - free German court decisions, normalized across fragmented official portals.

Indian Legal Corpus / InLegalBERT out of IIT Kharagpur covers Indian Supreme and High Court judgments. Most jurisdictions outside the US are critically under-served, and India is one of the few with serious open corpus work.

Brazil has community-built wrappers around the CNJ DataJud API exposing 100M+ case records - community-maintained, fragile, important. Same pattern: technically public, practically unscrapable, until someone open-sources the bridge.

Legal Data Hunter is a small example of the long tail here - a Scrapy + FastAPI project that hunts statutes and gazette publications across government sites and normalizes them. Not a flagship, but emblematic. Legal AI runs on hundreds of solo-maintained scrapers like this. They are the unsexy backbone nobody funds.

NLP libraries and open weights

Blackstone - spaCy pipeline for UK and Commonwealth legal text. Rare non-US legal NER.

LexNLP - Python library for extracting legal entities, citations, durations, money, parties. Pre-LLM but still vendored inside half the commercial tools you've heard of.

Legal-BERT (nlpaueb) - BERT pretrained on EU legislation, ECHR, US contracts. Cited over a thousand times. Foundational.

Saul (Equall.ai) - first open-weights LLM continued-pretrained on legal corpora. Proves the 'domain-pretrain a Llama' recipe works for law.

CUAD (Atticus Project) - 13,000 expert-labeled clauses across 510 contracts, 41 categories, CC BY 4.0. Almost every contract-AI product in the world trains or evaluates on CUAD whether they admit it or not.

Benchmarks

LegalBench - 162 legal reasoning tasks designed by lawyers, out of Stanford and Hazy Research. The benchmark frontier labs report on now. Replacing LexGLUE in practice.

LexGLUE - the older suite (ECtHR, SCOTUS, EUR-Lex, LEDGAR, CaseHOLD, UNFAIR-ToS). Still useful for comparing OSS models honestly.

Legal Benchmarks AI is the practitioner-facing version. Vendor-free. Their contract drafting benchmark put 14 tools and a bunch of human lawyers in the same scoring system, open methodology. You cannot improve what you cannot measure, and until recently nobody was measuring legal AI seriously.

AI applications, agents, and skills

Mike - the Hacker News darling. Rough code, real signal.

Lawvable / awesome-legal-skills is the one I keep coming back to. A curated registry of SKILL.md files written by actual practitioners from Clifford Chance, Baker McKenzie, and others. Drop one into Claude, Codex, Gemini CLI, or any tool that supports the format and you've taught it to do an EU AI Act classification, a GDPR breach assessment, an NDA triage, a référé assignation in French. Forty-plus skills the last time I checked, growing weekly. Closer to how legal knowledge actually moves between humans than any monolithic AI product I've seen.

Disclosure: HAQQ is a co-maintainer of awesome-legal-skills.

lawskills-hub (Harvard LIL) is the institutional cousin. A community registry of agent skills for legal workflows, curated by Harvard's Library Innovation Lab. Same pattern, different trust signal. The fact that Harvard is putting its name on a skills registry tells you the format is going to stick.

anthropic-skills - Anthropic's official skill repo. Practitioners are forking subsets for legal work. This is where the SKILL.md standard came from in the first place. Anthropic also shipped a 'Claude for Legal' plugin in April 2026; their own legal team uses skills internally and has been pretty public about it.

Atticus Project - non-profit behind CUAD and a growing library of contract NLP tooling. The closest thing legal AI has to an academic standards body.

ContraxSuite (LexPredict) - open core contract analytics platform. Pre-LLM, GPL-licensed, but the most complete OSS contract review pipeline ever built. Ages well as a baseline.

Harvard LIL's OLAW - open legal AI workbench for RAG research, integrating AI with legal APIs like CourtListener.

LangChain and LlamaIndex legal cookbooks aren't legal projects, but they're the plumbing every legal RAG demo runs on. Worth knowing where the SEC/EDGAR loaders, contract chunking patterns, and citation grounding helpers live.

LawGPT, ChatLaw, DISC-LawLLM, Lawyer LLaMA and a long tail of academic legal-LLM projects on Hugging Face. Most are research prototypes that don't survive contact with real practice - but a few of the Chinese-language ones (ChatLaw, DISC-LawLLM out of Fudan) are seriously good and underused outside of China.

Casetext-style RAG demos are now a category of their own - there's a small army of solo developers shipping 'Harvey clones in 200 lines' using LlamaIndex or LangChain on top of CourtListener. Most are toys. A few are quietly turning into real products.

GitHub topics like gpt-legal-chatbot, legal-rag, and legal-agent surface dozens of projects every month. Quality is wildly variable. Worth scanning if you're researching prior art before you build.

What HAQQ has shipped to the commons

I should be specific about this part since you asked.

Nomos is our open-source agent-native legal interface. Self-hostable. Skills-first. Designed to be the 'Cursor for legal' in the sense that the Legal AI Engine and the lawyer are both first-class users of the same workspace. We dogfood it internally for HAQQ work.

LegalMD is a Markdown dialect for legal documents - four typed primitives (`@party`, `@cite`, `@clause`, `@deadline`) with a TypeScript parser, a resolver that verifies citations against open legal data, two renderers (HTML and JSON), and a VS Code extension. MIT licensed. The thesis is that lawyers should not be writing contracts in DOCX in 2026 any more than developers should be writing code in Word. Early but shipping.

Master Claude for Legal is a community skill pack. Five working starter skills (NDA triage, multi-party version diff, meeting brief, citation verifier, status synthesis), reference docs covering privilege architecture and MCP permission hardening, three templates (firm AI policy, client-facing data explainer, vendor security questionnaire). MIT. Built as the long version of Anthropic's Claude for Legal Teams webinar - twenty thousand registrations, fifty-one questions, half left unanswered.

awesome-legaltech is exactly what it sounds like - a curated list of legal tech projects, open source where possible, commercial where worth knowing. Contributions welcome.

awesome-legal-skills is the Lawvable registry mentioned above. We co-maintain.

Legal Data Hunter - our internal scraper layer for hunting statutes, regulations, and gazette publications across government sites in the jurisdictions we operate in. Parts of this are already public. More of it will be open-sourced over the next two quarters as we standardize the schema.

There is more coming. Some of it is in private repos until it's stable enough not to embarrass us. The principle is consistent: anything that is not core product differentiation gets pushed back to the commons. Parsers, schemas, scrapers, evaluation harnesses, skill registries - none of that is HAQQ's moat. Our moat is jurisdiction depth, firm-specific training, and the product layer that makes all of it usable for a working lawyer.

Document automation, A2J, e-discovery

Docassemble has been around forever and powers court self-help systems in multiple US states. If you've ever filled out a free court form online, there's a decent chance Docassemble was underneath it. Quiet, durable, built by a lawyer-programmer who just kept shipping.

Open Decision is the more modern web-native answer out of Berlin. MIT-licensed decision-tree builder for legal self-help. Active.

Suffolk LIT Lab's Document Assembly Line - built on Docassemble, automates court forms across US states. Open source, well-maintained, deeply unsexy, deeply important.

Aleph (OCCRP) - investigative document platform doing OCR, NER, cross-doc search at scale. Not legal-specific, but it's the closest OSS analog to Relativity or Nuix for investigations and litigation-adjacent work. Journalists use it. Plaintiffs' firms should too.

opensource.legal - OpenContracts for annotation, Docxodus for redlining DOCX in WebAssembly, Python-Redlines, CAML for marking up legal articles. Boring infrastructure, hugely valuable.

SALI Alliance LMSS taxonomy - not code, but an open ontology every serious legal-AI project eventually adopts. Worth knowing exists.

GitHub's Open Source Guide on legal stuff isn't a product, but if you're shipping in this space and you don't understand the difference between MIT, Apache 2.0, and AGPL, read it before you push anything.

I've left out a dozen more. ROSS-style survivors, smart-contract experiments, niche regional scrapers, half-finished hackathon repos that quietly turned into infrastructure. The ecosystem is real now. Two years ago I could fit the entire OSS legal AI landscape on one slide. I can't anymore.

The thing nobody talks about: data

Here's the part that doesn't fit on the marketing site.

The bottleneck in legal AI is not models. It's not even tooling. It's data.

The work that actually matters to clients - memos, deal docs, discovery productions, settlement letters, internal advice, the redlines a senior partner makes at 2am - is privileged, confidential, or contractually locked inside firms. None of it can be released. None of it can be benchmarked publicly. None of it can be used to train a Legal AI Engine that another firm will see. The Heppner ruling I mentioned earlier just made this even more explicit.

What's left in the open - published opinions, statutes, EDGAR filings, CUAD's 510 contracts - is a tiny, non-representative slice. Appellate. Public-company. English-language. US-skewed. That's why benchmarks like LegalBench feel narrow. That's why contract models overfit CUAD. That's why every serious legal AI vendor's real moat is a private data pipeline, not an architecture choice.

This creates a structural ceiling for open source. OSS can ship excellent plumbing (eyecite, Juriscraper, Docassemble, Aleph, Open Decision) and excellent public-data Legal AI Engines (Legal-BERT, Saul, InLegalBERT). But it cannot close the loop on the work product that defines actual practice. You can't open source what you don't have access to.

So the interesting OSS frontier is not 'open GPT-for-law.' It's federated evaluation. Synthetic data generation. Privacy-preserving fine-tuning. Skill marketplaces that let firms keep their data private and share the behavior. That's the lane where open source still has real leverage in 2026, and it's the lane I'm most excited about.

The corollary, by the way, is that the public legal data layer matters more than ever. Every government that opens up its statutes and case law in machine-readable form expands the surface area where open source can compete on equal footing. The UK National Archives nailed it with LegalDocML. The EU got most of it right with EUR-Lex. Most other jurisdictions, including the ones HAQQ operates in, have not. We see this every day. Statutes locked in PDFs. Court rulings published once and never indexed. Gazettes that exist only as scanned images. Solo developers building scrapers like Legal Data Hunter to fix it one law at a time.

If you want to know where the real bottleneck is, it's there.

The arguments I keep seeing, and what I actually think

Every time one of these projects hits Hacker News or Reddit, the same arguments come up. Let me run through the ones I find interesting.

It's just a wrapper around an LLM.

Yes. So is Cursor. So is Harvey. So, frankly, is most of what's shipping right now in any AI vertical. The wrapper is where the product lives - citation accuracy, document handling, workflow orchestration, deployment model, security posture. Saying 'it's just a wrapper' is like saying 'your car is just a chassis around an engine.' Technically true. Completely missing the point.

Copyleft will kill enterprise adoption.

This one I actually agree with. Law firms are not going to GPL their internal stack. Anyone shipping open source legal AI should pick Apache 2.0 or MIT, full stop. The teams that get this will win adoption. The teams that don't will get loved on Twitter and ignored in procurement.

Big firms will never use open source for legal work.

They already do. Every law firm in the world runs on Linux, Postgres, and a hundred other open source pieces. The question isn't whether they'll use OSS - they're literally typing this comment into a Chromium browser. The question is whether they'll trust open source for the legal-specific layer. And the answer is: they will, but only when it's something they can self-host, audit, and lock down. Which is exactly what good open source enables.

AI breaks attorney-client privilege.

The Heppner ruling everyone keeps citing was about a public chatbot service. It was not about a firm running its own instance on its own servers with its own keys. Self-hosting is the privilege story. And open source is the only way most firms get to self-host without a Harvey-sized contract.

Where HAQQ actually stands

I should say the obvious thing: HAQQ is not open source. We're a venture-backed company building a commercial product. We have nine thousand eight hundred firms paying us across eighty-something countries. None of that is changing.

But our entire stack runs on open source. We use Postgres, Next.js, Python, Node, every model provider's SDKs, dozens of libraries we never paid for and never could have built. We publish our own AI skills and libraries back. We send PRs upstream. We sponsor projects when we can.

And here's the thing I keep telling our team: more open source in legal tech is good for us, not bad for us. It raises the floor. When CourtListener exists, when LegalBench exists, when Lawvable exists, every builder in this space - including us - can compete on what actually matters. Which is whether the product solves a real problem for a real lawyer.

If Mike or Docassemble or OpenContracts helps a solo practitioner anywhere serve their clients better, that's a win. We're not in the business of preventing other people from building. We're in the business of making sure the five billion people who can't afford legal help actually get some.

The walled garden era of legal tech is ending. It was never going to last.

What I'd watch for next

A few predictions, low confidence on timing, high confidence on direction:

The licensing question gets sorted. Apache 2.0 wins in legal. Anyone shipping AGPL or copyleft will struggle with enterprise procurement until they relicense.
Self-hosting becomes table stakes. The privilege argument is going to push Am Law firms toward private deployments fast. Projects with clean Docker images and sane infra will eat a lot of Harvey's lunch.
Open legal data goes global. The US has CourtListener. Most other jurisdictions have nothing comparable. Whoever opens up court data in Brazil, India, Indonesia, Nigeria - those people are going to shape the next decade. We're working on our piece of this.
Skills become the unit of legal knowledge. Lawvable is early but right. Modular, practitioner-authored, version-controlled AI skills are closer to how lawyers actually share knowledge than fine-tuned monoliths. We're rebuilding parts of our own product around this idea.
The Mike moment will repeat. Every couple of weeks now, somebody will ship a weekend project that gets too much attention and not enough scrutiny. Most won't matter. A few will. Pay attention to which ones the actual lawyers (not just the engineers) keep coming back to.

The thing I keep coming back to is that open source in legal tech is not a threat to lawyers. It's not a threat to good legal AI products either. It's the infrastructure layer that makes both better.

The builders are finally showing up. The data is opening up. The tools are getting real.

If you're building something in this space - open source or not - I'd love to hear about it. Reach me at [email protected].

FAQ

What is open source legal software?

Open source legal software is legal technology released under a permissive or copyleft license, so anyone can inspect, modify and self-host the code. Examples in 2026 include CourtListener (case law search), Docassemble (document automation), Aleph (investigative document analysis), LegalBench (LLM benchmarking) and CUAD (contract understanding dataset).

What are the best open source legal AI projects in 2026?

The most active open source legal AI projects in 2026 include LegalBench and LexGLUE for benchmarking, CUAD and ContractNLI for contract datasets, CourtListener and Caselaw Access Project for case law, Docassemble for document automation, and Aleph for legal investigations. HAQQ contributes to the commons with Nomos, LegalMD and Master Claude for Legal.

Is HAQQ open source?

HAQQ's core legal AI engine is proprietary, but HAQQ has open sourced several supporting tools and datasets under permissive licenses to advance the legal AI commons - including Nomos (legal ontology utilities), LegalMD (legal Markdown spec) and Master Claude for Legal (Claude prompting patterns for legal work).

Open source vs proprietary legal AI - which should a law firm choose?

Open source legal software is the right choice when the firm has engineering capacity, needs full control over data residency and audit, or wants to extend the platform. Proprietary legal AI like HAQQ is the right choice when the firm needs production-grade reliability, security compliance, multi-jurisdictional coverage and ongoing model improvements without maintaining infrastructure.

Can I self-host open source legal AI?

Yes. Projects like Docassemble, Aleph and CourtListener are designed for self-hosting. For LLM-driven legal AI, self-hosting is technically possible with open models but requires significant infrastructure, evaluation and security work - which is why most firms in 2026 pick a managed legal AI platform.