Open Source Legal Software in 2026: The Full Landscape and HAQQ's Contributions
A complete map of open source legal software in 2026 - CourtListener, Docassemble, Aleph, LegalBench, CUAD - plus what HAQQ has shipped to the legal commons with Nomos, LegalMD and Master Claude for Legal.
Mike is an open source clone of Harvey and Legora. Self-hostable, bring your own API key, no per-seat pricing. The code itself is rough - someone in the comments correctly pointed out it's basically a Supabase auth call and five database tables. But that's not really the point.
The point is the reaction.
Hundreds of comments. Reddit threads. LinkedIn debates. Lawyers asking why they couldn't just have their associate spin up something similar in a weekend. Builders asking why this hadn't happened five years ago.
I read the whole thing twice. And the more I read, the more it felt like a moment. Not because Mike itself is going to disrupt anything - it probably won't. But because legal tech has finally caught the open source bug, and once that starts, you can't put it back.
Why legal was the last vertical to get here
Every other industry got open source years ago. Healthcare has OpenMRS. Fintech has Hyperledger. E-commerce has Magento. Even the boring corners of enterprise have their thing.
Legal had basically nothing. And the reasons were never about technology.
The first reason is that law firms make money by being inefficient. Sorry - I know that sounds harsh. But the billable hour creates a perverse incentive: if you automate a 10-hour task down to 1 hour, you just deleted 9 hours of revenue. So why would any partner contribute to a project that does that?
The second reason is the secret sauce thing. Firms guard their brief banks and templates like trade secrets, because they kind of are. You can't open source your litigation strategy when you might be using it against the firm down the street next month.
The third reason is licensing fear. Bar associations don't move fast. Compliance teams panic at GPL. Most legal counsel reading the words 'open source' picture a teenager in a hoodie stealing client data, not a Linux kernel maintainer.
And the fourth - the one nobody talks about enough - is that Thomson Reuters and LexisNexis built their moats around data, not software. KeyCite and Shepard's are taxonomies that took decades to build. Replicating them costs hundreds of millions. So even if you wanted to ship an open source legal stack, the data layer underneath was locked away.
That's the world we've been working in. It's also the world that's starting to crack.
What's actually being built right now
I've been keeping a running list. Some of these are years old and finally getting attention. Some shipped this month. The space is much bigger than most people realize. Let me try to organize it.
The data layer - the stuff everything else stands on
**[Free Law Project](https://free.law)** is the most underrated organization in legal tech. They run CourtListener (250 million pages of US court data, free), RECAP (a browser extension that pulls federal filings out of PACER and into the public domain), eyecite (the de facto US citation parser), and Juriscraper (Python scrapers for hundreds of US courts). 138 repos. Most legal AI startups train on their data and don't credit them. They should.
**[Harvard's Caselaw Access Project](https://case.law)** digitized 360 years of US case law. 6.9 million cases, fully open since 2024. If you're building anything that needs American legal precedent, that's where you start.
**[Pile of Law](https://huggingface.co/datasets/pile-of-law/pile-of-law)** - 256 GB of legal text across 35 sub-corpora, hosted on Hugging Face. The closest thing to 'The Pile' for law. Nearly every open legal LLM trains on a slice of it.
**[Find Case Law (UK National Archives)](https://caselaw.nationalarchives.gov.uk)** - UK judgments published as machine-readable LegalDocML XML, with Atom feeds. This is the gold standard schema. Other countries should copy it.
**[EUR-Lex / Cellar](https://eur-lex.europa.eu)** - All EU legislation and CJEU case law, with a SPARQL endpoint. Probably the most structured open legal corpus on Earth. Underused outside academia.
**[OpenLegalData](https://openlegaldata.io)** is the German equivalent - free German court decisions, normalized across fragmented official portals.
**[Indian Legal Corpus / InLegalBERT](https://huggingface.co/law-ai/InLegalBERT)** out of IIT Kharagpur covers Indian Supreme and High Court judgments. Most jurisdictions outside the US are critically under-served, and India is one of the few with serious open corpus work.
**Brazil** has community-built wrappers around the CNJ DataJud API exposing 100M+ case records - community-maintained, fragile, important. Same pattern: technically public, practically unscrapable, until someone open-sources the bridge.
**Legal Data Hunter** is a small example of the long tail here - a Scrapy + FastAPI project that hunts statutes and gazette publications across government sites and normalizes them. Not a flagship, but emblematic. Legal AI runs on hundreds of solo-maintained scrapers like this. They are the unsexy backbone nobody funds.
NLP libraries and open weights
**[Blackstone](https://github.com/ICLRandD/Blackstone)** - spaCy pipeline for UK and Commonwealth legal text. Rare non-US legal NER.
**[LexNLP](https://github.com/LexPredict/lexpredict-lexnlp)** - Python library for extracting legal entities, citations, durations, money, parties. Pre-LLM but still vendored inside half the commercial tools you've heard of.
**[Legal-BERT (nlpaueb)](https://huggingface.co/nlpaueb/legal-bert-base-uncased)** - BERT pretrained on EU legislation, ECHR, US contracts. Cited over a thousand times. Foundational.
**[Saul (Equall.ai)](https://huggingface.co/Equall/Saul-7B-Instruct-v1)** - first open-weights LLM continued-pretrained on legal corpora. Proves the 'domain-pretrain a Llama' recipe works for law.
**[CUAD (Atticus Project)](https://www.atticusprojectai.org/cuad)** - 13,000 expert-labeled clauses across 510 contracts, 41 categories, CC BY 4.0. Almost every contract-AI product in the world trains or evaluates on CUAD whether they admit it or not.
Benchmarks
**[LegalBench](https://github.com/HazyResearch/legalbench)** - 162 legal reasoning tasks designed by lawyers, out of Stanford and Hazy Research. The benchmark frontier labs report on now. Replacing LexGLUE in practice.
**[LexGLUE](https://github.com/coastalcph/lex-glue)** - the older suite (ECtHR, SCOTUS, EUR-Lex, LEDGAR, CaseHOLD, UNFAIR-ToS). Still useful for comparing OSS models honestly.
**[Legal Benchmarks AI](https://www.legalbenchmarks.ai)** is the practitioner-facing version. Vendor-free. Their contract drafting benchmark put 14 tools and a bunch of human lawyers in the same scoring system, open methodology. You cannot improve what you cannot measure, and until recently nobody was measuring legal AI seriously.
AI applications, agents, and skills
**[Mike](https://mikeoss.com)** - the Hacker News darling. Rough code, real signal.
**[Lawvable / awesome-legal-skills](https://github.com/lawvable/awesome-legal-skills)** is the one I keep coming back to. A curated registry of SKILL.md files written by actual practitioners from Clifford Chance, Baker McKenzie, and others. Drop one into Claude, Codex, Gemini CLI, or any tool that supports the format and you've taught it to do an EU AI Act classification, a GDPR breach assessment, an NDA triage, a *référé* assignation in French. Forty-plus skills the last time I checked, growing weekly. Closer to how legal knowledge actually moves between humans than any monolithic AI product I've seen.
*Disclosure: HAQQ is a co-maintainer of awesome-legal-skills.*
**[lawskills-hub (Harvard LIL)](https://github.com/harvard-lil/lawskills-hub)** is the institutional cousin. A community registry of agent skills for legal workflows, curated by Harvard's Library Innovation Lab. Same pattern, different trust signal. The fact that Harvard is putting its name on a skills registry tells you the format is going to stick.
**[anthropic-skills](https://github.com/anthropics/skills)** - Anthropic's official skill repo. Practitioners are forking subsets for legal work. This is where the SKILL.md standard came from in the first place. Anthropic also shipped a 'Claude for Legal' plugin in April 2026; their own legal team uses skills internally and has been pretty public about it.
**[Atticus Project](https://www.atticusprojectai.org)** - non-profit behind CUAD and a growing library of contract NLP tooling. The closest thing legal AI has to an academic standards body.
**[ContraxSuite (LexPredict)](https://github.com/LexPredict/lexpredict-contraxsuite)** - open core contract analytics platform. Pre-LLM, GPL-licensed, but the most complete OSS contract review pipeline ever built. Ages well as a baseline.
**Harvard LIL's OLAW** - open legal AI workbench for RAG research, integrating AI with legal APIs like CourtListener.
**[LangChain](https://github.com/langchain-ai/langchain)** and **[LlamaIndex](https://github.com/run-llama/llama_index)** legal cookbooks aren't legal projects, but they're the plumbing every legal RAG demo runs on. Worth knowing where the SEC/EDGAR loaders, contract chunking patterns, and citation grounding helpers live.
**LawGPT, ChatLaw, DISC-LawLLM, Lawyer LLaMA** and a long tail of academic legal-LLM projects on Hugging Face. Most are research prototypes that don't survive contact with real practice - but a few of the Chinese-language ones (ChatLaw, DISC-LawLLM out of Fudan) are seriously good and underused outside of China.
**Casetext-style RAG demos** are now a category of their own - there's a small army of solo developers shipping 'Harvey clones in 200 lines' using LlamaIndex or LangChain on top of CourtListener. Most are toys. A few are quietly turning into real products.
GitHub topics like **gpt-legal-chatbot**, **legal-rag**, and **legal-agent** surface dozens of projects every month. Quality is wildly variable. Worth scanning if you're researching prior art before you build.
What HAQQ has shipped to the commons
I should be specific about this part since you asked.
**[Nomos](https://github.com/sboghossian/nomos)** is our open-source agent-native legal interface. Self-hostable. Skills-first. Designed to be the 'Cursor for legal' in the sense that the Legal AI Engine and the lawyer are both first-class users of the same workspace. We dogfood it internally for HAQQ work.
**[LegalMD](https://github.com/sboghossian/legalmd)** is a Markdown dialect for legal documents - four typed primitives (`@party`, `@cite`, `@clause`, `@deadline`) with a TypeScript parser, a resolver that verifies citations against open legal data, two renderers (HTML and JSON), and a VS Code extension. MIT licensed. The thesis is that lawyers should not be writing contracts in DOCX in 2026 any more than developers should be writing code in Word. Early but shipping.
**[Master Claude for Legal](https://github.com/sboghossian/master-claude-for-legal)** is a community skill pack. Five working starter skills (NDA triage, multi-party version diff, meeting brief, citation verifier, status synthesis), reference docs covering privilege architecture and MCP permission hardening, three templates (firm AI policy, client-facing data explainer, vendor security questionnaire). MIT. Built as the long version of Anthropic's *Claude for Legal Teams* webinar - twenty thousand registrations, fifty-one questions, half left unanswered.
**[awesome-legaltech](https://github.com/sboghossian/awesome-legaltech)** is exactly what it sounds like - a curated list of legal tech projects, open source where possible, commercial where worth knowing. Contributions welcome.
**[awesome-legal-skills](https://github.com/lawvable/awesome-legal-skills)** is the Lawvable registry mentioned above. We co-maintain.
**Legal Data Hunter** - our internal scraper layer for hunting statutes, regulations, and gazette publications across government sites in the jurisdictions we operate in. Parts of this are already public. More of it will be open-sourced over the next two quarters as we standardize the schema.
There is more coming. Some of it is in private repos until it's stable enough not to embarrass us. The principle is consistent: anything that is not core product differentiation gets pushed back to the commons. Parsers, schemas, scrapers, evaluation harnesses, skill registries - none of that is HAQQ's moat. Our moat is jurisdiction depth, firm-specific training, and the product layer that makes all of it usable for a working lawyer.
Document automation, A2J, e-discovery
**[Docassemble](https://docassemble.org)** has been around forever and powers court self-help systems in multiple US states. If you've ever filled out a free court form online, there's a decent chance Docassemble was underneath it. Quiet, durable, built by a lawyer-programmer who just kept shipping.
**[Open Decision](https://open-decision.org)** is the more modern web-native answer out of Berlin. MIT-licensed decision-tree builder for legal self-help. Active.
**[Suffolk LIT Lab's Document Assembly Line](https://suffolklitlab.org/docassemble-AssemblyLine-documentation/)** - built on Docassemble, automates court forms across US states. Open source, well-maintained, deeply unsexy, deeply important.
**[Aleph (OCCRP)](https://github.com/alephdata/aleph)** - investigative document platform doing OCR, NER, cross-doc search at scale. Not legal-specific, but it's the closest OSS analog to Relativity or Nuix for investigations and litigation-adjacent work. Journalists use it. Plaintiffs' firms should too.
**[opensource.legal](https://opensource.legal)** - OpenContracts for annotation, Docxodus for redlining DOCX in WebAssembly, Python-Redlines, CAML for marking up legal articles. Boring infrastructure, hugely valuable.
**[SALI Alliance LMSS taxonomy](https://www.sali.org)** - not code, but an open ontology every serious legal-AI project eventually adopts. Worth knowing exists.
**[GitHub's Open Source Guide on legal stuff](https://opensource.guide/legal/)** isn't a product, but if you're shipping in this space and you don't understand the difference between MIT, Apache 2.0, and AGPL, read it before you push anything.
I've left out a dozen more. ROSS-style survivors, smart-contract experiments, niche regional scrapers, half-finished hackathon repos that quietly turned into infrastructure. The ecosystem is real now. Two years ago I could fit the entire OSS legal AI landscape on one slide. I can't anymore.
The thing nobody talks about: data
Here's the part that doesn't fit on the marketing site.
The bottleneck in legal AI is not models. It's not even tooling. It's data.
The work that actually matters to clients - memos, deal docs, discovery productions, settlement letters, internal advice, the redlines a senior partner makes at 2am - is privileged, confidential, or contractually locked inside firms. None of it can be released. None of it can be benchmarked publicly. None of it can be used to train a Legal AI Engine that another firm will see. The Heppner ruling I mentioned earlier just made this even more explicit.
What's left in the open - published opinions, statutes, EDGAR filings, CUAD's 510 contracts - is a tiny, non-representative slice. Appellate. Public-company. English-language. US-skewed. That's why benchmarks like LegalBench feel narrow. That's why contract models overfit CUAD. That's why every serious legal AI vendor's real moat is a private data pipeline, not an architecture choice.
This creates a structural ceiling for open source. OSS can ship excellent plumbing (eyecite, Juriscraper, Docassemble, Aleph, Open Decision) and excellent public-data Legal AI Engines (Legal-BERT, Saul, InLegalBERT). But it cannot close the loop on the work product that defines actual practice. You can't open source what you don't have access to.
So the interesting OSS frontier is not 'open GPT-for-law.' It's federated evaluation. Synthetic data generation. Privacy-preserving fine-tuning. Skill marketplaces that let firms keep their data private and share the behavior. That's the lane where open source still has real leverage in 2026, and it's the lane I'm most excited about.
The corollary, by the way, is that the public legal data layer matters more than ever. Every government that opens up its statutes and case law in machine-readable form expands the surface area where open source can compete on equal footing. The UK National Archives nailed it with LegalDocML. The EU got most of it right with EUR-Lex. Most other jurisdictions, including the ones HAQQ operates in, have not. We see this every day. Statutes locked in PDFs. Court rulings published once and never indexed. Gazettes that exist only as scanned images. Solo developers building scrapers like Legal Data Hunter to fix it one law at a time.
If you want to know where the real bottleneck is, it's there.
The arguments I keep seeing, and what I actually think
Every time one of these projects hits Hacker News or Reddit, the same arguments come up. Let me run through the ones I find interesting.
It's just a wrapper around an LLM.
Yes. So is Cursor. So is Harvey. So, frankly, is most of what's shipping right now in any AI vertical. The wrapper is where the product lives - citation accuracy, document handling, workflow orchestration, deployment model, security posture. Saying 'it's just a wrapper' is like saying 'your car is just a chassis around an engine.' Technically true. Completely missing the point.
Copyleft will kill enterprise adoption.
This one I actually agree with. Law firms are not going to GPL their internal stack. Anyone shipping open source legal AI should pick Apache 2.0 or MIT, full stop. The teams that get this will win adoption. The teams that don't will get loved on Twitter and ignored in procurement.
Big firms will never use open source for legal work.
They already do. Every law firm in the world runs on Linux, Postgres, and a hundred other open source pieces. The question isn't whether they'll use OSS - they're literally typing this comment into a Chromium browser. The question is whether they'll trust open source for the legal-specific layer. And the answer is: they will, but only when it's something they can self-host, audit, and lock down. Which is exactly what good open source enables.
AI breaks attorney-client privilege.
The Heppner ruling everyone keeps citing was about a public chatbot service. It was not about a firm running its own instance on its own servers with its own keys. Self-hosting is the privilege story. And open source is the only way most firms get to self-host without a Harvey-sized contract.
Where HAQQ actually stands
I should say the obvious thing: HAQQ is not open source. We're a venture-backed company building a commercial product. We have nine thousand eight hundred firms paying us across eighty-something countries. None of that is changing.
But our entire stack runs on open source. We use Postgres, Next.js, Python, Node, every model provider's SDKs, dozens of libraries we never paid for and never could have built. We publish our own AI skills and libraries back. We send PRs upstream. We sponsor projects when we can.
And here's the thing I keep telling our team: more open source in legal tech is good for us, not bad for us. It raises the floor. When CourtListener exists, when LegalBench exists, when Lawvable exists, every builder in this space - including us - can compete on what actually matters. Which is whether the product solves a real problem for a real lawyer.
If Mike or Docassemble or OpenContracts helps a solo practitioner anywhere serve their clients better, that's a win. We're not in the business of preventing other people from building. We're in the business of making sure the five billion people who can't afford legal help actually get some.
The walled garden era of legal tech is ending. It was never going to last.
What I'd watch for next
A few predictions, low confidence on timing, high confidence on direction:
- The licensing question gets sorted. Apache 2.0 wins in legal. Anyone shipping AGPL or copyleft will struggle with enterprise procurement until they relicense.
- Self-hosting becomes table stakes. The privilege argument is going to push Am Law firms toward private deployments fast. Projects with clean Docker images and sane infra will eat a lot of Harvey's lunch.
- Open legal data goes global. The US has CourtListener. Most other jurisdictions have nothing comparable. Whoever opens up court data in Brazil, India, Indonesia, Nigeria - those people are going to shape the next decade. We're working on our piece of this.
- Skills become the unit of legal knowledge. Lawvable is early but right. Modular, practitioner-authored, version-controlled AI skills are closer to how lawyers actually share knowledge than fine-tuned monoliths. We're rebuilding parts of our own product around this idea.
- The Mike moment will repeat. Every couple of weeks now, somebody will ship a weekend project that gets too much attention and not enough scrutiny. Most won't matter. A few will. Pay attention to which ones the actual lawyers (not just the engineers) keep coming back to.
The thing I keep coming back to is that open source in legal tech is not a threat to lawyers. It's not a threat to good legal AI products either. It's the infrastructure layer that makes both better.
The builders are finally showing up. The data is opening up. The tools are getting real.
If you're building something in this space - open source or not - I'd love to hear about it. Reach me at stephane@haqq.ai.