Docling Python: A Practical Guide to Processing Files in 2026
Parse PDFs, Word and scanned files into clean Markdown with Docling, the open-source Python library. Install steps, chunking and RAG-ready code examples.
Working with documents in different formats is a common challenge when building AI applications. Whether you're processing PDFs, Word documents, or HTML files, extracting clean, structured text can be surprisingly difficult. Docling is a Python library that makes this process straightforward.
This guide walks you through the essentials of using Docling to process documents, with a focus on practical examples and best practices you can apply immediately.
Why Docling?
Docling solves common document processing problems in a unified way. It provides multi-format support that works seamlessly with PDFs, Word documents, PowerPoint presentations, HTML, and more. The library includes OCR capabilities that can extract text even from scanned documents and images, making it versatile for various document types.
What sets Docling apart is its smart chunking feature that breaks documents into meaningful pieces while preserving context, rather than arbitrarily splitting text. The output is clean and structured, whether you need markdown or plain text format. Best of all, Docling offers a simple, intuitive API that's easy to get started with, even for developers new to document processing.
Getting Started
First, install Docling:
pip install docling
Basic Usage
Converting a Document
The simplest way to use Docling is with the DocumentConverter:
That's it! Docling automatically detects the file format and processes it accordingly.
Working with Different File Sources
Docling can process both local files and remote URLs:
What Formats Are Supported?
Docling works with many common file formats out of the box. It handles PDF files, including scanned documents using OCR technology. Microsoft Office formats like Word (.docx) and PowerPoint (.pptx) are fully supported, as are web formats such as HTML. You can also process Markdown files, plain text documents, and even image files (.webp, .webp) using its built-in OCR capabilities.
The DocumentConverter automatically detects the file format and applies the appropriate processing method, so you don't need to worry about specifying the type explicitly.
Chunking Documents
For many AI applications, you need to split documents into smaller pieces ("chunks"). Docling's HybridChunker makes this smart and easy.
Basic Chunking
Why Use HybridChunker?
The HybridChunker provides intelligent document splitting that goes beyond simple character or word counts. It preserves natural document structures like paragraphs and sections, ensuring you never get chunks that awkwardly cut off mid-sentence. This is particularly important for maintaining semantic meaning in your text.
- Preserves natural document structures like paragraphs and sections
- Token-aware chunking that respects embedding model limits
- Configurable chunk sizes based on your specific needs
- Preserves metadata tracking where each chunk originated
Working with Metadata
Docling extracts useful metadata from documents:
Putting It All Together
Here's a complete example that processes a document and prepares it for use in an AI application:
Practical Tips
Processing Multiple Documents
Exporting to Different Formats
Docling can export documents to various formats:
Handling Errors Gracefully
Building a Document Search System
One of the most common use cases for Docling is building document search systems powered by AI. By combining Docling's document processing with embedding models, you can create powerful semantic search capabilities.
Performance Tips
Choose the Right Chunk Size
Match your chunk size to your embedding model:
Process Files in Parallel
Conclusion
Docling makes document processing straightforward by providing a simple API that lets you convert any document with just a few lines of code. Its smart chunking capabilities break documents into meaningful pieces that preserve context and structure, making it ideal for AI applications.
The library's combination of ease of use and powerful features makes it an excellent choice for both prototyping and production applications. With multi-format support for PDFs, Word documents, HTML, and more, plus built-in OCR for scanned documents, Docling handles the complexity of document processing so you don't have to.
Resources
You can find the Docling project on GitHub where you'll find the source code and additional documentation. For working with transformer models and tokenizers, check out the Hugging Face Transformers documentation. The Docling documentation provides more detailed information about advanced features and configuration options.
Related reading
- context engineering for reliable legal AI
- tabular document review for legal AI
- the legal engineering guide to AI-powered workflows
FAQ
What is Docling in Python?
Docling is an open source Python library from IBM Research that parses PDFs, Word documents, PowerPoint, HTML and images into clean, structured Markdown or JSON. It is widely used as a document preprocessing layer for AI and RAG pipelines.
How do I install Docling?
Install Docling with pip: 'pip install docling'. The package ships with default models for layout and OCR, and supports CPU and GPU inference. Python 3.10 or newer is required.
What file formats does Docling support?
Docling supports PDF (text and scanned), DOCX, PPTX, HTML, AsciiDoc, Markdown and common image formats (PNG, JPEG, TIFF). It preserves tables, headings, lists and reading order for downstream LLM consumption.
Docling vs Unstructured vs LlamaParse?
Docling is fully open source, runs locally and offers strong layout and table parsing. Unstructured is also open source with a broader connector ecosystem. LlamaParse is a hosted commercial service with strong table parsing. For private legal documents, Docling is the strongest default because it keeps data on your infrastructure.
How does HAQQ use Docling?
HAQQ uses document parsing technologies in the same family as Docling to ingest client files into private workspaces while preserving structure, tables and citations for legal analysis - without sending document content to public AI services.