Docling Python: A Practical Guide to Processing Files in 2026

By Jad Jabbour · 2026-05-20 · Updated 2026-06-11 · 10 min read · Guides

Parse PDFs, Word and scanned files into clean Markdown with Docling, the open-source Python library. Install steps, chunking and RAG-ready code examples.

Working with documents in different formats is a common challenge when building AI applications. Whether you're processing PDFs, Word documents, or HTML files, extracting clean, structured text can be surprisingly difficult. Docling is a Python library that makes this process straightforward.

This guide walks you through the essentials of using Docling to process documents, with a focus on practical examples and best practices you can apply immediately.

Why Docling?

Docling solves common document processing problems in a unified way. It provides multi-format support that works seamlessly with PDFs, Word documents, PowerPoint presentations, HTML, and more. The library includes OCR capabilities that can extract text even from scanned documents and images, making it versatile for various document types.

What sets Docling apart is its smart chunking feature that breaks documents into meaningful pieces while preserving context, rather than arbitrarily splitting text. The output is clean and structured, whether you need markdown or plain text format. Best of all, Docling offers a simple, intuitive API that's easy to get started with, even for developers new to document processing.

Getting Started

First, install Docling:

pip install docling

Basic Usage

Converting a Document

The simplest way to use Docling is with the DocumentConverter:

That's it! Docling automatically detects the file format and processes it accordingly.

Working with Different File Sources

Docling can process both local files and remote URLs:

What Formats Are Supported?

Docling works with many common file formats out of the box. It handles PDF files, including scanned documents using OCR technology. Microsoft Office formats like Word (.docx) and PowerPoint (.pptx) are fully supported, as are web formats such as HTML. You can also process Markdown files, plain text documents, and even image files (.webp, .webp) using its built-in OCR capabilities.

The DocumentConverter automatically detects the file format and applies the appropriate processing method, so you don't need to worry about specifying the type explicitly.

Chunking Documents

For many AI applications, you need to split documents into smaller pieces ("chunks"). Docling's HybridChunker makes this smart and easy.

Basic Chunking

Why Use HybridChunker?

The HybridChunker provides intelligent document splitting that goes beyond simple character or word counts. It preserves natural document structures like paragraphs and sections, ensuring you never get chunks that awkwardly cut off mid-sentence. This is particularly important for maintaining semantic meaning in your text.

Preserves natural document structures like paragraphs and sections
Token-aware chunking that respects embedding model limits
Configurable chunk sizes based on your specific needs
Preserves metadata tracking where each chunk originated

Working with Metadata

Docling extracts useful metadata from documents:

Putting It All Together

Here's a complete example that processes a document and prepares it for use in an AI application:

Practical Tips

Processing Multiple Documents

Exporting to Different Formats

Docling can export documents to various formats:

Handling Errors Gracefully

Building a Document Search System

One of the most common use cases for Docling is building document search systems powered by AI. By combining Docling's document processing with embedding models, you can create powerful semantic search capabilities.

Performance Tips

Choose the Right Chunk Size

Match your chunk size to your embedding model:

Process Files in Parallel

Conclusion

Docling makes document processing straightforward by providing a simple API that lets you convert any document with just a few lines of code. Its smart chunking capabilities break documents into meaningful pieces that preserve context and structure, making it ideal for AI applications.

The library's combination of ease of use and powerful features makes it an excellent choice for both prototyping and production applications. With multi-format support for PDFs, Word documents, HTML, and more, plus built-in OCR for scanned documents, Docling handles the complexity of document processing so you don't have to.

Resources

You can find the Docling project on GitHub where you'll find the source code and additional documentation. For working with transformer models and tokenizers, check out the Hugging Face Transformers documentation. The Docling documentation provides more detailed information about advanced features and configuration options.

FAQ

What is Docling in Python?

Docling is an open source Python library from IBM Research that parses PDFs, Word documents, PowerPoint, HTML and images into clean, structured Markdown or JSON. It is widely used as a document preprocessing layer for AI and RAG pipelines.

How do I install Docling?

Install Docling with pip: 'pip install docling'. The package ships with default models for layout and OCR, and supports CPU and GPU inference. Python 3.10 or newer is required.

What file formats does Docling support?

Docling supports PDF (text and scanned), DOCX, PPTX, HTML, AsciiDoc, Markdown and common image formats (PNG, JPEG, TIFF). It preserves tables, headings, lists and reading order for downstream LLM consumption.

Docling vs Unstructured vs LlamaParse?

Docling is fully open source, runs locally and offers strong layout and table parsing. Unstructured is also open source with a broader connector ecosystem. LlamaParse is a hosted commercial service with strong table parsing. For private legal documents, Docling is the strongest default because it keeps data on your infrastructure.

How does HAQQ use Docling?

HAQQ uses document parsing technologies in the same family as Docling to ingest client files into private workspaces while preserving structure, tables and citations for legal analysis - without sending document content to public AI services.

Docling Python: A Practical Guide to Processing Files in 2026

Why Docling?

Getting Started

Basic Usage

Converting a Document

Working with Different File Sources

What Formats Are Supported?

Chunking Documents

Basic Chunking

Why Use HybridChunker?

Working with Metadata

Putting It All Together

Practical Tips

Processing Multiple Documents

Exporting to Different Formats

Handling Errors Gracefully

Building a Document Search System

Performance Tips

Choose the Right Chunk Size

Process Files in Parallel

Conclusion

Resources

Related reading

FAQ

What is Docling in Python?

How do I install Docling?

What file formats does Docling support?

Docling vs Unstructured vs LlamaParse?

How does HAQQ use Docling?