← Back to HAQQ Blog

Docling Python: A Practical Guide to Processing Files in 2026

By Jad Jabbour · · 10 min read · Guides

How to use Docling, the open source Python library, to parse PDFs, Word, PowerPoint and HTML into clean structured text for AI and RAG pipelines - with installation, code examples and comparisons to Unstructured and LlamaParse.

Working with documents in different formats is a common challenge when building AI applications. Whether you're processing PDFs, Word documents, or HTML files, extracting clean, structured text can be surprisingly difficult. Docling is a Python library that makes this process straightforward.

This guide walks you through the essentials of using Docling to process documents, with a focus on practical examples and best practices you can apply immediately.

Why Docling?

Docling solves common document processing problems in a unified way. It provides multi-format support that works seamlessly with PDFs, Word documents, PowerPoint presentations, HTML, and more. The library includes OCR capabilities that can extract text even from scanned documents and images, making it versatile for various document types.

What sets Docling apart is its smart chunking feature that breaks documents into meaningful pieces while preserving context, rather than arbitrarily splitting text. The output is clean and structured, whether you need markdown or plain text format. Best of all, Docling offers a simple, intuitive API that's easy to get started with, even for developers new to document processing.

Getting Started

First, install Docling:

Basic Usage

Converting a Document

The simplest way to use Docling is with the DocumentConverter:

That's it! Docling automatically detects the file format and processes it accordingly.

Working with Different File Sources

Docling can process both local files and remote URLs:

What Formats Are Supported?

Docling works with many common file formats out of the box. It handles PDF files, including scanned documents using OCR technology. Microsoft Office formats like Word (.docx) and PowerPoint (.pptx) are fully supported, as are web formats such as HTML. You can also process Markdown files, plain text documents, and even image files (.webp, .webp) using its built-in OCR capabilities.

The DocumentConverter automatically detects the file format and applies the appropriate processing method, so you don't need to worry about specifying the type explicitly.

Chunking Documents

For many AI applications, you need to split documents into smaller pieces ("chunks"). Docling's HybridChunker makes this smart and easy.

Basic Chunking

Why Use HybridChunker?

The HybridChunker provides intelligent document splitting that goes beyond simple character or word counts. It preserves natural document structures like paragraphs and sections, ensuring you never get chunks that awkwardly cut off mid-sentence. This is particularly important for maintaining semantic meaning in your text.

Working with Metadata

Docling extracts useful metadata from documents:

Putting It All Together

Here's a complete example that processes a document and prepares it for use in an AI application:

Practical Tips

Processing Multiple Documents

Exporting to Different Formats

Docling can export documents to various formats:

Handling Errors Gracefully

Building a Document Search System

One of the most common use cases for Docling is building document search systems powered by AI. By combining Docling's document processing with embedding models, you can create powerful semantic search capabilities.

Performance Tips

Choose the Right Chunk Size

Match your chunk size to your embedding model:

Process Files in Parallel

Conclusion

Docling makes document processing straightforward by providing a simple API that lets you convert any document with just a few lines of code. Its smart chunking capabilities break documents into meaningful pieces that preserve context and structure, making it ideal for AI applications.

The library's combination of ease of use and powerful features makes it an excellent choice for both prototyping and production applications. With multi-format support for PDFs, Word documents, HTML, and more, plus built-in OCR for scanned documents, Docling handles the complexity of document processing so you don't have to.

Resources

You can find the Docling project on GitHub where you'll find the source code and additional documentation. For working with transformer models and tokenizers, check out the Hugging Face Transformers documentation. The Docling documentation provides more detailed information about advanced features and configuration options.