PyMuPDF4LLM

PyMuPDF4LLM is a specialized extension of PyMuPDF designed specifically for extracting content from PDFs in a format that's optimized for Large Language Models (LLMs).

Key Features

Markdown Output

Converts PDFs to clean, structured Markdown format
Preserves document hierarchy (headers, lists, tables)
Makes PDF content easily digestible for LLMs like Claude, GPT, etc.

Intelligent Structure Detection

Automatically identifies headers, paragraphs, tables, and images
Maintains document layout and reading order
Preserves semantic structure

Image Handling

Extracts images from PDFs
Can save images separately or encode them inline
Useful for multimodal LLMs that can process images

Installation

The Python package on PyPI pymupdf4llm (there also is an alias pdf4llm) is capable of converting PDF pages into text strings in Markdown format (GitHub compatible). This includes standard text as well as table-based text in a consistent and integrated view - a feature particularly important in RAG settings.

$ pip install -U pymupdf4llm

This command will automatically install PyMuPDF if required.

Then in your script do

import pymupdf4llm

md_text = pymupdf4llm.to_markdown("input.pdf")

# now work with the markdown text, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())

Instead of the filename string as above, one can also provide a PyMuPDF Document. By default, all pages in the PDF will be processed. If desired, the parameter pages=[...] can be used to provide a list of zero-based page numbers to consider.

Markdown text creation now also processes multi-column pages.

To create small chunks of text - as opposed to generating one large string for the whole document - the new (v0.0.2) option page_chunks=True can be used. The result of .to_markdown("input.pdf", page_chunks=True) will be a list of Python dictionaries, one for each page.

Also new in version 0.0.2 is the optional extraction of images and vector graphics: use of parameter write_images=True. The will store PNG images in the document's folder, and the Markdown text will appropriately refer to them. The images are named like "input.pdf-page_number-index.png".

Documentation and API

Documentation

API

Document Support

While PDF is by far the most important document format worldwide, it is worthwhile mentioning that all examples and helper scripts work in the same way and without change for all supported file types.

So for an XPS document or an eBook, simply provide the filename for instance as "input.mobi" and everything else will work as before.

About PyMuPDF

PyMuPDF adds Python bindings and abstractions to MuPDF, a lightweight PDF, XPS, and eBook viewer, renderer, and toolkit. Both PyMuPDF and MuPDF are maintained and developed by Artifex Software, Inc.

PyMuPDF's homepage is located on GitHub.

Community

Join us on Discord here: #pymupdf.

License and Copyright

PyMuPDF is available under open-source AGPL and commercial license agreements. If you determine you cannot meet the requirements of the AGPL, please contact Artifex for more information regarding a commercial license.

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
examples		examples
pdf4llm		pdf4llm
pymupdf4llm		pymupdf4llm
tests/pymupdf4llm/llama_index		tests/pymupdf4llm/llama_index
.gitignore		.gitignore
CHANGES.md		CHANGES.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PyMuPDF4LLM

Key Features

Installation

Documentation and API

Document Support

About PyMuPDF

Community

License and Copyright

About

Uh oh!

Releases 19

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

License

pymupdf/pymupdf4llm

Folders and files

Latest commit

History

Repository files navigation

PyMuPDF4LLM

Key Features

Installation

Documentation and API

Document Support

About PyMuPDF

Community

License and Copyright

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 19

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

Packages