Skip to content

Commit 6506883

Browse files
committed
Add PyMuPDF image extraction
1 parent 1ce8729 commit 6506883

File tree

3 files changed

+190
-158
lines changed

3 files changed

+190
-158
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,8 +53,9 @@ This benchmark is about reading pure PDF files - notscanned documents and not do
5353

5454
| # | Library | Average | [ 1 ](https://arxiv.org/pdf/2201.00214.pdf) | [ 2 ](https://github.com/py-pdf/sample-files/raw/main/009-pdflatex-geotopo/GeoTopo.pdf) | [ 3 ](https://arxiv.org/pdf/2201.00151.pdf) | [ 4 ](https://arxiv.org/pdf/1707.09725.pdf) | [ 5 ](https://arxiv.org/pdf/2201.00021.pdf) | [ 6 ](https://arxiv.org/pdf/2201.00037.pdf) | [ 7 ](https://arxiv.org/pdf/2201.00069.pdf) | [ 8 ](https://arxiv.org/pdf/2201.00178.pdf) | [ 9 ](https://arxiv.org/pdf/2201.00201.pdf) | [ 10 ](https://arxiv.org/pdf/1602.06541.pdf) | [ 11 ](https://arxiv.org/pdf/2201.00200.pdf) | [ 12 ](https://arxiv.org/pdf/2201.00022.pdf) | [ 13 ](https://arxiv.org/pdf/2201.00029.pdf) | [ 14 ](https://arxiv.org/pdf/1601.03642.pdf) |
5555
| :- | :-------------------------------------------------------- | :------ | :---------------------------------------------- | :------------------------------------------------------------------------------------------ | :---------------------------------------------- | :---------------------------------------------- | :---------------------------------------------- | :---------------------------------------------- | :---------------------------------------------- | :---------------------------------------------- | :---------------------------------------------- | :---------------------------------------------- | :---------------------------------------------- | :---------------------------------------------- | :---------------------------------------------- | :---------------------------------------------- |
56-
| 1 | [PyPDF2 ](https://pypi.org/project/PyPDF2/) | 1.0s | 0.4s | 1.5s | 0.0s | 3.5s | 0.9s | 0.0s | 5.7s | 0.7s | 0.7s | 0.2s | 0.0s | 0.5s | 0.0s | 0.0s |
57-
| 2 | [pdfminer.six ](https://pypi.org/project/pdfminer.six/) | 9.3s | 47.4s | 21.7s | 13.0s | 30.3s | 1.9s | 3.2s | 1.8s | 1.7s | 1.5s | 2.5s | 1.7s | 1.7s | 1.2s | 0.9s |
56+
| 1 | [PyMuPDF ](https://pypi.org/project/PyMuPDF/) | 0.6s | 0.4s | 0.9s | 0.0s | 2.0s | 0.6s | 0.0s | 3.2s | 0.4s | 0.4s | 0.3s | 0.0s | 0.3s | 0.2s | 0.0s |
57+
| 2 | [PyPDF2 ](https://pypi.org/project/PyPDF2/) | 1.0s | 0.4s | 1.5s | 0.0s | 3.6s | 0.9s | 0.0s | 5.7s | 0.7s | 0.7s | 0.2s | 0.0s | 0.5s | 0.0s | 0.0s |
58+
| 3 | [pdfminer.six ](https://pypi.org/project/pdfminer.six/) | 9.1s | 48.6s | 18.3s | 12.7s | 30.8s | 1.9s | 3.5s | 1.9s | 2.1s | 1.4s | 2.0s | 1.6s | 1.6s | 0.8s | 0.7s |
5859

5960

6061
## Watermarking Speed

benchmark.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,22 @@ def pypdf2_image_extraction(data: bytes) -> List[Tuple[str, bytes]]:
182182
return images
183183

184184

185+
def pymupdf_image_extraction(data: bytes) -> List[Tuple[str, bytes]]:
186+
images = []
187+
with PyMuPDF.open(stream=data, filetype="pdf") as pdf_file:
188+
for page_index in range(len(pdf_file)):
189+
page = pdf_file[page_index]
190+
for image_index, img in enumerate(page.get_images(), start=1):
191+
xref = img[0]
192+
base_image = pdf_file.extract_image(xref)
193+
image_bytes = base_image["image"]
194+
image_ext = base_image["ext"]
195+
images.append(
196+
(f"image{page_index+1}_{image_index}.{image_ext}", image_bytes)
197+
)
198+
return images
199+
200+
185201
def pdfminer_image_extraction(data: bytes) -> List[Tuple[str, bytes]]:
186202
from PIL import Image
187203

@@ -577,6 +593,7 @@ def get_text_extraction_score(doc: Document, library_name: str):
577593
lambda n: pymupdf_get_text(n),
578594
version=PyMuPDF.version[0],
579595
watermarking_function=None,
596+
image_extraction_function=pymupdf_image_extraction,
580597
dependencies="MuPDF",
581598
license="GNU AFFERO GPL 3.0 / Commerical",
582599
last_release_date="2022-08-31",

0 commit comments

Comments
 (0)