Skip to content

Commit 2417507

Browse files
authored
Use pypdfium2's new range-based text extractor (#5)
get_text() was boundary-based, which is not that suited for the use case of just extracting all text of a page. I believe the new get_text_range() function might both yield better results and be more performant. This can be merged once pypdfium2 3.3 is released.
1 parent 6506883 commit 2417507

File tree

1 file changed

+1
-4
lines changed

1 file changed

+1
-4
lines changed

benchmark.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -149,10 +149,7 @@ def pdfium_get_text(data: bytes) -> str:
149149
for i in range(len(pdf)):
150150
page = pdf.get_page(i)
151151
textpage = page.get_textpage()
152-
text += textpage.get_text()
153-
text += "\n"
154-
[g.close() for g in (textpage, page)]
155-
pdf.close()
152+
text += textpage.get_text_range() + "\n"
156153
return text
157154

158155

0 commit comments

Comments
 (0)