Use pypdfium2's new range-based text extractor (#5)

mara004 · web-flow · commit 2417507ba648 · 2022-10-11T07:12:13.000+02:00
get_text() was boundary-based, which is not that suited for the use case of just extracting all text of a page.
I believe the new get_text_range() function might both yield better results and be more performant.

This can be merged once pypdfium2 3.3 is released.
diff --git a/benchmark.py b/benchmark.py
@@ -149,10 +149,7 @@ def pdfium_get_text(data: bytes) -> str:
     for i in range(len(pdf)):
         page = pdf.get_page(i)
         textpage = page.get_textpage()
-        text += textpage.get_text()
-        text += "\n"
-        [g.close() for g in (textpage, page)]
-    pdf.close()
+        text += textpage.get_text_range() + "\n"
     return text