Skip to content

Commit 9e494c6

Browse files
authored
BUG: layout mode text extraction ZeroDivisionError (#2417)
For fonts without an explicitly defined width for the " " character, it's still possible to generate a ZeroDivisionError when compiling TextStateParams objects in _fixed_width_page.recurs_to_target_op() if the font size or the Tz parameter has been set to 0. Discovered during processing of a "pre-OCR'd" image PDF having `{"/BaseFont": "/GlyphLessFont"}`. DOC: Remove duplicate docstring for layout_mode_strip_rotated
1 parent facd6fd commit 9e494c6

File tree

2 files changed

+3
-4
lines changed

2 files changed

+3
-4
lines changed

pypdf/_page.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2027,8 +2027,6 @@ def extract_text(
20272027
layout_mode_strip_rotated (bool): layout mode does not support rotated text.
20282028
Set to False to include rotated text anyway. If rotated text is discovered,
20292029
layout will be degraded and a warning will result. Defaults to True.
2030-
layout_mode_strip_rotated: Removes text that is rotated w.r.t. to the page from
2031-
layout mode output. Defaults to True.
20322030
layout_mode_debug_path (Path | None): if supplied, must target a directory.
20332031
creates the following files with debug information for layout mode
20342032
functions if supplied:

pypdf/_text_extraction/_layout_mode/_fixed_width_page.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -143,8 +143,9 @@ def recurs_to_target_op(
143143
# multiply by bool (_idx != bt_idx) to ensure spaces aren't double
144144
# applied to the first tj of a BTGroup in fixed_width_page().
145145
excess_tx = round(_tj.tx - last_displaced_tx, 3) * (_idx != bt_idx)
146-
147-
new_text = f'{" " * int(excess_tx // _tj.space_tx)}{_tj.txt}'
146+
# space_tx could be 0 if either Tz or font_size was 0 for this _tj.
147+
spaces = int(excess_tx // _tj.space_tx) if _tj.space_tx else 0
148+
new_text = f'{" " * spaces}{_tj.txt}'
148149

149150
last_ty = _tj.ty
150151
_text = f"{_text}{new_text}"

0 commit comments

Comments
 (0)