Feat/vectorize layout merging (Unstructured-IO#3900)

badGarnet · ryannikolaidis · web-flow · commit 85bfb1b77337 · 2025-02-07T20:25:57.000Z
This PR rewrites the logic in `unstructured_inference` that merges
extracted with inferred layout using vectorized operations. The goal is
to:
- vectorize the operation to improve memory and cpu efficiency
- apply logic equally without order being a factor (the
`unstructured_inference` version uses loops and modifies the content of
the inner loop on the fly -&gt; order of the out loop, which is the order
of extracted elements becomes a factor) determining the merging results
- rewrite the loop into clear steps with clear rules
- setup stage for followup improvements

While this PR aim to reproduce the existing behavior as much as possible
it is not an exact replica of the looped version. Because order is not a
factor any more some extracted elements that used to be not considered
part of a larger inferred element (due to processing order being not
optimum) are now properly merged. This lead to changes in one ingest
test. For example, the change shows that now we properly merge the
section numerical number with the section title as the full title
element.

## Test:

Since the goal of this refactor is to preserve as much existing behavior
as possible we rely on existing tests. As mentioned above the one file
that changed output during ingest test is a net positive change.

---------

Co-authored-by: ryannikolaidis &lt;1208590+ryannikolaidis@users.noreply.github.com&gt;
Co-authored-by: badGarnet &lt;badGarnet@users.noreply.github.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,7 +1,9 @@
-## 0.16.21-dev1
+## 0.16.21-dev2
 
 ### Enhancements
 
+- **use vectorized logic to merge inferred and extracted layouts**. Using the new `LayoutElements` data structure and numpy library to refactor the layout merging logic to improve compute performance as well as making logic more clear
+
 ### Features
 
 ### Fixes
diff --git a/test_unstructured_ingest/expected-structured-output/google-drive/recalibrating-risk-report.pdf.json b/test_unstructured_ingest/expected-structured-output/google-drive/recalibrating-risk-report.pdf.json
@@ -4773,9 +4773,71 @@
       }
     }
   },
+  {
+    "type": "Image",
+    "element_id": "b0197950e1af5c2aac10f5b67d61524a",
+    "text": "",
+    "metadata": {
+      "filetype": "application/pdf",
+      "languages": [
+        "eng"
+      ],
+      "page_number": 8,
+      "data_source": {
+        "url": "https://drive.google.com/uc?id=1m1TUgyLv0hHdlsuL7DOWBAKQtvrhWNiV&export=download",
+        "record_locator": {
+          "file_id": "1m1TUgyLv0hHdlsuL7DOWBAKQtvrhWNiV"
+        },
+        "date_created": "1718723636.34",
+        "date_modified": "1676196572.0",
+        "permissions_data": [
+          {
+            "id": "anyoneWithLink",
+            "type": "anyone",
+            "kind": "drive#permission",
+            "role": "reader",
+            "allowFileDiscovery": false
+          },
+          {
+            "id": "18298851591250030956",
+            "displayName": "ingest@unstructured-ingest-test.iam.gserviceaccount.com",
+            "type": "user",
+            "kind": "drive#permission",
+            "photoLink": "https://lh3.googleusercontent.com/a/ACg8ocJok2KRwwYvrEDkeZVCYosHOMoa52GZa2qIIC1jScCRoFLHaQ=s64",
+            "emailAddress": "ingest@unstructured-ingest-test.iam.gserviceaccount.com",
+            "role": "writer",
+            "deleted": false,
+            "pendingOwner": false
+          },
+          {
+            "id": "04774006893477068632",
+            "displayName": "ryan",
+            "type": "user",
+            "kind": "drive#permission",
+            "photoLink": "https://lh3.googleusercontent.com/a-/ALV-UjXeWpu7QcZuYqIl3p1mwqzS8XGFJ4RqA3Xjljfkm1DcFZ9M7A=s64",
+            "emailAddress": "ryan@unstructured.io",
+            "role": "writer",
+            "deleted": false,
+            "pendingOwner": false
+          },
+          {
+            "id": "09147371668407854156",
+            "displayName": "roman",
+            "type": "user",
+            "kind": "drive#permission",
+            "photoLink": "https://lh3.googleusercontent.com/a-/ALV-UjWoGrFCgXcF6CtiBIBLnAfM68qUnQaJOcgvg3qzfQ3W8Ch6dA=s64",
+            "emailAddress": "roman@unstructured.io",
+            "role": "owner",
+            "deleted": false,
+            "pendingOwner": false
+          }
+        ]
+      }
+    }
+  },
   {
     "type": "FigureCaption",
-    "element_id": "7803862f2804d04dfe8c38c4a353001d",
+    "element_id": "83578f6774acfda63a6aeaba61c1338b",
     "text": "Equally, it is well established that living without access to electricity results in illness and death around the world, caused by everything from not having access to modern healthcare to household air pollution. As of today, 770 million people around the world do not have access to electricity, with over 75% of that population living in Sub-Saharan Africa. The world's poorest 4 billion people consume a mere 5% of the energy used in developed economies, and we need to find ways of delivering reliable electricity to the entire human population in a fashion that is sustainable. Household and ambient air pollution causes 8.7 million deaths each year, largely because of the continued use of fossil fuels. Widespread electrification is a key tool for delivering a just energy transition. Investment in nuclear, has become an urgent necessity. Discarding it, based on risk perceptions divorced from science, would be to abandon the moral obligation to ensure affordable, reliable, and sustainable energy for every community around the world.",
     "metadata": {
       "filetype": "application/pdf",
diff --git a/test_unstructured_ingest/expected-structured-output/local-single-file-with-pdf-infer-table-structure/layout-parser-paper.pdf.json b/test_unstructured_ingest/expected-structured-output/local-single-file-with-pdf-infer-table-structure/layout-parser-paper.pdf.json
@@ -360,8 +360,8 @@
   },
   {
     "type": "Title",
-    "element_id": "bcb94891b0d7a997ab7e28d99195ff37",
-    "text": "Introduction",
+    "element_id": "3a170066f972d25cc303a05ddc16d52c",
+    "text": "1 Introduction",
     "metadata": {
       "filetype": "application/pdf",
       "languages": [
@@ -1600,9 +1600,9 @@
     }
   },
   {
-    "type": "NarrativeText",
-    "element_id": "2f41c1732a2870b1fecd72dec1b2ff3d",
-    "text": "1 import layoutparser as lp 2 image = cv2 . imread ( \" image_file \" ) # load images 3 model = lp . De t e c tro n2 Lay outM odel ( 4 \" lp :// PubLayNet / f as t er _ r c nn _ R _ 50 _ F P N_ 3 x / config \" ) 5 layout = model . detect ( image )",
+    "type": "ListItem",
+    "element_id": "508a6705bb0bfb693616cc14fec5e1b9",
+    "text": "1 import layoutparser as lp",
     "metadata": {
       "filetype": "application/pdf",
       "languages": [
@@ -1622,9 +1622,9 @@
     }
   },
   {
-    "type": "ListItem",
-    "element_id": "53b448c75f1556b1f60b4e3324bd0724",
-    "text": "1 import layoutparser as lp",
+    "type": "NarrativeText",
+    "element_id": "c2af717e76ad68bd6da87a15a69f126a",
+    "text": "2 image = cv2 . imread ( \" image_file \" ) # load images 3 model = lp . De t e c tro n2 Lay outM odel ( 4 \" lp :// PubLayNet / f as t er _ r c nn _ R _ 50 _ F P N_ 3 x / config \" )",
     "metadata": {
       "filetype": "application/pdf",
       "languages": [
diff --git a/unstructured/__version__.py b/unstructured/__version__.py
@@ -1 +1 @@
-__version__ = "0.16.21-dev1"  # pragma: no cover
+__version__ = "0.16.21-dev2"  # pragma: no cover
diff --git a/unstructured/partition/pdf_image/ocr.py b/unstructured/partition/pdf_image/ocr.py
@@ -357,6 +357,13 @@ def merge_out_layout_with_ocr_layout(
     supplemented with the OCR layout.
     """
 
+    if len(out_layout) == 0 or len(ocr_layout) == 0:
+        # what if od model finds nothing but ocr finds something? should we use ocr output at all
+        # currently we require some kind of bounding box, from `out_layout` to aggreaget ocr
+        # results. Can we just use ocr bounding boxes (gonna be many but at least we save
+        # information)
+        return out_layout
+
     invalid_text_indices = [i for i, text in enumerate(out_layout.texts) if not valid_text(text)]
     out_layout.texts = out_layout.texts.astype(object)
 
@@ -434,18 +441,24 @@ def supplement_layout_with_ocr_elements(
         build_layout_elements_from_ocr_regions,
     )
 
-    mask = (
-        ~bboxes1_is_almost_subregion_of_bboxes2(
-            ocr_layout.element_coords, layout.element_coords, subregion_threshold
+    if len(layout) == 0:
+        if len(ocr_layout) == 0:
+            return layout
+        else:
+            ocr_regions_to_add = ocr_layout
+    else:
+        mask = (
+            ~bboxes1_is_almost_subregion_of_bboxes2(
+                ocr_layout.element_coords, layout.element_coords, subregion_threshold
+            )
+            .sum(axis=1)
+            .astype(bool)
         )
-        .sum(axis=1)
-        .astype(bool)
-    )
 
-    # add ocr regions that are not covered by layout
-    ocr_regions_to_add = ocr_layout.slice(mask)
+        # add ocr regions that are not covered by layout
+        ocr_regions_to_add = ocr_layout.slice(mask)
 
-    if sum(mask):
+    if len(ocr_regions_to_add):
         ocr_elements_to_add = build_layout_elements_from_ocr_regions(ocr_regions_to_add)
         final_layout = LayoutElements.concatenate([layout, ocr_elements_to_add])
     else:
diff --git a/unstructured/partition/pdf_image/pdfminer_processing.py b/unstructured/partition/pdf_image/pdfminer_processing.py

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-__version__ = "0.16.21-dev1" # pragma: no cover`
	`1`	`+__version__ = "0.16.21-dev2" # pragma: no cover`