Skip to content

Commit 85bfb1b

Browse files
Feat/vectorize layout merging (Unstructured-IO#3900)
This PR rewrites the logic in `unstructured_inference` that merges extracted with inferred layout using vectorized operations. The goal is to: - vectorize the operation to improve memory and cpu efficiency - apply logic equally without order being a factor (the `unstructured_inference` version uses loops and modifies the content of the inner loop on the fly -> order of the out loop, which is the order of extracted elements becomes a factor) determining the merging results - rewrite the loop into clear steps with clear rules - setup stage for followup improvements While this PR aim to reproduce the existing behavior as much as possible it is not an exact replica of the looped version. Because order is not a factor any more some extracted elements that used to be not considered part of a larger inferred element (due to processing order being not optimum) are now properly merged. This lead to changes in one ingest test. For example, the change shows that now we properly merge the section numerical number with the section title as the full title element. ## Test: Since the goal of this refactor is to preserve as much existing behavior as possible we rely on existing tests. As mentioned above the one file that changed output during ingest test is a net positive change. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
1 parent 3886dd4 commit 85bfb1b

File tree

6 files changed

+436
-34
lines changed

6 files changed

+436
-34
lines changed

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
1-
## 0.16.21-dev1
1+
## 0.16.21-dev2
22

33
### Enhancements
44

5+
- **use vectorized logic to merge inferred and extracted layouts**. Using the new `LayoutElements` data structure and numpy library to refactor the layout merging logic to improve compute performance as well as making logic more clear
6+
57
### Features
68

79
### Fixes

test_unstructured_ingest/expected-structured-output/google-drive/recalibrating-risk-report.pdf.json

Lines changed: 63 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4773,9 +4773,71 @@
47734773
}
47744774
}
47754775
},
4776+
{
4777+
"type": "Image",
4778+
"element_id": "b0197950e1af5c2aac10f5b67d61524a",
4779+
"text": "",
4780+
"metadata": {
4781+
"filetype": "application/pdf",
4782+
"languages": [
4783+
"eng"
4784+
],
4785+
"page_number": 8,
4786+
"data_source": {
4787+
"url": "https://drive.google.com/uc?id=1m1TUgyLv0hHdlsuL7DOWBAKQtvrhWNiV&export=download",
4788+
"record_locator": {
4789+
"file_id": "1m1TUgyLv0hHdlsuL7DOWBAKQtvrhWNiV"
4790+
},
4791+
"date_created": "1718723636.34",
4792+
"date_modified": "1676196572.0",
4793+
"permissions_data": [
4794+
{
4795+
"id": "anyoneWithLink",
4796+
"type": "anyone",
4797+
"kind": "drive#permission",
4798+
"role": "reader",
4799+
"allowFileDiscovery": false
4800+
},
4801+
{
4802+
"id": "18298851591250030956",
4803+
"displayName": "ingest@unstructured-ingest-test.iam.gserviceaccount.com",
4804+
"type": "user",
4805+
"kind": "drive#permission",
4806+
"photoLink": "https://lh3.googleusercontent.com/a/ACg8ocJok2KRwwYvrEDkeZVCYosHOMoa52GZa2qIIC1jScCRoFLHaQ=s64",
4807+
"emailAddress": "ingest@unstructured-ingest-test.iam.gserviceaccount.com",
4808+
"role": "writer",
4809+
"deleted": false,
4810+
"pendingOwner": false
4811+
},
4812+
{
4813+
"id": "04774006893477068632",
4814+
"displayName": "ryan",
4815+
"type": "user",
4816+
"kind": "drive#permission",
4817+
"photoLink": "https://lh3.googleusercontent.com/a-/ALV-UjXeWpu7QcZuYqIl3p1mwqzS8XGFJ4RqA3Xjljfkm1DcFZ9M7A=s64",
4818+
"emailAddress": "ryan@unstructured.io",
4819+
"role": "writer",
4820+
"deleted": false,
4821+
"pendingOwner": false
4822+
},
4823+
{
4824+
"id": "09147371668407854156",
4825+
"displayName": "roman",
4826+
"type": "user",
4827+
"kind": "drive#permission",
4828+
"photoLink": "https://lh3.googleusercontent.com/a-/ALV-UjWoGrFCgXcF6CtiBIBLnAfM68qUnQaJOcgvg3qzfQ3W8Ch6dA=s64",
4829+
"emailAddress": "roman@unstructured.io",
4830+
"role": "owner",
4831+
"deleted": false,
4832+
"pendingOwner": false
4833+
}
4834+
]
4835+
}
4836+
}
4837+
},
47764838
{
47774839
"type": "FigureCaption",
4778-
"element_id": "7803862f2804d04dfe8c38c4a353001d",
4840+
"element_id": "83578f6774acfda63a6aeaba61c1338b",
47794841
"text": "Equally, it is well established that living without access to electricity results in illness and death around the world, caused by everything from not having access to modern healthcare to household air pollution. As of today, 770 million people around the world do not have access to electricity, with over 75% of that population living in Sub-Saharan Africa. The world's poorest 4 billion people consume a mere 5% of the energy used in developed economies, and we need to find ways of delivering reliable electricity to the entire human population in a fashion that is sustainable. Household and ambient air pollution causes 8.7 million deaths each year, largely because of the continued use of fossil fuels. Widespread electrification is a key tool for delivering a just energy transition. Investment in nuclear, has become an urgent necessity. Discarding it, based on risk perceptions divorced from science, would be to abandon the moral obligation to ensure affordable, reliable, and sustainable energy for every community around the world.",
47804842
"metadata": {
47814843
"filetype": "application/pdf",

test_unstructured_ingest/expected-structured-output/local-single-file-with-pdf-infer-table-structure/layout-parser-paper.pdf.json

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -360,8 +360,8 @@
360360
},
361361
{
362362
"type": "Title",
363-
"element_id": "bcb94891b0d7a997ab7e28d99195ff37",
364-
"text": "Introduction",
363+
"element_id": "3a170066f972d25cc303a05ddc16d52c",
364+
"text": "1 Introduction",
365365
"metadata": {
366366
"filetype": "application/pdf",
367367
"languages": [
@@ -1600,9 +1600,9 @@
16001600
}
16011601
},
16021602
{
1603-
"type": "NarrativeText",
1604-
"element_id": "2f41c1732a2870b1fecd72dec1b2ff3d",
1605-
"text": "1 import layoutparser as lp 2 image = cv2 . imread ( \" image_file \" ) # load images 3 model = lp . De t e c tro n2 Lay outM odel ( 4 \" lp :// PubLayNet / f as t er _ r c nn _ R _ 50 _ F P N_ 3 x / config \" ) 5 layout = model . detect ( image )",
1603+
"type": "ListItem",
1604+
"element_id": "508a6705bb0bfb693616cc14fec5e1b9",
1605+
"text": "1 import layoutparser as lp",
16061606
"metadata": {
16071607
"filetype": "application/pdf",
16081608
"languages": [
@@ -1622,9 +1622,9 @@
16221622
}
16231623
},
16241624
{
1625-
"type": "ListItem",
1626-
"element_id": "53b448c75f1556b1f60b4e3324bd0724",
1627-
"text": "1 import layoutparser as lp",
1625+
"type": "NarrativeText",
1626+
"element_id": "c2af717e76ad68bd6da87a15a69f126a",
1627+
"text": "2 image = cv2 . imread ( \" image_file \" ) # load images 3 model = lp . De t e c tro n2 Lay outM odel ( 4 \" lp :// PubLayNet / f as t er _ r c nn _ R _ 50 _ F P N_ 3 x / config \" )",
16281628
"metadata": {
16291629
"filetype": "application/pdf",
16301630
"languages": [

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.16.21-dev1" # pragma: no cover
1+
__version__ = "0.16.21-dev2" # pragma: no cover

unstructured/partition/pdf_image/ocr.py

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,13 @@ def merge_out_layout_with_ocr_layout(
357357
supplemented with the OCR layout.
358358
"""
359359

360+
if len(out_layout) == 0 or len(ocr_layout) == 0:
361+
# what if od model finds nothing but ocr finds something? should we use ocr output at all
362+
# currently we require some kind of bounding box, from `out_layout` to aggreaget ocr
363+
# results. Can we just use ocr bounding boxes (gonna be many but at least we save
364+
# information)
365+
return out_layout
366+
360367
invalid_text_indices = [i for i, text in enumerate(out_layout.texts) if not valid_text(text)]
361368
out_layout.texts = out_layout.texts.astype(object)
362369

@@ -434,18 +441,24 @@ def supplement_layout_with_ocr_elements(
434441
build_layout_elements_from_ocr_regions,
435442
)
436443

437-
mask = (
438-
~bboxes1_is_almost_subregion_of_bboxes2(
439-
ocr_layout.element_coords, layout.element_coords, subregion_threshold
444+
if len(layout) == 0:
445+
if len(ocr_layout) == 0:
446+
return layout
447+
else:
448+
ocr_regions_to_add = ocr_layout
449+
else:
450+
mask = (
451+
~bboxes1_is_almost_subregion_of_bboxes2(
452+
ocr_layout.element_coords, layout.element_coords, subregion_threshold
453+
)
454+
.sum(axis=1)
455+
.astype(bool)
440456
)
441-
.sum(axis=1)
442-
.astype(bool)
443-
)
444457

445-
# add ocr regions that are not covered by layout
446-
ocr_regions_to_add = ocr_layout.slice(mask)
458+
# add ocr regions that are not covered by layout
459+
ocr_regions_to_add = ocr_layout.slice(mask)
447460

448-
if sum(mask):
461+
if len(ocr_regions_to_add):
449462
ocr_elements_to_add = build_layout_elements_from_ocr_regions(ocr_regions_to_add)
450463
final_layout = LayoutElements.concatenate([layout, ocr_elements_to_add])
451464
else:

0 commit comments

Comments
 (0)