From 734240be6b4ef721f0073e06052e4aee11389ca5 Mon Sep 17 00:00:00 2001 From: Philippe Prados Date: Tue, 15 Oct 2024 13:50:30 +0200 Subject: [PATCH 1/9] Add password for pdf --- CHANGELOG.md | 62 +++++++------- example-docs/pdf/password.pdf | Bin 0 -> 14179 bytes .../partition/pdf_image/test_pdf.py | 77 +++++++++++++----- unstructured/__version__.py | 2 +- unstructured/partition/image.py | 4 + unstructured/partition/pdf.py | 29 ++++++- unstructured/partition/pdf_image/ocr.py | 4 + .../partition/pdf_image/pdf_image_utils.py | 11 ++- .../pdf_image/pdfminer_processing.py | 6 +- .../partition/pdf_image/pdfminer_utils.py | 5 +- 10 files changed, 140 insertions(+), 60 deletions(-) create mode 100644 example-docs/pdf/password.pdf diff --git a/CHANGELOG.md b/CHANGELOG.md index bd54ff4a40..1fb75b081c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,6 +1,7 @@ -## 0.16.19-dev2 +## 0.16.19-dev3 ### Enhancements +- **Use password** to load PDF with all modes ### Features @@ -504,6 +505,7 @@ ### Features * **Expose conversion functions for tables** Adds public functions to convert tables from HTML to the Deckerd format and back + * **Adds Kafka Source and Destination** New source and destination connector added to all CLI ingest commands to support reading from and writing to Kafka streams. Also supports Confluent Kafka. ### Fixes @@ -550,7 +552,7 @@ * **Move logger error to debug level when PDFminer fails to extract text** which includes error message for Invalid dictionary construct. * **Add support for Pinecone serverless** Adds Pinecone serverless to the connector tests. Pinecone - serverless will work version versions >=0.14.2, but hadn't been tested until now. + serverless will work version versions >=0.14.2, but hadn't been tested until now. ### Features @@ -633,7 +635,6 @@ * **Add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR** configuration parameteres to control temporary storage. ### Features - * **Add form extraction basics (document elements and placeholder code in partition)**. This is to lay the ground work for the future. Form extraction models are not currently available in the library. An attempt to use this functionality will end in a `NotImplementedError`. ### Fixes @@ -811,8 +812,8 @@ ### Enhancements ### Features - * Add `date_from_file_object` parameter to partition. If True and if file is provided via `file` parameter it will cause partition to infer last modified date from `file`'s content. If False, last modified metadata will be `None`. + * **Header and footer detection for fast strategy** `partition_pdf` with `fast` strategy now detects elements that are in the top or bottom 5 percent of the page as headers and footers. * **Add parent_element to overlapping case output** Adds parent_element to the output for `identify_overlapping_or_nesting_case` and `catch_overlapping_and_nested_bboxes` functions. @@ -831,6 +832,7 @@ * **Rename `OpenAiEmbeddingConfig` to `OpenAIEmbeddingConfig`.** * **Fix partition_json() doesn't chunk.** The `@add_chunking_strategy` decorator was missing from `partition_json()` such that pre-partitioned documents serialized to JSON did not chunk when a chunking-strategy was specified. + ## 0.12.4 ### Enhancements @@ -859,6 +861,7 @@ * **Add title to Vectara upload - was not separated out from initial connector ** * **Fix change OpenSearch port to fix potential conflict with Elasticsearch in ingest test ** + ## 0.12.3 ### Enhancements @@ -911,7 +914,6 @@ * **Install Kapa AI chatbot.** Added Kapa.ai website widget on the documentation. ### Features - * **MongoDB Source Connector.** New source connector added to all CLI ingest commands to support downloading/partitioning files from MongoDB. * **Add OpenSearch source and destination connectors.** OpenSearch, a fork of Elasticsearch, is a popular storage solution for various functionality such as search, or providing intermediary caches within data pipelines. Feature: Added OpenSearch source connector to support downloading/partitioning files. Added OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch. @@ -1090,8 +1092,8 @@ * **Import tables_agent from inference** so that we don't have to initialize a global table agent in unstructured OCR again * **Fix empty table is identified as bulleted-table.** A table with no text content was mistakenly identified as a bulleted-table and processed by the wrong branch of the initial HTML partitioner. * **Fix partition_html() emits empty (no text) tables.** A table with cells nested below a `` or `` element was emitted as a table element having no text and unparseable HTML in `element.metadata.text_as_html`. Do not emit empty tables to the element stream. -* **Fix HTML `element.metadata.text_as_html` contains spurious `
` elements in invalid locations.** The HTML generated for the `text_as_html` metadata for HTML tables contained `
` elements invalid locations like between `` and ``. Change the HTML generator such that these do not appear. -* **Fix HTML table cells enclosed in `` and `` elements are dropped.** HTML table cells nested in a `` or `` element were not detected and the text in those cells was omitted from the table element text and `.text_as_html`. Detect table rows regardless of the semantic tag they may be nested in. +* **Fix HTML `element.metadata.text_as_html` contains spurious
elements in invalid locations.** The HTML generated for the `text_as_html` metadata for HTML tables contained `
` elements invalid locations like between `
` and ``. Change the HTML generator such that these do not appear. +* **Fix HTML table cells enclosed in and elements are dropped.** HTML table cells nested in a `` or `` element were not detected and the text in those cells was omitted from the table element text and `.text_as_html`. Detect table rows regardless of the semantic tag they may be nested in. * **Remove whitespace padding from `.text_as_html`.** `tabulate` inserts padding spaces to achieve visual alignment of columns in HTML tables it generates. Add our own HTML generator to do this simple job and omit that padding as well as newlines ("\n") used for human readability. * **Fix local connector with absolute input path** When passed an absolute filepath for the input document path, the local connector incorrectly writes the output file to the input file directory. This fixes such that the output in this case is written to `output-dir/input-filename.json` @@ -1159,8 +1161,8 @@ * **Update `ocr_only` strategy in `partition_pdf()`** Adds the functionality to get accurate coordinate data when partitioning PDFs and Images with the `ocr_only` strategy. ### Fixes - * **Fixed SharePoint permissions for the fetching to be opt-in** Problem: Sharepoint permissions were trying to be fetched even when no reletad cli params were provided, and this gave an error due to values for those keys not existing. Fix: Updated getting keys to be with .get() method and changed the "skip-check" to check individual cli params rather than checking the existance of a config object. + * **Fixes issue where tables from markdown documents were being treated as text** Problem: Tables from markdown documents were being treated as text, and not being extracted as tables. Solution: Enable the `tables` extension when instantiating the `python-markdown` object. Importance: This will allow users to extract structured data from tables in markdown documents. * **Fix wrong logger for paddle info** Replace the logger from unstructured-inference with the logger from unstructured for paddle_ocr.py module. * **Fix ingest pipeline to be able to use chunking and embedding together** Problem: When ingest pipeline was using chunking and embedding together, embedding outputs were empty and the outputs of chunking couldn't be re-read into memory and be forwarded to embeddings. Fix: Added CompositeElement type to TYPE_TO_TEXT_ELEMENT_MAP to be able to process CompositeElements with unstructured.staging.base.isd_to_elements @@ -1213,7 +1215,7 @@ ### Features * **Table OCR refactor** support Table OCR with pre-computed OCR data to ensure we only do one OCR for entrie document. User can specify - ocr agent tesseract/paddle in environment variable `OCR_AGENT` for OCRing the entire document. +ocr agent tesseract/paddle in environment variable `OCR_AGENT` for OCRing the entire document. * **Adds accuracy function** The accuracy scoring was originally an option under `calculate_edit_distance`. For easy function call, it is now a wrapper around the original function that calls edit_distance and return as "score". * **Adds HuggingFaceEmbeddingEncoder** The HuggingFace Embedding Encoder uses a local embedding model as opposed to using an API. * **Add AWS bedrock embedding connector** `unstructured.embed.bedrock` now provides a connector to use AWS bedrock's `titan-embed-text` model to generate embeddings for elements. This features requires valid AWS bedrock setup and an internet connectionto run. @@ -1244,7 +1246,7 @@ ### Fixes * **Fix paddle model file not discoverable** Fixes issue where ocr_models/paddle_ocr.py file is not discoverable on PyPI by adding - an `__init__.py` file under the folder. +an `__init__.py` file under the folder. * **Chipper v2 Fixes** Includes fix for a memory leak and rare last-element bbox fix. (unstructured-inference==0.7.7) * **Fix image resizing issue** Includes fix related to resizing images in the tables pipeline. (unstructured-inference==0.7.6) @@ -1306,13 +1308,12 @@ * **Applies `max_characters=` argument to all element types in `add_chunking_strategy` decorator** Previously this argument was only utilized in chunking Table elements and now applies to all partitioned elements if `add_chunking_strategy` decorator is utilized, further preparing the elements for downstream processing. * **Add common retry strategy utilities for unstructured-ingest** Dynamic retry strategy with exponential backoff added to Notion source connector. * - ### Features * **Adds `bag_of_words` and `percent_missing_text` functions** In order to count the word frequencies in two input texts and calculate the percentage of text missing relative to the source document. * **Adds `edit_distance` calculation metrics** In order to benchmark the cleaned, extracted text with unstructured, `edit_distance` (`Levenshtein distance`) is included. * **Adds detection_origin field to metadata** Problem: Currently isn't an easy way to find out how an element was created. With this change that information is added. Importance: With this information the developers and users are now able to know how an element was created to make decisions on how to use it. In order tu use this feature - setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed. +setting UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true is needed. * **Adds a function that calculates frequency of the element type and its depth** To capture the accuracy of element type extraction, this function counts the occurrences of each unique element type with its depth for use in element metrics. ### Fixes @@ -1322,10 +1323,11 @@ * **Fixes category_depth None value for Title elements** Problem: `Title` elements from `chipper` get `category_depth`= None even when `Headline` and/or `Subheadline` elements are present in the same page. Fix: all `Title` elements with `category_depth` = None should be set to have a depth of 0 instead iff there are `Headline` and/or `Subheadline` element-types present. Importance: `Title` elements should be equivalent html `H1` when nested headings are present; otherwise, `category_depth` metadata can result ambiguous within elements in a page. * **Tweak `xy-cut` ordering output to be more column friendly** This results in the order of elements more closely reflecting natural reading order which benefits downstream applications. While element ordering from `xy-cut` is usually mostly correct when ordering multi-column documents, sometimes elements from a RHS column will appear before elements in a LHS column. Fix: add swapped `xy-cut` ordering by sorting by X coordinate first and then Y coordinate. * **Fixes badly initialized Formula** Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class - should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class - allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas. +should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class +allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas. * **Fixes pdf uri error** An error was encountered when URI type of `GoToR` which refers to pdf resources outside of its own was detected since no condition catches such case. The code is fixing the issue by initialize URI before any condition check. + ## 0.10.19 ### Enhancements @@ -1334,7 +1336,7 @@ * **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images. * **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title. * **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself. -* **Adds Table support for the `add_chunking_strategy` decorator to partition functions.** In addition to combining elements under Title elements, user's can now specify the `max_characters=` argument to chunk Table elements into TableChunk elements with `text` and `text_as_html` of length `` characters. This means partitioned Table results are ready for use in downstream applications without any post processing. +* **Adds Table support for the `add_chunking_strategy` decorator to partition functions.** In addition to combining elements under Title elements, user's can now specify the `max_characters=` argument to chunk Table elements into TableChunk elements with `text` and `text_as_html` of length characters. This means partitioned Table results are ready for use in downstream applications without any post processing. * **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio). ### Features @@ -1402,6 +1404,7 @@ ## 0.10.15 + ### Enhancements * **Support for better element categories from the next-generation image-to-text model ("chipper").** Previously, not all of the classifications from Chipper were being mapped to proper `unstructured` element categories so the consumer of the library would see many `UncategorizedText` elements. This fixes the issue, improving the granularity of the element categories outputs for better downstream processing and chunking. The mapping update is: @@ -1475,6 +1478,7 @@ * Add Jira Connector to be able to pull issues from a Jira organization * Add `clean_ligatures` function to expand ligatures in text + ### Fixes * `partition_html` breaks on `
` elements. @@ -1492,12 +1496,14 @@ * Support for yolox_quantized layout detection model (0.5.20) * YoloX element types added + ### Features * Add Salesforce Connector to be able to pull Account, Case, Campaign, EmailMessage, Lead ### Fixes + * Bump unstructured-inference * Avoid divide-by-zero errors swith `safe_division` (0.5.21) @@ -1618,18 +1624,15 @@ * Adds ability to reuse connections per process in unstructured-ingest ### Features - * Add delta table connector ### Fixes ## 0.10.4 - * Pass ocr_mode in partition_pdf and set the default back to individual pages for now * Add diagrams and descriptions for ingest design in the ingest README ### Features - * Supports multipage TIFF image partitioning ### Fixes @@ -1637,7 +1640,6 @@ ## 0.10.2 ### Enhancements - * Bump unstructured-inference==0.5.13: - Fix extracted image elements being included in layout merge, addresses the issue where an entire-page image in a PDF was not passed to the layout model when using hi_res. @@ -1649,7 +1651,6 @@ ## 0.10.1 ### Enhancements - * Bump unstructured-inference==0.5.12: - fix to avoid trace for certain PDF's (0.5.12) - better defaults for DPI for hi_res and Chipper (0.5.11) @@ -1701,6 +1702,7 @@ ## 0.9.2 + ### Enhancements * Update table extraction section in API documentation to sync with change in Prod API @@ -1747,7 +1749,7 @@ * Skip ingest test on missing Slack token * Add Dropbox variables to CI environments * Remove default encoding for ingest -* Adds new element type `EmailAddress` for recognising email address in the text +* Adds new element type `EmailAddress` for recognising email address in the  text * Simplifies `min_partition` logic; makes partitions falling below the `min_partition` less likely. * Fix bug where ingest test check for number of files fails in smoke test @@ -1879,6 +1881,7 @@ * Adjust encoding recognition threshold value in `detect_file_encoding` * Fix KeyError when `isd_to_elements` doesn't find a type * Fix `_output_filename` for local connector, allowing single files to be written correctly to the disk + * Fix for cases where an invalid encoding is extracted from an email header. ### BREAKING CHANGES @@ -1890,7 +1893,6 @@ ### Enhancements * Adds `include_metadata` kwarg to `partition_doc`, `partition_docx`, `partition_email`, `partition_epub`, `partition_json`, `partition_msg`, `partition_odt`, `partition_org`, `partition_pdf`, `partition_ppt`, `partition_pptx`, `partition_rst`, and `partition_rtf` - ### Features * Add Elasticsearch connector for ingest cli to pull specific fields from all documents in an index. @@ -2125,8 +2127,10 @@ ### Features + ### Fixes + ## 0.6.10 ### Enhancements @@ -2223,6 +2227,7 @@ ### Fixes + ## 0.6.4 ### Enhancements @@ -2259,6 +2264,7 @@ * Added logic to `partition_pdf` for detecting copy protected PDFs and falling back to the hi res strategy when necessary. + ### Features * Add `partition_via_api` for partitioning documents through the hosted API. @@ -2329,8 +2335,8 @@ * Added method to utils to allow date time format validation ### Features - * Add Slack connector to pull messages for a specific channel + * Add --partition-by-api parameter to unstructured-ingest * Added `partition_rtf` for processing rich text files. * `partition` now accepts a `url` kwarg in addition to `file` and `filename`. @@ -2460,7 +2466,7 @@ ### Features * Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting - from `FsspecConnector` +from `FsspecConnector` * Add `partition_epub` for partitioning e-books in EPUB3 format. ### Fixes @@ -2493,16 +2499,16 @@ * Fully move from printing to logging. * `unstructured-ingest` now uses a default `--download_dir` of `$HOME/.cache/unstructured/ingest` - rather than a "tmp-ingest-" dir in the working directory. +rather than a "tmp-ingest-" dir in the working directory. ### Features ### Fixes * `setup_ubuntu.sh` no longer fails in some contexts by interpreting - `DEBIAN_FRONTEND=noninteractive` as a command +`DEBIAN_FRONTEND=noninteractive` as a command * `unstructured-ingest` no longer re-downloads files when --preserve-downloads - is used without --download-dir. +is used without --download-dir. * Fixed an issue that was causing text to be skipped in some HTML documents. ## 0.5.1 @@ -2679,7 +2685,7 @@ * Add ability to extract document metadata from `.docx`, `.xlsx`, and `.jpg` files. * Helper functions for identifying and extracting phone numbers * Add new function `extract_attachment_info` that extracts and decodes the attachment - of an email. +of an email. * Staging brick to convert a list of `Element`s to a `pandas` dataframe. * Add plain text functionality to `partition_email` diff --git a/example-docs/pdf/password.pdf b/example-docs/pdf/password.pdf new file mode 100644 index 0000000000000000000000000000000000000000..40f63f2af52c040fa5d4303fe3057843ab4b0137 GIT binary patch literal 14179 zcmeHu2{hF0`?n=x6tb6OB3Z|5X2!@iW6zpBOW6iv8)n8XTXvF&NU|hLi>iPZO_y0ccdCqyy|2*GwhWoz0*L_{reO=dmb3UIlVn*tk5V$M` zB-T{ebg!wpDHntQ!@wj5H;|$tRFB~0LUjcr0f{kGljuPukfE9$_Eds8!I9)dP*w&} zsAPh@C#b1jzPMoH4!UG|kDNRTo$KE8{J#I57k%uuxh~#D<+mF0qrzKbpMpxu@*j>q zczst(`o=!{CFa6;(;>UOssJIj3rgbGD|!)!y7|u&A;a;B=&EditX!P<+_0clQe8x} z$m!AT+Z@tUqMgTl3BwgN0z9#dAvss~?wVhaWngbw(Q5>FRF7L*R7 z%s>P$r|$s*Bif!ow5`wva1@O8c1wo&%ZBCuiw%EI2pkNDA(7}mx4*Qel*=|QHlT|f9Su>OGKNuI8D*=V&s&!%eUR-ujB!G#C6HL+=B~uhkG9@A zJ1|}axz4#+VMmGznPs@0J1R$pN+w@=UM{V$N~AM=V;u}J8`0d)vRk#0-@S)l$wh5@ zRms=|d8?4Znw7@^k;A--piX&*k)By~)Wu-}HOm+wzu&{JFqK#f_GzoKCEOldw zriz8Zk-e~rYx@KYu*Z2wvmn*3e&ds6N5s36M~05FsAfc*xZ@`cw@oMEm8%vWJTVGu zQ9IeTEJQcB+PR$*wmAP# z`bdMwA;xAyho@qKo!Fg^<8S+~SPN>Ih3hK!GK0l>Dc8Uo;(lk^l9dwtlRTI%$j2?M zo_wg==_d%v#(mv*QSD&I`q(=TvY*AQ4FzFp{#ynZ~}_r2gj< z){%yF@2-q-><7;nzYc=rT{-ij<5a{D?g>treUUDoSIIQ7=!pdm-~Yw!oMSYK|Wv|m*-AHGyTMrZ_BVP4Fv z5{4^nM@PCV@REzCRP3l%ULFqKwoXzFA<7n*ntZ+wJE50lh8okU_jvcv;(RZ1_Ooh~ z4N0Ny@yF%2i=Xt1n;)2MOS3l3GKU%H%-w>dZK>K<#7+Fi)q zW{;sTL7$g-QazH}py!CNMzvzUNY#8FKPi@~_uV{K?<_4nsu=mKPnpkVp0*pV;!Qap zYnosSEm`d?J5h^~jB}6_9MR#~iHDeItW4yTsJwbcH)-ns<>jDK-To1|3x!$N$I{|* zqfBkGw%Z1?F9&Ukil>fHcFuD=hVQCdao<&vdFPUtm%tGr94={1B>p&-X*So0^|#(t1G$dba-lfKLQH<%tF#hxyAx6cDv@r8&~; z>wMo*ZA_qX)nfQ+-@8<%kru~ycY94zV5FUm5pGiP$raoF*4D3u5y>uKPZ@-1m0KJwO}A9sJn!-KM*8KyuYZyRw+I zP99=xuzF)Y1}k`0mY>rNXyx|hrtZz{HAUEIoBWK*jx4VIK|&z?1;$FpNpQIe zm)SEI1`EoV%G}}iL|rHG<{qXX(7e+Zd$u;Cwn5>s8}sA2Z}_W52hkOJrEdwV+!tfx;z6R!fR9guVd*FT2MT)_+X7L;H8i7FId_teS zSRZY)gJZyFzOOQ`S}cg4Z}#D}C5!cfu92tI>VXBdeg5_>t5@?zL@y{y>IWHfi0b%F z7kOEfpMd!lFpXtJj8>c7ro4u}a5}OaaG7zoaB5)B2xqUkD7{|r0hEn4)AqTp@ z$jU5&4o1#N3(-Yx|LT-d&fdHKXzG3gDYC0;UGmeY9HeBO*{G65BaRT$y(6jIjHNo+ zk|aDgc|VWkn&5~PmqvfYUACS2s%LLjDap|$hC`5Z!k z&AU-|>~7Zqt4)6T1(?utsJ3DT+ zsa(jjPN6%r$_Agod_fveiRdTd9gQ(}}Ft@Gv zkGo*^vES|6MKDW!^NfLiP%cPZqQZNuTj?qDoA8EgJx#Gh6S!4xLm1=nL3r>Q{3t6( z(59~P;3PGT^|jbuyUzTOI|qC(q8jq#iqEqEwtL!aBvRNs+@>ir(1pdq$wT%Py=t5fi4e81z3B*EFm+)3knwwV{U_Oy zsXC2wXOH!i-A-@p*?19GLG|-=o3j6Nt&yh_{?K$OO=-J(^Xh)vS*?sL z8urTBJ!K@NiCM0z%tXBbsl-G~R1;@Gj>E+jPpH^Fo1y&2L&U2{)U!EuxFb7T#?g-S z-ZZPM?2RrxAG={N zB(N2{k9#1Dw-Vt0oOSvl>!rZ`2XxOfRj=18+xZkDq#H|&pIha*UCiEAOjbPhOpC-P zeD4nHZcG)Z@=<7oLU}dXXjr5tH5HM|5&|)G`A~V-HU85vgct`=HCR6Yw;}bwu=EQ) zZg?Aix^d(E*7?}PmaNl-jo@efv2*QVGmgR)w!-WgY&mCB$1u=|IFiJY>-+2Ba}#+| zbmM_mqNlWPIE(UGnz&Y_(!V~Mr^0C_Uz@3d^y6rNXuz3q9?TF4o)ce7_9+(4_7q*+ zq0mc!U!77Np2>1S4jv3Nl)bA^W_S^Cb-Pd3&2oyQm5zn*-u>*{SGIjI80VLQBPQU zcg&||@+(&+Fx@(8Y_z9e=U~enIcjCoKZZE5z{y$7D~|9SxwBZ&TIlg?-f;MEi!tI<3UDhA1yeobD`S?d={B<+*-LgZJQEMyx!zrW3 zCEs6*S}`Eh+U=RDHy&D) zR)xi&WZY&O4KDM4J$)_)sh2%E~3!kdTp zIX8BO;bww~wrA#^r8P2Q7DYN&TOA+F9kEf$Bafe4^AHeZG=FB)0a+QH%Hb}3o_V%} z&zoZ|FR?GGm-ph5vu3gfsvHrbAf^47F@|pUnLS4ksgN>>-((UA zpI49HOp6;2U%9;lr!%dgd&2f*&%?gqxbTC#Tp;|}&LZtsmL+GxO2#+b}iBqg!Y%Vn~@}9X2Db(QZvbf3cl1CnM>7-J8t6%Fv z;~l|gJa4s)5A)OMq#S0?{J?t4MKCnzAx?9orpqNAcIJwZt42*zYOwK@lTnuDs~b6y zeFAfQagP{`@NMq(YVN)h!Foeots^S994KEU1h&)c)rpxRd?5*6kD7YdwThO+&xA~f z)R|M(V2dRgLK*S$@e_-xRZ+8S=6Ns!X`6wJ9KrTQUs5F3lO0b;Ao1yCrPu!CVf?^F z+o~1r*9Tz9*yDSbuOm7hyS%*Un^5Z+I*j(%>jnI`HE$Z+{1z9Y@AS%YGZJRx!Db=+ z6?BXtXIV>Yo!+nd=&ThqHN7g9kO7ol?N0cH{DVll35Pcb&Y&A9%c04I zM>#A6dqi~XHKiW}JB^$_e&XUCPPY=w&6G{OKy@m;?t9Od?Tw#dYxeL)RKe4tsh8Uy z4OI0~q5H;a59W10c6Pg{(cG?m;(_z1j>%K~!G|j;Qw}00Vy8}?y3#(jtVfMrH69!D z);Zd-%=@>n!ef&%}6zf zWrciFZC$SPXuNuCchkD)lcv^~wJuiC%SKVVB|KAwq5OSo%8ANOEJE*3PRA?dr7x$Hd(R3boE92O-wlci>g##`J3Cj*B6@z z7`o;-9>3}k_y;r2_0)4~N%gMf+m0pFY)79ld$>zQw#LKlb7a+p-68Mjv!X%hQ!MU- zs@5kH*v}TC_jn%<8s6{Fm`_)ITB*-Nf2PF4D+6M;24;Y|hE@q4Lk!g+c#ciVRkEp~ZSDBZo$ z`^ryQXI|l?vmsvqSD5vK(tuM6L6+|y64xu~6dpXhS2vr~Gj7IG zDwZ@XsAngCqvvgfm1V6&HT!unh;GE`+DP7PWjc|W>+?C|tB{R=mr?oE1@M`qoqKMX z$g{pD)AdhJCLS^{Fsr?Q$}4*zG*0E^tk*ptP+oI&7k=nixQB(5tYBBa*f}Pi28^G7 z-=>qms#8@=mRijD2+6}aB8jm$)c~|p@t!N=Q?UEsKelnhRIlbA+t{N%$8 z;fuv=CI;8+96T9!TBvgaICo3lja= z$y#tB*0a?gyKTSmj-I_K#u7_^8$LD!Zp>Rje* z8{AXr*(Wu@hnY3n?7f4*A$28o2uNeBAJ1M5Syp06jeSb7y}w9=RZxkp=z(EcK>^}-z|3WnNb%`xOtrDY@6;+sD4 z9X?!;)n$N$N2|>Wz4t z*vm$_>#@WcI=>ARF3&hhj1h;CrVQ%hHCtQ8=_y^*nld?%tL-RH%G@LqdgN`P!{@Ff zvcNTW*Yg}Bm%a9uJYJX(kaj$HdEc?u!%aFzSSz$n(HpP~o~k%AOYDbuW}VDHEivS& zgBHs;M+=j($gF$Yi7nH7cDWJa1EQqB08nhp?1zNPrR$fiK{G{m-GTMfy^9NpLV5!N z4m4_l4 zqVUdErL~YbTvf8y>bZ(8MCcy%2#m2!^UHi88+YN_0yEUD*C-m4BbAaA89S^L)AeL$H^zD+-o!Pgf&cL! z|5rwZUYU@y+M+eEey*g!c0AEAwIpqOqkbq4^Lq_Y?r_JK8fkgc*4UnX-P_jBDSjEy z3R((cJa|H&rLNfGCU))|r~mUbjn-bjb0W^$Ant9@>Nj@tn{TWYPq|xj#7>^cw4TT^1h7<;)#>RQPPd=n4~KVz6h|ReeE}i}hfI zF}t0-5bj#I&-S@{QNpjeh6{Hsaws+uwmXb*wdiKD%J@kk4#$^yfQ+^QMO~L)wpX-J6UJK6%g30Zr=JT;`qkyK67A zM}SvcsoHK<)UTH3L$Pq^d}EM(eEkw!C@VX9Wk%^7rp&?1LdYoKD%YlHsC_jq)Q)@q z(V_i?MADHy^Bk?U@%~-9lUi198>dKDu4hQz6@T37Q#bm^<@U^oW&N zuAaTlq7cOGxmc8>pD)xp!h2parIevc7DiDVwFHgGzM9_KD{gPBTi4<5awB~H`Kxr9 z87{+Dp7T#Hq9T|a>2CO%h3v{&>0LA8eiE*?FC$1JwT?$ZUb7dk*VulaZ`&K}8M5=f{EVP% zi9Dgn|G+8G*a8}nfQBQGKS%}apQM6mpf>@kN%Eqq6DW>kqBoTUuqeRPz}}Mp#p7s? zv>wrcK(?n6NnR!dGSL}Cqgh-iU^!p_XyWmt0BZfsqGeEuI%VXgnD9ssd0ztshAgG$Xw>E+3;z|Y90WVFc1W$9YJg|gj z9$4|Jk=F-chtt6?r}QM<7{PrS4Ny=Uyz zehYFKle$ZLwV^HA>cUk-eQ2L-E$UV!r=P^|(2}cZ4)1+MhYb37tr{T<=Saq};K}G3 zi0+3;NyyZ5FA0p*vF%-(9r$c= z>iv*X-c5tE9UFrB`+RdChxy!u*2Ig!IgW=1;`A)n?k;KBLnv4zeIRQC%t4Ffa;*1|u+-t&begOb8SVj6oo_<|r8O9{AAYNO|nm7>Pt~{kHwq zCV#dsnAY9KAgC$H%!^240s%toFWvJKkEjVSX0&bo+FpNfaf*uH>;rxM&8!ibm zT0;SH5^72apxXY@mVgrNxB>j+R%VRsT?kMkdor!r(KG{4V*-WbOLins!0>O>r9Q!l zXiqysv?x(nlq_(Hut*FHhLOj}f#v1kvhwm+3>t~SU{MHpTd1kM0|jbAJ0@gbN9s4* z6k1-@fC>P9DB!I~e?BX6zeS}_pxQgxQ|*6L(o$dV4?YzI!=U~o4;57cyzL#`2~@BH z!G-9hBr;Q5F9IeyDT!F1^kMqmY6MrJb`Y6h5@cZN807AVbrMlg1}O$81bBLT0$#uY zo*rHlg#aZHdw|a*C;(%c7%HL+0xOc8oE41KHNWcsGbIsMD%D#73ibE*m-R==lE^Ml zI2MbA!Vpjd0s?42D1ly7`v8a+MU-aZn~gte4_r`SnoWBLk}p+BMC7|+ZzpG8ds@gp z8Mh}AJ-5tny@YyEphTywpa7HK^%cDB$y8;yBJ`Iz%_o%R97lHf;!iyril@VD- zk|)gr6fTSWDF6ka&76E431ns9@+A`tot=q}1TaEQ7O@r2pRa*aqrT{rm z1}O9IT7Sy^U*ZO8`Jdwsu=fVyhW!yYpr@b)Q0~Cr5K|J#L)n5%1W0*B=Huf2jDR6f5EudiLz%*13MiBU93c&ZE5KksE&rzZW3@iXiRc{o-&y^W=AWy-pMn3@ z`k!_Gi%fp&3ZR{V_J{uHe_Ct&&v^db$p3G3{I??iR^%Vbz`w2a4<+_*oBg*Ue>-*m z(87N!@((5UZ=3zMB7Zw||IosJEAkH|_HUd0w<3Q#b^p-9e=G72CH8Nd{hw2jKjAw7 zH&+tzr-5_-0p>;hh7W#&7BxHwo&d}U0FCxuPWEIc5Yz}vLlOaO)yM_>4Kjt|zyQ=~ z2w=K_0OkfY^00R#cmg0K6z@TDbf+PsP+b7n`~@TZ9gY0en>xwSmu7+H4NeQ}7k9eW zTO#nU7Ps_%hF!mzG$j)V#v~HW)D{9un}gw7Ff0gaO!6S;+j|4i|FJ))(f4?_Hu<{? z+OqF1417H))*u+z8vHF7TafKfXzNdkKOkTV;O{rw7f1}ziDLaNkHB^S_|_KyiQzw4 zg8%g8k1bPb0A}k!asjsE0^rBrH$s~L02cvjMj;sbQay-X1i+o4H^B?%NaJj*w<-mO zS{}6md--~JeER_dJpP)cAKBKk_i~YNi;!zpl>|Zc?aA&sUd|*SPqZPy2@Jr_v@h2{ z{{WG}@U49P9QPlj6R_wz9;k>%Ya(G-EKUuM*1(|U)HLvN0Kkpaz-hn`fU!T|&hL8i za5Xpv1?XV_J@hv{q^1U1UJe7-z@yc18tNE;FhF8(2)sN7hsVKDI5=Joj>W*?@@jy+ zEw{ktY3QDWjWhbt`htz5CVwBjPJ6aSPlwfg8=Ju~W_8J*qyC8%`~G#JADv+iM*M+~ z!x8dOV=xkGD508&xi<$Xq4dm0aVQF&31Yz7zw2)0)0ZFd$}k&s@s)&{NyPNq9p4q+ zd|Bw?tJOqG%^cltk0j9L5I77W|5G3VWP&r$T?iNm_TvGgQ79w|>Sw0nzoaKLcdX`oHi zfeS+sqYhKYW7RY<7!5TwIXSpI5{^(qqt$RYIhZCAg$B^$|Dy`+wg9e4HCKY8JH^)% z43*bF!%%WC6aojA(?Ds$)YahXXr#Ox5~r>PL&|H=?llS(pfk2w3ywsguplup4MR=P F{{jRef+_$2 literal 0 HcmV?d00001 diff --git a/test_unstructured/partition/pdf_image/test_pdf.py b/test_unstructured/partition/pdf_image/test_pdf.py index 0746eab82b..105aef64ff 100644 --- a/test_unstructured/partition/pdf_image/test_pdf.py +++ b/test_unstructured/partition/pdf_image/test_pdf.py @@ -208,48 +208,34 @@ def test_partition_pdf_local_raises_with_no_filename(): @pytest.mark.parametrize("file_mode", ["filename", "rb", "spool"]) @pytest.mark.parametrize( - ("strategy", "starting_page_number", "expected_page_numbers", "origin"), + "strategy", # fast: can't capture the "intentionally left blank page" page # others: will ignore the actual blank page [ - (PartitionStrategy.FAST, 1, {1, 4}, {"pdfminer"}), - (PartitionStrategy.FAST, 3, {3, 6}, {"pdfminer"}), - (PartitionStrategy.HI_RES, 4, {4, 6, 7}, {"yolox", "pdfminer", "ocr_tesseract"}), - (PartitionStrategy.OCR_ONLY, 1, {1, 3, 4}, {"ocr_tesseract"}), + PartitionStrategy.FAST, + PartitionStrategy.HI_RES, + PartitionStrategy.OCR_ONLY, ], ) def test_partition_pdf_outputs_valid_amount_of_elements_and_metadata_values( file_mode, strategy, - starting_page_number, - expected_page_numbers, - origin, filename=example_doc_path("pdf/layout-parser-paper-with-empty-pages.pdf"), ): # Test that the partition_pdf function can handle filename def _test(result): # validate that the result is a non-empty list of dicts assert len(result) > 10 - # check that the pdf has multiple different page numbers - assert {element.metadata.page_number for element in result} == expected_page_numbers - if UNSTRUCTURED_INCLUDE_DEBUG_METADATA: - print( - [ - (element.metadata.detection_origin, element.category, element.text) - for element in result - ] - ) - assert {element.metadata.detection_origin for element in result} == origin if file_mode == "filename": result = pdf.partition_pdf( - filename=filename, strategy=strategy, starting_page_number=starting_page_number + filename=filename, strategy=strategy, ) _test(result) elif file_mode == "rb": with open(filename, "rb") as f: result = pdf.partition_pdf( - file=f, strategy=strategy, starting_page_number=starting_page_number + file=f, strategy=strategy, ) _test(result) else: @@ -260,9 +246,8 @@ def _test(result): result = pdf.partition_pdf( file=spooled_temp_file, strategy=strategy, - starting_page_number=starting_page_number, ) - _test(result) + _test(result) @mock.patch.dict(os.environ, {"UNSTRUCTURED_HI_RES_MODEL_NAME": "checkbox"}) @@ -1545,3 +1530,51 @@ def test_document_to_element_list_sets_category_depth_titles(): assert elements[1].metadata.category_depth == 2 assert elements[2].metadata.category_depth is None assert elements[3].metadata.category_depth == 0 + + +@pytest.mark.parametrize("file_mode", ["filename", "rb", "spool"]) +@pytest.mark.parametrize( + "strategy", + # fast: can't capture the "intentionally left blank page" page + # others: will ignore the actual blank page + [ + PartitionStrategy.FAST, + PartitionStrategy.HI_RES, + PartitionStrategy.OCR_ONLY, + ], +) +def test_partition_pdf_with_password( + file_mode, + strategy, + filename=example_doc_path("pdf/password.pdf"), +): + # Test that the partition_pdf function can handle filename + def _test(result): + # validate that the result is a non-empty list of dicts + assert len(result) == 1 + assert result[0].text == 'File with password' + + if file_mode == "filename": + result = pdf.partition_pdf( + filename=filename, strategy=strategy, + password="password" + ) + _test(result) + elif file_mode == "rb": + with open(filename, "rb") as f: + result = pdf.partition_pdf( + file=f, strategy=strategy, + password="password" + ) + _test(result) + else: + with open(filename, "rb") as test_file: + with SpooledTemporaryFile() as spooled_temp_file: + spooled_temp_file.write(test_file.read()) + spooled_temp_file.seek(0) + result = pdf.partition_pdf( + file=spooled_temp_file, + strategy=strategy, + password="password" + ) + _test(result) diff --git a/unstructured/__version__.py b/unstructured/__version__.py index 7421900821..b13919b74b 100644 --- a/unstructured/__version__.py +++ b/unstructured/__version__.py @@ -1 +1 @@ -__version__ = "0.16.19-dev2" # pragma: no cover +__version__ = "0.16.19-dev3" # pragma: no cover diff --git a/unstructured/partition/image.py b/unstructured/partition/image.py index 50ceaa1187..712384e0d5 100644 --- a/unstructured/partition/image.py +++ b/unstructured/partition/image.py @@ -32,6 +32,7 @@ def partition_image( starting_page_number: int = 1, extract_forms: bool = False, form_extraction_skip_tables: bool = True, + password: Optional[str] = None, **kwargs: Any, ) -> list[Element]: """Parses an image into a list of interpreted elements. @@ -91,6 +92,8 @@ def partition_image( (results in adding FormKeysValues elements to output). form_extraction_skip_tables Whether the form extraction logic should ignore regions designated as Tables. + password + The password to decrypt the PDF file. """ exactly_one(filename=filename, file=file) @@ -113,5 +116,6 @@ def partition_image( starting_page_number=starting_page_number, extract_forms=extract_forms, form_extraction_skip_tables=form_extraction_skip_tables, + password=password, **kwargs, ) diff --git a/unstructured/partition/pdf.py b/unstructured/partition/pdf.py index 55d3f3c03c..dabdc64c4e 100644 --- a/unstructured/partition/pdf.py +++ b/unstructured/partition/pdf.py @@ -144,6 +144,7 @@ def partition_pdf( starting_page_number: int = 1, extract_forms: bool = False, form_extraction_skip_tables: bool = True, + password: Optional[str] = None, **kwargs: Any, ) -> list[Element]: """Parses a pdf document into a list of interpreted elements. @@ -224,6 +225,7 @@ def partition_pdf( starting_page_number=starting_page_number, extract_forms=extract_forms, form_extraction_skip_tables=form_extraction_skip_tables, + password=password, **kwargs, ) @@ -245,6 +247,7 @@ def partition_pdf_or_image( starting_page_number: int = 1, extract_forms: bool = False, form_extraction_skip_tables: bool = True, + password: Optional[str] = None, **kwargs: Any, ) -> list[Element]: """Parses a pdf or image document into a list of interpreted elements.""" @@ -273,6 +276,7 @@ def partition_pdf_or_image( languages=languages, metadata_last_modified=metadata_last_modified or last_modified, starting_page_number=starting_page_number, + password=password, **kwargs, ) pdf_text_extractable = any( @@ -322,6 +326,7 @@ def partition_pdf_or_image( starting_page_number=starting_page_number, extract_forms=extract_forms, form_extraction_skip_tables=form_extraction_skip_tables, + password=password, **kwargs, ) out_elements = _process_uncategorized_text_elements(elements) @@ -347,6 +352,7 @@ def partition_pdf_or_image( is_image=is_image, metadata_last_modified=metadata_last_modified or last_modified, starting_page_number=starting_page_number, + password=password, **kwargs, ) out_elements = _process_uncategorized_text_elements(elements) @@ -360,6 +366,7 @@ def extractable_elements( languages: Optional[list[str]] = None, metadata_last_modified: Optional[str] = None, starting_page_number: int = 1, + password:Optional[str] = None, **kwargs: Any, ) -> list[list[Element]]: if isinstance(file, bytes): @@ -370,6 +377,7 @@ def extractable_elements( languages=languages, metadata_last_modified=metadata_last_modified, starting_page_number=starting_page_number, + password=password, **kwargs, ) @@ -380,6 +388,7 @@ def _partition_pdf_with_pdfminer( languages: list[str], metadata_last_modified: Optional[str], starting_page_number: int = 1, + password:Optional[str] = None, **kwargs: Any, ) -> list[list[Element]]: """Partitions a PDF using PDFMiner instead of using a layoutmodel. Used for faster @@ -403,6 +412,7 @@ def _partition_pdf_with_pdfminer( languages=languages, metadata_last_modified=metadata_last_modified, starting_page_number=starting_page_number, + password=password, **kwargs, ) @@ -413,6 +423,7 @@ def _partition_pdf_with_pdfminer( languages=languages, metadata_last_modified=metadata_last_modified, starting_page_number=starting_page_number, + password=password, **kwargs, ) @@ -427,6 +438,7 @@ def _process_pdfminer_pages( metadata_last_modified: Optional[str], annotation_threshold: Optional[float] = env_config.PDF_ANNOTATION_THRESHOLD, starting_page_number: int = 1, + password: Optional[str] = None, **kwargs, ) -> list[list[Element]]: """Uses PDFMiner to split a document into pages and process them.""" @@ -434,7 +446,8 @@ def _process_pdfminer_pages( elements = [] for page_number, (page, page_layout) in enumerate( - open_pdfminer_pages_generator(fp), start=starting_page_number + open_pdfminer_pages_generator(fp, password=password), + start=starting_page_number, ): width, height = page_layout.width, page_layout.height @@ -556,6 +569,7 @@ def _partition_pdf_or_image_local( extract_forms: bool = False, form_extraction_skip_tables: bool = True, pdf_hi_res_max_pages: Optional[int] = None, + password:Optional[str] = None, **kwargs: Any, ) -> list[Element]: """Partition using package installed locally""" @@ -592,10 +606,12 @@ def _partition_pdf_or_image_local( is_image=is_image, model_name=hi_res_model_name, pdf_image_dpi=pdf_image_dpi, + password=password, ) extracted_layout, layouts_links = ( - process_file_with_pdfminer(filename=filename, dpi=pdf_image_dpi) + process_file_with_pdfminer(filename=filename, dpi=pdf_image_dpi, + password=password) if pdf_text_extractable else ([], []) ) @@ -635,6 +651,7 @@ def _partition_pdf_or_image_local( ocr_mode=ocr_mode, pdf_image_dpi=pdf_image_dpi, ocr_layout_dumper=ocr_layout_dumper, + password=password, ) else: inferred_document_layout = process_data_with_model( @@ -642,13 +659,14 @@ def _partition_pdf_or_image_local( is_image=is_image, model_name=hi_res_model_name, pdf_image_dpi=pdf_image_dpi, + password=password, ) if hasattr(file, "seek"): file.seek(0) extracted_layout, layouts_links = ( - process_data_with_pdfminer(file=file, dpi=pdf_image_dpi) + process_data_with_pdfminer(file=file, dpi=pdf_image_dpi, password=password) if pdf_text_extractable else ([], []) ) @@ -690,6 +708,7 @@ def _partition_pdf_or_image_local( ocr_mode=ocr_mode, pdf_image_dpi=pdf_image_dpi, ocr_layout_dumper=ocr_layout_dumper, + password=password, ) # vectorization of the data structure ends here @@ -837,6 +856,7 @@ def _partition_pdf_or_image_with_ocr( is_image: bool = False, metadata_last_modified: Optional[str] = None, starting_page_number: int = 1, + password: Optional[str] = None, **kwargs: Any, ): """Partitions an image or PDF using OCR. For PDFs, each page is converted @@ -861,7 +881,8 @@ def _partition_pdf_or_image_with_ocr( elements.extend(page_elements) else: for page_number, image in enumerate( - convert_pdf_to_images(filename, file), start=starting_page_number + convert_pdf_to_images(filename, file, password=password), + start=starting_page_number ): page_elements = _partition_pdf_or_image_with_ocr_from_image( image=image, diff --git a/unstructured/partition/pdf_image/ocr.py b/unstructured/partition/pdf_image/ocr.py index 9e139af523..557199eea8 100644 --- a/unstructured/partition/pdf_image/ocr.py +++ b/unstructured/partition/pdf_image/ocr.py @@ -42,6 +42,7 @@ def process_data_with_ocr( ocr_mode: str = OCRMode.FULL_PAGE.value, pdf_image_dpi: int = 200, ocr_layout_dumper: Optional[OCRLayoutDumper] = None, + password:Optional[str] = None, ) -> "DocumentLayout": """ Process OCR data from a given data and supplement the output DocumentLayout @@ -89,6 +90,7 @@ def process_data_with_ocr( ocr_mode=ocr_mode, pdf_image_dpi=pdf_image_dpi, ocr_layout_dumper=ocr_layout_dumper, + password=password, ) return merged_layouts @@ -105,6 +107,7 @@ def process_file_with_ocr( ocr_mode: str = OCRMode.FULL_PAGE.value, pdf_image_dpi: int = 200, ocr_layout_dumper: Optional[OCRLayoutDumper] = None, + password:Optional[str] = None, ) -> "DocumentLayout": """ Process OCR data from a given file and supplement the output DocumentLayout @@ -165,6 +168,7 @@ def process_file_with_ocr( dpi=pdf_image_dpi, output_folder=temp_dir, paths_only=True, + userpw=password or "" ) image_paths = cast(List[str], _image_paths) for i, image_path in enumerate(image_paths): diff --git a/unstructured/partition/pdf_image/pdf_image_utils.py b/unstructured/partition/pdf_image/pdf_image_utils.py index a809c7f76d..d57af9d532 100644 --- a/unstructured/partition/pdf_image/pdf_image_utils.py +++ b/unstructured/partition/pdf_image/pdf_image_utils.py @@ -58,6 +58,7 @@ def convert_pdf_to_image( dpi: int = 200, output_folder: Optional[Union[str, PurePath]] = None, path_only: bool = False, + password: Optional[str] = None, ) -> Union[List[Image.Image], List[str]]: """Get the image renderings of the pdf pages using pdf2image""" @@ -71,6 +72,7 @@ def convert_pdf_to_image( dpi=dpi, output_folder=output_folder, paths_only=path_only, + userpw=password, ) else: images = pdf2image.convert_from_path( @@ -125,6 +127,7 @@ def save_elements( is_image: bool = False, extract_image_block_to_payload: bool = False, output_dir_path: str | None = None, + password: Optional[str] = None, ): """ Saves specific elements from a PDF as images either to a directory or embeds them in the @@ -167,6 +170,7 @@ def save_elements( pdf_image_dpi, output_folder=temp_dir, path_only=True, + password=password, ) image_paths = cast(List[str], _image_paths) @@ -389,15 +393,16 @@ def convert_pdf_to_images( filename: str = "", file: Optional[bytes | IO[bytes]] = None, chunk_size: int = 10, + password: Optional[str] = None, ) -> Iterator[Image.Image]: # Convert a PDF in small chunks of pages at a time (e.g. 1-10, 11-20... and so on) exactly_one(filename=filename, file=file) if file is not None: f_bytes = convert_to_bytes(file) - info = pdf2image.pdfinfo_from_bytes(f_bytes) + info = pdf2image.pdfinfo_from_bytes(f_bytes, userpw=password) else: f_bytes = None - info = pdf2image.pdfinfo_from_path(filename) + info = pdf2image.pdfinfo_from_path(filename, userpw=password) total_pages = info["Pages"] for start_page in range(1, total_pages + 1, chunk_size): @@ -407,12 +412,14 @@ def convert_pdf_to_images( f_bytes, first_page=start_page, last_page=end_page, + userpw=password, ) else: chunk_images = pdf2image.convert_from_path( filename, first_page=start_page, last_page=end_page, + userpw=password, ) for image in chunk_images: diff --git a/unstructured/partition/pdf_image/pdfminer_processing.py b/unstructured/partition/pdf_image/pdfminer_processing.py index 14836f1815..f100c4bdb9 100644 --- a/unstructured/partition/pdf_image/pdfminer_processing.py +++ b/unstructured/partition/pdf_image/pdfminer_processing.py @@ -36,12 +36,14 @@ def process_file_with_pdfminer( filename: str = "", dpi: int = 200, + password: Optional[str] = None, ) -> tuple[List[List["TextRegion"]], List[List]]: with open_filename(filename, "rb") as fp: fp = cast(BinaryIO, fp) extracted_layout, layouts_links = process_data_with_pdfminer( file=fp, dpi=dpi, + password=password, ) return extracted_layout, layouts_links @@ -114,6 +116,7 @@ def process_page_layout_from_pdfminer( def process_data_with_pdfminer( file: Optional[Union[bytes, BinaryIO]] = None, dpi: int = 200, + password:Optional[str] = None, ) -> tuple[List[LayoutElements], List[List]]: """Loads the image and word objects from a pdf using pdfplumber and the image renderings of the pdf pages using pdf2image""" @@ -124,7 +127,8 @@ def process_data_with_pdfminer( layouts_links = [] # Coefficient to rescale bounding box to be compatible with images coef = dpi / 72 - for page_number, (page, page_layout) in enumerate(open_pdfminer_pages_generator(file)): + for page_number, (page, page_layout) in ( + enumerate(open_pdfminer_pages_generator(file, password=password))): width, height = page_layout.width, page_layout.height annotation_list = [] diff --git a/unstructured/partition/pdf_image/pdfminer_utils.py b/unstructured/partition/pdf_image/pdfminer_utils.py index 929affeaae..d3444890fb 100644 --- a/unstructured/partition/pdf_image/pdfminer_utils.py +++ b/unstructured/partition/pdf_image/pdfminer_utils.py @@ -1,6 +1,6 @@ import os import tempfile -from typing import BinaryIO, List, Tuple +from typing import BinaryIO, List, Tuple, Optional from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LAParams, LTContainer, LTImage, LTItem, LTTextLine @@ -73,6 +73,7 @@ def rect_to_bbox( @requires_dependencies(["pikepdf", "pypdf"]) def open_pdfminer_pages_generator( fp: BinaryIO, + password: Optional[str] = None, ): """Open PDF pages using PDFMiner, handling and repairing invalid dictionary constructs.""" @@ -84,7 +85,7 @@ def open_pdfminer_pages_generator( with tempfile.TemporaryDirectory() as tmp_dir_path: tmp_file_path = os.path.join(tmp_dir_path, "tmp_file") try: - pages = PDFPage.get_pages(fp) + pages = PDFPage.get_pages(fp, password=password or "") # Detect invalid dictionary construct for entire PDF for i, page in enumerate(pages): try: From e61dcc0cbaf9f3870a36c7a88563c13448b1aa6a Mon Sep 17 00:00:00 2001 From: Philippe Prados Date: Thu, 6 Feb 2025 15:46:56 +0100 Subject: [PATCH 2/9] Fix ruff --- unstructured/partition/pdf_image/pdfminer_utils.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/unstructured/partition/pdf_image/pdfminer_utils.py b/unstructured/partition/pdf_image/pdfminer_utils.py index d3444890fb..305b0f0367 100644 --- a/unstructured/partition/pdf_image/pdfminer_utils.py +++ b/unstructured/partition/pdf_image/pdfminer_utils.py @@ -1,6 +1,6 @@ import os import tempfile -from typing import BinaryIO, List, Tuple, Optional +from typing import BinaryIO, List, Optional, Tuple from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LAParams, LTContainer, LTImage, LTItem, LTTextLine From 4e39efce0f9b19e4baa6c5f1444b681e80616d4d Mon Sep 17 00:00:00 2001 From: Philippe Prados Date: Thu, 6 Feb 2025 16:27:25 +0100 Subject: [PATCH 3/9] undo test_partition_pdf_outputs_valid_amount_of_elements_and_metadata_values --- .../partition/pdf_image/test_pdf.py | 27 ++++++++++++++----- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/test_unstructured/partition/pdf_image/test_pdf.py b/test_unstructured/partition/pdf_image/test_pdf.py index 105aef64ff..8dee5b5981 100644 --- a/test_unstructured/partition/pdf_image/test_pdf.py +++ b/test_unstructured/partition/pdf_image/test_pdf.py @@ -208,34 +208,48 @@ def test_partition_pdf_local_raises_with_no_filename(): @pytest.mark.parametrize("file_mode", ["filename", "rb", "spool"]) @pytest.mark.parametrize( - "strategy", + ("strategy", "starting_page_number", "expected_page_numbers", "origin"), # fast: can't capture the "intentionally left blank page" page # others: will ignore the actual blank page [ - PartitionStrategy.FAST, - PartitionStrategy.HI_RES, - PartitionStrategy.OCR_ONLY, + (PartitionStrategy.FAST, 1, {1, 4}, {"pdfminer"}), + (PartitionStrategy.FAST, 3, {3, 6}, {"pdfminer"}), + (PartitionStrategy.HI_RES, 4, {4, 6, 7}, {"yolox", "pdfminer", "ocr_tesseract"}), + (PartitionStrategy.OCR_ONLY, 1, {1, 3, 4}, {"ocr_tesseract"}), ], ) def test_partition_pdf_outputs_valid_amount_of_elements_and_metadata_values( file_mode, strategy, + starting_page_number, + expected_page_numbers, + origin, filename=example_doc_path("pdf/layout-parser-paper-with-empty-pages.pdf"), ): # Test that the partition_pdf function can handle filename def _test(result): # validate that the result is a non-empty list of dicts assert len(result) > 10 + # check that the pdf has multiple different page numbers + assert {element.metadata.page_number for element in result} == expected_page_numbers + if UNSTRUCTURED_INCLUDE_DEBUG_METADATA: + print( + [ + (element.metadata.detection_origin, element.category, element.text) + for element in result + ] + ) + assert {element.metadata.detection_origin for element in result} == origin if file_mode == "filename": result = pdf.partition_pdf( - filename=filename, strategy=strategy, + filename=filename, strategy=strategy, starting_page_number=starting_page_number ) _test(result) elif file_mode == "rb": with open(filename, "rb") as f: result = pdf.partition_pdf( - file=f, strategy=strategy, + file=f, strategy=strategy, starting_page_number=starting_page_number ) _test(result) else: @@ -246,6 +260,7 @@ def _test(result): result = pdf.partition_pdf( file=spooled_temp_file, strategy=strategy, + starting_page_number=starting_page_number, ) _test(result) From c669c7b146c25c2d4b6475d0c35d344201952ce4 Mon Sep 17 00:00:00 2001 From: Philippe Prados Date: Sat, 8 Feb 2025 08:22:19 +0100 Subject: [PATCH 4/9] reformat with black --- .../partition/pdf_image/test_pdf.py | 16 ++++------------ 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/test_unstructured/partition/pdf_image/test_pdf.py b/test_unstructured/partition/pdf_image/test_pdf.py index 8dee5b5981..8ece563046 100644 --- a/test_unstructured/partition/pdf_image/test_pdf.py +++ b/test_unstructured/partition/pdf_image/test_pdf.py @@ -1567,20 +1567,14 @@ def test_partition_pdf_with_password( def _test(result): # validate that the result is a non-empty list of dicts assert len(result) == 1 - assert result[0].text == 'File with password' + assert result[0].text == "File with password" if file_mode == "filename": - result = pdf.partition_pdf( - filename=filename, strategy=strategy, - password="password" - ) + result = pdf.partition_pdf(filename=filename, strategy=strategy, password="password") _test(result) elif file_mode == "rb": with open(filename, "rb") as f: - result = pdf.partition_pdf( - file=f, strategy=strategy, - password="password" - ) + result = pdf.partition_pdf(file=f, strategy=strategy, password="password") _test(result) else: with open(filename, "rb") as test_file: @@ -1588,8 +1582,6 @@ def _test(result): spooled_temp_file.write(test_file.read()) spooled_temp_file.seek(0) result = pdf.partition_pdf( - file=spooled_temp_file, - strategy=strategy, - password="password" + file=spooled_temp_file, strategy=strategy, password="password" ) _test(result) From ea850a646a189b3b6c08c6ed270dc5b6b13cc3a2 Mon Sep 17 00:00:00 2001 From: Philippe Prados Date: Sat, 8 Feb 2025 08:22:19 +0100 Subject: [PATCH 5/9] reformat with black --- .../partition/pdf_image/test_pdf.py | 16 ++++------------ unstructured/partition/pdf.py | 14 ++++++-------- unstructured/partition/pdf_image/ocr.py | 6 +++--- .../partition/pdf_image/pdfminer_processing.py | 7 ++++--- 4 files changed, 17 insertions(+), 26 deletions(-) diff --git a/test_unstructured/partition/pdf_image/test_pdf.py b/test_unstructured/partition/pdf_image/test_pdf.py index 8dee5b5981..8ece563046 100644 --- a/test_unstructured/partition/pdf_image/test_pdf.py +++ b/test_unstructured/partition/pdf_image/test_pdf.py @@ -1567,20 +1567,14 @@ def test_partition_pdf_with_password( def _test(result): # validate that the result is a non-empty list of dicts assert len(result) == 1 - assert result[0].text == 'File with password' + assert result[0].text == "File with password" if file_mode == "filename": - result = pdf.partition_pdf( - filename=filename, strategy=strategy, - password="password" - ) + result = pdf.partition_pdf(filename=filename, strategy=strategy, password="password") _test(result) elif file_mode == "rb": with open(filename, "rb") as f: - result = pdf.partition_pdf( - file=f, strategy=strategy, - password="password" - ) + result = pdf.partition_pdf(file=f, strategy=strategy, password="password") _test(result) else: with open(filename, "rb") as test_file: @@ -1588,8 +1582,6 @@ def _test(result): spooled_temp_file.write(test_file.read()) spooled_temp_file.seek(0) result = pdf.partition_pdf( - file=spooled_temp_file, - strategy=strategy, - password="password" + file=spooled_temp_file, strategy=strategy, password="password" ) _test(result) diff --git a/unstructured/partition/pdf.py b/unstructured/partition/pdf.py index dabdc64c4e..9a2efcd650 100644 --- a/unstructured/partition/pdf.py +++ b/unstructured/partition/pdf.py @@ -366,7 +366,7 @@ def extractable_elements( languages: Optional[list[str]] = None, metadata_last_modified: Optional[str] = None, starting_page_number: int = 1, - password:Optional[str] = None, + password: Optional[str] = None, **kwargs: Any, ) -> list[list[Element]]: if isinstance(file, bytes): @@ -388,7 +388,7 @@ def _partition_pdf_with_pdfminer( languages: list[str], metadata_last_modified: Optional[str], starting_page_number: int = 1, - password:Optional[str] = None, + password: Optional[str] = None, **kwargs: Any, ) -> list[list[Element]]: """Partitions a PDF using PDFMiner instead of using a layoutmodel. Used for faster @@ -447,7 +447,7 @@ def _process_pdfminer_pages( for page_number, (page, page_layout) in enumerate( open_pdfminer_pages_generator(fp, password=password), - start=starting_page_number, + start=starting_page_number, ): width, height = page_layout.width, page_layout.height @@ -569,7 +569,7 @@ def _partition_pdf_or_image_local( extract_forms: bool = False, form_extraction_skip_tables: bool = True, pdf_hi_res_max_pages: Optional[int] = None, - password:Optional[str] = None, + password: Optional[str] = None, **kwargs: Any, ) -> list[Element]: """Partition using package installed locally""" @@ -610,8 +610,7 @@ def _partition_pdf_or_image_local( ) extracted_layout, layouts_links = ( - process_file_with_pdfminer(filename=filename, dpi=pdf_image_dpi, - password=password) + process_file_with_pdfminer(filename=filename, dpi=pdf_image_dpi, password=password) if pdf_text_extractable else ([], []) ) @@ -881,8 +880,7 @@ def _partition_pdf_or_image_with_ocr( elements.extend(page_elements) else: for page_number, image in enumerate( - convert_pdf_to_images(filename, file, password=password), - start=starting_page_number + convert_pdf_to_images(filename, file, password=password), start=starting_page_number ): page_elements = _partition_pdf_or_image_with_ocr_from_image( image=image, diff --git a/unstructured/partition/pdf_image/ocr.py b/unstructured/partition/pdf_image/ocr.py index 557199eea8..0798caacf3 100644 --- a/unstructured/partition/pdf_image/ocr.py +++ b/unstructured/partition/pdf_image/ocr.py @@ -42,7 +42,7 @@ def process_data_with_ocr( ocr_mode: str = OCRMode.FULL_PAGE.value, pdf_image_dpi: int = 200, ocr_layout_dumper: Optional[OCRLayoutDumper] = None, - password:Optional[str] = None, + password: Optional[str] = None, ) -> "DocumentLayout": """ Process OCR data from a given data and supplement the output DocumentLayout @@ -107,7 +107,7 @@ def process_file_with_ocr( ocr_mode: str = OCRMode.FULL_PAGE.value, pdf_image_dpi: int = 200, ocr_layout_dumper: Optional[OCRLayoutDumper] = None, - password:Optional[str] = None, + password: Optional[str] = None, ) -> "DocumentLayout": """ Process OCR data from a given file and supplement the output DocumentLayout @@ -168,7 +168,7 @@ def process_file_with_ocr( dpi=pdf_image_dpi, output_folder=temp_dir, paths_only=True, - userpw=password or "" + userpw=password or "", ) image_paths = cast(List[str], _image_paths) for i, image_path in enumerate(image_paths): diff --git a/unstructured/partition/pdf_image/pdfminer_processing.py b/unstructured/partition/pdf_image/pdfminer_processing.py index f100c4bdb9..ff1eba2e3b 100644 --- a/unstructured/partition/pdf_image/pdfminer_processing.py +++ b/unstructured/partition/pdf_image/pdfminer_processing.py @@ -116,7 +116,7 @@ def process_page_layout_from_pdfminer( def process_data_with_pdfminer( file: Optional[Union[bytes, BinaryIO]] = None, dpi: int = 200, - password:Optional[str] = None, + password: Optional[str] = None, ) -> tuple[List[LayoutElements], List[List]]: """Loads the image and word objects from a pdf using pdfplumber and the image renderings of the pdf pages using pdf2image""" @@ -127,8 +127,9 @@ def process_data_with_pdfminer( layouts_links = [] # Coefficient to rescale bounding box to be compatible with images coef = dpi / 72 - for page_number, (page, page_layout) in ( - enumerate(open_pdfminer_pages_generator(file, password=password))): + for page_number, (page, page_layout) in enumerate( + open_pdfminer_pages_generator(file, password=password) + ): width, height = page_layout.width, page_layout.height annotation_list = [] From 2c0c7d9db24e87c2377f8005af2fae9b0d55b5d2 Mon Sep 17 00:00:00 2001 From: Philippe Prados Date: Sat, 8 Feb 2025 08:29:12 +0100 Subject: [PATCH 6/9] reformat with black --- unstructured/partition/pdf_image/pdfminer_utils.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/unstructured/partition/pdf_image/pdfminer_utils.py b/unstructured/partition/pdf_image/pdfminer_utils.py index 305b0f0367..3544e26762 100644 --- a/unstructured/partition/pdf_image/pdfminer_utils.py +++ b/unstructured/partition/pdf_image/pdfminer_utils.py @@ -94,7 +94,7 @@ def open_pdfminer_pages_generator( page_layout = device.get_result() except PSSyntaxError: logger.info("Detected invalid dictionary construct for PDFminer") - logger.info(f"Repairing the PDF page {i+1} ...") + logger.info(f"Repairing the PDF page {i + 1} ...") # find the error page from binary data fp error_page_data = get_page_data(fp, page_number=i) # repair the error page with pikepdf From 479b678772fb7b2fc7d1a57d499c7ae587267148 Mon Sep 17 00:00:00 2001 From: Philippe Prados Date: Tue, 11 Feb 2025 13:05:52 +0100 Subject: [PATCH 7/9] Updete dependencies --- requirements/extra-pdf-image.in | 2 +- requirements/extra-pdf-image.txt | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/requirements/extra-pdf-image.in b/requirements/extra-pdf-image.in index 99df481053..816388cbe5 100644 --- a/requirements/extra-pdf-image.in +++ b/requirements/extra-pdf-image.in @@ -11,5 +11,5 @@ google-cloud-vision effdet # Do not move to constraints.in, otherwise unstructured-inference will not be upgraded # when unstructured library is. -unstructured-inference>=0.8.6 +unstructured-inference>=0.8.7 unstructured.pytesseract>=0.3.12 diff --git a/requirements/extra-pdf-image.txt b/requirements/extra-pdf-image.txt index 910c2e2797..f30252303d 100644 --- a/requirements/extra-pdf-image.txt +++ b/requirements/extra-pdf-image.txt @@ -263,7 +263,7 @@ typing-extensions==4.12.2 # torch tzdata==2025.1 # via pandas -unstructured-inference==0.8.6 +unstructured-inference==0.8.7 # via -r ./extra-pdf-image.in unstructured-pytesseract==0.3.13 # via -r ./extra-pdf-image.in From af9f4d60be0c4bf510ff81efc66408e70f5c3ace Mon Sep 17 00:00:00 2001 From: Philippe PRADOS Date: Tue, 11 Feb 2025 16:36:37 +0100 Subject: [PATCH 8/9] Update CHANGELOG.md Co-authored-by: John J <43506685+Coniferish@users.noreply.github.com> --- CHANGELOG.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 9d819e43fd..554ef679da 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,4 +1,4 @@ -## 0.16.21-dev3 +## 0.16.21-dev4 ### Enhancements - **Use password** to load PDF with all modes From b539fc0c879ee7acfeec8d31edbdeb61e246e4bd Mon Sep 17 00:00:00 2001 From: Philippe Prados Date: Tue, 11 Feb 2025 16:48:05 +0100 Subject: [PATCH 9/9] Updete versions --- unstructured/__version__.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/unstructured/__version__.py b/unstructured/__version__.py index 15608835af..464c141982 100644 --- a/unstructured/__version__.py +++ b/unstructured/__version__.py @@ -1 +1 @@ -__version__ = "0.16.21-dev3" # pragma: no cover +__version__ = "0.16.21-dev4" # pragma: no cover