Skip to content

Have we thought about adding pdf plumber for table detection? #97

@espoirMur

Description

@espoirMur

Description

Hi, I have been using this library. I would like to say thank you for the good work.

I have checked the PDF parsing algorithms we have, but I can't use any of them for some tasks at work.

  • PyMuPDF: Licensing issues.
  • VLLM models, such as Unitable and Table Transformers. (We are restricted from downloading open-source models at work.)

I have managed to get PDFPlumber working well for the table extraction and OpenParse with PDFMiner.six for text extraction. I like how PDFMiner is extracting the text for my page (it comes with the bold and line breaks).

On the other hand, I like how PDFPlumber is getting the tables.

Can I customize the library to use the two libraries?

Has anyone tried? If yes, what challenges did they face?

I am trying to give it a go this weekend and hopefully will make a PR to the repo with my findings.

Cheers

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions