pull request 20241114 #951

lztiancn · 2024-11-14T01:17:15Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.

Checklist

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues.
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
CLA has been signed and all committers have signed the CLA in this PR.

github-actions · 2024-11-14T01:17:29Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

1 out of 4 committers have signed the CLA.
✅ (moria97)[https://github.com/moria97]
❌ @lztiancn
❌ @futuremeng
❌ @futuremeng
FutureMeng seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.}

nil-andreu · 2024-11-16T09:05:15Z

Dockerfile

    wget https://gitee.com/myhloli/MinerU/raw/master/scripts/download_models.py && \
    python3 download_models.py && \
    sed -i 's|cpu|cuda|g' /root/magic-pdf.json"

+# install extents
+COPY requirements-fastapi.txt /minerugw/requirements-fastapi.txt


To do not increase by default the size of the image, this could be set as a optional requirements.

返回当前队列长度

- Set PyMuPDF version to <= 1.24.14 in all requirements files - Prevent potential compatibility issues with future versions

- Merge title blocks that are close to each other horizontally - Adjust line insertion logic for title blocks- Increase image size and decrease confidence threshold for layout detection - Update DocLayoutYOLO model weights - Refactor drawing of bounding boxes for different block types

…_yolo

- Add remove_tilted_line function to filter out lines with angles between 2 and 88 degrees - Integrate the new function into the text extraction process - Improve the accuracy of text block processing by removing non-horizontal/vertical lines

- Add key length validation for ONNX model initialization - Move import statements to the top of the file - Wrap model initialization in a try-except block for better error handling - Refactor code to improve readability and maintainability

- Update model path from 'unimernet_small' to 'unimernet_small_2501' in multiple scripts and configuration files - This change affects download_models.py, download_models_hf.py, and model_configs.yaml

- Reduce YOLO_LAYOUT_BASE_BATCH_SIZE from 4 to 1 - Simplify batch ratio calculation for formula detection - Remove unused conditional logic in batch ratio determination

- Update GPU memory check and batch ratio calculation logic - Add support for virtual VRAM size environment variable - Improve logging for GPU memory and batch ratio

…e VRAM allocation logic to use 'VIRTUAL_VRAM_SIZE' environment variable - Reduce MFR (Math Formula Recognition) batch size from 64 to 32

- Reduce batch_ratio by 1 for better performance and stability - This change ensures more consistent memory usage when processing documents

- Improve batch ratio calculation based on GPU memory - Enhance performance for devices with 8GB or more VRAM

- Update conditions for batch ratio assignment: -8 <= gpu_memory < 10: batch_ratio = 2 - 10 <= gpu_memory <= 12: batch_ratio =4 - This fix ensures proper batch ratio selection for GPU memory sizes

- Restore commented code for filtering out characters with invalid bounding boxes - This change may affect the filtering of unnecessary characters in PDF parsing

- Add a check to return 0 when either bbox1_area or bbox2_area is zero - This prevents division by zero errors when calculating IoU

- Add timing measurement for formula, text, and title optimization using LLM - Log the execution time for each LLM aided process

- Change Miners homepage link from 'https://mineru.org.cn/home?source=online' to 'https://mineru.net/home?source=online' - Change Miners client download link from 'https://mineru.org.cn/client?source=online' to 'https://mineru.net/client?source=online'

- Add sub_model configuration option for rapid_table model - Provide two sub_model options: slanet_plus and unitable

…ilities: upgrade to latest doclayout_yolo(2501) and unimernet(2501) models - Improve performance: optimize resource usage and processing pipeline for faster parsing on high-end devices- Enhance parsing effects: add new heading classification feature to online demo - Refactor changelog structure for better readability and organization

…ability - Update online demo links in both English and Chinese README files

lztiancn added 4 commits November 7, 2024 15:52

feat: add fastapi requirement file

cef00a4

feat: docker增加fastapi

0dc406e

feat: 修改dockerfile

f1c8144

fix: dockerrun调通fastapi

3f24891

lztiancn and others added 7 commits November 14, 2024 09:18

Merge branch 'master' into master

7510a86

fix: docker compose 启动fastapi

cb15db1

Merge branch 'master' of https://github.com/lztiancn/lzmineru

6213e8b

fix: 增加redis部署

0db2ad7

fix: 增加redis部署

51e9c13

feat: 增加异步处理

932fee1

fix: 优化处理队列

9eedc23

nil-andreu reviewed Nov 16, 2024

View reviewed changes

lztiancn and others added 17 commits November 19, 2024 19:36

fix: 增加md5查询功能

3963ec4

fix: 增加url取不到数据的异常处理

5326e97

增加fastapi

7c4af87

增加接口文档说明

2e91a85

增加贡献者lztiancn说明

e40d687

.gitignore

2657cbe

保存pdf文件并在重启后重新加载队列

6664add

重建任务队列时恢复初始状态

abef79b

返回当前队列长度

app里面需要临时写images

02ea87b

Merge remote-tracking branch 'upstream/master'

4a51cc9

add fork

bcfa326

Merge remote-tracking branch 'upstream/master'

548d4c1

@moria97 has signed the CLA in opendatalab#1578

a98b6ef

Update pdf_parse_union_core_v2.py

39eadff

build(deps): add upper version limit for PyMuPDF

2265379

- Set PyMuPDF version to <= 1.24.14 in all requirements files - Prevent potential compatibility issues with future versions

refactor(BatchAnalyze): comment out image rotation logic in doclayout…

d05f2fe

…_yolo

moria97 and others added 30 commits February 14, 2025 10:14

Fix ocr utills

64e1204

fix(models): update unimernet_small model path

f58512e

- Update model path from 'unimernet_small' to 'unimernet_small_2501' in multiple scripts and configuration files - This change affects download_models.py, download_models_hf.py, and model_configs.yaml

perf(model): adjust batch size for layout and formula detection

ad54ff2

- Reduce YOLO_LAYOUT_BASE_BATCH_SIZE from 4 to 1 - Simplify batch ratio calculation for formula detection - Remove unused conditional logic in batch ratio determination

perf(magic_pdf): optimize batch ratio calculation for GPU

5c70353

- Update GPU memory check and batch ratio calculation logic - Add support for virtual VRAM size environment variable - Improve logging for GPU memory and batch ratio

refactor(magic_pdf): adjust VRAM allocation and MFR batch size- Updat…

b1313e2

…e VRAM allocation logic to use 'VIRTUAL_VRAM_SIZE' environment variable - Reduce MFR (Math Formula Recognition) batch size from 64 to 32

perf(magic_pdf): adjust batch ratio calculation for GPU memory

fabbca2

- Reduce batch_ratio by 1 for better performance and stability - This change ensures more consistent memory usage when processing documents

perf(magic_pdf): optimize batch processing for GPU

e516ec5

- Improve batch ratio calculation based on GPU memory - Enhance performance for devices with 8GB or more VRAM

fix(magic_pdf): correct batch ratio conditions for GPU memory

c2aa56a

- Update conditions for batch ratio assignment: -8 <= gpu_memory < 10: batch_ratio = 2 - 10 <= gpu_memory <= 12: batch_ratio =4 - This fix ensures proper batch ratio selection for GPU memory sizes

refactor(pdf_parse): uncomment char bbox validation logic

a7a2e00

- Restore commented code for filtering out characters with invalid bounding boxes - This change may affect the filtering of unnecessary characters in PDF parsing

fix(boxbase): handle cases where bounding box area is zero

a6fe0d2

- Add a check to return 0 when either bbox1_area or bbox2_area is zero - This prevents division by zero errors when calculating IoU

feat(pdf_parse_union_core_v2): add timing log for LLM aided processes

3e754fe

- Add timing measurement for formula, text, and title optimization using LLM - Log the execution time for each LLM aided process

docs(readme):update readme for 1.1.0

1711181

feat(table-config): add sub_model configuration for rapid_table

64d541a

- Add sub_model configuration option for rapid_table model - Provide two sub_model options: slanet_plus and unitable

docs(README): update online demo links and enhance documentation read…

32c96d8

…ability - Update online demo links in both English and Chinese README files

Update version.py with new version

333c5d2

Update python-package.yml

58af7ef

Update README.md

7cbad26

Update README_zh-CN.md

e9a49a5

merge: lztiancn/lzmineru

54dccba

fix: context: ./docker/china

3e555eb

feat: 增加回调

55e8960

Merge remote-tracking branch 'upstream/master'

7deb790

fix: 修改解析错误DiskReaderWriter

deece55

fix: 修改解析错误image_dir

22b4aee

feat: 发送回调使用异步

32ec4bd

fix: 代码优化

d7fc67f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pull request 20241114 #951

pull request 20241114 #951

lztiancn commented Nov 14, 2024

github-actions bot commented Nov 14, 2024 •

edited

Loading

nil-andreu Nov 16, 2024

pull request 20241114 #951

Are you sure you want to change the base?

pull request 20241114 #951

Conversation

lztiancn commented Nov 14, 2024

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

github-actions bot commented Nov 14, 2024 • edited Loading

nil-andreu Nov 16, 2024

Choose a reason for hiding this comment

github-actions bot commented Nov 14, 2024 •

edited

Loading