Skip to content

pull request 20241114 #951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 77 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
cef00a4
feat: add fastapi requirement file
lztiancn Nov 7, 2024
0dc406e
feat: docker增加fastapi
lztiancn Nov 8, 2024
f1c8144
feat: 修改dockerfile
lztiancn Nov 8, 2024
3f24891
fix: dockerrun调通fastapi
lztiancn Nov 13, 2024
7510a86
Merge branch 'master' into master
lztiancn Nov 14, 2024
cb15db1
fix: docker compose 启动fastapi
lztiancn Nov 14, 2024
6213e8b
Merge branch 'master' of https://github.com/lztiancn/lzmineru
lztiancn Nov 14, 2024
0db2ad7
fix: 增加redis部署
lztiancn Nov 14, 2024
51e9c13
fix: 增加redis部署
lztiancn Nov 14, 2024
932fee1
feat: 增加异步处理
lztiancn Nov 15, 2024
9eedc23
fix: 优化处理队列
lztiancn Nov 16, 2024
3963ec4
fix: 增加md5查询功能
lztiancn Nov 19, 2024
5326e97
fix: 增加url取不到数据的异常处理
lztiancn Nov 19, 2024
7c4af87
增加fastapi
Nov 20, 2024
2e91a85
增加接口文档说明
Nov 20, 2024
e40d687
增加贡献者lztiancn说明
Nov 20, 2024
2657cbe
.gitignore
Nov 20, 2024
6664add
保存pdf文件并在重启后重新加载队列
Nov 20, 2024
abef79b
重建任务队列时恢复初始状态
Nov 20, 2024
02ea87b
app里面需要临时写images
Nov 26, 2024
4a51cc9
Merge remote-tracking branch 'upstream/master'
Dec 27, 2024
bcfa326
add fork
Jan 15, 2025
548d4c1
Merge remote-tracking branch 'upstream/master'
Jan 15, 2025
a98b6ef
@moria97 has signed the CLA in opendatalab/MinerU#1578
github-actions[bot] Jan 20, 2025
39eadff
Update pdf_parse_union_core_v2.py
myhloli Jan 14, 2025
2265379
build(deps): add upper version limit for PyMuPDF
myhloli Jan 14, 2025
316ba84
feat(layout): improve title block handling and layout detection
myhloli Jan 14, 2025
d05f2fe
refactor(BatchAnalyze): comment out image rotation logic in doclayout…
myhloli Jan 14, 2025
c6be246
feat(post_proc): enhance title block processing with average line height
myhloli Jan 14, 2025
af7a28f
refactor(pre_proc): adjust IOU threshold for character overlap detection
myhloli Jan 15, 2025
31219f1
docs(magic_pdf): update llm_aided.py prompt for title list optimization
myhloli Jan 15, 2025
a2e3178
fix(language): remove invalid UTF-16 surrogate pairs from input text
myhloli Jan 15, 2025
85d4c9c
update logo
myhloli Jan 15, 2025
8d75e47
build(docker): update doclayout-yolo dependency
myhloli Jan 15, 2025
c2481cd
feat(model): improve batch analysis logic and support npu
myhloli Jan 15, 2025
1dc0816
refactor(magic_pdf): improve title block merging logic
myhloli Jan 15, 2025
1f2ffdf
fix(magic_pdf): correct end page index and improve error handling
myhloli Jan 16, 2025
f540b42
docs(README): update demo badges
myhloli Jan 16, 2025
1333d67
docs(README): update demo badges
myhloli Jan 16, 2025
613c060
feat(table): upgrade RapidTable to1.0.3 and add sub-model support
myhloli Jan 16, 2025
c3827c1
build(docker): update rapid-table dependency
myhloli Jan 16, 2025
7573af9
refactor(model): update batch analyze logic for rapid table model
myhloli Jan 16, 2025
977a4b8
docs(README): update WeChat group link
myhloli Jan 16, 2025
ec7ced3
refactor(table): add device configuration for Unitable model
myhloli Jan 17, 2025
73efa06
refactor(model): update config version check to 1.1.1
myhloli Jan 17, 2025
f98dd02
fix(magic_pdf): limit batch ratio for GPU memory
myhloli Jan 17, 2025
6b279a6
feat(llm_aided): add reasonability check and fine-tuning guidelines
myhloli Jan 17, 2025
64e1204
Fix ocr utills
moria97 Jan 20, 2025
53e6380
feat(pdf_parse): remove tilted lines for better text extraction
myhloli Jan 20, 2025
fe2d24f
fix(ocr): improve ONNX model initialization and error handling
myhloli Jan 20, 2025
f58512e
fix(models): update unimernet_small model path
myhloli Jan 21, 2025
ad54ff2
perf(model): adjust batch size for layout and formula detection
myhloli Jan 21, 2025
5c70353
perf(magic_pdf): optimize batch ratio calculation for GPU
myhloli Jan 21, 2025
b1313e2
refactor(magic_pdf): adjust VRAM allocation and MFR batch size- Updat…
myhloli Jan 21, 2025
fabbca2
perf(magic_pdf): adjust batch ratio calculation for GPU memory
myhloli Jan 21, 2025
e516ec5
perf(magic_pdf): optimize batch processing for GPU
myhloli Jan 21, 2025
c2aa56a
fix(magic_pdf): correct batch ratio conditions for GPU memory
myhloli Jan 21, 2025
a7a2e00
refactor(pdf_parse): uncomment char bbox validation logic
myhloli Jan 22, 2025
a6fe0d2
fix(boxbase): handle cases where bounding box area is zero
myhloli Jan 22, 2025
3e754fe
feat(pdf_parse_union_core_v2): add timing log for LLM aided processes
myhloli Jan 22, 2025
1711181
docs(readme):update readme for 1.1.0
myhloli Jan 22, 2025
bab73f1
docs(url): update Miners links in header
myhloli Jan 22, 2025
64d541a
feat(table-config): add sub_model configuration for rapid_table
myhloli Jan 23, 2025
e409475
docs(readme): update changelog for v1.1.0 release- Update model capab…
myhloli Jan 23, 2025
32c96d8
docs(README): update online demo links and enhance documentation read…
myhloli Jan 23, 2025
333c5d2
Update version.py with new version
myhloli Jan 23, 2025
58af7ef
Update python-package.yml
myhloli Jan 23, 2025
7cbad26
Update README.md
myhloli Jan 24, 2025
e9a49a5
Update README_zh-CN.md
myhloli Jan 24, 2025
54dccba
merge: lztiancn/lzmineru
futuremeng Feb 14, 2025
3e555eb
fix: context: ./docker/china
futuremeng Feb 14, 2025
55e8960
feat: 增加回调
lztiancn Feb 14, 2025
7deb790
Merge remote-tracking branch 'upstream/master'
lztiancn Feb 14, 2025
deece55
fix: 修改解析错误DiskReaderWriter
lztiancn Feb 14, 2025
22b4aee
fix: 修改解析错误image_dir
lztiancn Feb 14, 2025
32ec4bd
feat: 发送回调使用异步
lztiancn Feb 14, 2025
d7fc67f
fix: 代码优化
lztiancn Feb 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
- 2024/11/22 0.10.0 released. Introducing hybrid OCR text extraction capabilities,
- Significantly improved parsing performance in complex text distribution scenarios such as dense formulas, irregular span regions, and text represented by images.
- Combines the dual advantages of accurate content extraction and faster speed in text mode, and more precise span/line region recognition in OCR mode.
- 2024/11/20 add fastapi, via: [https://github.com/lztiancn/lzmineru.git](https://github.com/lztiancn/lzmineru.git)
- 2024/11/15 0.9.3 released. Integrated [RapidTable](https://github.com/RapidAI/RapidTable) for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.
- 2024/11/06 0.9.2 released. Integrated the [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B) model for table recognition functionality.
- 2024/10/31 0.9.0 released. This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:
Expand Down Expand Up @@ -157,6 +158,11 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c

## Quick Start


```
docker-compose up -d
```

If you encounter any installation issues, please first consult the <a href="#faq">FAQ</a>. </br>
If the parsing results are not as expected, refer to the <a href="#known-issues">Known Issues</a>. </br>
There are three different ways to experience MinerU:
Expand Down Expand Up @@ -347,6 +353,12 @@ You can enable MPS acceleration by setting the `device-mode` parameter to `mps`
[Using MinerU via Python API](https://mineru.readthedocs.io/en/latest/user_guide/usage/api.html)



### FAST API

doc [http://localhost:8910/docs](http://localhost:8910/docs)


### Deploy Derived Projects

Derived projects include secondary development projects based on MinerU by project developers and community developers,
Expand Down
6 changes: 6 additions & 0 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@
- 2024/11/22 0.10.0发布,通过引入混合OCR文本提取能力,
- 在公式密集、span区域不规范、部分文本使用图像表现等复杂文本分布场景下获得解析效果的显著提升
- 同时具备文本模式内容提取准确、速度更快与OCR模式span/line区域识别更准的双重优势
- 2024/11/20 增加fastapi(futuremeng and tianlz)
- 2024/11/15 0.9.3发布,为表格识别功能接入了[RapidTable](https://github.com/RapidAI/RapidTable),单表解析速度提升10倍以上,准确率更高,显存占用更低
- 2024/11/06 0.9.2发布,为表格识别功能接入了[StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B)模型
- 2024/10/31 0.9.0发布,这是我们进行了大量代码重构的全新版本,解决了众多问题,提升了性能,降低了硬件需求,并提供了更丰富的易用性:
Expand Down Expand Up @@ -349,6 +350,11 @@ pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com -i
[通过Python代码调用MinerU](https://mineru.readthedocs.io/en/latest/user_guide/usage/api.html)



### FAST API

接口文档doc [http://localhost:8910/docs](http://localhost:8910/docs)

### 部署衍生项目

衍生项目包含项目开发者和社群开发者们基于MinerU的二次开发项目,
Expand Down
46 changes: 46 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
services:
gateway:
build:
context: ./docker/china
depends_on:
- redis
container_name: mineru-gateway
ports:
- "8910:80"
volumes:
- ./services/fastapi/app:/gateway/app:rw
- ./services/fastapi/tmp:/gateway/tmp:rw
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
TZ: Asia/Shanghai
command: ["/gateway/app/start.sh"]
restart: always
networks:
- default

redis:
image: redis:7.2.4
container_name: mineru-redis
ports:
- "6380:6379"
volumes:
- ./services/redis/conf/redis.conf:/etc/redis.conf
- ./services/redis/data/:/data/
restart: always
entrypoint: ["redis-server", "/etc/redis.conf"]
environment:
TZ: Asia/Shanghai
networks:
- default

networks:
default:
driver: bridge
ipam:
driver: default
11 changes: 10 additions & 1 deletion docker/china/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ FROM ubuntu:22.04
# Set environment variables to non-interactive to avoid prompts during installation
ENV DEBIAN_FRONTEND=noninteractive

RUN /bin/bash -c "sed -i 's/archive.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list && \
sed -i 's/security.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list"

# Update the package list and install necessary packages
RUN apt-get update && \
apt-get install -y \
Expand Down Expand Up @@ -46,5 +49,11 @@ RUN /bin/bash -c "pip3 install modelscope && \
python3 download_models.py && \
sed -i 's|cpu|cuda|g' /root/magic-pdf.json"

# install extents
COPY requirements-fastapi.txt /mineru-gateway/requirements-fastapi.txt

RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
pip3 install -r /mineru-gateway/requirements-fastapi.txt"

# Set the entry point to activate the virtual environment and run the command line tool
ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]
CMD ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]
4 changes: 4 additions & 0 deletions docker/china/requirements-fastapi.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
fastapi>=0.115.2,<0.115.4
uvicorn>=0.30.0,<0.32.0
redis>=5.2.0
requests
11 changes: 10 additions & 1 deletion docker/global/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ FROM ubuntu:22.04
# Set environment variables to non-interactive to avoid prompts during installation
ENV DEBIAN_FRONTEND=noninteractive

RUN /bin/bash -c "sed -i 's/archive.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list && \
sed -i 's/security.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list"

# Update the package list and install necessary packages
RUN apt-get update && \
apt-get install -y \
Expand Down Expand Up @@ -46,5 +49,11 @@ RUN /bin/bash -c "pip3 install huggingface_hub && \
python3 download_models.py && \
sed -i 's|cpu|cuda|g' /root/magic-pdf.json"

# install extents
COPY requirements-fastapi.txt /gateway/requirements-fastapi.txt

RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
pip3 install -r /gateway/requirements-fastapi.txt"

# Set the entry point to activate the virtual environment and run the command line tool
ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]
CMD ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]
29 changes: 29 additions & 0 deletions fork-remote.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
<!--
* @Author: FutureMeng be_loving@163.com
* @Date: 2025-01-15 22:45:17
* @LastEditors: FutureMeng be_loving@163.com
* @LastEditTime: 2025-01-15 22:45:53
* @FilePath: \MinerU\fork-remote.md
* @Description: 这是默认设置,请设置`customMade`, 打开koroFileHeader查看配置 进行设置: https://github.com/OBKoro1/koro1FileHeader/wiki/%E9%85%8D%E7%BD%AE
-->
1. 查看是否添加了更新源
```
git remote -v
```
2. 添加更新源,本项目fork自yeszao/dnmp
```
git remote add upstream https://github.com/opendatalab/MinerU.git
```
3. 从源更新
```
git fetch upstream
```
4. 合并源的分支
```
git merge upstream/master
```

5. 推送
```
git push
```
3 changes: 3 additions & 0 deletions requirements-fastapi.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
fastapi>=0.115.2,<0.115.4
uvicorn>=0.30.0,<0.32.0
redis>=5.2.0
Empty file.
18 changes: 18 additions & 0 deletions services/fastapi/app/http_util.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import requests
import asyncio
from loguru import logger
def post_result_callback(url, cbkey, md5_value, content_list, md_content):
if (url is None):
return
asyncio.run(commit_callback_task(url, cbkey, md5_value, content_list, md_content))

async def commit_callback_task(url, cbkey, md5_value, content_list, md_content):
asyncio.create_task(post_callback(url, cbkey, md5_value, content_list, md_content))

async def post_callback(url, cbkey, md5_value, content_list, md_content):
try:
json_result = {"cbkey": cbkey, "md5": md5_value, "content_list": content_list, "md_content": md_content}
requests.post(url, data=json_result)
logger.info("post_callback success, url:{}", url)
except Exception as e:
logger.exception(e)
87 changes: 87 additions & 0 deletions services/fastapi/app/magic_pdf_parse_util.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import os
import json
import datetime
import shutil

from loguru import logger

from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.config.enums import SupportedPdfParseMethod

from . import redis_util
from . import http_util

def pdf_parse(
md5_value,
pdf_bytes: bytes,
parse_method: str = 'auto',
cbUrl: str = None,
cbkey: str = None,
model_json_path: str = None,
output_dir: str = None
):
try:
file_info = redis_util.get_file_info(md5_value)
if not file_info:
return
if file_info["state"] != "waiting":
return
redis_util.set_parse_parsing(md5_value)
current_script_dir = os.path.dirname(os.path.abspath(__file__))
foldname = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
if output_dir:
output_path = os.path.join(output_dir, foldname)
else:
output_path = os.path.join(current_script_dir, foldname)

output_image_path = os.path.join(output_path, 'images')
local_md_dir = os.path.join(output_path, 'output')

# 获取图片的父路径,为的是以相对路径保存到 .md 和 conent_list.json 文件中
image_path_parent = os.path.basename(output_image_path)

if model_json_path:
# 读取已经被模型解析后的pdf文件的 json 原始数据,list 类型
model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read())
else:
model_json = []

# 执行解析步骤
image_writer = FileBasedDataWriter(output_image_path)

ds = PymuDocDataset(pdf_bytes)

## inference
if ds.classify() == SupportedPdfParseMethod.OCR:
infer_result = ds.apply(doc_analyze, ocr=True)

## pipeline
pipe_result = infer_result.pipe_ocr_mode(image_writer)

else:
infer_result = ds.apply(doc_analyze, ocr=False)

## pipeline
pipe_result = infer_result.pipe_txt_mode(image_writer)

### draw model result on each page
infer_result.draw_model(os.path.join(local_md_dir, f"model.pdf"))

### get model inference result
model_inference_result = infer_result.get_infer_res()

### get markdown content
md_content = pipe_result.get_markdown(output_image_path)

### get content list content
content_list = pipe_result.get_content_list(output_image_path)

# delete fold
shutil.rmtree(output_path)
redis_util.set_parse_parsed(md5_value, content_list, md_content)
http_util.post_result_callback(cbUrl, cbkey, md5_value, content_list, md_content)
except Exception as e:
redis_util.set_parse_failed(md5_value)
logger.exception(e)
101 changes: 101 additions & 0 deletions services/fastapi/app/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
import hashlib
import os
import queue
import threading
from urllib import parse, request

from fastapi import FastAPI
from loguru import logger

from . import magic_pdf_parse_util
from . import redis_util

message_queue = queue.Queue(20)

app = FastAPI()

def calc_md5(byteContent: bytes):
hash_md5 = hashlib.md5()
hash_md5.update(byteContent)
return hash_md5.hexdigest()

def commit_parse_task(md5_value, parse_method, cbUrl, cbkey):
message_queue.put({"md5": md5_value, "parse_method": parse_method, "cbUrl": cbUrl, "cbkey": cbkey})

def queue_consumer(q):
while True:
item = q.get()
if (item):
file_path='/gateway/tmp/'+item['md5']+'.pdf'
if os.access(file_path, os.R_OK):
byteContent = open(file_path,'rb').read()
logger.info('start to parse '+item['md5'])
magic_pdf_parse_util.pdf_parse(item['md5'], byteContent, item['parse_method'], item['cbUrl'], item['cbkey'])
else:
logger.error(file_path+' can not read')
q.task_done()

consumer_thread = threading.Thread(target=queue_consumer, args=(message_queue,))
consumer_thread.start()


initial_pdf_list = redis_util.get_init_list()
logger.info(initial_pdf_list)
for item in initial_pdf_list:
redis_util.set_parse_init(item['md5'])
commit_parse_task(item['md5'], 'auto')

MAX_FILE_SIZE = 1024 * 1024 * 50

@app.post("/parse_pdf")
async def parse_pdf(encodeUrl: str = None, md5: str = None, cbUrl: str = None, cbkey: str = None, parse_method: str = 'auto'):
if (encodeUrl is None and md5 is None):
return {"state": "failed", "error": "encodeUrl or md5 is required" }
if (md5):
file_info = redis_util.get_file_info(md5)
if (file_info):
return file_info
if (encodeUrl is None):
return {"state": "failed", "error": "encodeUrl is required when md5 is not valid"}

try:
decodeUrl = urllib.parse.unquote(encodeUrl)
with urllib.request.urlopen(decodeUrl) as response:
if response.getheader('Content-Length') and int(response.getheader('Content-Length')) > MAX_FILE_SIZE:
return {"state": "failed", "error": "File size exceeds the limit."}
pdf_bytes = bytearray()
while True:
chunk = response.read(4096)
if not chunk:
break
pdf_bytes.extend(chunk)
if len(pdf_bytes) > MAX_FILE_SIZE:
return {"state": "failed", "error": "File size exceeds the limit."}
except Exception:
logger.error(f"Error downloading PDF: {e}")
return {"state": "failed", "error": "encodeUrl is not valid."}

md5_value = calc_md5(pdf_bytes)

file_path=f'/gateway/tmp/{md5_value+}.pdf'

if not os.path.isfile(file_path):
tmp_file = open(file_path,'wb')
tmp_file.write(pdf_bytes)
tmp_file.close()


file_info = redis_util.get_file_info(md5_value)
if file_info:
file_info['queue'] = message_queue.qsize()
return file_info
try:
commit_parse_task(md5_value, parse_method, cbUrl, cbkey)
except Exception:
redis_util.set_parse_deny(md5_value)
return redis_util.get_file_info(md5_value)

redis_util.set_parse_init(md5_value)
file_info = redis_util.get_file_info(md5_value)
file_info['queue'] = message_queue.qsize()
return file_info
Loading
Loading