-
Notifications
You must be signed in to change notification settings - Fork 2.7k
pull request 20241114 #951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
lztiancn
wants to merge
77
commits into
opendatalab:master
Choose a base branch
from
lztiancn:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 11 commits
Commits
Show all changes
77 commits
Select commit
Hold shift + click to select a range
cef00a4
feat: add fastapi requirement file
lztiancn 0dc406e
feat: docker增加fastapi
lztiancn f1c8144
feat: 修改dockerfile
lztiancn 3f24891
fix: dockerrun调通fastapi
lztiancn 7510a86
Merge branch 'master' into master
lztiancn cb15db1
fix: docker compose 启动fastapi
lztiancn 6213e8b
Merge branch 'master' of https://github.com/lztiancn/lzmineru
lztiancn 0db2ad7
fix: 增加redis部署
lztiancn 51e9c13
fix: 增加redis部署
lztiancn 932fee1
feat: 增加异步处理
lztiancn 9eedc23
fix: 优化处理队列
lztiancn 3963ec4
fix: 增加md5查询功能
lztiancn 5326e97
fix: 增加url取不到数据的异常处理
lztiancn 7c4af87
增加fastapi
2e91a85
增加接口文档说明
e40d687
增加贡献者lztiancn说明
2657cbe
.gitignore
6664add
保存pdf文件并在重启后重新加载队列
abef79b
重建任务队列时恢复初始状态
02ea87b
app里面需要临时写images
4a51cc9
Merge remote-tracking branch 'upstream/master'
bcfa326
add fork
548d4c1
Merge remote-tracking branch 'upstream/master'
a98b6ef
@moria97 has signed the CLA in opendatalab/MinerU#1578
github-actions[bot] 39eadff
Update pdf_parse_union_core_v2.py
myhloli 2265379
build(deps): add upper version limit for PyMuPDF
myhloli 316ba84
feat(layout): improve title block handling and layout detection
myhloli d05f2fe
refactor(BatchAnalyze): comment out image rotation logic in doclayout…
myhloli c6be246
feat(post_proc): enhance title block processing with average line height
myhloli af7a28f
refactor(pre_proc): adjust IOU threshold for character overlap detection
myhloli 31219f1
docs(magic_pdf): update llm_aided.py prompt for title list optimization
myhloli a2e3178
fix(language): remove invalid UTF-16 surrogate pairs from input text
myhloli 85d4c9c
update logo
myhloli 8d75e47
build(docker): update doclayout-yolo dependency
myhloli c2481cd
feat(model): improve batch analysis logic and support npu
myhloli 1dc0816
refactor(magic_pdf): improve title block merging logic
myhloli 1f2ffdf
fix(magic_pdf): correct end page index and improve error handling
myhloli f540b42
docs(README): update demo badges
myhloli 1333d67
docs(README): update demo badges
myhloli 613c060
feat(table): upgrade RapidTable to1.0.3 and add sub-model support
myhloli c3827c1
build(docker): update rapid-table dependency
myhloli 7573af9
refactor(model): update batch analyze logic for rapid table model
myhloli 977a4b8
docs(README): update WeChat group link
myhloli ec7ced3
refactor(table): add device configuration for Unitable model
myhloli 73efa06
refactor(model): update config version check to 1.1.1
myhloli f98dd02
fix(magic_pdf): limit batch ratio for GPU memory
myhloli 6b279a6
feat(llm_aided): add reasonability check and fine-tuning guidelines
myhloli 64e1204
Fix ocr utills
moria97 53e6380
feat(pdf_parse): remove tilted lines for better text extraction
myhloli fe2d24f
fix(ocr): improve ONNX model initialization and error handling
myhloli f58512e
fix(models): update unimernet_small model path
myhloli ad54ff2
perf(model): adjust batch size for layout and formula detection
myhloli 5c70353
perf(magic_pdf): optimize batch ratio calculation for GPU
myhloli b1313e2
refactor(magic_pdf): adjust VRAM allocation and MFR batch size- Updat…
myhloli fabbca2
perf(magic_pdf): adjust batch ratio calculation for GPU memory
myhloli e516ec5
perf(magic_pdf): optimize batch processing for GPU
myhloli c2aa56a
fix(magic_pdf): correct batch ratio conditions for GPU memory
myhloli a7a2e00
refactor(pdf_parse): uncomment char bbox validation logic
myhloli a6fe0d2
fix(boxbase): handle cases where bounding box area is zero
myhloli 3e754fe
feat(pdf_parse_union_core_v2): add timing log for LLM aided processes
myhloli 1711181
docs(readme):update readme for 1.1.0
myhloli bab73f1
docs(url): update Miners links in header
myhloli 64d541a
feat(table-config): add sub_model configuration for rapid_table
myhloli e409475
docs(readme): update changelog for v1.1.0 release- Update model capab…
myhloli 32c96d8
docs(README): update online demo links and enhance documentation read…
myhloli 333c5d2
Update version.py with new version
myhloli 58af7ef
Update python-package.yml
myhloli 7cbad26
Update README.md
myhloli e9a49a5
Update README_zh-CN.md
myhloli 54dccba
merge: lztiancn/lzmineru
futuremeng 3e555eb
fix: context: ./docker/china
futuremeng 55e8960
feat: 增加回调
lztiancn 7deb790
Merge remote-tracking branch 'upstream/master'
lztiancn deece55
fix: 修改解析错误DiskReaderWriter
lztiancn 22b4aee
fix: 修改解析错误image_dir
lztiancn 32ec4bd
feat: 发送回调使用异步
lztiancn d7fc67f
fix: 代码优化
lztiancn File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
services: | ||
mineru-gw: | ||
build: | ||
context: . | ||
depends_on: | ||
- redis | ||
container_name: mineru-gw | ||
ports: | ||
- "8910:80" | ||
volumes: | ||
- ./services/fastapi/app:/minerugw/app:rw | ||
deploy: | ||
resources: | ||
reservations: | ||
devices: | ||
- driver: nvidia | ||
count: all | ||
capabilities: [gpu] | ||
environment: | ||
TZ: Asia/Shanghai | ||
command: ["/minerugw/app/start.sh"] | ||
restart: always | ||
networks: | ||
- default | ||
|
||
redis: | ||
image: redis:7.2.4 | ||
container_name: redis | ||
ports: | ||
- "6380:6379" | ||
volumes: | ||
- ./services/redis/conf/redis.conf:/etc/redis.conf | ||
- ./services/redis/conf/:/data/ | ||
restart: always | ||
entrypoint: ["redis-server", "/etc/redis.conf"] | ||
environment: | ||
TZ: Asia/Shanghai | ||
networks: | ||
- default | ||
|
||
networks: | ||
default: | ||
driver: bridge | ||
ipam: | ||
driver: default |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
fastapi>=0.115.2,<0.115.4 | ||
uvicorn>=0.30.0,<0.32.0 | ||
redis>=5.2.0 |
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
import os | ||
import json | ||
import datetime | ||
import shutil | ||
|
||
from loguru import logger | ||
|
||
from magic_pdf.libs.draw_bbox import draw_layout_bbox, draw_span_bbox | ||
from magic_pdf.pipe.UNIPipe import UNIPipe | ||
from magic_pdf.pipe.OCRPipe import OCRPipe | ||
from magic_pdf.pipe.TXTPipe import TXTPipe | ||
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter | ||
|
||
from . import redis_util | ||
|
||
def pdf_parse( | ||
md5_value, | ||
pdf_bytes: bytes, | ||
parse_method: str = 'auto', | ||
model_json_path: str = None, | ||
output_dir: str = None | ||
): | ||
""" | ||
执行从 pdf 转换到 json、md 的过程,输出 md 和 json 文件到 pdf 文件所在的目录 | ||
:param parse_method: 解析方法, 共 auto、ocr、txt 三种,默认 auto,如果效果不好,可以尝试 ocr | ||
:param model_json_path: 已经存在的模型数据文件,如果为空则使用内置模型,pdf 和 model_json 务必对应 | ||
:param is_json_md_dump: 是否将解析后的数据写入到 .json 和 .md 文件中,默认 True,会将不同阶段的数据写入到不同的 .json 文件中(共3个.json文件),md内容会保存到 .md 文件中 | ||
:param output_dir: 输出结果的目录地址,会生成一个以 pdf 文件名命名的文件夹并保存所有结果 | ||
""" | ||
try: | ||
file_info = redis_util.get_file_info(md5_value) | ||
if not file_info: | ||
return | ||
if file_info["state"] != "init": | ||
return | ||
redis_util.set_parse_parsing(md5_value) | ||
current_script_dir = os.path.dirname(os.path.abspath(__file__)) | ||
foldname = datetime.datetime.now().strftime("%Y%m%d%H%M%S") | ||
if output_dir: | ||
output_path = os.path.join(output_dir, foldname) | ||
else: | ||
output_path = os.path.join(current_script_dir, foldname) | ||
|
||
output_image_path = os.path.join(output_path, 'images') | ||
|
||
# 获取图片的父路径,为的是以相对路径保存到 .md 和 conent_list.json 文件中 | ||
image_path_parent = os.path.basename(output_image_path) | ||
|
||
if model_json_path: | ||
# 读取已经被模型解析后的pdf文件的 json 原始数据,list 类型 | ||
model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read()) | ||
else: | ||
model_json = [] | ||
|
||
# 执行解析步骤 | ||
image_writer = DiskReaderWriter(output_image_path) | ||
|
||
# 选择解析方式 | ||
# jso_useful_key = {"_pdf_type": "", "model_list": model_json} | ||
# pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer) | ||
if parse_method == "auto": | ||
jso_useful_key = {"_pdf_type": "", "model_list": model_json} | ||
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer) | ||
elif parse_method == "txt": | ||
pipe = TXTPipe(pdf_bytes, model_json, image_writer) | ||
elif parse_method == "ocr": | ||
pipe = OCRPipe(pdf_bytes, model_json, image_writer) | ||
else: | ||
shutil.rmtree(output_path) | ||
redis_util.set_parse_failed(md5_value) | ||
logger.error("unknown parse method, only auto, ocr, txt allowed") | ||
return | ||
|
||
# 执行分类 | ||
pipe.pipe_classify() | ||
|
||
# 如果没有传入模型数据,则使用内置模型解析 | ||
if not model_json: | ||
pipe.pipe_analyze() # 解析 | ||
|
||
# 执行解析 | ||
pipe.pipe_parse() | ||
|
||
# 保存 text 和 md 格式的结果 | ||
content_list = pipe.pipe_mk_uni_format(image_path_parent, drop_mode="none") | ||
md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none") | ||
|
||
# delete fold | ||
shutil.rmtree(output_path) | ||
redis_util.set_parse_parsed(md5_value, content_list, md_content) | ||
|
||
except Exception as e: | ||
redis_util.set_parse_failed(md5_value) | ||
logger.exception(e) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
''' | ||
Author: dt_4541218930 abcstorms@163.com | ||
Date: 2024-11-14 17:04:42 | ||
LastEditors: dt_4541218930 abcstorms@163.com | ||
LastEditTime: 2024-11-15 22:38:00 | ||
FilePath: \lzmineru\services\fastapi\app\main.py | ||
Description: 这是默认设置,请设置`customMade`, 打开koroFileHeader查看配置 进行设置: https://github.com/OBKoro1/koro1FileHeader/wiki/%E9%85%8D%E7%BD%AE | ||
''' | ||
from fastapi import FastAPI | ||
import urllib.request | ||
import hashlib | ||
import queue | ||
import threading | ||
from . import magic_pdf_parse_util | ||
from . import redis_util | ||
|
||
message_queue = queue.Queue(20) | ||
|
||
app = FastAPI() | ||
|
||
def calc_md5(byteContent: bytes): | ||
hash_md5 = hashlib.md5() | ||
hash_md5.update(byteContent) | ||
return hash_md5.hexdigest() | ||
|
||
def commit_parse_task(md5_value, byteContent: bytes, parse_method): | ||
message_queue.put({"md5": md5_value, "byteContent": byteContent, "parse_method": parse_method}) | ||
|
||
def queue_consumer(q): | ||
while True: | ||
item = q.get() | ||
if (item): | ||
magic_pdf_parse_util.pdf_parse(item['md5'], item['byteContent'], item['parse_method']) | ||
|
||
consumer_thread = threading.Thread(target=queue_consumer, args=(message_queue,)) | ||
consumer_thread.start() | ||
|
||
@app.post("/parse_pdf") | ||
async def parse_pdf(imageUrl: str, parse_method: str = 'auto'): | ||
pdf_bytes = urllib.request.urlopen(imageUrl).read() | ||
md5_value = calc_md5(pdf_bytes) | ||
file_info = redis_util.get_file_info(md5_value) | ||
if file_info: | ||
return file_info | ||
try: | ||
commit_parse_task(md5_value, pdf_bytes, parse_method) | ||
except Exception: | ||
redis_util.set_parse_deny(md5_value) | ||
return redis_util.get_file_info(md5_value) | ||
redis_util.set_parse_init(md5_value) | ||
return redis_util.get_file_info(md5_value) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
import redis | ||
import json | ||
import enum | ||
|
||
ParseState = enum.Enum('ParseState', ('deny', 'init', 'start', 'done', 'failed')) | ||
|
||
redis_conn = redis.Redis(host='host.docker.internal', port=6380, db=0) | ||
|
||
def get_file_info(md5_value): | ||
json_str = redis_conn.get(md5_value) | ||
if json_str: | ||
return json.loads(json_str) | ||
|
||
def del_file_info(md5_value): | ||
redis_conn.delete(md5_value) | ||
|
||
def set_file_info_expire(md5_value, expire_seconds): | ||
redis_conn.expire(md5_value, expire_seconds) | ||
|
||
def set_file_info(md5_value, state: ParseState, content_list = "", md_content = ""): | ||
json_str = json.dumps({"state": state.name, "content_list": content_list, "md_content": md_content}) | ||
redis_conn.set(md5_value, json_str) | ||
|
||
def set_parse_deny(md5_value): | ||
set_file_info(md5_value, ParseState.deny) | ||
set_file_info_expire(md5_value, 5) | ||
|
||
def set_parse_failed(md5_value): | ||
set_file_info(md5_value, ParseState.failed) | ||
set_file_info_expire(md5_value, 10) | ||
|
||
def set_parse_init(md5_value): | ||
set_file_info(md5_value, ParseState.init) | ||
set_file_info_expire(md5_value, 60 * 60) | ||
|
||
def set_parse_parsing(md5_value): | ||
set_file_info(md5_value, ParseState.start) | ||
set_file_info_expire(md5_value, 60 * 30) | ||
|
||
def set_parse_parsed(md5_value, content_list, md_content): | ||
set_file_info(md5_value, ParseState.done, content_list, md_content) | ||
set_file_info_expire(md5_value, 60 * 60 * 24) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
#!/bin/bash | ||
echo "starting miner server" | ||
source /opt/mineru_venv/bin/activate | ||
cd /minerugw | ||
uvicorn app.main:app --host 0.0.0.0 --port 80 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
''' | ||
Author: FutureMeng be_loving@163.com | ||
Date: 2024-11-13 19:05:01 | ||
LastEditors: FutureMeng be_loving@163.com | ||
LastEditTime: 2024-11-13 19:06:17 | ||
FilePath: \lzmineru\api\test.py | ||
Description: 这是默认设置,请设置`customMade`, 打开koroFileHeader查看配置 进行设置: https://github.com/OBKoro1/koro1FileHeader/wiki/%E9%85%8D%E7%BD%AE | ||
''' | ||
|
||
import urllib.request | ||
import os | ||
from magic_pdf.pipe.UNIPipe import UNIPipe | ||
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter | ||
|
||
current_script_dir = os.path.dirname(os.path.abspath(__file__)) | ||
local_image_dir = os.path.join(current_script_dir, 'images') | ||
image_dir = str(os.path.basename(local_image_dir)) | ||
imageUrl = 'https://one-jiulu.oss-cn-beijing.aliyuncs.com/9250ba5ccbf34249b054d063d32ec8f8.pdf?OSSAccessKeyId=LTAI5tABhdnCgSeVaptuWLfx&Expires=1732100601&Signature=XZqGPO%2BJ76bEJ0ou8GZQUO7vhjs%3D' | ||
|
||
pdf_bytes = urllib.request.urlopen(imageUrl).read() | ||
image_writer = DiskReaderWriter(local_image_dir) | ||
jso_useful_key = {"_pdf_type": "", "model_list": []} | ||
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer) | ||
pipe.pipe_classify() | ||
pipe.pipe_analyze() | ||
pipe.pipe_parse() | ||
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") | ||
print(md_content) |
Binary file not shown.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do not increase by default the size of the image, this could be set as a optional requirements.