opendatalab · lztiancn · Nov 7, 2024 · Nov 8, 2024 · Nov 8, 2024 · Nov 13, 2024
diff --git a/README.md b/README.md
@@ -67,6 +67,7 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
 - 2024/11/22 0.10.0 released. Introducing hybrid OCR text extraction capabilities,
   - Significantly improved parsing performance in complex text distribution scenarios such as dense formulas, irregular span regions, and text represented by images.
   - Combines the dual advantages of accurate content extraction and faster speed in text mode, and more precise span/line region recognition in OCR mode.
+- 2024/11/20 add fastapi, via: [https://github.com/lztiancn/lzmineru.git](https://github.com/lztiancn/lzmineru.git)
 - 2024/11/15 0.9.3 released. Integrated [RapidTable](https://github.com/RapidAI/RapidTable) for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.
 - 2024/11/06 0.9.2 released. Integrated the [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B) model for table recognition functionality.
 - 2024/10/31 0.9.0 released. This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:
@@ -157,6 +158,11 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
 
 ## Quick Start
 
+
+```
+docker-compose up -d
+```
+
 If you encounter any installation issues, please first consult the <a href="#faq">FAQ</a>. </br>
 If the parsing results are not as expected, refer to the <a href="#known-issues">Known Issues</a>. </br>
 There are three different ways to experience MinerU:
@@ -347,6 +353,12 @@ You can enable MPS acceleration by setting the `device-mode` parameter to `mps`
 [Using MinerU via Python API](https://mineru.readthedocs.io/en/latest/user_guide/usage/api.html)
 
 
+
+### FAST API
+
+doc [http://localhost:8910/docs](http://localhost:8910/docs)
+
+
 ### Deploy Derived Projects
 
 Derived projects include secondary development projects based on MinerU by project developers and community developers,  

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -66,6 +66,7 @@
 - 2024/11/22 0.10.0发布，通过引入混合OCR文本提取能力，
   - 在公式密集、span区域不规范、部分文本使用图像表现等复杂文本分布场景下获得解析效果的显著提升
   - 同时具备文本模式内容提取准确、速度更快与OCR模式span/line区域识别更准的双重优势
+- 2024/11/20 增加fastapi(futuremeng and tianlz)
 - 2024/11/15 0.9.3发布，为表格识别功能接入了[RapidTable](https://github.com/RapidAI/RapidTable),单表解析速度提升10倍以上，准确率更高，显存占用更低
 - 2024/11/06 0.9.2发布，为表格识别功能接入了[StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B)模型
 - 2024/10/31 0.9.0发布，这是我们进行了大量代码重构的全新版本，解决了众多问题，提升了性能，降低了硬件需求，并提供了更丰富的易用性：
@@ -349,6 +350,11 @@ pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com -i
 [通过Python代码调用MinerU](https://mineru.readthedocs.io/en/latest/user_guide/usage/api.html)
 
 
+
+### FAST API
+
+接口文档doc [http://localhost:8910/docs](http://localhost:8910/docs)
+
 ### 部署衍生项目
 
 衍生项目包含项目开发者和社群开发者们基于MinerU的二次开发项目，

diff --git a/docker-compose.yml b/docker-compose.yml
@@ -0,0 +1,46 @@
+services:
+  gateway:
+    build:
+      context: ./docker/china
+    depends_on:
+      - redis
+    container_name: mineru-gateway
+    ports:
+      - "8910:80"
+    volumes:
+      - ./services/fastapi/app:/gateway/app:rw
+      - ./services/fastapi/tmp:/gateway/tmp:rw
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [gpu]
+    environment:
+      TZ: Asia/Shanghai
+    command: ["/gateway/app/start.sh"]
+    restart: always
+    networks:
+      - default
+
+  redis:
+    image: redis:7.2.4
+    container_name: mineru-redis
+    ports:
+      - "6380:6379"
+    volumes:
+      - ./services/redis/conf/redis.conf:/etc/redis.conf
+      - ./services/redis/data/:/data/
+    restart: always
+    entrypoint: ["redis-server", "/etc/redis.conf"]
+    environment:
+      TZ: Asia/Shanghai
+    networks:
+      - default
+
+networks:
+  default:
+    driver: bridge
+    ipam:
+      driver: default
diff --git a/docker/china/Dockerfile b/docker/china/Dockerfile
@@ -4,6 +4,9 @@ FROM ubuntu:22.04
 # Set environment variables to non-interactive to avoid prompts during installation
 ENV DEBIAN_FRONTEND=noninteractive
 
+RUN /bin/bash -c "sed -i 's/archive.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list && \
+                  sed -i 's/security.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list"
+
 # Update the package list and install necessary packages
 RUN apt-get update && \
     apt-get install -y \
@@ -46,5 +49,11 @@ RUN /bin/bash -c "pip3 install modelscope && \
     python3 download_models.py && \
     sed -i 's|cpu|cuda|g' /root/magic-pdf.json"
 
+# install extents
+COPY requirements-fastapi.txt /mineru-gateway/requirements-fastapi.txt
+
+RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
+    pip3 install -r /mineru-gateway/requirements-fastapi.txt"
+
 # Set the entry point to activate the virtual environment and run the command line tool
-ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]
+CMD ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]
diff --git a/docker/china/requirements-fastapi.txt b/docker/china/requirements-fastapi.txt
@@ -0,0 +1,4 @@
+fastapi>=0.115.2,<0.115.4
+uvicorn>=0.30.0,<0.32.0
+redis>=5.2.0
+requests
diff --git a/docker/global/Dockerfile b/docker/global/Dockerfile
@@ -4,6 +4,9 @@ FROM ubuntu:22.04
 # Set environment variables to non-interactive to avoid prompts during installation
 ENV DEBIAN_FRONTEND=noninteractive
 
+RUN /bin/bash -c "sed -i 's/archive.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list && \
+                  sed -i 's/security.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list"
+
 # Update the package list and install necessary packages
 RUN apt-get update && \
     apt-get install -y \
@@ -46,5 +49,11 @@ RUN /bin/bash -c "pip3 install huggingface_hub && \
     python3 download_models.py && \
     sed -i 's|cpu|cuda|g' /root/magic-pdf.json"
 
+# install extents
+COPY requirements-fastapi.txt /gateway/requirements-fastapi.txt
+
+RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
+    pip3 install -r /gateway/requirements-fastapi.txt"
+
 # Set the entry point to activate the virtual environment and run the command line tool
-ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]
+CMD ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]
diff --git a/fork-remote.md b/fork-remote.md
@@ -0,0 +1,29 @@
+<!--
+ * @Author: FutureMeng be_loving@163.com
+ * @Date: 2025-01-15 22:45:17
+ * @LastEditors: FutureMeng be_loving@163.com
+ * @LastEditTime: 2025-01-15 22:45:53
+ * @FilePath: \MinerU\fork-remote.md
+ * @Description: 这是默认设置,请设置`customMade`, 打开koroFileHeader查看配置 进行设置: https://github.com/OBKoro1/koro1FileHeader/wiki/%E9%85%8D%E7%BD%AE
+-->
+1. 查看是否添加了更新源
+   ```
+   git remote -v
+   ```
+2. 添加更新源，本项目fork自yeszao/dnmp
+   ```
+   git remote add upstream https://github.com/opendatalab/MinerU.git
+   ```
+3. 从源更新
+   ```
+   git fetch upstream
+   ```
+4. 合并源的分支
+   ```
+   git merge upstream/master
+   ```
+
+5. 推送
+   ```
+   git push
+   ```
diff --git a/requirements-fastapi.txt b/requirements-fastapi.txt
@@ -0,0 +1,3 @@
+fastapi>=0.115.2,<0.115.4
+uvicorn>=0.30.0,<0.32.0
+redis>=5.2.0
diff --git a/services/fastapi/app/__init__.py b/services/fastapi/app/__init__.py
diff --git a/services/fastapi/app/http_util.py b/services/fastapi/app/http_util.py
@@ -0,0 +1,18 @@
+import requests
+import asyncio
+from loguru import logger
+def post_result_callback(url, cbkey, md5_value, content_list, md_content):
+    if (url is None):
+        return
+    asyncio.run(commit_callback_task(url, cbkey, md5_value, content_list, md_content))
+
+async def commit_callback_task(url, cbkey, md5_value, content_list, md_content):
+    asyncio.create_task(post_callback(url, cbkey, md5_value, content_list, md_content))
+
+async def post_callback(url, cbkey, md5_value, content_list, md_content):
+    try:
+        json_result = {"cbkey": cbkey, "md5": md5_value, "content_list": content_list, "md_content": md_content}
+        requests.post(url, data=json_result)
+        logger.info("post_callback success, url:{}", url)
+    except Exception as e:
+        logger.exception(e)
diff --git a/services/fastapi/app/magic_pdf_parse_util.py b/services/fastapi/app/magic_pdf_parse_util.py
@@ -0,0 +1,87 @@
+import os
+import json
+import datetime
+import shutil
+
+from loguru import logger
+
+from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
+from magic_pdf.data.dataset import PymuDocDataset
+from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
+from magic_pdf.config.enums import SupportedPdfParseMethod
+
+from . import redis_util
+from . import http_util
+
+def pdf_parse(
+        md5_value,
+        pdf_bytes: bytes,
+        parse_method: str = 'auto',
+        cbUrl: str = None,
+        cbkey: str = None,
+        model_json_path: str = None,
+        output_dir: str = None
+):
+    try:
+        file_info = redis_util.get_file_info(md5_value)
+        if not file_info:
+            return
+        if file_info["state"] != "waiting":
+            return
+        redis_util.set_parse_parsing(md5_value)
+        current_script_dir = os.path.dirname(os.path.abspath(__file__))
+        foldname = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
+        if output_dir:
+            output_path = os.path.join(output_dir, foldname)
+        else:
+            output_path = os.path.join(current_script_dir, foldname)
+
+        output_image_path = os.path.join(output_path, 'images')
+        local_md_dir = os.path.join(output_path, 'output')
+
+        # 获取图片的父路径，为的是以相对路径保存到 .md 和 conent_list.json 文件中
+        image_path_parent = os.path.basename(output_image_path)
+
+        if model_json_path:
+            # 读取已经被模型解析后的pdf文件的 json 原始数据，list 类型
+            model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read())
+        else:
+            model_json = []
+
+        # 执行解析步骤
+        image_writer = FileBasedDataWriter(output_image_path)
+
+        ds = PymuDocDataset(pdf_bytes)
+
+        ## inference
+        if ds.classify() == SupportedPdfParseMethod.OCR:
+            infer_result = ds.apply(doc_analyze, ocr=True)
+
+            ## pipeline
+            pipe_result = infer_result.pipe_ocr_mode(image_writer)
+
+        else:
+            infer_result = ds.apply(doc_analyze, ocr=False)
+
+            ## pipeline
+            pipe_result = infer_result.pipe_txt_mode(image_writer)
+
+        ### draw model result on each page
+        infer_result.draw_model(os.path.join(local_md_dir, f"model.pdf"))
+
+        ### get model inference result
+        model_inference_result = infer_result.get_infer_res()
+
+        ### get markdown content
+        md_content = pipe_result.get_markdown(output_image_path)
+
+        ### get content list content
+        content_list = pipe_result.get_content_list(output_image_path)
+
+        # delete fold
+        shutil.rmtree(output_path)
+        redis_util.set_parse_parsed(md5_value, content_list, md_content)
+        http_util.post_result_callback(cbUrl, cbkey, md5_value, content_list, md_content)
+    except Exception as e:
+        redis_util.set_parse_failed(md5_value)
+        logger.exception(e)
diff --git a/services/fastapi/app/main.py b/services/fastapi/app/main.py
@@ -0,0 +1,101 @@
+import hashlib
+import os
+import queue
+import threading
+from urllib import parse, request
+
+from fastapi import FastAPI
+from loguru import logger
+
+from . import magic_pdf_parse_util
+from . import redis_util
+
+message_queue = queue.Queue(20)
+
+app = FastAPI()
+
+def calc_md5(byteContent: bytes):
+    hash_md5 = hashlib.md5()
+    hash_md5.update(byteContent)
+    return hash_md5.hexdigest()
+
+def commit_parse_task(md5_value, parse_method, cbUrl, cbkey):
+    message_queue.put({"md5": md5_value, "parse_method": parse_method, "cbUrl": cbUrl, "cbkey": cbkey})
+
+def queue_consumer(q):
+    while True:
+        item = q.get()
+        if (item):
+            file_path='/gateway/tmp/'+item['md5']+'.pdf'
+            if os.access(file_path, os.R_OK):
+                byteContent = open(file_path,'rb').read() 
+                logger.info('start to parse '+item['md5'])
+                magic_pdf_parse_util.pdf_parse(item['md5'], byteContent, item['parse_method'], item['cbUrl'], item['cbkey'])
+            else:
+                logger.error(file_path+' can not read')
+            q.task_done()
+
+consumer_thread = threading.Thread(target=queue_consumer, args=(message_queue,))
+consumer_thread.start()
+
+
+initial_pdf_list = redis_util.get_init_list()
+logger.info(initial_pdf_list)
+for item in initial_pdf_list:
+    redis_util.set_parse_init(item['md5'])
+    commit_parse_task(item['md5'], 'auto')
+
+MAX_FILE_SIZE = 1024 * 1024 * 50
+
+@app.post("/parse_pdf")
+async def parse_pdf(encodeUrl: str = None, md5: str = None, cbUrl: str = None, cbkey: str = None, parse_method: str = 'auto'):
+    if (encodeUrl is None and md5 is None):
+        return {"state": "failed", "error": "encodeUrl or md5 is required" }
+    if (md5):
+        file_info = redis_util.get_file_info(md5)
+        if (file_info):
+            return file_info
+    if (encodeUrl is None):
+        return {"state": "failed", "error": "encodeUrl is required when md5 is not valid"}
+
+    try:
+        decodeUrl = urllib.parse.unquote(encodeUrl)
+        with urllib.request.urlopen(decodeUrl) as response:
+            if response.getheader('Content-Length') and int(response.getheader('Content-Length')) > MAX_FILE_SIZE:
+                return {"state": "failed", "error": "File size exceeds the limit."}
+            pdf_bytes = bytearray()
+            while True:
+                chunk = response.read(4096)
+                if not chunk:
+                    break
+                pdf_bytes.extend(chunk)
+                if len(pdf_bytes) > MAX_FILE_SIZE:
+                    return {"state": "failed", "error": "File size exceeds the limit."}
+    except Exception:
+        logger.error(f"Error downloading PDF: {e}")
+        return {"state": "failed", "error": "encodeUrl is not valid."}
+
+    md5_value = calc_md5(pdf_bytes)
+
+    file_path=f'/gateway/tmp/{md5_value+}.pdf'
+
+    if not os.path.isfile(file_path):
+        tmp_file = open(file_path,'wb')
+        tmp_file.write(pdf_bytes)
+        tmp_file.close()
+
+
+    file_info = redis_util.get_file_info(md5_value)
+    if file_info:
+        file_info['queue'] = message_queue.qsize()
+        return file_info
+    try:
+        commit_parse_task(md5_value, parse_method, cbUrl, cbkey)
+    except Exception:
+        redis_util.set_parse_deny(md5_value)
+        return redis_util.get_file_info(md5_value)
+
+    redis_util.set_parse_init(md5_value)
+    file_info = redis_util.get_file_info(md5_value)
+    file_info['queue'] = message_queue.qsize()
+    return file_info