PaddlePaddle
diff --git a/‎docs/version3.x/module_usage/doc_vlm.en.md
Lines changed: 255 additions & 0 deletions b/‎docs/version3.x/module_usage/doc_vlm.en.md
Lines changed: 255 additions & 0 deletions
diff --git a/‎docs/version3.x/module_usage/doc_vlm.md
Lines changed: 21 additions & 6 deletions b/‎docs/version3.x/module_usage/doc_vlm.md
Lines changed: 21 additions & 6 deletions
@@ -0,0 +1,255 @@
+---
+comments: true
+---
+
+# Tutorial on Using Document Visual Language Model Module
+
+## I. Overview
+
+Document visual language models are a cutting-edge multimodal processing technology aimed at addressing the limitations of traditional document processing methods. Traditional methods are often limited to processing document information in specific formats or predefined categories, whereas document visual language models can integrate visual and linguistic information to understand and handle diverse document content. By combining computer vision and natural language processing technologies, these models can recognize images, text, and their relationships within documents, and even understand semantic information within complex layout structures. This makes document processing more intelligent and flexible, with stronger generalization capabilities, showing broad application prospects in automated office work, information extraction, and other fields.
+
+## II. Supported Model List
+
+<table>
+<tr>
+<th>Model</th><th>Model Download Link</th>
+<th>Model Storage Size (GB)</th>
+<th>Total Score</th>
+<th>Description</th>
+</tr>
+<tr>
+<td>PP-DocBee-2B</td><td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-DocBee-2B_infer.tar">Inference Model</a></td>
+<td>4.2</td>
+<td>765</td>
+<td rowspan="2">PP-DocBee is a self-developed multimodal large model by the PaddlePaddle team, focusing on document understanding, and it performs excellently in Chinese document understanding tasks. The model is fine-tuned and optimized using nearly 5 million multimodal datasets for document understanding, including general VQA, OCR, charts, text-rich documents, mathematics and complex reasoning, synthetic data, and pure text data, with different training data ratios set. On several authoritative English document understanding evaluation lists in academia, PP-DocBee has basically achieved SOTA for models of the same parameter scale. In terms of internal business Chinese scenario indicators, PP-DocBee also outperforms the current popular open-source and closed-source models.</td>
+</tr>
+<tr>
+<td>PP-DocBee-7B</td><td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-DocBee-7B_infer.tar">Inference Model</a></td>
+<td>15.8</td>
+<td>-</td>
+</tr>
+<tr>
+<td>PP-DocBee2-3B</td><td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-DocBee2-3B_infer.tar">Inference Model</a></td>
+<td>7.6</td>
+<td>852</td>
+<td>PP-DocBee2 is a self-developed multimodal large model by the PaddlePaddle team, further optimizing the base model on the foundation of PP-DocBee and introducing a new data optimization scheme to improve data quality. Using a small amount of 470,000 data generated by a self-developed data synthesis strategy, PP-DocBee2 performs better in Chinese document understanding tasks. In terms of internal business Chinese scenario indicators, PP-DocBee2 improves by about 11.4% compared to PP-DocBee, and also outperforms the current popular open-source and closed-source models of the same scale.</td>
+</tr>
+</table>
+
+<b>Note: The total scores of the above models are test results from an internal evaluation set, where all images have a resolution (height, width) of (1680, 1204), with a total of 1196 data entries, covering scenarios such as financial reports, laws and regulations, scientific and technical papers, manuals, humanities papers, contracts, research reports, etc. There are no plans for public release at the moment.</b>
+
+## III. Quick Start
+
+> ❗ Before starting quickly, please install the PaddleOCR wheel package. For details, please refer to the [Installation Guide](../ppocr/installation.md).
+
+You can quickly experience it with one line of command:
+
+```bash
+paddleocr doc_vlm -i "{'image': 'https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出'}"
+```
+
+You can also integrate the model inference from the open document visual language model module into your project. Before running the following code, please download the [sample image](https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png) locally.
+
+```python
+from paddleocr import DocVLM
+model = DocVLM(model_name="PP-DocBee2-3B")
+results = model.predict(
+    input={"image": "medal_table.png", "query": "识别这份表格的内容, 以markdown格式输出"},
+    batch_size=1
+)
+for res in results:
+    res.print()
+    res.save_to_json(f"./output/res.json")
+```
+
+After running, the result is:
+
+```bash
+{'res': {'image': 'medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出', 'result': '| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |\n| --- | --- | --- | --- | --- | --- |\n| 1 | 中国（CHN） | 48 | 22 | 30 | 100 |\n| 2 | 美国（USA） | 36 | 39 | 37 | 112 |\n| 3 | 俄罗斯（RUS） | 24 | 13 | 23 | 60 |\n| 4 | 英国（GBR） | 19 | 13 | 19 | 51 |\n| 5 | 德国（GER） | 16 | 11 | 14 | 41 |\n| 6 | 澳大利亚（AUS） | 14 | 15 | 17 | 46 |\n| 7 | 韩国（KOR） | 13 | 11 | 8 | 32 |\n| 8 | 日本（JPN） | 9 | 8 | 8 | 25 |\n| 9 | 意大利（ITA） | 8 | 9 | 10 | 27 |\n| 10 | 法国（FRA） | 7 | 16 | 20 | 43 |\n| 11 | 荷兰（NED） | 7 | 5 | 4 | 16 |\n| 12 | 乌克兰（UKR） | 7 | 4 | 11 | 22 |\n| 13 | 肯尼亚（KEN） | 6 | 4 | 6 | 16 |\n| 14 | 西班牙（ESP） | 5 | 11 | 3 | 19 |\n| 15 | 牙买加（JAM） | 5 | 4 | 2 | 11 |\n'}}
+```
+
+The meaning of the result parameters is as follows:
+- `image`: Indicates the path of the input image to be predicted
+- `query`: Represents the input text information to be predicted
+- `result`: Information of the model's prediction result
+
+The visualization of the prediction result is as follows:
+
+```bash
+| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |
+| --- | --- | --- | --- | --- | --- |
+| 1 | 中国（CHN） | 48 | 22 | 30 | 100 |
+| 2 | 美国（USA） | 36 | 39 | 37 | 112 |
+| 3 | 俄罗斯（RUS） | 24 | 13 | 23 | 60 |
+| 4 | 英国（GBR） | 19 | 13 | 19 | 51 |
+| 5 | 德国（GER） | 16 | 11 | 14 | 41 |
+| 6 | 澳大利亚（AUS） | 14 | 15 | 17 | 46 |
+| 7 | 韩国（KOR） | 13 | 11 | 8 | 32 |
+| 8 | 日本（JPN） | 9 | 8 | 8 | 25 |
+| 9 | 意大利（ITA） | 8 | 9 | 10 | 27 |
+| 10 | 法国（FRA） | 7 | 16 | 20 | 43 |
+| 11 | 荷兰（NED） | 7 | 5 | 4 | 16 |
+| 12 | 乌克兰（UKR） | 7 | 4 | 11 | 22 |
+| 13 | 肯尼亚（KEN） | 6 | 4 | 6 | 16 |
+| 14 | 西班牙（ESP） | 5 | 11 | 3 | 19 |
+| 15 | 牙买加（JAM） | 5 | 4 | 2 | 11 |
+```
+
+Explanations of related methods, parameters, etc., are as follows:
+
+* `DocVLM` instantiates the document visual language model (taking `PP-DocBee-2B` as an example), with specific explanations as follows:
+<table>
+<thead>
+<tr>
+<th>Parameter</th>
+<th>Description</th>
+<th>Type</th>
+<th>Options</th>
+<th>Default</th>
+</tr>
+</thead>
+<tr>
+<td><code>model_name</code></td>
+<td>Model Name</td>
+<td><code>str</code></td>
+<td>None</td>
+<td><code>None</code></td>
+</tr>
+<tr>
+<td><code>model_dir</code></td>
+<td>Model Storage Path</td>
+<td><code>str</code></td>
+<td>None</td>
+<td>None</td>
+</tr>
+<tr>
+<td><code>device</code></td>
+<td>Model Inference Device</td>
+<td><code>str</code></td>
+<td>Supports specifying specific GPU card number, such as "gpu:0", other hardware specific card numbers, such as "npu:0", CPU such as "cpu".</td>
+<td><code>gpu:0</code></td>
+</tr>
+<tr>
+<td><code>use_hpip</code></td>
+<td>Whether to enable high-performance inference plugin. Currently not supported.</td>
+<td><code>bool</code></td>
+<td>None</td>
+<td><code>False</code></td>
+</tr>
+<tr>
+<td><code>hpi_config</code></td>
+<td>High-performance inference configuration. Currently not supported.</td>
+<td><code>dict</code> | <code>None</code></td>
+<td>None</td>
+<td><code>None</code></td>
+</tr>
+</table>
+
+* Among them, `model_name` must be specified. After specifying `model_name`, the default PaddleX built-in model parameters will be used. On this basis, when specifying `model_dir`, user-defined models will be used.
+
+* Call the `predict()` method of the document visual language model for inference prediction. This method will return a result list. Additionally, this module also provides the `predict_iter()` method. Both are completely consistent in terms of parameter acceptance and result return, the difference being that `predict_iter()` returns a `generator`, capable of gradually processing and obtaining prediction results, suitable for handling large datasets or scenarios where memory saving is desired. You can choose to use either of these methods based on actual needs. The `predict()` method parameters include `input`, `batch_size`, with specific explanations as follows:
+
+<table>
+<thead>
+<tr>
+<th>Parameter</th>
+<th>Description</th>
+<th>Type</th>
+<th>Options</th>
+<th>Default</th>
+</tr>
+</thead>
+<tr>
+<td><code>input</code></td>
+<td>Data to be predicted</td>
+<td><code>dict</code></td>
+<td>
+<code>Dict</code>, as multimodal models have different input requirements, it needs to be determined based on the specific model. Specifically:
+<li>PP-DocBee series input format is <code>{'image': image_path, 'query': query_text}</code></li>
+</td>
+<td>None</td>
+</tr>
+<tr>
+<td><code>batch_size</code></td>
+<td>Batch Size</td>
+<td><code>int</code></td>
+<td>Integer</td>
+<td>1</td>
+</tr>
+</table>
+
+* Process the prediction results. The prediction result for each sample is the corresponding Result object, and it supports operations such as printing and saving as `json` file:
+
+<table>
+<thead>
+<tr>
+<th>Method</th>
+<th>Description</th>
+<th>Parameter</th>
+<th>Type</th>
+<th>Description</th>
+<th>Default</th>
+</tr>
+</thead>
+<tr>
+<td rowspan = "3"><code>print()</code></td>
+<td rowspan = "3">Print results to terminal</td>
+<td><code>format_json</code></td>
+<td><code>bool</code></td>
+<td>Whether to format the output content using <code>JSON</code> indentation</td>
+<td><code>True</code></td>
+</tr>
+<tr>
+<td><code>indent</code></td>
+<td><code>int</code></td>
+<td>Specify the indentation level to beautify the output <code>JSON</code> data, making it more readable, effective only when <code>format_json</code> is <code>True</code></td>
+<td>4</td>
+</tr>
+<tr>
+<td><code>ensure_ascii</code></td>
+<td><code>bool</code></td>
+<td>Control whether non-<code>ASCII</code> characters are escaped to <code>Unicode</code>. When set to <code>True</code>, all non-<code>ASCII</code> characters will be escaped; <code>False</code> retains the original characters, effective only when <code>format_json</code> is <code>True</code></td>
+<td><code>False</code></td>
+</tr>
+<tr>
+<td rowspan = "3"><code>save_to_json()</code></td>
+<td rowspan = "3">Save the result as a json format file</td>
+<td><code>save_path</code></td>
+<td><code>str</code></td>
+<td>Path of the file to be saved. When it is a directory, the naming of the saved file is consistent with the input file type.</td>
+<td>None</td>
+</tr>
+<tr>
+<td><code>indent</code></td>
+<td><code>int</code></td>
+<td>Specify the indentation level to beautify the output <code>JSON</code> data, making it more readable, effective only when <code>format_json</code> is <code>True</code></td>
+<td>4</td>
+</tr>
+<tr>
+<td><code>ensure_ascii</code></td>
+<td><code>bool</code></td>
+<td>Control whether non-<code>ASCII</code> characters are escaped to <code>Unicode</code>. When set to <code>True</code>, all non-<code>ASCII</code> characters will be escaped; <code>False</code> retains the original characters, effective only when <code>format_json</code> is <code>True</code></td>
+<td><code>False</code></td>
+</tr>
+</table>
+
+* Additionally, it also supports obtaining prediction results through attributes, as follows:
+
+<table>
+<thead>
+<tr>
+<th>Attribute</th>
+<th>Description</th>
+</tr>
+</thead>
+<tr>
+<td rowspan = "1"><code>json</code></td>
+<td rowspan = "1">Get the prediction result in <code>json</code> format</td>
+</tr>
+</table>
+
+## IV. Secondary Development
+
+The current module does not support fine-tuning training temporarily, only inference integration is supported. The fine-tuning training of this module is planned to be supported in the future.
+
+## V. FAQ
@@ -15,19 +15,31 @@ comments: true
 <tr>
 <th>模型</th><th>模型下载链接</th>
 <th>模型存储大小（GB）</th>
+<th>模型总分</th>
 <th>介绍</th>
 </tr>
 <tr>
 <td>PP-DocBee-2B</td><td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-DocBee-2B_infer.tar">推理模型</a></td>
 <td>4.2</td>
+<td>765</td>
 <td rowspan="2">PP-DocBee 是飞桨团队自研的一款专注于文档理解的多模态大模型，在中文文档理解任务上具有卓越表现。该模型通过近 500 万条文档理解类多模态数据集进行微调优化，各种数据集包括了通用VQA类、OCR类、图表类、text-rich文档类、数学和复杂推理类、合成数据类、纯文本数据等，并设置了不同训练数据配比。在学术界权威的几个英文文档理解评测榜单上，PP-DocBee基本都达到了同参数量级别模型的SOTA。在内部业务中文场景类的指标上，PP-DocBee也高于目前的热门开源和闭源模型。</td>
 </tr>
 <tr>
 <td>PP-DocBee-7B</td><td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-DocBee-7B_infer.tar">推理模型</a></td>
 <td>15.8</td>
+<td>-</td>
+</tr>
+<tr>
+<td>PP-DocBee2-3B</td><td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-DocBee2-3B_infer.tar">推理模型</a></td>
+<td>7.6</td>
+<td>852</td>
+<td>PP-DocBee2 是飞桨团队自研的一款专注于文档理解的多模态大模型，在PP-DocBee的基础上进一步优化了基础模型，并引入了新的数据优化方案，提高了数据质量，使用自研数据合成策略生成的少量的47万数据便使得PP-DocBee2在中文文档理解任务上表现更佳。在内部业务中文场景类的指标上，PP-DocBee2相较于PP-DocBee提升了约11.4%，同时也高于目前的同规模热门开源和闭源模型。</td>
 </tr>
 </table>
 
+<b>注：以上模型总分为内部评估集模型测试结果，内部评估集所有图像分辨率 (height, width) 为 (1680,1204)，共1196条数据，包括了财报、法律法规、理工科论文、说明书、文科论文、合同、研报等场景，暂时未有计划公开。</b>
+
+
 
 ## 三、快速开始
 
@@ -36,16 +48,16 @@ comments: true
 使用一行命令即可快速体验：
 
 ```bash
-paddleocr doc_vlm -i "{'image': 'https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png', 'query': '识别这份表格的内容'}"
+paddleocr doc_vlm -i "{'image': 'https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出'}"
 ```
 
 您也可以将开放文档类视觉语言模型模块中的模型推理集成到您的项目中。运行以下代码前，请您下载[示例图片](https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png)到本地。
 
 ```python
 from paddleocr import DocVLM
-model = DocVLM(model_name="PP-DocBee-2B")
+model = DocVLM(model_name="PP-DocBee2-3B")
 results = model.predict(
-    input={"image": "medal_table.png", "query": "识别这份表格的内容"},
+    input={"image": "medal_table.png", "query": "识别这份表格的内容, 以markdown格式输出"},
     batch_size=1
 )
 for res in results:
@@ -56,7 +68,7 @@ for res in results:
 运行后，得到的结果为：
 
 ```bash
-{'res': {'image': 'medal_table.png', 'query': '识别这份表格的内容', 'result': '| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |\n| --- | --- | --- | --- | --- | --- |\n| 1 | 中国（CHN） | 48 | 22 | 30 | 100 |\n| 2 | 美国（USA） | 36 | 39 | 37 | 112 |\n| 3 | 俄罗斯（RUS） | 24 | 13 | 23 | 60 |\n| 4 | 英国（GBR） | 19 | 13 | 19 | 51 |\n| 5 | 德国（GER） | 16 | 11 | 14 | 41 |\n| 6 | 澳大利亚（AUS） | 14 | 15 | 17 | 46 |\n| 7 | 韩国（KOR） | 13 | 11 | 8 | 32 |\n| 8 | 日本（JPN） | 9 | 8 | 8 | 25 |\n| 9 | 意大利（ITA） | 8 | 9 | 10 | 27 |\n| 10 | 法国（FRA） | 7 | 16 | 20 | 43 |\n| 11 | 荷兰（NED） | 7 | 5 | 4 | 16 |\n| 12 | 乌克兰（UKR） | 7 | 4 | 11 | 22 |\n| 13 | 肯尼亚（KEN） | 6 | 4 | 6 | 16 |\n| 14 | 西班牙（ESP） | 5 | 11 | 3 | 19 |\n| 15 | 牙买加（JAM） | 5 | 4 | 2 | 11 |\n'}}
+{'res': {'image': 'medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出', 'result': '| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |\n| --- | --- | --- | --- | --- | --- |\n| 1 | 中国（CHN） | 48 | 22 | 30 | 100 |\n| 2 | 美国（USA） | 36 | 39 | 37 | 112 |\n| 3 | 俄罗斯（RUS） | 24 | 13 | 23 | 60 |\n| 4 | 英国（GBR） | 19 | 13 | 19 | 51 |\n| 5 | 德国（GER） | 16 | 11 | 14 | 41 |\n| 6 | 澳大利亚（AUS） | 14 | 15 | 17 | 46 |\n| 7 | 韩国（KOR） | 13 | 11 | 8 | 32 |\n| 8 | 日本（JPN） | 9 | 8 | 8 | 25 |\n| 9 | 意大利（ITA） | 8 | 9 | 10 | 27 |\n| 10 | 法国（FRA） | 7 | 16 | 20 | 43 |\n| 11 | 荷兰（NED） | 7 | 5 | 4 | 16 |\n| 12 | 乌克兰（UKR） | 7 | 4 | 11 | 22 |\n| 13 | 肯尼亚（KEN） | 6 | 4 | 6 | 16 |\n| 14 | 西班牙（ESP） | 5 | 11 | 3 | 19 |\n| 15 | 牙买加（JAM） | 5 | 4 | 2 | 11 |\n'}}
 ```
 运行结果参数含义如下：
 - `image`: 表示输入待预测图像的路径
@@ -155,15 +167,16 @@ for res in results:
 <td>待预测数据</td>
 <td><code>dict</code></td>
 <td>
-<code>Dict</code>, 需要根据具体的模型确定，如PP-DocBee系列的输入为{'image': image_path, 'query': query_text}
+<code>Dict</code>, 由于多模态模型对输入有不同的要求，需要根据具体的模型确定，具体而言:
+<li>PP-DocBee系列的输入形式为<code>{'image': image_path, 'query': query_text}</code></li>
 </td>
 <td>无</td>
 </tr>
 <tr>
 <td><code>batch_size</code></td>
 <td>批大小</td>
 <td><code>int</code></td>
-<td>整数(目前仅支持为1)</td>
+<td>整数</td>
 <td>1</td>
 </tr>
 </table>
@@ -241,3 +254,5 @@ for res in results:
 ## 四、二次开发
 
 当前模块暂时不支持微调训练，仅支持推理集成。关于该模块的微调训练，计划在未来支持。
+
+## 五、FAQ