PaddlePaddle · luotao1 · Apr 28, 2025 · Apr 22, 2025 · Apr 24, 2025 · Apr 24, 2025
diff --git a/paddlemix/examples/points_qwen2_5/README.md b/paddlemix/examples/points_qwen2_5/README.md
@@ -0,0 +1,104 @@
+# MiniCPM-V-2_6
+
+## 1. 模型介绍
+
+[POINTS-Qwen](https://huggingface.co/WePOINTS/POINTS-Qwen-2-5-7B-Chat) 融合了视觉语言模型的最新研究进展，并采用了微信AI团队提出的前沿创新技术。
+
+- **强大的基线**：将视觉-语言模型领域的最新进展，即CapFusion、双视觉编码器和动态高分辨率技术，整合到POINTS中
+
+- **预训练数据集过滤**：提出使用困惑度（perplexity）作为指标来过滤预训练数据集。通过这种过滤策略，可以显著减少预训练数据集的规模，同时提升模型的性能。
+
+- **模型融合（Model Soup）**：提出对使用不同视觉指令微调数据集进行微调的模型应用模型融合技术，这可以进一步显著提升模型的性能。
+
+**本仓库支持的模型权重:**
+
+| Model              |
+|--------------------|
+| WePOINTS/POINTS-Qwen-2-5-7B-Chat |
+
+
+## 2 环境准备
+1）[安装PaddlePaddle](https://github.com/PaddlePaddle/PaddleMIX?tab=readme-ov-file#3-%EF%B8%8F%E5%AE%89%E8%A3%85paddlepaddle)
+- **python >= 3.10**
+- **paddlepaddle-gpu 要求是3.0.0b2或develop版本**
+```bash
+# 提供三种 PaddlePaddle 安装命令示例，也可参考PaddleMIX主页的安装教程进行安装
+
+# 3.0.0b2版本安装示例 (CUDA 11.8)
+python -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
+
+# Develop 版本安装示例
+python -m pip install paddlepaddle-gpu==0.0.0.post118 -f https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html
+
+# sh 脚本快速安装
+sh build_paddle_env.sh
+```
+
+2）[安装PaddleMIX环境依赖包](https://github.com/PaddlePaddle/PaddleMIX?tab=readme-ov-file#3-%EF%B8%8F%E5%AE%89%E8%A3%85paddlepaddle)
+- **paddlenlp >= 3.0.0b3**
+
+```bash
+# 提供两种 PaddleMIX 依赖安装命令示例
+
+# pip 安装示例，安装paddlemix、ppdiffusers、项目依赖、paddlenlp
+python -m pip install -e . --user
+python -m pip install -e ppdiffusers --user
+python -m pip install -r requirements.txt --user
+python -m pip install paddlenlp==3.0.0b4 --user
+
+# sh 脚本快速安装
+sh build_env.sh
+```
+
+> 注：
+* 请确保安装了以上依赖，否则无法运行。同时，需要安装 paddlemix/external_ops 下的自定义OP, `python setup.py install`。如果安装后仍然找不到算子，需要额外设置PYTHONPATH
+* (默认开启flash_attn)使用flash_attn 要求A100/A800显卡或者H20显卡。V100请用float16推理。
+
+## 3 模型转换
+
+将torch模型转换成paddle模型，请采用下述命令。
+
+```bash
+# 单图推理
+python paddlemix/examples/points_qwen2_5/convert_torch_to_paddle.py --torch_model_path ./models/POINTS-Qwen-2-5-7B-Chat/ --paddle_model_path ./models/POINTS-Qwen-2-5-7B-Chat_pd
+```
+
+## 4 快速开始
+
+### 推理
+
+```bash
+# 单图推理
+python paddlemix/examples/points_qwen2_5/image_infer.py --model_path ./models/POINTS-Qwen-2-5-7B-Chat_pd/ --image_file ./paddlemix/demo_images/examples_image2.jpg
+```
+
+![](../../demo_images/examples_image2.jpg)
+
+**Prompt:**
+
+>please describe the image in detail
+
+**Result:**
+
+>The image features a giant panda sitting amidst a lush environment. The panda, with its distinctive black and white fur, is holding a bamboo shoot, which is a staple in its diet. The panda's eyes are looking slightly to the side, giving it a contemplative expression. Surrounding the panda are various green plants, including bamboo shoots and other foliage, which contribute to the natural of a natural habitat. The ground is covered with what appears to be a layer of mulch or soil, and the overall setting suggests a well-maintained enclosure, likely within a zoo or conservation area.
+
+
+
+### 参考文献
+
+```BibTeX
+@article{liu2024points,
+  title={POINTS: Improving Your Vision-language Model with Affordable Strategies},
+  author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
+  journal={arXiv preprint arXiv:2409.04828},
+  year={2024}
+}
+
+@article{liu2024rethinking,
+  title={Rethinking Overlooked Aspects in Vision-Language Models},
+  author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
+  journal={arXiv preprint arXiv:2405.11850},
+  year={2024}
+}
+
+```
diff --git a/paddlemix/examples/points_qwen2_5/convert_torch_to_paddle.py b/paddlemix/examples/points_qwen2_5/convert_torch_to_paddle.py
@@ -0,0 +1,173 @@
+# -*- coding: utf-8 -*-
+
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# @Time    : 2025/4/25 下午11:33
+# @Author  : zhaop-l(zhaop-l@glocon.com)
+import argparse
+import copy
+import json
+import os
+import shutil
+
+import paddle
+import torch
+from safetensors.numpy import save_file
+from safetensors.torch import load_file
+
+from paddlemix.utils.log import logger
+
+need_transpose = {
+    # —— 语言模型部分（CustomLlamaForCausalLM） ——
+    "attention.query_dense.weight",
+    "attention.key_value_dense.weight",
+    "attention.dense.weight",
+    "mlp.dense_h_to_4h.weight",
+    "mlp.dense_4h_to_h.weight",
+    "llm.embed_out.weight",
+    # —— 双路视觉编码器部分（general_vit + ocr_vit） ——
+    # self_attn
+    "self_attn.k_proj.weight",
+    "self_attn.v_proj.weight",
+    "self_attn.q_proj.weight",
+    "self_attn.out_proj.weight",
+    # mlp
+    "mlp.fc1.weight",
+    "mlp.fc2.weight",
+    # —— vision_projector 重采样 / 映射层 ——
+    "vision_projector.0.weight",
+    "vision_projector.2.weight",
+}
+
+rename_layers = {
+    "embeddings.class_embedding": "class_embedding",
+    "embeddings.patch_embedding.weight": "conv1.weight",
+    "embeddings.position_embedding": "positional_embedding",
+    "pre_layrnorm": "ln_pre",
+    "vision_model.encoder": "vision_model.transformer",
+    "layer_norm1": "norm1",
+    "layer_norm2": "norm2",
+    "mlp.fc1": "linear1",
+    "mlp.fc2": "linear2",
+    "post_layernorm": "ln_post",
+}
+
+
+def execute_cmd(cmd, file_path):
+    cmd = cmd + " " + file_path
+    os.system(cmd)
+
+
+def check_trans(key, _need_transpose):
+    precess_list = []
+    for x in _need_transpose:
+        if x in key:
+            precess_list.append(x)
+    if len(precess_list) > 0:
+        return True, precess_list
+    else:
+        return False, None
+
+
+def translate_one_safetensors(file_name: str, dst_path: str, model_path: str):
+    tensors = load_file(os.path.join(model_path, file_name))
+    for key in list(tensors.keys()):
+        dst_key = key
+        shape_ = tensors[key].shape
+        rename_flag, rename_key = check_trans(key, rename_layers)
+        if rename_flag:
+            for _r in rename_key:
+                dst_key = dst_key.replace(_r, rename_layers[_r])
+        t_flag, _ = check_trans(key, need_transpose)
+        if t_flag and len(shape_) == 2:
+            t = tensors.pop(key).cuda().t().contiguous()
+            capsule = torch.utils.dlpack.to_dlpack(t)
+            t = paddle.utils.dlpack.from_dlpack(capsule)
+            tensors[dst_key] = t.numpy()
+        else:
+            t = tensors.pop(key).cuda()
+            capsule = torch.utils.dlpack.to_dlpack(t)
+            t = paddle.utils.dlpack.from_dlpack(capsule)
+            tensors[dst_key] = t.numpy()
+
+    save_file(tensors, os.path.join(dst_path, file_name), metadata={"format": "np"})
+
+
+def main(args):
+    model_path = args.torch_model_path
+    if args.paddle_model_path is not None:
+        dst_path = args.paddle_model_path
+    else:
+        dst_path = model_path.rstrip("/") + "_pd"
+    os.makedirs(dst_path, exist_ok=True)
+
+    logger.info(f"torch model path: {model_path}, paddle model path: {dst_path}")
+    logger.info("start convert torch model to paddle model")
+
+    if os.path.exists(os.path.join(model_path, "model.safetensors.index.json")):
+        index = json.load(open(os.path.join(model_path, "model.safetensors.index.json")))
+        dst_index = copy.deepcopy(index)
+        files = set(index["weight_map"].values())
+
+        for key in list(dst_index["weight_map"].keys()):
+            rename_flag, rename_key = check_trans(key, rename_layers)
+            dst_key = key
+            if rename_flag:
+                for _r in rename_key:
+                    dst_key = dst_key.replace(_r, rename_layers[_r])
+            dst_index["weight_map"][dst_key] = dst_index["weight_map"].pop(key)
+
+        for file_name in sorted(os.listdir(model_path)):
+            # skip hidden files
+            if file_name.startswith("."):
+                continue
+
+            if file_name in files:
+                # convert safetensors to safetensors(paddle)
+                logger.info(f"start convert {file_name}")
+                translate_one_safetensors(file_name, dst_path, model_path)
+            else:
+                # copy config.json and other files
+                shutil.copy(os.path.join(model_path, file_name), os.path.join(dst_path, file_name))
+
+        json.dump(dst_index, open(os.path.join(dst_path, "model.safetensors.index.json"), "w"), indent=2)
+
+    else:
+        for file_name in sorted(os.listdir(model_path)):
+            # skip hidden files
+            if file_name.startswith("."):
+                continue
+
+            logger.info(file_name)
+            if file_name == "model.safetensors":
+                # convert safetensors to safetensors(paddle)
+                translate_one_safetensors(file_name, dst_path, model_path)
+            else:
+                # copy config.json and other files
+                shutil.copy(os.path.join(model_path, file_name), os.path.join(dst_path, file_name))
+
+    execute_cmd(cmd="sed -i -e  's/torch_dtype/dtype/g' ", file_path=os.path.join(dst_path, "config.json"))
+
+    execute_cmd(cmd="sed -i /transformers_version/d ", file_path=os.path.join(dst_path, "config.json"))
+
+    logger.info(f"convert torch model to paddle model success, paddle model path: {dst_path}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--torch_model_path", type=str, default="POINTS-Qwen-2-5-7B-Chat")
+    parser.add_argument("--paddle_model_path", type=str, default=None)
+    args = parser.parse_args()
+    main(args)
diff --git a/paddlemix/examples/points_qwen2_5/image_infer.py b/paddlemix/examples/points_qwen2_5/image_infer.py
@@ -0,0 +1,58 @@
+# -*- coding: utf-8 -*-
+
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# @Time    : 2025/4/19 下午8:37
+# @Author  : zhaop-l(zhaopuzxjc@126.com)
+import argparse
+
+from paddlenlp.transformers import CLIPImageProcessor, Qwen2Tokenizer
+from PIL import Image
+
+from paddlemix.models.points_qwen2_5 import POINTSChatModel
+
+
+def main(args):
+    model_path = args.model_path
+
+    model = POINTSChatModel.from_pretrained(model_path)
+    tokenizer = Qwen2Tokenizer.from_pretrained(model_path)
+    image_processor = CLIPImageProcessor.from_pretrained(model_path)
+
+    image_path = args.image_file
+    pil_image = Image.open(image_path)
+    question = args.question
+
+    generation_config = {
+        "max_new_tokens": args.max_new_tokens,
+        "temperature": args.temperature,
+        "top_p": args.top_p,
+        "num_beams": 1,
+    }
+    res = model.chat(pil_image, question, tokenizer, image_processor, True, generation_config)
+
+    print(f"User: {question}\nAssistant: {res}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_path", type=str, default="./models/POINTS-Qwen-2-5-7B-Chat_pd")
+    parser.add_argument("--question", type=str, default="please describe the image in detail")
+    parser.add_argument("--image_file", type=str, default="paddlemix/demo_images/examples_image2.jpg")
+    parser.add_argument("--top_p", type=float, default=0.0)
+    parser.add_argument("--temperature", type=float, default=0.0)
+    parser.add_argument("--max_new_tokens", type=int, default=1024)
+    args = parser.parse_args()
+    main(args)
diff --git a/paddlemix/models/points_qwen2_5/__init__.py b/paddlemix/models/points_qwen2_5/__init__.py
@@ -0,0 +1,19 @@
+# -*- coding: utf-8 -*-
+
+# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# @Time    : 2025/4/19 下午8:37
+# @Author  : zhaop-l(zhaopuzxjc@126.com)
+from .modeling_points_chat import *