PaddlePaddle
diff --git a/‎paddlemix/examples/minigpt4/deploy/README.md
+120 b/‎paddlemix/examples/minigpt4/deploy/README.md
+120
diff --git a/‎paddlemix/examples/minigpt4/deploy/export_image_encoder.py
+43 b/‎paddlemix/examples/minigpt4/deploy/export_image_encoder.py
+43
diff --git a/‎paddlemix/examples/minigpt4/deploy/run_static_predict.py
+218 b/‎paddlemix/examples/minigpt4/deploy/run_static_predict.py
+218
@@ -0,0 +1,120 @@
+# MiniGPT4 推理加速
+
+本项目提供了基于 MiniGPT4 的推理加速功能，基本的解决思路是将 MiniGPT4 动态图转为静态图，然后基于 PaddleInference 库进行推理加速。
+
+下图展示了 MiniGPT4 的整体模型结构， 可以看到整体上，MiniGPT4的主要部分由 VIT， QFormer 和 Vicuna 模型组成，其中 Vicuna 模型是基于 Llama 训练的，在代码实现中调用的也是Llama代码，为方便描述，忽略不必要的分歧，所以在后续中将语言模型这部分默认描述为Llama。
+
+在本方案中，我们将MiniGPT4 导出为两个子图：VIT 和 QFormer部分导出为一个静态子图， Llama 部分导出为一个子图。后续会结合这两个子图统一做 MiniGPT4 的推理功能。
+
+<center><img src="https://github.com/PaddlePaddle/Paddle/assets/35913314/f0306cb6-4837-4f52-8f57-a0e7e35238f6" /></center>
+
+
+
+
+## 1. 环境准备
+### 1.1 基础环境准备：
+本项目在以下基础环境进行了验证：
+- CUDA: 11.7
+- python: 3.11
+- paddle: develop版
+
+其中CUDA版本需要>=11.2， 具体Paddle版本可以点击[这里](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html)按需下载。
+
+
+### 1.2 安装项目库
+1. 本项目需要用到 PaddleMIX 和 PaddleNLP 两个库，并且需要下载最新的 develop 版本：
+
+```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+git clone https://github.com/PaddlePaddle/PaddleMIX.git
+```
+
+2. 安装paddlenlp_ops：
+```shell
+cd PaddleNLP/csrc
+python setup_cuda.py install
+```
+
+3. 最后设置相应的环境变量：
+```shell
+export PYTHONPATH= yourpath/PaddleNLP:yourpath/PaddleMIX
+```
+
+### 1.3 特别说明
+目前需要修复PaddleNLP和Paddle的部分代码，从而进行MiniGPT4推理加速。这部分功能后续逐步会逐步完善到PaddleNLP和Paddle，但目前如果想使用的话需要手动修改一下。
+1. 修改PaddleNLP代码: 
+参考该[分支代码](https://github.com/1649759610/PaddleNLP/tree/bugfix_minigpt4)，依次替换以下文件：
+- PaddleNLP/paddlenlp/experimental/transformers/generation_utils.py
+- PaddleNLP/paddlenlp/experimental/transformers/llama/modeling.py
+- PaddleNLP/llm/export_model.py
+
+2. 修改Paddle代码
+进入到Paddle安装目录，打开文件：paddle/static/io.py, 注释第284-287行代码：
+```python
+if not skip_prune_program:
+    copy_program = copy_program._prune_with_input(
+        feeded_var_names=feed_var_names, targets=fetch_vars
+    )
+```
+
+## 2. MiniGPT4 分阶段导出
+
+### 2.1 导出前一部分子图：
+请确保在该目录下：PaddleMIX/paddlemix/examples/minigpt4/inference，按照以下命令进行导出：
+```
+python export_image_encoder.py \
+    --minigpt4_13b_path "you minigpt4 dir path" \
+    --save_path "./checkpoints/encode_image/encode_image" 
+```
+
+**参数说明**:
+- minigpt4_13b_path: 存放MiniGPT4的目录名
+- save_path: 前一部分模型的导出路径和名称
+
+
+### 2.2 导出后一部分子图
+请进入到目录： PaddleNLP/llm, 按照以下命令进行导出：
+```
+python export_model.py \
+    --model_name_or_path "your llama dir path" \
+    --output_path "your output path" \
+    --dtype float16 \
+    --inference_model \
+    --model_prefix llama \
+    --model_type llama-img2txt
+    
+```
+
+**参数说明**:
+- model_name_or_path: 存放Llama模型的目录名
+- output_path: 语言模型部分的导出路径和名称
+- dtype: 模型权重数据类型
+- inference_model: 表示是推理模型
+- model_prefix: 指明模型前缀
+- model_type: 指明模型类型
+
+**备注**： 当前导出Llama部分需要转移到PaddleNLP下进行手动导出，后续将支持在PaddleMIX下一键转出。
+
+## 3. MiniGPT4 静态图推理
+请进入到目录PaddleMIX/paddlemix/examples/minigpt4/inference，执行以下命令：
+```python
+python run_static_predict.py \
+    --first_model_path "The dir name of image encoder model" \
+    --second_model_path "The dir name of language model" \
+    --minigpt4_path "The minigpt4 dir name of saving tokenizer"
+```
+
+**参数说明**:
+- first_model_path: 存放前一部分（即vit和qformer）的静态图模型目录名
+- second_model_path: 存放后一部分（即语言模型）的静态图模型目录名
+- minigpt4_path: 存放 MiniGPT4 tokenizer的目录名
+
+以下展示了针对以下这个图片，MiniGPT4静态图推理的输出：
+
+<center><img src="https://paddlenlp.bj.bcebos.com/data/images/mugs.png" /></center>
+
+```text
+Reference: The image shows two black and white cats sitting next to each other on a blue background. The cats have black fur and white fur with black noses, eyes, and paws. They are both looking at the camera with a curious expression. The mugs are also blue with the same design of the cats on them. There is a small white flower on the left side of the mug. The background is a light blue color.
+
+Outputs:  ['The image shows two black and white cats sitting next to each other on a blue background. The cats have black fur and white fur with black noses, eyes, and paws. They are both looking at the camera with a curious expression. The mugs are also blue with the same design of the cats on them. There is a small white flower on the left side of the mug. The background is a light blue color.##']
+```
@@ -0,0 +1,43 @@
+import argparse
+import os
+os.environ["CUDA_VISIBLE_DEVICES"]="0"
+os.environ["FLAGS_use_cuda_managed_memory"]="true"
+
+import paddle
+from paddlemix import MiniGPT4ForConditionalGeneration
+
+
+def export(args):
+    model = MiniGPT4ForConditionalGeneration.from_pretrained(args.minigpt4_13b_path, vit_dtype="float16")
+    model.eval()
+
+    # convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model.encode_images,
+        input_spec=[
+            paddle.static.InputSpec(
+                shape=[None, 3, None, None], dtype="float32"),  # images
+        ])
+
+    # save to static model
+    paddle.jit.save(model, args.save_path)
+    print(f"static model has been to {args.save_path}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--minigpt4_13b_path",
+        default="your minigpt4 dir path",
+        type=str,
+        help="The dir name of minigpt4 checkpoint.",
+    )
+    parser.add_argument(
+        "--save_path",
+        default="./checkpoints/encode_image/encode_image",
+        type=str,
+        help="The saving path of static minigpt4.",
+    )
+    args = parser.parse_args()
+
+    export(args)
@@ -0,0 +1,218 @@
+import argparse
+import os
+import sys
+import requests
+import numpy as np
+import datetime
+os.environ["CUDA_VISIBLE_DEVICES"] = "0"
+os.environ["FLAGS_use_cuda_managed_memory"] = "true"
+
+import paddle
+from paddle import inference
+from paddlenlp.transformers import MiniGPT4Processor
+from PIL import Image
+
+from utils import load_real_time_tokens
+
+
+class Predictor(object):
+    def __init__(self, args):
+        self.args = args
+        self.first_predictor, self.first_input_handles, self.first_output_handles = self.create_predictor(
+            args.first_model_path)
+        self.second_predictor, self.second_input_handles, self.second_output_handles = self.create_predictor(
+            args.second_model_path)
+        print(f"first_model_path: {args.first_model_path}, {self.first_predictor}")
+        print(f"second_model_path: {args.second_model_path}, {self.second_predictor}")
+        self.processor = MiniGPT4Processor.from_pretrained(args.minigpt4_path)
+
+    def create_predictor(self, model_path):
+
+        from paddlenlp.utils.import_utils import import_module
+        import_module("paddlenlp_ops.encode_rotary_qk")
+        import_module("paddlenlp_ops.get_padding_offset")
+        import_module("paddlenlp_ops.qkv_transpose_split")
+        import_module("paddlenlp_ops.rebuild_padding")
+        import_module("paddlenlp_ops.transpose_remove_padding")
+        import_module("paddlenlp_ops.write_cache_kv")
+
+        model_file = model_path + ".pdmodel"
+        params_file = model_path + ".pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+        
+        shape_range_file = model_file + "shape.txt"
+        # 第一次运行的时候需要收集shape信息，请打开下面的注释
+        # config.collect_shape_range_info(shape_range_file)
+
+        config.switch_ir_optim(True)
+        # 第一个模型跑TRT
+        if(model_file.find("llama") == -1):
+            self.args.use_tensorrt = False
+        else:
+            self.args.use_tensorrt = False
+
+        if self.args.device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_mode = inference.PrecisionType.Half
+            # 第一个模型是要跑TRT的
+            if self.args.use_tensorrt:
+                config.enable_tuned_tensorrt_dynamic_shape(shape_range_file, True)
+                config.enable_tensorrt_engine(
+                    max_batch_size=-1, min_subgraph_size=30, precision_mode=precision_mode,
+                    use_static = True)
+
+        config.switch_use_feed_fetch_ops(False)
+        predictor = paddle.inference.create_predictor(config)
+        input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()]
+        output_handle = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+
+        return predictor, input_handles, output_handle
+
+    @paddle.no_grad()
+    def encode_images(self, pixel_values):
+        # pixel_values 已经在GPU上了
+        [language_model_inputs, language_model_attention_mask] = self.first_predictor.run([pixel_values])
+        return language_model_inputs, language_model_attention_mask
+
+    @paddle.no_grad()
+    def generate_with_image_features(self,
+                                     image_features,
+                                     first_input_ids,
+                                     second_input_ids,
+                                     image_attention_mask=None,
+                                     first_attention_mask=None,
+                                     second_attention_mask=None,
+                                     **generate_kwargs, ):
+        batch, seq,_ = image_features.shape
+        seq = image_features.shape[1] + first_input_ids.shape[1] + second_input_ids.shape[1]
+        max_len = 204
+        dtype = "float16"
+        tgt_generation_mask = paddle.full([batch, 1, 1, max_len], 0, dtype=dtype)
+        tgt_generation_mask[:,0,0,:seq] = 1
+
+        attention_mask = paddle.full([batch, 1, max_len, max_len], 0, dtype=dtype)
+        attention_mask[:,0,:seq,:seq] = paddle.tril(
+                    paddle.ones(shape=(seq, seq), dtype=dtype)
+                )
+        position_ids = paddle.full([batch, seq], 0, dtype="int64")
+        for i in range(batch):
+            position_ids[i,:] = paddle.to_tensor([i for i in range(seq)], dtype="int64")
+
+
+
+        inputs = [image_features, 
+                  first_input_ids, 
+                  second_input_ids,
+                  attention_mask,
+                  # image_attention_mask, 
+                  # first_attention_mask, 
+                  # second_attention_mask,
+                  position_ids,    # position_ids
+                  paddle.full([batch, 1], 1.0, dtype="float32"),  # penalty_score
+                  paddle.full([batch, 1], 0.0, dtype="float32"),  # frequency_score,
+                  paddle.full([batch, 1], 0.0, dtype="float32"),  # presence_score,
+                  paddle.full([batch, 1], 1, dtype="int64"),    # min_length,
+                  paddle.full([batch, 1], max_len - seq, dtype="int64"), # max_length,
+                  paddle.full([batch, 1], 1.0, dtype="float32"), # temperature,
+                  paddle.full([batch, 1], 0.0, dtype="float32"), # top_p,
+                  paddle.full([1], 2277, dtype="int64"),   # eos_token_id,
+                  paddle.full([batch, 1], seq, dtype="int32"),  # seq_len_encoder,
+                  paddle.full([batch, 1], seq, dtype="int32"), # seq_len_decoder,
+                  paddle.full([batch, 1], 0, dtype="int64"), # step_idx,
+                  paddle.full([batch, 1], False, dtype="bool"), # stop_flags,
+                  paddle.full([batch, 1], -123, dtype="int64"), # tgt_ids can be be initialized arbitrarily
+                  paddle.full([batch, 1], seq - 1, dtype="int64"), # tgt_pos,
+                  tgt_generation_mask, # tgt_generation_mask,
+                  paddle.full([batch, max_len], -100, dtype="int64"), # pre_ids, can be initialized arbitrarily
+                  paddle.full([1], batch, dtype="int64") # stop_nums, be batch 
+                                       ]
+        for i in range(40):
+            tmp = paddle.rand(shape=[2, batch, 40, max_len, 128], dtype=dtype)
+            print(tmp.shape)
+            inputs.append(tmp)
+
+        self.second_predictor.run(inputs)
+        tokens: np.ndarray = load_real_time_tokens()
+        generate_ids = tokens.tolist()
+        return generate_ids, None
+
+    def pre_processing(self, images, text, prompt=None):
+        processed_contents = self.processor(images, text, prompt=prompt)
+        return processed_contents
+
+    def post_processing(self, generate_ids):
+        msg = self.processor.batch_decode(generate_ids)
+        return msg
+
+    def predict(self, images, text, prompt=None):
+        processed_contents = self.pre_processing(images, text, prompt=prompt)
+        batch = 1
+        processed_contents["pixel_values"] = paddle.tile(processed_contents["pixel_values"], repeat_times=[batch,1,1,1])
+        image_features, image_attention_mask = self.encode_images(processed_contents["pixel_values"])
+        print(image_attention_mask.shape)
+        processed_contents["first_input_ids"] = paddle.tile(processed_contents["first_input_ids"], repeat_times=[batch,1])
+        processed_contents["second_input_ids"] = paddle.tile(processed_contents["second_input_ids"], repeat_times=[batch,1])
+        processed_contents["first_attention_mask"] = paddle.tile(processed_contents["first_attention_mask"], repeat_times=[batch,1])
+        processed_contents["second_attention_mask"] = paddle.tile(processed_contents["second_attention_mask"], repeat_times=[batch,1])
+        generate_ids, _ = self.generate_with_image_features(
+            image_features,
+            processed_contents["first_input_ids"],
+            processed_contents["second_input_ids"],
+            image_attention_mask,
+            processed_contents["first_attention_mask"],
+            processed_contents["second_attention_mask"],
+        )
+
+        msg = self.post_processing(generate_ids)
+
+        return msg
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--first_model_path", default='The dir name of image encoder model', type=str, help="", )
+    parser.add_argument("--second_model_path", default='The dir name of language model', type=str, help="", )
+    parser.add_argument("--minigpt4_path", type=str,
+                        default="The minigpt4 dir name of saving tokenizer",
+                        help="The path of extraction model path that you want to load.")
+    parser.add_argument("--use_tensorrt", action='store_true', help="Whether to use inference engin TensorRT.")
+    parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"],
+                        help='The tensorrt precision.')
+    parser.add_argument("--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference.")
+    parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+    parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False],
+                        help='Enable to use mkldnn to speed up when using cpu.')
+    args = parser.parse_args()
+
+    predictor = Predictor(args)
+
+    url = "https://paddlenlp.bj.bcebos.com/data/images/mugs.png"
+    image = Image.open(requests.get(url, stream=True).raw)
+
+    text = "describe this image"
+    prompt = "Give the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img><ImageHere></Img> <TextHere>###Assistant:"
+
+    # warm up
+    warm_up_times = 2
+    repeat_times = 10
+    for i in range(warm_up_times):
+        msg = predictor.predict(image, text, prompt)
+
+    # 测试50次
+    starttime = datetime.datetime.now()
+    for i in range(repeat_times):
+        msg = predictor.predict(image, text, prompt)
+    
+    endtime = datetime.datetime.now()
+    duringtime = endtime - starttime
+    time_ms = duringtime.seconds * 1000 + duringtime.microseconds / 1000.0
+
+    print("Reference: The image shows two black and white cats sitting next to each other on a blue background. The cats have black fur and white fur with black noses, eyes, and paws. They are both looking at the camera with a curious expression. The mugs are also blue with the same design of the cats on them. There is a small white flower on the left side of the mug. The background is a light blue color.")
+    print("Outputs: ", msg)
+    print("The whole time on average: ", time_ms / repeat_times, "ms")