Skip to content

Commit e1c10ec

Browse files
authored
Merge pull request #159 from 1649759610/minigpt4_inference
[New Feature] Add MiniGPT4 Inference
2 parents 0ffbb1b + 65816ab commit e1c10ec

File tree

5 files changed

+447
-13
lines changed

5 files changed

+447
-13
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# MiniGPT4 推理加速
2+
3+
本项目提供了基于 MiniGPT4 的推理加速功能,基本的解决思路是将 MiniGPT4 动态图转为静态图,然后基于 PaddleInference 库进行推理加速。
4+
5+
下图展示了 MiniGPT4 的整体模型结构, 可以看到整体上,MiniGPT4的主要部分由 VIT, QFormer 和 Vicuna 模型组成,其中 Vicuna 模型是基于 Llama 训练的,在代码实现中调用的也是Llama代码,为方便描述,忽略不必要的分歧,所以在后续中将语言模型这部分默认描述为Llama。
6+
7+
在本方案中,我们将MiniGPT4 导出为两个子图:VIT 和 QFormer部分导出为一个静态子图, Llama 部分导出为一个子图。后续会结合这两个子图统一做 MiniGPT4 的推理功能。
8+
9+
<center><img src="https://github.com/PaddlePaddle/Paddle/assets/35913314/f0306cb6-4837-4f52-8f57-a0e7e35238f6" /></center>
10+
11+
12+
13+
14+
## 1. 环境准备
15+
### 1.1 基础环境准备:
16+
本项目在以下基础环境进行了验证:
17+
- CUDA: 11.7
18+
- python: 3.11
19+
- paddle: develop版
20+
21+
其中CUDA版本需要>=11.2, 具体Paddle版本可以点击[这里](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html)按需下载。
22+
23+
24+
### 1.2 安装项目库
25+
1. 本项目需要用到 PaddleMIX 和 PaddleNLP 两个库,并且需要下载最新的 develop 版本:
26+
27+
```shell
28+
git clone https://github.com/PaddlePaddle/PaddleNLP.git
29+
git clone https://github.com/PaddlePaddle/PaddleMIX.git
30+
```
31+
32+
2. 安装paddlenlp_ops:
33+
```shell
34+
cd PaddleNLP/csrc
35+
python setup_cuda.py install
36+
```
37+
38+
3. 最后设置相应的环境变量:
39+
```shell
40+
export PYTHONPATH= yourpath/PaddleNLP:yourpath/PaddleMIX
41+
```
42+
43+
### 1.3 特别说明
44+
目前需要修复PaddleNLP和Paddle的部分代码,从而进行MiniGPT4推理加速。这部分功能后续逐步会逐步完善到PaddleNLP和Paddle,但目前如果想使用的话需要手动修改一下。
45+
1. 修改PaddleNLP代码:
46+
参考该[分支代码](https://github.com/1649759610/PaddleNLP/tree/bugfix_minigpt4),依次替换以下文件:
47+
- PaddleNLP/paddlenlp/experimental/transformers/generation_utils.py
48+
- PaddleNLP/paddlenlp/experimental/transformers/llama/modeling.py
49+
- PaddleNLP/llm/export_model.py
50+
51+
2. 修改Paddle代码
52+
进入到Paddle安装目录,打开文件:paddle/static/io.py, 注释第284-287行代码:
53+
```python
54+
if not skip_prune_program:
55+
copy_program = copy_program._prune_with_input(
56+
feeded_var_names=feed_var_names, targets=fetch_vars
57+
)
58+
```
59+
60+
## 2. MiniGPT4 分阶段导出
61+
62+
### 2.1 导出前一部分子图:
63+
请确保在该目录下:PaddleMIX/paddlemix/examples/minigpt4/inference,按照以下命令进行导出:
64+
```
65+
python export_image_encoder.py \
66+
--minigpt4_13b_path "you minigpt4 dir path" \
67+
--save_path "./checkpoints/encode_image/encode_image"
68+
```
69+
70+
**参数说明**:
71+
- minigpt4_13b_path: 存放MiniGPT4的目录名
72+
- save_path: 前一部分模型的导出路径和名称
73+
74+
75+
### 2.2 导出后一部分子图
76+
请进入到目录: PaddleNLP/llm, 按照以下命令进行导出:
77+
```
78+
python export_model.py \
79+
--model_name_or_path "your llama dir path" \
80+
--output_path "your output path" \
81+
--dtype float16 \
82+
--inference_model \
83+
--model_prefix llama \
84+
--model_type llama-img2txt
85+
86+
```
87+
88+
**参数说明**:
89+
- model_name_or_path: 存放Llama模型的目录名
90+
- output_path: 语言模型部分的导出路径和名称
91+
- dtype: 模型权重数据类型
92+
- inference_model: 表示是推理模型
93+
- model_prefix: 指明模型前缀
94+
- model_type: 指明模型类型
95+
96+
**备注**: 当前导出Llama部分需要转移到PaddleNLP下进行手动导出,后续将支持在PaddleMIX下一键转出。
97+
98+
## 3. MiniGPT4 静态图推理
99+
请进入到目录PaddleMIX/paddlemix/examples/minigpt4/inference,执行以下命令:
100+
```python
101+
python run_static_predict.py \
102+
--first_model_path "The dir name of image encoder model" \
103+
--second_model_path "The dir name of language model" \
104+
--minigpt4_path "The minigpt4 dir name of saving tokenizer"
105+
```
106+
107+
**参数说明**:
108+
- first_model_path: 存放前一部分(即vit和qformer)的静态图模型目录名
109+
- second_model_path: 存放后一部分(即语言模型)的静态图模型目录名
110+
- minigpt4_path: 存放 MiniGPT4 tokenizer的目录名
111+
112+
以下展示了针对以下这个图片,MiniGPT4静态图推理的输出:
113+
114+
<center><img src="https://paddlenlp.bj.bcebos.com/data/images/mugs.png" /></center>
115+
116+
```text
117+
Reference: The image shows two black and white cats sitting next to each other on a blue background. The cats have black fur and white fur with black noses, eyes, and paws. They are both looking at the camera with a curious expression. The mugs are also blue with the same design of the cats on them. There is a small white flower on the left side of the mug. The background is a light blue color.
118+
119+
Outputs: ['The image shows two black and white cats sitting next to each other on a blue background. The cats have black fur and white fur with black noses, eyes, and paws. They are both looking at the camera with a curious expression. The mugs are also blue with the same design of the cats on them. There is a small white flower on the left side of the mug. The background is a light blue color.##']
120+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
import argparse
2+
import os
3+
os.environ["CUDA_VISIBLE_DEVICES"]="0"
4+
os.environ["FLAGS_use_cuda_managed_memory"]="true"
5+
6+
import paddle
7+
from paddlemix import MiniGPT4ForConditionalGeneration
8+
9+
10+
def export(args):
11+
model = MiniGPT4ForConditionalGeneration.from_pretrained(args.minigpt4_13b_path, vit_dtype="float16")
12+
model.eval()
13+
14+
# convert to static graph with specific input description
15+
model = paddle.jit.to_static(
16+
model.encode_images,
17+
input_spec=[
18+
paddle.static.InputSpec(
19+
shape=[None, 3, None, None], dtype="float32"), # images
20+
])
21+
22+
# save to static model
23+
paddle.jit.save(model, args.save_path)
24+
print(f"static model has been to {args.save_path}")
25+
26+
27+
if __name__ == "__main__":
28+
parser = argparse.ArgumentParser()
29+
parser.add_argument(
30+
"--minigpt4_13b_path",
31+
default="your minigpt4 dir path",
32+
type=str,
33+
help="The dir name of minigpt4 checkpoint.",
34+
)
35+
parser.add_argument(
36+
"--save_path",
37+
default="./checkpoints/encode_image/encode_image",
38+
type=str,
39+
help="The saving path of static minigpt4.",
40+
)
41+
args = parser.parse_args()
42+
43+
export(args)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
import argparse
2+
import os
3+
import sys
4+
import requests
5+
import numpy as np
6+
import datetime
7+
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
8+
os.environ["FLAGS_use_cuda_managed_memory"] = "true"
9+
10+
import paddle
11+
from paddle import inference
12+
from paddlenlp.transformers import MiniGPT4Processor
13+
from PIL import Image
14+
15+
from utils import load_real_time_tokens
16+
17+
18+
class Predictor(object):
19+
def __init__(self, args):
20+
self.args = args
21+
self.first_predictor, self.first_input_handles, self.first_output_handles = self.create_predictor(
22+
args.first_model_path)
23+
self.second_predictor, self.second_input_handles, self.second_output_handles = self.create_predictor(
24+
args.second_model_path)
25+
print(f"first_model_path: {args.first_model_path}, {self.first_predictor}")
26+
print(f"second_model_path: {args.second_model_path}, {self.second_predictor}")
27+
self.processor = MiniGPT4Processor.from_pretrained(args.minigpt4_path)
28+
29+
def create_predictor(self, model_path):
30+
31+
from paddlenlp.utils.import_utils import import_module
32+
import_module("paddlenlp_ops.encode_rotary_qk")
33+
import_module("paddlenlp_ops.get_padding_offset")
34+
import_module("paddlenlp_ops.qkv_transpose_split")
35+
import_module("paddlenlp_ops.rebuild_padding")
36+
import_module("paddlenlp_ops.transpose_remove_padding")
37+
import_module("paddlenlp_ops.write_cache_kv")
38+
39+
model_file = model_path + ".pdmodel"
40+
params_file = model_path + ".pdiparams"
41+
if not os.path.exists(model_file):
42+
raise ValueError("not find model file path {}".format(model_file))
43+
if not os.path.exists(params_file):
44+
raise ValueError("not find params file path {}".format(params_file))
45+
config = paddle.inference.Config(model_file, params_file)
46+
47+
shape_range_file = model_file + "shape.txt"
48+
# 第一次运行的时候需要收集shape信息,请打开下面的注释
49+
# config.collect_shape_range_info(shape_range_file)
50+
51+
config.switch_ir_optim(True)
52+
# 第一个模型跑TRT
53+
if(model_file.find("llama") == -1):
54+
self.args.use_tensorrt = False
55+
else:
56+
self.args.use_tensorrt = False
57+
58+
if self.args.device == "gpu":
59+
# set GPU configs accordingly
60+
# such as initialize the gpu memory, enable tensorrt
61+
config.enable_use_gpu(100, 0)
62+
precision_mode = inference.PrecisionType.Half
63+
# 第一个模型是要跑TRT的
64+
if self.args.use_tensorrt:
65+
config.enable_tuned_tensorrt_dynamic_shape(shape_range_file, True)
66+
config.enable_tensorrt_engine(
67+
max_batch_size=-1, min_subgraph_size=30, precision_mode=precision_mode,
68+
use_static = True)
69+
70+
config.switch_use_feed_fetch_ops(False)
71+
predictor = paddle.inference.create_predictor(config)
72+
input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()]
73+
output_handle = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
74+
75+
return predictor, input_handles, output_handle
76+
77+
@paddle.no_grad()
78+
def encode_images(self, pixel_values):
79+
# pixel_values 已经在GPU上了
80+
[language_model_inputs, language_model_attention_mask] = self.first_predictor.run([pixel_values])
81+
return language_model_inputs, language_model_attention_mask
82+
83+
@paddle.no_grad()
84+
def generate_with_image_features(self,
85+
image_features,
86+
first_input_ids,
87+
second_input_ids,
88+
image_attention_mask=None,
89+
first_attention_mask=None,
90+
second_attention_mask=None,
91+
**generate_kwargs, ):
92+
batch, seq,_ = image_features.shape
93+
seq = image_features.shape[1] + first_input_ids.shape[1] + second_input_ids.shape[1]
94+
max_len = 204
95+
dtype = "float16"
96+
tgt_generation_mask = paddle.full([batch, 1, 1, max_len], 0, dtype=dtype)
97+
tgt_generation_mask[:,0,0,:seq] = 1
98+
99+
attention_mask = paddle.full([batch, 1, max_len, max_len], 0, dtype=dtype)
100+
attention_mask[:,0,:seq,:seq] = paddle.tril(
101+
paddle.ones(shape=(seq, seq), dtype=dtype)
102+
)
103+
position_ids = paddle.full([batch, seq], 0, dtype="int64")
104+
for i in range(batch):
105+
position_ids[i,:] = paddle.to_tensor([i for i in range(seq)], dtype="int64")
106+
107+
108+
109+
inputs = [image_features,
110+
first_input_ids,
111+
second_input_ids,
112+
attention_mask,
113+
# image_attention_mask,
114+
# first_attention_mask,
115+
# second_attention_mask,
116+
position_ids, # position_ids
117+
paddle.full([batch, 1], 1.0, dtype="float32"), # penalty_score
118+
paddle.full([batch, 1], 0.0, dtype="float32"), # frequency_score,
119+
paddle.full([batch, 1], 0.0, dtype="float32"), # presence_score,
120+
paddle.full([batch, 1], 1, dtype="int64"), # min_length,
121+
paddle.full([batch, 1], max_len - seq, dtype="int64"), # max_length,
122+
paddle.full([batch, 1], 1.0, dtype="float32"), # temperature,
123+
paddle.full([batch, 1], 0.0, dtype="float32"), # top_p,
124+
paddle.full([1], 2277, dtype="int64"), # eos_token_id,
125+
paddle.full([batch, 1], seq, dtype="int32"), # seq_len_encoder,
126+
paddle.full([batch, 1], seq, dtype="int32"), # seq_len_decoder,
127+
paddle.full([batch, 1], 0, dtype="int64"), # step_idx,
128+
paddle.full([batch, 1], False, dtype="bool"), # stop_flags,
129+
paddle.full([batch, 1], -123, dtype="int64"), # tgt_ids can be be initialized arbitrarily
130+
paddle.full([batch, 1], seq - 1, dtype="int64"), # tgt_pos,
131+
tgt_generation_mask, # tgt_generation_mask,
132+
paddle.full([batch, max_len], -100, dtype="int64"), # pre_ids, can be initialized arbitrarily
133+
paddle.full([1], batch, dtype="int64") # stop_nums, be batch
134+
]
135+
for i in range(40):
136+
tmp = paddle.rand(shape=[2, batch, 40, max_len, 128], dtype=dtype)
137+
print(tmp.shape)
138+
inputs.append(tmp)
139+
140+
self.second_predictor.run(inputs)
141+
tokens: np.ndarray = load_real_time_tokens()
142+
generate_ids = tokens.tolist()
143+
return generate_ids, None
144+
145+
def pre_processing(self, images, text, prompt=None):
146+
processed_contents = self.processor(images, text, prompt=prompt)
147+
return processed_contents
148+
149+
def post_processing(self, generate_ids):
150+
msg = self.processor.batch_decode(generate_ids)
151+
return msg
152+
153+
def predict(self, images, text, prompt=None):
154+
processed_contents = self.pre_processing(images, text, prompt=prompt)
155+
batch = 1
156+
processed_contents["pixel_values"] = paddle.tile(processed_contents["pixel_values"], repeat_times=[batch,1,1,1])
157+
image_features, image_attention_mask = self.encode_images(processed_contents["pixel_values"])
158+
print(image_attention_mask.shape)
159+
processed_contents["first_input_ids"] = paddle.tile(processed_contents["first_input_ids"], repeat_times=[batch,1])
160+
processed_contents["second_input_ids"] = paddle.tile(processed_contents["second_input_ids"], repeat_times=[batch,1])
161+
processed_contents["first_attention_mask"] = paddle.tile(processed_contents["first_attention_mask"], repeat_times=[batch,1])
162+
processed_contents["second_attention_mask"] = paddle.tile(processed_contents["second_attention_mask"], repeat_times=[batch,1])
163+
generate_ids, _ = self.generate_with_image_features(
164+
image_features,
165+
processed_contents["first_input_ids"],
166+
processed_contents["second_input_ids"],
167+
image_attention_mask,
168+
processed_contents["first_attention_mask"],
169+
processed_contents["second_attention_mask"],
170+
)
171+
172+
msg = self.post_processing(generate_ids)
173+
174+
return msg
175+
176+
177+
if __name__ == "__main__":
178+
parser = argparse.ArgumentParser()
179+
parser.add_argument("--first_model_path", default='The dir name of image encoder model', type=str, help="", )
180+
parser.add_argument("--second_model_path", default='The dir name of language model', type=str, help="", )
181+
parser.add_argument("--minigpt4_path", type=str,
182+
default="The minigpt4 dir name of saving tokenizer",
183+
help="The path of extraction model path that you want to load.")
184+
parser.add_argument("--use_tensorrt", action='store_true', help="Whether to use inference engin TensorRT.")
185+
parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"],
186+
help='The tensorrt precision.')
187+
parser.add_argument("--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference.")
188+
parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
189+
parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False],
190+
help='Enable to use mkldnn to speed up when using cpu.')
191+
args = parser.parse_args()
192+
193+
predictor = Predictor(args)
194+
195+
url = "https://paddlenlp.bj.bcebos.com/data/images/mugs.png"
196+
image = Image.open(requests.get(url, stream=True).raw)
197+
198+
text = "describe this image"
199+
prompt = "Give the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img><ImageHere></Img> <TextHere>###Assistant:"
200+
201+
# warm up
202+
warm_up_times = 2
203+
repeat_times = 10
204+
for i in range(warm_up_times):
205+
msg = predictor.predict(image, text, prompt)
206+
207+
# 测试50次
208+
starttime = datetime.datetime.now()
209+
for i in range(repeat_times):
210+
msg = predictor.predict(image, text, prompt)
211+
212+
endtime = datetime.datetime.now()
213+
duringtime = endtime - starttime
214+
time_ms = duringtime.seconds * 1000 + duringtime.microseconds / 1000.0
215+
216+
print("Reference: The image shows two black and white cats sitting next to each other on a blue background. The cats have black fur and white fur with black noses, eyes, and paws. They are both looking at the camera with a curious expression. The mugs are also blue with the same design of the cats on them. There is a small white flower on the left side of the mug. The background is a light blue color.")
217+
print("Outputs: ", msg)
218+
print("The whole time on average: ", time_ms / repeat_times, "ms")

0 commit comments

Comments
 (0)