Skip to content

Commit d37e921

Browse files
authored
implement AWQ using AutoAWQ (#16)
1 parent 85e18f2 commit d37e921

File tree

9 files changed

+228
-21
lines changed

9 files changed

+228
-21
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,3 +163,5 @@ models/
163163

164164
.python-version
165165
.vscode/
166+
167+
config.yaml

Dockerfile.gpu

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ FROM debian:bullseye-slim as pytorch-install
22

33
ARG PYTORCH_VERSION=2.0.1
44
ARG PYTHON_VERSION=3.9
5-
ARG CUDA_VERSION=11.7.1
5+
ARG CUDA_VERSION=11.8.0
66
ARG MAMBA_VERSION=23.1.0-4
77
ARG CUDA_CHANNEL=nvidia
88
ARG INSTALL_CHANNEL=pytorch
@@ -47,7 +47,7 @@ RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-ins
4747
ninja-build \
4848
&& rm -rf /var/lib/apt/lists/*
4949

50-
RUN /opt/conda/bin/conda install -qy -c "nvidia/label/cuda-11.7.1" cuda==11.7.1 && \
50+
RUN /opt/conda/bin/conda install -q -c "nvidia/label/cuda-${CUDA_VERSION}" cuda==$CUDA_VERSION && \
5151
/opt/conda/bin/conda clean -yqa
5252

5353
FROM debian:bullseye-slim as base
@@ -71,6 +71,7 @@ COPY ./requirements.txt /llm-api/requirements.txt
7171
RUN pip3 install --no-cache-dir --upgrade -r requirements.txt && \
7272
pip3 install --no-cache-dir accelerate==0.20.3 packaging==23.0 ninja==1.11.1 && \
7373
pip3 install --no-cache-dir --no-build-isolation flash-attn==v2.3.3 && \
74+
pip3 install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp39-cp39-linux_x86_64.whl && \
7475
pip3 cache purge && \
7576
/opt/conda/bin/conda clean -ya
7677

README.md

Lines changed: 40 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,18 @@ LLM enthusiasts, developers, researchers, and creators are invited to join this
1515
### Tested with
1616

1717
- [x] Different Llama based-models in different versions such as (Llama, Alpaca, Vicuna, Llama 2 ) on CPU using llama.cpp
18-
- [x] Llama & Llama 2 based models using GPTQ-for-LLaMa
18+
- [x] Llama & Llama 2 quantized models using GPTQ-for-LLaMa
1919
- [x] Generic huggingface pipeline e.g. gpt-2, MPT
2020
- [x] Mistral 7b
21+
- [x] Several quantized models using AWQ
2122
- [x] OpenAI-like interface using [llm-api-python](https://github.com/1b5d/llm-api-python)
2223
- [ ] Support RWKV-LM
2324

2425
# Usage
2526

2627
To run LLM-API on a local machine, you must have a functioning Docker engine. The following steps outline the process for running LLM-API:
2728

28-
1. **Create a Configuration File**: Begin by creating a `config.yaml` file with the configurations as described below.
29+
1. **Create a Configuration File**: Begin by creating a `config.yaml` file with the configurations as described below (use the examples in `config.yaml.example`).
2930

3031
```
3132
models_dir: /models # dir inside the container
@@ -147,15 +148,15 @@ curl --location 'localhost:8000/generate' \
147148
}'
148149
```
149150
150-
If you're looking to accelerate inference using a GPU, the `1b5d/llm-api:x.x.x-gpu` image is designed for this purpose. When running the Docker image using Compose, consider utilizing a dedicated Compose file for GPU support:
151+
If you're looking to accelerate inference using a GPU, the `1b5d/llm-api:latest-gpu` image is designed for this purpose. When running the Docker image using Compose, consider utilizing a dedicated Compose file for GPU support:
151152
152153
```
153154
docker compose -f docker-compose.gpu.yaml up
154155
```
155156
156157
**Note**: currenty only `linux/amd64` architecture is supported for gpu images
157158
158-
## Llama on CPU - using llama.cpp
159+
## Llama models on CPU - using llama.cpp
159160
160161
Utilizing Llama on a CPU is made simple by configuring the model usage in a local `config.yaml` file. Below are the possible configurations:
161162
@@ -210,6 +211,36 @@ curl --location 'localhost:8000/generate' \
210211
}'
211212
```
212213
214+
## AWQ quantized models - using AutoAWQ
215+
216+
AWQ quantization is supported using the AutoAWQ implementation, below is an example config
217+
218+
```
219+
models_dir: /models
220+
model_family: autoawq
221+
setup_params:
222+
repo_id: <repo id>
223+
tokenizer_repo_id: <repo id>
224+
filename: <model file name>
225+
model_params:
226+
trust_remote_code: False
227+
fuse_layers: False
228+
safetensors: True
229+
device_map: "cuda:0"
230+
```
231+
232+
To run this model, the gpu supported docker image is needed `1b5d/llm-api:latest-gpu`
233+
234+
```
235+
docker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:latest-gpu
236+
```
237+
238+
Or you can use the docker-compose.gpu.yaml file available in this repo:
239+
240+
```
241+
docker compose -f docker-compose.gpu.yaml up
242+
```
243+
213244
## Llama on GPU - using GPTQ-for-LLaMa
214245
215246
**Important Note**: Before running Llama or Llama 2 on GPU, make sure to install the [NVIDIA Driver](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html) on your host machine. You can verify the NVIDIA environment by executing the following command:
@@ -220,7 +251,7 @@ docker run --rm --gpus all nvidia/cuda:11.7.1-base-ubuntu20.04 nvidia-smi
220251
221252
You should see a table displaying the current NVIDIA driver version and related information, confirming the proper setup.
222253
223-
When running the Llama model with GPTQ-for-LLaMa 4-bit quantization, you can use a specialized Docker image designed for this purpose, `1b5d/llm-api:x.x.x-gpu`, as an alternative to the default image. You can run this mode using a separate Docker Compose file:
254+
When running the Llama model with GPTQ-for-LLaMa 4-bit quantization, you can use a specialized Docker image designed for this purpose, `1b5d/llm-api:latest-gpu`, as an alternative to the default image. You can run this mode using a separate Docker Compose file:
224255
225256
```
226257
docker compose -f docker-compose.gpu.yaml up
@@ -229,10 +260,10 @@ docker compose -f docker-compose.gpu.yaml up
229260
Or by directly running the container:
230261
231262
```
232-
docker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:x.x.x-gpu
263+
docker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:latest-gpu
233264
```
234265
235-
**Important Note**: The `llm-api:x.x.x-gptq-llama-cuda` and `llm-api:x.x.x-gptq-llama-triton` images have been deprecated. Please switch to the `1b5d/llm-api:x.x.x-gpu` image when GPU support is required
266+
**Important Note**: The `llm-api:x.x.x-gptq-llama-cuda` and `llm-api:x.x.x-gptq-llama-triton` images have been deprecated. Please switch to the `1b5d/llm-api:latest-gpu` image when GPU support is required
236267
237268
Example config file:
238269
@@ -271,5 +302,6 @@ curl --location 'localhost:8000/generate' \
271302
272303
- [llama.cpp](https://github.com/ggerganov/llama.cpp) for making it possible to run Llama models on CPU.
273304
- [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for the python bindings lib for `llama.cpp`.
274-
- [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) for providing a GPTQ implementation for Llama based models.
305+
- [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) for providing a GPTQ quantization implementation for Llama based models.
306+
- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) for providing an implementation for AWQ quantization
275307
- Huggingface for the great ecosystem of tooling they provide.

app/llms/__init__.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,17 @@ def _load_hugging_face():
2323
return HuggingFaceLLM
2424

2525

26+
def _load_autoawq():
27+
from .autoawq.autoawq import AutoAWQ # pylint: disable=C0415
28+
29+
return AutoAWQ
30+
31+
2632
model_families = {
2733
"llama": _load_llama,
2834
"gptq_llama": _load_gptq_llama,
2935
"huggingface": _load_hugging_face,
36+
"autoawq": _load_autoawq,
3037
}
3138

3239

app/llms/autoawq/__init__.py

Whitespace-only changes.

app/llms/autoawq/autoawq.py

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
"""
2+
AutoAWQ LLM-API implementation
3+
"""
4+
import logging
5+
import os
6+
from typing import AsyncIterator, Dict, List
7+
8+
import huggingface_hub
9+
from awq import AutoAWQForCausalLM
10+
from transformers import AutoConfig, AutoTokenizer, TextIteratorStreamer, pipeline
11+
12+
from app.base import BaseLLM
13+
from app.config import settings
14+
15+
logger = logging.getLogger("llm-api.autoawq")
16+
17+
18+
class AutoAWQ(BaseLLM):
19+
"""LLM-API implementation to support AWQ quantization using AutoAWQ"""
20+
21+
def _download(self, model_path, model_dir):
22+
if os.path.exists(model_path):
23+
logger.info("found an existing model %s", model_path)
24+
return
25+
26+
logger.info("downloading model to %s", model_path)
27+
28+
huggingface_hub.hf_hub_download(
29+
repo_id=settings.setup_params["repo_id"],
30+
filename=settings.setup_params["filename"],
31+
local_dir=model_dir,
32+
local_dir_use_symlinks=False,
33+
cache_dir=os.path.join(settings.models_dir, ".cache"),
34+
)
35+
36+
huggingface_hub.hf_hub_download(
37+
repo_id=settings.setup_params["repo_id"],
38+
filename="config.json",
39+
local_dir=model_dir,
40+
local_dir_use_symlinks=False,
41+
cache_dir=os.path.join(settings.models_dir, ".cache"),
42+
)
43+
44+
def __init__(self, params: Dict[str, str]) -> None:
45+
model_dir = super().get_model_dir(
46+
settings.models_dir,
47+
settings.model_family,
48+
settings.setup_params["filename"],
49+
)
50+
51+
model_path = os.path.join(
52+
model_dir,
53+
settings.setup_params["filename"],
54+
)
55+
56+
self._download(model_path, model_dir)
57+
58+
self.device = params.get("device_map", "cuda:0")
59+
del params["device_map"]
60+
61+
self.config = AutoConfig.from_pretrained(settings.setup_params["repo_id"])
62+
63+
self.model = AutoAWQForCausalLM.from_quantized(
64+
model_dir, settings.setup_params["filename"], **params
65+
)
66+
67+
self.tokenizer = AutoTokenizer.from_pretrained(
68+
settings.setup_params["tokenizer_repo_id"], cache_dir=model_dir, **params
69+
)
70+
71+
logger.info("setup done successfully for %s", model_path)
72+
73+
def generate(self, prompt: str, params: Dict[str, str]) -> str:
74+
"""
75+
Generate text from the model using the input prompt and parameters
76+
"""
77+
input_ids = self.tokenizer(
78+
prompt,
79+
return_tensors="pt",
80+
).to(self.device)
81+
82+
gen_tokens = self.model.generate(**input_ids, **params)
83+
84+
result = self.tokenizer.batch_decode(
85+
gen_tokens[:, input_ids["input_ids"].shape[1] :]
86+
)
87+
88+
return result[0]
89+
90+
async def agenerate(
91+
self, prompt: str, params: Dict[str, str]
92+
) -> AsyncIterator[str]:
93+
"""
94+
Generate text stream from model using the input prompt and parameters
95+
"""
96+
input_ids = self.tokenizer(
97+
prompt,
98+
return_tensors="pt",
99+
).to(self.device)
100+
streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True)
101+
self.model.generate(**input_ids, streamer=streamer, **params or None)
102+
for text in streamer:
103+
yield text
104+
105+
def embeddings(self, text: str) -> List[float]:
106+
"""
107+
Generate embeddings using the input text
108+
"""
109+
110+
pipe = pipeline(
111+
"feature-extraction",
112+
framework="pt",
113+
model=self.model,
114+
tokenizer=self.tokenizer,
115+
)
116+
return pipe(text)[0][0]

app/llms/gptq_llama/gptq_llama.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
sys.path.append(os.path.join(os.path.dirname(__file__), "GPTQ-for-LLaMa"))
1919

2020
try:
21-
from .GPTQforLLaMa import quant
21+
from .GPTQforLLaMa import quant # type: ignore
2222
from .GPTQforLLaMa.utils import find_layers
2323
except ImportError as exp:
2424
raise ImportError(

config.yaml

Lines changed: 0 additions & 10 deletions
This file was deleted.

config.yaml.example

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
models_dir: /models
2+
model_family: huggingface
3+
setup_params:
4+
repo_id: <repo_id>
5+
tokenizer_repo_id: <repo_id>
6+
trust_remote_code: True
7+
config_params:
8+
init_device: cuda:0
9+
attn_config:
10+
attn_impl: triton
11+
model_params:
12+
device_map: "cuda:0"
13+
trust_remote_code: True
14+
torch_dtype: torch.bfloat16
15+
---
16+
models_dir: /models
17+
model_family: llama
18+
setup_params:
19+
repo_id: user/repo_id
20+
filename: ggml-model-q4_0.bin
21+
model_params:
22+
n_ctx: 512
23+
n_parts: -1
24+
n_gpu_layers: 0
25+
seed: -1
26+
use_mmap: True
27+
n_threads: 8
28+
n_batch: 2048
29+
last_n_tokens_size: 64
30+
lora_base: null
31+
lora_path: null
32+
low_vram: False
33+
tensor_split: null
34+
rope_freq_base: 10000.0
35+
rope_freq_scale: 1.0
36+
verbose: True
37+
---
38+
models_dir: /models
39+
model_family: gptq_llama
40+
setup_params:
41+
repo_id: user/repo_id
42+
filename: <model.safetensors or model.pt>
43+
model_params:
44+
group_size: 128
45+
wbits: 4
46+
cuda_visible_devices: "0"
47+
device: "cuda:0"
48+
---
49+
models_dir: /models
50+
model_family: autoawq
51+
setup_params:
52+
repo_id: <repo id>
53+
tokenizer_repo_id: <repo id>
54+
filename: model.safetensors
55+
model_params:
56+
trust_remote_code: False
57+
fuse_layers: False
58+
safetensors: True
59+
device_map: "cuda:0"

0 commit comments

Comments
 (0)