implement AWQ using AutoAWQ (#16)

1b5d · web-flow · commit d37e92190a26 · 2023-11-12T22:32:54.000Z
diff --git a/.gitignore b/.gitignore
@@ -163,3 +163,5 @@ models/
 
 .python-version
 .vscode/
+
+config.yaml
diff --git a/Dockerfile.gpu b/Dockerfile.gpu
@@ -2,7 +2,7 @@ FROM debian:bullseye-slim as pytorch-install
 
 ARG PYTORCH_VERSION=2.0.1
 ARG PYTHON_VERSION=3.9
-ARG CUDA_VERSION=11.7.1
+ARG CUDA_VERSION=11.8.0
 ARG MAMBA_VERSION=23.1.0-4
 ARG CUDA_CHANNEL=nvidia
 ARG INSTALL_CHANNEL=pytorch
@@ -47,7 +47,7 @@ RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-ins
     ninja-build \
     && rm -rf /var/lib/apt/lists/*
 
-RUN /opt/conda/bin/conda install -qy -c "nvidia/label/cuda-11.7.1"  cuda==11.7.1 && \
+RUN /opt/conda/bin/conda install -q -c "nvidia/label/cuda-${CUDA_VERSION}"  cuda==$CUDA_VERSION && \
     /opt/conda/bin/conda clean -yqa
 
 FROM debian:bullseye-slim as base
@@ -71,6 +71,7 @@ COPY ./requirements.txt /llm-api/requirements.txt
 RUN pip3 install --no-cache-dir --upgrade -r requirements.txt && \
     pip3 install --no-cache-dir accelerate==0.20.3 packaging==23.0 ninja==1.11.1 && \
     pip3 install --no-cache-dir --no-build-isolation flash-attn==v2.3.3 && \
+    pip3 install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp39-cp39-linux_x86_64.whl && \
     pip3 cache purge && \
     /opt/conda/bin/conda clean -ya
 
diff --git a/README.md b/README.md
@@ -15,17 +15,18 @@ LLM enthusiasts, developers, researchers, and creators are invited to join this
 ### Tested with
 
 - [x] Different Llama based-models in different versions such as (Llama, Alpaca, Vicuna, Llama 2 ) on CPU using llama.cpp
-- [x] Llama & Llama 2 based models using GPTQ-for-LLaMa
+- [x] Llama & Llama 2 quantized models using GPTQ-for-LLaMa
 - [x] Generic huggingface pipeline e.g. gpt-2, MPT
 - [x] Mistral 7b
+- [x] Several quantized models using AWQ
 - [x] OpenAI-like interface using [llm-api-python](https://github.com/1b5d/llm-api-python)
 - [ ] Support RWKV-LM
 
 # Usage
 
 To run LLM-API on a local machine, you must have a functioning Docker engine. The following steps outline the process for running LLM-API:
 
-1. **Create a Configuration File**: Begin by creating a `config.yaml` file with the configurations as described below.
+1. **Create a Configuration File**: Begin by creating a `config.yaml` file with the configurations as described below (use the examples in `config.yaml.example`).
 
 ```
 models_dir: /models     # dir inside the container
@@ -147,15 +148,15 @@ curl --location 'localhost:8000/generate' \
 }'
 ```
 
-If you're looking to accelerate inference using a GPU, the `1b5d/llm-api:x.x.x-gpu` image is designed for this purpose. When running the Docker image using Compose, consider utilizing a dedicated Compose file for GPU support:
+If you're looking to accelerate inference using a GPU, the `1b5d/llm-api:latest-gpu` image is designed for this purpose. When running the Docker image using Compose, consider utilizing a dedicated Compose file for GPU support:
 
 ```
 docker compose -f docker-compose.gpu.yaml up
 ```
 
 **Note**: currenty only `linux/amd64` architecture is supported for gpu images
 
-## Llama on CPU - using llama.cpp
+## Llama models on CPU - using llama.cpp
 
 Utilizing Llama on a CPU is made simple by configuring the model usage in a local `config.yaml` file. Below are the possible configurations:
 
@@ -210,6 +211,36 @@ curl --location 'localhost:8000/generate' \
 }'
 ```
 
+## AWQ quantized models - using AutoAWQ
+
+AWQ quantization is supported using the AutoAWQ implementation, below is an example config
+
+```
+models_dir: /models
+model_family: autoawq
+setup_params:
+  repo_id: <repo id>
+  tokenizer_repo_id: <repo id>
+  filename: <model file name>
+model_params:
+  trust_remote_code: False
+  fuse_layers: False
+  safetensors: True
+  device_map: "cuda:0"
+```
+
+To run this model, the gpu supported docker image is needed `1b5d/llm-api:latest-gpu`
+
+```
+docker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:latest-gpu
+```
+
+Or you can use the docker-compose.gpu.yaml file available in this repo:
+
+```
+docker compose -f docker-compose.gpu.yaml up
+```
+
 ## Llama on GPU - using GPTQ-for-LLaMa
 
 **Important Note**: Before running Llama or Llama 2 on GPU, make sure to install the [NVIDIA Driver](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html) on your host machine. You can verify the NVIDIA environment by executing the following command:
@@ -220,7 +251,7 @@ docker run --rm --gpus all nvidia/cuda:11.7.1-base-ubuntu20.04 nvidia-smi
 
 You should see a table displaying the current NVIDIA driver version and related information, confirming the proper setup.
 
-When running the Llama model with GPTQ-for-LLaMa 4-bit quantization, you can use a specialized Docker image designed for this purpose, `1b5d/llm-api:x.x.x-gpu`, as an alternative to the default image. You can run this mode using a separate Docker Compose file:
+When running the Llama model with GPTQ-for-LLaMa 4-bit quantization, you can use a specialized Docker image designed for this purpose, `1b5d/llm-api:latest-gpu`, as an alternative to the default image. You can run this mode using a separate Docker Compose file:
 
 ```
 docker compose -f docker-compose.gpu.yaml up
@@ -229,10 +260,10 @@ docker compose -f docker-compose.gpu.yaml up
 Or by directly running the container:
 
 ```
-docker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:x.x.x-gpu
+docker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:latest-gpu
 ```
 
-**Important Note**: The `llm-api:x.x.x-gptq-llama-cuda` and `llm-api:x.x.x-gptq-llama-triton` images have been deprecated. Please switch to the `1b5d/llm-api:x.x.x-gpu` image when GPU support is required
+**Important Note**: The `llm-api:x.x.x-gptq-llama-cuda` and `llm-api:x.x.x-gptq-llama-triton` images have been deprecated. Please switch to the `1b5d/llm-api:latest-gpu` image when GPU support is required
 
 Example config file:
 
@@ -271,5 +302,6 @@ curl --location 'localhost:8000/generate' \
 
 - [llama.cpp](https://github.com/ggerganov/llama.cpp) for making it possible to run Llama models on CPU. 
 - [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for the python bindings lib for `llama.cpp`.
-- [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) for providing a GPTQ implementation for Llama based models.
+- [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) for providing a GPTQ quantization implementation for Llama based models.
+- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) for providing an implementation for AWQ quantization
 - Huggingface for the great ecosystem of tooling they provide.
diff --git a/app/llms/__init__.py b/app/llms/__init__.py
@@ -23,10 +23,17 @@ def _load_hugging_face():
     return HuggingFaceLLM
 
 
+def _load_autoawq():
+    from .autoawq.autoawq import AutoAWQ  # pylint: disable=C0415
+
+    return AutoAWQ
+
+
 model_families = {
     "llama": _load_llama,
     "gptq_llama": _load_gptq_llama,
     "huggingface": _load_hugging_face,
+    "autoawq": _load_autoawq,
 }
 
 
diff --git a/app/llms/autoawq/__init__.py b/app/llms/autoawq/__init__.py
diff --git a/app/llms/autoawq/autoawq.py b/app/llms/autoawq/autoawq.py
@@ -0,0 +1,116 @@
+"""
+AutoAWQ LLM-API implementation
+"""
+import logging
+import os
+from typing import AsyncIterator, Dict, List
+
+import huggingface_hub
+from awq import AutoAWQForCausalLM
+from transformers import AutoConfig, AutoTokenizer, TextIteratorStreamer, pipeline
+
+from app.base import BaseLLM
+from app.config import settings
+
+logger = logging.getLogger("llm-api.autoawq")
+
+
+class AutoAWQ(BaseLLM):
+    """LLM-API implementation to support AWQ quantization using AutoAWQ"""
+
+    def _download(self, model_path, model_dir):
+        if os.path.exists(model_path):
+            logger.info("found an existing model %s", model_path)
+            return
+
+        logger.info("downloading model to %s", model_path)
+
+        huggingface_hub.hf_hub_download(
+            repo_id=settings.setup_params["repo_id"],
+            filename=settings.setup_params["filename"],
+            local_dir=model_dir,
+            local_dir_use_symlinks=False,
+            cache_dir=os.path.join(settings.models_dir, ".cache"),
+        )
+
+        huggingface_hub.hf_hub_download(
+            repo_id=settings.setup_params["repo_id"],
+            filename="config.json",
+            local_dir=model_dir,
+            local_dir_use_symlinks=False,
+            cache_dir=os.path.join(settings.models_dir, ".cache"),
+        )
+
+    def __init__(self, params: Dict[str, str]) -> None:
+        model_dir = super().get_model_dir(
+            settings.models_dir,
+            settings.model_family,
+            settings.setup_params["filename"],
+        )
+
+        model_path = os.path.join(
+            model_dir,
+            settings.setup_params["filename"],
+        )
+
+        self._download(model_path, model_dir)
+
+        self.device = params.get("device_map", "cuda:0")
+        del params["device_map"]
+
+        self.config = AutoConfig.from_pretrained(settings.setup_params["repo_id"])
+
+        self.model = AutoAWQForCausalLM.from_quantized(
+            model_dir, settings.setup_params["filename"], **params
+        )
+
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            settings.setup_params["tokenizer_repo_id"], cache_dir=model_dir, **params
+        )
+
+        logger.info("setup done successfully for %s", model_path)
+
+    def generate(self, prompt: str, params: Dict[str, str]) -> str:
+        """
+        Generate text from the model using the input prompt and parameters
+        """
+        input_ids = self.tokenizer(
+            prompt,
+            return_tensors="pt",
+        ).to(self.device)
+
+        gen_tokens = self.model.generate(**input_ids, **params)
+
+        result = self.tokenizer.batch_decode(
+            gen_tokens[:, input_ids["input_ids"].shape[1] :]
+        )
+
+        return result[0]
+
+    async def agenerate(
+        self, prompt: str, params: Dict[str, str]
+    ) -> AsyncIterator[str]:
+        """
+        Generate text stream from model using the input prompt and parameters
+        """
+        input_ids = self.tokenizer(
+            prompt,
+            return_tensors="pt",
+        ).to(self.device)
+        streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True)
+        self.model.generate(**input_ids, streamer=streamer, **params or None)
+        for text in streamer:
+            yield text
+
+    def embeddings(self, text: str) -> List[float]:
+        """
+        Generate embeddings using the input text
+        """
+
+        pipe = pipeline(
+            "feature-extraction",
+            framework="pt",
+            model=self.model,
+            tokenizer=self.tokenizer,
+        )
+        return pipe(text)[0][0]
diff --git a/app/llms/gptq_llama/gptq_llama.py b/app/llms/gptq_llama/gptq_llama.py
@@ -18,7 +18,7 @@
 sys.path.append(os.path.join(os.path.dirname(__file__), "GPTQ-for-LLaMa"))
 
 try:
-    from .GPTQforLLaMa import quant
+    from .GPTQforLLaMa import quant  # type: ignore
     from .GPTQforLLaMa.utils import find_layers
 except ImportError as exp:
     raise ImportError(
diff --git a/config.yaml b/config.yaml
diff --git a/config.yaml.example b/config.yaml.example
@@ -0,0 +1,59 @@
+models_dir: /models
+model_family: huggingface
+setup_params:
+  repo_id: <repo_id>
+  tokenizer_repo_id: <repo_id>
+  trust_remote_code: True
+  config_params:
+    init_device: cuda:0
+    attn_config:
+      attn_impl: triton
+model_params:
+  device_map: "cuda:0"
+  trust_remote_code: True
+  torch_dtype: torch.bfloat16
+---
+models_dir: /models
+model_family: llama
+setup_params:
+  repo_id: user/repo_id
+  filename: ggml-model-q4_0.bin
+model_params:
+  n_ctx: 512
+  n_parts: -1
+  n_gpu_layers: 0
+  seed: -1
+  use_mmap: True
+  n_threads: 8
+  n_batch: 2048
+  last_n_tokens_size: 64
+  lora_base: null
+  lora_path: null
+  low_vram: False
+  tensor_split: null
+  rope_freq_base: 10000.0
+  rope_freq_scale: 1.0
+  verbose: True
+---
+models_dir: /models
+model_family: gptq_llama
+setup_params:
+  repo_id: user/repo_id
+  filename: <model.safetensors or model.pt>
+model_params:
+  group_size: 128
+  wbits: 4
+  cuda_visible_devices: "0"
+  device: "cuda:0"
+---
+models_dir: /models
+model_family: autoawq
+setup_params:
+  repo_id: <repo id>
+  tokenizer_repo_id: <repo id>
+  filename: model.safetensors
+model_params:
+  trust_remote_code: False
+  fuse_layers: False
+  safetensors: True
+  device_map: "cuda:0"