You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+40-8Lines changed: 40 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,17 +15,18 @@ LLM enthusiasts, developers, researchers, and creators are invited to join this
15
15
### Tested with
16
16
17
17
-[x] Different Llama based-models in different versions such as (Llama, Alpaca, Vicuna, Llama 2 ) on CPU using llama.cpp
18
-
-[x] Llama & Llama 2 based models using GPTQ-for-LLaMa
18
+
-[x] Llama & Llama 2 quantized models using GPTQ-for-LLaMa
19
19
-[x] Generic huggingface pipeline e.g. gpt-2, MPT
20
20
-[x] Mistral 7b
21
+
-[x] Several quantized models using AWQ
21
22
-[x] OpenAI-like interface using [llm-api-python](https://github.com/1b5d/llm-api-python)
22
23
-[ ] Support RWKV-LM
23
24
24
25
# Usage
25
26
26
27
To run LLM-API on a local machine, you must have a functioning Docker engine. The following steps outline the process for running LLM-API:
27
28
28
-
1.**Create a Configuration File**: Begin by creating a `config.yaml` file with the configurations as described below.
29
+
1.**Create a Configuration File**: Begin by creating a `config.yaml` file with the configurations as described below (use the examples in `config.yaml.example`).
If you're looking to accelerate inference using a GPU, the `1b5d/llm-api:x.x.x-gpu` image is designed for this purpose. When running the Docker image using Compose, consider utilizing a dedicated Compose file for GPU support:
151
+
If you're looking to accelerate inference using a GPU, the `1b5d/llm-api:latest-gpu` image is designed for this purpose. When running the Docker image using Compose, consider utilizing a dedicated Compose file for GPU support:
151
152
152
153
```
153
154
docker compose -f docker-compose.gpu.yaml up
154
155
```
155
156
156
157
**Note**: currenty only `linux/amd64` architecture is supported for gpu images
157
158
158
-
## Llama on CPU - using llama.cpp
159
+
## Llama models on CPU - using llama.cpp
159
160
160
161
Utilizing Llama on a CPU is made simple by configuring the model usage in a local `config.yaml` file. Below are the possible configurations:
AWQ quantization is supported using the AutoAWQ implementation, below is an example config
217
+
218
+
```
219
+
models_dir: /models
220
+
model_family: autoawq
221
+
setup_params:
222
+
repo_id: <repoid>
223
+
tokenizer_repo_id: <repoid>
224
+
filename: <modelfilename>
225
+
model_params:
226
+
trust_remote_code: False
227
+
fuse_layers: False
228
+
safetensors: True
229
+
device_map: "cuda:0"
230
+
```
231
+
232
+
To run this model, the gpu supported docker image is needed `1b5d/llm-api:latest-gpu`
233
+
234
+
```
235
+
docker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:latest-gpu
236
+
```
237
+
238
+
Or you can use the docker-compose.gpu.yaml file available in this repo:
239
+
240
+
```
241
+
docker compose -f docker-compose.gpu.yaml up
242
+
```
243
+
213
244
## Llama on GPU - using GPTQ-for-LLaMa
214
245
215
246
**Important Note**: Before running Llama or Llama 2 on GPU, make sure to install the [NVIDIA Driver](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html) on your host machine. You can verify the NVIDIA environment by executing the following command:
@@ -220,7 +251,7 @@ docker run --rm --gpus all nvidia/cuda:11.7.1-base-ubuntu20.04 nvidia-smi
220
251
221
252
You should see a table displaying the current NVIDIA driver version and related information, confirming the proper setup.
222
253
223
-
When running the Llama model with GPTQ-for-LLaMa 4-bit quantization, you can use a specialized Docker image designed for this purpose, `1b5d/llm-api:x.x.x-gpu`, as an alternative to the default image. You can run this mode using a separate Docker Compose file:
254
+
When running the Llama model with GPTQ-for-LLaMa 4-bit quantization, you can use a specialized Docker image designed for this purpose, `1b5d/llm-api:latest-gpu`, as an alternative to the default image. You can run this mode using a separate Docker Compose file:
224
255
225
256
```
226
257
docker compose -f docker-compose.gpu.yaml up
@@ -229,10 +260,10 @@ docker compose -f docker-compose.gpu.yaml up
229
260
Or by directly running the container:
230
261
231
262
```
232
-
docker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:x.x.x-gpu
263
+
docker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:latest-gpu
233
264
```
234
265
235
-
**Important Note**: The `llm-api:x.x.x-gptq-llama-cuda` and `llm-api:x.x.x-gptq-llama-triton` images have been deprecated. Please switch to the `1b5d/llm-api:x.x.x-gpu` image when GPU support is required
266
+
**Important Note**: The `llm-api:x.x.x-gptq-llama-cuda` and `llm-api:x.x.x-gptq-llama-triton` images have been deprecated. Please switch to the `1b5d/llm-api:latest-gpu` image when GPU support is required
0 commit comments