huggingface
diff --git a/‎Dockerfile
Lines changed: 7 additions & 6 deletions b/‎Dockerfile
Lines changed: 7 additions & 6 deletions
diff --git a/‎README.md
Lines changed: 91 additions & 15 deletions b/‎README.md
Lines changed: 91 additions & 15 deletions
diff --git a/‎assets/UI.png
212 KB b/‎assets/UI.png
212 KB
diff --git a/‎inference_server/README.md
Lines changed: 0 additions & 80 deletions b/‎inference_server/README.md
Lines changed: 0 additions & 80 deletions
diff --git a/‎static/css/style.css
Lines changed: 37 additions & 0 deletions b/‎static/css/style.css
Lines changed: 37 additions & 0 deletions
diff --git a/‎static/js/index.js
Lines changed: 107 additions & 0 deletions b/‎static/js/index.js
Lines changed: 107 additions & 0 deletions
@@ -26,9 +26,6 @@ FROM conda as conda_env
 # update conda
 RUN conda update -n base -c defaults conda -y
 
-COPY Makefile Makefile
-COPY LICENSE LICENSE
-
 # necessary stuff
 RUN pip install torch==1.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 \
     transformers \
@@ -43,9 +40,12 @@ RUN pip install torch==1.12.1+cu116 --extra-index-url https://download.pytorch.o
 	grpcio-tools==1.50.0 \
     --no-cache-dir
 
-# install grpc and compile protos
+# copy the code
 COPY inference_server inference_server
+COPY Makefile Makefile
+COPY LICENSE LICENSE
 
+# install grpc and compile protos
 RUN make gen-proto
 
 # clean conda env
@@ -58,9 +58,10 @@ ENV TRANSFORMERS_CACHE=/transformers_cache/ \
     HUGGINGFACE_HUB_CACHE=${TRANSFORMERS_CACHE} \
     HOME=/homedir
 
-# Runs as arbitrary user in OpenShift
 RUN mkdir ${HOME} && chmod g+wx ${HOME} && \
     mkdir tmp && chmod -R g+w tmp
-# RUN chmod g+w Makefile
+
+# for debugging
+# RUN chmod -R g+w inference_server && chmod g+w Makefile
 
 CMD make bloom-176b
@@ -1,33 +1,109 @@
 # Fast Inference Solutions for BLOOM
 
-This repo provides demos and packages to perform fast inference solutions for BLOOM. Some of the solutions have their own repos in which case a link to the corresponding repos is provided instead.
+This repo provides demos and packages to perform fast inference solutions for BLOOM. Some of the solutions have their own repos in which case a link to the [corresponding repos](#Other-inference-solutions) is provided instead.
 
-Some of the solutions provide both half-precision and int8-quantized solution.
 
-## Client-side solutions
+# Inference solutions for BLOOM 176B
 
-Solutions developed to perform large batch inference locally:
+We support HuggingFace accelerate and DeepSpeed Inference for generation.
 
-Pytorch:
+Install required packages:
 
-* [Accelerate, DeepSpeed-Inference and DeepSpeed-ZeRO](./bloom-inference-scripts)
+```shell
+pip install flask flask_api gunicorn pydantic accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3 deepspeed-mii==0.0.2
+```
 
-* [Custom HF Code](https://github.com/huggingface/transformers_bloom_parallel/).
+alternatively you can also install deepspeed from source:
+```shell
+git clone https://github.com/microsoft/DeepSpeed
+cd DeepSpeed
+CFLAGS="-I$CONDA_PREFIX/include/" LDFLAGS="-L$CONDA_PREFIX/lib/" TORCH_CUDA_ARCH_LIST="7.0" DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
+```
 
-JAX:
+All the provided scripts are tested on 8 A100 80GB GPUs for BLOOM 176B (fp16/bf16) and 4 A100 80GB GPUs for BLOOM 176B (int8). These scripts might not work for other models or a different number of GPUs.
 
-* [BLOOM Inference in JAX](https://github.com/huggingface/bloom-jax-inference)
+DS inference is deployed using logic borrowed from DeepSpeed MII library.
 
+Note: Sometimes GPU memory is not freed when DS inference deployment crashes. You can free this memory by running `killall python` in terminal.
 
+For using BLOOM quantized, use dtype = int8. Also, change the model_name to microsoft/bloom-deepspeed-inference-int8 for DeepSpeed-Inference. For HF accelerate, no change is needed for model_name.
 
-## Server solutions
+HF accelerate uses [LLM.int8()](https://arxiv.org/abs/2208.07339) and DS-inference uses [ZeroQuant](https://arxiv.org/abs/2206.01861) for post-training quantization.
+
+## BLOOM inference via command-line
+
+This asks for generate_kwargs everytime.
+Example: generate_kwargs =
+```json
+{"min_length": 100, "max_new_tokens": 100, "do_sample": false}
+```
+
+1. using HF accelerate
+```shell
+python -m inference_server.cli --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
+```
+
+2. using DS inference
+```shell
+python -m inference_server.cli --model_name microsoft/bloom-deepspeed-inference-fp16 --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
+```
+
+## BLOOM server deployment
+
+[make <model_name>](../Makefile) can be used to launch a generation server. Please note that the serving method is synchronous and users have to wait in queue until the preceding requests have been processed. An example to fire server requests is given [here](../server_request.py). Alternativey, a [Dockerfile](./Dockerfile) is also provided which launches a generation server on port 5000.
+
+An interactive UI can be launched via the following command to connect to the generation server. The default URL of the UI is `http://127.0.0.1:5001/`.
+```shell
+python -m ui
+```
+This command launches the following UI to play with generation. Sorry for the crappy design. Unfotunately, my UI skills only go so far. 😅😅😅
+![image](assets/UI.png)
+
+## Benchmark system for BLOOM inference
+
+1. using HF accelerate
+```shell
+python -m inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --benchmark_cycles 5
+```
+
+2. using DS inference
+```shell
+deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
+```
+alternatively, to load model faster:
+```shell
+deepspeed --num_gpus 8 --module inference_server.benchmark --model_name microsoft/bloom-deepspeed-inference-fp16 --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
+```
 
-Solutions developed to be used in a server mode (i.e. varied batch size, varied request rate):
+3. using DS ZeRO
+```shell
+deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5
+```
 
-Pytorch:
+# Support
 
-* [Accelerate and DeepSpeed-Inference based solutions](./bloom-inference-server)
 
-Rust:
+If you run into things not working or have other questions please open an Issue in the corresponding backend:
+
+- [Accelerate](https://github.com/huggingface/accelerate/issues)
+- [Deepspeed-Inference](https://github.com/microsoft/DeepSpeed/issues)
+- [Deepspeed-ZeRO](https://github.com/microsoft/DeepSpeed/issues)
+
+If there a specific issue with one of the scripts and not the backend only then please open an Issue here and tag [@mayank31398](https://github.com/mayank31398).
+
+
+# Other inference solutions
+## Client-side solutions
+
+Solutions developed to perform large batch inference locally:
+
+* [Custom HF Code](https://github.com/huggingface/transformers_bloom_parallel/).
+
+JAX:
+
+* [BLOOM Inference in JAX](https://github.com/huggingface/bloom-jax-inference)
+
+
+## Server solutions
 
-* [Bloom-server](https://github.com/Narsil/bloomserver)
+A solution developed to be used in a server mode (i.e. varied batch size, varied request rate) can be found [here](https://github.com/Narsil/bloomserver). This is implemented in Rust.
@@ -0,0 +1,37 @@
+#left-column {
+    width: 80%;
+}
+
+#right-column {
+    width: 18%;
+    float: right;
+    padding-right: 10px;
+}
+
+body {
+    background-color: lightgray;
+    height: auto;
+}
+
+#text-input {
+    width: 100%;
+    float: left;
+    resize: none;
+}
+
+.slider {
+    width: 100%;
+    float: left;
+}
+
+#log-output {
+    width: 100%;
+    float: left;
+    resize: none;
+}
+
+#max-new-tokens-input {
+    width: 30%;
+    float: left;
+    margin-left: 5px;
+}
@@ -0,0 +1,107 @@
+const textGenInput = document.getElementById('text-input');
+const clickButton = document.getElementById('submit-button');
+
+const temperatureSlider = document.getElementById('temperature-slider');
+const temperatureTextBox = document.getElementById('temperature-textbox')
+
+const top_pSlider = document.getElementById('top_p-slider');
+const top_pTextBox = document.getElementById('top_p-textbox');
+
+const top_kSlider = document.getElementById('top_k-slider');
+const top_kTextBox = document.getElementById('top_k-textbox');
+
+const repetition_penaltySlider = document.getElementById('repetition_penalty-slider');
+const repetition_penaltyTextBox = document.getElementById('repetition_penalty-textbox');
+
+const max_new_tokensInput = document.getElementById('max-new-tokens-input');
+
+const textLogOutput = document.getElementById('log-output');
+
+function get_temperature() {
+    return parseFloat(temperatureSlider.value);
+}
+
+temperatureSlider.addEventListener('input', async (event) => {
+    temperatureTextBox.innerHTML = "temperature = " + get_temperature();
+});
+
+function get_top_p() {
+    return parseFloat(top_pSlider.value);
+}
+
+top_pSlider.addEventListener('input', async (event) => {
+    top_pTextBox.innerHTML = "top_p = " + get_top_p();
+});
+
+function get_top_k() {
+    return parseInt(top_kSlider.value);
+}
+
+top_kSlider.addEventListener('input', async (event) => {
+    top_kTextBox.innerHTML = "top_k = " + get_top_k();
+});
+
+function get_repetition_penalty() {
+    return parseFloat(repetition_penaltySlider.value);
+}
+
+repetition_penaltySlider.addEventListener('input', async (event) => {
+    repetition_penaltyTextBox.innerHTML = "repetition_penalty = " + get_repetition_penalty();
+});
+
+function get_max_new_tokens() {
+    return parseInt(max_new_tokensInput.value);
+}
+
+clickButton.addEventListener('click', async (event) => {
+    clickButton.textContent = 'Processing'
+    clickButton.disabled = true;
+
+    var jsonPayload = {
+        text: [textGenInput.value],
+        temperature: get_temperature(),
+        top_k: get_top_k(),
+        top_p: get_top_p(),
+        max_new_tokens: get_max_new_tokens(),
+        repetition_penalty: get_repetition_penalty(),
+        do_sample: true,
+        remove_input_from_output: true
+    };
+
+    if (jsonPayload.temperature == 0) {
+        jsonPayload.do_sample = false;
+    }
+
+    console.log(jsonPayload);
+
+    $.ajax({
+        url: '/generate/',
+        type: 'POST',
+        contentType: "application/json; charset=utf-8",
+        data: JSON.stringify(jsonPayload),
+        headers: { 'Access-Control-Allow-Origin': '*' },
+        success: function (response) {
+            var input_text = textGenInput.value;
+
+            if ("text" in response) {
+                textGenInput.value = input_text + response.text[0];
+
+                textLogOutput.value = 'total_time_taken = ' + response.total_time_taken + "\n";
+                textLogOutput.value += 'num_generated_tokens = ' + response.num_generated_tokens + "\n";
+                textLogOutput.style.backgroundColor = "lightblue";
+            } else {
+                textLogOutput.value = 'total_time_taken = ' + response.total_time_taken + "\n";
+                textLogOutput.value += 'error: ' + response.message;
+                textLogOutput.style.backgroundColor = "#D65235";
+            }
+
+            clickButton.textContent = 'Submit';
+            clickButton.disabled = false;
+        },
+        error: function (error) {
+            console.log(JSON.stringify(error, null, 2));
+            clickButton.textContent = 'Submit'
+            clickButton.disabled = false;
+        }
+    });
+});