-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
[WIP]
We want to achieve and demonstrate state-of-the-art inference throughputs and latencies for our models. Here is a list of milestones and tasks. These are not necessarily in order, we can (and should) already look into the later milestones.
Milestone 1: Make a starter implementation of MQA and add it do BigCode transformers. Agreeing on a common will be crucial for the next steps. (bigcode-project/transformers#4)
- Task 1.1: Implement a GPT2 model with MHA and MQA within BigCode transformers. We should keep support for MHA so we can compare with an equally optimized implementation (@bigximik, @jlamypoirier, @mayank31398 ).
- Task 1.2 Add basic profiling support to our benchmarking code (@jlamypoirier). Profiling and misc #10
- Task 1.3: Validate, profile and add simple optimizations for our model for a ~1B model such as SantaCoder (@jlamypoirier).
Milestone 2: Turn our starter implementation into a strong baseline.
- Task 2.1: Verify our MQA implementation for correctness.
- Task 2.2: Add complete support for SantaCoder models. The released checkpoints use a different version of the code, so some changes will be needed. We will also need to adapt our benchmarking code.
- Task 2.3: Collaborate with the evaluation team to ensure a common codebase.
- Task 2.4: Further optimizations
- Avoid concatenations (@jlamypoirier)
- Reduce the CPU bottleneck with cuda graphs and/or pytorch 2.0.
- Fuse more kernels, with custom ones and/or pytorch 2.0.
- Deal with the slow model creation and unnecessary initialization (@jlamypoirier?) Hacky prototype here https://github.com/bigcode-project/bigcode-inference-benchmark/blob/c1efe53ecbacb038347112cbc09c48b48a342ef0/src/utils/fast_init.py
- Get the faster GeLU into HF transformers (@jlamypoirier) Add the pytorch implementation of the OpenAI GeLU approximation huggingface/transformers#21344
- Something else?
- Task 2.5: After the other steps, benchmark inference.
Milestone 3: Scaling up
- Task 3.1: Look into alternative libraries (semi-optional)
- Try inference with Megatron
- Add MQA support to deepspeed
- Other suggestions?
- Task 3.2: Collaborate with the training team do determine our scaling needs and the target model configurations.
- Task 3.3: Add support for tensor model parallelism. This will likely involve an alternative library. This will be needed to reduce the latency for bigger models, and possibly for memory depending on the target model size and hardware (We can go ~40B with fp16 on A100).
- Task 3.4: Optimize for bigger models.
- Task 3.5: Benchmark the bigger models.
Milestone 4: Deployment
- Task 4.1: Optimize end-to-end model performance
- Optimize tokenization
- Optimize decoding
- Run them asynchronously whenever possible (i.e., in parallel with GPU ops for other batches)
- Task 4.2: Use a fast inference server (HF inference, Big science inference, Deepspeed inference, Nvidia Triton??)
- Task 4.3: Integrate our optimized model into HF transformers. [WIP] Adding GPT2 with Multi Query Attention huggingface/transformers#21253
Metadata
Metadata
Assignees
Labels
No labels