Skip to content

Inference tasks and milestones #11

@jlamypoirier

Description

@jlamypoirier

[WIP]
We want to achieve and demonstrate state-of-the-art inference throughputs and latencies for our models. Here is a list of milestones and tasks. These are not necessarily in order, we can (and should) already look into the later milestones.

Milestone 1: Make a starter implementation of MQA and add it do BigCode transformers. Agreeing on a common will be crucial for the next steps. (bigcode-project/transformers#4)

  • Task 1.1: Implement a GPT2 model with MHA and MQA within BigCode transformers. We should keep support for MHA so we can compare with an equally optimized implementation (@bigximik, @jlamypoirier, @mayank31398 ).
  • Task 1.2 Add basic profiling support to our benchmarking code (@jlamypoirier). Profiling and misc #10
  • Task 1.3: Validate, profile and add simple optimizations for our model for a ~1B model such as SantaCoder (@jlamypoirier).

Milestone 2: Turn our starter implementation into a strong baseline.

Milestone 3: Scaling up

  • Task 3.1: Look into alternative libraries (semi-optional)
    • Try inference with Megatron
    • Add MQA support to deepspeed
    • Other suggestions?
  • Task 3.2: Collaborate with the training team do determine our scaling needs and the target model configurations.
  • Task 3.3: Add support for tensor model parallelism. This will likely involve an alternative library. This will be needed to reduce the latency for bigger models, and possibly for memory depending on the target model size and hardware (We can go ~40B with fp16 on A100).
  • Task 3.4: Optimize for bigger models.
  • Task 3.5: Benchmark the bigger models.

Milestone 4: Deployment

  • Task 4.1: Optimize end-to-end model performance
    • Optimize tokenization
    • Optimize decoding
    • Run them asynchronously whenever possible (i.e., in parallel with GPU ops for other batches)
  • Task 4.2: Use a fast inference server (HF inference, Big science inference, Deepspeed inference, Nvidia Triton??)
  • Task 4.3: Integrate our optimized model into HF transformers. [WIP] Adding GPT2 with Multi Query Attention huggingface/transformers#21253

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions