This repository was archived by the owner on Aug 30, 2024. It is now read-only.

Description
In the Bestla README.md Weight only Quantization supported config is provided - https://github.com/intel/neural-speed/blob/main/bestla/README.md#weight-only
As Bestla supports the BF16 compute DType, I have quantized the model using quantize.py - https://github.com/intel/neural-speed/blob/main/scripts/quantize.py
Ex: python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_file ne-q4_j.bin --weight_dtype int4 --group_size 128 --compute_dtype bf16
During the Inference cycle, I noticed that for both FP32 and BF16 computation types F32 APIs are being triggered:
One scenario is with in QKV fusion BTLAGemmCompF32() is triggered with both F32 and BF16 -
|
void BTLAGemmCompF32(const int M, const int N, const int K, const float* A, const int lda, |
Question 1: I would like to know, if I can use Bestla/Neural speed APIs for BF16 compute Dtype without falling back to F32 on AVX512 ISA and what about the input Dtype the API supports?