Add accuracy benchmarks for llama 3

Post-training predictors on llama 3 output, we need to evaluate
- MMLU and AIME, refer to [further benchmarks](https://huggingface.co/meta-llama/Llama-3.2-1B#instruction-tuned-models)
- Under int8 quant, fp16 and fp32
- With and without explicitly sparsifying LLMs like [Prosparse](https://arxiv.org/abs/2402.13516)