Optimizing llama.cpp flags for max performance; tuning with ML Optuna-library (llama‑optimus) #14191

BrunoArsioli · 2025-06-15T10:56:44Z

BrunoArsioli
Jun 15, 2025

Hi everyone, I started experimenting with local-AI and wondered if there was a tool that could help me find the best llama.cpp parameters for max tokens/s generation. Tuning flags like -t, --batch-size, --ubatch-size, -ngl, and flash-attn, by hand, consumes a lot of time.

From previous ML project, I knew that the Python library Optuna could be used to fine-tune llama.cpp. I’ve been experimenting with it because I really want to avoid manually tuning all CPU/GPU flags by trial and error. Especially, because I tend to test many models.

For the automated search, I've been developing llama-optimus:

and it runs in shell with:

pip install llama-optimus

llama-optimus --llama-bin ~/path_to_your/llama.cpp/build/bin --model ~/path_to_your/models/my-model.gguf --metric tg --trials 30 --repeat 2

It helps to find the llama.cpp flags to optimize token generation speed (tg), prompt processing (pp), or the average (mean); option for target --metric <tg|pp|mean>

what is already implemented in llama-optimus?

auto-detects a safe -ngl upper limit to avoid crashes
runs multiple llama-bench trials, with flags sampled via Optunas's TPEsampler
targets one of three metrics:
- tg – generation throughput
- pp – prompt-processing throughput
- mean – simple average
finally prints ready-to-run optimum commands for llama-server / llama-bench

I wonder how this could be integrated in llama.cpp/tools ; (or as an official companion script)
Reference GitHub Repo (MIT)
Open to feedback, edge cases, or anything I might have missed (btw, if someone could test with a cluster, and report results, it would be great)
thanks!

ExtReMLapin · 2025-06-15T14:42:23Z

ExtReMLapin
Jun 15, 2025

why not llama-bench

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizing llama.cpp flags for max performance; tuning with ML Optuna-library (llama‑optimus) #14191

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Optimizing llama.cpp flags for max performance; tuning with ML Optuna-library (llama‑optimus) #14191

Uh oh!

Uh oh!

BrunoArsioli Jun 15, 2025

what is already implemented in llama-optimus?

Replies: 1 comment

Uh oh!

ExtReMLapin Jun 15, 2025

BrunoArsioli
Jun 15, 2025

ExtReMLapin
Jun 15, 2025