Optimizing llama.cpp flags for max performance; tuning with ML Optuna-library (llama‑optimus) #14191
BrunoArsioli
started this conversation in
Ideas
Replies: 1 comment
-
why not llama-bench |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone, I started experimenting with local-AI and wondered if there was a tool that could help me find the best llama.cpp parameters for max tokens/s generation. Tuning flags like -t, --batch-size, --ubatch-size, -ngl, and flash-attn, by hand, consumes a lot of time.
From previous ML project, I knew that the Python library Optuna could be used to fine-tune llama.cpp. I’ve been experimenting with it because I really want to avoid manually tuning all CPU/GPU flags by trial and error. Especially, because I tend to test many models.
For the automated search, I've been developing llama-optimus:
and it runs in shell with:
It helps to find the llama.cpp flags to optimize token generation speed (tg), prompt processing (pp), or the average (mean); option for target --metric <tg|pp|mean>
what is already implemented in llama-optimus?
-ngl
upper limit to avoid crashesllama-bench
trials, with flags sampled via Optunas's TPEsamplertg
– generation throughputpp
– prompt-processing throughputmean
– simple averagellama-server
/llama-bench
I wonder how this could be integrated in llama.cpp/tools ; (or as an official companion script)
Reference GitHub Repo (MIT)
Open to feedback, edge cases, or anything I might have missed (btw, if someone could test with a cluster, and report results, it would be great)
thanks!
Beta Was this translation helpful? Give feedback.
All reactions