-
Notifications
You must be signed in to change notification settings - Fork 12.5k
quantize : configurable neutral imatrix prior #15060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I remember this one being horribly broken: https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct Edit: Here's an imatrix with NaNs https://huggingface.co/legraphista/Qwen2-57B-A14B-Instruct-IMat-GGUF |
I have been having a horrible time with the latest Qwen3 30B, I've tried 10mb sized diverse datasets, even a custom one I made last time that I found REALLY forced diversity, no dice on any of them I tried a ton of datasets and all failed on Q5_K even I'll try to organize them and upload tomorrow for reference, running another test tonight |
From my understanding, to fix the problems with MoE tensors - it's "should" not matter as the weighting factors are applied per 256 element block (or 32 for the legacy quants), and so long as all rows are a multiple of 256; any that get no samples during the I say "should" in inverted commas because this assumes that with no samples, the behaviour should revert to the unweighted case... But actually this is using the empirically found "mixture inside a square root method" explained here: weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j]) so using an equal weight here for the MoE tensors with no samples simplifies to: weight[j] = 1 * sqrtf(sigma2 + xb[j]*xb[j]) = sqrtf(sigma2 + xb[j]*xb[j] rather than: weight[j] = xb[j]*xb[j] where: const float * xbl = x + QK_K*ibl;
float sumx2 = 0;
for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
float sigma2 = 2*sumx2/QK_K; If using |
@jukofyork There is only one unique evaluation count per tensor when using (Note that in this context, evaluation counts map to tokens, to avoid depending on a chunk size) The number of neutral tokens (aka the weight of the neutral prior) makes a difference only when the evaluation count is non-zero, but is still small enough to have similar orders of magnitude.
Right, this is different. But it's also kind of close to the same. In #12557 I also tried to use that (with sigma2) in the unweighted case, and mostly saw some improvement. To be clear this PR doesn't change what happens when the evaluation count is 0, that was already changed in #9400, and has the behavior you're describing. This also only happens with What this PR changes is what happens when the evaluation count is small, to make the imatrix weights less impactful when the sample size isn't big enough (which often happens with MoE tensors when there are many experts). The difference you're describing is still relevant, though, because it does mean the "neutral prior" is not quite like the unweighted case for some types. But it's not too far from that and the perplexity should be similar, based on what I've seen when working on #12557. |
Follow-up from #9400 (comment) to allow experimenting with different weights for the neutral prior with GGUF-based imatrix.
This should help avoid some NANs and/or unstable quantization with MoE models which have some rarely-used experts.
Basically, imatrix weights are per-channel averages of squared activations. This new feature here inserts
1
(the neutral weight value when no imatrix is provided) in that average with a configurable weight (in the sense of a weighted average).The default is
1
(onmaster
, this was technically0
), which means the neutral weight is worth as much as a single token from the calibration dataset.This only works with GGUF-based imatrix files, because they store per-expert activation counts (unlike the
imatrix.dat
format).What I don't know is if it would be better to use some different value than
1
token, and so I've made it configurable with--prior-weight
to make it easier to test different values. ("prior weight" might not be an intuitive name, suggestions welcome. "neutral tokens", maybe?).Example usage:
$ ./llama-quantize --imatrix imatrix.gguf --prior-weight 128 model-F16.gguf model-Q4_K_M.gguf q4_k_m
When
--prior-weight
is not specified,--prior-weight 1
is implied.To get the same behavior as before this PR,
--prior-weight 0
can be used.TODO
imatrix.gguf
with problematic MoE models and different prior weightsimatrix.gguf
file)Make sure to read the contributing guidelines before submitting a PR