Skip to content

quantize : configurable neutral imatrix prior #15060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented Aug 3, 2025

Follow-up from #9400 (comment) to allow experimenting with different weights for the neutral prior with GGUF-based imatrix.

This should help avoid some NANs and/or unstable quantization with MoE models which have some rarely-used experts.

Basically, imatrix weights are per-channel averages of squared activations. This new feature here inserts 1 (the neutral weight value when no imatrix is provided) in that average with a configurable weight (in the sense of a weighted average).

The default is 1 (on master, this was technically 0), which means the neutral weight is worth as much as a single token from the calibration dataset.

This only works with GGUF-based imatrix files, because they store per-expert activation counts (unlike the imatrix.dat format).

What I don't know is if it would be better to use some different value than 1 token, and so I've made it configurable with --prior-weight to make it easier to test different values. ("prior weight" might not be an intuitive name, suggestions welcome. "neutral tokens", maybe?).

Example usage:

$ ./llama-quantize --imatrix imatrix.gguf --prior-weight 128 model-F16.gguf model-Q4_K_M.gguf q4_k_m

When --prior-weight is not specified, --prior-weight 1 is implied.

To get the same behavior as before this PR, --prior-weight 0 can be used.

TODO


Make sure to read the contributing guidelines before submitting a PR

@compilade compilade added generation quality Quality of model output research 🔬 need feedback Testing and feedback with results are needed labels Aug 3, 2025
@CISC
Copy link
Collaborator

CISC commented Aug 3, 2025

I remember this one being horribly broken: https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct

Edit: Here's an imatrix with NaNs https://huggingface.co/legraphista/Qwen2-57B-A14B-Instruct-IMat-GGUF

@bartowski1182
Copy link
Contributor

bartowski1182 commented Aug 3, 2025

I have been having a horrible time with the latest Qwen3 30B, I've tried 10mb sized diverse datasets, even a custom one I made last time that I found REALLY forced diversity, no dice on any of them

I tried a ton of datasets and all failed on Q5_K even

I'll try to organize them and upload tomorrow for reference, running another test tonight

@jukofyork
Copy link
Collaborator

jukofyork commented Aug 4, 2025

What I don't know is if it would be better to use some different value than 1 token

From my understanding, to fix the problems with MoE tensors - it's "should" not matter as the weighting factors are applied per 256 element block (or 32 for the legacy quants), and so long as all rows are a multiple of 256; any that get no samples during the imatrix creation "should" end up with equal weights.

I say "should" in inverted commas because this assumes that with no samples, the behaviour should revert to the unweighted case... But actually this is using the empirically found "mixture inside a square root method" explained here:

ikawrakow/ik_llama.cpp#140

weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j])

so using an equal weight here for the MoE tensors with no samples simplifies to:

weight[j] = 1 * sqrtf(sigma2 + xb[j]*xb[j]) = sqrtf(sigma2 + xb[j]*xb[j]

rather than:

weight[j] = xb[j]*xb[j]

where:

const float * xbl = x + QK_K*ibl;
float sumx2 = 0;
for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
float sigma2 = 2*sumx2/QK_K;

If using qw[j] = 1 is still causing problems for MoE models, then it can only be here where the problem lies.

@compilade
Copy link
Collaborator Author

compilade commented Aug 4, 2025

What I don't know is if it would be better to use some different value than 1 token

From my understanding, to fix the problems with MoE tensors - it's "should" not matter as the weighting factors are applied per 256 element block (or 32 for the legacy quants),

@jukofyork
If the row sizes are not multiples of the block sizes, then quantization cannot really happen anyway.

There is only one unique evaluation count per tensor when using MUL_MAT. The counts can only be distinct between 2D matrices when MUL_MAT_ID is used.

(Note that in this context, evaluation counts map to tokens, to avoid depending on a chunk size)

The number of neutral tokens (aka the weight of the neutral prior) makes a difference only when the evaluation count is non-zero, but is still small enough to have similar orders of magnitude.

I say "should" in inverted commas because this assumes that with no samples, the behaviour should revert to the unweighted case... But actually this is using the empirically found "mixture inside a square root method" explained here: ikawrakow/ik_llama.cpp#140

Right, this is different. But it's also kind of close to the same. In #12557 I also tried to use that (with sigma2) in the unweighted case, and mostly saw some improvement.
For some types, (e.g. Q3_K), it was actually better (or at least not significantly worse) to use 1.0f instead of xb[i] * xb[i] (or variants) in the unweighted case.

To be clear this PR doesn't change what happens when the evaluation count is 0, that was already changed in #9400, and has the behavior you're describing. This also only happens with MUL_MAT_ID (for MoE experts), because the counts only exist after the first collection (and so are minimally 1 for MUL_MAT).

What this PR changes is what happens when the evaluation count is small, to make the imatrix weights less impactful when the sample size isn't big enough (which often happens with MoE tensors when there are many experts).

The difference you're describing is still relevant, though, because it does mean the "neutral prior" is not quite like the unweighted case for some types. But it's not too far from that and the perplexity should be similar, based on what I've seen when working on #12557.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples generation quality Quality of model output need feedback Testing and feedback with results are needed research 🔬
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants