quantize : configurable neutral imatrix prior #15060

compilade · 2025-08-03T21:49:06Z

Follow-up from #9400 (comment) to allow experimenting with different weights for the neutral prior with GGUF-based imatrix.

This should help avoid some NANs and/or unstable quantization with MoE models which have some rarely-used experts.

Basically, imatrix weights are per-channel averages of squared activations. This new feature here inserts 1 (the neutral weight value when no imatrix is provided) in that average with a configurable weight (in the sense of a weighted average).

The default is 1 (on master, this was technically 0), which means the neutral weight is worth as much as a single token from the calibration dataset.

This only works with GGUF-based imatrix files, because they store per-expert activation counts (unlike the imatrix.dat format).

What I don't know is if it would be better to use some different value than 1 token, and so I've made it configurable with --prior-weight to make it easier to test different values. ("prior weight" might not be an intuitive name, suggestions welcome. "neutral tokens", maybe?).

Example usage:

$ ./llama-quantize --imatrix imatrix.gguf --prior-weight 128 model-F16.gguf model-Q4_K_M.gguf q4_k_m

When --prior-weight is not specified, --prior-weight 1 is implied.

To get the same behavior as before this PR, --prior-weight 0 can be used.

TODO

Store metadata to make it clear what prior weight was used
- Done in 92383bf
Maybe rename "prior weight" to "neutral tokens"?
Test perplexity from imatrix.gguf with problematic MoE models and different prior weights
- Maybe models from Imatrix quantization bug: OLMo-2-0325-32B-Instruct found nan value #12439, and Misc. bug: Quantizing Olmo models with imatrix failing on some sizes #11764 (which might not have been really fixed yet)
- @bartowski1182, @ubergarm, @danielhanchen, @nicoboss, do you happen to have encountered NANs or other problems with some MoE models when using imatrix? Which models and datasets? Which quant type(s)? (even better if it was from a published imatrix.gguf file)

Make sure to read the contributing guidelines before submitting a PR

CISC · 2025-08-03T22:13:26Z

I remember this one being horribly broken: https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct

Edit: Here's an imatrix with NaNs https://huggingface.co/legraphista/Qwen2-57B-A14B-Instruct-IMat-GGUF

bartowski1182 · 2025-08-03T22:17:57Z

I have been having a horrible time with the latest Qwen3 30B, I've tried 10mb sized diverse datasets, even a custom one I made last time that I found REALLY forced diversity, no dice on any of them

I tried a ton of datasets and all failed on Q5_K even

I'll try to organize them and upload tomorrow for reference, running another test tonight

jukofyork · 2025-08-04T11:31:56Z

What I don't know is if it would be better to use some different value than 1 token

From my understanding, to fix the problems with MoE tensors - it's "should" not matter as the weighting factors are applied per 256 element block (or 32 for the legacy quants), and so long as all rows are a multiple of 256; any that get no samples during the imatrix creation "should" end up with equal weights.

I say "should" in inverted commas because this assumes that with no samples, the behaviour should revert to the unweighted case... But actually this is using the empirically found "mixture inside a square root method" explained here:

ikawrakow/ik_llama.cpp#140

weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j])

so using an equal weight here for the MoE tensors with no samples simplifies to:

weight[j] = 1 * sqrtf(sigma2 + xb[j]*xb[j]) = sqrtf(sigma2 + xb[j]*xb[j]

rather than:

weight[j] = xb[j]*xb[j]

where:

const float * xbl = x + QK_K*ibl;
float sumx2 = 0;
for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
float sigma2 = 2*sumx2/QK_K;

If using qw[j] = 1 is still causing problems for MoE models, then it can only be here where the problem lies.

compilade · 2025-08-04T13:54:46Z

What I don't know is if it would be better to use some different value than 1 token

From my understanding, to fix the problems with MoE tensors - it's "should" not matter as the weighting factors are applied per 256 element block (or 32 for the legacy quants),

@jukofyork
If the row sizes are not multiples of the block sizes, then quantization cannot really happen anyway.

There is only one unique evaluation count per tensor when using MUL_MAT. The counts can only be distinct between 2D matrices when MUL_MAT_ID is used.

(Note that in this context, evaluation counts map to tokens, to avoid depending on a chunk size)

The number of neutral tokens (aka the weight of the neutral prior) makes a difference only when the evaluation count is non-zero, but is still small enough to have similar orders of magnitude.

I say "should" in inverted commas because this assumes that with no samples, the behaviour should revert to the unweighted case... But actually this is using the empirically found "mixture inside a square root method" explained here: ikawrakow/ik_llama.cpp#140

Right, this is different. But it's also kind of close to the same. In #12557 I also tried to use that (with sigma2) in the unweighted case, and mostly saw some improvement.
For some types, (e.g. Q3_K), it was actually better (or at least not significantly worse) to use 1.0f instead of xb[i] * xb[i] (or variants) in the unweighted case.

To be clear this PR doesn't change what happens when the evaluation count is 0, that was already changed in #9400, and has the behavior you're describing. This also only happens with MUL_MAT_ID (for MoE experts), because the counts only exist after the first collection (and so are minimally 1 for MUL_MAT).

What this PR changes is what happens when the evaluation count is small, to make the imatrix weights less impactful when the sample size isn't big enough (which often happens with MoE tensors when there are many experts).

The difference you're describing is still relevant, though, because it does mean the "neutral prior" is not quite like the unweighted case for some types. But it's not too far from that and the perplexity should be similar, based on what I've seen when working on #12557.

quantize : configurable neutral imatrix prior

0416ed2

compilade added generation quality Quality of model output research 🔬 need feedback Testing and feedback with results are needed labels Aug 3, 2025

github-actions bot added the examples label Aug 3, 2025

quantize : store metadata for prior weight used for imatrix

92383bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

quantize : configurable neutral imatrix prior #15060

quantize : configurable neutral imatrix prior #15060

compilade commented Aug 3, 2025 •

edited

Loading

Uh oh!

CISC commented Aug 3, 2025 •

edited

Loading

Uh oh!

bartowski1182 commented Aug 3, 2025 •

edited

Loading

Uh oh!

jukofyork commented Aug 4, 2025 •

edited

Loading

Uh oh!

compilade commented Aug 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

quantize : configurable neutral imatrix prior #15060

Are you sure you want to change the base?

quantize : configurable neutral imatrix prior #15060

Conversation

compilade commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

CISC commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartowski1182 commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

compilade commented Aug 3, 2025 •

edited

Loading

CISC commented Aug 3, 2025 •

edited

Loading

bartowski1182 commented Aug 3, 2025 •

edited

Loading

jukofyork commented Aug 4, 2025 •

edited

Loading

compilade commented Aug 4, 2025 •

edited

Loading