-
-
Notifications
You must be signed in to change notification settings - Fork 212
MXFP4 quants #2188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
MXFP4 quants #2188
Conversation
Testing has begun. It's time to find out the optimal temperature and sampling values to operate MXFP4, so that our users don't have to. We run the model with the same prompt and seed, varying only the temperature. We repeat this activity for a range of questions and scenarios. To prevent boredom for readers, here is just a single execution with some manual analysis. Imagine this exercise being carried out tirelessly for many hours. Prompt:
Expected answer: Raw output from the script:
Converted to markdown table with qwen3-coder.
|
Some results:
|
Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>
18c938b
to
f3b780e
Compare
|
quantization mode in our config.json is correctly shown as GLM-4.5 analysis Analysis of LLM ConfigurationsConfiguration 1: cdx1-pro-mlx-MXFP4 (Qwen3 MoE Model)Architecture
MoE Configuration
Context & Positional Encoding
Quantization
Other Features
Configuration 2: gpt-oss-20bArchitecture
MoE Configuration
Context & Positional Encoding
Quantization
Other Features
Key Differences
The cdx1-pro model appears optimized for very long context understanding with its deep architecture and large context window, while gpt-oss-20b focuses on computational efficiency with its selective quantization and sliding window attention. |
The results aren't that impressive. The model continues to hallucinate even with temperature 0 for logic tests. Evaluation ResultsModel: Questions with Score 0
|
If OpenAI can do it, so can we.
cdx1 models are now available in MXFP4 quants in both MLX and GGUF formats.
The quantization attribute in the config file should give an idea about its complexity compared to legacy quants.
Limitations
cdx1 uses
unsloth/Qwen2.5-Coder-14B-Instruct
as the base image and doesn't benefit from this hybrid quant. We lack compute for DWQ quants, so we are stuck a bit.Benchmarks
TBD
Screenshots (cdx1-pro)
GLM-4.5 analysis of the configurations
gpt-oss-20b - https://huggingface.co/openai/gpt-oss-20b/raw/main/config.json
cdx1-pro-mlx-MXFP4 - https://huggingface.co/CycloneDX/cdx1-pro-mlx-MXFP4/raw/main/config.json
Analysis of Two LLM Model Configurations
I'll analyze the differences between these two models and provide insights on their efficiency for different purposes.
Model 1: GptOssForCausalLM
Architecture & Size
Attention Mechanism
Context Handling
Quantization
Model 2: Qwen3MoeForCausalLM
Architecture & Size
Attention Mechanism
Context Handling
Quantization
Key Differences
Model Structure: Model 1 has fewer layers (24) but larger hidden size (2880), while Model 2 has more layers (48) but smaller hidden size (2048).
Attention Mechanism: Model 1 uses sliding attention which can be more efficient for long sequences, while Model 2 uses QK normalization which may improve stability.
MoE Configuration: Model 2 has significantly more experts (128 vs 32) and uses more experts per token (8 vs 4).
Context Length: Model 2 supports a larger context length (262,144 vs 131,072).
Quantization: Different approaches - Model 1 uses MXFP4 with exclusions, while Model 2 uses 4-bit with special gate handling.
Efficiency Analysis
Training Efficiency
Model 2 (Qwen3MoeForCausalLM) is likely more efficient for training because:
Inference Efficiency
Model 1 (GptOssForCausalLM) would likely be more efficient for inference because:
Fine-tuning Efficiency
Model 1 (GptOssForCausalLM) would likely be more efficient for fine-tuning because:
However, if the fine-tuning task requires significant adaptation, Model 2 might be preferable due to its larger number of experts, which could allow for more specialized adaptation.
Conclusion
The choice ultimately depends on your specific use case, available computational resources, and the nature of your tasks.