feat: update cubins for dynamic scale factor interface and one-copy kvcache on trtllm-gen mla #1248

yyihuang · 2025-07-14T08:34:37Z

📌 Description

Update the static scaling factor interface to take:
q_scale, k_scale, v_scale, sm_scale, o_scale.

🔍 Related Issues

TODOs in #1222, last comment.

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

…m-mla-cubin

gemini-code-assist

Summary of Changes

Hello @yyihuang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a more explicit and flexible interface for handling scaling factors within the trtllm-gen mla kernels. By replacing a single scale parameter with a set of five specific scales (for query, key, value, softmax, and output), the changes enable finer-grained control over quantization, particularly in preparation for future FP8 quantization support. The modifications span both the Python API and the underlying C++ CUDA kernels, ensuring consistent behavior and improved extensibility for advanced quantization schemes.

Highlights

API Interface Update: The trtllm_batch_decode_with_kv_cache_mla function in flashinfer/decode.py and its corresponding C++ kernel launcher (trtllm_paged_attention_mla_launcher in csrc/trtllm_mla_kernel_launcher.cu) have been refactored to replace a single generic scale parameter with five distinct, more granular scaling factors: q_scale, k_scale, v_scale, sm_scale, and o_scale.
Dynamic Scale Factor Derivation: The bmm1_scale and bmm2_scale values, which are consumed by the C++ kernel, are now dynamically computed within the Python wrapper (flashinfer/decode.py) based on the newly introduced granular input scales. Specifically, bmm1_scale is derived from q_scale, k_scale, and sm_scale (and the head dimension), while bmm2_scale is derived from v_scale and o_scale.
Internal Kernel Logic Adjustment: In the C++ kernel (csrc/trtllm_mla_kernel_launcher.cu), the runner_params.mScaleQ (previously controlled by the generic scale parameter) is now hardcoded to 1.0. The runner_params.scaleSoftmaxLog2 calculation has been updated to bmm1_scale * std::sqrt((float)(runner_params.mHeadDimQk)), aligning with the new scale derivation logic from the Python frontend.
Test Suite Adaptation: The test cases in tests/test_trtllm_gen_decode.py have been updated to reflect the new function signature and parameter passing for trtllm_batch_decode_with_kv_cache_mla, ensuring the correctness and compatibility of the changes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the scaling factor interface for the TRT-LLM MLA kernels, moving from a single scale parameter to a more detailed set of quantization and model scales (q_scale, k_scale, v_scale, sm_scale, o_scale). This improves clarity and flexibility. The changes are mostly in the C++ kernel launcher and the Python interface.

I've found two high-severity issues that should be addressed:

In csrc/trtllm_mla_kernel_launcher.cu, there's a bug in the calculation of the softmax scaling factor, which uses an incorrect head dimension.
In flashinfer/decode.py, there's an incorrect shape and dtype check for the pre-allocated output tensor, which would cause runtime errors.

csrc/trtllm_mla_kernel_launcher.cu

flashinfer/decode.py

…into trtllm-mla-cubin

## 📌 Description Enable deepseek MTP with q>1. Recommend small batch size and small MTP = 2 / 3. Update tests to reduce test time. Draft dynamic scale factor interface of mla (part1, part2 - cubin loading & fmha in #1248) --ready to merge now. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes  --------- Co-authored-by: averyhuang <averyh@nvidia.com>

…m-mla-cubin

yyihuang · 2025-07-29T00:26:32Z

Close to skip rebasing on last refactor. Moved to #1342.

yyihuang and others added 30 commits July 5, 2025 23:38

init

ba37382

upd

8e99626

Merge branch 'main' of github.com:flashinfer-ai/flashinfer into trtll…

0aff87e

…m-mla-cubin

upd params list

d1ce702

Merge branch 'main' of github.com:flashinfer-ai/flashinfer into trtll…

b2aeed0

…m-mla-cubin

upd launcher params

22c1082

add pagesInMemPool calculation

17434e9

disable scale factor

0998b6e

upd jit

cfbd418

upd python interface

1e11ce4

add default scale

1fe129f

upd type dispatch

702193b

upd build, draft test

bb024ce

fix q head_dim

4ec35b4

add cubin hash

d0d4a95

upd MLA_576 meta info

af87094

upd SHA id

de1955a

add bmm1_scale and bmm2_scale

9e4cf50

fix illegal memory access

bec5a45

draft test

1f1c4b4

upd test

ce5a0a2

upd test draft

7ca8bdc

upd kvcache

1662a1b

add draft test

e9cc634

fix

bb410c9

checkpoint: fix pgsz 16, fp8, duplicate cache

08e61f8

upd tests: temp disable pgsz 16

dca7c57

upd todo

1fe4471

upd

f40605c

Merge branch 'main' of github.com:flashinfer-ai/flashinfer into trtll…

f280efc

…m-mla-cubin

gemini-code-assist bot reviewed Jul 14, 2025

View reviewed changes

csrc/trtllm_mla_kernel_launcher.cu Outdated Show resolved Hide resolved

flashinfer/decode.py Outdated Show resolved Hide resolved

fix out shape

7ed0804

yyihuang mentioned this pull request Jul 14, 2025

feat: add trtllm-gen mla cubin #1222

Merged

5 tasks

averyhNV added 15 commits July 14, 2025 18:30

add dynamic scale factor interface

6c12caa

draft mtp interface

78fa630

minor upd

da981d1

upd maxseqlenQ

80f753a

add bs test

3448b86

init

bc4772e

upd

0af76c4

upd

138ac4a

fix q=1

3498733

add debug v_means (to be removed)

7c98f13

update fmha and mla scale interface

ee62521

remove debug print at q=0

8460205

upd test

e6818e8

Merge branch 'mtp-trtllm-gen-attn' of github.com:yyihuang/flashinfer …

117c64a

…into trtllm-mla-cubin

upd

bd4dd71

yyihuang mentioned this pull request Jul 16, 2025

feat: enable trtllm-gen mla MTP #1258

Merged

5 tasks

yyihuang changed the title ~~feat: update scale factor interface for trtllm-gen mla kernels.~~ feat: update cubins for dynamic scale factor interface and one-copy kvcache on trtllm-gen mla Jul 16, 2025

averyhNV added 3 commits July 16, 2025 13:20

Merge branch 'main' of github.com:flashinfer-ai/flashinfer into trtll…

98647c7

…m-mla-cubin

resume cubin loader

c901a8f

add two cubin loaders draft

a6f2100

yyihuang marked this pull request as draft July 17, 2025 23:34

yyihuang mentioned this pull request Jul 21, 2025

refactor: refactor trtllm-gen attention kernel integration code #1289

Merged

5 tasks

yyihuang mentioned this pull request Jul 29, 2025

feat: support dynamic scale factor interface for trtllm-gen attn #1342

Closed

5 tasks

yyihuang closed this Jul 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: update cubins for dynamic scale factor interface and one-copy kvcache on trtllm-gen mla #1248

feat: update cubins for dynamic scale factor interface and one-copy kvcache on trtllm-gen mla #1248

Uh oh!

yyihuang commented Jul 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

yyihuang commented Jul 29, 2025

Uh oh!

Uh oh!

feat: update cubins for dynamic scale factor interface and one-copy kvcache on trtllm-gen mla #1248

feat: update cubins for dynamic scale factor interface and one-copy kvcache on trtllm-gen mla #1248

Uh oh!

Conversation

yyihuang commented Jul 14, 2025

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

yyihuang commented Jul 29, 2025

Uh oh!

Uh oh!