[Transform] Attention/Cache transforms #436

kylesayrs · 2025-08-26T23:46:45Z

Purpose

Support attention quantization
Support kv cache quantization within huggingface for lm_eval tests

Prerequisites

Must be merged at the same time as [Quantization] Attention/ KV Cache Refactor vllm-project/llm-compressor#1651

Changes

New Classes

Add hookable attention and kvcache implementations which are registered to the attention module as submodules
- QuantizedAttentionImpl injects itself into the model by registering a new attention implementation called ct_hooked_attention overriding model.config._attn_implementation to be the new implementation name
- QuantizedKVCache injects itself into the model by overriding the past_key_values input kwarg to attention, and wrapping the functionality of the original cache
- Calibration and transform hooks can be added to these modules via the hook functions
  - register_query_hook,
  - register_key_hook
  - register_value_hook

Quantization Lifecycle Changes

Apply
- The kv_cache_scheme field of the quantization config is now used to call initialize_hooked_kv_cache
- Attention modules can now be targeted, and are used to call initialize_hooked_attention if attention modules are explicitly targted (see is_narrow_match)
- Remove logic for "merging" kv cache schemes (this doesn't really make any sense, I'm not sure why it was ever included)
Initiailize
- Hooked kv cache and attention modules have their quantization parameters initialized by initialize_module_for_quantization
- The presence of attention or kvcache submodules is what determines whether attention or kv cache only quantization is being applied
Serialization
- QuantizationConfig. from_pretrained was cleaned up with additional comments
- The kv_cache_scheme field is added if there are any attention modules with a quantization_scheme attached

Helpers

is_narrow_match is used to check that attention modules are being specifically targeted (rather than targeting all modules in a layer)
get_head_dim is used to get the attention head_dim from a config

Testing

Added tests for is_narrow_match
Added tests for added attention and kvcache classes

brian-dellabetta

This looks good, though i have a number of questions and minor suggestions

src/compressed_tensors/modeling/attention.py

brian-dellabetta · 2025-08-27T15:26:16Z

src/compressed_tensors/modeling/attention.py

+            # assumes only one model at a time
+            global _original_impl


😬 i don't want to delay things, but we should briefly consider if there are alternative solutions

I spent 20 minutes exploring this, it requires creating specialized _ct_hooked_attention functions and specialized QuantizedAttentionImpl, which is more complexity than value added imho

can _original_impl be registered on the module level (i.e. each self_attn block) instead of setting a global var?

Sure, but in order to register the _original_impl, it needs to be gotten from somewhere.

The first time, you "get" it from model.config. However on subsequent calls, model.config is overridden. This means that in order to "get" the original implementation, you'd have to go find the last Attention module you registered it to, or else store it in some global store.

You could register it to the model module itself or something like that, but I think that that's less reliable than just a a global store. If it's functionality you're after, we can turn it into a hash table or something, keyed by model hash.

src/compressed_tensors/modeling/kvcache.py

dsikka

If the goal is to use this generally for kv_cache and attn quantize, can we move the initialize_hooked_attention and initialize_hooked_kv_cache to initialize.py?

I understand we haven't hooked them in yet for those workflows but I think these belong there.

src/compressed_tensors/modeling/attention.py

dsikka

do a pass through on any missing docstring, otherwise lgtm.
nice work

src/compressed_tensors/modeling/kvcache.py

The base branch was changed.

brian-dellabetta

Following for the most part. A few clarifications, but this makes sense to me

brian-dellabetta · 2025-10-08T21:03:15Z

src/compressed_tensors/modeling/attention.py

+    QuantizedAttentionImpl module which wraps the functionality of the original
+    attention implementation. Unlike the original attention function, this
+    implementation is a `torch.nn.Module` which can be hooked to trigger
+    transforms and calibration hooks.


Does this wrap every single attention block? If so, global _original_impl will be re-set multiple times, though if the same attention function is used throughout the entire model that's probably ok?

We guard against multiple sets using if model.config._attn_implementation != HOOKED_ATTENTION_NAME:

brian-dellabetta · 2025-10-08T21:03:44Z

src/compressed_tensors/modeling/attention.py

+            # assumes only one model at a time
+            global _original_impl


can _original_impl be registered on the module level (i.e. each self_attn block) instead of setting a global var?

The base branch was changed.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs mentioned this pull request Aug 27, 2025

[Transform] Spinquant R3 vllm-project/llm-compressor#1778

Open

brian-dellabetta previously approved these changes Aug 27, 2025

View reviewed changes

dsikka reviewed Aug 28, 2025

View reviewed changes

src/compressed_tensors/modeling/attention.py Outdated Show resolved Hide resolved

kylesayrs force-pushed the kylesayrs/r3-only branch from 7bf4b57 to 75056bf Compare August 28, 2025 21:09

dsikka previously approved these changes Sep 2, 2025

View reviewed changes

src/compressed_tensors/modeling/kvcache.py Show resolved Hide resolved

Base automatically changed from kylesayrs/transform-simplify-key to main September 8, 2025 18:46

kylesayrs mentioned this pull request Oct 7, 2025

[Attention] Attention head quantization strategy #481

Merged

kylesayrs force-pushed the kylesayrs/r3-only branch 2 times, most recently from e224a5d to 05ec17e Compare October 8, 2025 19:20

kylesayrs changed the base branch from main to kylesayrs/add-attn-head-strat October 8, 2025 19:20

brian-dellabetta previously approved these changes Oct 8, 2025

View reviewed changes

kylesayrs marked this pull request as draft October 8, 2025 21:06

kylesayrs force-pushed the kylesayrs/add-attn-head-strat branch from d084c5e to e3f24d4 Compare October 9, 2025 14:19

kylesayrs mentioned this pull request Oct 9, 2025

[Transform] [Attention] [KV Cache] Support KV-cache integrated attention transform and quantization #428

Closed

kylesayrs changed the base branch from kylesayrs/add-attn-head-strat to main October 9, 2025 18:14

kylesayrs changed the base branch from main to kylesayrs/add-attn-head-strat October 9, 2025 18:15

kylesayrs mentioned this pull request Oct 9, 2025

[Quantization] Attention/ KV Cache Refactor vllm-project/llm-compressor#1651

Draft

kylesayrs force-pushed the kylesayrs/r3-only branch from 145c9aa to 2efe3db Compare October 9, 2025 18:35

Base automatically changed from kylesayrs/add-attn-head-strat to main October 9, 2025 20:11

kylesayrs added 7 commits October 9, 2025 16:13

refactor

fca6664

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

reduce diff

5b0df0a

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

reduce diff

ea53a95

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add tests

bfec90d

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

remove attn head

14f092e

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

simplify

b8090c7

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

refactor

df792ea

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs added 16 commits October 9, 2025 16:14

reduce diff

dea774a

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

reduce diff

d274ca5

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix shapes

ac75f85

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix shapes

db894dc

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

revert

0903779

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add tests

5bff877

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

simplify

d7f8470

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add kv and attention

5862b1e

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

WIP

ac65b6f

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

remove transform stuff

a2b1684

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

apply

935f0e3

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

do not rely on hash

f46dadf

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

is_narrow_match tests

5229f86

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add tests

fafccf3

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix kv cache apply

2aead0e

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add more tests

04f716a

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs force-pushed the kylesayrs/r3-only branch from 7c19358 to 04f716a Compare October 9, 2025 20:16

fix style

5eac577

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs mentioned this pull request Oct 12, 2025

[Transforms] Use get_head_dim util vllm-project/llm-compressor#1918

Open

[Transform] Attention/Cache transforms #436

Are you sure you want to change the base?

[Transform] Attention/Cache transforms #436

Uh oh!

Conversation

kylesayrs commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Prerequisites

Changes

New Classes

Quantization Lifecycle Changes

Helpers

Testing

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kylesayrs commented Aug 26, 2025 •

edited

Loading