[`FA`] Cleanup loading logic #41427

vasqu · 2025-10-07T17:51:09Z

We currently ignore anything passed through the interface and forced correct loading at attn setting time. This changes it to be more flexible where power users may customly switch it up by just setting it in the config.

vasqu · 2025-10-07T17:54:42Z

src/transformers/integrations/hub_kernels.py

    if hasattr(kernel, "flash_attn_varlen_func"):
        if attention_wrapper is None:
            attention_wrapper = flash_attention_forward
-        kernel_function = partial(attention_wrapper, implementation=kernel)


This didn't have any effect currently and we only used the forced loading to get the correct implementation

no this is super important for paged_attention wrappers 😓

Changed the logic so that CB is now able to use all fa versions with lazy loading, i.e. fa2, fa3, kernels fas

HuggingFaceDocBuilderDev · 2025-10-07T18:01:55Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu · 2025-10-07T18:53:40Z

src/transformers/integrations/hub_kernels.py

        if attention_wrapper is None:
            attention_wrapper = flash_attention_forward
-        kernel_function = partial(attention_wrapper, implementation=kernel)
-        lazy_import_flash_attention(kernel, force_import=True)


Lazy import only happens at init or explicit setting time

transformers/src/transformers/modeling_utils.py

Lines 2580 to 2582 in 766ac18

# preload flash attention here to allow compile with fullgraph

if "flash" in applicable_attn_implementation:

lazy_import_flash_attention(applicable_attn_implementation)

transformers/src/transformers/generation/continuous_batching/continuous_api.py

Lines 614 to 618 in ced85ca

# lazy loading flash attention including kernel variations

if "flash" in attn_implementation:

from ...modeling_flash_attention_utils import lazy_import_paged_flash_attention

lazy_import_paged_flash_attention(attn_implementation)

Otherwise, it can happen when users only set the attention within the config, i.e. the interface detects a different implementation name than the currently loaded one

transformers/src/transformers/modeling_flash_attention_utils.py

Lines 143 to 144 in 766ac18

if implementation is not None and _loaded_implementation != implementation:

_loaded_implementation = implementation

ArthurZucker

make sure you test the continious_batching_simple script with fa2 !

ArthurZucker · 2025-10-08T11:59:54Z

src/transformers/integrations/hub_kernels.py

    if hasattr(kernel, "flash_attn_varlen_func"):
        if attention_wrapper is None:
            attention_wrapper = flash_attention_forward
-        kernel_function = partial(attention_wrapper, implementation=kernel)


no this is super important for paged_attention wrappers 😓

vasqu · 2025-10-08T16:12:22Z

@ArthurZucker updated the logic a bit to include CB now as well. The loaded attention is now properly dependent on the config like in the base fa version + we can now use fa3 as well for CB.

ArthurZucker

If we do:

- partial(attention_wrapper, implementation=kernel)

in that case we should remove this arg from all xxx_paged

vasqu · 2025-10-14T15:43:01Z

Seems like the implementation kwarg was only used in flash_paged, the other implementations don't have it:

eager

transformers/src/transformers/integrations/eager_paged.py

Lines 21 to 29 in a99b1be

    
           def eager_paged_attention_forward( 
        
               module: nn.Module, 
        
               query: torch.Tensor, 
        
               key: torch.Tensor, 
        
               value: torch.Tensor, 
        
               attention_mask: Optional[torch.Tensor],  # shape [seqlen_q, seqlen_k] 
        
               scaling: float, 
        
               **kwargs, 
        
           ):

sdpa

transformers/src/transformers/integrations/sdpa_paged.py

Lines 20 to 29 in a99b1be

    
           def sdpa_attention_paged_forward( 
        
               module: torch.nn.Module, 
        
               query: torch.Tensor, 
        
               key: torch.Tensor, 
        
               value: torch.Tensor, 
        
               attention_mask: Optional[torch.Tensor], 
        
               dropout: float = 0.0, 
        
               scaling: Optional[float] = None, 
        
               **kwargs, 
        
           ) -> tuple[torch.Tensor, None]:

flash removed in this PR

transformers/src/transformers/integrations/flash_paged.py

Lines 9 to 21 in 1ece193

    
           def paged_attention_forward( 
        
               module: torch.nn.Module, 
        
               q: torch.Tensor, 
        
               k: torch.Tensor, 
        
               v: torch.Tensor, 
        
               attention_mask: Optional[torch.Tensor] = None, 
        
               cache: PagedAttentionCache = None, 
        
               cu_seq_lens_q=None, 
        
               cu_seq_lens_k=None, 
        
               max_seqlen_q=None, 
        
               max_seqlen_k=None, 
        
               **kwargs, 
        
           ) -> torch.Tensor:

So we should be good @ArthurZucker?

vasqu added 2 commits October 7, 2025 19:45

fix

ede2344

style

abf1007

vasqu commented Oct 7, 2025

View reviewed changes

vasqu requested review from ArthurZucker and Cyrilvallez October 7, 2025 17:55

vasqu mentioned this pull request Oct 7, 2025

Fix flash_attention.py: wrong argument passing for attn_implementation #41347

Merged

5 tasks

vasqu added 2 commits October 7, 2025 20:49

fix kernels loading as well

c90f081

fix typing

766ac18

vasqu commented Oct 7, 2025

View reviewed changes

ArthurZucker reviewed Oct 8, 2025

View reviewed changes

vasqu added 5 commits October 8, 2025 15:04

Merge branch 'main' into cleanup-fa-loading

8ea4553

refactor CB loading logic as well

16320d4

fix base fa logic

06d26df

rename

c04d173

properly lazy load paged fa

ced85ca

vasqu requested review from ArthurZucker, McPatate and remi-or October 8, 2025 16:12

vasqu added 3 commits October 9, 2025 18:31

Merge branch 'main' into cleanup-fa-loading

ebc54ce

fix

6d6fb64

Merge branch 'main' into cleanup-fa-loading

65d8c2c

huggingface deleted a comment from github-actions bot Oct 9, 2025

ArthurZucker reviewed Oct 13, 2025

View reviewed changes

vasqu added 2 commits October 14, 2025 17:26

Merge branch 'main' into cleanup-fa-loading

8d8b5b6

check if ci is crashing again

1ece193

vasqu added 2 commits October 15, 2025 13:51

Merge branch 'main' into cleanup-fa-loading

fe3fa56

Merge remote-tracking branch 'upstream/main' into cleanup-fa-loading

6ec2aec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[`FA`] Cleanup loading logic #41427

[`FA`] Cleanup loading logic #41427

vasqu commented Oct 7, 2025 •

edited

Loading

Uh oh!

vasqu Oct 7, 2025 •

edited

Loading

Uh oh!

ArthurZucker Oct 8, 2025

Uh oh!

vasqu Oct 8, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 7, 2025

Uh oh!

vasqu Oct 7, 2025 •

edited

Loading

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Oct 8, 2025

Uh oh!

vasqu commented Oct 8, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

vasqu commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	# preload flash attention here to allow compile with fullgraph
	if "flash" in applicable_attn_implementation:
	lazy_import_flash_attention(applicable_attn_implementation)

	# lazy loading flash attention including kernel variations
	if "flash" in attn_implementation:
	from ...modeling_flash_attention_utils import lazy_import_paged_flash_attention

	lazy_import_paged_flash_attention(attn_implementation)

	if implementation is not None and _loaded_implementation != implementation:
	_loaded_implementation = implementation

[FA] Cleanup loading logic #41427

Are you sure you want to change the base?

[FA] Cleanup loading logic #41427

Conversation

vasqu commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Oct 7, 2025

Uh oh!

vasqu Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu commented Oct 8, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[`FA`] Cleanup loading logic #41427

[`FA`] Cleanup loading logic #41427

vasqu commented Oct 7, 2025 •

edited

Loading

vasqu Oct 7, 2025 •

edited

Loading

vasqu Oct 7, 2025 •

edited

Loading