-
Notifications
You must be signed in to change notification settings - Fork 30.2k
Improve Gemma3n model and tests #39764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
40c604b
d143de4
404208a
8aff749
ee1fe17
31b1bbe
82a2c5f
e4e6cc7
ffb2c61
e3ca2a3
4243098
83f2599
501f651
49d52d7
3f630b8
b23e3ca
7d79307
afeca3b
b8f7f09
9d4ecb6
8131164
dd77392
c9ca022
fbfa424
6247789
1a0bd59
1c5b653
c12d304
5d4785a
eb88eb0
e9bcc4c
abd6c5d
df1cf46
3eda455
84b80af
46cd717
c7047e3
32e252d
d083330
4f69ef3
0436f99
db221f0
8fcd111
29c8c4d
7fabbac
f6ee7bd
fff923a
b572f5d
6f9a6a1
03b3f9f
f2155d7
5f8a3d8
c53c18a
c2f72e5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -156,12 +156,13 @@ class Gemma3nTextConfig(PretrainedConfig): | |
The number of layer that share KV cache values. During the forward pass, the last `num_kv_shared_layers` | ||
layers in the model "share" the KV values in that each local and global layer in this range uses the KV | ||
cache values computed for the last local or global layer, respectively, before entering this range. The | ||
value should be `num_kv_shared_layers` should be a scalar of `sliding_window_pattern`. | ||
value should be a multiple of the attention pattern size (see `layer_types` parameter). | ||
laurel_rank (int, *optional*, defaults to 64): | ||
The intermediate size for the linear projections in the Learned Augmented Residual Layer. | ||
activation_sparsity_pattern (Sequence[float], *optional*, defaults to `(0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)`): | ||
activation_sparsity_pattern (Sequence[float], *optional*): | ||
The sparsity factor used to extract the top-k activations for a given layer. The provided Sequence must | ||
explicitly provide a sparsity value for each layer in the model. | ||
explicitly provide a sparsity value for each layer in the model. By default, the first 10 layers are | ||
sparse with a sparsity factor of 0.95 and the rest are dense. | ||
|
||
```python | ||
>>> from transformers import Gemma3nTextModel, Gemma3nTextConfig | ||
|
@@ -227,7 +228,7 @@ def __init__( | |
altup_num_inputs: int = 4, | ||
num_kv_shared_layers: int = 15, | ||
laurel_rank: int = 64, | ||
activation_sparsity_pattern: Optional[Union[float, Sequence[float]]] = (0.95,) * 10 + (0.0,) * 25, | ||
activation_sparsity_pattern: Optional[Union[float, Sequence[float]]] = None, | ||
**kwargs, | ||
): | ||
super().__init__( | ||
|
@@ -289,7 +290,10 @@ def __init__( | |
self.laurel_rank = laurel_rank | ||
|
||
if activation_sparsity_pattern is None: | ||
activation_sparsity_pattern = [0.0] * num_hidden_layers | ||
num_sparse_layers = 10 if num_hidden_layers > 10 else 0 | ||
activation_sparsity_pattern = (0.95,) * num_sparse_layers + (0.0,) * ( | ||
num_hidden_layers - num_sparse_layers | ||
) | ||
Comment on lines
292
to
+296
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having the number of layers hardcoded is no good, the code crashes when instantiating a model with a different number of layers. The None default is therefore used. There is no danger in deleting the previous default as no model on the Hub relied on it, see here for the discussion. |
||
|
||
if (len_asp := len(activation_sparsity_pattern)) != num_hidden_layers: | ||
raise ValueError( | ||
|
Uh oh!
There was an error while loading. Please reload this page.