You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/tuning_guide.md
+34
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,7 @@ You may need to tune a kernel before it achieving the best performance. In this
4
4
5
5
---
6
6
7
+
### MoE layer 0 (AllGather + Scatter + GroupGEMM)
7
8
To enable tuning for the test demo, you only need to set the `--tune` flag.
8
9
9
10
```bash
@@ -37,3 +38,36 @@ static int config_ag_scatter_sm90 = []() {
37
38
```
38
39
39
40
The search space for tuning is defined in `src/generator`. For the MoE layer0's kernel, the search space is defined in `src/generator/gen_moe_ag_scatter.cc`. For example, the search space for GEMM tile size is defined as `cute::make_tuple(Shape<Auto, _256, Auto>{}, Shape<Auto, _128, Auto>{})` in #L88. Modify these codes and compile Flux again if you want enlarge the search space.
You can find the configuration space defined in `src/generator/gen_moe_gather_rs.cc`. You may notice that there are only three kernels been profiled in the case above. This is because there are only three qualified kernels in the search space for the configuration in the test demo, as defined in #L90-92 of `src/generator/gen_moe_gather_rs.cc`. The first value in `make_gather_rs_hparams` refers to the number of thread blocks specialized for communication and the second value refers to the size of the hidden dimension. You must make sure at least one hparams is registered here for the shape of the MoE layer1 you want.
0 commit comments