Skip to content

Commit 9e4902d

Browse files
committed
Add How to Benchmark a Single KleidiAI Micro-kernel in ExecuTorch
1 parent a80bb77 commit 9e4902d

File tree

13 files changed

+1143
-0
lines changed

13 files changed

+1143
-0
lines changed
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
---
2+
title: Environment setup
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
10+
### Python Environment Setup
11+
12+
Before building ExecuTorch, it is highly recommended to create an isolated Python environment.
13+
This prevents dependency conflicts with your system Python installation and ensures a clean build environment.
14+
15+
```bash
16+
cd $WORKSPACE
17+
python3 -m venv pyenv
18+
source pyenv/bin/activate
19+
20+
```
21+
All subsequent steps should be executed within this Python virtual environment.
22+
23+
### Download the ExecuTorch Source Code
24+
25+
Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched.
26+
27+
```bash
28+
cd $WORKSPACE
29+
git clone -b v1.0.0 --recurse-submodules https://github.com/pytorch/executorch.git
30+
31+
```
32+
33+
> **Note:**
34+
> The instructions in this guide are based on **ExecuTorch v1.0.0**.
35+
> Commands or configuration options may differ in later releases.
36+
37+
### Build and Install the ExecuTorch Python Components
38+
39+
Next, build the Python bindings and install them into your environment. The following command uses the provided installation script to configure, compile, and install ExecuTorch with developer tools enabled.
40+
41+
```bash
42+
cd $WORKSPACE/executorch
43+
CMAKE_ARGS="-DEXECUTORCH_BUILD_DEVTOOLS=ON" ./install_executorch.sh
44+
45+
```
46+
47+
This will build ExecuTorch and its dependencies using CMake, enabling optional developer utilities such as ETDump and Inspector.
48+
49+
After installation completes successfully, you can verify the environment by running:
50+
51+
```bash
52+
python -c "import executorch; print('Executorch build and install successfully.')"
53+
```
54+
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
---
2+
title: Cross-Compile ExecuTorch for the Aarch64 platform
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
10+
This section describes how to cross-compile ExecuTorch for an AArch64 target platform with XNNPACK and KleidiAI support enabled.
11+
All commands below are intended to be executed on an x86-64 Linux host with an appropriate cross-compilation toolchain installed (e.g., aarch64-linux-gnu-gcc).
12+
13+
14+
### Run CMake Configuration
15+
16+
Use CMake to configure the ExecuTorch build for Aarch64. The example below enables key extensions, developer tools, and XNNPACK with KleidiAI acceleration:
17+
18+
```bash
19+
20+
cd $WORKSPACE
21+
mkdir -p build-arm64
22+
cd build-arm64
23+
24+
cmake -GNinja \
25+
-DCMAKE_BUILD_TYPE=Debug \
26+
-DCMAKE_SYSTEM_NAME=Linux \
27+
-DCMAKE_SYSTEM_PROCESSOR=aarch64 \
28+
-DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \
29+
-DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ \
30+
-DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
31+
-DCMAKE_FIND_ROOT_PATH_MODE_LIBRARY=BOTH \
32+
-DCMAKE_FIND_ROOT_PATH_MODE_INCLUDE=ONLY \
33+
-DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
34+
-DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON \
35+
-DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
36+
-DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
37+
-DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
38+
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
39+
-DEXECUTORCH_BUILD_XNNPACK=ON \
40+
-DEXECUTORCH_BUILD_DEVTOOLS=ON \
41+
-DEXECUTORCH_ENABLE_EVENT_TRACER=ON \
42+
-DEXECUTORCH_ENABLE_LOGGING=ON \
43+
-DEXECUTORCH_LOG_LEVEL=debug \
44+
-DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON \
45+
../executorch
46+
47+
```
48+
49+
#### Key Build Options
50+
51+
| **CMake Option** | **Description** |
52+
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
53+
| `EXECUTORCH_BUILD_XNNPACK` | Builds the **XNNPACK backend**, which provides highly optimized CPU operators (GEMM, convolution, etc.) for Arm64 platforms. |
54+
| `EXECUTORCH_XNNPACK_ENABLE_KLEIDI` | Enables **Arm KleidiAI** acceleration for XNNPACK kernels, providing further performance improvements on Armv8.2+ CPUs. |
55+
| `EXECUTORCH_BUILD_DEVTOOLS` | Builds **developer tools** such as the ExecuTorch Inspector and diagnostic utilities for profiling and debugging. |
56+
| `EXECUTORCH_BUILD_EXTENSION_MODULE` | Builds the **Module API** extension, which provides a high-level abstraction for model loading and execution using `Module` objects. |
57+
| `EXECUTORCH_BUILD_EXTENSION_TENSOR` | Builds the **Tensor API** extension, providing convenience functions for creating, manipulating, and managing tensors in C++ runtime. |
58+
| `EXECUTORCH_BUILD_KERNELS_OPTIMIZED` | Enables building **optimized kernel implementations** for better performance on supported architectures. |
59+
| `EXECUTORCH_ENABLE_EVENT_TRACER` | Enables the **event tracing** feature, which records performance and operator timing information for runtime analysis. |
60+
61+
62+
63+
### Build ExecuTorch
64+
65+
```bash
66+
cmake --build . -j$(nproc)
67+
68+
```
69+
70+
If the build completes successfully, you should find the executor_runner binary under the directory:
71+
72+
```bash
73+
build-arm64/executor_runner
74+
75+
```
76+
77+
This binary can be used to run ExecuTorch models on the ARM64 target device using the XNNPACK backend with KleidiAI acceleration.
78+
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
---
2+
title: KleidiAI micro-kernels support in ExecuTorch
3+
weight: 4
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
ExecuTorch uses XNNPACK as its primary CPU backend for operator execution and performance optimization.
9+
10+
Within this architecture, only a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms.
11+
12+
These specialized micro-kernels are designed to accelerate operators with specific data types and quantization configurations in ExecuTorch models.
13+
14+
When an operator matches one of the supported configurations, ExecuTorch automatically dispatches it through the KleidiAI-optimized path.
15+
16+
Operators that are not covered by KleidiAI fall back to the standard XNNPACK implementations during inference, ensuring functional correctness across all models.
17+
18+
In ExecuTorch v1.0.0, the following operator types are implemented through the XNNPACK backend and can potentially benefit from KleidiAI acceleration:
19+
- XNNFullyConnected – Fully connected (dense) layers
20+
- XNNConv2d – Standard 2D convolution layers
21+
- XNNBatchMatrixMultiply – Batched matrix multiplication operations
22+
23+
However, not all instances of these operators are accelerated by KleidiAI.
24+
25+
Acceleration eligibility depends on several operator attributes and backend support, including:
26+
- Data types (e.g., float32, int8, int4)
27+
- Quantization schemes (e.g., symmetric/asymmetric, per-tensor/per-channel)
28+
- Tensor memory layout and alignment
29+
- Kernel dimensions and stride settings
30+
31+
The following section provides detailed information on which operator configurations can benefit from KleidiAI acceleration, along with their corresponding data type and quantization support.
32+
33+
34+
### XNNFullyConnected
35+
36+
| XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType |
37+
| ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- |
38+
| pf16_gemm | FP16 | FP16 | FP16 |
39+
| pf32_gemm | FP32 | FP32 | FP32 |
40+
| qp8_f32_qc8w_gemm | Asymmetric INT8 per-row quantization | Per-channel symmetric INT8 quantization | FP32 |
41+
| pqs8_qc8w_gemm | Asymmetric INT8 quantization | Per-channel symmetric INT8 quantization | Asymmetric INT8 quantization |
42+
| qp8_f32_qb4w_gemm | Asymmetric INT8 per-row quantization | INT4 (signed), shared blockwise quantization | FP32 |
43+
44+
45+
### XNNConv2d
46+
| XNNPACK GEMM Variant | Input DataType| Filter DataType | Output DataType |
47+
| ------------------ | ---------------------------- | --------------------------------------- | ---------------------------- |
48+
| pf32_gemm | FP32 | FP32, pointwise (1×1) | FP32 |
49+
| pqs8_qc8w_gemm | Asymmetric INT8 quantization (NHWC) | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization(NHWC) |
50+
51+
52+
### XNNBatchMatrixMultiply
53+
| XNNPACK GEMM Variant | Input A DataType| Input B DataType |Output DataType |
54+
| ------------------ | ---------------------------- | --------------------------------------- |--------------------------------------- |
55+
| pf32_gemm | FP32 | FP32 | FP32 |
56+
| pf16_gemm | FP16 | FP16 | FP16 |
57+
58+
59+

0 commit comments

Comments
 (0)