ArmDeveloperEcosystem · geesun · Nov 7, 2025
diff --git a/...cs-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md b/...cs-and-gaming/measure-kleidiai-kernel-performance-on-executorch/01-env-setup.md
@@ -0,0 +1,54 @@
+---
+title: Environment setup
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+
+### Python Environment Setup 
+
+Before building ExecuTorch, it is highly recommended to create an isolated Python environment.
+This prevents dependency conflicts with your system Python installation and ensures a clean build environment.
+
+```bash 
+cd $WORKSPACE
+python3 -m venv pyenv
+source pyenv/bin/activate
+
+```
+All subsequent steps should be executed within this Python virtual environment.
+
+### Download the ExecuTorch Source Code
+
+Clone the ExecuTorch repository from GitHub. The following command checks out the stable v1.0.0 release and ensures all required submodules are fetched.
+
+```bash 
+cd $WORKSPACE
+git clone -b v1.0.0 --recurse-submodules https://github.com/pytorch/executorch.git
+
+```
+
+   > **Note:**  
+   > The instructions in this guide are based on **ExecuTorch v1.0.0**.  
+   > Commands or configuration options may differ in later releases.
+
+### Build and Install the ExecuTorch Python Components
+
+Next, build the Python bindings and install them into your environment. The following command uses the provided installation script to configure, compile, and install ExecuTorch with developer tools enabled.
+
+```bash 
+cd $WORKSPACE/executorch
+CMAKE_ARGS="-DEXECUTORCH_BUILD_DEVTOOLS=ON" ./install_executorch.sh
+
+```
+
+This will build ExecuTorch and its dependencies using CMake, enabling optional developer utilities such as ETDump and Inspector.
+
+After installation completes successfully, you can verify the environment by running:
+
+```bash 
+python -c "import executorch; print('Executorch build and install successfully.')"
+```
+
diff --git a/...nd-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md b/...nd-gaming/measure-kleidiai-kernel-performance-on-executorch/02-cross-compile.md
@@ -0,0 +1,78 @@
+---
+title: Cross-Compile ExecuTorch for the Aarch64 platform
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+
+This section describes how to cross-compile ExecuTorch for an AArch64 target platform with XNNPACK and KleidiAI support enabled.
+All commands below are intended to be executed on an x86-64 Linux host with an appropriate cross-compilation toolchain installed (e.g., aarch64-linux-gnu-gcc).
+
+
+### Run CMake Configuration 
+
+Use CMake to configure the ExecuTorch build for Aarch64. The example below enables key extensions, developer tools, and XNNPACK with KleidiAI acceleration: 
+
+```bash 
+
+cd $WORKSPACE
+mkdir -p build-arm64
+cd build-arm64
+
+cmake -GNinja \
+    -DCMAKE_BUILD_TYPE=Debug \
+    -DCMAKE_SYSTEM_NAME=Linux \
+    -DCMAKE_SYSTEM_PROCESSOR=aarch64 \
+    -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \
+    -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ \
+    -DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
+    -DCMAKE_FIND_ROOT_PATH_MODE_LIBRARY=BOTH \
+    -DCMAKE_FIND_ROOT_PATH_MODE_INCLUDE=ONLY \
+    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
+    -DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON \
+    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
+    -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
+    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
+    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
+    -DEXECUTORCH_BUILD_XNNPACK=ON \
+    -DEXECUTORCH_BUILD_DEVTOOLS=ON \
+    -DEXECUTORCH_ENABLE_EVENT_TRACER=ON \
+    -DEXECUTORCH_ENABLE_LOGGING=ON \
+    -DEXECUTORCH_LOG_LEVEL=debug \
+    -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON \
+    ../executorch
+
+```
+
+#### Key Build Options
+
+| **CMake Option**                            | **Description**                                                                                                                                                    |
+| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `EXECUTORCH_BUILD_XNNPACK`                  | Builds the **XNNPACK backend**, which provides highly optimized CPU operators (GEMM, convolution, etc.) for Arm64 platforms.                                 |
+| `EXECUTORCH_XNNPACK_ENABLE_KLEIDI`          | Enables **Arm KleidiAI** acceleration for XNNPACK kernels, providing further performance improvements on Armv8.2+ CPUs.                                            |
+| `EXECUTORCH_BUILD_DEVTOOLS`                 | Builds **developer tools** such as the ExecuTorch Inspector and diagnostic utilities for profiling and debugging.                                                  |
+| `EXECUTORCH_BUILD_EXTENSION_MODULE`         | Builds the **Module API** extension, which provides a high-level abstraction for model loading and execution using `Module` objects.                               |
+| `EXECUTORCH_BUILD_EXTENSION_TENSOR`         | Builds the **Tensor API** extension, providing convenience functions for creating, manipulating, and managing tensors in C++ runtime.                              |
+| `EXECUTORCH_BUILD_KERNELS_OPTIMIZED`        | Enables building **optimized kernel implementations** for better performance on supported architectures.                                                           |
+| `EXECUTORCH_ENABLE_EVENT_TRACER`            | Enables the **event tracing** feature, which records performance and operator timing information for runtime analysis.                                             |
+
+
+
+### Build ExecuTorch 
+
+```bash 
+cmake --build . -j$(nproc)
+
+```
+
+If the build completes successfully, you should find the executor_runner binary under the directory:
+
+```bash
+build-arm64/executor_runner
+
+```
+
+This binary can be used to run ExecuTorch models on the ARM64 target device using the XNNPACK backend with KleidiAI acceleration.
+
diff --git a/...sure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md b/...sure-kleidiai-kernel-performance-on-executorch/03-executorch-node-kai-kernel.md
@@ -0,0 +1,59 @@
+---
+title: KleidiAI micro-kernels support in ExecuTorch
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+ExecuTorch uses XNNPACK as its primary CPU backend for operator execution and performance optimization.
+
+Within this architecture, only a subset of KleidiAI SME (Scalable Matrix Extension) micro-kernels has been integrated into XNNPACK to provide additional acceleration on supported Arm platforms.
+
+These specialized micro-kernels are designed to accelerate operators with specific data types and quantization configurations in ExecuTorch models.
+
+When an operator matches one of the supported configurations, ExecuTorch automatically dispatches it through the KleidiAI-optimized path.
+
+Operators that are not covered by KleidiAI fall back to the standard XNNPACK implementations during inference, ensuring functional correctness across all models.
+
+In ExecuTorch v1.0.0, the following operator types are implemented through the XNNPACK backend and can potentially benefit from KleidiAI acceleration:
+- XNNFullyConnected – Fully connected (dense) layers
+- XNNConv2d – Standard 2D convolution layers
+- XNNBatchMatrixMultiply – Batched matrix multiplication operations
+
+However, not all instances of these operators are accelerated by KleidiAI.
+
+Acceleration eligibility depends on several operator attributes and backend support, including:
+- Data types (e.g., float32, int8, int4)
+- Quantization schemes (e.g., symmetric/asymmetric, per-tensor/per-channel)
+- Tensor memory layout and alignment
+- Kernel dimensions and stride settings
+
+The following section provides detailed information on which operator configurations can benefit from KleidiAI acceleration, along with their corresponding data type and quantization support.
+
+
+### XNNFullyConnected 
+
+| XNNPACK GEMM Variant | Activations DataType| Weights DataType | Output DataType                      |
+| ------------------  | ---------------------------- | --------------------------------------- | ---------------------------- |
+| pf16_gemm    | FP16                         | FP16                                    | FP16                         |
+| pf32_gemm    | FP32                         | FP32                                    | FP32                         |
+| qp8_f32_qc8w_gemm | Asymmetric INT8 per-row quantization | Per-channel symmetric INT8 quantization | FP32                         |
+| pqs8_qc8w_gemm    | Asymmetric INT8 quantization | Per-channel symmetric INT8 quantization | Asymmetric INT8 quantization |
+| qp8_f32_qb4w_gemm | Asymmetric INT8 per-row quantization | INT4 (signed), shared blockwise quantization | FP32                         |
+
+
+### XNNConv2d
+| XNNPACK GEMM Variant | Input DataType| Filter DataType | Output DataType                      |
+| ------------------  | ---------------------------- | --------------------------------------- | ---------------------------- |
+| pf32_gemm    | FP32                         | FP32, pointwise (1×1)                   | FP32                         |
+| pqs8_qc8w_gemm | Asymmetric INT8 quantization (NHWC) | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization(NHWC) |
+
+
+### XNNBatchMatrixMultiply
+| XNNPACK GEMM Variant | Input A DataType| Input B DataType |Output DataType |
+| ------------------  | ---------------------------- | --------------------------------------- |--------------------------------------- |
+| pf32_gemm    | FP32                         | FP32                         | FP32 | 
+| pf16_gemm    | FP16                         | FP16                         | FP16 |
+
+
+