Note: This project is currently under development and this README will be periodcally updated.
Update Sept 28, 2025 Over summer, a clever person ported the Speech Diarization model to CoreML. Its neatly wrapped and abstracted in the FluidAudio Library. The barrier for entry for using that API should be lower than this one. However, if you're doing some pretty advanced and nuanced code stuff. This project will still be useful as FluidAudio is built atop of Sherpa.
This repository aims to refactor and simplify the SwiftUI example provided by k2-fsa/sherpa-onnx, specifically focusing on Speech Diarization.
I wrote a companion article breaking down how and why I built this project.
Additionally, I recently created an algorithm for Active Speaker Detection using this project as a base.
Before building this project, ensure the required frameworks are in place:
onnxruntime
is too large to be included directly. You must download it manually.Sherpa-Onnx.xcframework
must also be built and added to your project. See Building from Sherpa Onnx.
Without these, building the project will fail.
Note: After setup, test the app using the File Picker to load an audio file. Alternatively, hardcode a file path in
ContentView
(line 18) for testing.
Download the onnxruntime
framework:
onnxruntime.xcframework-1.17.1.tar.bz2
Steps:
- Extract the archive.
- Copy
onnxruntime.xcframework
into your Xcode project directory.
To build Sherpa-Onnx.xcframework
, follow these steps:
Visit this link for more detailed build instructions.
-
Clone the reposity
git clone https://github.com/k2-fsa/sherpa-onnx
-
Enter the repo directory
cd sherpa-onnx
-
Run the ios build script with
./build-ios.sh
-
After the script completes, a
build-ios
folder will be created. -
Copy
sherpa-onnx.xcframework
from build-ios into your Xcode project. -
You’ll also find
onnxruntime.xcframework
in:ios-onnxruntime/1.17.1/onnxruntime.xcframework
This is the same xcframework from the previous section

The App requires you to select an Audio/Video file via File Picker. Alternatively, you can change line 18
in ContentView
to hardcode a file in your bundle for testing.
It then converts it to a format that the speech diarization model accepts
Afterwards, run the model and the results will eventually replace the placehodler text
Screen.Recording.2025-04-11.at.8.55.42.PM.mov
Contributions and suggestions are welcome as the project is actively evolving.
Updates and additional documentation will be provided as development progresses.