GitHub - hyperion-ml/vox-profile-release: Vox-Profile Benchmark

Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits

📄 [Preprint Paper] | 🤗 [HuggingFace Models] | 🚀 [Voxlect Project]

In this repo, we present Vox-Profile, one of the first benchmarking efforts that systematically evaluate rich multi-dimensional speaker and speech traits from English-speaking voices. Our benchmark is presented below:

Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation), so you need to cut your audio to a maximum of 15 seconds, 16kHz and mono channel

Download Repo

git clone git@github.com:tiantiaf0627/vox-profile-release.git

Installation

conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .

Quick Example 1 - Whisper Large Narrow Accent

# Load libraries
import torch
import torch.nn.functional as F
from src.model.accent.whisper_accent import WhisperWrapper

# Label List
english_accent_list = [
    'East Asia', 'English', 'Germanic', 'Irish', 
    'North America', 'Northern Irish', 'Oceania', 
    'Other', 'Romance', 'Scottish', 'Semitic', 'Slavic', 
    'South African', 'Southeast Asia', 'South Asia', 'Welsh'
]
    
# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"

# Load model from Huggingface
whisper_model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-narrow-accent").to(device)
whisper_model.eval()

# Load data, here just zeros as the example
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
whisper_logits, whisper_embeddings = whisper_model(data, return_feature=True)
    
# Probability and output
whisper_prob = F.softmax(whisper_logits, dim=1)
print(english_accent_list[torch.argmax(whisper_prob).detach().cpu().item()])

	Eastern Asia	English	Germanic	Irish	North America	Northern Irish	Oceania	Other	Romance	Scottish	Semitic	Slavic	South African	South-eastern Asia	Southern Asia	Welsh
Eastern Asia	89.67	1.03	0.96	0.11	2.20	0.00	0.75	0.21	2.63	0.07	0.28	0.25	0.04	1.49	0.07	0.25
English	0.16	84.31	0.08	0.08	1.36	0.36	1.24	0.12	2.32	3.23	0.84	0.12	1.32	0.28	0.44	3.75
Germanic	0.11	0.00	65.76	0.00	0.11	0.00	0.00	5.99	11.75	0.00	0.23	16.05	0.00	0.00	0.00	0.00
Irish	0.15	5.61	1.21	47.04	8.50	0.30	2.58	0.30	8.95	23.07	0.46	0.15	1.21	0.00	0.15	0.30
North America	2.11	1.54	1.02	0.90	67.93	0.00	1.98	0.38	9.73	0.38	5.38	0.64	2.43	1.98	3.52	0.06
Northern Irish	0.00	0.00	0.00	0.00	0.00	97.50	0.00	0.00	0.00	2.50	0.00	0.00	0.00	0.00	0.00	0.00
Oceania	0.82	4.09	4.29	0.41	7.36	0.00	64.62	0.20	10.22	0.61	0.41	2.66	3.68	0.61	0.00	0.00
Other	1.40	0.59	1.40	0.16	0.70	0.00	0.05	80.99	3.54	0.00	2.42	7.04	0.48	0.27	0.54	0.43
Romance	2.12	0.00	1.73	0.20	0.43	0.00	0.00	0.70	89.08	0.00	1.29	4.15	0.00	0.23	0.07	0.00
Scottish	0.00	5.18	0.00	10.36	5.18	0.36	1.61	0.00	0.18	75.54	0.18	0.00	0.89	0.00	0.18	0.36
Semitic	31.54	0.00	0.05	0.00	0.00	0.00	0.00	0.11	10.43	0.00	57.35	0.37	0.00	0.11	0.05	0.00
Slavic	0.83	0.00	5.82	0.00	0.00	0.00	0.00	5.54	16.07	0.00	1.66	70.08	0.00	0.00	0.00	0.00
South African	0.00	4.00	2.00	0.00	4.50	0.00	8.50	2.50	2.00	1.50	0.00	0.00	68.00	0.00	4.00	3.00
South-eastern Asia	41.10	0.57	0.08	0.12	2.53	0.00	0.37	0.00	3.54	0.00	0.12	0.04	0.73	49.12	1.67	0.00
Southern Asia	0.62	0.22	1.02	0.04	0.66	0.00	0.09	0.31	6.16	0.00	1.02	0.18	0.62	0.09	88.88	0.09
Welsh	0.00	26.89	0.00	0.20	0.20	0.00	0.20	0.20	0.10	3.58	0.10	0.00	0.82	0.00	0.00	67.69

Quick Example 2 - WavLM Large Narrow Accent

# Load libraries
import torch
import torch.nn.functional as F
from src.model.accent.wavlm_accent import WavLMWrapper

# Label List
english_accent_list = [
    'East Asia', 'English', 'Germanic', 'Irish', 
    'North America', 'Northern Irish', 'Oceania', 
    'Other', 'Romance', 'Scottish', 'Semitic', 'Slavic', 
    'South African', 'Southeast Asia', 'South Asia', 'Welsh'
]
    
# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"

# Load model from Huggingface
wavlm_model = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-narrow-accent").to(device)
wavlm_model.eval()

# Load data, here just zeros as the example
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
wavlm_logits, wavlm_embeddings = wavlm_model(data, return_feature=True)
    
# Probability and output
wavlm_prob = F.softmax(wavlm_logits, dim=1)
print(english_accent_list[torch.argmax(wavlm_prob).detach().cpu().item()])

Given that the Vox-Profile Benchmark paper is still under peer-review, we provide limited set of models and model weights before the review is concluded. But below are the models we currently put out.

WavLM-Large Models

Model Name	Data	Pre-trained Model	Use LoRa	LoRa Rank Size	Output	Example Code
wavlm-large-sex-age	CommonVoice+Timit+Voxceleb (age enriched)	wavlm-large	Yes	16	Sex (2-class) / Age (0-1)*100 Years	Example
wavlm-large-broader-accent	See Paper (11 Datasets)	wavlm-large	Yes	16	North American / British / Other (3-class)	Example
wavlm-large-narrow-accent	See Paper (11 Datasets)	wavlm-large	Yes	16	See example	Example
wavlm-large-voice-quality	ParaSpeechCaps	wavlm-large	Yes	16	See example	Example
wavlm-large-influency	SEP28K+FluencyBank	wavlm-large	Yes	16	Fluent/Disfluent (Specified Disfluency Types)	Example
wavlm-large-categorical-emotion	MSP-Podcast	wavlm-large	No	NA	8 Emotions + Other	Example
wavlm-large-dim-emotion	MSP-Podcast	wavlm-large	No	NA	Arousal/Valence/Dominance	Example

Whisper-Large V3 Models

Model Name	Data	Pre-trained Model	Use LoRa	LoRa Rank Size	Output	Example Code
whisper-large-v3-broader-accent	See Paper (11 Datasets)	whisper-large v3	Yes	16	North American / British / Other (3-class)	Example
whisper-large-v3-narrow-accent	See Paper (11 Datasets)	whisper-large v3	Yes	16	See example	Example
whisper-large-v3-voice-quality	ParaSpeechCaps	whisper-large v3	Yes	16	See example	Example
whisper-large-v3-influency	SEP28K+FluencyBank	whisper-large v3	Yes	16	Fluent/Disfluent (Specified Disfluency Types)	Example
whisper-large-v3-categorical-emotion	MSP-Podcast	whisper-large v3	Yes	16	8 Emotions + Other	Example
whisper-large-v3-dim-emotion	MSP-Podcast	whisper-large v3	No	NA	Arousal/Valence/Dominance	Example

Labeling Scheme

In Vox-Profile, we experiments with over 15 publicly available datasets to predict static traits (speaker age, speaker sex, speaker accent, and voice quality) and dynamic traits (speech emotion, speech flow, and speech expressiveness) in different recording conditions and elicitation settings (e.g., read, spontaneous, and conversational speech). Our labeling taxonomy is described below:

Enabling Versatile Speech Applications with Vox-Profile

Our Vox-Profile can be used as a fundamental component to support a versatile speech applications.

1. Speech Model Performance Analysis

We generate speaker and speech traits for existing datasets and investigate whether these generated labels can lead to the same insights as using the ground truth trait information in analyzing the speech model performances. For example, in the results below, we identify that regardless of grouping by groud truth traits or traits inferred by Vox-Profile, the ASR performance trends remain the same across different traits.

2. Automated Evaluation Tool for Speech Generation Tasks

We demonstrate the utility of Vox-Profile as an evaluation tool for speech generation tasks by comparing two representative models: FreeVC and VALLE-X. As shown in Table below, the accent prediction scores and the cosine similarity for the synthesized samples from FreeVC suggest greater similarity to the source speaker’s accent than to the reference speaker. In contrast, the scores for VALLE-X indicate closer alignment with the reference speaker’s accent in most conditions. These findings are consistent with previous studies, which report that FreeVC has limited capability in replicating the accentual features of the reference speaker compare to VALLE-X.

3. Generating Synthetic Speaking Style Prompt

Vox-Profile provides a more extensive and varied set of traits, including speech flow, arousal, valence, and speaker age. Moreover, computational models of Vox-Profile output probabilistic predictions for each trait, enabling more nuanced and confidence sensitive descriptions. For example, a Scottish accent prediction with a probability of 0.9 can be described as having a distinct Scottish accent.

Human-evaluation results comparing synthetic speaking style prompts from Vox-Profile and human-annotated speaking style prompts from ParaSpeechCaps suggest that this group of human raters shows similar preference levels for both synthetic and human-annotated speaking style prompts. Specifically, they favor the emotion, age, and speech flow descriptions generated by Vox-Profile over those from ParaSpeechCaps.

Responsible Use: Users should respect the privacy and consent of the data subjects, and adhere to the relevant laws and regulations in their jurisdictions when using Vox-Profile. We also advise the use of Vox-Profile under the RAIL license.

❌ Exemplar Out-of-Scope Use

Clinical or diagnostic applications
Surveillance
Privacy-invasive applications
No commercial use

If you like our work or use the models in your work, kindly cite the following. We appreciate your recognition!

@article{feng2025vox,
  title={Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits},
  author={Feng, Tiantian and Lee, Jihwan and Xu, Anfeng and Lee, Yoonjeong and Lertpetchpun, Thanathai and Shi, Xuan and Wang, Helin and Thebaud, Thomas and Moro-Velazquez, Laureano and Byrd, Dani and others},
  journal={arXiv preprint arXiv:2505.14648},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
img		img
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example_emotion.py		example_emotion.py
example_narrow_accent.py		example_narrow_accent.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits

In this repo, we present Vox-Profile, one of the first benchmarking efforts that systematically evaluate rich multi-dimensional speaker and speech traits from English-speaking voices. Our benchmark is presented below:

Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation), so you need to cut your audio to a maximum of 15 seconds, 16kHz and mono channel

Download Repo

Installation

Quick Example 1 - Whisper Large Narrow Accent

Quick Example 2 - WavLM Large Narrow Accent

Given that the Vox-Profile Benchmark paper is still under peer-review, we provide limited set of models and model weights before the review is concluded. But below are the models we currently put out.

WavLM-Large Models

Whisper-Large V3 Models

Labeling Scheme

Enabling Versatile Speech Applications with Vox-Profile

1. Speech Model Performance Analysis

2. Automated Evaluation Tool for Speech Generation Tasks

3. Generating Synthetic Speaking Style Prompt

Responsible Use: Users should respect the privacy and consent of the data subjects, and adhere to the relevant laws and regulations in their jurisdictions when using Vox-Profile. We also advise the use of Vox-Profile under the RAIL license.

If you like our work or use the models in your work, kindly cite the following. We appreciate your recognition!

About

Uh oh!

Releases

Packages

Languages

License

hyperion-ml/vox-profile-release

Folders and files

Latest commit

History

Repository files navigation

Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits

In this repo, we present Vox-Profile, one of the first benchmarking efforts that systematically evaluate rich multi-dimensional speaker and speech traits from English-speaking voices. Our benchmark is presented below:

Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation), so you need to cut your audio to a maximum of 15 seconds, 16kHz and mono channel

Download Repo

Installation

Quick Example 1 - Whisper Large Narrow Accent

Quick Example 2 - WavLM Large Narrow Accent

Given that the Vox-Profile Benchmark paper is still under peer-review, we provide limited set of models and model weights before the review is concluded. But below are the models we currently put out.

WavLM-Large Models

Whisper-Large V3 Models

Labeling Scheme

Enabling Versatile Speech Applications with Vox-Profile

1. Speech Model Performance Analysis

2. Automated Evaluation Tool for Speech Generation Tasks

3. Generating Synthetic Speaking Style Prompt

Responsible Use: Users should respect the privacy and consent of the data subjects, and adhere to the relevant laws and regulations in their jurisdictions when using Vox-Profile. We also advise the use of Vox-Profile under the RAIL license.

If you like our work or use the models in your work, kindly cite the following. We appreciate your recognition!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages