VoxParadox | ICML 2026

We evaluate modern Audio LLMs, including the AudioFlamingo series, Qwen-2-Audio, and Kimi Audio, to test whether they truly leverage acoustic cues or over-rely on semantics from speech content.

Abstract

Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs reveals consistently low accuracy on acoustic ground truth and a strong tendency to follow language-implied (incorrect) answers. To understand the cause of this gap, we perform layer-wise probing and find that (i) paralinguistic cues can degrade in deeper encoder layers and at the encoder-LLM interface, and (ii) even when such cues are available in audio tokens, the language model frequently ignores them. To address these problems, we propose Prompt-Conditioned Layer Mixer (PCLM), which adaptively combines information from multiple audio layers based on the input prompt, and pair it with Direct Preference Optimization (DPO) to explicitly prefer acoustically supported options over language-implied alternatives. These methods substantially improve Audio LLM paralinguistic understanding, improving Audio Flamingo 3 from 17.40% to 65.20% on VoxParadox, and from 37.74% to 54.78% on MMSU paralinguistic subset.

Key Findings

Current Audio LLMs often prioritize speech semantics over acoustics.
Paralinguistic signals (for example, prosody and delivery style) are frequently underutilized.
Performance can degrade when semantic shortcuts are weakened and acoustic grounding is required.
VoxParadox identifies failure patterns and supports mitigation strategies for more robust audio reasoning.

Try It Yourself

Play each clip and notice how the audio and the transcript point to different answers. The True Label reflects the audio; the Adversarial Label reflects what the transcript asserts.

Question & Audio	Options	True Label	Adversarial Label
Age prediction What is the most likely age group of the speaker in the audio?	A) Young adult B) Middle-aged adult C) Elderly adult D) Child	Elderly adult	Young adult
Gender prediction What is the speaker's gender?	A) Female B) Male	Male	Female
Emotion recognition How does the speaker feel in the recording?	A) neutral B) sadness C) anger D) happiness	anger	neutral
Emotion recognition How does the speaker feel in the recording?	A) neutral B) sadness C) anger D) happiness	sadness	happiness
Intonation perception What is the intonation of the entire sentence in the audio?	A) Rise-fall intonation B) Falling intonation C) Fall-rise intonation D) Rising intonation	Falling intonation	Rising intonation
Total speaker counting How many different speakers are in the audio?	A) 8 people B) 3 people C) 2 people D) 1 person	2 people	1 person
Speaker identity recognition Which speaker clip belongs to the same person as speaker clip 1?	A) the second person B) the third person C) the fifth person D) the fourth person	the fourth person	the second person
Pitch comparison Which pitch pattern best matches the audio?	A) low-medium-high B) medium-high-low C) high-low-medium D) medium-low-high	high-low-medium	low-medium-high
Volume comparison Which volume pattern best matches the audio?	A) low-high-medium B) high-medium-low C) medium-low-high D) low-medium-high	low-high-medium	high-medium-low
Speed comparison Which speed pattern best matches the audio?	A) high-medium-low B) high-low-medium C) low-high-medium D) medium-low-high	high-low-medium	low-high-medium
Vocal range comparison Which vocal range pattern best matches the audio?	A) high-medium-low B) low-medium-high C) high-low-medium D) medium-low-high	medium-low-high	high-medium-low, high-low-medium

VoxParadox Leaderboard

Reported model performance on VoxParadox. Higher is better for Avg. VoxParadox and MMSU Para; lower is better for Adv. label.

Rank	Model	Avg. VoxParadox (%) ↑	Adv. label (%) ↓	MMSU Para. (%) ↑
1	Audio Flamingo 3 + PCLM + DPO (Ours)	65.20	22.60	54.78
2	Audio Flamingo 3 + PCLM (Ours)	60.00	26.30	54.06
3	Audio Flamingo 3 + SFT (Ours)	34.80	-	44.76
4	Audio Flamingo 2 (AF2)	30.85	29.80	27.44
5	Qwen2-Audio	30.15	27.20	20.88
6	Gemini 2.5 Flash	24.70	58.20	51.05
7	MiMo-Audio	19.60	65.42	35.64
8	Kimi-Audio	19.00	60.50	41.48
9	Step-Audio-R1	17.45	60.17	54.51
10	Audio Flamingo 3 (AF3)	17.40	68.50	37.74
11	GPT-4o Audio	8.60	77.80	36.55
12	Qwen2.5-Omni	7.95	73.67	33.45
13	VITA-Audio	6.85	72.75	29.54
14	SALMONN	6.10	26.50	6.84

VoxParadox average is computed over 10 paralinguistic tasks: Age, Gender, Emotion, Pitch, Volume, Speed, Range, Intonation, Speaker ID, and Speaker Counting.

Project Resources

We are preparing the full project release.

Preprint: paper PDF and supplementary material.
Benchmark: VoxParadox dataset, tasks, and evaluation scripts.
Code: baseline implementations and mitigation methods.

Please check back soon for release links.

BibTeX

@inproceedings{
pang2026do,
title={Do Audio {LLM}s Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox},
author={Jiacheng Pang and Ashutosh Chaubey and Mohammad Soleymani},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=v7rYbRR9Zw}
}

Contact

For questions or collaboration, please contact:

Jiacheng Pang - pangj@usc.edu

Ashutosh Chaubey - achaubey@usc.edu

Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox