Do Audio LLMs Listen or Read?
Analyzing and Mitigating Paralinguistic Failures with VoxParadox

* Equal contribution

We evaluate modern Audio LLMs, including the AudioFlamingo series, Qwen-2-Audio, and Kimi Audio, to test whether they truly leverage acoustic cues or over-rely on semantics from speech content.

Abstract

Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs reveals consistently low accuracy on acoustic ground truth and a strong tendency to follow language-implied (incorrect) answers. To understand the cause of this gap, we perform layer-wise probing and find that (i) paralinguistic cues can degrade in deeper encoder layers and at the encoder-LLM interface, and (ii) even when such cues are available in audio tokens, the language model frequently ignores them. To address these problems, we propose Prompt-Conditioned Layer Mixer (PCLM), which adaptively combines information from multiple audio layers based on the input prompt, and pair it with Direct Preference Optimization (DPO) to explicitly prefer acoustically supported options over language-implied alternatives. These methods substantially improve Audio LLM paralinguistic understanding, improving Audio Flamingo 3 from 17.40% to 65.20% on VoxParadox, and from 37.74% to 54.78% on MMSU paralinguistic subset.

Key Findings

  • Current Audio LLMs often prioritize speech semantics over acoustics.
  • Paralinguistic signals (for example, prosody and delivery style) are frequently underutilized.
  • Performance can degrade when semantic shortcuts are weakened and acoustic grounding is required.
  • VoxParadox identifies failure patterns and supports mitigation strategies for more robust audio reasoning.

Try It Yourself

Play each clip and notice how the audio and the transcript point to different answers. The True Label reflects the audio; the Adversarial Label reflects what the transcript asserts.

Question & Audio Options True Label Adversarial Label
Age prediction

What is the most likely age group of the speaker in the audio?

  • A) Young adult
  • B) Middle-aged adult
  • C) Elderly adult
  • D) Child
Elderly adult Young adult
Gender prediction

What is the speaker's gender?

  • A) Female
  • B) Male
Male Female
Emotion recognition

How does the speaker feel in the recording?

  • A) neutral
  • B) sadness
  • C) anger
  • D) happiness
anger neutral
Emotion recognition

How does the speaker feel in the recording?

  • A) neutral
  • B) sadness
  • C) anger
  • D) happiness
sadness happiness
Intonation perception

What is the intonation of the entire sentence in the audio?

  • A) Rise-fall intonation
  • B) Falling intonation
  • C) Fall-rise intonation
  • D) Rising intonation
Falling intonation Rising intonation
Total speaker counting

How many different speakers are in the audio?

  • A) 8 people
  • B) 3 people
  • C) 2 people
  • D) 1 person
2 people 1 person
Speaker identity recognition

Which speaker clip belongs to the same person as speaker clip 1?

  • A) the second person
  • B) the third person
  • C) the fifth person
  • D) the fourth person
the fourth person the second person
Pitch comparison

Which pitch pattern best matches the audio?

  • A) low-medium-high
  • B) medium-high-low
  • C) high-low-medium
  • D) medium-low-high
high-low-medium low-medium-high
Volume comparison

Which volume pattern best matches the audio?

  • A) low-high-medium
  • B) high-medium-low
  • C) medium-low-high
  • D) low-medium-high
low-high-medium high-medium-low
Speed comparison

Which speed pattern best matches the audio?

  • A) high-medium-low
  • B) high-low-medium
  • C) low-high-medium
  • D) medium-low-high
high-low-medium low-high-medium
Vocal range comparison

Which vocal range pattern best matches the audio?

  • A) high-medium-low
  • B) low-medium-high
  • C) high-low-medium
  • D) medium-low-high
medium-low-high high-medium-low,
high-low-medium

VoxParadox Leaderboard

Reported model performance on VoxParadox. Higher is better for Avg. VoxParadox and MMSU Para; lower is better for Adv. label.

Rank Model Avg. VoxParadox (%) ↑ Adv. label (%) ↓ MMSU Para. (%) ↑
1 Audio Flamingo 3 + PCLM + DPO (Ours) 65.20 22.60 54.78
2 Audio Flamingo 3 + PCLM (Ours) 60.00 26.30 54.06
3 Audio Flamingo 3 + SFT (Ours) 34.80 - 44.76
4 Audio Flamingo 2 (AF2) 30.85 29.80 27.44
5 Qwen2-Audio 30.15 27.20 20.88
6 Gemini 2.5 Flash 24.70 58.20 51.05
7 MiMo-Audio 19.60 65.42 35.64
8 Kimi-Audio 19.00 60.50 41.48
9 Step-Audio-R1 17.45 60.17 54.51
10 Audio Flamingo 3 (AF3) 17.40 68.50 37.74
11 GPT-4o Audio 8.60 77.80 36.55
12 Qwen2.5-Omni 7.95 73.67 33.45
13 VITA-Audio 6.85 72.75 29.54
14 SALMONN 6.10 26.50 6.84

VoxParadox average is computed over 10 paralinguistic tasks: Age, Gender, Emotion, Pitch, Volume, Speed, Range, Intonation, Speaker ID, and Speaker Counting.

Project Resources

We are preparing the full project release.

  • Preprint: paper PDF and supplementary material.
  • Benchmark: VoxParadox dataset, tasks, and evaluation scripts.
  • Code: baseline implementations and mitigation methods.

Please check back soon for release links.

BibTeX

@inproceedings{
pang2026do,
title={Do Audio {LLM}s Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox},
author={Jiacheng Pang and Ashutosh Chaubey and Mohammad Soleymani},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=v7rYbRR9Zw}
}

Contact

For questions or collaboration, please contact:

Jiacheng Pang - pangj@usc.edu

Ashutosh Chaubey - achaubey@usc.edu