so there's this 2025 paper showing speech emotion models get much better when you force them to explain themselves. not just "the speaker is angry" but "the speaker is angry because they say X / I detected sound Y / here's the evidence" it's almost embarrassingly simple. you take the transcript & ground truth emotion label, prompt an LLM to generate an explanation grounding the label in what was actually said, then use THAT as the supervision signal. training on reasoning-augmented targets instead of bare labels improved emotion recognition by ~20% (across IEMOCAP and MELD). they also test on out of domain data (mandarin TV, singlish youtube), and the reasoning model STILL generalizes better than emotion2vec+ large, R1-AQA, and audio-reasoner, even though it was only trained on english dyadic conversations and episodes of the TV show Friends. a classifier memorizes a distribution, but a reasoning model learns what emotions actually sound like. intuitive but still low key wild.