Speech Emotion Recognition using Self-Supervised Features

Speech emotion recognition (SER) can be used in call heart dialogue examination, mental health, or spoken dialogue programs.

Audio recordings can also be used for automated speech emotion recognition.

Audio recordings can also be applied for automatic speech emotion recognition. Picture credit score: Alex Regan by means of Wikimedia, CC-BY-2.

A recent paper revealed on arXiv.org formulates the SER problem as a mapping from the steady speech area into the discrete domain of categorical labels of emotion.

Scientists use the Upstream + Downstream architecture product paradigm to allow for effortless use/integration of a substantial assortment of self-supervised features. The Upstream is pre-trained in a self-supervised trend is liable for characteristic extraction. The Downstream is a job-dependent design, which classifies the functions created by the Upstream model into categorical labels of emotion.

Experimental benefits clearly show that even with utilizing only the speech modality, the proposed process can arrive at final results identical to people realized by multimodal units, which use equally Speech and Textual content modalities.

Self-supervised pre-educated options have continuously sent state-of-art success in the subject of natural language processing (NLP) even so, their merits in the field of speech emotion recognition (SER) continue to require more investigation. In this paper we introduce a modular Finish-to- Finish (E2E) SER system centered on an Upstream + Downstream architecture paradigm, which permits uncomplicated use/integration of a huge selection of self-supervised functions. Quite a few SER experiments for predicting categorical emotion lessons from the IEMOCAP dataset are executed. These experiments examine interactions among the wonderful-tuning of self-supervised element types, aggregation of frame-stage capabilities into utterance-degree capabilities and again-finish classification networks. The proposed monomodal speechonly based mostly method not only achieves SOTA outcomes, but also delivers mild to the chance of highly effective and perfectly finetuned self-supervised acoustic capabilities that access results comparable to the benefits reached by SOTA multimodal devices working with both Speech and Text modalities.

Study paper: Morais, E., Hoory, R., Zhu, W., Gat, I., Damasceno, M., and Aronowitz, H., “Speech Emotion Recognition using Self-Supervised Features”, 2022. Url: https://arxiv.org/abdominal muscles/2202.03896