Cross-corpus speech emotion recognition with transformers: Leveraging handcrafted features and data augmentation.

Alroobaea, Roobaea

Alroobaea, Roobaea.

Afiliación

Alroobaea R; Department of Computer Science, College of Computers and Information Technology, Taif University, Taif 21944, Saudi Arabia. Electronic address: r.robai@tu.edu.sa.

Comput Biol Med ; 179: 108841, 2024 Sep.

Article en En | MEDLINE | ID: mdl-39002317

ABSTRACT

ABSTRACT

Speech emotion recognition (SER) stands as a prominent and dynamic research field in data science due to its extensive application in various domains such as psychological assessment, mobile services, and computer games, mobile services. In previous research, numerous studies utilized manually engineered features for emotion classification, resulting in commendable accuracy. However, these features tend to underperform in complex scenarios, leading to reduced classification accuracy. These scenarios include 1. Datasets that contain diverse speech patterns, dialects, accents, or variations in emotional expressions. 2. Data with background noise. 3. Scenarios where the distribution of emotions varies significantly across datasets can be challenging. 4. Combining datasets from different sources introduce complexities due to variations in recording conditions, data quality, and emotional expressions. Consequently, there is a need to improve the classification performance of SER techniques. To address this, a novel SER framework was introduced in this study. Prior to feature extraction, signal preprocessing and data augmentation methods were applied to augment the available data, resulting in the derivation of 18 informative features from each signal. The discriminative feature set was obtained using feature selection techniques which was then utilized as input for emotion recognition using the SAVEE, RAVDESS, and EMO-DB datasets. Furthermore, this research also implemented a cross-corpus model that incorporated all speech files related to common emotions from three datasets. The experimental outcomes demonstrated the superior performance of SER framework compared to existing frameworks in the field. Notably, the framework presented in this study achieved remarkable accuracy rates across various datasets. Specifically, the proposed model obtained an accuracy of 95%, 94%,97%, and 97% on SAVEE, RAVDESS, EMO-DB and cross-corpus datasets respectively. These results underscore the significant contribution of our proposed framework to the field of SER.

Asunto(s)

Emociones; Humanos; Emociones/fisiología; Habla/fisiología; Masculino; Femenino; Software de Reconocimiento del Habla; Bases de Datos Factuales; Procesamiento de Señales Asistido por Computador

Palabras clave

Cross-corpus; Data augmentation; Handcrafted features; Speech emotion recognition; Transformer

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Emociones Límite: Female / Humans / Male Idioma: En Revista: Comput Biol Med Año: 2024 Tipo del documento: Article Pais de publicación: Estados Unidos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google