Speech Emotion Recognition with Hybrid CNNLSTM and Transformers Models: Evaluating the Hybrid Model Using Grad-CAM
View/ Open
Date
2025-07Author
Kumari, HMLS
Kumari, HMNS
Nawarathne, UMMPK
Metadata
Show full item recordAbstract
Emotional recognition and classification using artificial intelligence (AI) techniques play a crucial role in
human-computer interaction (HCI). It enables the prediction of human emotions from audio signals with broad
applications in psychology, medicine, education, entertainment, etc. This research focused on speech-emotion
recognition (SER) by employing classification methods and transformer models using the Toronto Emotional Speech
Set (TESS). Initially, acoustic features were extracted using different feature extraction techniques, including chroma,
Mel-scaled spectrogram, contrast features, and Mel Frequency Cepstral Coefficients (MFCCs) from the audio dataset.
Then, this study employed a Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and a hybrid
CNN-LSTM model to classify emotions. To compare the performance of these models, classical image transformer
models such as ViT (Visual Image Transformer) and BEiT (Bidirectional Encoder Representation of Images) were
employed on the Mel-spectograms derived from the same dataset. Evaluation metrics such as accuracy, precision, recall,
and F1-score were calculated for each of these models to ensure a comprehensive performance comparison. According
to the results, the hybrid model performed better than other models by achieving an accuracy of 99.01%, while the CNN,
LSTM, ViT, and BEiT models demonstrated accuracies of 95.37%, 98.57%, 98%, and 98.3%, respectively. To interpret
the output of this hybrid model and to provide visual explanations of its predictions, the Grad-CAM (Gradient-weighted
Class Activation Mappings) was obtained. This technique reduced the black-box character of deep models, making them
more reliable to use in clinical and other delicate contexts. In conclusion, the hybrid CNN-LSTM model showed strong
performance in audio-based emotion classification.