• Login
    • University Home
    • Library Home
    • Lib Catalogue
    • Advance Search
    View Item 
    •   KDU-Repository Home
    • ACADEMIC JOURNALS
    • International Journal of Research in Computing
    • Volume 03 , Issue 01, 2024
    • View Item
    •   KDU-Repository Home
    • ACADEMIC JOURNALS
    • International Journal of Research in Computing
    • Volume 03 , Issue 01, 2024
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Speech Emotion Recognition with Hybrid CNN- LSTM and Transformers Models: Evaluating the Hybrid Model Using Grad-CAM

    Thumbnail
    View/Open
    IJRCV4I2_158.pdf (745.0Kb)
    Date
    2024-07
    Author
    Kumari, HMLS
    Kumari, HMNS
    Nawarathne, UMMPK
    Metadata
    Show full item record
    Abstract
    Emotional recognition and classification using artificial intelligence (AI) techniques play a crucial role in human-computer interaction (HCI). It enables the prediction of human emotions from audio signals with broad applications in psychology, medicine, education, entertainment, etc. This research focused on speech-emotion recognition (SER) by employing classification methods and transformer models using the Toronto Emotional Speech Set (TESS). Initially, acoustic features were extracted using different feature extraction techniques, including chroma, Mel-scaled spectrogram, contrast features, and Mel Frequency Cepstral Coefficients (MFCCs) from the audio dataset. Then, this study employed a Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and a hybrid CNN-LSTM model to classify emotions. To compare the performance of these models, classical image transformer models such as ViT (Visual Image Transformer) and BEiT (Bidirectional Encoder Representation of Images) were employed on the Mel-spectograms derived from the same dataset. Evaluation metrics such as accuracy, precision, recall, and F1-score were calculated for each of these models to ensure a comprehensive performance comparison. According to the results, the hybrid model performed better than other models by achieving an accuracy of 99.01%, while the CNN, LSTM, ViT, and BEiT models demonstrated accuracies of 95.37%, 98.57%, 98%, and 98.3%, respectively. To interpret the output of this hybrid model and to provide visual explanations of its predictions, the Grad-CAM (Gradient-weighted Class Activation Mappings) was obtained. This technique reduced the black-box character of deep models, making them more reliable to use in clinical and other delicate contexts. In conclusion, the hybrid CNN-LSTM model showed strong performance in audio-based emotion classification.
    URI
    https://ir.kdu.ac.lk/handle/345/8907
    10.64701/ijrc/345/8907
    Collections
    • Volume 03 , Issue 01, 2024 [11]

    Library copyright © 2017  General Sir John Kotelawala Defence University, Sri Lanka
    Contact Us | Send Feedback
     

     

    Browse

    All of KDU RepositoryCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsFacultyDocument TypeThis CollectionBy Issue DateAuthorsTitlesSubjectsFacultyDocument Type

    My Account

    LoginRegister

    Library copyright © 2017  General Sir John Kotelawala Defence University, Sri Lanka
    Contact Us | Send Feedback