Abstract:
Internet of things (IoT)-enabled wireless body area network (WBAN) is an emerging
technology that combines medical devices, wireless devices, and non-medical devices for healthcare
management applications. Speech emotion recognition (SER) is an active research field in the
healthcare domain and machine learning. It is a technique that can be used to automatically identify
speakers’ emotions from their speech. However, the SER system, especially in the healthcare domain,
is confronted with a few challenges. For example, low prediction accuracy, high computational
complexity, delay in real-time prediction, and how to identify appropriate features from speech.
Motivated by these research gaps, we proposed an emotion-aware IoT-enabled WBAN system within
the healthcare framework where data processing and long-range data transmissions are performed
by an edge AI system for real-time prediction of patients’ speech emotions as well as to capture
the changes in emotions before and after treatment. Additionally, we investigated the effectiveness
of different machine learning and deep learning algorithms in terms of performance classification,
feature extraction methods, and normalization methods. We developed a hybrid deep learning
model, i.e., convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM),
and a regularized CNN model. We combined the models with different optimization strategies and
regularization techniques to improve the prediction accuracy, reduce generalization error, and reduce
the computational complexity of the neural networks in terms of their computational time, power,
and space. Different experiments were performed to check the efficiency and effectiveness of the
proposed machine learning and deep learning algorithms. The proposed models are compared with
a related existing model for evaluation and validation using standard performance metrics such
as prediction accuracy, precision, recall, F1 score, confusion matrix, and the differences between
the actual and predicted values. The experimental results proved that one of the proposed models
outperformed the existing model with an accuracy of about 98%.