Optimasi Kontinu untuk Speech Emotion Recognition (SER) Berdasarkan Deep Learning dan CMARS

Date
2023Author
Triandi, Budi
Advisor(s)
Efendi, Syahril
Mawengkang, Herman
Sawaluddin
Metadata
Show full item recordAbstract
Emotion is a person's condition that influences events that occur in the human subconscious. A person's emotional condition is something that can influence a person's behavior. One form of identifying a person's emotions can be found in changes in the voice feature data contained therein. Voice signals have complex voice feature data with very large amounts of data and contain uncertain parameters, which cause problems, namely that it becomes difficult to predict potential emotions. This difficulty will increase if the voice signal is limited to one form of emotion. Therefore, it is necessary to carry out further research to conduct a deeper study of non-parametric prediction analysis to optimize influential features so as to obtain optimal patterns that have the potential to change the shape of a person's emotions. The collection of multiple data sets requires analytical solutions that are appropriate to the phenomena involved under the conditions for which the data are measured or for which numerical hypothetical results are available. Many methods have been used to solve this problem, such as optimization to obtain optimal estimates from continuous data for prediction problems. The deep learning approach can be used in the field of Speech Emotion Recognition (SER) research, but an alternative solution is needed to overcome problems that often occur, such as over fitting data due to long trial-and-error learning. This dissertation research focuses on optimization in SER problems by developing learning techniques using nonparametric regression using conic multivariate adaptive regression splines (CMARS) to describe the relationship between the dependent variable and the independent variable and mathematically interpret the relationship between various parameters. This research uses the Ryerson Audio-Visual Database of Emotional Speech and Song dataset (RAVDESS) with 600 sound data files in *.WAV format as training data and test data. To extract sound feature data, the data extraction technique used is the mel-frequency cepstral coefficient (MFCC) feature to obtain the mel-cepstral coefficient. The results of the tests carried out in this dissertation research obtained a generalized cross validation (GCV) value of 0.0130 with an estimated root mean squared error (RMSE) of 0.0062, and the level of prediction suitability reached R2_Score (RSq) = 0, 9720, or 97.20%, meaning that the proposed model is the best nonparametric CMARS regression model for speech emotion recognition (SER) problems.