LU600124B1

LU600124B1 - Voice emotion recognition method and system

Info

Publication number: LU600124B1
Application number: LU600124A
Authority: LU
Inventors: Hexing Wang; Yanjun Chen
Original assignee: Univ Northeastern Qinhuangdao
Priority date: 2025-01-24
Filing date: 2025-01-24
Publication date: 2025-07-25

Abstract

The invention relates to a voice emotion recognition method, system, where the method includes the following steps: acquiring a voice signal and preprocessing the voice signal to acquire the preprocessed voice signal; extracting acoustic parameters from the preprocessed voice signal to obtain an acoustic parameter set; acquiring attribute information of a speaker, and selecting a corresponding voice emotion model from a pre-built voice emotion model base according to the attribute information, where the voice emotion model is constructed based on a deep neural network model and obtained by training a training set, the training set includes voice data of the same crowd in different emotional states; according to the corresponding voice emotion model, inputting an acoustic parameter set into the corresponding voice emotion model and an emotion state set; and obtaining the cultural background information of the speaker, and modify the emotional state set according to the cultural background information.

Description

LU6001 24

DESCRIPTION

VOICE EMOTION RECOGNITION METHOD AND SYSTEM

TECHNICAL FIELD

The invention relates to the technical field of voice emotion recognition, in particular to a voice emotion recognition method and system.

BACKGROUND

A key technical problem in voice emotion recognition is how to extract feature parameters that can accurately reflect the change of emotion state from voice signals. Traditional feature extraction methods, such as fundamental frequency and energy, are greatly influenced by individual differences of speakers, and the description of voice emotion is not comprehensive enough. How to find a new voice emotional feature with strong robustness and high discrimination is very important. It needs to dig deep into the physiological and acoustic mechanism closely related to emotion in the process of voice production, and consider the influence of the speaker's gender, age, cultural background and other factors. The extraction of this new emotional feature requires knowledge across many disciplines, such as voice recognition, psychology of emotion, acoustic phonetics and so on, which poses a new challenge to the existing feature extraction framework.

SUMMARY

The purpose of the invention is to provide a voice emotion recognition method and system, which fully consider the influence of individual differences of speakers and cultural factors on emotional expression, and significantly improve the accuracy and adaptability of voice emotion recognition.

LU6001 24

In order to achieve the above objectives, the invention provides the following scheme:

A voice emotion recognition method includes the following steps: acquiring a voice signal and preprocessing the voice signal to acquire the preprocessed voice signal; extracting acoustic parameters from the preprocessed voice signal to obtain an acoustic parameter set; acquiring attribute information of a speaker, and selecting a corresponding voice emotion model from a pre-built voice emotion model base according to the attribute information, where the voice emotion model is constructed based on a deep neural network model and obtained by training a training set, and the training set includes voice data of the same crowd in different emotional states: inputting the acoustic parameter set into the corresponding voice emotion model and emotion state set according to the corresponding voice emotion model: acquiring the cultural background information of the speaker, and modifying the emotional state set according to the cultural background information to obtain an emotional recognition result.

Optionally, obtaining the preprocessed voice signal includes: converting the voice signal into a digital signal, and framing the digital signal: calculating the short-time energy and the short-time zero-crossing rate of each frame signal, and judging whet each frame signal is a voice frame according to the short-time energy and the short-time zero-crossing rate; removing the mute part according to the voice frame, obtaining a voice signal from which the mute part is removed and performing frequency domain transformation to obtain a frequency domain signal: denoising the frequency domain signal and performing inverse transformation to obtain the preprocessed voice signal.

Optionally, obtaining the acoustic parameter set includes: extracting the preprocessed voice signal by adopting a pitch detection algorithm to obtain fundamental frequency parameters; extracting the preprocessed voice signal by adopting a short-time energy analysis method to obtain energy parameters;

LU6001 24 extracting the preprocessed voice signal by adopting linear predictive coding technology to obtain formant parameters; acquiring the acoustic parameter set according to the fundamental frequency parameter, the energy parameter and the formant parameter.

Optionally, inputting the acoustic parameter set into a corresponding voice emotion model and obtaining the emotion state set includes: the convolution layer of the deep neural network extracts the local features of the acoustic parameter set, captures the time sequence relationship of the acoustic parameter set through the circulation layer, and maps the local features to different emotion categories through the fully connected layer to obtain the probability distribution of emotion categories.

Optionally, the method further includes inputting the acoustic parameter set into a universal voice emotion model to obtain an emotion state set if there is no corresponding voice emotion model in the pre-built voice emotion model base, where the training set of the universal voice emotion model includes voice data of different people with different emotion states.

Optionally, obtaining the cultural background information of the speaker and modifying the emotional state set according to the cultural background information includes: acquiring the cultural background information according to the attribute information: according to the cultural background information, obtaining a corresponding emotional expression feature vector from an emotional expression knowledge base: carrying out weighted correction on each emotion category in the emotion state set according to the emotion expression feature vector to obtain the emotion recognition result.

Optionally, the weighted correction of each emotion category in the emotion state set according to the emotion expression feature vector includes: performing matching calculation on the emotional state set and the emotional expression feature vector, and judging whether correction is needed through a similarity threshold;

LU6001 24 if so, performing weighted correction on each emotion category in the emotion state set according to the emotion expression feature vector, and processing the corrected emotion state set through a softmax normalization function to obtain the emotion recognition result.

On the other hand, the invention also provides a voice emotion recognition system, which includes a preprocessing module, an acoustic parameter extraction module, a voice emotion model selection module, a voice emotion recognition module and a recognition result correction module; the preprocessing module is used for acquiring and preprocessing a voice signal to acquire the preprocessed voice signal: the acoustic parameter extraction module is used for extracting acoustic parameters from the preprocessed voice signal to obtain an acoustic parameter set: the voice emotion model selection module is used for acquiring the attribute information of a speaker and selecting a corresponding voice emotion model from a pre-built voice emotion model base according to the attribute information, where the voice emotion model is constructed based on a deep neural network model and obtained through training of a training set, and the training set includes voice data of the same crowd in different emotional states; the voice emotion recognition module is used for inputting the acoustic parameter set into the corresponding voice emotion model according to the corresponding voice emotion model to obtain an emotion state set; the recognition result correction module is used for obtaining the cultural background information of the speaker, and correcting the emotional state set according to the cultural background information to obtain an emotional recognition result.

Optionally, obtaining the preprocessed voice signal includes: converting the voice signal into a digital signal, and framing the digital signal; calculating the short-time energy and the short-time zero-crossing rate of each frame signal, and judging whet each frame signal is a voice frame according to the short-time energy and the short-time zero-crossing rate;

LU6001 24 removing the mute part according to the voice frame, obtaining a voice signal from which the mute part is removed and performing frequency domain transformation to obtain a frequency domain signal: denoising the frequency domain signal and performing inverse transformation to obtain the preprocessed voice signal.

Optionally, obtaining the acoustic parameter set includes: extracting the preprocessed voice signal by adopting a pitch detection algorithm to obtain fundamental frequency parameters; extracting the preprocessed voice signal by adopting a short-time energy analysis method to obtain energy parameters; extracting the preprocessed voice signal by adopting linear predictive coding technology to obtain formant parameters; acquiring the acoustic parameter set according to the fundamental frequency parameter, the energy parameter and the formant parameter.

The invention has the beneficial effects that the invention discloses a voice emotion recognition method, which includes the following steps: firstly, preprocessing an input voice signal and extracting acoustic features, and then selecting a corresponding model from a pre-built voice emotion model library for emotion judgment according to attribute information such as gender, age and the like of a speaker, finally, combined with the speaker's cultural background information, the emotional state is corrected to get the final recognition result.

Through multi-level and multi-dimensional analysis, the invention fully considers the influence of individual differences and cultural factors of speakers on emotional expression, and significantly improves the accuracy and adaptability of voice emotion recognition. This method can be widely used in intelligent customer service, emotional computing and other scenes that need to accurately identify voice emotions.

LU6001 24

BRIEF DESCRIPTION OF THE FIGURES

In order to explain the embodiments of the invention or the technical scheme in the prior art more clearly, the drawings needed in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the invention, and other drawings can be obtained according to these drawings without creative work for ordinary people in the field.

Fig. 1 is a flowchart of a voice emotion recognition method according to an embodiment of the invention.

DESCRIPTION OF THE INVENTION

In the following, the technical scheme in the embodiment of the invention will be clearly and completely described with reference to the attached drawings.

Obviously, the described embodiment is only a part of the embodiment of the invention, but not the whole embodiment. Based on the embodiments in the invention, all other embodiments obtained by ordinary technicians in the field without creative labor belong to the scope of protection of the invention.

In order to make the above objects, features and advantages of the invention more obvious and easy to understand, the invention will be further described in detail with the attached drawings and specific embodiments.

Embodiment 1: as shown in Fig. 1, this embodiment provides a voice emotion recognition method, which includes: acquiring a voice signal and preprocessing the voice signal to acquire the preprocessed voice signal; extracting acoustic parameters from the preprocessed voice signal to obtain an acoustic parameter set; acquiring attribute information of a speaker, and selecting a corresponding voice emotion model from a pre-built voice emotion model base according to the attribute information, where the voice emotion model is constructed based on a

LU6001 24 deep neural network model and obtained by training a training set, and the training set includes voice data of the same crowd in different emotional states: according to the corresponding voice emotion model, inputting an acoustic parameter set into the corresponding voice emotion model and an emotion state set: obtaining the cultural background information of the speaker, and modify the emotional state set according to the cultural background information to obtain the emotional recognition result.

Further, obtaining the preprocessed voice signal includes: converting voice signals into digital signals, and framing the digital signals; calculating the short-time energy and the short-time zero-crossing rate of each frame signal, and judging whether each frame signal is a voice frame according to the short-time energy and the short-time zero-crossing rate; removing the mute part according to the voice frame, obtaining the voice signal from which the mute part is removed and performing frequency domain transformation to obtain the frequency domain signal, the frequency domain signal is denoised and inverse transformed to obtain the preprocessed voice signal.

Specifically, voice signals are acquired and converted into digital signals for processing. Frame the digital signal, with the length of each frame being 20-30 ms and the frame shifting being 10-15 ms. Calculate the short-time energy and short- time zero-crossing rate of each frame signal, and judge whether the frame is a voice frame according to the energy and zero-crossing rate. Endpoint detection is carried out on voice frames, the starting point and ending point of voice are judged and the mute parts before and after voice are removed. Perform frequency domain transform on the voice segment, such as FFT transform, to obtain frequency domain signals. The frequency domain signal is denoised by spectral subtraction, and the power spectrum of noise is estimated. The estimated noise spectrum is subtracted from the voice spectrum to obtain the denoised voice spectrum. Inverse transform the denoised voice spectrum, such as IFFT transform, to obtain the denoised voice signal and output it as the preprocessed voice signal.

LU6001 24

In voice emotion recognition, preprocessing can highlight personal voice characteristics and improve recognition accuracy. Generally speaking, the application of these preprocessing technologies makes the voice signal more "clean" and "regular", which is beneficial to the subsequent analysis and processing. By eliminating irrelevant noise and highlighting effective information, the performance and robustness of voice processing system can be significantly improved.

Further, obtaining the acoustic parameter set includes: using pitch detection algorithm, the preprocessed voice signal is extracted to obtain the fundamental frequency parameters; using short-term energy analysis method, the preprocessed voice signal is extracted to obtain energy parameters; using linear predictive coding technology, the preprocessed voice signal is extracted to obtain formant parameters; according to the fundamental frequency parameters, energy parameters and formant parameters, an acoustic parameter set is obtained.

Specifically, pitch detection is a key step in voice signal processing, which is used to extract the fundamental frequency parameters of voice. Pitch detection algorithms include autocorrelation method and cepstrum method. Taking autocorrelation method as an example, the fundamental frequency is estimated by calculating the correlation between the voice signal and its own delayed version.

For a typical male speaker, its fundamental frequency range is usually between 80-160 Hz. Suppose that the fundamental frequency of a voice segment is 120 Hz, which means that the vocal cords vibrate 120 times per second. Short-term energy analysis is an important method to evaluate the strength of voice signal. It divides the voice signal into short time frames and calculates the energy of each frame.

For example, for a voice frame of 20 ms, the sum of its squares can be calculated as the energy value. By observing the change of energy value, we can judge the beginning and end of voice and distinguish voiced sound from unvoiced sound.

Linear predictive coding (LPC) is a widely used voice signal analysis technology to extract formant parameters. LPC assumes that the current voice sample can be predicted linearly by the past samples.

LU6001 24

By solving the LPC coefficient, the frequency response of the channel can be obtained, and the formant frequency can be estimated. For example, for an adult female speaker, the first formant (F1) is usually in the range of 300-800 Hz, and the second formant (F2) is in the range of 1000-2500Hz. The fundamental frequency, energy and formant parameters are combined into an acoustic parameter set, which provides a basis for subsequent acoustic modeling. These parameters together describe the pitch, loudness and timbre characteristics of voice.

Further, inputting the acoustic parameter set into the corresponding voice emotion model to obtain the emotion state set includes: the convolution layer of the deep neural network extracts the local features of the acoustic parameter set, captures the time series relationship of the acoustic parameter set through the circulation layer, and maps the local features to different emotion categories through the full connection layer to obtain the probability distribution of emotion categories.

Specifically, the extracted acoustic parameters are input into the pre-trained deep neural network model, which adopts a combination of convolutional neural network and cyclic neural network. In the deep neural network, the local features of the acoustic parameter set are extracted through the convolution layer, and then the time series relationship of the acoustic parameter set is captured through the circulation layer, and finally it is mapped to different emotion categories through the full connection layer. According to the output of the deep neural network, the probability distribution of the voice signal belonging to each emotion category is obtained. Corresponding the emotion analysis result with the acoustic parameter set to form emotion labeling data. At the same time, the accurately labeled data are added to the training set to iteratively optimize the deep neural network model and improve the generalization ability of the model.

Further, the method also includes inputting an acoustic parameter set into a universal voice emotion model to obtain an emotion state set if there is no corresponding voice emotion model in the pre-built voice emotion model base, where the training set of the universal voice emotion model includes voice data of different people with different emotion states.

LU6001 24

Specifically, the speaker's gender, age and other attribute information can be obtained through user registration information, face recognition technology, etc.

People of different genders and age groups will have different voice characteristics when expressing their emotions. For example, the tone of young women is usually higher, while the tone of old men is relatively lower, and there may be differences in the expression of emotions; when young people express excitement, the voice speed may be faster and the tone may change more, while the old people may be relatively flat. Therefore, choosing the appropriate voice emotion model according to the speaker's attribute information can improve the accuracy of emotion recognition. For example, suppose that the speaker is a young woman, about 25 years old. Next, according to the young woman's attribute information, search in the pre-built voice emotion model base. This model base is pre-constructed, which contains voice emotion models trained for people with different gender, age and other attributes. For example, the model base may contain "young female model", "young male model", "old female model" and "old male model". Assuming that a phonetic emotion model specially trained for "young female" is found, and then this model is selected for subsequent analysis. If you can't find a perfect match model, you can choose a general voice emotion model. This general model is usually obtained by training the voice data of a large number of different people. Although the accuracy may not be as good as the exclusive model, it can also provide some reference. After selecting the model, load the model into memory. This step is similar to opening an application and loading it from the hard disk into memory to run quickly.

LU6001 24

Further, obtaining the cultural background information of the speaker and modifying the emotional state set according to the cultural background information includes: obtaining cultural background information according to the attribute information: according to the cultural background information, the corresponding emotional expression feature vector is obtained from the emotional expression knowledge base: according to the emotion expression feature vector, each emotion category in the emotion state set is weighted and corrected to obtain the emotion recognition result.

Further, the weighted correction of each emotion category in the emotion state set according to the emotion expression feature vector includes: matching the emotional state set with the emotional expression feature vector, and judge whether it needs to be corrected through the similarity threshold; if it is corrected, each emotion category in the emotion state set is weighted according to the emotion expression feature vector, and the corrected emotion state set is processed by softmax normalization function to obtain the emotion recognition result.

Specifically, obtain the personal attribute information of the speaker, including age, gender, nationality, occupation, etc., and determine the cultural background classification of the speaker according to the attribute information. According to different cultural background classification, the corresponding emotional expression feature vectors are obtained from the emotional expression knowledge base. The emotional state set is matched with the emotional expression feature vector corresponding to the cultural background, and whether adaptive correction is needed is judged through the similarity threshold. If it needs to be corrected, the probability values of each emotion category in the emotional state set are weighted according to the emotional expression characteristics of the cultural background. The adjusted emotional probability value is processed by softmax normalization function, and the modified emotional state set is obtained.

LU6001 24

The corrected emotional state set is output as the final emotional recognition result of the speaker. Among them, the emotional expression knowledge base is a data set containing emotional expression characteristics in different cultural backgrounds.

Embodiment 2

A voice emotion recognition system includes a preprocessing module, an acoustic parameter extraction module, a voice emotion model selection module, a voice emotion recognition module and a recognition result correction module; the preprocessing module is used for acquiring and preprocessing a voice signal to acquire the preprocessed voice signal: the acoustic parameter extraction module is used for extracting acoustic parameters from the preprocessed voice signal to obtain an acoustic parameter set: the voice emotion model selection module is used for acquiring the attribute information of the speaker, and selecting a corresponding voice emotion model from a pre-built voice emotion model base according to the attribute information, where the voice emotion model is constructed based on a deep neural network model and obtained through training of a training set, and the training set includes voice data of the same crowd in different emotional states: the voice emotion recognition module is used for inputting an acoustic parameter set into the corresponding voice emotion model according to the corresponding voice emotion model to obtain an emotion state set, the recognition result correcting module is used for obtaining the cultural background information of the speaker, and correcting the emotional state set according to the cultural background information to obtain the emotional recognition result.

LU6001 24

The above-mentioned embodiment is only a description of the preferred mode of the invention, and does not limit the scope of the invention. Under the premise of not departing from the design spirit of the invention, various modifications and improvements made by ordinary technicians in the field to the technical scheme of the invention shall fall within the protection scope determined by the claims of the invention.

Claims

LU6001 24 CLAIMS

1. À voice emotion recognition method, comprising: acquiring a voice signal and preprocessing the voice signal to acquire a preprocessed voice signal; extracting acoustic parameters from the preprocessed voice signal to obtain an acoustic parameter set; acquiring attribute information of a speaker, and selecting a corresponding voice emotion model from a pre-built voice emotion model base according to the attribute information, wherein the voice emotion model is constructed based on a deep neural network model and obtained by training a training set, and the training set comprises voice data of the same crowd in different emotional states; inputting the acoustic parameter set into the corresponding voice emotion model and emotion state set according to the corresponding voice emotion model: and acquiring a cultural background information of the speaker, and modifying the emotional state set according to the cultural background information to obtain an emotional recognition result.

2. The voice emotion recognition method according to claim 1, wherein obtaining the preprocessed voice signal comprises: converting the voice signal into a digital signal, and framing the digital signal: calculating a short-time energy and a short-time zero-crossing rate of each frame signal, and judging whet each frame signal is a voice frame according to the short-time energy and the short-time zero-crossing rate; removing a mute part according to the voice frame, obtaining the voice signal, from the voice signal the mute part is removed, performing frequency domain transformation to obtain a frequency domain signal: and denoising the frequency domain signal and performing inverse transformation to obtain the preprocessed voice signal.

LU6001 24

3. The voice emotion recognition method according to claim 2, wherein obtaining the acoustic parameter set comprises: extracting the preprocessed voice signal by adopting a pitch detection algorithm to obtain fundamental frequency parameters; extracting the preprocessed voice signal by adopting a short-time energy analysis method to obtain energy parameters; extracting the preprocessed voice signal by adopting linear predictive coding technology to obtain formant parameters; and acquiring the acoustic parameter set according to the fundamental frequency parameter, the energy parameter and the formant parameter.

4. The voice emotion recognition method according to claim 1, wherein inputting the acoustic parameter set into the corresponding voice emotion model and obtaining the emotion state set comprises: the convolution layer of the deep neural network extracts the local features of the acoustic parameter set, captures the time sequence relationship of the acoustic parameter set through the circulation layer, and maps the local features to different emotion categories through the fully connected layer to obtain the probability distribution of emotion categories.

5. The voice emotion recognition method according to claim 1, wherein the method further comprises inputting the acoustic parameter set into a general voice emotion model to obtain an emotion state set if there is no corresponding voice emotion model in the pre-built voice emotion model base, wherein the training set of the general voice emotion model comprises voice data of different people with different emotion states.

LU6001 24

6. The voice emotion recognition method according to claim 1, wherein obtaining the cultural background information of the speaker and modifying the emotional state set according to the cultural background information comprises: acquiring the cultural background information according to the attribute information: according to the cultural background information, obtaining a corresponding emotional expression feature vector from an emotional expression knowledge base: and carrying out weighted correction on each emotion category in the emotion state set according to the emotion expression feature vector to obtain the emotion recognition result.

7. The voice emotion recognition method according to claim 6, wherein the weighted correction of each emotion category in the emotion state set according to the emotion expression feature vector comprises: performing matching calculation on the emotional state set and the emotional expression feature vector, and judging whether correction is needed through a similarity threshold; if so, performing weighted correction on each emotion category in the emotion state set according to the emotion expression feature vector, and processing the corrected emotion state set through a softmax normalization function to obtain the emotion recognition result.

8. A system of voice emotion recognition method according to claim 1, comprising a preprocessing module, an acoustic parameter extraction module, a voice emotion model selection module, a voice emotion recognition module and a recognition result correction module; the preprocessing module is used for acquiring and preprocessing a voice signal to acquire the preprocessed voice signal; the acoustic parameter extraction module is used for extracting acoustic parameters from the preprocessed voice signal to obtain an acoustic parameter set:

LU6001 24 the voice emotion model selection module is used for acquiring the attribute information of a speaker and selecting a corresponding voice emotion model from a pre-built voice emotion model base according to the attribute information, wherein the voice emotion model is constructed based on a deep neural network model and obtained through training of a training set, and the training set comprises voice data of the same crowd in different emotional states; the voice emotion recognition module is used for inputting the acoustic parameter set into the corresponding voice emotion model according to the corresponding voice emotion model to obtain an emotion state set, and the recognition result correction module is used for obtaining the cultural background information of the speaker, and correcting the emotional state set according to the cultural background information to obtain an emotional recognition result.

9. The system according to claim 8, wherein obtaining the preprocessed voice signal comprises: converting the voice signal into a digital signal, and framing the digital signal: calculating that short-time energy and the short-time zero-crossing rate of each frame signal, and judging whet each frame signal is a voice frame according to the short-time energy and the short-time zero-crossing rate; removing the mute part according to the voice frame, obtaining a voice signal, from the voice signal the mute part is removed, performing frequency domain transformation to obtain a frequency domain signal: and denoising the frequency domain signal and performing inverse transformation to obtain the preprocessed voice signal.

LU6001 24

10. The system according to claim 8, wherein obtaining the acoustic parameter set comprises: extracting the preprocessed voice signal by adopting a pitch detection algorithm to obtain fundamental frequency parameters; extracting the preprocessed voice signal by adopting a short-time energy analysis method to obtain energy parameters; extracting the preprocessed voice signal by adopting linear predictive coding technology to obtain formant parameters; and acquiring the acoustic parameter set according to the fundamental frequency parameter, the energy parameter and the formant parameter.