CN113903362A

CN113903362A - Speech emotion recognition method based on neural network

Info

Publication number: CN113903362A
Application number: CN202110990439.3A
Authority: CN
Inventors: 张悦; 黄逸轩
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2022-01-07
Anticipated expiration: 2041-08-26
Also published as: CN113903362B

Abstract

The invention discloses a speech emotion recognition method based on a neural network, which comprises the steps of classifying a target speech signal into four types of emotions, namely happy emotion, sad emotion, neutral emotion and angry emotion, extracting the characteristics of the speech signal based on a filter bank, then respectively sending the characteristics into a convolutional neural network and a time delay neural network to automatically extract emotion characteristics, obtaining the probability value belonging to each type of emotion by using a normalized exponential function classifier, and selecting the emotion corresponding to the maximum probability value as the emotion category of the speech; and then the target voice signal is recognized as a text, the text emotion classification is obtained by sending the text to a pre-training model of a bidirectional encoder, and the final emotion classification is obtained by fusing the three models, so that the problems that model fusion and multi-mode emotion recognition training are difficult and the accuracy is not improved greatly in the prior art are solved.

Description

Speech emotion recognition method based on neural network

Technical Field

The invention relates to the technical field of speech emotion recognition, in particular to a speech emotion recognition method based on a neural network.

Background

Many methods for speech emotion recognition are to adopt different speech emotion classification models for fusion, however, since all speech information is used, the relevance of the models is relatively high, and the effect of model fusion is not greatly improved; there is also a method of extracting features using different models, and then different models are fused according to the same weight, which also has the problem of little effect improvement.

At present, a multi-mode method of text emotion recognition and voice emotion recognition is adopted, but feature fusion is adopted, and due to the fact that learning speeds of different models are different, the feature fusion cannot well play a role in complementing advantages of information in different modes.

Disclosure of Invention

The invention aims to provide a speech emotion recognition method based on a neural network, and aims to solve the problems that model fusion and multi-mode emotion recognition training are difficult and accuracy is not improved greatly in the prior art.

In order to achieve the above object, the present invention adopts a speech emotion recognition method based on a neural network, comprising the following steps:

extracting voice features and sending the voice features to a convolutional neural network to obtain convolutional emotion categories;

the voice features are sent to a time delay neural network to obtain time delay emotion types;

recognizing a voice text and sending the voice text into a pre-training model of a bidirectional encoder to obtain the emotion type of the text;

and model fusion to obtain the final emotion classification.

Wherein the speech feature is a filter bank based feature of the target speech signal.

The target speech signal is divided into four categories of happiness, sadness, neutrality and anger, and the convolution emotion category, the time delay emotion category, the text emotion category and the final emotion category are any one of the four categories.

In the process of extracting voice features and sending the voice features to a convolutional neural network to obtain convolutional emotion categories, the convolutional neural network automatically extracts emotion features contained in the voice features, a normalized exponential function classifier is used for obtaining probability values of the emotion features belonging to each category, and the emotion features corresponding to the maximum probability values are selected as convolutional emotion categories.

In the process of sending the voice features into the time delay neural network to obtain the time delay emotion categories, the time delay neural network automatically extracts the emotion features contained in the voice features, then uses a normalization index function classifier to obtain the probability value of the emotion features belonging to each category, and selects the emotion features corresponding to the maximum probability value as the time delay emotion categories.

The method comprises the following steps of recognizing a voice text and sending the voice text to a pre-training model of a bidirectional encoder to obtain the emotion type of the text, wherein the method comprises the following steps:

recognizing a text corresponding to the target voice signal by utilizing a voice recognition technology to obtain a voice text;

mapping the characters in the voice text into corresponding labels to form a label sequence;

sending the label sequence into a pre-training model of a bidirectional encoder, and extracting emotional characteristics contained in the text;

and obtaining the probability value of the emotional feature belonging to each type by using a normalized index function classifier, and selecting the emotional feature corresponding to the maximum probability value as the text emotional category.

In the process of obtaining the final emotion category through model fusion, the probability values of the convolution emotion category, the time delay emotion category and the text emotion category after respective normalization index functions are linearly added, and the emotion feature corresponding to the maximum value is selected as the final emotion category.

And in the process of carrying out the linear addition, the weight values of different models are set to be the same or different.

The invention relates to a voice emotion recognition method based on a neural network, which comprises the steps of firstly classifying target voice signals into four types of emotions of happiness, sadness, neutrality and anger, then extracting the characteristics of the voice signals based on a filter bank, then respectively sending the characteristics into a convolutional neural network and a time delay neural network to automatically extract emotion characteristics, obtaining the probability value belonging to each type of emotion by using a normalized exponential function classifier, and selecting the emotion corresponding to the maximum probability value as the emotion category of the voice; and then the target voice signal is recognized as a text, the text emotion classification is obtained by sending the text to a pre-training model of a bidirectional encoder, and the final emotion classification is obtained by fusing the three models, so that the problems that model fusion and multi-mode emotion recognition training are difficult and the accuracy is not improved greatly in the prior art are solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a speech emotion recognition method based on a neural network according to the present invention.

FIG. 2 is a model architecture diagram of the convolutional neural network of the present invention.

FIG. 3 is a model architecture diagram of the delay neural network of the present invention.

Fig. 4 is a block diagram of a single layer bi-directional encoder of the present invention.

FIG. 5 is a schematic diagram of the model fusion weighted procedure of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

In this application, the corresponding terms may also be referred to as other names, for example, the filter bank-based feature is an FBank feature, the convolutional neural network is CNN, the delay neural network is ECAPA-TDNN, the pre-training model of the bidirectional encoder is Bert, and the normalized exponential function is Softmax.

Referring to fig. 1, the present invention provides a speech emotion recognition method based on a neural network, including the following steps:

s1: extracting voice features and sending the voice features to a convolutional neural network to obtain convolutional emotion categories;

s2: the voice features are sent to a time delay neural network to obtain time delay emotion types;

s3: recognizing a voice text and sending the voice text into a pre-training model of a bidirectional encoder to obtain the emotion type of the text;

s4: and model fusion to obtain the final emotion classification.

The speech feature is a filterbank-based feature of the target speech signal.

The emotional characteristics of the target speech signal are divided into four categories of happiness, sadness, neutrality and anger, and the convolution emotion category, the time delay emotion category, the text emotion category and the final emotion category can be any one of the four categories.

And in the process of sending the voice features into a time delay neural network to obtain time delay emotion categories, the time delay neural network automatically extracts emotion features contained in the voice features, then uses a normalization index function classifier to obtain probability values of the emotion features belonging to each category, and selects the emotion features corresponding to the maximum probability values as the time delay emotion categories.

Recognizing a voice text and sending the voice text into a pre-training model of a bidirectional encoder to obtain a text emotion category, wherein the method comprises the following steps:

In the linear addition, the weight values of different models may be set to be the same or different.

Further, referring to fig. 2, the model architecture of the convolutional neural network CNN is as follows:

the speech signal is used as the input of the convolutional neural network based on the characteristics of a filter bank, the model is composed of 5 layers of two-dimensional convolutional neural network blocks, each two-dimensional convolutional neural network block is composed of 3 parts, namely a two-dimensional convolutional neural network, a batch normalization layer and a maximum pooling layer. And then connecting a global average pooling layer. And then connecting the full connection layer, obtaining the probability value belonging to each type of emotion by activating the function to be the normalized index function softmax, and then selecting the emotion corresponding to the maximum probability value as the emotion category of the voice.

The architecture of the time-delay neural network ECAPA-TDNN model is shown in FIG. 3:

the method comprises the steps of using the filter bank-based features of a voice signal as the input of a model, connecting a time delay neural network to the rear of the model, connecting a modified linear unit activation function and a batch standardization network to the rear of the model, connecting a 3-layer feature compression and excitation module, inputting the output of the first and second feature compression and excitation modules and the output of the third feature compression and excitation module into the time delay neural network, connecting the modified linear unit activation function to the model, obtaining a statistical attention pooling vector based on the features of the filter bank through attention pooling calculation, carrying out batch standardization, sending the statistical attention pooling vector to a full-connection network layer, carrying out batch standardization, obtaining probability values belonging to each emotion through an additional angle margin normalization index function, and selecting the maximum class as the emotion class of the voice.

In the process of pre-training the model by Bert:

the text corresponding to the voice is recognized by utilizing a voice recognition technology, and then each word in the text is mapped into a corresponding label according to a dictionary, wherein different words correspond to different labels. The corresponding label sequence of the text is then input to a pre-trained model of a bi-directional encoder (Bert).

The Bert pre-training model is formed by overlapping a multi-layer bidirectional encoder. The structure of the single-layer bi-directional encoder is shown in fig. 4. Inputting a text, extracting to obtain input embedding, carrying out position coding on input information, then sending the input information into a coder for coding, then sending the output of the previous layer into a decoder, sending the output of the previous layer into a full connection layer and a normalization index function softmax layer for classification by combining the characteristics obtained by coding of the coder, and obtaining the emotion category of the text.

Further, in the process of obtaining the final emotion classification through model fusion:

referring to fig. 5, the fusion method: the probability value after the softmax of the weight 1 × CNN + the probability value after the softmax of the weight 2 × ECAPA-TDNN + the probability value after the softmax of the weight 3 × Bert is a new probability value, and then the emotion corresponding to the maximum value is selected as the final emotion category.

Wherein: weight 1+ weight 2+ weight 3 ═ 1

The invention also provides a specific embodiment illustrating the improvement change of the identification accuracy rate:

related terms mean: accuracy rate is the number of correctly predicted samples/total number of samples

Weighted accuracy WA: the accuracy of a certain type of emotion is the proportion of a certain type of emotion in a data set;

non-weighted accuracy UA is the accuracy of a certain type of emotion classification.

Model 1 Filter Bank based features (Fbank features) input as speech, weighted accuracy WA, unweighted accuracy UA 67%, 65% using convolutional neural network cnn model

Model 2 Filter bank-based features (Fbank features) of input speech, weighted accuracy WA and unweighted accuracy UA of 67% and 66% by using a time-delay neural network ECAPA-TDNN model

Model 3, a two-way encoder Bert pre-training model for inputting text, with a weighted accuracy WA and a non-weighted accuracy UA of 62% and 61%

The weights of the set different models are the same, and the speech emotion recognition result is as follows:

# weighted accuracy WA, non-weighted accuracy UA 76%, 74%

(1. probability value after softmax of model 1+ 1. probability value after softmax of model 2+ 1. probability value after model 3 softmax)/3

When the models are fused, the weights are changed to be different, and the performance is greatly improved:

# weighted accuracy WA, non-weighted accuracy UA 81%, 80%

(0.5 probability value after softmax of model 1+ 2.1 probability value after softmax of model 2+ 0.4 probability value after model 3 softmax)/3

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A speech emotion recognition method based on a neural network is characterized by comprising the following steps:

and model fusion to obtain the final emotion classification.

2. The method of claim 1, wherein the speech features are filter bank based features of a target speech signal.

3. The neural network-based speech emotion recognition method of claim 2, wherein the emotion characteristics of the target speech signal are classified into four categories of happy, sad, neutral and angry, and the convolutional emotion category, the time-delayed emotion category, the text emotion category and the final emotion category are any one of the four categories.

4. The method for recognizing the speech emotion based on the neural network as claimed in claim 1, wherein in the process of extracting the speech features and sending the speech features to the convolutional neural network to obtain the convolutional emotion categories, the convolutional neural network automatically extracts the emotion features contained in the speech features, then uses a normalized exponential function classifier to obtain the probability value of the emotion features belonging to each category, and selects the emotion features corresponding to the maximum probability value as the convolutional emotion categories.

5. The method for recognizing the speech emotion based on the neural network as claimed in claim 1, wherein in the process of sending the speech features into the time-delay neural network to obtain the time-delay emotion categories, the time-delay neural network automatically extracts the emotion features included in the speech features, then uses a normalized exponential function classifier to obtain the probability value of the emotion features belonging to each category, and selects the emotion features corresponding to the maximum probability value as the time-delay emotion categories.

6. The method for speech emotion recognition based on neural network as claimed in claim 2, wherein the speech text is recognized and fed into the pre-training model of the bi-directional encoder to obtain the text emotion classification, comprising the following steps:

7. The speech emotion recognition method based on neural network as claimed in claim 1, wherein in the process of obtaining the final emotion category through model fusion, the probability values after normalization index functions of the convolution emotion category, the time delay emotion category and the text emotion category are linearly added, and the emotion feature corresponding to the maximum value is selected as the final emotion category.

8. The method as claimed in claim 7, wherein the linear addition is performed by setting the weight values of different models to be the same or different.