CN114898779A

CN114898779A - Multi-mode fused speech emotion recognition method and system

Info

Publication number: CN114898779A
Application number: CN202210641067.8A
Authority: CN
Inventors: 刘云翔; 张可欣; 原鑫鑫
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2022-04-07
Filing date: 2022-06-08
Publication date: 2022-08-12

Abstract

The invention provides a multi-mode fused speech emotion recognition method and system, which comprises the following steps: acquiring a voice signal, and extracting a voice emotion characteristic value of the voice signal; acquiring text information corresponding to a voice signal, and preprocessing the text information to generate text characteristic information; acquiring a pre-trained speech emotion classifier, wherein the speech emotion classifier comprises a first classification model and a second classification model; inputting the speech emotion characteristic value and the text characteristic information into the speech emotion classifier, identifying the speech emotion characteristic value through a first classification model to generate first classification information, identifying the text characteristic information through a second classification model to generate second classification information, and fusing the first classification information and the second classification information to generate identification information. The invention fully utilizes the advantages of each classifier, integrates multiple modes, can avoid the contingency of a single mode and improve the identification accuracy.

Description

Multi-mode fused speech emotion recognition method and system

Technical Field

The invention relates to voice emotion recognition, in particular to a multi-mode fused voice emotion recognition method and system.

Background

Speech emotion recognition is part of artificial intelligence, involving knowledge in deep learning and machine learning. The human language contains rich emotion, and how to make a computer recognize human emotion from voice like a human brain is a research hotspot of voice recognition. Speech recognition has been applied in many fields such as face recognition, handwritten digit recognition, and the like. Speech emotion recognition is widely applied to education industry, service industry, driving assistance industry and criminal investigation industry. In the education industry, the emotional states of students are observed, and judgment is accurately made, so that teachers can take beneficial measures to prevent the depression tendency of the students. In the service industry, the voice communication with the client can effectively judge whether the client is satisfied with the service from the voice of the client, thereby reminding the service personnel to adjust the service strategy in time. In the criminal investigation industry, when a police interrogates a criminal suspect, whether the suspect lies is generally judged according to the speaking voice tone of the suspect. When people lie, people can be nervous in emotion, and voice intonation can change along with emotion. Not only do domestic people develop intensive research on speech emotion recognition, but also speech emotion recognition is emphasized abroad.

In 1997, the professor of the university of Ma province of science and technology proposed the concept of emotion calculation, which only allowed a computer to correctly recognize human emotion and to achieve real man-machine interaction. In the year 2000, a speech emotion recognition conference was held in Ireland, and the technology was thoroughly studied in meetings. The speech emotion recognition means that emotion categories are predicted by extracting characteristic parameters of languages. Extracting characteristic values in the voice through deep learning; and finishing emotion classification on the extracted characteristic values through a machine learning classification model. The speech emotion recognition process comprises three processes of selecting a speech emotion database, extracting speech characteristic values and predicting emotion classification, and the predecessors make a lot of contributions in the three processes.

However, most studies only focus on one modality, namely the speech signal, to recognize emotion, and do not focus on other modalities to recognize emotion. The emotion can be recognized not only by voice but also by facial expression and text information.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a speech emotion recognition method and system fusing multiple modes.

The method for recognizing the speech emotion fusing multiple modes, provided by the invention, comprises the following steps of:

step S1: acquiring a voice signal, and extracting a voice emotion characteristic value of the voice signal;

step S2: acquiring text information corresponding to the voice signal, and preprocessing the text information to generate text characteristic information;

step S3: acquiring a pre-trained speech emotion classifier, wherein the speech emotion classifier comprises a first classification model and a second classification model;

step S4: inputting the speech emotion characteristic value and the text characteristic information into the speech emotion classifier, identifying the speech emotion characteristic value through a first classification model to generate first classification information, identifying the text characteristic information through a second classification model to generate second classification information, and fusing the first classification information and the second classification information to generate identification information.

Preferably, the step S1 includes the steps of:

step S101: pre-emphasis, framing, windowing and endpoint detection are sequentially carried out on the voice signals to determine the voice signals with emotion;

step S102: and reducing the dimension of the voice signal with emotion, and extracting a characteristic value with the maximum contribution degree to emotion recognition to generate the voice emotion characteristic value.

Preferably, the step S2 includes the steps of:

step S201: performing word segmentation on the text information, and removing stop words in the text information;

step S202: performing feature selection in the text information according to an information gain method to generate a plurality of features, calculating the information entropy of each feature, and determining a target feature according to the information entropy;

step S203: and constructing a keyword library, an emotion side word library and an emotion shape word library, and giving different weights to different target characteristics according to the word libraries to generate the speech emotion characteristic value.

Preferably, step S3 includes the following steps

Step S301: designing a DNN neural network and a first BLSTM neural network, and constructing a first classification model by using the DNN neural network and the BLSTM neural network;

step S302: designing a BERT model, a second BLSTM neural network and a softmax classifier, and constructing the BERT model, the second BLSTM neural network and the softmax classifier into the second classification model;

step S303: and acquiring a preset voice training set, a preset voice test set, a preset text training set and a preset text test set, training and testing the first classification model through the voice training set and the preset voice test set, and training and testing the second classification model through the preset text training set and the preset text test set so as to generate the voice emotion classifier.

Preferably, in the step S4: the probability distribution in the first classification information is

The probability distribution in the second classification information is

Then, the weighted fusion probability of the first classification information and the second classification information is w ═ w ₁ S ^audio +w ₂ S ^text Wherein w is ₁ Weight, w, representing speech ₂ Representing the weight of the text.

Preferably, the first classification model comprises a first language model and a second speech model;

the main task of the first language model is emotion classification, the auxiliary task is noise identification, and the identified noise type signals are discarded to generate non-noise type signals;

and the second voice model is used for carrying out emotion classification and gender classification on the input signals of the non-noise classes.

Preferably, in the second classification model, the BERT model is configured to generate vectorized information by performing vectorized representation on text information;

the second BLSTM neural network is used for processing the vector quantization information to extract text features of the context;

the softmax classifier is used for generating the second classification information according to the classification of the text features.

Preferably, the language emotion characteristic value comprises related characteristics based on a spectrum, a spectrogram, a prosody characteristic and a tone quality characteristic.

Preferably, in step S102, the feature with the largest Fisher value is selected to generate the speech emotion feature value.

The invention provides a multi-mode fused speech emotion recognition system, which comprises the following modules:

the speech emotion feature extraction module is used for acquiring a speech signal and extracting a speech emotion feature value of the speech signal;

the text preprocessing module is used for acquiring text information corresponding to the voice signal and preprocessing the text information to generate text characteristic information;

the system comprises a classifier acquisition module, a pre-training speech emotion classifier and a pre-training speech emotion classifier acquisition module, wherein the speech emotion classifier comprises a first classification model and a second classification model;

and the information identification module is used for inputting the speech emotion characteristic value and the text characteristic information into the speech emotion classifier, identifying the speech emotion characteristic value through a first classification model to generate first classification information, identifying the text characteristic information through a second classification model to generate second classification information, and fusing the first classification information and the second classification information to generate identification information.

Compared with the prior art, the invention has the following beneficial effects:

the invention not only uses one classifier, but also integrates a plurality of classifiers, uses different classifiers for different emotion recognition modes, fully utilizes the advantages of each classifier, integrates a plurality of modes, can avoid the contingency of a single mode and improve the recognition accuracy.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flowchart illustrating steps of a method for speech emotion recognition with multi-modal fusion according to an embodiment of the present invention;

FIG. 2 is a logic flow diagram of a multi-task learning of a first classification model in an embodiment of the present invention; and

FIG. 3 is a schematic block diagram of a speech emotion recognition system with multi-modal fusion according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

Fig. 1 is a flowchart of steps of a multi-modal fused speech emotion recognition method in an embodiment of the present invention, and as shown in fig. 1, the multi-modal fused speech emotion recognition method provided by the present invention includes the following steps:

the step S1 includes the following steps:

The language emotion characteristic value comprises related characteristics based on a spectrum, a spectrogram, a prosody characteristic and a tone quality characteristic. Preprocessing a voice signal, firstly identifying a signal without emotion, and discarding the signal in time, wherein the signal comprises pre-emphasis, framing, windowing and end point detection; the pre-emphasis operation is implemented using a digital filter to enhance the high frequency portion of the speech, smoothing the signal, and retaining the original information. X (n), y (n) denote an input signal and an output signal at time n, respectively, and μ is a pre-emphasis coefficient, and μ is 0.97. The pre-emphasized speech signal is expressed in y (n) ═ x (n) — μ × x (n-1). Since the speech signal is in a changing process, in order to analyze the short-time speech signal, the speech signal is divided into frames ranging from 10ms to 30ms, the framing operation is realized by a hamming window, and in order to prevent continuity between lost frames, a method of overlapping framing is used, and the frame shift part is 10 ms. In order to prevent the signal frame of the non-speech part from adding redundant information of speech emotion recognition, an endpoint detection technology is used for detecting zero crossing rate and short-time energy, distinguishing the speech signal, the non-speech signal and the noise part, and then removing the detected non-speech signal and noise. The unvoiced sounds are checked by using the zero crossing rate, and the voiced sounds are checked by using the short-time energy.

However, the selected characteristic value is too muchThe dimension reduction needs to be carried out on the characteristic value, a Fisher criterion is selected in the embodiment of the invention, and the characteristic value with the maximum contribution degree to emotion recognition is extracted, so that in the experiment of speech emotion classification, the characteristic with a large Fisher value is selected, and the calculation formula of Fisher is as follows:

μ _1d ，μ _2d representing the mean, delta, of two different classes of d-dimensional vectors _1d ² And delta _2d ² And representing the variance of the d-dimensional vectors of two different classes, wherein the MFCC and the spectrogram feature have the largest contribution degree through calculation, so that MFCC (Mel-frequency cepstral coefficients) and spectrogram features are extracted as the feature values identified by the emotion classification.

the step S2 includes the following steps:

In the embodiment of the invention, a Jieba tool is used for word segmentation, the Jieba tool integrates two methods based on rules and statistics, and word segmentation can be carried out while part of speech is labeled; stop words such as ' I,'s, has, places, does ' in the text information should be removed in the preprocessing stage, and redundant texts are removed;

selecting a feature with large information entropy; the calculation formula of the information entropy is as follows:

wherein k is the number of categories, c is a category variable, and Pi is the probability of occurrence of each feature.

According to the constructed word stock, the strength of the expression emotion is analyzed, and a high degree is endowed with a large weight, and a low degree is endowed with a small weight.

step S3 includes the following steps

in the embodiment of the invention, the ReLu function is selected as the activation function of the DNN neural network, the cross entropy loss function is selected as the loss function, then the convolutional layer and the pooling layer are set, the characteristic diagram is obtained by performing convolution calculation and activation function calculation on an input image, the convolution process is to scan an input matrix according to a certain step length by using a convolution kernel with a fixed size to perform dot product operation, the convolution kernel is a weight matrix, the characteristic diagram is obtained by inputting the convolution calculation result into the activation function, the depth of the characteristic value is equal to the number of the convolution kernels set by the current layer, and the convolutional layer is used for extracting the characteristic value. The pooling layer has the effects that the pooling operation is used for reducing the size of a characteristic diagram and reducing the calculated amount, redundant information of images can be removed for the images with a large number of characteristic values, the image processing efficiency is improved, and overfitting is reduced;

sometimes speech information is determined either by preceding or following sequences, and BLSTM neural networks are introduced. The two-way memory speech signal sequence for realizing information is marked as [ x (1), x (2), …, x (T)]The output signal is marked as r ^→ (T)，r ^← (T)】r ^→ (T) represents the forward characteristic output, r ^← (T) represents the inverse signature output. The BLSTM network solves the problem that RNN cannot implement bi-directional memory. The speech spectrogram is used as the input of DNN, the speech signal uses a DNN + BLSTM model, the speech signal is firstly input into BLSTM, and then the output of the BLSTM is used as the input of the DNN to obtain the final classification result.

In the embodiment of the invention, an IEMOCAP emotion database is selected to generate the training set and the test set, the IEMOCAP database supports video and audio, and text transcription of all utterances, the database contains 5331 audio and text transcription utterances, the experiment of the invention is divided into the training set according to 90% of data samples, and 10% of the data samples are divided into the test set, wherein voice data and text data in the training set and the test set respectively account for half.

In the step S4: the probability distribution in the first classification information is:

the probability distribution in the second classification information is

FIG. 2 is a logic flow diagram of multi-task learning of a first classification model according to an embodiment of the present invention, as shown in FIG. 2, the first classification model including a first language model and a second language model;

The traditional learning method is that a large task is divided into a plurality of subtasks for classification learning. The conventional method has a drawback in that useful information in the real model is ignored, and the connection between tasks. The multi-task learning associates the main task with a plurality of auxiliary tasks, and can improve the generalization of classification. In the neural network model of multitask learning, the bottom layer of the network is a shared hidden layer, the connections between tasks are learned, the top layer is a task-specific layer, and the specific attributes of each task are learned, as shown in FIG. 2, each multitask learning is a BLSTM structure. The BLSTM structure can learn past and future information of the speech signal. The shared hidden layer of the network model is provided with 2 hidden layers, and each hidden layer is composed of 128 units. The hidden layer shares nodes in the hidden layer between all three attributes. The front of the Dense layer is connected with the hidden layer, and the rear of the Dense layer is respectively connected with each task, so that the function of learning a specific task is achieved. Following the sense layer is a softmax classifier. The activation function is a ReLu nonlinear activation function. There are several tasks to design several softmax classifiers. The network model is defined using the kearas definition based on tensorflow in python. The left half is a first language model and the right half is a second language model. In the first language model, the auxiliary task identifies the noise category, and the signal of the category is discarded, so that the effect of eliminating the noise through multi-task learning is achieved. The non-noise class signals are then input into a second language model. And the auxiliary task of the model II is gender classification, and meanwhile, emotion classification and gender classification are learned, so that the influence of gender factors on the identification accuracy is reduced. Two multi-task learning models of the first language model and the second language model are trained by minimizing the average error. The total loss value (Lov) for each model is a weighted sum of the individual task weights. The loss of the noise classification is designated Lov1, the weight of the noise loss is α, the loss of the emotion classifier is designated Lov2, the weight is β, the loss of the gender classification is designated Lov3, and the weight is γ. The total loss value of the first language model is Lov ═ α Lov1+ β Lov 2. The total loss value of the first language model is Lov ═ α Lov1+ γ Lov 3. Both models were trained simultaneously to minimize the total loss value.

In the embodiment of the present invention, in the second classification model, the BERT model is configured to perform vectorization representation on text information to generate vectorized information;

Compared with a Word2vec model, the BERT model can fully utilize semantic information of context; BERT sets two target tasks, respectively obtaining semantic forms at the word and sentence level: 1. covering the language model; 2. and predicting the relation between upper sentences and lower sentences. Wherein, the covering language model is similar to 'filling in the blank' in shape, namely covering 15% of sentences randomly, and making the encoder predict the words; the upper and lower sentence relation prediction is a relation for learning sentences by predicting whether two random sentences can compose an upper and lower sentence. By the model trained in the way, BERT has strong expression capability of sentences and words. The BERT model has multiple layers of bi-directional transformers, computing context using a self-attention mechanismThe relation between the context semantics reflects the relevance between the context semantics. The calculation formula is as follows:

where Q, K, V represent the input word vector matrix of the encoder, d _k Is the input vector dimension.

Fig. 3 is a schematic block diagram of a multi-modal fused speech emotion recognition system in an embodiment of the present invention, and as shown in fig. 3, the multi-modal fused speech emotion recognition system provided by the present invention includes the following modules:

In the embodiment of the invention, a Windows system is selected as an experimental operation environment, python programming is used as a programming language, tensierflow and keras are selected as a deep learning frame, a PC (personal computer) is used as an experimental hardware environment, Windows is selected as an operating system, a Linux kernel is introduced, and the best parameter is selected to achieve a better classification effect. The grid searching method comprises the following specific steps:

step 1: model _ selection library in python is called

Step 2, introducing a function GridSearch from a Sklear

And 3, defining a parameter list of the grid search, wherein the parameter list comprises the value ranges of three parameters including the size, the step size and the depth of the convolution kernel, the parameter list of the convolution kernel is {1 x 5, 2 x 2, 3 x 3 and 4 x 5}, the parameter list of the step size is {1 x 1, 2 x 2, 3 x 3 and 4 x 4}, and the parameter list of the depth is {16, 32, 64, 128, 256}

And 4, applying the defined parameters to a convolutional neural network model, wherein the first parameter of the GridSearchCV function is the defined convolutional neural network model, and the second parameter is a parameter list.

Step 5, outputting the optimal parameters

And selecting the convolution layer of the convolutional neural network, the convolution kernel size, the step length and the depth of the pooling layer and the optimal parameter of the convolution kernel size of the hidden layer by using a grid searching method. The learning rate is set to 0.01 and dropout is set to 0.5. After the grid search, the optimum parameters selected were convolution kernel size 1 × 5, step size 2 × 2, and depth 64.

As the conclusion of a single data set proves that the data set has contingency, different modes are fused on different data sets, the fact that the multi-mode is beneficial to improving the recognition rate is proved, experiments for fusing two modes of texts and voice are compared with experiments only using voice signals on an IEMOCAP database, the recognition rate is improved to 83.45% only by using 78.23% of the feature recognition accuracy rate of the voice signals, and after the two modes of texts and voice are fused.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A multi-mode fused speech emotion recognition method is characterized by comprising the following steps:

2. The method for speech emotion recognition fused with multiple modalities according to claim 1, wherein said step S1 includes the steps of:

3. The method for speech emotion recognition fused with multiple modalities according to claim 1, wherein said step S2 includes the steps of:

4. The method for speech emotion recognition based on fusion of multiple modalities of claim 1, wherein step S3 includes the following steps

5. The method for speech emotion recognition fused with multiple modalities according to claim 1, wherein in said step S4: the probability distribution in the first classification information is

The probability distribution in the second classification information is

6. The method for speech emotion recognition fused with multiple modalities of claim 1, wherein the first classification model comprises a first language model and a second language model;

7. The method for recognizing speech emotion fusing multimodal forms as claimed in claim 4, wherein in the second classification model, the BERT model is used for generating vectorized information by vectorizing representation of text information;

8. The method of claim 1, wherein the linguistic emotion feature values comprise related features based on spectrum, spectrogram pattern, prosody feature and psychoacoustic feature.

9. The method for speech emotion recognition based on multi-modality fusion, as claimed in claim 2, wherein the speech emotion feature value is generated by selecting the feature with the largest Fisher value in step S102.

10. A speech emotion recognition system fusing multiple modes is characterized by comprising the following modules: