CN106782602B

CN106782602B - Speech emotion recognition method based on deep neural network

Info

Publication number: CN106782602B
Application number: CN201611093447.3A
Authority: CN
Inventors: 袁亮; 卢官明; 闫静杰
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2016-12-01
Filing date: 2016-12-01
Publication date: 2020-03-17
Anticipated expiration: 2036-12-01
Also published as: CN106782602A

Abstract

The invention discloses a speech emotion recognition method based on a long-time memory network and a convolutional neural network, which comprises the steps of constructing a speech emotion recognition system based on LSTM and CNN, taking a speech sequence as the input of the system, training the LSTM and CNN by adopting a back propagation algorithm, optimizing parameters of the network and obtaining an optimized network model; and carrying out emotion classification on the newly input voice sequence by using the trained network model, wherein the newly input voice sequence is divided into six emotions, namely sadness, happiness, aversion, fear, startle and neutrality. The method comprehensively considers two network models of LSTM and CNN, avoids the complexity of manual selection and feature extraction, and improves the accuracy of emotion recognition.

Description

Speech emotion recognition method based on deep neural network

Technical Field

The invention relates to the field of image processing and pattern recognition, in particular to a speech emotion recognition method based on a long-time memory network and a convolutional neural network.

Background

In interpersonal communication, there are a variety of ways of exchanging information including voice, body language, facial expressions, and the like. Among them, speech signals are the fastest and most primitive communication methods, and are considered by researchers to be one of the most effective methods for human-computer interaction. Over the last half century, scholars have studied a large number of topics regarding speech recognition, namely how to convert speech sequences into text. Despite significant advances in speech recognition, there is a long way to achieve natural human and machine interaction because the machine cannot understand the emotional state of the speaker. This has also led to another aspect of research on how to identify the emotional state of the speaker from the speech, i.e., speech emotion recognition.

The speech emotion recognition is used as an important branch of man-machine interaction, and can be widely applied to various fields such as education, medical treatment, traffic and the like. In the vehicle-mounted system, the system can be used for monitoring the mental state of a driver and judging whether the driver is in a safe state, so that the driver can be reminded when the driver is tired, and traffic accidents are avoided; in the telephone service, the system can be used for sorting the users with fierce speech expression, and switching the users to human customer service, thereby optimizing the user experience and improving the whole service level; in clinical medicine, emotional changes of depression patients or autistic children are tracked by means of speech emotion recognition and used as a tool for disease diagnosis and adjuvant therapy; in the robot research, the robot is helped to understand the emotion of a person by utilizing voice information, friendly and intelligent responses are made, and interaction is realized.

Most speech emotion recognition methods in the present stage adopt the traditional method of extracting features and then classifying by using a classifier. Common speech features include pitch, speech rate, intensity (prosodic features), linear prediction cepstral coefficients, mel-frequency cepstral coefficients (spectral features), and the like. Common classification methods include hidden markov models, support vector machines, and gaussian mixture models. The traditional emotion recognition method tends to mature, but has certain defects. For example, it is not clear which feature has the greatest influence on emotion recognition, and only one feature is selected as a basis for judgment in most experiments, so that the objectivity of emotion recognition is reduced. In addition, some existing features, such as pitch, speech rate, etc., are greatly affected by the style of the speaker, which increases the complexity of recognition.

With the development of deep learning in the near stage, many researchers choose to adopt a training network model to complete emotion recognition. The existing speech emotion recognition methods mainly include a speech emotion recognition method based on a deep belief network, a speech emotion recognition method based on a long-time memory network and a speech emotion recognition method based on a convolutional neural network. Of the three methods described above, there are major disadvantages: the advantages of each network model cannot be taken into account. For example, a deep belief network may use a one-dimensional sequence as input, but cannot exploit the correlation between the front and back of the sequence; although the long-time and short-time memory network can utilize the correlation between the front and the back of the sequence, the extracted feature dimension is higher; the convolutional neural network cannot directly process a voice sequence, and needs to perform fourier transform on a voice signal and convert the voice signal into a frequency spectrum to be used as input. The traditional speech emotion recognition method has small prospect in feature extraction and classification development, and the existing speech emotion method based on deep learning has a single network.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides the speech emotion recognition method based on the long-time memory network and the convolutional neural network, avoids the complex process of manually extracting and screening the characteristics, and obtains the optimal emotion recognition effect by training the adaptive adjustment parameters of the network.

The invention adopts the following technical scheme for solving the technical problems:

the speech emotion recognition method based on the long-and-short-time memory network and the convolutional neural network comprises the following steps of:

step A, preprocessing voice samples in a voice emotion database to enable each voice sample to be represented by a sequence with equal length, and accordingly a preprocessed voice sequence is obtained;

step B, constructing a speech emotion recognition system based on the long-time memory network LSTM and the convolutional neural network CNN, wherein the speech emotion recognition system comprises two basic modules: the long and short time memory network module and the convolutional neural network module;

step C, sequentially sending the preprocessed voice sequence into a voice emotion recognition system for multiple training, and adjusting parameters of LSTM and CNN by using a back propagation algorithm to obtain an optimized network model;

and D, carrying out emotion classification on the newly input voice sequence by using the network model obtained by training in the step C, wherein the emotion classification comprises six emotions, namely sadness, happiness, disgust, fear, fright and neutrality.

As a further optimization scheme of the speech emotion recognition method based on the long-and-short-time memory network and the convolutional neural network, the long-and-short-time memory network module in the step B is specifically constructed by the following steps:

b1.1, setting the length of the speech sample sequence as m, where m is n × n, n is a positive integer, and setting the outputs of the forgetting gate unit and the input gate unit at the current time as f_tAnd i_tIs full ofFoot:

f_t＝σ(W_f·x_c+b_f)

i_t＝σ(W_i·x_c+b_i)

wherein x is_c＝[h_t-1,x_t]New vector x_cIs to make two h_t-1、x_tObtained by joining vectors end to end, x_tFor input at the current time, h_t-1For hiding the state of the layer at the previous moment, x_cFor the concatenated new vector, W_fAnd W_iWeight matrices for the forgetting gate unit and the input gate unit, respectively, b_fAnd b_iBias vectors of a forgetting gate unit and an input gate unit respectively, wherein sigma (·) is a sigmoid excitation function;

b1.2, calculating to obtain the current cell state C by the following formula_tThe value of (c):

wherein, C_t-1The state of the cells at the previous moment,

is a reference value of the cell state at the present time, W_cIs a weight matrix of cell states, b_CTan h (-) is a hyperbolic tangent function as a bias vector for the cell state;

b1.3, obtaining the output h of each hidden node according to the following formula_tH is to be_tSequentially connecting to form m-dimensional characteristic vectors;

h_t＝o_t*tanh(C_t)

o_t＝σ(W_o·[h_t-1,x_t]+b_o)

wherein, W_oAs a weight matrix of the output gate unit, b_oIs an offset vector of the output gate unit, o_tTo be transportedAn output of the gate-out unit.

As a further optimization scheme of the speech emotion recognition method based on the long-and-short-time memory network and the convolutional neural network, the convolutional neural network module in the step B is specifically constructed by the following steps:

b2.1, converting the m-dimensional feature vector extracted in the step B1.3 into an n multiplied by n feature matrix as the input of the convolutional neural network;

b2.2, selecting m as convolution layer as first layer of convolution neural network₁K is₁×k₁Performing convolution operation on input data by using dimensional convolution kernel with convolution step length of s₁The result after convolution is non-linearly mapped by an excitation function to obtain the output m of the convolution layer₁An₁×l₁A feature map of the dimension;

the second layer of the B2.3 convolutional neural network is a pooling layer, and m is selected₂K is₂×k₂Performing dimensional convolution kernel with step length s on the feature map output by the first layer of convolution layer₂Pooling to obtain the output m of the pooling layer₂An₂×l₂A feature map of the dimension;

b2.4, selecting m as convolution layer as the third layer of the convolution neural network₃K is₃×k₃Performing convolution operation on the feature map output by the second pooling layer by using a dimensional convolution kernel, and performing nonlinear mapping on the result after convolution by using an excitation function to obtain the output m of the convolution layer₃An₃×l₃A feature map of the dimension;

b2.5, selecting m as convolution layer as the fourth layer of the convolution neural network₄K is₄×k₄Performing convolution operation on the characteristic graph output by the third layer of convolution layer by using a dimensional convolution kernel, and performing nonlinear mapping on the result after convolution by using an excitation function to obtain the output m of the convolution layer₄An₄×l₄A feature map of the dimension;

b2.6, selecting m as convolution layer for the fifth layer of the convolution neural network₅K is₅×k₅A dimensional convolution kernel for performing convolution on the feature map output by the fourth layer of convolution layerPerforming product operation, performing nonlinear mapping on the convolved result by an excitation function to obtain the output m of the convolutional layer₅An₅×l₅A feature map of the dimension;

b2.7, selecting m as a pooling layer as a sixth layer of the convolutional neural network₆K is₆×k₆The convolution kernel of dimension carries out step length s on the characteristic diagram output by the fifth layer convolution layer₆Pooling to obtain the output m of the pooling layer₆An₆×l₆A feature map of the dimension;

b2.8, the seventh layer, the eighth layer and the ninth layer of the convolutional neural network are all connected layers; wherein, the seventh layer is to connect the feature map output by the sixth layer pooling layer to c nodes of the layer; the eighth layer carries out ReLU function nonlinear transformation on c nodes of the seventh layer, and then the connection weight of the nodes of the hidden layer is controlled by using a dropout method, wherein the total connection number is c; the output nodes of the ninth full-connection layer are p, and the output is softmax loss fused with the feature tags.

As a further optimization scheme of the speech emotion recognition method based on the long-and-short-term memory network and the convolutional neural network, a function J (θ) of softmax loss of the convolutional neural network in step B is defined as follows:

wherein x is⁽ⁱ⁾As an input vector, y⁽ⁱ⁾The emotion category corresponding to the input vector is i-1, 2, … q, and q corresponds to the number of voice samples; theta_jJ is 1,2, … p, p is corresponding to the number of emotion categories, T is transposition, e is natural base number; 1 {. is an indication function, and when the value of the median in the parenthesis is true, the value of the function is 1, otherwise, the value is 0.

As a further optimization scheme of the speech emotion recognition method based on the long-and-short-time memory network and the convolutional neural network, the tanh function is expressed as

sigmoid function is expressed as

Wherein x is a variable.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

(1) the complex process of manual extraction and feature screening is avoided, and the optimal emotion recognition effect is obtained by training the adaptive adjustment parameters of the network;

(2) the speech emotion recognition method based on LSTM and CNN fuses two different network models, can directly process a speech sequence by means of LSTM, and can utilize the correlation between the front and the back of sequence time; the interference of noise is reduced by means of CNN, more abstract characteristics can be learned, and the accuracy and robustness of emotion recognition are improved.

Drawings

FIG. 1 is a flow chart of the speech emotion recognition method based on LSTM and CNN of the present invention.

FIG. 2 is a basic framework structure diagram of the constructed LSTM and CNN-based speech emotion recognition system.

FIG. 3 is a basic framework diagram of a long and short duration memory network module in the speech emotion recognition system.

Fig. 4 is a basic framework diagram of a convolutional neural network module in a speech emotion recognition system.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, which is a flow chart of the speech emotion recognition method based on LSTM and CNN of the present invention, the implementation of the speech emotion recognition method based on LSTM and CNN of the present invention mainly includes the following steps:

step 1: selecting a proper voice emotion database, and collecting voice fragments in the proper voice emotion database;

in actual operation, an AFEW database is selected that provides the original video clips, all of which are cut from a movie work. Compared with a common laboratory database, the voice and emotion expression in the AFEW database is closer to the real life environment and has more generality. The sample ages were spread between 1 and 70 years, covering various age groups, including large samples of children and adolescents, which could be subsequently used for emotional identification in younger subjects. The samples in the database are divided into six categories, namely sadness, happiness, disgust, fear, fright and neutrality, which are marked by 1-6. And selecting a voice segment in the video as a sample set, wherein the sampling frequency is 48 kHz.

Step 2: reading voice sample data, and unifying the length of a sample sequence;

because of the differences in the duration of the speech samples, and considering that the useful information is mainly concentrated in the middle region of the speech sequence, 16384 sample points near the middle point of each speech sequence are actually selected to represent the entire speech. According to the following steps: and 3, randomly selecting the voice samples according to the proportion to be respectively used as a training set and a verification set. The speech sequences and labels for each sample set are stored as a pkl file.

And step 3: constructing a speech emotion recognition system, taking a speech sequence as input, training a long-time memory network, and obtaining the output of a hidden layer; FIG. 2 is a basic framework structure diagram of a constructed speech emotion recognition system based on LSTM and CNN, which illustrates the whole process of completing emotion classification of speech samples, and the system mainly comprises two basic modules of LSTM and CNN; FIG. 3 is a basic framework diagram of a long and short duration memory network module in the speech emotion recognition system, illustrating the internal structure of the LSTM network unit, reflecting the relationship between the hidden layer state and each gate unit; FIG. 4 is a basic framework diagram of a convolutional neural network module in a speech emotion recognition system, illustrating a process of generating a vector containing tag information after a feature matrix is subjected to convolution, pooling and full-connection operations;

by x₀,x₁,x₂,…,x_t… denotes the input speech sequence, h₀,h₁,h₂,…,h_tAnd …, the status of each hidden node. x is the number of_c＝[h_t-1,x_t]Indicates that the previous time is about to beThe state of the hidden layer and the input at the current moment are concatenated into a vector x_c. Setting the outputs of the forgetting gate unit and the input gate unit at t time as f_tAnd i_tCalculating f_tAnd i_tThe values of (a) are as follows:

f_t＝σ(W_f·x_c+b_f) (1)

i_t＝σ(W_i·x_c+b_i) (2)

the value of the cell state is calculated by the following formula:

the output of the network module is determined by the current cell state and is the filtered cell value. Firstly, the cell state is passed through a sigmoid unit, then the output range of the cell state is ensured to be between-1 and 1 by utilizing a tanh function, and the output value o of the layer is obtained_tMultiplying to determine the output h of the hidden layer_t：

o_t＝σ(W_o·[h_t-1,x_t]+b_o) (4)

h_t＝o_t*tanh(C_t) (5)

Get the output h of each hidden node_tThese are then concatenated in turn to form a long 16384 feature vector, which is converted to a 128 x 128 feature matrix.

And 4, step 4: taking the feature matrix as input, training the convolutional neural network, and specifically comprising the following steps:

the first layer is a convolution pooling layer, 96 convolution kernels with 11 multiplied by 11 dimensions are selected to carry out convolution operation on input data, the convolution step length is 3, signal characteristics are enhanced through the convolution operation, and noise is reduced. After convolution, 96 feature maps of 40 × 40 dimensions are generated.

The second layer is a pooling layer, and the feature maps generated by the first layer of the convolutional layers are pooled by a step length of 3 by using a 4 x 4-dimensional convolutional kernel to generate 96 13 x 13-dimensional feature maps;

the third layer is a convolution layer, 256 5 x 5-dimensional convolution cores are selected to perform convolution on the feature map generated by the second layer, and the feature map is prevented from shrinking in the convolution process by adopting an edge expansion and grouping mode. Generating 256 characteristic graphs with 13 multiplied by 13 dimensions after nonlinear transformation;

the fourth layer is a convolution layer, 384 5 × 5 dimensional convolution cores are selected to perform convolution on the feature maps generated by the third layer, and 384 13 × 13 dimensional feature maps are generated after nonlinear transformation in the same way of expanding edges and grouping.

And the fifth layer is a convolutional layer, 256 5 × 5-dimensional convolutional kernels are selected, and 384 13 × 13-dimensional feature maps are generated after nonlinear mapping is performed on the generated feature maps in an edge expansion mode.

The sixth layer is a pooling layer, and the feature maps generated by the fifth layer are pooled by using a convolution kernel with 3 x 3 dimensions and with the step length of 2 to generate 256 feature maps with 6 x 6 dimensions;

the seventh, eighth and ninth layers are all connecting layers. The seventh layer is that the characteristic graph generated by the sixth layer is fully connected to 4096 nodes, the eighth layer is that the working weight of the nodes of the hidden layer is controlled by using a dropout method after the ReLU function nonlinear transformation is carried out on the nodes of the seventh layer, the dropout method randomly discards part of the hidden nodes during each training, the discarded nodes can be temporarily considered not to be part of the network structure, but the weight of the nodes is kept, and only part of parameters are selected for adjustment each time. The number of full connections of the eighth layer is 4096. The output nodes of the ninth layer full connection layer are 6, and the output is the softmax loss fused with the feature labels.

And 5: adjusting parameters of LSTM and CNN in the system by using a back propagation algorithm, selecting an optimal network model, and storing the parameters;

step 6: and sending the test set sample into an optimal network model, and carrying out emotion recognition on the test set sample by using the trained network.

The function J (θ) of the softmax loss of the convolutional neural network is defined as follows:

wherein x is⁽ⁱ⁾As an input vector, y⁽ⁱ⁾The emotion category corresponding to the input vector is i-1, 2, … q, and q corresponds to the number of voice samples; theta_jJ is 1,2, … p, p is corresponding to the number of emotion categories, T is transposition, e is natural base number; 1 {. is an indication function, and when the median value of the brace is true, the value of the function is 1, otherwise, the value is 0; as the number of training samples increases, the value of the loss function decreases continuously, and the corresponding theta is equal to theta when the loss function is stable_jThe parameters of the optimized network model.

The tanh function (hyperbolic tangent function) in the present invention is expressed as

ReLU function (modified linear unit function) expressed as f (x) max (0, x), sigmoid function (S-shaped growth curve) expressed as

x is a variable.

Claims

1. A speech emotion recognition method based on a long-time memory network and a convolutional neural network is characterized by comprising the following steps:

d, carrying out emotion classification on the newly input voice sequence by using the network model obtained by training in the step C, wherein the emotion classification is divided into six emotions, namely sadness, happiness, disgust, fear, startle and neutrality;

the long-time memory network module in the step B is specifically constructed by the following steps:

b1.1, setting the length of the speech sample sequence as m, where m is n × n, n is a positive integer, and setting the outputs of the forgetting gate unit and the input gate unit at the current time as f_tAnd i_tAnd satisfies the following conditions:

f_t＝σ(W_f·x_c+b_f)

i_t＝σ(W_i·x_c+b_i)

wherein, C_t-1The state of the cells at the previous moment,

b1.3, obtained according to the formulaOutput h of each hidden node_tH is to be_tSequentially connecting to form m-dimensional characteristic vectors;

h_t＝o_t*tanh(C_t)

o_t＝σ(W_o·[h_t-1,x_t]+b_o)

wherein, W_oAs a weight matrix of the output gate unit, b_oIs an offset vector of the output gate unit, o_tIs the output of the output gate unit;

the convolutional neural network module in the step B is specifically constructed by the following steps:

b2.5, selecting m as convolution layer as the fourth layer of the convolution neural network₄K is₄×k₄Performing convolution operation on the feature graph output by the third layer of convolution layer by using a dimensional convolution kernel, and performing convolution operation on the result after convolution by using an excitation functionPerforming nonlinear mapping to obtain the output m of the convolutional layer₄An₄×l₄A feature map of the dimension;

b2.6, selecting m as convolution layer for the fifth layer of the convolution neural network₅K is₅×k₅Performing convolution operation on the feature graph output by the fourth layer of convolution layer by using a dimensional convolution kernel, and performing nonlinear mapping on the result after convolution by using an excitation function to obtain the output m of the convolution layer₅An₅×l₅A feature map of the dimension;

2. The emotion speech recognition method based on the long-and-short-term memory network and the convolutional neural network as claimed in claim 1, wherein the function J (θ) of softmax loss of the convolutional neural network in step B is defined as follows:

wherein x is⁽ⁱ⁾As an input vector, y⁽ⁱ⁾The emotion category corresponding to the input vector is i-1, 2, … q, and q corresponds to the number of voice samples; theta_jJ is 1,2, … p, p is corresponding to the number of emotion categories, T is transposition, e is natural base number; 1 {. is an indication function, in parenthesisWhen the value is true, the function takes a value of 1, otherwise it is 0.

3. The method of claim 1, wherein the tanh function is expressed as

sigmoid function is expressed as

Wherein x is a variable.