CN112185423A

CN112185423A - Speech emotion recognition method based on multi-head attention mechanism

Info

Publication number: CN112185423A
Application number: CN202011041520.9A
Authority: CN
Inventors: 谢跃; 梁瑞宇; 杜静; 包永强; 赵力
Original assignee: Nanjing Institute of Technology
Current assignee: Nanjing Institute of Technology
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2021-01-05
Anticipated expiration: 2040-09-28
Also published as: CN112185423B

Abstract

The invention discloses a voice emotion recognition method based on a multi-head attention mechanism, which comprises the following steps of: extracting voice features with time sequence information from original voice data; establishing an LSTM model with the capability of processing time sequence; respectively carrying out attention weighting on the output of the LSTM model in a time dimension and a feature dimension by using multi-head attention, and distinguishing the saturation of the emotion in different time segments and the recognition capability of different features on the emotion; respectively inputting the weighted output of the time dimension and the weighted feature dimension into the full-link layer and the softmax layer to obtain the speech emotion classification; training an emotion recognition model by using a known original voice data sample, and evaluating the recognition performance; and inputting unknown original voice data samples into the trained emotion recognition model, and outputting corresponding voice emotion types. The invention strengthens the output of the LSTM model through a multi-head attention mechanism so as to improve the speech emotion recognition performance and has good application prospect.

Description

Speech emotion recognition method based on multi-head attention mechanism

Technical Field

The invention relates to the technical field of speech emotion recognition, in particular to a speech emotion recognition method based on a multi-head attention mechanism.

Background

In recent years, Speech Emotion Recognition (SER) has received increasing attention because it has a great practical value in human-computer interaction. As deep learning progresses, neural networks of different structures have been investigated for SERs. For example, Long Short Term Memory (LSTM) is a variant of Recurrent Neural Networks (RNN) that can model and store longer time information, which makes it achieve good classification performance on the SER.

In most applications of LSTM, the output of its last time step is selected as the input to the final classifier. In this framework, LSTM acts as a feature extractor, and since the long-range time span is not infinite, the feature information extracted at the last time step may not be sufficient. Therefore, some researchers in the field of computer vision have been equipped with convolutional neural networks with attention mechanisms that can automatically learn which parts of the image are more important to the final performance. At the same time, bahdana u introduced an attention mechanism in LSTM-based end-to-end, making a significant improvement in the field of machine translation. In the aspect of emotion recognition task, Tao uses an attention mechanism to update the cell state of LSTM, paying attention to information between cells, taking into account more historical cell states. Mirseamaidi proposes local attention to compute weights for frames with attention parameter vectors.

The traditional zoom point attention algorithm executes a single attention function and has certain one-sidedness; compared with the traditional single-head attention mechanism, the multi-head attention mechanism maps the attention vector to a plurality of new subspaces through different linear projections, so that the model can jointly focus on information from different feature subspaces at different positions, and one-sidedness can be effectively reduced.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the invention discloses a voice emotion recognition method based on a multi-head attention mechanism, which replaces the traditional zoom point attention mechanism with the multi-head attention mechanism, carries out attention weighting on the historical output of an LSTM and the output of the LSTM at the last moment in the time dimension, distinguishes the saturation of emotion in different time segments, expands the multi-head attention mechanism to the feature dimension, distinguishes the recognition capability of different features on emotion, and improves the performance of voice emotion recognition.

The technical scheme is as follows: the invention adopts the following technical scheme: a speech emotion recognition method based on a multi-head attention mechanism is characterized by comprising the following steps:

s1, extracting frame-level voice features with time sequence information from the original voice data sample;

s2, establishing an LSTM model with the capability of processing time sequence data, and inputting the voice characteristics with time sequence information in the step S1 into the LSTM model;

s3, performing attention weighting on the output of the LSTM model in a time dimension by using a multi-head attention mechanism, and combining the matrix output at all times of the LSTM model and the output at the last time to obtain the output weighted in the time dimension;

s4, performing attention weighting on the output of the LSTM model on the characteristic dimension by using a multi-head attention mechanism, and combining the matrix output at all times of the LSTM model to obtain the output after the characteristic dimension weighting;

s5, splicing the output after the time dimension weighting and the output after the feature dimension weighting, inputting the spliced output into a full connection layer, and obtaining the speech emotion category through a softmax layer;

s6, training the emotion recognition model in the steps S1 to S5 by using the known original voice data sample and the voice emotion type thereof, testing and verifying the trained emotion recognition model, and evaluating the recognition performance;

and S7, inputting the unknown original voice data sample into the trained emotion recognition model, and outputting the corresponding voice emotion type.

Preferably, the step S3 of weighting the output of the LSTM model in the time dimension by using a multi-head attention mechanism includes the following steps:

s31, calculating the projection of the matrix output at all time points of the LSTM model and the output at the last time point in each linear subspace respectively:

T_K_i＝o_{all_time}*T_W_i,k+T_b_i,k

T_Q_i＝o_{max_time}*T_W_i,q+T_b_i,q

wherein denotes a matrix multiplication; o_{all_time}Matrix representing the output of the LSTM model at all times, o_{max_time}An output representing the last moment of the LSTM model;

and

respectively represents o_{all_time}And o_{max_time}The weight in the linear transformation of the ith linear subspace, i represents the index of the linear subspace and has h linear subspaces in total, and N represents the number of hidden layer units of the LSTM model; t _ b_i,kAnd T _ b_i,qRespectively represents o_{all_time}And o_{max_time}Bias in the ith linear subspace linear transformation; t _ K_iAnd T _ Q_iRespectively represents o_{all_time}And o_{max_time}A projection in the ith linear subspace;

s32, calculating a weighting coefficient of a time dimension in each linear subspace:

T_score_i＝softmax(T_Q_i*T_K_i ^H)

wherein, T _ score_iRepresenting the weighting coefficients of the time dimension in the ith linear subspace, and softmax () representing the normalized exponential function;

s33, calculating the value to be weighted in each linear subspace:

T_V_i＝o_{all_time}*T_W_i,v+T_b_i,v

wherein, T _ V_iRepresenting the value to be weighted in the ith linear subspace; t _ W_i,vRepresenting the weight when the ith linear subspace is used for solving the value to be weighted; t _ b_i,vRepresenting the bias when the ith linear subspace is used for solving the value to be weighted;

s34, calculating the output of each linear subspace, and splicing the outputs of all the linear subspaces together to obtain the output weighted by the time dimension:

T_output_i＝T_score_i*T_V_i

T_Output＝Concat(T_output₁,...,T_output_h)

wherein Concat () is used to connect two or more arrays; t _ output_iAn output representing the ith linear subspace; t _ Output represents the time dimension weighted Output.

Preferably, the step S4 of weighting the output of the LSTM model in the feature dimension by using a multi-head attention mechanism includes the following steps:

s41, calculating the projection of the matrix output at all time points of the LSTM model and the output at the last time point in each linear subspace respectively:

F_K_i＝o_{all_time}*F_W_i,k+F_b_i,k

wherein denotes a matrix multiplication; o_{all_time}A matrix representing the output of the LSTM model at all times; f _ W_i,kRepresents o_{all_time}Weights in the ith linear subspace linear transformation; f _ b_i,kRepresenting the bias in the ith linear subspace linear transformation; f _ K_iRepresents o_{all_time}A projection in the ith linear subspace;

s42, calculating a weighting coefficient of the characteristic dimension in each linear subspace:

F_score_i＝softmax(tanh(F_K_i)*F_W_score+F_b_score)

wherein, F _ score_iThe weighting coefficient representing the characteristic dimension in the ith linear subspace is obtained by adopting a self-attention algorithmGet, F _ W_scoreAnd F _ b_scoreRespectively representing the weight and the bias of the self-attention algorithm; softmax () represents a normalized exponential function;

s43, calculating the value to be weighted in each linear subspace:

F_V_i＝o_{all_time}*F_W_i,v+F_b_i,v

wherein, F _ V_iRepresenting the value to be weighted in the ith linear subspace; f _ W_i,vRepresenting the weight when the ith linear subspace is used for solving the value to be weighted; f _ b_i,vRepresenting the bias when the ith linear subspace is used for solving the value to be weighted;

s44, calculating the output of each linear subspace, splicing the outputs of all the linear subspaces together, and summing in a time dimension to obtain the output weighted by a characteristic dimension:

wherein the content of the first and second substances,

representing a Hadamard product; concat () is used to connect two or more arrays; f _ output_iAn output representing the ith linear subspace; f _ Output represents the Output after feature dimension weighting.

Preferably, in step S2, the LSTM model is calculated as follows:

f_t＝σ(W_fh*h_t-1+W_fx*x_t+W_fc*C_t-1+b_f)

i_t＝σ(W_ih*h_t-1+W_ix*x_t+W_ic*C_t-1+b_i)

o_t＝σ(W_oh*h_t-1+W_ox*x_t+W_oc*C_t+b_o)

wherein, the matrix multiplication is represented,

representing a Hadamard product; w_fh、W_fc、W_fxRespectively being a forgetting door f_tHidden layer output, cell state and weight of input, b_fTo forget the door f_tBias of (3); w_ih、W_ic、W_ixRespectively update values i for information_tHidden layer output, cell state and weight of input, b_iUpdating value i for information_tBias of (3); w_ch、W_cxUpdating the cell status respectively

Weight of hidden layer output and input, b_cUpdating a value for a cellular state

Bias of (3); w_oh、W_ox、W_ocAre respectively the output value o_tHidden layer output, input and cell state weights, b_oIs the output value o_tBias of (3); h is_t、x_t、C_tRespectively representing the hidden layer output, input and cell state at time t.

Preferably, in step S5, the input of the fully connected layer is Concat (T _ Output, F _ Output), where T _ Output represents the Output weighted by the time dimension, and F _ Output represents the Output weighted by the feature dimension.

Preferably, in step S6, the emotion recognition model is tested and verified by recall ratio:

wherein recall is the recall rate; TP is the number predicted to be positive, actually positive; TN is the quantity predicted negative, actually positive.

Preferably: in step S1, the extracted speech features retain timing information in the original speech data sample through the sequence relationship between the speech frames, and the dimensionality of the speech features varies with the actual length of the original speech data sample.

Has the advantages that: the invention has the following beneficial effects:

according to the voice emotion recognition method based on the multi-head attention mechanism, the multi-head attention mechanism replaces the traditional zoom point attention mechanism, the historical output and the last moment output of an LSTM model are combined, attention weighting is carried out on the time dimension of the LSTM model output, and the saturation of emotion in different time segments is distinguished; and the multi-head attention mechanism is expanded to the feature dimension, the emotion recognition capability of different features is distinguished, the speech emotion recognition performance can be improved, and the application prospect is good.

Drawings

FIG. 1 is a flowchart of a method of speech emotion recognition based on a multi-head attention mechanism according to the present invention;

FIG. 2 is a system block diagram of the speech emotion recognition method based on a multi-head attention mechanism of the present invention;

FIG. 3 is a graph that verifies the performance on the eNFERFACE and GEMEP databases as a function of the number of subspaces in the time dimension;

FIG. 4 is a graph that verifies the performance on the eNFERFACE and GEMEP databases as a function of the number of subspaces in the feature dimension.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

The invention discloses a speech emotion recognition method based on a multi-head attention mechanism, which comprises the following steps of:

step A, extracting frame-level voice features with time sequence information from original voice data, wherein the time sequence information in the original voice data is reserved by the voice features through the sequence relation among voice frames, and the dimensionality of the voice features is changed along with the actual length of the original voice data. The detailed speech feature set is shown in table 1 below:

TABLE 1

And step B, establishing a long-term memory (LSTM) model with the capability of processing time sequence data, memorizing history information through a cell unit, and screening history redundant information through a forgetting gate, so that the problem of long-term dependence is solved, and the model is suitable for the dynamic voice characteristics provided in the step A, namely the model can be matched with the characteristic data.

The LSTM model calculation method is as follows, where denotes matrix multiplication:

f_t＝σ(W_fh*h_t-1+W_fx*x_t+W_fc*C_t-1+b_f) (1)

wherein, W_fh、W_fc、W_fxRespectively being a forgetting door f_tHidden layer output, cell state and weight of input, b_fTo forget the door f_tOffset of (2), W_fh、W_fc、W_fxAnd b_fAre all parameters to be trained; forget door f_tCell state C at the last moment_t-1Hidden layer output h_t-1And input x of the current time_tExpresses the degree of retention of the history information.

i_t＝σ(W_ih*h_t-1+W_ix*x_t+W_ic*C_t-1+b_i) (2)

Wherein, W_ih、W_ic、W_ixRespectively update values i for information_tHidden layer output, cell state and weight of input, b_iUpdating value i for information_tOffset of (2), W_ih、W_ic、W_ixAnd b_iAre all parameters to be trained; information update value i_tCell state C at the last moment_t-1Hidden layer output h_t-1And input x of the current time_tDetermines the current amount of useful information.

o_t＝σ(W_oh*h_t-1+W_ox*x_t+W_oc*C_t+b_o) (5)

Wherein the content of the first and second substances,

representing the Hadamard product, W_ch、W_cxUpdating the cell status respectively

Offset of (2), W_ch、W_cxAnd b_cAre all parameters to be trained, cell state update values

Hidden layer output h at previous moment_t-1And input x of the current time_tThe influence of (a); cell state C at the present time_tCell state C at the last moment_t-1And cell state update values

The influence of (a); w_oh、W_ox、W_ocAre respectively the output value o_tHidden layer output, input and cell state weights, b_oIs the output value o_tOffset of (2), W_oh、W_ox、W_ocAnd b_oAre all parameters to be trained, and output value o_tHidden layer output h at previous moment_t-1And the cell state C at the current time_tInputting x_tThe influence of (c).

Wherein, the hidden layer output h of the current time_tBy the cell state C at the present moment_tAnd the output value o_tThe influence of (c).

In the formulas (1) to (6), subscript t is a time step and corresponds to the number of frames in the voice features; σ is a sigmoid function, expressed as: σ (x) 1/(1+ e)^-x) And the value is between 0 and 1, so that the concept of fractional weighting is expressed.

And step C, replacing the traditional zoom point attention mechanism with a multi-head attention mechanism, carrying out attention weighting on the time dimension output by the LSTM model, and distinguishing the saturation of the emotion in different time segments.

The multi-head attention mechanism avoids one-sidedness of single linear mapping of the traditional zoom point attention mechanism, and the multi-head means multiple linear changes, so that more robust information representation can be obtained. The calculation formula is as follows:

T_K_i＝o_{all_time}*T_W_i,k+T_b_i,k (7)

T_Q_i＝o_{max_time}*T_W_i,q+T_b_i,q (8)

wherein the content of the first and second substances,

is a weight in a linear transformation, T _ b_i,kAnd T _ b_i,qThen it is an offset in the linear transformation, T _ W_i,k、T_W_i,q、T_b_i,kAnd T _ b_i,qAll parameters are parameters to be trained, i represents the index of a linear subspace, h linear subspaces are counted, and N corresponds to the number of hidden layer units of the LSTM model; o_{all_time}Is the matrix of the LSTM model output at all times, o_{max_time}It is the output of the LSTM model at the last moment. Because of the accumulated characteristic of the LSTM model to the historical information, the LSTM model has relatively more effective information at the last moment, so the invention will not generate the new LSTM model_{max_time}The information effectiveness at other moments is measured as a reference quantity; and T _ K_iAnd T _ Q_iAre respectively o_{all_time}And o_{max_time}Projection in the ith linear subspace.

T_score_i＝softmax(T_Q_i*T_K_i ^H) (9)

T_V_i＝o_{all_time}*T_W_i,v+T_b_i,v (10)

T_output_i＝T_score_i*T_V_i (11)

Wherein softmax () represents a normalized exponential function; t _ score_iIs the weighting coefficient of the time dimension in the ith linear subspace; the value to be weighted is o_{all_time}New projection T _ V under ith linear subspace_iThe projection is T _ W_i,vAs a weight, T _ b_i,vFor a new linear transformation of the offset, T _ W_i,vAnd T _ b_i,vAre all parameters to be trained; will T _ score_iActing on T _ V_iObtaining the corresponding output T _ output of each linear subspace_i(ii) a Finally, all the linear subspaces are output T _ output_iSpliced together to form a completely new Output T _ Output:

T_Output＝Concat(T_output₁,...,T_output_h) (12)

where Concat () is used to connect two or more arrays.

And D, expanding the multi-head attention mechanism to a characteristic dimension, carrying out attention weighting on the output of the LSTM model on the characteristic dimension, and distinguishing the recognition capability of different characteristics on the emotion.

Different features have different abilities to emotion recognition, and deep features closely related to an emotion recognition task are automatically screened out through a weighting coefficient of a multi-head attention mechanism so as to improve the emotion recognition performance. The calculation method is as follows:

F_K_i＝o_{all_time}*F_W_i,k+F_b_i,k (13)

F_score_i＝softmax(tanh(F_K_i)*F_W_score+F_b_score) (14)

F_V_i＝o_{all_time}*F_W_i,v+F_b_i,v (15)

wherein, F _ W_i,kAnd F _ b_i,kAre respectively o_{all_time}Linearly varying weight and offset, F _ W_i,kAnd F _ b_i,kAre all parameters to be trained, F _ K_iIs o_{all_time}A projection in the ith linear subspace; f _ score_iIs the weighting coefficient of the characteristic dimension in the ith linear subspace, and is obtained by adopting a self-attention algorithm, F _ W_scoreAnd F _ b_scoreIs the weight and bias to be trained in the self-attention algorithm; similar to the weighting in the time dimension, the value to be weighted in the feature dimension is o_{all_time}New projection F _ V under ith linear subspace_iThe projection is F _ W_i,vAs a weight, F _ b_i,vIs a new linearity of the biasConversion, F _ W_i,vAnd F _ b_i,vAre all parameters to be trained; will F _ score_iActing on F _ V_iObtaining the corresponding output F _ output of each linear subspace_iDifferent from the time dimension, the weighting factor F _ score_iAnd F _ V_iBetween are Hadamard products; the final Output F _ Output is the Output F _ Output of each linear subspace_iSplicing and summing in a time dimension, wherein the summing is different from the traditional static statistics in terms of isogeny, but the weight coefficient is obtained by training and is not designed manually:

and E, adding a full connection layer and a softmax layer behind the LSTM model, wherein the weighted result of the time dimension and the weighted result of the characteristic dimension are spliced, namely Concat (T _ Output, F _ Output) is used as the input of the full connection layer, the Output of the full connection layer is input into the softmax layer for normalization, the probability of each speech emotion category is Output, and the maximum probability is the final speech emotion category.

Step F, the known original voice data sample and the voice emotion type thereof are applied to the emotion recognition model in the steps A to E; training an emotion recognition model, and evaluating the recognition performance of the trained emotion recognition model; and inputting unknown original voice data samples into the trained emotion recognition model, and outputting corresponding voice emotion types.

In order to evaluate the performance of the model, the method compares the commonly used Recall rate (Recall) in emotion recognition with the traditional zoom point attention machine algorithm, and the overall evaluation index is the Unweighted Average Recall rate (UAR).

Wherein TP is the number predicted to be positive, actually positive; TN is the quantity predicted negative, actually positive.

In order to verify the effectiveness of the present invention, tests are performed on two emotion databases, namely, eNTERFACE and GEMEP, and the test results are shown in tables 2 and 3 below, where table 2 is the experimental results of the LSTM model, the LSTM model combined with the zoom point attention mechanism, and the LSTM model combined with the multi-head attention mechanism on the eNTERFACE database, and table 3 is the experimental results of the LSTM model, the LSTM model combined with the zoom point attention mechanism, and the LSTM model combined with the multi-head attention mechanism on the GEMEP database.

As shown in table 2, the UAR of the LSTM model of the present invention combined with the multi-head attention mechanism on the eNTERFACE database reaches 91.2%, over the other models. Compared with a zoom point attention mechanism, the memory of most emotions is improved, and the fact that a multi-linear subspace can obtain more stable emotion information than a single attention function is shown. As shown in Table 3, the LSTM model of the present invention combined with the multi-head attention mechanism gave the best results on the GEMEP database, with UAR reaching 62%. For the most easily recognized emotions with recall rates above 60%, significant improvement was achieved with more attention than with the other models in the table.

TABLE 2

Model (model)

Generating qi

Aversion to

Fear of

Happy

Heart injury

Is surprised

UAR

LSTM

88.4％

64.3％

76.6％

83.8％

68.9％

73.9％

75.8％

LSTM + zoom Point attention

88.4％

85.7％

80.9％

97.3％

86.7％

84.8％

86.9％

LSTM + multi-head attention

90.7％

90.5％

87.2％

94.6％

93.3％

91.3％

91.2％

TABLE 3

In order to study the influence of the number of heads of the attention mechanism on the performance in the time dimension and the feature dimension respectively, the experiment first sets the number of heads of the feature dimension as a fixed parameter 1, and studies the relationship between the time dimension and the UAR, as shown in fig. 3: the UAR is gradually increased along with the increase of the time dimension head number on the GEMEP database and the ENTERFACE database, when the head number is 8, the UAR respectively reaches 58% and 91.2% of the maximum on the GEMEP database and the ENTERFACE database, which shows that a plurality of linear subspaces can form a complementary relationship, enhances the soundness of emotion information and prevents the one-sidedness of a single subspace. However, this positive correlation is bounded, and when the number of heads exceeds 8, UAR will drop, and more heads means more linear subspace needs to be learned, which increases the learning burden of the network, especially in the case of limited training data.

On the basis, the number of heads in the time dimension is fixed to be 8, and the influence of the number of heads in the feature dimension on the UAR is studied, as shown in fig. 4: on the GEMEP database, UAR increases with the increase of the number of characteristic dimension heads, the UAR reaches a maximum of 62% when the number of heads is 16, and then UAR decreases when the number of heads increases again. In contrast, on the eNewface database, UAR falls as the number of feature dimension heads rises. This may be due to the complexity of the emotion recognition task in both databases. The emotion classification of the GEMEP database is twice that of the eNTFACE database, so the former is more complex. And the more emotion categories, the more serious the emotion information interweaving in the feature space is, and the greater the difficulty of recognition is. The multi-linear subspace can better relieve the overlapping of emotional characteristics and improve the performance. For the eNTERFACE database with only 6 emotions, there is less overlap between emotions, so an increase in the linear subspace only increases the difficulty of training, with a natural impact on performance. Therefore, the GEMEP database can achieve the optimal UAR when the time dimension header number is 8 and the feature dimension header number is 16, while the eNTERFACE database can achieve the optimal UAR when the time dimension header number is 8 and the feature dimension header number is 1.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A speech emotion recognition method based on a multi-head attention mechanism is characterized by comprising the following steps:

2. The method for speech emotion recognition based on multi-head attention mechanism as claimed in claim 1, wherein the step S3 of performing attention weighting on the output of LSTM model in time dimension by using multi-head attention mechanism includes the following steps:

T_K_i＝o_{all_time}*T_W_i,k+T_b_i,k

T_Q_i＝o_{max_time}*T_W_i,q+T_b_i,q

and

T_score_i＝softmax(T_Q_i*T_K_i ^H)

s33, calculating the value to be weighted in each linear subspace:

T_V_i＝o_{all_time}*T_W_i,v+T_b_i,v

T_output_i＝T_score_i*T_V_i

T_Output＝Concat(T_output₁,...,T_output_h)

3. The method for speech emotion recognition based on multi-head attention mechanism as claimed in claim 1, wherein the step S4 of performing attention weighting on the output of LSTM model in feature dimension by using multi-head attention mechanism includes the following steps:

F_K_i＝o_{all_time}*F_W_i,k+F_b_i,k

F_score_i＝softmax(tanh(F_K_i)*F_W_score+F_b_score)

wherein, F _ score_iRepresenting the ith linear subspaceThe weighting coefficient of the middle characteristic dimension is obtained by adopting a self-attention algorithm, F _ W_scoreAnd F _ b_scoreRespectively representing the weight and the bias of the self-attention algorithm; softmax () represents a normalized exponential function;

s43, calculating the value to be weighted in each linear subspace:

F_V_i＝o_{all_time}*F_W_i,v+F_b_i,v

wherein the content of the first and second substances,

4. The method for speech emotion recognition based on multi-head attention mechanism as claimed in claim 1, wherein in step S2, the LSTM model is calculated as follows:

f_t＝σ(W_fh*h_t-1+W_fx*x_t+W_fc*C_t-1+b_f)

i_t＝σ(W_ih*h_t-1+W_ix*x_t+W_ic*C_t-1+b_i)

o_t＝σ(W_oh*h_t-1+W_ox*x_t+W_oc*C_t+b_o)

wherein, the matrix multiplication is represented,

Bias of (3); w_oh、W_ox、W_ocAre respectively the output value o_tThe hidden layer output,Input and cell status weights, b_oIs the output value o_tBias of (3); h is_t、x_t、C_tRespectively representing the hidden layer output, input and cell state at time t.

5. The method for speech emotion recognition based on multi-head attention mechanism as claimed in claim 1, wherein in step S5, the input of the fully connected layer is Concat (T _ Output, F _ Output), where T _ Output represents the weighted Output of the time dimension and F _ Output represents the weighted Output of the feature dimension.

6. The method for speech emotion recognition based on multi-head attention mechanism as claimed in claim 1, wherein in step S6, the emotion recognition model is tested and verified by recall ratio:

7. The method for recognizing the speech emotion based on the multi-head attention mechanism as claimed in claim 1, wherein: in step S1, the extracted speech features retain timing information in the original speech data sample through the sequence relationship between the speech frames, and the dimensionality of the speech features varies with the actual length of the original speech data sample.