CN112185423A - Speech emotion recognition method based on multi-head attention mechanism - Google Patents

Speech emotion recognition method based on multi-head attention mechanism Download PDF

Info

Publication number
CN112185423A
CN112185423A CN202011041520.9A CN202011041520A CN112185423A CN 112185423 A CN112185423 A CN 112185423A CN 202011041520 A CN202011041520 A CN 202011041520A CN 112185423 A CN112185423 A CN 112185423A
Authority
CN
China
Prior art keywords
output
time
representing
dimension
weighted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011041520.9A
Other languages
Chinese (zh)
Other versions
CN112185423B (en
Inventor
谢跃
梁瑞宇
杜静
包永强
赵力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Institute of Technology
Original Assignee
Nanjing Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Institute of Technology filed Critical Nanjing Institute of Technology
Priority to CN202011041520.9A priority Critical patent/CN112185423B/en
Publication of CN112185423A publication Critical patent/CN112185423A/en
Application granted granted Critical
Publication of CN112185423B publication Critical patent/CN112185423B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a voice emotion recognition method based on a multi-head attention mechanism, which comprises the following steps of: extracting voice features with time sequence information from original voice data; establishing an LSTM model with the capability of processing time sequence; respectively carrying out attention weighting on the output of the LSTM model in a time dimension and a feature dimension by using multi-head attention, and distinguishing the saturation of the emotion in different time segments and the recognition capability of different features on the emotion; respectively inputting the weighted output of the time dimension and the weighted feature dimension into the full-link layer and the softmax layer to obtain the speech emotion classification; training an emotion recognition model by using a known original voice data sample, and evaluating the recognition performance; and inputting unknown original voice data samples into the trained emotion recognition model, and outputting corresponding voice emotion types. The invention strengthens the output of the LSTM model through a multi-head attention mechanism so as to improve the speech emotion recognition performance and has good application prospect.

Description

Speech emotion recognition method based on multi-head attention mechanism
Technical Field
The invention relates to the technical field of speech emotion recognition, in particular to a speech emotion recognition method based on a multi-head attention mechanism.
Background
In recent years, Speech Emotion Recognition (SER) has received increasing attention because it has a great practical value in human-computer interaction. As deep learning progresses, neural networks of different structures have been investigated for SERs. For example, Long Short Term Memory (LSTM) is a variant of Recurrent Neural Networks (RNN) that can model and store longer time information, which makes it achieve good classification performance on the SER.
In most applications of LSTM, the output of its last time step is selected as the input to the final classifier. In this framework, LSTM acts as a feature extractor, and since the long-range time span is not infinite, the feature information extracted at the last time step may not be sufficient. Therefore, some researchers in the field of computer vision have been equipped with convolutional neural networks with attention mechanisms that can automatically learn which parts of the image are more important to the final performance. At the same time, bahdana u introduced an attention mechanism in LSTM-based end-to-end, making a significant improvement in the field of machine translation. In the aspect of emotion recognition task, Tao uses an attention mechanism to update the cell state of LSTM, paying attention to information between cells, taking into account more historical cell states. Mirseamaidi proposes local attention to compute weights for frames with attention parameter vectors.
The traditional zoom point attention algorithm executes a single attention function and has certain one-sidedness; compared with the traditional single-head attention mechanism, the multi-head attention mechanism maps the attention vector to a plurality of new subspaces through different linear projections, so that the model can jointly focus on information from different feature subspaces at different positions, and one-sidedness can be effectively reduced.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention discloses a voice emotion recognition method based on a multi-head attention mechanism, which replaces the traditional zoom point attention mechanism with the multi-head attention mechanism, carries out attention weighting on the historical output of an LSTM and the output of the LSTM at the last moment in the time dimension, distinguishes the saturation of emotion in different time segments, expands the multi-head attention mechanism to the feature dimension, distinguishes the recognition capability of different features on emotion, and improves the performance of voice emotion recognition.
The technical scheme is as follows: the invention adopts the following technical scheme: a speech emotion recognition method based on a multi-head attention mechanism is characterized by comprising the following steps:
s1, extracting frame-level voice features with time sequence information from the original voice data sample;
s2, establishing an LSTM model with the capability of processing time sequence data, and inputting the voice characteristics with time sequence information in the step S1 into the LSTM model;
s3, performing attention weighting on the output of the LSTM model in a time dimension by using a multi-head attention mechanism, and combining the matrix output at all times of the LSTM model and the output at the last time to obtain the output weighted in the time dimension;
s4, performing attention weighting on the output of the LSTM model on the characteristic dimension by using a multi-head attention mechanism, and combining the matrix output at all times of the LSTM model to obtain the output after the characteristic dimension weighting;
s5, splicing the output after the time dimension weighting and the output after the feature dimension weighting, inputting the spliced output into a full connection layer, and obtaining the speech emotion category through a softmax layer;
s6, training the emotion recognition model in the steps S1 to S5 by using the known original voice data sample and the voice emotion type thereof, testing and verifying the trained emotion recognition model, and evaluating the recognition performance;
and S7, inputting the unknown original voice data sample into the trained emotion recognition model, and outputting the corresponding voice emotion type.
Preferably, the step S3 of weighting the output of the LSTM model in the time dimension by using a multi-head attention mechanism includes the following steps:
s31, calculating the projection of the matrix output at all time points of the LSTM model and the output at the last time point in each linear subspace respectively:
T_Ki=oall_time*T_Wi,k+T_bi,k
T_Qi=omax_time*T_Wi,q+T_bi,q
wherein denotes a matrix multiplication; oall_timeMatrix representing the output of the LSTM model at all times, omax_timeAn output representing the last moment of the LSTM model;
Figure BDA0002706789470000021
and
Figure BDA0002706789470000022
respectively represents oall_timeAnd omax_timeThe weight in the linear transformation of the ith linear subspace, i represents the index of the linear subspace and has h linear subspaces in total, and N represents the number of hidden layer units of the LSTM model; t _ bi,kAnd T _ bi,qRespectively represents oall_timeAnd omax_timeBias in the ith linear subspace linear transformation; t _ KiAnd T _ QiRespectively represents oall_timeAnd omax_timeA projection in the ith linear subspace;
s32, calculating a weighting coefficient of a time dimension in each linear subspace:
T_scorei=softmax(T_Qi*T_Ki H)
wherein, T _ scoreiRepresenting the weighting coefficients of the time dimension in the ith linear subspace, and softmax () representing the normalized exponential function;
s33, calculating the value to be weighted in each linear subspace:
T_Vi=oall_time*T_Wi,v+T_bi,v
wherein, T _ ViRepresenting the value to be weighted in the ith linear subspace; t _ Wi,vRepresenting the weight when the ith linear subspace is used for solving the value to be weighted; t _ bi,vRepresenting the bias when the ith linear subspace is used for solving the value to be weighted;
s34, calculating the output of each linear subspace, and splicing the outputs of all the linear subspaces together to obtain the output weighted by the time dimension:
T_outputi=T_scorei*T_Vi
T_Output=Concat(T_output1,...,T_outputh)
wherein Concat () is used to connect two or more arrays; t _ outputiAn output representing the ith linear subspace; t _ Output represents the time dimension weighted Output.
Preferably, the step S4 of weighting the output of the LSTM model in the feature dimension by using a multi-head attention mechanism includes the following steps:
s41, calculating the projection of the matrix output at all time points of the LSTM model and the output at the last time point in each linear subspace respectively:
F_Ki=oall_time*F_Wi,k+F_bi,k
wherein denotes a matrix multiplication; oall_timeA matrix representing the output of the LSTM model at all times; f _ Wi,kRepresents oall_timeWeights in the ith linear subspace linear transformation; f _ bi,kRepresenting the bias in the ith linear subspace linear transformation; f _ KiRepresents oall_timeA projection in the ith linear subspace;
s42, calculating a weighting coefficient of the characteristic dimension in each linear subspace:
F_scorei=softmax(tanh(F_Ki)*F_Wscore+F_bscore)
wherein, F _ scoreiThe weighting coefficient representing the characteristic dimension in the ith linear subspace is obtained by adopting a self-attention algorithmGet, F _ WscoreAnd F _ bscoreRespectively representing the weight and the bias of the self-attention algorithm; softmax () represents a normalized exponential function;
s43, calculating the value to be weighted in each linear subspace:
F_Vi=oall_time*F_Wi,v+F_bi,v
wherein, F _ ViRepresenting the value to be weighted in the ith linear subspace; f _ Wi,vRepresenting the weight when the ith linear subspace is used for solving the value to be weighted; f _ bi,vRepresenting the bias when the ith linear subspace is used for solving the value to be weighted;
s44, calculating the output of each linear subspace, splicing the outputs of all the linear subspaces together, and summing in a time dimension to obtain the output weighted by a characteristic dimension:
Figure BDA0002706789470000041
Figure BDA0002706789470000042
wherein the content of the first and second substances,
Figure BDA0002706789470000043
representing a Hadamard product; concat () is used to connect two or more arrays; f _ outputiAn output representing the ith linear subspace; f _ Output represents the Output after feature dimension weighting.
Preferably, in step S2, the LSTM model is calculated as follows:
ft=σ(Wfh*ht-1+Wfx*xt+Wfc*Ct-1+bf)
it=σ(Wih*ht-1+Wix*xt+Wic*Ct-1+bi)
Figure BDA0002706789470000044
Figure BDA0002706789470000045
ot=σ(Woh*ht-1+Wox*xt+Woc*Ct+bo)
Figure BDA0002706789470000046
wherein, the matrix multiplication is represented,
Figure BDA0002706789470000047
representing a Hadamard product; wfh、Wfc、WfxRespectively being a forgetting door ftHidden layer output, cell state and weight of input, bfTo forget the door ftBias of (3); wih、Wic、WixRespectively update values i for informationtHidden layer output, cell state and weight of input, biUpdating value i for informationtBias of (3); wch、WcxUpdating the cell status respectively
Figure BDA0002706789470000048
Weight of hidden layer output and input, bcUpdating a value for a cellular state
Figure BDA0002706789470000049
Bias of (3); woh、Wox、WocAre respectively the output value otHidden layer output, input and cell state weights, boIs the output value otBias of (3); h ist、xt、CtRespectively representing the hidden layer output, input and cell state at time t.
Preferably, in step S5, the input of the fully connected layer is Concat (T _ Output, F _ Output), where T _ Output represents the Output weighted by the time dimension, and F _ Output represents the Output weighted by the feature dimension.
Preferably, in step S6, the emotion recognition model is tested and verified by recall ratio:
Figure BDA0002706789470000051
wherein recall is the recall rate; TP is the number predicted to be positive, actually positive; TN is the quantity predicted negative, actually positive.
Preferably: in step S1, the extracted speech features retain timing information in the original speech data sample through the sequence relationship between the speech frames, and the dimensionality of the speech features varies with the actual length of the original speech data sample.
Has the advantages that: the invention has the following beneficial effects:
according to the voice emotion recognition method based on the multi-head attention mechanism, the multi-head attention mechanism replaces the traditional zoom point attention mechanism, the historical output and the last moment output of an LSTM model are combined, attention weighting is carried out on the time dimension of the LSTM model output, and the saturation of emotion in different time segments is distinguished; and the multi-head attention mechanism is expanded to the feature dimension, the emotion recognition capability of different features is distinguished, the speech emotion recognition performance can be improved, and the application prospect is good.
Drawings
FIG. 1 is a flowchart of a method of speech emotion recognition based on a multi-head attention mechanism according to the present invention;
FIG. 2 is a system block diagram of the speech emotion recognition method based on a multi-head attention mechanism of the present invention;
FIG. 3 is a graph that verifies the performance on the eNFERFACE and GEMEP databases as a function of the number of subspaces in the time dimension;
FIG. 4 is a graph that verifies the performance on the eNFERFACE and GEMEP databases as a function of the number of subspaces in the feature dimension.
Detailed Description
The present invention will be further described with reference to the accompanying drawings.
The invention discloses a speech emotion recognition method based on a multi-head attention mechanism, which comprises the following steps of:
step A, extracting frame-level voice features with time sequence information from original voice data, wherein the time sequence information in the original voice data is reserved by the voice features through the sequence relation among voice frames, and the dimensionality of the voice features is changed along with the actual length of the original voice data. The detailed speech feature set is shown in table 1 below:
TABLE 1
Figure BDA0002706789470000052
Figure BDA0002706789470000061
And step B, establishing a long-term memory (LSTM) model with the capability of processing time sequence data, memorizing history information through a cell unit, and screening history redundant information through a forgetting gate, so that the problem of long-term dependence is solved, and the model is suitable for the dynamic voice characteristics provided in the step A, namely the model can be matched with the characteristic data.
The LSTM model calculation method is as follows, where denotes matrix multiplication:
ft=σ(Wfh*ht-1+Wfx*xt+Wfc*Ct-1+bf) (1)
wherein, Wfh、Wfc、WfxRespectively being a forgetting door ftHidden layer output, cell state and weight of input, bfTo forget the door ftOffset of (2), Wfh、Wfc、WfxAnd bfAre all parameters to be trained; forget door ftCell state C at the last momentt-1Hidden layer output ht-1And input x of the current timetExpresses the degree of retention of the history information.
it=σ(Wih*ht-1+Wix*xt+Wic*Ct-1+bi) (2)
Wherein, Wih、Wic、WixRespectively update values i for informationtHidden layer output, cell state and weight of input, biUpdating value i for informationtOffset of (2), Wih、Wic、WixAnd biAre all parameters to be trained; information update value itCell state C at the last momentt-1Hidden layer output ht-1And input x of the current timetDetermines the current amount of useful information.
Figure BDA0002706789470000071
Figure BDA0002706789470000072
ot=σ(Woh*ht-1+Wox*xt+Woc*Ct+bo) (5)
Wherein the content of the first and second substances,
Figure BDA0002706789470000073
representing the Hadamard product, Wch、WcxUpdating the cell status respectively
Figure BDA0002706789470000074
Weight of hidden layer output and input, bcUpdating a value for a cellular state
Figure BDA0002706789470000075
Offset of (2), Wch、WcxAnd bcAre all parameters to be trained, cell state update values
Figure BDA0002706789470000076
Hidden layer output h at previous momentt-1And input x of the current timetThe influence of (a); cell state C at the present timetCell state C at the last momentt-1And cell state update values
Figure BDA0002706789470000077
The influence of (a); woh、Wox、WocAre respectively the output value otHidden layer output, input and cell state weights, boIs the output value otOffset of (2), Woh、Wox、WocAnd boAre all parameters to be trained, and output value otHidden layer output h at previous momentt-1And the cell state C at the current timetInputting xtThe influence of (c).
Figure BDA0002706789470000078
Wherein, the hidden layer output h of the current timetBy the cell state C at the present momenttAnd the output value otThe influence of (c).
In the formulas (1) to (6), subscript t is a time step and corresponds to the number of frames in the voice features; σ is a sigmoid function, expressed as: σ (x) 1/(1+ e)-x) And the value is between 0 and 1, so that the concept of fractional weighting is expressed.
And step C, replacing the traditional zoom point attention mechanism with a multi-head attention mechanism, carrying out attention weighting on the time dimension output by the LSTM model, and distinguishing the saturation of the emotion in different time segments.
The multi-head attention mechanism avoids one-sidedness of single linear mapping of the traditional zoom point attention mechanism, and the multi-head means multiple linear changes, so that more robust information representation can be obtained. The calculation formula is as follows:
T_Ki=oall_time*T_Wi,k+T_bi,k (7)
T_Qi=omax_time*T_Wi,q+T_bi,q (8)
wherein the content of the first and second substances,
Figure BDA0002706789470000079
is a weight in a linear transformation, T _ bi,kAnd T _ bi,qThen it is an offset in the linear transformation, T _ Wi,k、T_Wi,q、T_bi,kAnd T _ bi,qAll parameters are parameters to be trained, i represents the index of a linear subspace, h linear subspaces are counted, and N corresponds to the number of hidden layer units of the LSTM model; oall_timeIs the matrix of the LSTM model output at all times, omax_timeIt is the output of the LSTM model at the last moment. Because of the accumulated characteristic of the LSTM model to the historical information, the LSTM model has relatively more effective information at the last moment, so the invention will not generate the new LSTM modelmax_timeThe information effectiveness at other moments is measured as a reference quantity; and T _ KiAnd T _ QiAre respectively oall_timeAnd omax_timeProjection in the ith linear subspace.
T_scorei=softmax(T_Qi*T_Ki H) (9)
T_Vi=oall_time*T_Wi,v+T_bi,v (10)
T_outputi=T_scorei*T_Vi (11)
Wherein softmax () represents a normalized exponential function; t _ scoreiIs the weighting coefficient of the time dimension in the ith linear subspace; the value to be weighted is oall_timeNew projection T _ V under ith linear subspaceiThe projection is T _ Wi,vAs a weight, T _ bi,vFor a new linear transformation of the offset, T _ Wi,vAnd T _ bi,vAre all parameters to be trained; will T _ scoreiActing on T _ ViObtaining the corresponding output T _ output of each linear subspacei(ii) a Finally, all the linear subspaces are output T _ outputiSpliced together to form a completely new Output T _ Output:
T_Output=Concat(T_output1,...,T_outputh) (12)
where Concat () is used to connect two or more arrays.
And D, expanding the multi-head attention mechanism to a characteristic dimension, carrying out attention weighting on the output of the LSTM model on the characteristic dimension, and distinguishing the recognition capability of different characteristics on the emotion.
Different features have different abilities to emotion recognition, and deep features closely related to an emotion recognition task are automatically screened out through a weighting coefficient of a multi-head attention mechanism so as to improve the emotion recognition performance. The calculation method is as follows:
F_Ki=oall_time*F_Wi,k+F_bi,k (13)
F_scorei=softmax(tanh(F_Ki)*F_Wscore+F_bscore) (14)
F_Vi=oall_time*F_Wi,v+F_bi,v (15)
Figure BDA0002706789470000091
wherein, F _ Wi,kAnd F _ bi,kAre respectively oall_timeLinearly varying weight and offset, F _ Wi,kAnd F _ bi,kAre all parameters to be trained, F _ KiIs oall_timeA projection in the ith linear subspace; f _ scoreiIs the weighting coefficient of the characteristic dimension in the ith linear subspace, and is obtained by adopting a self-attention algorithm, F _ WscoreAnd F _ bscoreIs the weight and bias to be trained in the self-attention algorithm; similar to the weighting in the time dimension, the value to be weighted in the feature dimension is oall_timeNew projection F _ V under ith linear subspaceiThe projection is F _ Wi,vAs a weight, F _ bi,vIs a new linearity of the biasConversion, F _ Wi,vAnd F _ bi,vAre all parameters to be trained; will F _ scoreiActing on F _ ViObtaining the corresponding output F _ output of each linear subspaceiDifferent from the time dimension, the weighting factor F _ scoreiAnd F _ ViBetween are Hadamard products; the final Output F _ Output is the Output F _ Output of each linear subspaceiSplicing and summing in a time dimension, wherein the summing is different from the traditional static statistics in terms of isogeny, but the weight coefficient is obtained by training and is not designed manually:
Figure BDA0002706789470000092
and E, adding a full connection layer and a softmax layer behind the LSTM model, wherein the weighted result of the time dimension and the weighted result of the characteristic dimension are spliced, namely Concat (T _ Output, F _ Output) is used as the input of the full connection layer, the Output of the full connection layer is input into the softmax layer for normalization, the probability of each speech emotion category is Output, and the maximum probability is the final speech emotion category.
Step F, the known original voice data sample and the voice emotion type thereof are applied to the emotion recognition model in the steps A to E; training an emotion recognition model, and evaluating the recognition performance of the trained emotion recognition model; and inputting unknown original voice data samples into the trained emotion recognition model, and outputting corresponding voice emotion types.
In order to evaluate the performance of the model, the method compares the commonly used Recall rate (Recall) in emotion recognition with the traditional zoom point attention machine algorithm, and the overall evaluation index is the Unweighted Average Recall rate (UAR).
Figure BDA0002706789470000093
Wherein TP is the number predicted to be positive, actually positive; TN is the quantity predicted negative, actually positive.
In order to verify the effectiveness of the present invention, tests are performed on two emotion databases, namely, eNTERFACE and GEMEP, and the test results are shown in tables 2 and 3 below, where table 2 is the experimental results of the LSTM model, the LSTM model combined with the zoom point attention mechanism, and the LSTM model combined with the multi-head attention mechanism on the eNTERFACE database, and table 3 is the experimental results of the LSTM model, the LSTM model combined with the zoom point attention mechanism, and the LSTM model combined with the multi-head attention mechanism on the GEMEP database.
As shown in table 2, the UAR of the LSTM model of the present invention combined with the multi-head attention mechanism on the eNTERFACE database reaches 91.2%, over the other models. Compared with a zoom point attention mechanism, the memory of most emotions is improved, and the fact that a multi-linear subspace can obtain more stable emotion information than a single attention function is shown. As shown in Table 3, the LSTM model of the present invention combined with the multi-head attention mechanism gave the best results on the GEMEP database, with UAR reaching 62%. For the most easily recognized emotions with recall rates above 60%, significant improvement was achieved with more attention than with the other models in the table.
TABLE 2
Model (model) Generating qi Aversion to Fear of Happy Heart injury Is surprised UAR
LSTM 88.4% 64.3% 76.6% 83.8% 68.9% 73.9% 75.8%
LSTM + zoom Point attention 88.4% 85.7% 80.9% 97.3% 86.7% 84.8% 86.9%
LSTM + multi-head attention 90.7% 90.5% 87.2% 94.6% 93.3% 91.3% 91.2%
TABLE 3
Figure BDA0002706789470000101
Figure BDA0002706789470000111
In order to study the influence of the number of heads of the attention mechanism on the performance in the time dimension and the feature dimension respectively, the experiment first sets the number of heads of the feature dimension as a fixed parameter 1, and studies the relationship between the time dimension and the UAR, as shown in fig. 3: the UAR is gradually increased along with the increase of the time dimension head number on the GEMEP database and the ENTERFACE database, when the head number is 8, the UAR respectively reaches 58% and 91.2% of the maximum on the GEMEP database and the ENTERFACE database, which shows that a plurality of linear subspaces can form a complementary relationship, enhances the soundness of emotion information and prevents the one-sidedness of a single subspace. However, this positive correlation is bounded, and when the number of heads exceeds 8, UAR will drop, and more heads means more linear subspace needs to be learned, which increases the learning burden of the network, especially in the case of limited training data.
On the basis, the number of heads in the time dimension is fixed to be 8, and the influence of the number of heads in the feature dimension on the UAR is studied, as shown in fig. 4: on the GEMEP database, UAR increases with the increase of the number of characteristic dimension heads, the UAR reaches a maximum of 62% when the number of heads is 16, and then UAR decreases when the number of heads increases again. In contrast, on the eNewface database, UAR falls as the number of feature dimension heads rises. This may be due to the complexity of the emotion recognition task in both databases. The emotion classification of the GEMEP database is twice that of the eNTFACE database, so the former is more complex. And the more emotion categories, the more serious the emotion information interweaving in the feature space is, and the greater the difficulty of recognition is. The multi-linear subspace can better relieve the overlapping of emotional characteristics and improve the performance. For the eNTERFACE database with only 6 emotions, there is less overlap between emotions, so an increase in the linear subspace only increases the difficulty of training, with a natural impact on performance. Therefore, the GEMEP database can achieve the optimal UAR when the time dimension header number is 8 and the feature dimension header number is 16, while the eNTERFACE database can achieve the optimal UAR when the time dimension header number is 8 and the feature dimension header number is 1.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (7)

1. A speech emotion recognition method based on a multi-head attention mechanism is characterized by comprising the following steps:
s1, extracting frame-level voice features with time sequence information from the original voice data sample;
s2, establishing an LSTM model with the capability of processing time sequence data, and inputting the voice characteristics with time sequence information in the step S1 into the LSTM model;
s3, performing attention weighting on the output of the LSTM model in a time dimension by using a multi-head attention mechanism, and combining the matrix output at all times of the LSTM model and the output at the last time to obtain the output weighted in the time dimension;
s4, performing attention weighting on the output of the LSTM model on the characteristic dimension by using a multi-head attention mechanism, and combining the matrix output at all times of the LSTM model to obtain the output after the characteristic dimension weighting;
s5, splicing the output after the time dimension weighting and the output after the feature dimension weighting, inputting the spliced output into a full connection layer, and obtaining the speech emotion category through a softmax layer;
s6, training the emotion recognition model in the steps S1 to S5 by using the known original voice data sample and the voice emotion type thereof, testing and verifying the trained emotion recognition model, and evaluating the recognition performance;
and S7, inputting the unknown original voice data sample into the trained emotion recognition model, and outputting the corresponding voice emotion type.
2. The method for speech emotion recognition based on multi-head attention mechanism as claimed in claim 1, wherein the step S3 of performing attention weighting on the output of LSTM model in time dimension by using multi-head attention mechanism includes the following steps:
s31, calculating the projection of the matrix output at all time points of the LSTM model and the output at the last time point in each linear subspace respectively:
T_Ki=oall_time*T_Wi,k+T_bi,k
T_Qi=omax_time*T_Wi,q+T_bi,q
wherein denotes a matrix multiplication; oall_timeMatrix representing the output of the LSTM model at all times, omax_timeAn output representing the last moment of the LSTM model;
Figure FDA0002706789460000011
and
Figure FDA0002706789460000012
respectively represents oall_timeAnd omax_timeThe weight in the linear transformation of the ith linear subspace, i represents the index of the linear subspace and has h linear subspaces in total, and N represents the number of hidden layer units of the LSTM model; t _ bi,kAnd T _ bi,qRespectively represents oall_timeAnd omax_timeBias in the ith linear subspace linear transformation; t _ KiAnd T _ QiRespectively represents oall_timeAnd omax_timeA projection in the ith linear subspace;
s32, calculating a weighting coefficient of a time dimension in each linear subspace:
T_scorei=softmax(T_Qi*T_Ki H)
wherein, T _ scoreiRepresenting the weighting coefficients of the time dimension in the ith linear subspace, and softmax () representing the normalized exponential function;
s33, calculating the value to be weighted in each linear subspace:
T_Vi=oall_time*T_Wi,v+T_bi,v
wherein, T _ ViRepresenting the value to be weighted in the ith linear subspace; t _ Wi,vRepresenting the weight when the ith linear subspace is used for solving the value to be weighted; t _ bi,vRepresenting the bias when the ith linear subspace is used for solving the value to be weighted;
s34, calculating the output of each linear subspace, and splicing the outputs of all the linear subspaces together to obtain the output weighted by the time dimension:
T_outputi=T_scorei*T_Vi
T_Output=Concat(T_output1,...,T_outputh)
wherein Concat () is used to connect two or more arrays; t _ outputiAn output representing the ith linear subspace; t _ Output represents the time dimension weighted Output.
3. The method for speech emotion recognition based on multi-head attention mechanism as claimed in claim 1, wherein the step S4 of performing attention weighting on the output of LSTM model in feature dimension by using multi-head attention mechanism includes the following steps:
s41, calculating the projection of the matrix output at all time points of the LSTM model and the output at the last time point in each linear subspace respectively:
F_Ki=oall_time*F_Wi,k+F_bi,k
wherein denotes a matrix multiplication; oall_timeA matrix representing the output of the LSTM model at all times; f _ Wi,kRepresents oall_timeWeights in the ith linear subspace linear transformation; f _ bi,kRepresenting the bias in the ith linear subspace linear transformation; f _ KiRepresents oall_timeA projection in the ith linear subspace;
s42, calculating a weighting coefficient of the characteristic dimension in each linear subspace:
F_scorei=softmax(tanh(F_Ki)*F_Wscore+F_bscore)
wherein, F _ scoreiRepresenting the ith linear subspaceThe weighting coefficient of the middle characteristic dimension is obtained by adopting a self-attention algorithm, F _ WscoreAnd F _ bscoreRespectively representing the weight and the bias of the self-attention algorithm; softmax () represents a normalized exponential function;
s43, calculating the value to be weighted in each linear subspace:
F_Vi=oall_time*F_Wi,v+F_bi,v
wherein, F _ ViRepresenting the value to be weighted in the ith linear subspace; f _ Wi,vRepresenting the weight when the ith linear subspace is used for solving the value to be weighted; f _ bi,vRepresenting the bias when the ith linear subspace is used for solving the value to be weighted;
s44, calculating the output of each linear subspace, splicing the outputs of all the linear subspaces together, and summing in a time dimension to obtain the output weighted by a characteristic dimension:
Figure FDA0002706789460000035
Figure FDA0002706789460000031
wherein the content of the first and second substances,
Figure FDA0002706789460000039
representing a Hadamard product; concat () is used to connect two or more arrays; f _ outputiAn output representing the ith linear subspace; f _ Output represents the Output after feature dimension weighting.
4. The method for speech emotion recognition based on multi-head attention mechanism as claimed in claim 1, wherein in step S2, the LSTM model is calculated as follows:
ft=σ(Wfh*ht-1+Wfx*xt+Wfc*Ct-1+bf)
it=σ(Wih*ht-1+Wix*xt+Wic*Ct-1+bi)
Figure FDA0002706789460000036
Figure FDA0002706789460000037
ot=σ(Woh*ht-1+Wox*xt+Woc*Ct+bo)
Figure FDA0002706789460000038
wherein, the matrix multiplication is represented,
Figure FDA0002706789460000032
representing a Hadamard product; wfh、Wfc、WfxRespectively being a forgetting door ftHidden layer output, cell state and weight of input, bfTo forget the door ftBias of (3); wih、Wic、WixRespectively update values i for informationtHidden layer output, cell state and weight of input, biUpdating value i for informationtBias of (3); wch、WcxUpdating the cell status respectively
Figure FDA0002706789460000033
Weight of hidden layer output and input, bcUpdating a value for a cellular state
Figure FDA0002706789460000034
Bias of (3); woh、Wox、WocAre respectively the output value otThe hidden layer output,Input and cell status weights, boIs the output value otBias of (3); h ist、xt、CtRespectively representing the hidden layer output, input and cell state at time t.
5. The method for speech emotion recognition based on multi-head attention mechanism as claimed in claim 1, wherein in step S5, the input of the fully connected layer is Concat (T _ Output, F _ Output), where T _ Output represents the weighted Output of the time dimension and F _ Output represents the weighted Output of the feature dimension.
6. The method for speech emotion recognition based on multi-head attention mechanism as claimed in claim 1, wherein in step S6, the emotion recognition model is tested and verified by recall ratio:
Figure FDA0002706789460000041
wherein recall is the recall rate; TP is the number predicted to be positive, actually positive; TN is the quantity predicted negative, actually positive.
7. The method for recognizing the speech emotion based on the multi-head attention mechanism as claimed in claim 1, wherein: in step S1, the extracted speech features retain timing information in the original speech data sample through the sequence relationship between the speech frames, and the dimensionality of the speech features varies with the actual length of the original speech data sample.
CN202011041520.9A 2020-09-28 2020-09-28 Voice emotion recognition method based on multi-head attention mechanism Active CN112185423B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011041520.9A CN112185423B (en) 2020-09-28 2020-09-28 Voice emotion recognition method based on multi-head attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011041520.9A CN112185423B (en) 2020-09-28 2020-09-28 Voice emotion recognition method based on multi-head attention mechanism

Publications (2)

Publication Number Publication Date
CN112185423A true CN112185423A (en) 2021-01-05
CN112185423B CN112185423B (en) 2023-11-21

Family

ID=73943860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011041520.9A Active CN112185423B (en) 2020-09-28 2020-09-28 Voice emotion recognition method based on multi-head attention mechanism

Country Status (1)

Country Link
CN (1) CN112185423B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766172A (en) * 2021-01-21 2021-05-07 北京师范大学 Face continuous expression recognition method based on time sequence attention mechanism
CN113053418A (en) * 2021-04-30 2021-06-29 南京工程学院 Method for quickly identifying speech emotion category based on long-time and short-time memory network
CN113628639A (en) * 2021-07-06 2021-11-09 哈尔滨理工大学 Voice emotion recognition method based on multi-head attention mechanism

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109243494A (en) * 2018-10-30 2019-01-18 南京工程学院 Childhood emotional recognition methods based on the long memory network in short-term of multiple attention mechanism
CN109285562A (en) * 2018-09-28 2019-01-29 东南大学 Speech-emotion recognition method based on attention mechanism
JP2019020684A (en) * 2017-07-21 2019-02-07 日本電信電話株式会社 Emotion interaction model learning device, emotion recognition device, emotion interaction model learning method, emotion recognition method, and program
CN110223714A (en) * 2019-06-03 2019-09-10 杭州哲信信息技术有限公司 A kind of voice-based Emotion identification method
WO2020173133A1 (en) * 2019-02-27 2020-09-03 平安科技(深圳)有限公司 Training method of emotion recognition model, emotion recognition method, device, apparatus, and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019020684A (en) * 2017-07-21 2019-02-07 日本電信電話株式会社 Emotion interaction model learning device, emotion recognition device, emotion interaction model learning method, emotion recognition method, and program
CN109285562A (en) * 2018-09-28 2019-01-29 东南大学 Speech-emotion recognition method based on attention mechanism
CN109243494A (en) * 2018-10-30 2019-01-18 南京工程学院 Childhood emotional recognition methods based on the long memory network in short-term of multiple attention mechanism
WO2020173133A1 (en) * 2019-02-27 2020-09-03 平安科技(深圳)有限公司 Training method of emotion recognition model, emotion recognition method, device, apparatus, and storage medium
CN110223714A (en) * 2019-06-03 2019-09-10 杭州哲信信息技术有限公司 A kind of voice-based Emotion identification method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
RUNNAN LI 等: "Towards Discriminative Representation Learning for Speech Emotion Recognition", IJCAI *
余莉萍 等: "基于改进LSTM的儿童语音情感识别模型", 计算机工程, vol. 46, no. 06 *
卢艳: "基于神经网络与注意力机制结合的语音情感识别研究", 中国优秀硕士学位论文全文数据库信息科技辑, no. 08, pages 41 - 44 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766172A (en) * 2021-01-21 2021-05-07 北京师范大学 Face continuous expression recognition method based on time sequence attention mechanism
CN112766172B (en) * 2021-01-21 2024-02-02 北京师范大学 Facial continuous expression recognition method based on time sequence attention mechanism
CN113053418A (en) * 2021-04-30 2021-06-29 南京工程学院 Method for quickly identifying speech emotion category based on long-time and short-time memory network
CN113628639A (en) * 2021-07-06 2021-11-09 哈尔滨理工大学 Voice emotion recognition method based on multi-head attention mechanism

Also Published As

Publication number Publication date
CN112185423B (en) 2023-11-21

Similar Documents

Publication Publication Date Title
CN112185423B (en) Voice emotion recognition method based on multi-head attention mechanism
CN112784798B (en) Multi-modal emotion recognition method based on feature-time attention mechanism
CN112784730B (en) Multi-modal emotion recognition method based on time domain convolutional network
CN111310672A (en) Video emotion recognition method, device and medium based on time sequence multi-model fusion modeling
CN107729999A (en) Consider the deep neural network compression method of matrix correlation
CN108776796B (en) Action identification method based on global space-time attention model
CN110556130A (en) Voice emotion recognition method and device and storage medium
CN110728656A (en) Meta-learning-based no-reference image quality data processing method and intelligent terminal
CN110534133B (en) Voice emotion recognition system and voice emotion recognition method
CN112687374B (en) Psychological crisis early warning method based on text and image information joint calculation
CN111178319A (en) Video behavior identification method based on compression reward and punishment mechanism
CN111444951A (en) Method and device for generating sample identification model, computer equipment and storage medium
Eisele et al. A comparative study of linear feature transformation techniques for automatic speech recognition
CN112100439B (en) Recommendation method based on dependency embedding and neural attention network
CN112966568A (en) Video customer service quality analysis method and device
CN115510286A (en) Multi-relation cognitive diagnosis method based on graph convolution network
Chatzikonstantinou et al. Neural network compression using higher-order statistics and auxiliary reconstruction losses
Garcia-Romero et al. Joint Factor Analysis for Speaker Recognition Reinterpreted as Signal Coding Using Overcomplete Dictionaries.
CN115795015A (en) Comprehensive knowledge tracking method for enhancing test question difficulty
Reshma et al. A survey on speech emotion recognition
CN113239678B (en) Multi-angle attention feature matching method and system for answer selection
CN114819107A (en) Mixed data assimilation method based on deep learning
CN113053418A (en) Method for quickly identifying speech emotion category based on long-time and short-time memory network
CN113469977A (en) Flaw detection device and method based on distillation learning mechanism and storage medium
CN112634947A (en) Animal voice and emotion feature set sequencing and identifying method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant