CN110400579A

CN110400579A - Based on direction from the speech emotion recognition of attention mechanism and two-way length network in short-term

Info

Publication number: CN110400579A
Application number: CN201910555688.2A
Authority: CN
Inventors: 李冬冬; 王喆; 孙琳煜; 方仲礼; 杜文莉; 张静
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2019-06-25
Filing date: 2019-06-25
Publication date: 2019-11-01
Anticipated expiration: 2039-06-25
Also published as: CN110400579B

Abstract

The present invention relates to a kind of based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term, includes the following steps: first to extract acoustic feature to original audio signal, then is input to forward and reverse length in short-term in memory network, exports forward and reverse feature；Then by operating to obtain the output after forward and reverse weighting from attention from attention mechanism；Mean value pond and splicing are done respectively to the output after obtained forward and reverse weighting from attention, and it is input to softmax layers, obtained softmax layers of output and category are input to together in cross entropy loss function, most suitable network is selected by verifying collection, the data of test set are finally put into trained network to the emotional category obtained to the end.The present invention can be more easier to find the correlation of sentence internal signal being introduced into Recognition with Recurrent Neural Network from attention mechanism, and joined direction mechanism to from attention mechanism, solve the problems, such as because the shortage of information causes classification performance to decline.

Description

Based on direction from the speech emotion recognition of attention mechanism and two-way length network in short-term

Technical field

The present invention relates to speech emotion recognition technical fields, are paid attention to certainly specifically, the present invention relates to one kind based on direction The speech emotion recognition system of power mechanism and two-way length network in short-term.

Background technique

In recent years, human-computer interaction causes the interest of more and more data science men.For the friendship allowed between people and machine Stream is more naturally, its target is mainly there are two aspect: first is that the meaning for allowing machine to understand that the mankind speak, second is that machine recognition is allowed to go out The mood when mankind speak.Nowadays computer is understood that the meaning that the mankind speak, but machine recognition is allowed to go out the feelings in voice Thread but has biggish challenge.

When early stage, researchers by extract phonic signal character, recycle Machine learning classifiers to its into Row classification.At the beginning of 21 century, researchers are classified using gauss hybrid models or hidden Markov model, Zhi Houyou In the outstanding performance of support vector machines, classifier has been substituted for support vector machines by researchers, and the algorithm is often made at present For the baseline algorithm in speech emotion recognition field.And then, due to the development of neural network, researchers' discovery passes through nerve net Network extracts high-level feature, and placing into other classifier (such as support vector machines and gauss hybrid models etc.) can obtain Good effect.

Although people are analyzed the emotional change in voice using depth learning technology and achieve good effect in recent years Fruit, but general method can not distinguish unvoiced frame and unvoiced frames in voice well.And this problem is handled at present Method is broadly divided into two major classes: the first kind is manual removal unvoiced frames, the second class be adaptively learnt using algorithm out which It is unvoiced frames, which is unvoiced frame.First kind method is usually to be identified according to pitch, but this method is time-consuming and laborious, And the timing of voice data can be largely destroyed, although institute can be used in this way, there is certain defect.Second Class method is to assign lower weight to unvoiced frames using certain adaptive method, and common method includes attention mechanism With CTC loss method.Since CTC loss method is the discrete weight of distribution, the weight of non-voiced segments can be forcibly classified as 0 Or the weight of voiced segments is forcibly classified as 1, but the expression of human emotion is often incremental, so distributing it Continuous weight is only correctly desirable method, and attention mechanism can exactly accomplish this point well.

The present invention is different with traditional attention mechanism, and traditional attention mechanism is to make to the data on time dimension Softmax transformation obtains the weight in timing, and although this method has certain effect, but can not utilize letter well Number.And of the present invention from attention mechanism is to be softmax by the similarity between data itself and itself Transformation obtains, weight matrix be it is obtained by the internal information between signal, can more efficiently utilize sentence Internal information.

Summary of the invention

Technical problem: technical problem to be solved by the invention is to provide the calculations that one kind can analyze voice signal mood Method finds the correlation inside signal by being added after two-way length in short-term network from attention mechanism, and then controls each The significance level of temporal frame.It can reduce the influence of the temporal frame unfavorable to classification performance from attention mechanism, and allow network It focuses more on and biggish temporal frame is helped to classification performance, and then improve classification essence of the classifier on speech emotional data set Degree.

Technical solution: firstly, initial data is divided into training set, verifying collection and test set.Due to the timing of voice data Property, by two-way length, memory network is decoded phonetic feature training set data to the present invention in short-term, then to latter two side of decoding To data each timing is weighted with from attention mechanism method, finally weighting output result and true class Mark is put into cross entropy loss function.After obtaining Model Weight by training set, parameter selection is carried out with verifying the set pair analysis model The best model of performance is obtained, then test set is put into obtained best model and is tested, model is obtained Classification performance.

The technical solution adopted in the present invention can be refined further.It is described to be defined as itself and oneself from attention mechanism Body does similarity measurement, and obtains the weights at each moment by the similarity measurement.First by two-way long short-term memory The feature of network output is respectively put into three one-dimensional convolution, obtains three different Feature Mapping matrixesAnd gained Q, the last one dimension D of K, V are split to obtain three Four-matrixThen obtained Q ' matrix and K ' matrix are done Resulting operation result is simultaneously done softmax layers of transformation and obtains weight matrix W by multiplying, finally resulting weight matrix W Dot product is done with another four-matrix V ', the output O after weighting from attention is obtained, is defined by formula as:

O=W*V '

The third dimension for merging gained output O obtains three-dimensional data O ', and the output after the positive weighting from attention is defined asReversely It is defined as from the output after attention weightingTo the output after the obtained positive weighting from attentionWith reversed from attention Output after power weightingThe operation of mean value pondization is done respectively to obtainWithAnd it will be obtainedWithIt is spelled It connects, exports splicedIt is spliced by what is exportedIt is input to Obtained softmax layers of output and category are input to intersection by the output that softmax layers are obtained in softmax layers together In entropy loss function, whole network structure is adjusted by back-propagation algorithm.

The utility model has the advantages that the present invention compared with prior art, has the advantage that

It is of the invention based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term, will be from infusing Meaning power mechanism introduces two-way length in short-term in network, the weight of voice temporal frame is assigned by attention mechanism, without hand It is dynamic to delete useless frame.The present invention is utilized from attention mechanism it can be found that the characteristics of sentence internal signal correlation, more Pay close attention to unvoiced frame, moreover it is possible to weaken the influence to unfavorable unvoiced frames of classifying.In addition, from different directions come analyze voice data can be into One step increases the robustness of network, so speech emotion recognition system of the invention joined steering wheel to from attention mechanism System is solved by parsing the high-level feature of LSTM forward and reverse because classification performance declines caused by poor information The problem of.Experiments have shown that speech emotion recognition system of the invention has ideal classification performance.

Detailed description of the invention

Fig. 1 is the general frame figure of the invention applied in speech emotion recognition field；

Fig. 2 is confusion matrix of all kinds of algorithms in IEMOCAP improvisation data set

Specific embodiment

Content in order to more clearly describe the present invention, is described in detail in the following with reference to the drawings and specific embodiments.This Invention based on direction from the speech emotion recognition system of the two-way length of attention mechanism network (BLSTM-DSA) in short-term, including Following steps:

Step 1: acoustic feature being extracted to original audio signal samples, acoustic feature includes prosodic features: zero-crossing rate And energy, compose correlated characteristic: mel-frequency cepstrum coefficient, spectral roll-off point, frequency spectrum flow, spectral centroid, frequency spectrum entropy, frequency spectrum expand The latitude of emulsion, chromaticity and chromaticity standard deviation, these acoustic features are extracted with the tool box opensmile, are extracted Voice training collection data after feature；

Step 2: by the voice training collection data after obtained extraction feature be input to positive long memory network in short-term and In reversed long memory network in short-term, the training voice data of input is defined as Wherein N is instruction Practice the quantity of sample, y_i=0 represents the sample as angry class, y_i=1 represents the sample as happiness class, y_i=2 represent the sample as Neutral class, y_i=3 represent the sample as sad class.The formula of long memory network in short-term is defined as follows:

Wherein σ () represents sigmoid function, its output interval is (0,1).Because of the spy of sigmoid function output interval Different property (being similar to probability), so it is often considered as being closest to the form of expression of normal distribution. W_i,W_f,W_c,W_oIt is input The weight matrix that can learn to state (Input to State), U_i,U_f,U_c,U_oIt is state to state (State to State) Can learning matrix, V_i,V_f,V_oBe referred to as peep-hole connection (Peephole Connections) can learning matrix,It is The l layers of neuron on time step t.It is input gate, it indicates to save at current time for candidate past state How much information；It is to forget door, it indicates the internal state in previous time stepIn how much information should be forgotten；It is defeated It gos out, it controls current time internal stateHow much information must be exported to external statusIt is defeated in order to distinguish forward and reverse Out, the output of the last layer forward direction feature is defined asOpposite feature output is defined as

Step 3: the positive feature that will be exportedAnd opposite featureOne-dimensional convolution three times is done respectively, it is defeated after obtaining convolution OutWherein positive three-dimensional feature mapping matrix is defined asReversed three-dimensional feature mapping matrix is defined as One-dimensional convolution operation is relatively suitble to analysis voice data, can preferably utilize The timing of voice data, and compared to other algorithms, one-dimensional convolution occupies certain advantage in speed, and does three secondary volumes Product operation is exactly to analyze from attention mechanism itself in order to facilitate subsequent.Then to Q, the last one dimension of K, V into Row segmentation obtains three four-dimensional eigenmatrixes, these three four-matrixes are defined as by weWherein the size of third dimension i isTo obtained Q ', K ', V ' does Scaled Dot-Product Attention operation, is defined by formula as:

O=W*V ' (7)

The third dimension for finally merging gained output O obtains three-dimensional data O ', the output definition after the positive weighting from attention ForReversely it is defined as from the output after attention weighting

To the output after the obtained positive weighting from attentionReversely from the output after attention weightingRespectively The operation of mean value pondization is done to obtainWithAnd it will be obtainedWithSpliced, which indicates are as follows:

Resulting spliced result S is input in softmax layers, then by softmax layers of output and category one It rises and is input in cross entropy loss function, whole network structure is adjusted by back-propagation algorithm.The definition of cross entropy loss function Are as follows:

Wherein H is classification number, and N is number of samples.

Experimental design

Experimental data set is chosen: there is used herein current most popular affection data library (Interactive Emotional Dyadic Motion Capture,IEMOCAP).IEMOCAP database is recorded by engineering college, University of Southern California, the U.S., It in total include the audiovisual record of 5 sessions, i.e., audio, video and motion capture data, total duration have reached 12 hours.It is each A session is engaged in the dialogue performance by an actor and actress, and is performed and be divided into drama performance and two kinds of improvisation.Root According to statistics, which is made of the sentence of 10039 different durations, and the average length of every a word is 4.5 seconds, and by three Annotation person squeezes into continuous label and discrete tags to every a word.Database is primarily upon five kinds of moods: angry, happy, sad, It is neutral and dejected, however, annotation person is not limited to these moods in mark.Wherein, do not consider that the voice data of category accounts for Than being 38%, the voice data accounting of category is not 7%, can not determine that the voice data accounting of category is 15%, it may be determined that The data accounting of category is 40%.In order to compare with the research achievement of other researchers, we, which only choose, can determine category Anger, happiness, neutrality and sad voice data in part.Table 1 shows each in IEMOCAP improvisation data set Individual distinguishes the description of how many word in different emotions.

1 IEMOCAP improvisation data set of table

Feature extraction: in feature extraction phases, original signal will be converted into acoustic feature (including prosodic features, It composes correlated characteristic, sound quality feature and deep learning algorithm and extracts feature).It includes zero-crossing rate that prosodic features is chosen in this method And energy, the spectrum correlated characteristic of selection include mel-frequency cepstrum coefficient, spectral roll-off point, frequency spectrum flow, spectral centroid, frequency spectrum Entropy, spread spectrum degree, chromaticity and chromaticity standard deviation, use openSMILE as speech feature extraction tool.It is first Framing adding window first is carried out to the voice signal of 16KHz sample frequency, that voice window is 25ms Hamming window and 10ms in this method Frame move.The mel-frequency cepstrum coefficient of 12 dimensions is calculated by logarithm Fourier transform and 26 filters.Spectral roll-off point It is set as 0.85, this shows that the frequency lower than overall amplitude level 85% will be taken into account, and frequency spectrum flow is by present frame and previous Frame minimum squared distance obtains, and spectral centroid is obtained by the weighted average of calculating frequency.Frequency spectrum entropy changes energy using Shannon entropy It is distributed as probability distribution.Frequency spectrum extensibility, that is, frequency spectrum second-order central is away from mark of the band frequency to spectral centroid when being by calculating each Quasi- difference obtains.Zero-crossing rate is frequency of the time domain wave by time shaft.Energy is obtained by the weighted quadratic of each frame, in addition, energy Entropy is to joined Shannon entropy to energy, to determine whether Energy distribution is uniform.The low-dimensional feature of entire manual extraction includes Meier Frequency cepstral coefficient, spectral roll-off point, frequency spectrum flow, spectral centroid, frequency spectrum entropy, spread spectrum degree, zero-crossing rate, fundamental frequency, Energy, Energy-Entropy and their first-order difference.Last each frame has 68 dimensional features, in order to better adapt to neural network, Mean variance normalization can use in the method.

Network training method: this method uses the independent Training strategy of speaker, on IEMOCAP improvisation data set The Training strategy of one group of method (Leave One Group Out, LOGO) is stayed in selection, executes five wheels in total, and each round is with wherein four Sentence in a session remains in next session as training set, and the sentence that actress records will be used as test set, actor's record The sentence of system will be as verifying collection.Since the sample of happiness emotion in IEMOCAP improvisation data set occupies the minority, data emotion In non-equilibrium state, so having carried out resampling to happiness sample on the data set.On network training method, BLSTM The number of plies is set as 2 layers, and the linear transformation initial method of input is uniformly distributed for Glorot, at the beginning of recycling the linear transformation of layer state Beginning method is omnidirectional distribution initial method, and each layer of LSTM neuron number is set as 256, and random inactivation rate is set as 0.3.From in attention mechanism, the initial method of one-dimensional convolution convolution kernel is that Glorot is uniformly distributed, convolution kernel size It is 1, number 128, regularization method is L2 regularization, and regularization parameter is set as 3*10^-7.Attention mechanism divides fragment Number 8, loss function selects cross entropy, and batch_size is set as 256, and base learning rate is set as 0.0001, then uses Nadam optimizer carries out parameter optimization.In order to preferably train network that will select warm_up and sliding average strategy.warm_up Strategy is pressed in trained preceding 8 epochFormula calculates learning rate.Work as The state that habit rate linearly increases in early period, it will be able to network be allowed to better adapt to data.Sliding average can be such that model is surveying More healthy and stronger on examination collection, attenuation rate (Decay) is set as 0.999.Over-fitting in order to prevent also uses in training and early stops plan Slightly, when the loss of verifying collection is no longer reduced in 10 epoch, stopping network training finally selecting to collect upper loss most in verifying Low model is tested.In order to accelerate to restrain, it is added between BLSTM and Direction Self Attention Layer standardization (Layer Normal) layer.

Verifying index: this method selection weighted average recall rate (Weighted Accuracy, WA) and unweighted average are called together Return the evaluation index that rate (Unweighted Accuracy, UA) is model.WA is correctly number of classifying on entire test set Amount.In order to evaluate influence of the data class imbalance to overall model, the average result of the classification classification accuracy rate of UA, that is, every kind Including being also considered.WA and UA can be defined as:

Compare algorithm: the comparison algorithm that this method uses is CNN, LSTM, BLSTM.The structure of CNN is two layers of convolutional layer, And the size of first layer convolutional layer convolution kernel is 2*2, and step-length 1, convolution kernel number is 10, second layer convolutional layer convolution kernel Size is 2*2, and step-length 1, convolution kernel number is 20, and one layer of maximum pond layer then can be all added after each layer of convolutional layer, Size is 2*2, and step-length 2 finally adds the full articulamentum that two layers of neuron number is 128, and adds between full articulamentum Batch standardization (Batch Normalization) layer is entered.LSTM is set as two layers in this experiment, each layer of neuron number It is 256, random inactivation rate (dropout) is set as 0.3.The experiment parameter setting of BLSTM is identical with LSTM, only in each layer Positive LSTM adds one layer of reversed LSTM again, and all models are all unified to use Nadam optimizer.

Experimental result

Table 2 shows experimental result of each algorithm on IEMOCAP improvisation data set.CNN is in IEMOCAP Good performance is not played on emerging performance data set, and whether on WA and UA, CNN is minimum result.Adding After entering direction mechanism, BLSTM ratio LSTM shows more outstanding generalization ability.It has incorporated from attention mechanism and steering wheel The BLSTM-DSA of system has reached best result in two results of WA and UA.

Result of each algorithm of table 2 on IEMOCAP improvisation data set

Model	WA (%)	UA (%)
			CNN	57.75	45.08
LSTM	61.89	50.52
			BSLTM	62.01	52.48
BLSTM-DSA	62.16	55.21

Fig. 2 illustrates all kinds of algorithms in the confusion matrix of IEMOCAP improvisation data set.

By the confusion matrix figure in Fig. 2 it is found that in angry emotion recognition rate, BLSTM-DSA be it is highest, CNN is most Low.In happiness emotion recognition rate, BLSTM-DSA is also highest, and LSTM is minimum.In neutral emotion recognition rate On, every kind of algorithm is all 70% or more, and every kind of algorithm difference is little.It is similar with neutral emotion recognition, all kinds of algorithms Sad discrimination is also not much different.In conclusion BLSTM-DSA is on angry discrimination, neutral discrimination and sad discrimination There is ideal result.Further, since the sample size of sad and neutral two kinds of emotions is larger, and both feelings Sense have the characteristics that it is obvious, so both emotions are all in relatively high discrimination in all kinds of algorithms.

In conclusion it is of the invention based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term System finds the correlation inside signal by being added after two-way length in short-term network from attention mechanism, and then controls each The significance level of temporal frame.It can reduce the influence of the temporal frame unfavorable to classification performance from attention mechanism, and allow network It focuses more on and biggish temporal frame is helped to classification performance, to improve classification essence of the classifier on speech emotional data set Degree.In addition, the present invention also provides reference for other relevant issues in same domain, expansion extension can be carried out on this basis, With very wide application prospect.

Claims

1. it is a kind of based on direction from the speech emotion recognition system of attention mechanism and two-way length network in short-term, which is characterized in that Include the following steps:

1) acoustic feature is extracted to original audio signal samples, obtains extracting the voice training collection data after feature；

2) the voice training collection data after obtained extraction feature are input to positive long memory network and reversed length in short-term When memory network in, export positive featureAnd opposite feature

3) the positive feature that will be exportedAnd opposite featureOne-dimensional convolution three times is done respectively, obtains three positive three-dimensional features Mapping matrixWith three reversed three-dimensional feature mapping matrixes

4) three positive three-dimensional feature mapping matrixes obtained to step 2) It does from attention mechanism and operates to obtain forward direction from the output after attention weightingObtained to step 2) three A reversed three-dimensional feature mapping matrix It does from attention mechanism and operates to obtain Reversely from the output after attention weighting

5) to the output after the obtained positive weighting from attentionReversely from the output after attention weightingIt does respectively The operation of mean value pondization obtainsWithAnd it will be obtainedWithSpliced, is exported spliced

6) spliced by what is exportedIt is input in softmax layers and obtains softmax layers of output, Obtained softmax layers of output and category are input to together in cross entropy loss function, back-propagation algorithm tune is passed through Whole network structure.

2. it is according to claim 1 based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term System, it is characterised in that: 1) the original audio signal samples come from international voice affection data library IEMOCAP；It is described original The acoustic features of audio signal samples extracted by the tool box opensmile；The acoustics of the original audio signal samples is special Sign includes prosodic features: zero-crossing rate and energy, composes correlated characteristic: mel-frequency cepstrum coefficient, spectral roll-off point, spectrum stream Amount, spectral centroid, frequency spectrum entropy, spread spectrum degree, chromaticity and chromaticity standard deviation.

3. the speech emotion recognition system according to claim 1 based on two-way attention mechanism, it is characterised in that: 2) institute Stating the voice training collection data after extracting feature isWherein N indicates the quantity of training sample, y_iIndicate emotion Classification is separately input in positive long memory network in short-term and reversed long memory network in short-term, and the output for obtaining both direction is special Sign, respectively positive featureAnd opposite feature

4. it is according to claim 1 based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term System, it is characterised in that: 4) it is described to be defined as itself from the operation of attention mechanism and do similarity measurement, and pass through the similitude Measurement obtains the weights at each moment.First to obtained three three-dimensional feature mapping matrixesThe last one dimension D be split to obtain three four-matrixesThen multiplying is done to obtained Q ' matrix and K ' matrix And resulting operation result is done into softmax layers of transformation and obtains weight matrix W, resulting weight matrix is for each moment and in addition Resulting weight matrix W and another four-matrix V ' are finally done dot product by the weight matrix of all moment degrees of correlation, are obtained certainly Output O after attention weighting, is defined by formula as:

O=W*V '

The third dimension for merging gained output O obtains three-dimensional data O ', and the output after the positive weighting from attention is defined asReversely It is defined as from the output after attention weighting

5. it is according to claim 1 based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term System, it is characterised in that: the 5) output to after the obtained positive weighting from attentionReversely from after attention weighting OutputThe operation of mean value pondization is done respectively, obtains two two-dimensional matrixesWithThe operating process indicates are as follows:With It is described splicedIt is to splice the output of forward and reverse, preferably to retain original feature, is defined with formula Are as follows:

6. it is according to claim 1 based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term System, it is characterised in that: 6) softmax layers of the output is obtained in the softmax for be input to obtained S 4 neurons To the probability P of each classification, the probability P of obtained each classification and category y are input in cross entropy loss function, by it It minimizes:

Finally by the weight of back-propagation algorithm adjustment network.