CN110400579A - Based on direction from the speech emotion recognition of attention mechanism and two-way length network in short-term - Google Patents
Based on direction from the speech emotion recognition of attention mechanism and two-way length network in short-term Download PDFInfo
- Publication number
- CN110400579A CN110400579A CN201910555688.2A CN201910555688A CN110400579A CN 110400579 A CN110400579 A CN 110400579A CN 201910555688 A CN201910555688 A CN 201910555688A CN 110400579 A CN110400579 A CN 110400579A
- Authority
- CN
- China
- Prior art keywords
- output
- attention
- network
- feature
- attention mechanism
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000007246 mechanism Effects 0.000 title claims abstract description 40
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 22
- 230000006870 function Effects 0.000 claims abstract description 10
- 230000015654 memory Effects 0.000 claims abstract description 9
- 230000002441 reversible effect Effects 0.000 claims abstract description 7
- 230000005236 sound signal Effects 0.000 claims abstract description 6
- 238000000034 method Methods 0.000 claims description 28
- 239000011159 matrix material Substances 0.000 claims description 24
- 238000001228 spectrum Methods 0.000 claims description 17
- 238000012549 training Methods 0.000 claims description 17
- 230000003595 spectral effect Effects 0.000 claims description 11
- 238000013507 mapping Methods 0.000 claims description 8
- 230000008451 emotion Effects 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 210000002569 neuron Anatomy 0.000 claims description 5
- 230000002596 correlated effect Effects 0.000 claims description 4
- 238000005259 measurement Methods 0.000 claims description 4
- 230000002996 emotional effect Effects 0.000 abstract description 5
- 238000012360 testing method Methods 0.000 abstract description 5
- 238000013528 artificial neural network Methods 0.000 abstract description 3
- 230000007423 decrease Effects 0.000 abstract 1
- 230000000306 recurrent effect Effects 0.000 abstract 1
- 230000002123 temporal effect Effects 0.000 description 7
- 238000013527 convolutional neural network Methods 0.000 description 6
- 230000007935 neutral effect Effects 0.000 description 6
- 230000036651 mood Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000012706 support-vector machine Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000002779 inactivation Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000012952 Resampling Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 239000000839 emulsion Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000013401 experimental design Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000004064 recycling Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Image Analysis (AREA)
Abstract
The present invention relates to a kind of based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term, includes the following steps: first to extract acoustic feature to original audio signal, then is input to forward and reverse length in short-term in memory network, exports forward and reverse feature;Then by operating to obtain the output after forward and reverse weighting from attention from attention mechanism;Mean value pond and splicing are done respectively to the output after obtained forward and reverse weighting from attention, and it is input to softmax layers, obtained softmax layers of output and category are input to together in cross entropy loss function, most suitable network is selected by verifying collection, the data of test set are finally put into trained network to the emotional category obtained to the end.The present invention can be more easier to find the correlation of sentence internal signal being introduced into Recognition with Recurrent Neural Network from attention mechanism, and joined direction mechanism to from attention mechanism, solve the problems, such as because the shortage of information causes classification performance to decline.
Description
Technical field
The present invention relates to speech emotion recognition technical fields, are paid attention to certainly specifically, the present invention relates to one kind based on direction
The speech emotion recognition system of power mechanism and two-way length network in short-term.
Background technique
In recent years, human-computer interaction causes the interest of more and more data science men.For the friendship allowed between people and machine
Stream is more naturally, its target is mainly there are two aspect: first is that the meaning for allowing machine to understand that the mankind speak, second is that machine recognition is allowed to go out
The mood when mankind speak.Nowadays computer is understood that the meaning that the mankind speak, but machine recognition is allowed to go out the feelings in voice
Thread but has biggish challenge.
When early stage, researchers by extract phonic signal character, recycle Machine learning classifiers to its into
Row classification.At the beginning of 21 century, researchers are classified using gauss hybrid models or hidden Markov model, Zhi Houyou
In the outstanding performance of support vector machines, classifier has been substituted for support vector machines by researchers, and the algorithm is often made at present
For the baseline algorithm in speech emotion recognition field.And then, due to the development of neural network, researchers' discovery passes through nerve net
Network extracts high-level feature, and placing into other classifier (such as support vector machines and gauss hybrid models etc.) can obtain
Good effect.
Although people are analyzed the emotional change in voice using depth learning technology and achieve good effect in recent years
Fruit, but general method can not distinguish unvoiced frame and unvoiced frames in voice well.And this problem is handled at present
Method is broadly divided into two major classes: the first kind is manual removal unvoiced frames, the second class be adaptively learnt using algorithm out which
It is unvoiced frames, which is unvoiced frame.First kind method is usually to be identified according to pitch, but this method is time-consuming and laborious,
And the timing of voice data can be largely destroyed, although institute can be used in this way, there is certain defect.Second
Class method is to assign lower weight to unvoiced frames using certain adaptive method, and common method includes attention mechanism
With CTC loss method.Since CTC loss method is the discrete weight of distribution, the weight of non-voiced segments can be forcibly classified as 0
Or the weight of voiced segments is forcibly classified as 1, but the expression of human emotion is often incremental, so distributing it
Continuous weight is only correctly desirable method, and attention mechanism can exactly accomplish this point well.
The present invention is different with traditional attention mechanism, and traditional attention mechanism is to make to the data on time dimension
Softmax transformation obtains the weight in timing, and although this method has certain effect, but can not utilize letter well
Number.And of the present invention from attention mechanism is to be softmax by the similarity between data itself and itself
Transformation obtains, weight matrix be it is obtained by the internal information between signal, can more efficiently utilize sentence
Internal information.
Summary of the invention
Technical problem: technical problem to be solved by the invention is to provide the calculations that one kind can analyze voice signal mood
Method finds the correlation inside signal by being added after two-way length in short-term network from attention mechanism, and then controls each
The significance level of temporal frame.It can reduce the influence of the temporal frame unfavorable to classification performance from attention mechanism, and allow network
It focuses more on and biggish temporal frame is helped to classification performance, and then improve classification essence of the classifier on speech emotional data set
Degree.
Technical solution: firstly, initial data is divided into training set, verifying collection and test set.Due to the timing of voice data
Property, by two-way length, memory network is decoded phonetic feature training set data to the present invention in short-term, then to latter two side of decoding
To data each timing is weighted with from attention mechanism method, finally weighting output result and true class
Mark is put into cross entropy loss function.After obtaining Model Weight by training set, parameter selection is carried out with verifying the set pair analysis model
The best model of performance is obtained, then test set is put into obtained best model and is tested, model is obtained
Classification performance.
The technical solution adopted in the present invention can be refined further.It is described to be defined as itself and oneself from attention mechanism
Body does similarity measurement, and obtains the weights at each moment by the similarity measurement.First by two-way long short-term memory
The feature of network output is respectively put into three one-dimensional convolution, obtains three different Feature Mapping matrixesAnd gained Q, the last one dimension D of K, V are split to obtain three
Four-matrixThen obtained Q ' matrix and K ' matrix are done
Resulting operation result is simultaneously done softmax layers of transformation and obtains weight matrix W by multiplying, finally resulting weight matrix W
Dot product is done with another four-matrix V ', the output O after weighting from attention is obtained, is defined by formula as:
O=W*V '
The third dimension for merging gained output O obtains three-dimensional data O ', and the output after the positive weighting from attention is defined asReversely
It is defined as from the output after attention weightingTo the output after the obtained positive weighting from attentionWith reversed from attention
Output after power weightingThe operation of mean value pondization is done respectively to obtainWithAnd it will be obtainedWithIt is spelled
It connects, exports splicedIt is spliced by what is exportedIt is input to
Obtained softmax layers of output and category are input to intersection by the output that softmax layers are obtained in softmax layers together
In entropy loss function, whole network structure is adjusted by back-propagation algorithm.
The utility model has the advantages that the present invention compared with prior art, has the advantage that
It is of the invention based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term, will be from infusing
Meaning power mechanism introduces two-way length in short-term in network, the weight of voice temporal frame is assigned by attention mechanism, without hand
It is dynamic to delete useless frame.The present invention is utilized from attention mechanism it can be found that the characteristics of sentence internal signal correlation, more
Pay close attention to unvoiced frame, moreover it is possible to weaken the influence to unfavorable unvoiced frames of classifying.In addition, from different directions come analyze voice data can be into
One step increases the robustness of network, so speech emotion recognition system of the invention joined steering wheel to from attention mechanism
System is solved by parsing the high-level feature of LSTM forward and reverse because classification performance declines caused by poor information
The problem of.Experiments have shown that speech emotion recognition system of the invention has ideal classification performance.
Detailed description of the invention
Fig. 1 is the general frame figure of the invention applied in speech emotion recognition field;
Fig. 2 is confusion matrix of all kinds of algorithms in IEMOCAP improvisation data set
Specific embodiment
Content in order to more clearly describe the present invention, is described in detail in the following with reference to the drawings and specific embodiments.This
Invention based on direction from the speech emotion recognition system of the two-way length of attention mechanism network (BLSTM-DSA) in short-term, including
Following steps:
Step 1: acoustic feature being extracted to original audio signal samples, acoustic feature includes prosodic features: zero-crossing rate
And energy, compose correlated characteristic: mel-frequency cepstrum coefficient, spectral roll-off point, frequency spectrum flow, spectral centroid, frequency spectrum entropy, frequency spectrum expand
The latitude of emulsion, chromaticity and chromaticity standard deviation, these acoustic features are extracted with the tool box opensmile, are extracted
Voice training collection data after feature;
Step 2: by the voice training collection data after obtained extraction feature be input to positive long memory network in short-term and
In reversed long memory network in short-term, the training voice data of input is defined as Wherein N is instruction
Practice the quantity of sample, yi=0 represents the sample as angry class, yi=1 represents the sample as happiness class, yi=2 represent the sample as
Neutral class, yi=3 represent the sample as sad class.The formula of long memory network in short-term is defined as follows:
Wherein σ () represents sigmoid function, its output interval is (0,1).Because of the spy of sigmoid function output interval
Different property (being similar to probability), so it is often considered as being closest to the form of expression of normal distribution. Wi,Wf,Wc,WoIt is input
The weight matrix that can learn to state (Input to State), Ui,Uf,Uc,UoIt is state to state (State to State)
Can learning matrix, Vi,Vf,VoBe referred to as peep-hole connection (Peephole Connections) can learning matrix,It is
The l layers of neuron on time step t.It is input gate, it indicates to save at current time for candidate past state
How much information;It is to forget door, it indicates the internal state in previous time stepIn how much information should be forgotten;It is defeated
It gos out, it controls current time internal stateHow much information must be exported to external statusIt is defeated in order to distinguish forward and reverse
Out, the output of the last layer forward direction feature is defined asOpposite feature output is defined as
Step 3: the positive feature that will be exportedAnd opposite featureOne-dimensional convolution three times is done respectively, it is defeated after obtaining convolution
OutWherein positive three-dimensional feature mapping matrix is defined asReversed three-dimensional feature mapping matrix is defined as One-dimensional convolution operation is relatively suitble to analysis voice data, can preferably utilize
The timing of voice data, and compared to other algorithms, one-dimensional convolution occupies certain advantage in speed, and does three secondary volumes
Product operation is exactly to analyze from attention mechanism itself in order to facilitate subsequent.Then to Q, the last one dimension of K, V into
Row segmentation obtains three four-dimensional eigenmatrixes, these three four-matrixes are defined as by weWherein the size of third dimension i isTo obtained Q ', K ',
V ' does Scaled Dot-Product Attention operation, is defined by formula as:
O=W*V ' (7)
The third dimension for finally merging gained output O obtains three-dimensional data O ', the output definition after the positive weighting from attention
ForReversely it is defined as from the output after attention weighting
To the output after the obtained positive weighting from attentionReversely from the output after attention weightingRespectively
The operation of mean value pondization is done to obtainWithAnd it will be obtainedWithSpliced, which indicates are as follows:
Resulting spliced result S is input in softmax layers, then by softmax layers of output and category one
It rises and is input in cross entropy loss function, whole network structure is adjusted by back-propagation algorithm.The definition of cross entropy loss function
Are as follows:
Wherein H is classification number, and N is number of samples.
Experimental design
Experimental data set is chosen: there is used herein current most popular affection data library (Interactive Emotional
Dyadic Motion Capture,IEMOCAP).IEMOCAP database is recorded by engineering college, University of Southern California, the U.S.,
It in total include the audiovisual record of 5 sessions, i.e., audio, video and motion capture data, total duration have reached 12 hours.It is each
A session is engaged in the dialogue performance by an actor and actress, and is performed and be divided into drama performance and two kinds of improvisation.Root
According to statistics, which is made of the sentence of 10039 different durations, and the average length of every a word is 4.5 seconds, and by three
Annotation person squeezes into continuous label and discrete tags to every a word.Database is primarily upon five kinds of moods: angry, happy, sad,
It is neutral and dejected, however, annotation person is not limited to these moods in mark.Wherein, do not consider that the voice data of category accounts for
Than being 38%, the voice data accounting of category is not 7%, can not determine that the voice data accounting of category is 15%, it may be determined that
The data accounting of category is 40%.In order to compare with the research achievement of other researchers, we, which only choose, can determine category
Anger, happiness, neutrality and sad voice data in part.Table 1 shows each in IEMOCAP improvisation data set
Individual distinguishes the description of how many word in different emotions.
1 IEMOCAP improvisation data set of table
Feature extraction: in feature extraction phases, original signal will be converted into acoustic feature (including prosodic features,
It composes correlated characteristic, sound quality feature and deep learning algorithm and extracts feature).It includes zero-crossing rate that prosodic features is chosen in this method
And energy, the spectrum correlated characteristic of selection include mel-frequency cepstrum coefficient, spectral roll-off point, frequency spectrum flow, spectral centroid, frequency spectrum
Entropy, spread spectrum degree, chromaticity and chromaticity standard deviation, use openSMILE as speech feature extraction tool.It is first
Framing adding window first is carried out to the voice signal of 16KHz sample frequency, that voice window is 25ms Hamming window and 10ms in this method
Frame move.The mel-frequency cepstrum coefficient of 12 dimensions is calculated by logarithm Fourier transform and 26 filters.Spectral roll-off point
It is set as 0.85, this shows that the frequency lower than overall amplitude level 85% will be taken into account, and frequency spectrum flow is by present frame and previous
Frame minimum squared distance obtains, and spectral centroid is obtained by the weighted average of calculating frequency.Frequency spectrum entropy changes energy using Shannon entropy
It is distributed as probability distribution.Frequency spectrum extensibility, that is, frequency spectrum second-order central is away from mark of the band frequency to spectral centroid when being by calculating each
Quasi- difference obtains.Zero-crossing rate is frequency of the time domain wave by time shaft.Energy is obtained by the weighted quadratic of each frame, in addition, energy
Entropy is to joined Shannon entropy to energy, to determine whether Energy distribution is uniform.The low-dimensional feature of entire manual extraction includes Meier
Frequency cepstral coefficient, spectral roll-off point, frequency spectrum flow, spectral centroid, frequency spectrum entropy, spread spectrum degree, zero-crossing rate, fundamental frequency,
Energy, Energy-Entropy and their first-order difference.Last each frame has 68 dimensional features, in order to better adapt to neural network,
Mean variance normalization can use in the method.
Network training method: this method uses the independent Training strategy of speaker, on IEMOCAP improvisation data set
The Training strategy of one group of method (Leave One Group Out, LOGO) is stayed in selection, executes five wheels in total, and each round is with wherein four
Sentence in a session remains in next session as training set, and the sentence that actress records will be used as test set, actor's record
The sentence of system will be as verifying collection.Since the sample of happiness emotion in IEMOCAP improvisation data set occupies the minority, data emotion
In non-equilibrium state, so having carried out resampling to happiness sample on the data set.On network training method, BLSTM
The number of plies is set as 2 layers, and the linear transformation initial method of input is uniformly distributed for Glorot, at the beginning of recycling the linear transformation of layer state
Beginning method is omnidirectional distribution initial method, and each layer of LSTM neuron number is set as 256, and random inactivation rate is set as
0.3.From in attention mechanism, the initial method of one-dimensional convolution convolution kernel is that Glorot is uniformly distributed, convolution kernel size
It is 1, number 128, regularization method is L2 regularization, and regularization parameter is set as 3*10-7.Attention mechanism divides fragment
Number 8, loss function selects cross entropy, and batch_size is set as 256, and base learning rate is set as 0.0001, then uses
Nadam optimizer carries out parameter optimization.In order to preferably train network that will select warm_up and sliding average strategy.warm_up
Strategy is pressed in trained preceding 8 epochFormula calculates learning rate.Work as
The state that habit rate linearly increases in early period, it will be able to network be allowed to better adapt to data.Sliding average can be such that model is surveying
More healthy and stronger on examination collection, attenuation rate (Decay) is set as 0.999.Over-fitting in order to prevent also uses in training and early stops plan
Slightly, when the loss of verifying collection is no longer reduced in 10 epoch, stopping network training finally selecting to collect upper loss most in verifying
Low model is tested.In order to accelerate to restrain, it is added between BLSTM and Direction Self Attention
Layer standardization (Layer Normal) layer.
Verifying index: this method selection weighted average recall rate (Weighted Accuracy, WA) and unweighted average are called together
Return the evaluation index that rate (Unweighted Accuracy, UA) is model.WA is correctly number of classifying on entire test set
Amount.In order to evaluate influence of the data class imbalance to overall model, the average result of the classification classification accuracy rate of UA, that is, every kind
Including being also considered.WA and UA can be defined as:
Compare algorithm: the comparison algorithm that this method uses is CNN, LSTM, BLSTM.The structure of CNN is two layers of convolutional layer,
And the size of first layer convolutional layer convolution kernel is 2*2, and step-length 1, convolution kernel number is 10, second layer convolutional layer convolution kernel
Size is 2*2, and step-length 1, convolution kernel number is 20, and one layer of maximum pond layer then can be all added after each layer of convolutional layer,
Size is 2*2, and step-length 2 finally adds the full articulamentum that two layers of neuron number is 128, and adds between full articulamentum
Batch standardization (Batch Normalization) layer is entered.LSTM is set as two layers in this experiment, each layer of neuron number
It is 256, random inactivation rate (dropout) is set as 0.3.The experiment parameter setting of BLSTM is identical with LSTM, only in each layer
Positive LSTM adds one layer of reversed LSTM again, and all models are all unified to use Nadam optimizer.
Experimental result
Table 2 shows experimental result of each algorithm on IEMOCAP improvisation data set.CNN is in IEMOCAP
Good performance is not played on emerging performance data set, and whether on WA and UA, CNN is minimum result.Adding
After entering direction mechanism, BLSTM ratio LSTM shows more outstanding generalization ability.It has incorporated from attention mechanism and steering wheel
The BLSTM-DSA of system has reached best result in two results of WA and UA.
Result of each algorithm of table 2 on IEMOCAP improvisation data set
Model | WA (%) | UA (%) |
CNN | 57.75 | 45.08 |
LSTM | 61.89 | 50.52 |
BSLTM | 62.01 | 52.48 |
BLSTM-DSA | 62.16 | 55.21 |
Fig. 2 illustrates all kinds of algorithms in the confusion matrix of IEMOCAP improvisation data set.
By the confusion matrix figure in Fig. 2 it is found that in angry emotion recognition rate, BLSTM-DSA be it is highest, CNN is most
Low.In happiness emotion recognition rate, BLSTM-DSA is also highest, and LSTM is minimum.In neutral emotion recognition rate
On, every kind of algorithm is all 70% or more, and every kind of algorithm difference is little.It is similar with neutral emotion recognition, all kinds of algorithms
Sad discrimination is also not much different.In conclusion BLSTM-DSA is on angry discrimination, neutral discrimination and sad discrimination
There is ideal result.Further, since the sample size of sad and neutral two kinds of emotions is larger, and both feelings
Sense have the characteristics that it is obvious, so both emotions are all in relatively high discrimination in all kinds of algorithms.
In conclusion it is of the invention based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term
System finds the correlation inside signal by being added after two-way length in short-term network from attention mechanism, and then controls each
The significance level of temporal frame.It can reduce the influence of the temporal frame unfavorable to classification performance from attention mechanism, and allow network
It focuses more on and biggish temporal frame is helped to classification performance, to improve classification essence of the classifier on speech emotional data set
Degree.In addition, the present invention also provides reference for other relevant issues in same domain, expansion extension can be carried out on this basis,
With very wide application prospect.
Claims (6)
1. it is a kind of based on direction from the speech emotion recognition system of attention mechanism and two-way length network in short-term, which is characterized in that
Include the following steps:
1) acoustic feature is extracted to original audio signal samples, obtains extracting the voice training collection data after feature;
2) the voice training collection data after obtained extraction feature are input to positive long memory network and reversed length in short-term
When memory network in, export positive featureAnd opposite feature
3) the positive feature that will be exportedAnd opposite featureOne-dimensional convolution three times is done respectively, obtains three positive three-dimensional features
Mapping matrixWith three reversed three-dimensional feature mapping matrixes
4) three positive three-dimensional feature mapping matrixes obtained to step 2) It does from attention mechanism and operates to obtain forward direction from the output after attention weightingObtained to step 2) three
A reversed three-dimensional feature mapping matrix It does from attention mechanism and operates to obtain
Reversely from the output after attention weighting
5) to the output after the obtained positive weighting from attentionReversely from the output after attention weightingIt does respectively
The operation of mean value pondization obtainsWithAnd it will be obtainedWithSpliced, is exported spliced
6) spliced by what is exportedIt is input in softmax layers and obtains softmax layers of output,
Obtained softmax layers of output and category are input to together in cross entropy loss function, back-propagation algorithm tune is passed through
Whole network structure.
2. it is according to claim 1 based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term
System, it is characterised in that: 1) the original audio signal samples come from international voice affection data library IEMOCAP;It is described original
The acoustic features of audio signal samples extracted by the tool box opensmile;The acoustics of the original audio signal samples is special
Sign includes prosodic features: zero-crossing rate and energy, composes correlated characteristic: mel-frequency cepstrum coefficient, spectral roll-off point, spectrum stream
Amount, spectral centroid, frequency spectrum entropy, spread spectrum degree, chromaticity and chromaticity standard deviation.
3. the speech emotion recognition system according to claim 1 based on two-way attention mechanism, it is characterised in that: 2) institute
Stating the voice training collection data after extracting feature isWherein N indicates the quantity of training sample, yiIndicate emotion
Classification is separately input in positive long memory network in short-term and reversed long memory network in short-term, and the output for obtaining both direction is special
Sign, respectively positive featureAnd opposite feature
4. it is according to claim 1 based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term
System, it is characterised in that: 4) it is described to be defined as itself from the operation of attention mechanism and do similarity measurement, and pass through the similitude
Measurement obtains the weights at each moment.First to obtained three three-dimensional feature mapping matrixesThe last one dimension D be split to obtain three four-matrixesThen multiplying is done to obtained Q ' matrix and K ' matrix
And resulting operation result is done into softmax layers of transformation and obtains weight matrix W, resulting weight matrix is for each moment and in addition
Resulting weight matrix W and another four-matrix V ' are finally done dot product by the weight matrix of all moment degrees of correlation, are obtained certainly
Output O after attention weighting, is defined by formula as:
O=W*V '
The third dimension for merging gained output O obtains three-dimensional data O ', and the output after the positive weighting from attention is defined asReversely
It is defined as from the output after attention weighting
5. it is according to claim 1 based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term
System, it is characterised in that: the 5) output to after the obtained positive weighting from attentionReversely from after attention weighting
OutputThe operation of mean value pondization is done respectively, obtains two two-dimensional matrixesWithThe operating process indicates are as follows:With It is described splicedIt is to splice the output of forward and reverse, preferably to retain original feature, is defined with formula
Are as follows:
6. it is according to claim 1 based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term
System, it is characterised in that: 6) softmax layers of the output is obtained in the softmax for be input to obtained S 4 neurons
To the probability P of each classification, the probability P of obtained each classification and category y are input in cross entropy loss function, by it
It minimizes:
Finally by the weight of back-propagation algorithm adjustment network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910555688.2A CN110400579B (en) | 2019-06-25 | 2019-06-25 | Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910555688.2A CN110400579B (en) | 2019-06-25 | 2019-06-25 | Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110400579A true CN110400579A (en) | 2019-11-01 |
CN110400579B CN110400579B (en) | 2022-01-11 |
Family
ID=68322649
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910555688.2A Active CN110400579B (en) | 2019-06-25 | 2019-06-25 | Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110400579B (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111048082A (en) * | 2019-12-12 | 2020-04-21 | 中国电子科技集团公司第二十八研究所 | Improved end-to-end speech recognition method |
CN111259761A (en) * | 2020-01-13 | 2020-06-09 | 东南大学 | Electroencephalogram emotion recognition method and device based on migratable attention neural network |
CN111357051A (en) * | 2019-12-24 | 2020-06-30 | 深圳市优必选科技股份有限公司 | Speech emotion recognition method, intelligent device and computer readable storage medium |
CN111429948A (en) * | 2020-03-27 | 2020-07-17 | 南京工业大学 | Voice emotion recognition model and method based on attention convolution neural network |
CN111461173A (en) * | 2020-03-06 | 2020-07-28 | 华南理工大学 | Attention mechanism-based multi-speaker clustering system and method |
CN111477221A (en) * | 2020-05-28 | 2020-07-31 | 中国科学技术大学 | Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network |
CN111508500A (en) * | 2020-04-17 | 2020-08-07 | 五邑大学 | Voice emotion recognition method, system, device and storage medium |
CN111524535A (en) * | 2020-04-30 | 2020-08-11 | 杭州电子科技大学 | Feature fusion method for speech emotion recognition based on attention mechanism |
CN111613240A (en) * | 2020-05-22 | 2020-09-01 | 杭州电子科技大学 | Camouflage voice detection method based on attention mechanism and Bi-LSTM |
CN111783469A (en) * | 2020-06-29 | 2020-10-16 | 中国计量大学 | Method for extracting text sentence characteristics |
CN111798445A (en) * | 2020-07-17 | 2020-10-20 | 北京大学口腔医院 | Tooth image caries identification method and system based on convolutional neural network |
CN112447186A (en) * | 2020-10-16 | 2021-03-05 | 华东理工大学 | Speech emotion recognition algorithm weighted according to class characteristics |
CN112581979A (en) * | 2020-12-10 | 2021-03-30 | 重庆邮电大学 | Speech emotion recognition method based on spectrogram |
CN113257280A (en) * | 2021-06-07 | 2021-08-13 | 苏州大学 | Speech emotion recognition method based on wav2vec |
CN113284515A (en) * | 2021-04-19 | 2021-08-20 | 大连海事大学 | Voice emotion recognition method based on physical waves and circulating network |
CN113317791A (en) * | 2021-05-28 | 2021-08-31 | 温州康宁医院股份有限公司 | Method and device for determining severity of depression based on audio frequency of testee |
CN113469470A (en) * | 2021-09-02 | 2021-10-01 | 国网浙江省电力有限公司杭州供电公司 | Energy consumption data and carbon emission correlation analysis method based on electric brain center |
CN113571050A (en) * | 2021-07-28 | 2021-10-29 | 复旦大学 | Voice depression state identification method based on Attention and Bi-LSTM |
CN111259761B (en) * | 2020-01-13 | 2024-06-07 | 东南大学 | Electroencephalogram emotion recognition method and device based on movable attention neural network |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108597541A (en) * | 2018-04-28 | 2018-09-28 | 南京师范大学 | A kind of speech-emotion recognition method and system for enhancing indignation and happily identifying |
CN108831450A (en) * | 2018-03-30 | 2018-11-16 | 杭州鸟瞰智能科技股份有限公司 | A kind of virtual robot man-machine interaction method based on user emotion identification |
CN109243493A (en) * | 2018-10-30 | 2019-01-18 | 南京工程学院 | Based on the vagitus emotion identification method for improving long memory network in short-term |
CN109243494A (en) * | 2018-10-30 | 2019-01-18 | 南京工程学院 | Childhood emotional recognition methods based on the long memory network in short-term of multiple attention mechanism |
CN109285562A (en) * | 2018-09-28 | 2019-01-29 | 东南大学 | Speech-emotion recognition method based on attention mechanism |
CN109522548A (en) * | 2018-10-26 | 2019-03-26 | 天津大学 | A kind of text emotion analysis method based on two-way interactive neural network |
CN109710761A (en) * | 2018-12-21 | 2019-05-03 | 中国标准化研究院 | The sentiment analysis method of two-way LSTM model based on attention enhancing |
CN109740148A (en) * | 2018-12-16 | 2019-05-10 | 北京工业大学 | A kind of text emotion analysis method of BiLSTM combination Attention mechanism |
CN109817246A (en) * | 2019-02-27 | 2019-05-28 | 平安科技(深圳)有限公司 | Training method, emotion identification method, device, equipment and the storage medium of emotion recognition model |
-
2019
- 2019-06-25 CN CN201910555688.2A patent/CN110400579B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108831450A (en) * | 2018-03-30 | 2018-11-16 | 杭州鸟瞰智能科技股份有限公司 | A kind of virtual robot man-machine interaction method based on user emotion identification |
CN108597541A (en) * | 2018-04-28 | 2018-09-28 | 南京师范大学 | A kind of speech-emotion recognition method and system for enhancing indignation and happily identifying |
CN109285562A (en) * | 2018-09-28 | 2019-01-29 | 东南大学 | Speech-emotion recognition method based on attention mechanism |
CN109522548A (en) * | 2018-10-26 | 2019-03-26 | 天津大学 | A kind of text emotion analysis method based on two-way interactive neural network |
CN109243493A (en) * | 2018-10-30 | 2019-01-18 | 南京工程学院 | Based on the vagitus emotion identification method for improving long memory network in short-term |
CN109243494A (en) * | 2018-10-30 | 2019-01-18 | 南京工程学院 | Childhood emotional recognition methods based on the long memory network in short-term of multiple attention mechanism |
CN109740148A (en) * | 2018-12-16 | 2019-05-10 | 北京工业大学 | A kind of text emotion analysis method of BiLSTM combination Attention mechanism |
CN109710761A (en) * | 2018-12-21 | 2019-05-03 | 中国标准化研究院 | The sentiment analysis method of two-way LSTM model based on attention enhancing |
CN109817246A (en) * | 2019-02-27 | 2019-05-28 | 平安科技(深圳)有限公司 | Training method, emotion identification method, device, equipment and the storage medium of emotion recognition model |
Non-Patent Citations (2)
Title |
---|
XIAOFENG CAI ETC: "Multi-view and Attention-Based BI-LSTM for Weibo", 《ADVANCES IN INTELLIGENT SYSTEMS RESEARCH,VOLUME 147,INTERNATIONAL CONFERENCE ON NETWORK, COMMUNICATION, COMPUTER ENGINEERING (NCCE 2018)》 * |
邢吉亮: "结合注意力机制的Bi-LSTM循环神经网络对关系分类的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111048082B (en) * | 2019-12-12 | 2022-09-06 | 中国电子科技集团公司第二十八研究所 | Improved end-to-end speech recognition method |
CN111048082A (en) * | 2019-12-12 | 2020-04-21 | 中国电子科技集团公司第二十八研究所 | Improved end-to-end speech recognition method |
CN111357051B (en) * | 2019-12-24 | 2024-02-02 | 深圳市优必选科技股份有限公司 | Speech emotion recognition method, intelligent device and computer readable storage medium |
CN111357051A (en) * | 2019-12-24 | 2020-06-30 | 深圳市优必选科技股份有限公司 | Speech emotion recognition method, intelligent device and computer readable storage medium |
CN111259761A (en) * | 2020-01-13 | 2020-06-09 | 东南大学 | Electroencephalogram emotion recognition method and device based on migratable attention neural network |
CN111259761B (en) * | 2020-01-13 | 2024-06-07 | 东南大学 | Electroencephalogram emotion recognition method and device based on movable attention neural network |
CN111461173A (en) * | 2020-03-06 | 2020-07-28 | 华南理工大学 | Attention mechanism-based multi-speaker clustering system and method |
CN111461173B (en) * | 2020-03-06 | 2023-06-20 | 华南理工大学 | Multi-speaker clustering system and method based on attention mechanism |
CN111429948A (en) * | 2020-03-27 | 2020-07-17 | 南京工业大学 | Voice emotion recognition model and method based on attention convolution neural network |
CN111508500B (en) * | 2020-04-17 | 2023-08-29 | 五邑大学 | Voice emotion recognition method, system, device and storage medium |
CN111508500A (en) * | 2020-04-17 | 2020-08-07 | 五邑大学 | Voice emotion recognition method, system, device and storage medium |
CN111524535A (en) * | 2020-04-30 | 2020-08-11 | 杭州电子科技大学 | Feature fusion method for speech emotion recognition based on attention mechanism |
CN111524535B (en) * | 2020-04-30 | 2022-06-21 | 杭州电子科技大学 | Feature fusion method for speech emotion recognition based on attention mechanism |
CN111613240A (en) * | 2020-05-22 | 2020-09-01 | 杭州电子科技大学 | Camouflage voice detection method based on attention mechanism and Bi-LSTM |
CN111613240B (en) * | 2020-05-22 | 2023-06-27 | 杭州电子科技大学 | Camouflage voice detection method based on attention mechanism and Bi-LSTM |
CN111477221B (en) * | 2020-05-28 | 2022-12-30 | 中国科学技术大学 | Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network |
CN111477221A (en) * | 2020-05-28 | 2020-07-31 | 中国科学技术大学 | Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network |
CN111783469A (en) * | 2020-06-29 | 2020-10-16 | 中国计量大学 | Method for extracting text sentence characteristics |
CN111798445B (en) * | 2020-07-17 | 2023-10-31 | 北京大学口腔医院 | Tooth image caries identification method and system based on convolutional neural network |
CN111798445A (en) * | 2020-07-17 | 2020-10-20 | 北京大学口腔医院 | Tooth image caries identification method and system based on convolutional neural network |
CN112447186A (en) * | 2020-10-16 | 2021-03-05 | 华东理工大学 | Speech emotion recognition algorithm weighted according to class characteristics |
CN112581979B (en) * | 2020-12-10 | 2022-07-12 | 重庆邮电大学 | Speech emotion recognition method based on spectrogram |
CN112581979A (en) * | 2020-12-10 | 2021-03-30 | 重庆邮电大学 | Speech emotion recognition method based on spectrogram |
CN113284515B (en) * | 2021-04-19 | 2023-05-02 | 大连海事大学 | Speech emotion recognition method based on physical wave and circulation network |
CN113284515A (en) * | 2021-04-19 | 2021-08-20 | 大连海事大学 | Voice emotion recognition method based on physical waves and circulating network |
CN113317791B (en) * | 2021-05-28 | 2023-03-14 | 温州康宁医院股份有限公司 | Method and device for determining severity of depression based on audio frequency of testee |
CN113317791A (en) * | 2021-05-28 | 2021-08-31 | 温州康宁医院股份有限公司 | Method and device for determining severity of depression based on audio frequency of testee |
CN113257280A (en) * | 2021-06-07 | 2021-08-13 | 苏州大学 | Speech emotion recognition method based on wav2vec |
CN113571050A (en) * | 2021-07-28 | 2021-10-29 | 复旦大学 | Voice depression state identification method based on Attention and Bi-LSTM |
CN113469470A (en) * | 2021-09-02 | 2021-10-01 | 国网浙江省电力有限公司杭州供电公司 | Energy consumption data and carbon emission correlation analysis method based on electric brain center |
Also Published As
Publication number | Publication date |
---|---|
CN110400579B (en) | 2022-01-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110400579A (en) | Based on direction from the speech emotion recognition of attention mechanism and two-way length network in short-term | |
Chatziagapi et al. | Data Augmentation Using GANs for Speech Emotion Recognition. | |
Er | A novel approach for classification of speech emotions based on deep and acoustic features | |
Hu et al. | Temporal multimodal learning in audiovisual speech recognition | |
Sun | End-to-end speech emotion recognition with gender information | |
Mesgarani et al. | Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations | |
Dennis | Sound event recognition in unstructured environments using spectrogram image processing | |
CN103544963B (en) | A kind of speech-emotion recognition method based on core semi-supervised discrimination and analysis | |
CN111583964B (en) | Natural voice emotion recognition method based on multimode deep feature learning | |
Li et al. | Exploiting the potentialities of features for speech emotion recognition | |
Ghai et al. | Emotion recognition on speech signals using machine learning | |
Elshaer et al. | Transfer learning from sound representations for anger detection in speech | |
Huang et al. | Speech emotion recognition using convolutional neural network with audio word-based embedding | |
Gao et al. | ToneNet: A CNN Model of Tone Classification of Mandarin Chinese. | |
CN110348482B (en) | Speech emotion recognition system based on depth model integrated architecture | |
CN111968652A (en) | Speaker identification method based on 3DCNN-LSTM and storage medium | |
Kuang et al. | Simplified inverse filter tracked affective acoustic signals classification incorporating deep convolutional neural networks | |
Kamaruddin et al. | Features extraction for speech emotion | |
CN112329819A (en) | Underwater target identification method based on multi-network fusion | |
Xue et al. | Learning speech emotion features by joint disentangling-discrimination | |
Xue et al. | Driver’s speech emotion recognition for smart cockpit based on a self-attention deep learning framework | |
Rammohan et al. | Speech signal-based modelling of basic emotions to analyse compound emotion: Anxiety | |
Muralikrishna et al. | Noise-robust spoken language identification using language relevance factor based embedding | |
CN115312080A (en) | Voice emotion recognition model and method based on complementary acoustic characterization | |
Segarceanu et al. | Environmental acoustics modelling techniques for forest monitoring |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |