CN110400579A - Based on direction from the speech emotion recognition of attention mechanism and two-way length network in short-term - Google Patents

Based on direction from the speech emotion recognition of attention mechanism and two-way length network in short-term Download PDF

Info

Publication number
CN110400579A
CN110400579A CN201910555688.2A CN201910555688A CN110400579A CN 110400579 A CN110400579 A CN 110400579A CN 201910555688 A CN201910555688 A CN 201910555688A CN 110400579 A CN110400579 A CN 110400579A
Authority
CN
China
Prior art keywords
output
attention
network
feature
attention mechanism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910555688.2A
Other languages
Chinese (zh)
Other versions
CN110400579B (en
Inventor
李冬冬
王喆
孙琳煜
方仲礼
杜文莉
张静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China University of Science and Technology
Original Assignee
East China University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China University of Science and Technology filed Critical East China University of Science and Technology
Priority to CN201910555688.2A priority Critical patent/CN110400579B/en
Publication of CN110400579A publication Critical patent/CN110400579A/en
Application granted granted Critical
Publication of CN110400579B publication Critical patent/CN110400579B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The present invention relates to a kind of based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term, includes the following steps: first to extract acoustic feature to original audio signal, then is input to forward and reverse length in short-term in memory network, exports forward and reverse feature;Then by operating to obtain the output after forward and reverse weighting from attention from attention mechanism;Mean value pond and splicing are done respectively to the output after obtained forward and reverse weighting from attention, and it is input to softmax layers, obtained softmax layers of output and category are input to together in cross entropy loss function, most suitable network is selected by verifying collection, the data of test set are finally put into trained network to the emotional category obtained to the end.The present invention can be more easier to find the correlation of sentence internal signal being introduced into Recognition with Recurrent Neural Network from attention mechanism, and joined direction mechanism to from attention mechanism, solve the problems, such as because the shortage of information causes classification performance to decline.

Description

Based on direction from the speech emotion recognition of attention mechanism and two-way length network in short-term
Technical field
The present invention relates to speech emotion recognition technical fields, are paid attention to certainly specifically, the present invention relates to one kind based on direction The speech emotion recognition system of power mechanism and two-way length network in short-term.
Background technique
In recent years, human-computer interaction causes the interest of more and more data science men.For the friendship allowed between people and machine Stream is more naturally, its target is mainly there are two aspect: first is that the meaning for allowing machine to understand that the mankind speak, second is that machine recognition is allowed to go out The mood when mankind speak.Nowadays computer is understood that the meaning that the mankind speak, but machine recognition is allowed to go out the feelings in voice Thread but has biggish challenge.
When early stage, researchers by extract phonic signal character, recycle Machine learning classifiers to its into Row classification.At the beginning of 21 century, researchers are classified using gauss hybrid models or hidden Markov model, Zhi Houyou In the outstanding performance of support vector machines, classifier has been substituted for support vector machines by researchers, and the algorithm is often made at present For the baseline algorithm in speech emotion recognition field.And then, due to the development of neural network, researchers' discovery passes through nerve net Network extracts high-level feature, and placing into other classifier (such as support vector machines and gauss hybrid models etc.) can obtain Good effect.
Although people are analyzed the emotional change in voice using depth learning technology and achieve good effect in recent years Fruit, but general method can not distinguish unvoiced frame and unvoiced frames in voice well.And this problem is handled at present Method is broadly divided into two major classes: the first kind is manual removal unvoiced frames, the second class be adaptively learnt using algorithm out which It is unvoiced frames, which is unvoiced frame.First kind method is usually to be identified according to pitch, but this method is time-consuming and laborious, And the timing of voice data can be largely destroyed, although institute can be used in this way, there is certain defect.Second Class method is to assign lower weight to unvoiced frames using certain adaptive method, and common method includes attention mechanism With CTC loss method.Since CTC loss method is the discrete weight of distribution, the weight of non-voiced segments can be forcibly classified as 0 Or the weight of voiced segments is forcibly classified as 1, but the expression of human emotion is often incremental, so distributing it Continuous weight is only correctly desirable method, and attention mechanism can exactly accomplish this point well.
The present invention is different with traditional attention mechanism, and traditional attention mechanism is to make to the data on time dimension Softmax transformation obtains the weight in timing, and although this method has certain effect, but can not utilize letter well Number.And of the present invention from attention mechanism is to be softmax by the similarity between data itself and itself Transformation obtains, weight matrix be it is obtained by the internal information between signal, can more efficiently utilize sentence Internal information.
Summary of the invention
Technical problem: technical problem to be solved by the invention is to provide the calculations that one kind can analyze voice signal mood Method finds the correlation inside signal by being added after two-way length in short-term network from attention mechanism, and then controls each The significance level of temporal frame.It can reduce the influence of the temporal frame unfavorable to classification performance from attention mechanism, and allow network It focuses more on and biggish temporal frame is helped to classification performance, and then improve classification essence of the classifier on speech emotional data set Degree.
Technical solution: firstly, initial data is divided into training set, verifying collection and test set.Due to the timing of voice data Property, by two-way length, memory network is decoded phonetic feature training set data to the present invention in short-term, then to latter two side of decoding To data each timing is weighted with from attention mechanism method, finally weighting output result and true class Mark is put into cross entropy loss function.After obtaining Model Weight by training set, parameter selection is carried out with verifying the set pair analysis model The best model of performance is obtained, then test set is put into obtained best model and is tested, model is obtained Classification performance.
The technical solution adopted in the present invention can be refined further.It is described to be defined as itself and oneself from attention mechanism Body does similarity measurement, and obtains the weights at each moment by the similarity measurement.First by two-way long short-term memory The feature of network output is respectively put into three one-dimensional convolution, obtains three different Feature Mapping matrixesAnd gained Q, the last one dimension D of K, V are split to obtain three Four-matrixThen obtained Q ' matrix and K ' matrix are done Resulting operation result is simultaneously done softmax layers of transformation and obtains weight matrix W by multiplying, finally resulting weight matrix W Dot product is done with another four-matrix V ', the output O after weighting from attention is obtained, is defined by formula as:
O=W*V '
The third dimension for merging gained output O obtains three-dimensional data O ', and the output after the positive weighting from attention is defined asReversely It is defined as from the output after attention weightingTo the output after the obtained positive weighting from attentionWith reversed from attention Output after power weightingThe operation of mean value pondization is done respectively to obtainWithAnd it will be obtainedWithIt is spelled It connects, exports splicedIt is spliced by what is exportedIt is input to Obtained softmax layers of output and category are input to intersection by the output that softmax layers are obtained in softmax layers together In entropy loss function, whole network structure is adjusted by back-propagation algorithm.
The utility model has the advantages that the present invention compared with prior art, has the advantage that
It is of the invention based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term, will be from infusing Meaning power mechanism introduces two-way length in short-term in network, the weight of voice temporal frame is assigned by attention mechanism, without hand It is dynamic to delete useless frame.The present invention is utilized from attention mechanism it can be found that the characteristics of sentence internal signal correlation, more Pay close attention to unvoiced frame, moreover it is possible to weaken the influence to unfavorable unvoiced frames of classifying.In addition, from different directions come analyze voice data can be into One step increases the robustness of network, so speech emotion recognition system of the invention joined steering wheel to from attention mechanism System is solved by parsing the high-level feature of LSTM forward and reverse because classification performance declines caused by poor information The problem of.Experiments have shown that speech emotion recognition system of the invention has ideal classification performance.
Detailed description of the invention
Fig. 1 is the general frame figure of the invention applied in speech emotion recognition field;
Fig. 2 is confusion matrix of all kinds of algorithms in IEMOCAP improvisation data set
Specific embodiment
Content in order to more clearly describe the present invention, is described in detail in the following with reference to the drawings and specific embodiments.This Invention based on direction from the speech emotion recognition system of the two-way length of attention mechanism network (BLSTM-DSA) in short-term, including Following steps:
Step 1: acoustic feature being extracted to original audio signal samples, acoustic feature includes prosodic features: zero-crossing rate And energy, compose correlated characteristic: mel-frequency cepstrum coefficient, spectral roll-off point, frequency spectrum flow, spectral centroid, frequency spectrum entropy, frequency spectrum expand The latitude of emulsion, chromaticity and chromaticity standard deviation, these acoustic features are extracted with the tool box opensmile, are extracted Voice training collection data after feature;
Step 2: by the voice training collection data after obtained extraction feature be input to positive long memory network in short-term and In reversed long memory network in short-term, the training voice data of input is defined as Wherein N is instruction Practice the quantity of sample, yi=0 represents the sample as angry class, yi=1 represents the sample as happiness class, yi=2 represent the sample as Neutral class, yi=3 represent the sample as sad class.The formula of long memory network in short-term is defined as follows:
Wherein σ () represents sigmoid function, its output interval is (0,1).Because of the spy of sigmoid function output interval Different property (being similar to probability), so it is often considered as being closest to the form of expression of normal distribution. Wi,Wf,Wc,WoIt is input The weight matrix that can learn to state (Input to State), Ui,Uf,Uc,UoIt is state to state (State to State) Can learning matrix, Vi,Vf,VoBe referred to as peep-hole connection (Peephole Connections) can learning matrix,It is The l layers of neuron on time step t.It is input gate, it indicates to save at current time for candidate past state How much information;It is to forget door, it indicates the internal state in previous time stepIn how much information should be forgotten;It is defeated It gos out, it controls current time internal stateHow much information must be exported to external statusIt is defeated in order to distinguish forward and reverse Out, the output of the last layer forward direction feature is defined asOpposite feature output is defined as
Step 3: the positive feature that will be exportedAnd opposite featureOne-dimensional convolution three times is done respectively, it is defeated after obtaining convolution OutWherein positive three-dimensional feature mapping matrix is defined asReversed three-dimensional feature mapping matrix is defined as One-dimensional convolution operation is relatively suitble to analysis voice data, can preferably utilize The timing of voice data, and compared to other algorithms, one-dimensional convolution occupies certain advantage in speed, and does three secondary volumes Product operation is exactly to analyze from attention mechanism itself in order to facilitate subsequent.Then to Q, the last one dimension of K, V into Row segmentation obtains three four-dimensional eigenmatrixes, these three four-matrixes are defined as by weWherein the size of third dimension i isTo obtained Q ', K ', V ' does Scaled Dot-Product Attention operation, is defined by formula as:
O=W*V ' (7)
The third dimension for finally merging gained output O obtains three-dimensional data O ', the output definition after the positive weighting from attention ForReversely it is defined as from the output after attention weighting
To the output after the obtained positive weighting from attentionReversely from the output after attention weightingRespectively The operation of mean value pondization is done to obtainWithAnd it will be obtainedWithSpliced, which indicates are as follows:
Resulting spliced result S is input in softmax layers, then by softmax layers of output and category one It rises and is input in cross entropy loss function, whole network structure is adjusted by back-propagation algorithm.The definition of cross entropy loss function Are as follows:
Wherein H is classification number, and N is number of samples.
Experimental design
Experimental data set is chosen: there is used herein current most popular affection data library (Interactive Emotional Dyadic Motion Capture,IEMOCAP).IEMOCAP database is recorded by engineering college, University of Southern California, the U.S., It in total include the audiovisual record of 5 sessions, i.e., audio, video and motion capture data, total duration have reached 12 hours.It is each A session is engaged in the dialogue performance by an actor and actress, and is performed and be divided into drama performance and two kinds of improvisation.Root According to statistics, which is made of the sentence of 10039 different durations, and the average length of every a word is 4.5 seconds, and by three Annotation person squeezes into continuous label and discrete tags to every a word.Database is primarily upon five kinds of moods: angry, happy, sad, It is neutral and dejected, however, annotation person is not limited to these moods in mark.Wherein, do not consider that the voice data of category accounts for Than being 38%, the voice data accounting of category is not 7%, can not determine that the voice data accounting of category is 15%, it may be determined that The data accounting of category is 40%.In order to compare with the research achievement of other researchers, we, which only choose, can determine category Anger, happiness, neutrality and sad voice data in part.Table 1 shows each in IEMOCAP improvisation data set Individual distinguishes the description of how many word in different emotions.
1 IEMOCAP improvisation data set of table
Feature extraction: in feature extraction phases, original signal will be converted into acoustic feature (including prosodic features, It composes correlated characteristic, sound quality feature and deep learning algorithm and extracts feature).It includes zero-crossing rate that prosodic features is chosen in this method And energy, the spectrum correlated characteristic of selection include mel-frequency cepstrum coefficient, spectral roll-off point, frequency spectrum flow, spectral centroid, frequency spectrum Entropy, spread spectrum degree, chromaticity and chromaticity standard deviation, use openSMILE as speech feature extraction tool.It is first Framing adding window first is carried out to the voice signal of 16KHz sample frequency, that voice window is 25ms Hamming window and 10ms in this method Frame move.The mel-frequency cepstrum coefficient of 12 dimensions is calculated by logarithm Fourier transform and 26 filters.Spectral roll-off point It is set as 0.85, this shows that the frequency lower than overall amplitude level 85% will be taken into account, and frequency spectrum flow is by present frame and previous Frame minimum squared distance obtains, and spectral centroid is obtained by the weighted average of calculating frequency.Frequency spectrum entropy changes energy using Shannon entropy It is distributed as probability distribution.Frequency spectrum extensibility, that is, frequency spectrum second-order central is away from mark of the band frequency to spectral centroid when being by calculating each Quasi- difference obtains.Zero-crossing rate is frequency of the time domain wave by time shaft.Energy is obtained by the weighted quadratic of each frame, in addition, energy Entropy is to joined Shannon entropy to energy, to determine whether Energy distribution is uniform.The low-dimensional feature of entire manual extraction includes Meier Frequency cepstral coefficient, spectral roll-off point, frequency spectrum flow, spectral centroid, frequency spectrum entropy, spread spectrum degree, zero-crossing rate, fundamental frequency, Energy, Energy-Entropy and their first-order difference.Last each frame has 68 dimensional features, in order to better adapt to neural network, Mean variance normalization can use in the method.
Network training method: this method uses the independent Training strategy of speaker, on IEMOCAP improvisation data set The Training strategy of one group of method (Leave One Group Out, LOGO) is stayed in selection, executes five wheels in total, and each round is with wherein four Sentence in a session remains in next session as training set, and the sentence that actress records will be used as test set, actor's record The sentence of system will be as verifying collection.Since the sample of happiness emotion in IEMOCAP improvisation data set occupies the minority, data emotion In non-equilibrium state, so having carried out resampling to happiness sample on the data set.On network training method, BLSTM The number of plies is set as 2 layers, and the linear transformation initial method of input is uniformly distributed for Glorot, at the beginning of recycling the linear transformation of layer state Beginning method is omnidirectional distribution initial method, and each layer of LSTM neuron number is set as 256, and random inactivation rate is set as 0.3.From in attention mechanism, the initial method of one-dimensional convolution convolution kernel is that Glorot is uniformly distributed, convolution kernel size It is 1, number 128, regularization method is L2 regularization, and regularization parameter is set as 3*10-7.Attention mechanism divides fragment Number 8, loss function selects cross entropy, and batch_size is set as 256, and base learning rate is set as 0.0001, then uses Nadam optimizer carries out parameter optimization.In order to preferably train network that will select warm_up and sliding average strategy.warm_up Strategy is pressed in trained preceding 8 epochFormula calculates learning rate.Work as The state that habit rate linearly increases in early period, it will be able to network be allowed to better adapt to data.Sliding average can be such that model is surveying More healthy and stronger on examination collection, attenuation rate (Decay) is set as 0.999.Over-fitting in order to prevent also uses in training and early stops plan Slightly, when the loss of verifying collection is no longer reduced in 10 epoch, stopping network training finally selecting to collect upper loss most in verifying Low model is tested.In order to accelerate to restrain, it is added between BLSTM and Direction Self Attention Layer standardization (Layer Normal) layer.
Verifying index: this method selection weighted average recall rate (Weighted Accuracy, WA) and unweighted average are called together Return the evaluation index that rate (Unweighted Accuracy, UA) is model.WA is correctly number of classifying on entire test set Amount.In order to evaluate influence of the data class imbalance to overall model, the average result of the classification classification accuracy rate of UA, that is, every kind Including being also considered.WA and UA can be defined as:
Compare algorithm: the comparison algorithm that this method uses is CNN, LSTM, BLSTM.The structure of CNN is two layers of convolutional layer, And the size of first layer convolutional layer convolution kernel is 2*2, and step-length 1, convolution kernel number is 10, second layer convolutional layer convolution kernel Size is 2*2, and step-length 1, convolution kernel number is 20, and one layer of maximum pond layer then can be all added after each layer of convolutional layer, Size is 2*2, and step-length 2 finally adds the full articulamentum that two layers of neuron number is 128, and adds between full articulamentum Batch standardization (Batch Normalization) layer is entered.LSTM is set as two layers in this experiment, each layer of neuron number It is 256, random inactivation rate (dropout) is set as 0.3.The experiment parameter setting of BLSTM is identical with LSTM, only in each layer Positive LSTM adds one layer of reversed LSTM again, and all models are all unified to use Nadam optimizer.
Experimental result
Table 2 shows experimental result of each algorithm on IEMOCAP improvisation data set.CNN is in IEMOCAP Good performance is not played on emerging performance data set, and whether on WA and UA, CNN is minimum result.Adding After entering direction mechanism, BLSTM ratio LSTM shows more outstanding generalization ability.It has incorporated from attention mechanism and steering wheel The BLSTM-DSA of system has reached best result in two results of WA and UA.
Result of each algorithm of table 2 on IEMOCAP improvisation data set
Model WA (%) UA (%)
CNN 57.75 45.08
LSTM 61.89 50.52
BSLTM 62.01 52.48
BLSTM-DSA 62.16 55.21
Fig. 2 illustrates all kinds of algorithms in the confusion matrix of IEMOCAP improvisation data set.
By the confusion matrix figure in Fig. 2 it is found that in angry emotion recognition rate, BLSTM-DSA be it is highest, CNN is most Low.In happiness emotion recognition rate, BLSTM-DSA is also highest, and LSTM is minimum.In neutral emotion recognition rate On, every kind of algorithm is all 70% or more, and every kind of algorithm difference is little.It is similar with neutral emotion recognition, all kinds of algorithms Sad discrimination is also not much different.In conclusion BLSTM-DSA is on angry discrimination, neutral discrimination and sad discrimination There is ideal result.Further, since the sample size of sad and neutral two kinds of emotions is larger, and both feelings Sense have the characteristics that it is obvious, so both emotions are all in relatively high discrimination in all kinds of algorithms.
In conclusion it is of the invention based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term System finds the correlation inside signal by being added after two-way length in short-term network from attention mechanism, and then controls each The significance level of temporal frame.It can reduce the influence of the temporal frame unfavorable to classification performance from attention mechanism, and allow network It focuses more on and biggish temporal frame is helped to classification performance, to improve classification essence of the classifier on speech emotional data set Degree.In addition, the present invention also provides reference for other relevant issues in same domain, expansion extension can be carried out on this basis, With very wide application prospect.

Claims (6)

1. it is a kind of based on direction from the speech emotion recognition system of attention mechanism and two-way length network in short-term, which is characterized in that Include the following steps:
1) acoustic feature is extracted to original audio signal samples, obtains extracting the voice training collection data after feature;
2) the voice training collection data after obtained extraction feature are input to positive long memory network and reversed length in short-term When memory network in, export positive featureAnd opposite feature
3) the positive feature that will be exportedAnd opposite featureOne-dimensional convolution three times is done respectively, obtains three positive three-dimensional features Mapping matrixWith three reversed three-dimensional feature mapping matrixes
4) three positive three-dimensional feature mapping matrixes obtained to step 2) It does from attention mechanism and operates to obtain forward direction from the output after attention weightingObtained to step 2) three A reversed three-dimensional feature mapping matrix It does from attention mechanism and operates to obtain Reversely from the output after attention weighting
5) to the output after the obtained positive weighting from attentionReversely from the output after attention weightingIt does respectively The operation of mean value pondization obtainsWithAnd it will be obtainedWithSpliced, is exported spliced
6) spliced by what is exportedIt is input in softmax layers and obtains softmax layers of output, Obtained softmax layers of output and category are input to together in cross entropy loss function, back-propagation algorithm tune is passed through Whole network structure.
2. it is according to claim 1 based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term System, it is characterised in that: 1) the original audio signal samples come from international voice affection data library IEMOCAP;It is described original The acoustic features of audio signal samples extracted by the tool box opensmile;The acoustics of the original audio signal samples is special Sign includes prosodic features: zero-crossing rate and energy, composes correlated characteristic: mel-frequency cepstrum coefficient, spectral roll-off point, spectrum stream Amount, spectral centroid, frequency spectrum entropy, spread spectrum degree, chromaticity and chromaticity standard deviation.
3. the speech emotion recognition system according to claim 1 based on two-way attention mechanism, it is characterised in that: 2) institute Stating the voice training collection data after extracting feature isWherein N indicates the quantity of training sample, yiIndicate emotion Classification is separately input in positive long memory network in short-term and reversed long memory network in short-term, and the output for obtaining both direction is special Sign, respectively positive featureAnd opposite feature
4. it is according to claim 1 based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term System, it is characterised in that: 4) it is described to be defined as itself from the operation of attention mechanism and do similarity measurement, and pass through the similitude Measurement obtains the weights at each moment.First to obtained three three-dimensional feature mapping matrixesThe last one dimension D be split to obtain three four-matrixesThen multiplying is done to obtained Q ' matrix and K ' matrix And resulting operation result is done into softmax layers of transformation and obtains weight matrix W, resulting weight matrix is for each moment and in addition Resulting weight matrix W and another four-matrix V ' are finally done dot product by the weight matrix of all moment degrees of correlation, are obtained certainly Output O after attention weighting, is defined by formula as:
O=W*V '
The third dimension for merging gained output O obtains three-dimensional data O ', and the output after the positive weighting from attention is defined asReversely It is defined as from the output after attention weighting
5. it is according to claim 1 based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term System, it is characterised in that: the 5) output to after the obtained positive weighting from attentionReversely from after attention weighting OutputThe operation of mean value pondization is done respectively, obtains two two-dimensional matrixesWithThe operating process indicates are as follows:With It is described splicedIt is to splice the output of forward and reverse, preferably to retain original feature, is defined with formula Are as follows:
6. it is according to claim 1 based on direction from the speech emotion recognition system of the two-way length of attention mechanism network in short-term System, it is characterised in that: 6) softmax layers of the output is obtained in the softmax for be input to obtained S 4 neurons To the probability P of each classification, the probability P of obtained each classification and category y are input in cross entropy loss function, by it It minimizes:
Finally by the weight of back-propagation algorithm adjustment network.
CN201910555688.2A 2019-06-25 2019-06-25 Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network Active CN110400579B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910555688.2A CN110400579B (en) 2019-06-25 2019-06-25 Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910555688.2A CN110400579B (en) 2019-06-25 2019-06-25 Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network

Publications (2)

Publication Number Publication Date
CN110400579A true CN110400579A (en) 2019-11-01
CN110400579B CN110400579B (en) 2022-01-11

Family

ID=68322649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910555688.2A Active CN110400579B (en) 2019-06-25 2019-06-25 Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network

Country Status (1)

Country Link
CN (1) CN110400579B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111048082A (en) * 2019-12-12 2020-04-21 中国电子科技集团公司第二十八研究所 Improved end-to-end speech recognition method
CN111259761A (en) * 2020-01-13 2020-06-09 东南大学 Electroencephalogram emotion recognition method and device based on migratable attention neural network
CN111357051A (en) * 2019-12-24 2020-06-30 深圳市优必选科技股份有限公司 Speech emotion recognition method, intelligent device and computer readable storage medium
CN111429948A (en) * 2020-03-27 2020-07-17 南京工业大学 Voice emotion recognition model and method based on attention convolution neural network
CN111461173A (en) * 2020-03-06 2020-07-28 华南理工大学 Attention mechanism-based multi-speaker clustering system and method
CN111477221A (en) * 2020-05-28 2020-07-31 中国科学技术大学 Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network
CN111508500A (en) * 2020-04-17 2020-08-07 五邑大学 Voice emotion recognition method, system, device and storage medium
CN111524535A (en) * 2020-04-30 2020-08-11 杭州电子科技大学 Feature fusion method for speech emotion recognition based on attention mechanism
CN111613240A (en) * 2020-05-22 2020-09-01 杭州电子科技大学 Camouflage voice detection method based on attention mechanism and Bi-LSTM
CN111783469A (en) * 2020-06-29 2020-10-16 中国计量大学 Method for extracting text sentence characteristics
CN111798445A (en) * 2020-07-17 2020-10-20 北京大学口腔医院 Tooth image caries identification method and system based on convolutional neural network
CN112447186A (en) * 2020-10-16 2021-03-05 华东理工大学 Speech emotion recognition algorithm weighted according to class characteristics
CN112581979A (en) * 2020-12-10 2021-03-30 重庆邮电大学 Speech emotion recognition method based on spectrogram
CN113257280A (en) * 2021-06-07 2021-08-13 苏州大学 Speech emotion recognition method based on wav2vec
CN113284515A (en) * 2021-04-19 2021-08-20 大连海事大学 Voice emotion recognition method based on physical waves and circulating network
CN113317791A (en) * 2021-05-28 2021-08-31 温州康宁医院股份有限公司 Method and device for determining severity of depression based on audio frequency of testee
CN113469470A (en) * 2021-09-02 2021-10-01 国网浙江省电力有限公司杭州供电公司 Energy consumption data and carbon emission correlation analysis method based on electric brain center
CN113571050A (en) * 2021-07-28 2021-10-29 复旦大学 Voice depression state identification method based on Attention and Bi-LSTM
CN111259761B (en) * 2020-01-13 2024-06-07 东南大学 Electroencephalogram emotion recognition method and device based on movable attention neural network

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597541A (en) * 2018-04-28 2018-09-28 南京师范大学 A kind of speech-emotion recognition method and system for enhancing indignation and happily identifying
CN108831450A (en) * 2018-03-30 2018-11-16 杭州鸟瞰智能科技股份有限公司 A kind of virtual robot man-machine interaction method based on user emotion identification
CN109243493A (en) * 2018-10-30 2019-01-18 南京工程学院 Based on the vagitus emotion identification method for improving long memory network in short-term
CN109243494A (en) * 2018-10-30 2019-01-18 南京工程学院 Childhood emotional recognition methods based on the long memory network in short-term of multiple attention mechanism
CN109285562A (en) * 2018-09-28 2019-01-29 东南大学 Speech-emotion recognition method based on attention mechanism
CN109522548A (en) * 2018-10-26 2019-03-26 天津大学 A kind of text emotion analysis method based on two-way interactive neural network
CN109710761A (en) * 2018-12-21 2019-05-03 中国标准化研究院 The sentiment analysis method of two-way LSTM model based on attention enhancing
CN109740148A (en) * 2018-12-16 2019-05-10 北京工业大学 A kind of text emotion analysis method of BiLSTM combination Attention mechanism
CN109817246A (en) * 2019-02-27 2019-05-28 平安科技(深圳)有限公司 Training method, emotion identification method, device, equipment and the storage medium of emotion recognition model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108831450A (en) * 2018-03-30 2018-11-16 杭州鸟瞰智能科技股份有限公司 A kind of virtual robot man-machine interaction method based on user emotion identification
CN108597541A (en) * 2018-04-28 2018-09-28 南京师范大学 A kind of speech-emotion recognition method and system for enhancing indignation and happily identifying
CN109285562A (en) * 2018-09-28 2019-01-29 东南大学 Speech-emotion recognition method based on attention mechanism
CN109522548A (en) * 2018-10-26 2019-03-26 天津大学 A kind of text emotion analysis method based on two-way interactive neural network
CN109243493A (en) * 2018-10-30 2019-01-18 南京工程学院 Based on the vagitus emotion identification method for improving long memory network in short-term
CN109243494A (en) * 2018-10-30 2019-01-18 南京工程学院 Childhood emotional recognition methods based on the long memory network in short-term of multiple attention mechanism
CN109740148A (en) * 2018-12-16 2019-05-10 北京工业大学 A kind of text emotion analysis method of BiLSTM combination Attention mechanism
CN109710761A (en) * 2018-12-21 2019-05-03 中国标准化研究院 The sentiment analysis method of two-way LSTM model based on attention enhancing
CN109817246A (en) * 2019-02-27 2019-05-28 平安科技(深圳)有限公司 Training method, emotion identification method, device, equipment and the storage medium of emotion recognition model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAOFENG CAI ETC: "Multi-view and Attention-Based BI-LSTM for Weibo", 《ADVANCES IN INTELLIGENT SYSTEMS RESEARCH,VOLUME 147,INTERNATIONAL CONFERENCE ON NETWORK, COMMUNICATION, COMPUTER ENGINEERING (NCCE 2018)》 *
邢吉亮: "结合注意力机制的Bi-LSTM循环神经网络对关系分类的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111048082B (en) * 2019-12-12 2022-09-06 中国电子科技集团公司第二十八研究所 Improved end-to-end speech recognition method
CN111048082A (en) * 2019-12-12 2020-04-21 中国电子科技集团公司第二十八研究所 Improved end-to-end speech recognition method
CN111357051B (en) * 2019-12-24 2024-02-02 深圳市优必选科技股份有限公司 Speech emotion recognition method, intelligent device and computer readable storage medium
CN111357051A (en) * 2019-12-24 2020-06-30 深圳市优必选科技股份有限公司 Speech emotion recognition method, intelligent device and computer readable storage medium
CN111259761A (en) * 2020-01-13 2020-06-09 东南大学 Electroencephalogram emotion recognition method and device based on migratable attention neural network
CN111259761B (en) * 2020-01-13 2024-06-07 东南大学 Electroencephalogram emotion recognition method and device based on movable attention neural network
CN111461173A (en) * 2020-03-06 2020-07-28 华南理工大学 Attention mechanism-based multi-speaker clustering system and method
CN111461173B (en) * 2020-03-06 2023-06-20 华南理工大学 Multi-speaker clustering system and method based on attention mechanism
CN111429948A (en) * 2020-03-27 2020-07-17 南京工业大学 Voice emotion recognition model and method based on attention convolution neural network
CN111508500B (en) * 2020-04-17 2023-08-29 五邑大学 Voice emotion recognition method, system, device and storage medium
CN111508500A (en) * 2020-04-17 2020-08-07 五邑大学 Voice emotion recognition method, system, device and storage medium
CN111524535A (en) * 2020-04-30 2020-08-11 杭州电子科技大学 Feature fusion method for speech emotion recognition based on attention mechanism
CN111524535B (en) * 2020-04-30 2022-06-21 杭州电子科技大学 Feature fusion method for speech emotion recognition based on attention mechanism
CN111613240A (en) * 2020-05-22 2020-09-01 杭州电子科技大学 Camouflage voice detection method based on attention mechanism and Bi-LSTM
CN111613240B (en) * 2020-05-22 2023-06-27 杭州电子科技大学 Camouflage voice detection method based on attention mechanism and Bi-LSTM
CN111477221B (en) * 2020-05-28 2022-12-30 中国科学技术大学 Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network
CN111477221A (en) * 2020-05-28 2020-07-31 中国科学技术大学 Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network
CN111783469A (en) * 2020-06-29 2020-10-16 中国计量大学 Method for extracting text sentence characteristics
CN111798445B (en) * 2020-07-17 2023-10-31 北京大学口腔医院 Tooth image caries identification method and system based on convolutional neural network
CN111798445A (en) * 2020-07-17 2020-10-20 北京大学口腔医院 Tooth image caries identification method and system based on convolutional neural network
CN112447186A (en) * 2020-10-16 2021-03-05 华东理工大学 Speech emotion recognition algorithm weighted according to class characteristics
CN112581979B (en) * 2020-12-10 2022-07-12 重庆邮电大学 Speech emotion recognition method based on spectrogram
CN112581979A (en) * 2020-12-10 2021-03-30 重庆邮电大学 Speech emotion recognition method based on spectrogram
CN113284515B (en) * 2021-04-19 2023-05-02 大连海事大学 Speech emotion recognition method based on physical wave and circulation network
CN113284515A (en) * 2021-04-19 2021-08-20 大连海事大学 Voice emotion recognition method based on physical waves and circulating network
CN113317791B (en) * 2021-05-28 2023-03-14 温州康宁医院股份有限公司 Method and device for determining severity of depression based on audio frequency of testee
CN113317791A (en) * 2021-05-28 2021-08-31 温州康宁医院股份有限公司 Method and device for determining severity of depression based on audio frequency of testee
CN113257280A (en) * 2021-06-07 2021-08-13 苏州大学 Speech emotion recognition method based on wav2vec
CN113571050A (en) * 2021-07-28 2021-10-29 复旦大学 Voice depression state identification method based on Attention and Bi-LSTM
CN113469470A (en) * 2021-09-02 2021-10-01 国网浙江省电力有限公司杭州供电公司 Energy consumption data and carbon emission correlation analysis method based on electric brain center

Also Published As

Publication number Publication date
CN110400579B (en) 2022-01-11

Similar Documents

Publication Publication Date Title
CN110400579A (en) Based on direction from the speech emotion recognition of attention mechanism and two-way length network in short-term
Chatziagapi et al. Data Augmentation Using GANs for Speech Emotion Recognition.
Er A novel approach for classification of speech emotions based on deep and acoustic features
Hu et al. Temporal multimodal learning in audiovisual speech recognition
Sun End-to-end speech emotion recognition with gender information
Mesgarani et al. Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations
Dennis Sound event recognition in unstructured environments using spectrogram image processing
CN103544963B (en) A kind of speech-emotion recognition method based on core semi-supervised discrimination and analysis
CN111583964B (en) Natural voice emotion recognition method based on multimode deep feature learning
Li et al. Exploiting the potentialities of features for speech emotion recognition
Ghai et al. Emotion recognition on speech signals using machine learning
Elshaer et al. Transfer learning from sound representations for anger detection in speech
Huang et al. Speech emotion recognition using convolutional neural network with audio word-based embedding
Gao et al. ToneNet: A CNN Model of Tone Classification of Mandarin Chinese.
CN110348482B (en) Speech emotion recognition system based on depth model integrated architecture
CN111968652A (en) Speaker identification method based on 3DCNN-LSTM and storage medium
Kuang et al. Simplified inverse filter tracked affective acoustic signals classification incorporating deep convolutional neural networks
Kamaruddin et al. Features extraction for speech emotion
CN112329819A (en) Underwater target identification method based on multi-network fusion
Xue et al. Learning speech emotion features by joint disentangling-discrimination
Xue et al. Driver’s speech emotion recognition for smart cockpit based on a self-attention deep learning framework
Rammohan et al. Speech signal-based modelling of basic emotions to analyse compound emotion: Anxiety
Muralikrishna et al. Noise-robust spoken language identification using language relevance factor based embedding
CN115312080A (en) Voice emotion recognition model and method based on complementary acoustic characterization
Segarceanu et al. Environmental acoustics modelling techniques for forest monitoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant