CN113450830A - Voice emotion recognition method of convolution cyclic neural network with multiple attention mechanisms - Google Patents
Voice emotion recognition method of convolution cyclic neural network with multiple attention mechanisms Download PDFInfo
- Publication number
- CN113450830A CN113450830A CN202110695847.6A CN202110695847A CN113450830A CN 113450830 A CN113450830 A CN 113450830A CN 202110695847 A CN202110695847 A CN 202110695847A CN 113450830 A CN113450830 A CN 113450830A
- Authority
- CN
- China
- Prior art keywords
- features
- attention
- cnn
- neural network
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 21
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 18
- 230000007246 mechanism Effects 0.000 title claims abstract description 15
- 125000004122 cyclic group Chemical group 0.000 title claims abstract description 9
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 32
- 230000008451 emotion Effects 0.000 claims abstract description 13
- 230000004927 fusion Effects 0.000 claims abstract description 9
- 238000011176 pooling Methods 0.000 claims description 12
- 230000000306 recurrent effect Effects 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 6
- 238000013459 approach Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 230000003595 spectral effect Effects 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 238000001228 spectrum Methods 0.000 claims 2
- 238000004364 calculation method Methods 0.000 claims 1
- 238000009432 framing Methods 0.000 claims 1
- 238000002156 mixing Methods 0.000 claims 1
- 238000013135 deep learning Methods 0.000 abstract description 3
- 230000008569 process Effects 0.000 abstract description 2
- 230000000694 effects Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 230000002996 emotional effect Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000002790 cross-validation Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Abstract
The invention discloses a voice emotion recognition method of a convolution cyclic neural network with a multiple attention mechanism, which comprises the following steps: step 1, extracting spectrogram features and frame level features. And 2, transmitting the spectrogram characteristics into a CNN module to learn time-frequency related information in the characteristics. And 3, the multi-head self-attention layer acts on the CNN module to calculate the weights of different frames under the global features of different scales, and the features of different depths in the CNN are fused. And 4, a multi-dimensional attention layer acts on the frame-level features of the LSTM input to comprehensively consider the relationship between the local features and the global features. And 5, transmitting the processed frame-level features into an LSTM model to acquire time information in the features. Step 6, a fusion layer summarizes the outputs of different modules to enhance the model performance. And 7, classifying different emotions by using a Softmax classifier. The invention combines a deep learning network, and the module adopts a parallel connection structure to process the characteristics simultaneously, thereby effectively improving the performance of speech emotion recognition.
Description
Technical Field
The invention relates to the technical field of speech emotion recognition, in particular to a speech emotion recognition method of a convolution cyclic neural network with a multiple attention mechanism.
Background
The focus of linguistic research is to mine potential information in the language, characterizing the state of a speaker or voice. As an emotion task in paraphrase language, speech emotion recognition can learn the category of emotion from speech, which can provide assistance for intelligent human-computer interaction. Recent studies related to deep learning provide speech recognition with a deep model that better describes the emotional state of speech. One of the most important deep learning models is the neural network, which is typically used to learn distinctive feature representations from low-order acoustic features. Furthermore, these emotion-related tasks tend to be centered around convolutional neural networks and long-term memory-based recurrent neural networks to mine local information in speech. CNN is often used to learn time-frequency information from spectral features, while LSTM is mainly used to extract the sequence correlation of speech time series.
Although the neural network model described above has been successfully applied to speech emotion recognition, there are three problems to be solved. First, most existing neural network model methods segment the complete speech into segments to meet the requirement of model input fixed length. In this process, incomplete time information inevitably results in loss of emotional details. Second, most CNN-based methods take only the last convolutional layer as output, and do not consider hidden convolutional layers containing high-resolution low-level information. Third, existing speech emotion studies based on LSTM attention layers exclude temporal correlation of frame-level features in utterances by weighting sequences in high-level tokens by setting an attention layer connected at the back end of the LSTM.
Disclosure of Invention
The technical problem is as follows: in order to overcome some problems of the existing speech emotion recognition technology, the invention discloses a speech emotion recognition method of a convolution cyclic neural network (CRNN-MA) with a multiple attention mechanism.
The technical scheme is as follows: a speech emotion recognition method of a convolution cyclic neural network with a multiple attention mechanism comprises the following steps:
and step A, extracting spectrogram characteristics and frame level characteristics respectively as the input of different modules of the model. Then, the characteristics are respectively input into a Convolutional Neural Network (CNN) and a long-time memory cyclic neural network (LSTM), and time-frequency information and sequence information are simultaneously acquired by adopting a parallel structure. And step B, the spectrogram characteristics are transmitted to a CNN to learn time-frequency related information in the characteristics. And step C, a multi-head Self-Attention layer (Multiple Self-Attention) acts on the CNN module to calculate the weights of different frames under global features of different scales, and features of different depths in the CNN are fused. Step D, a Multi-Dimensional Attention layer (Multi-Dimensional Attention) acts on the frame-level features of the LSTM input to comprehensively consider the relationship between the local features and the global features. And E, transmitting the processed frame-level features into an LSTM model to acquire time information in the features. Step F, a fusion layer summarizes the outputs of different modules to enhance the model performance. And G, classifying different emotions by using a Softmax classifier.
Preferably, the specific step of extracting the spectrogram feature in the step a comprises: pre-emphasis, framing and fast Fourier transform are carried out on the voice, and then the energy spectrum is processed through a group of Mel-scale triangular filter banks to obtain spectrogram characteristics; then, solving a first order difference and a second order difference of each section of spectrogram characteristic; the specific steps of extracting the frame-level features in the step A comprise: each frame of speech is extracted with 95-dimensional low-level descriptors including mel-frequency cepstral coefficients and their first derivatives, mel-frequency cepstral and its first derivatives, spectral features, spectral flatness, color spectrum, zero-crossing rate and root-mean-square.
Preferably, the specific steps of step B include:
b, the spectrogram characteristics in the step A and three-dimensional spectrogram characteristics formed by the first-order difference and the second-order difference are transmitted into a CNN module for learning; for the CNN module, AlexNet trained on an ImageNet data set is used as an initial model, the model has five convolutional layers and three pooling layers in total, and a full-link hit layer in the network is deleted to better match a multi-head self-attention layer; the input size is 227 x 3, the first tier of convolutional layers contains 96 convolutional kernels, size 11 x 11, and the second tier contains 256 convolutional kernels, size 5 x 5; the last three convolutional layers contain 384, 384 and 256 convolutional kernels, respectively, and have a size of 3 × 3.
Preferably, the specific steps of step C include:
step C-1: setting the three pooling layers of the CNN in the step B as the input of the self-attention layer; in the self-attention layer, the input is first dimension reduced:
Fn=σR(fn*Xn)
wherein sigmaR(. to) denotes the ReLU activation function, ". to" is a convolution operation, XnFor input, X1,X2,X3Respectively representing a first pooling layer, a second pooling layer and a third pooling layer in the CNN;
step C-2: adding attention unit to calculate interdependence of all frames to obtain weight of different frames
αn=Softmax(Vn·Un)
Wherein, Vn=σS(Fn·Wn+bn),T0Is the time dimension, W and U are weights, b is the offset, σSRepresents a Sigmoid activation function, and Softmax represents a Softmax operation;
step C-3: a convolution Gn of 1 × 1 size with 1024 convolution kernels is set, calculated as:
N0feature dimensions representing input features, then, at GnAdopt N01 is the mostLarge-pool operation:
step C-4: the output of the multi-headed self-attention layer combines all the autocorrelation layers:
wherein, On=Mn·αn∈R1024×1。
Preferably, the specific steps of step D include:
step D-1: in the multidimensional attention layer, a convolution f of 1 × 1 size is first set for one channelTAnd fNThe output of the frame dimension and feature dimension is represented as:
FT=σR(fT*XT)∈RT×N
FN=σR(fN*XN)∈RN×T
wherein XTAnd XN=(XT)TRepresenting the input of two dimensions of a multidimensional attention layer, T and N representing a frame dimension and a feature dimension, respectively;
step D-2: using the attention unit to score different frame dimensions or feature dimensions, thereby obtaining the weights of different attention layers as:
αT=Softmax(σR(FT·WT+bT)·UT)∈RT×1
αN=Softmax(σR(FN·WN+bN)·UN)∈RN×1
WT,UT,WN,UNrespectively represent a weight matrix, bT,bNRepresents a deviation;
step D-3: output of frame dimension OTAnd featuresOutput of dimension ONThe outputs of (a) are respectively expressed as:
wherein eT∈RT×1,eN∈RN×1,. Representing the Hadamard product, and then adding ONAfter inversion with OTFusion is performed as input to LSTM, O(LSTM)∈R1024×1。
Preferably, the specific steps of step E include:
step E-1: the input of LSTM at each time is the input value x of the current timetOutput value h of last timet-1And last cell state ct-1The outputs are respectively the current time htAnd current state ct'forget to remember door' ftInformation for determining cell discard:
ft=σ(Wf[ht-1,xt]+bf)
sigma denotes the activation function Sigmoid, W and b are the weight and offset, respectively, ftThe output of (1) is between 0 and 1, 1 represents that the information is completely reserved, and 0 represents that the data is completely discarded;
step E-2: the cell decides the value to update:
it=σ(Wi[ht-1,xt]+bi)
sigmoid decides which values are to be updated, tanh is used to create a new candidate;
step E-3: the cell state will update and output the final state:
ht=ot*tanh(Ct)。
preferably, the specific steps of step F include:
step F-1: respectively optimizing the output values of two different modules to accelerate the convergence rate of training:
O(CNN-)=σS(BN(O(CNN)))
O(LSTM-)=σR(BN(O(LSTM)))
wherein BN represents batch normalization;
step F-2: calculating the output of the CRNN-MA model:
O(CRNN-MA)=σS([(O(CNN-B))T,[(O(LSTM-BN))T]·W)·V
wherein W, V ∈ R2048×2048Representing the weight of the fusion layer.
Preferably, the specific steps of step G include: softmax is expressed as:
and (f), (Vi) is the probability corresponding to the characteristic value, the sum of all probabilities is equal to 1, if one Vi is larger than all other V, the probability output by the characteristic value is the highest, the mapping component approaches to 1, and the mapping components of other characteristic values approach to 0.
Has the advantages that: the speech emotion recognition method of the convolution cyclic neural network with the multiple attention mechanisms is combined with the deep learning network, and the characteristics are processed simultaneously by adopting a parallel connection structure in the module, so that the speech emotion recognition performance can be effectively improved.
Drawings
FIG. 1 is a diagram of the structure of the CRNN-MA model;
FIG. 2 is a diagram of a multi-headed self-attention layer structure;
FIG. 3 is a diagram of a multi-dimensional attention layer model architecture.
Detailed Description
The architecture of the proposed CRNN-MA model is shown in fig. 1. Spectrogram features and frame-level features are first input into the model. The relationship between local and global features is obtained using three pooling layers in the CNN as inputs to a multi-headed self-attention module, which is shown in fig. 2. The multidimensional attention layer computes the weights of the different frames and features as shown in fig. 3. The fusion layer is then used to fuse the different outputs, while the softmax classifier outputs the results.
To verify the performance of the proposed model, experiments were performed on the ABC emotion database and the eNTERFACE emotion database. The ABC database is a german database with 6 different emotions recorded by 4 males and 4 females. There are 430 speech samples. The eNTERFACE emotion database included 43 subjects from 14 different countries, recording in english, for a total of 1283 speech samples.
For the ABC database, the LOSO cross-validation strategy was employed. In this strategy, each time a speech sample of one person is selected from the data set as the test set of the experiment, the remaining samples are used as the training set. Each person's voice takes turns as a test set. Finally, the average of several tests was calculated. For the eNTERFACE database, the data was randomly divided into 8 speaker independent samples, with each of the seven samples containing 5 speaker samples and the remaining one containing 8 speaker samples, and eight cross-validations were performed.
Due to the imbalance of emotion classes, weighted and unweighted accuracy rates are used to evaluate the experimental results. The weighted accuracy is the ratio of the number of correct samples to the number of all samples determined. The unweighted accuracy is the sum of the precision of all classes divided by the number of classes, regardless of the number of samples per class.
Table 1 shows the effect of multiple layers of attention on the performance of the model
TABLE 1 comparison of Performance of Multi-head attention layer to CNN (%; 'WA/UA')
Data set | ABC | eNTERFACE |
CNN | 49.2/41.7 | 70.9/71.0 |
CNN + Multi-headed attention layer | 60.9/53.9 | 71.0/71.1 |
As can be seen from Table 1, the recognition effect of the emotion of the model can be effectively improved by the multi-head attention layer, WA values of the two databases are respectively improved by 11.7% and 0.1%, and UA values are respectively improved by 12.2% and 0.1%. The result shows that the proposed multi-head self-attention layer can improve the CNN by capturing the time-frequency information of the CNN module.
Table 2 shows the effect of the multidimensional attention layer on the performance of the LSTM model.
TABLE 2 comparison of Performance of multidimensional attention layer to LSTM (%; 'WA/UA')
Data set | ABC | eNTERFACE |
LSTM | 57.2/49.4 | 71.5/71.6 |
LSTM + multi-head attention layer | 60.1/52.5 | 74.3/74.4 |
As can be seen from table 2, the multidimensional attention layer can also effectively improve the recognition effect of the model, the WA values of the two databases are respectively improved by 2.9% and 2.8%, and the UA values are respectively improved by 3.1% and 2.8%. It is reasonable to state that the proposed multidimensional attention layer is in coordinating emotion fragments.
Table 3 shows the enhancement of the fusion layer to the model performance.
TABLE 3 enhancement of fusion layer for model Performance (%; 'WA/UA')
Data set | ABC | eNTERFACE |
CNN+LSTM | 58.0/49.9 | 74.8/75.0 |
CRNN-MA (no fusion layer) | 60.3/53.1 | 75.7/75.6 |
CRNN-MA | 65.3/59.7 | 78.6/78.6 |
From table 3, we can see that the proposed CRNN-MA model achieves the best experimental effect, and the integration of the fusion layer on different advanced features makes the performance of the model generate positive influence; also, the addition of the fusion layer enables the model to obtain more effective emotional information.
Claims (8)
1. A speech emotion recognition method of a convolution cyclic neural network with a multiple attention mechanism is characterized by comprising the following steps: comprises the following steps of (a) carrying out,
step A: extracting spectrogram features and frame-level features in the voice;
and B: using CNN to learn time-frequency related information in the spectrogram;
and C: the multi-head self-attention layer acts on the CNN module to calculate the weights of different frames under global features of different scales and fuses features of different depths in the CNN;
step D: a multi-dimensional attention layer acts on the frame-level features to comprehensively consider the relationship between the local features and the global features;
step E: the processed frame-level features are transmitted into an LSTM model to obtain time information in the features;
step F: a blending layer to summarize the outputs of the different modules to enhance model performance;
step G: the emotion is classified using a softmax classifier.
2. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the specific steps of extracting spectrogram features in the step A comprise: pre-emphasis, framing and fast Fourier transform are carried out on the voice, and then the energy spectrum is processed through a group of Mel-scale triangular filter banks to obtain spectrogram characteristics; then, solving a first order difference and a second order difference of each section of spectrogram characteristic; the specific steps of extracting the frame-level features in the step A comprise: each frame of speech is extracted with 95-dimensional low-level descriptors including mel-frequency cepstral coefficients and their first derivatives, mel-frequency cepstral and its first derivatives, spectral features, spectral flatness, color spectrum, zero-crossing rate and root-mean-square.
3. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 2, wherein: the specific steps of the step B comprise:
b, the spectrogram characteristics in the step A and three-dimensional spectrogram characteristics formed by the first-order difference and the second-order difference are transmitted into a CNN module for learning; for the CNN module, AlexNet trained on an ImageNet data set is used as an initial model, the model has five convolutional layers and three pooling layers in total, and a full-link hit layer in the network is deleted to better match a multi-head self-attention layer; the input size is 227 x 3, the first tier of convolutional layers contains 96 convolutional kernels, size 11 x 11, and the second tier contains 256 convolutional kernels, size 5 x 5; the last three convolutional layers contain 384, 384 and 256 convolutional kernels, respectively, and have a size of 3 × 3.
4. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 3, wherein: the concrete steps of the step C comprise:
step C-1: setting the three pooling layers of the CNN in the step B as the input of the self-attention layer; in the self-attention layer, the input is first dimension reduced:
Fn=σR(fn*Xn)
wherein sigmaR(. to) denotes the ReLU activation function, ". to" is a convolution operation, XnFor input, X1,X2,X3Respectively representing a first pooling layer, a second pooling layer and a third pooling layer in the CNN;
step C-2: adding attention unitsTo calculate the interdependencies of all frames to obtain the weights of different frames
αn=Softmax(Vn·Un)
Wherein, Vn=σS(Fn·Wn+bn),T0Is the time dimension, W and U are weights, b is the offset, σSRepresents a Sigmoid activation function, and Softmax represents a Softmax operation;
step C-3: setting a convolution G of 1 × 1 size with 1024 convolution kernelsnThe calculation is as follows:
N0feature dimensions representing input features, then, at GnAdopt N0Max pooling operation by x 1:
step C-4: the output of the multi-headed self-attention layer combines all the autocorrelation layers:
wherein, On=Mn·αn∈R1024×1。
5. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the specific steps of the step D comprise:
step D-1: in the multidimensional attention layer, a convolution f of 1 × 1 size is first set for one channelTAnd fNThe output of the frame dimension and feature dimension is represented as:
FT=σR(fT*XT)∈RT×N
FN=σR(fN*XN)∈RN×T
wherein XTAnd XN=(XT)TRepresenting the input of two dimensions of a multidimensional attention layer, T and N representing a frame dimension and a feature dimension, respectively;
step D-2: using the attention unit to score different frame dimensions or feature dimensions, thereby obtaining the weights of different attention layers as:
αT=Softmax(σR(FT·WT+bT)·UT)∈RT×1
αN=Softmax(σR(FN·WN+bN)·UN)∈RN×1
WT,UT,WN,UNrespectively represent a weight matrix, bT,bNRepresents a deviation;
step D-3: output of frame dimension OTAnd output of the feature dimension ONThe outputs of (a) are respectively expressed as:
6. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the concrete steps of the step E comprise:
step E-1: the input of LSTM at each time is the input value x of the current timetOutput value h of last timet-1And last cell state ct-1The outputs are respectively the current time htAnd current state ct'forget to remember door' ftInformation for determining cell discard:
ft=σ(Wf[ht-1,xt]+bf)
sigma denotes the activation function Sigmoid, W and b are the weight and offset, respectively, ftThe output of (1) is between 0 and 1, 1 represents that the information is completely reserved, and 0 represents that the data is completely discarded;
step E-2: the cell decides the value to update:
it=σ(Wi[ht-1,xt]+bi)
sigmoid decides which values are to be updated, tanh is used to create a new candidate;
step E-3: the cell state will update and output the final state:
ht=ot*tanh(Ct)。
7. the speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the specific steps of step F include:
step F-1: respectively optimizing the output values of two different modules to accelerate the convergence rate of training:
O(CNN)=σS(BN(O(CNN)))
O(LSTM-)=σR(BN(O(LSTM)))
wherein BN represents batch normalization;
step F-2: calculating the output of the CRNN-MA model:
O(CRNN-)=σS([(O(CNN-BN))T,[(O(LSTM-BN))T]·W)·V
wherein W, V ∈ R2048×2048Representing the weight of the fusion layer.
8. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the concrete steps of the step G comprise: softmax is expressed as:
the features are normalized using softmax, f (V)i) If one Vi is larger than all the other V, the probability of the output of the characteristic value is the highest, the mapping component approaches to 1, and the mapping components of the other characteristic values approach to 0.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110695847.6A CN113450830B (en) | 2021-06-23 | 2021-06-23 | Speech emotion recognition method of convolutional recurrent neural network with multiple attention mechanisms |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110695847.6A CN113450830B (en) | 2021-06-23 | 2021-06-23 | Speech emotion recognition method of convolutional recurrent neural network with multiple attention mechanisms |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113450830A true CN113450830A (en) | 2021-09-28 |
CN113450830B CN113450830B (en) | 2024-03-08 |
Family
ID=77812318
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110695847.6A Active CN113450830B (en) | 2021-06-23 | 2021-06-23 | Speech emotion recognition method of convolutional recurrent neural network with multiple attention mechanisms |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113450830B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111785301A (en) * | 2020-06-28 | 2020-10-16 | 重庆邮电大学 | Residual error network-based 3DACRNN speech emotion recognition method and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108664632A (en) * | 2018-05-15 | 2018-10-16 | 华南理工大学 | A kind of text emotion sorting algorithm based on convolutional neural networks and attention mechanism |
CN110534132A (en) * | 2019-09-23 | 2019-12-03 | 河南工业大学 | A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic |
CN110853680A (en) * | 2019-11-05 | 2020-02-28 | 河南工业大学 | double-BiLSTM structure with multi-input multi-fusion strategy for speech emotion recognition |
US10699129B1 (en) * | 2019-11-15 | 2020-06-30 | Fudan University | System and method for video captioning |
US10937444B1 (en) * | 2017-11-22 | 2021-03-02 | Educational Testing Service | End-to-end neural network based automated speech scoring |
CN112784798A (en) * | 2021-02-01 | 2021-05-11 | 东南大学 | Multi-modal emotion recognition method based on feature-time attention mechanism |
-
2021
- 2021-06-23 CN CN202110695847.6A patent/CN113450830B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10937444B1 (en) * | 2017-11-22 | 2021-03-02 | Educational Testing Service | End-to-end neural network based automated speech scoring |
CN108664632A (en) * | 2018-05-15 | 2018-10-16 | 华南理工大学 | A kind of text emotion sorting algorithm based on convolutional neural networks and attention mechanism |
CN110534132A (en) * | 2019-09-23 | 2019-12-03 | 河南工业大学 | A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic |
CN110853680A (en) * | 2019-11-05 | 2020-02-28 | 河南工业大学 | double-BiLSTM structure with multi-input multi-fusion strategy for speech emotion recognition |
US10699129B1 (en) * | 2019-11-15 | 2020-06-30 | Fudan University | System and method for video captioning |
CN112784798A (en) * | 2021-02-01 | 2021-05-11 | 东南大学 | Multi-modal emotion recognition method based on feature-time attention mechanism |
Non-Patent Citations (1)
Title |
---|
周凯烨;: "基于运动注意力模块的多分支动作识别网络", 工业控制计算机, no. 07 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111785301A (en) * | 2020-06-28 | 2020-10-16 | 重庆邮电大学 | Residual error network-based 3DACRNN speech emotion recognition method and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113450830B (en) | 2024-03-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110400579B (en) | Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network | |
US10176811B2 (en) | Neural network-based voiceprint information extraction method and apparatus | |
CN107610707B (en) | A kind of method for recognizing sound-groove and device | |
CN112466326B (en) | Voice emotion feature extraction method based on transducer model encoder | |
Umamaheswari et al. | An enhanced human speech emotion recognition using hybrid of PRNN and KNN | |
Bhat et al. | Automatic assessment of sentence-level dysarthria intelligibility using BLSTM | |
CN110459225B (en) | Speaker recognition system based on CNN fusion characteristics | |
CN111798874A (en) | Voice emotion recognition method and system | |
CN108520753A (en) | Voice lie detection method based on the two-way length of convolution memory network in short-term | |
CN111583964A (en) | Natural speech emotion recognition method based on multi-mode deep feature learning | |
Yılmaz et al. | Articulatory features for asr of pathological speech | |
CN115862684A (en) | Audio-based depression state auxiliary detection method for dual-mode fusion type neural network | |
Janbakhshi et al. | Automatic dysarthric speech detection exploiting pairwise distance-based convolutional neural networks | |
CN116524960A (en) | Speech emotion recognition system based on mixed entropy downsampling and integrated classifier | |
Alashban et al. | Speaker gender classification in mono-language and cross-language using BLSTM network | |
Jiang et al. | Speech Emotion Recognition Using Deep Convolutional Neural Network and Simple Recurrent Unit. | |
CN110348482A (en) | A kind of speech emotion recognition system based on depth model integrated architecture | |
CN112466284B (en) | Mask voice identification method | |
CN113450830B (en) | Speech emotion recognition method of convolutional recurrent neural network with multiple attention mechanisms | |
CN111243621A (en) | Construction method of GRU-SVM deep learning model for synthetic speech detection | |
Işık et al. | Turkish dialect recognition using acoustic and phonotactic features in deep learning architectures | |
Romero et al. | Exploring transformer-based language recognition using phonotactic information | |
Zhang et al. | Autoencoder based on cepstrum separation to detect depression from speech | |
Abdiche et al. | Text-independent speaker identification using mel-frequency energy coefficients and convolutional neural networks | |
Kalita et al. | Use of Bidirectional Long Short Term Memory in Spoken Word Detection with reference to the Assamese language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |