CN113450830A - Voice emotion recognition method of convolution cyclic neural network with multiple attention mechanisms - Google Patents

Voice emotion recognition method of convolution cyclic neural network with multiple attention mechanisms Download PDF

Info

Publication number
CN113450830A
CN113450830A CN202110695847.6A CN202110695847A CN113450830A CN 113450830 A CN113450830 A CN 113450830A CN 202110695847 A CN202110695847 A CN 202110695847A CN 113450830 A CN113450830 A CN 113450830A
Authority
CN
China
Prior art keywords
features
attention
cnn
neural network
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110695847.6A
Other languages
Chinese (zh)
Other versions
CN113450830B (en
Inventor
姜芃旭
梁瑞宇
赵力
徐新洲
陶华伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202110695847.6A priority Critical patent/CN113450830B/en
Publication of CN113450830A publication Critical patent/CN113450830A/en
Application granted granted Critical
Publication of CN113450830B publication Critical patent/CN113450830B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention discloses a voice emotion recognition method of a convolution cyclic neural network with a multiple attention mechanism, which comprises the following steps: step 1, extracting spectrogram features and frame level features. And 2, transmitting the spectrogram characteristics into a CNN module to learn time-frequency related information in the characteristics. And 3, the multi-head self-attention layer acts on the CNN module to calculate the weights of different frames under the global features of different scales, and the features of different depths in the CNN are fused. And 4, a multi-dimensional attention layer acts on the frame-level features of the LSTM input to comprehensively consider the relationship between the local features and the global features. And 5, transmitting the processed frame-level features into an LSTM model to acquire time information in the features. Step 6, a fusion layer summarizes the outputs of different modules to enhance the model performance. And 7, classifying different emotions by using a Softmax classifier. The invention combines a deep learning network, and the module adopts a parallel connection structure to process the characteristics simultaneously, thereby effectively improving the performance of speech emotion recognition.

Description

Voice emotion recognition method of convolution cyclic neural network with multiple attention mechanisms
Technical Field
The invention relates to the technical field of speech emotion recognition, in particular to a speech emotion recognition method of a convolution cyclic neural network with a multiple attention mechanism.
Background
The focus of linguistic research is to mine potential information in the language, characterizing the state of a speaker or voice. As an emotion task in paraphrase language, speech emotion recognition can learn the category of emotion from speech, which can provide assistance for intelligent human-computer interaction. Recent studies related to deep learning provide speech recognition with a deep model that better describes the emotional state of speech. One of the most important deep learning models is the neural network, which is typically used to learn distinctive feature representations from low-order acoustic features. Furthermore, these emotion-related tasks tend to be centered around convolutional neural networks and long-term memory-based recurrent neural networks to mine local information in speech. CNN is often used to learn time-frequency information from spectral features, while LSTM is mainly used to extract the sequence correlation of speech time series.
Although the neural network model described above has been successfully applied to speech emotion recognition, there are three problems to be solved. First, most existing neural network model methods segment the complete speech into segments to meet the requirement of model input fixed length. In this process, incomplete time information inevitably results in loss of emotional details. Second, most CNN-based methods take only the last convolutional layer as output, and do not consider hidden convolutional layers containing high-resolution low-level information. Third, existing speech emotion studies based on LSTM attention layers exclude temporal correlation of frame-level features in utterances by weighting sequences in high-level tokens by setting an attention layer connected at the back end of the LSTM.
Disclosure of Invention
The technical problem is as follows: in order to overcome some problems of the existing speech emotion recognition technology, the invention discloses a speech emotion recognition method of a convolution cyclic neural network (CRNN-MA) with a multiple attention mechanism.
The technical scheme is as follows: a speech emotion recognition method of a convolution cyclic neural network with a multiple attention mechanism comprises the following steps:
and step A, extracting spectrogram characteristics and frame level characteristics respectively as the input of different modules of the model. Then, the characteristics are respectively input into a Convolutional Neural Network (CNN) and a long-time memory cyclic neural network (LSTM), and time-frequency information and sequence information are simultaneously acquired by adopting a parallel structure. And step B, the spectrogram characteristics are transmitted to a CNN to learn time-frequency related information in the characteristics. And step C, a multi-head Self-Attention layer (Multiple Self-Attention) acts on the CNN module to calculate the weights of different frames under global features of different scales, and features of different depths in the CNN are fused. Step D, a Multi-Dimensional Attention layer (Multi-Dimensional Attention) acts on the frame-level features of the LSTM input to comprehensively consider the relationship between the local features and the global features. And E, transmitting the processed frame-level features into an LSTM model to acquire time information in the features. Step F, a fusion layer summarizes the outputs of different modules to enhance the model performance. And G, classifying different emotions by using a Softmax classifier.
Preferably, the specific step of extracting the spectrogram feature in the step a comprises: pre-emphasis, framing and fast Fourier transform are carried out on the voice, and then the energy spectrum is processed through a group of Mel-scale triangular filter banks to obtain spectrogram characteristics; then, solving a first order difference and a second order difference of each section of spectrogram characteristic; the specific steps of extracting the frame-level features in the step A comprise: each frame of speech is extracted with 95-dimensional low-level descriptors including mel-frequency cepstral coefficients and their first derivatives, mel-frequency cepstral and its first derivatives, spectral features, spectral flatness, color spectrum, zero-crossing rate and root-mean-square.
Preferably, the specific steps of step B include:
b, the spectrogram characteristics in the step A and three-dimensional spectrogram characteristics formed by the first-order difference and the second-order difference are transmitted into a CNN module for learning; for the CNN module, AlexNet trained on an ImageNet data set is used as an initial model, the model has five convolutional layers and three pooling layers in total, and a full-link hit layer in the network is deleted to better match a multi-head self-attention layer; the input size is 227 x 3, the first tier of convolutional layers contains 96 convolutional kernels, size 11 x 11, and the second tier contains 256 convolutional kernels, size 5 x 5; the last three convolutional layers contain 384, 384 and 256 convolutional kernels, respectively, and have a size of 3 × 3.
Preferably, the specific steps of step C include:
step C-1: setting the three pooling layers of the CNN in the step B as the input of the self-attention layer; in the self-attention layer, the input is first dimension reduced:
Fn=σR(fn*Xn)
wherein sigmaR(. to) denotes the ReLU activation function, ". to" is a convolution operation, XnFor input, X1,X2,X3Respectively representing a first pooling layer, a second pooling layer and a third pooling layer in the CNN;
step C-2: adding attention unit to calculate interdependence of all frames to obtain weight of different frames
Figure BDA0003128339690000021
αn=Softmax(Vn·Un)
Wherein, Vn=σS(Fn·Wn+bn),T0Is the time dimension, W and U are weights, b is the offset, σSRepresents a Sigmoid activation function, and Softmax represents a Softmax operation;
step C-3: a convolution Gn of 1 × 1 size with 1024 convolution kernels is set, calculated as:
Figure BDA0003128339690000022
N0feature dimensions representing input features, then, at GnAdopt N01 is the mostLarge-pool operation:
Figure BDA0003128339690000023
step C-4: the output of the multi-headed self-attention layer combines all the autocorrelation layers:
Figure BDA0003128339690000024
wherein, On=Mn·αn∈R1024×1
Preferably, the specific steps of step D include:
step D-1: in the multidimensional attention layer, a convolution f of 1 × 1 size is first set for one channelTAnd fNThe output of the frame dimension and feature dimension is represented as:
FT=σR(fT*XT)∈RT×N
FN=σR(fN*XN)∈RN×T
wherein XTAnd XN=(XT)TRepresenting the input of two dimensions of a multidimensional attention layer, T and N representing a frame dimension and a feature dimension, respectively;
step D-2: using the attention unit to score different frame dimensions or feature dimensions, thereby obtaining the weights of different attention layers as:
αT=Softmax(σR(FT·WT+bT)·UT)∈RT×1
αN=Softmax(σR(FN·WN+bN)·UN)∈RN×1
WT,UT,WN,UNrespectively represent a weight matrix, bT,bNRepresents a deviation;
step D-3: output of frame dimension OTAnd featuresOutput of dimension ONThe outputs of (a) are respectively expressed as:
Figure BDA0003128339690000031
Figure BDA0003128339690000032
wherein eT∈RT×1,eN∈RN×1,. Representing the Hadamard product, and then adding ONAfter inversion with OTFusion is performed as input to LSTM, O(LSTM)∈R1024×1
Preferably, the specific steps of step E include:
step E-1: the input of LSTM at each time is the input value x of the current timetOutput value h of last timet-1And last cell state ct-1The outputs are respectively the current time htAnd current state ct'forget to remember door' ftInformation for determining cell discard:
ft=σ(Wf[ht-1,xt]+bf)
sigma denotes the activation function Sigmoid, W and b are the weight and offset, respectively, ftThe output of (1) is between 0 and 1, 1 represents that the information is completely reserved, and 0 represents that the data is completely discarded;
step E-2: the cell decides the value to update:
it=σ(Wi[ht-1,xt]+bi)
Figure BDA0003128339690000033
sigmoid decides which values are to be updated, tanh is used to create a new candidate;
step E-3: the cell state will update and output the final state:
Figure BDA0003128339690000041
ht=ot*tanh(Ct)。
preferably, the specific steps of step F include:
step F-1: respectively optimizing the output values of two different modules to accelerate the convergence rate of training:
O(CNN-)=σS(BN(O(CNN)))
O(LSTM-)=σR(BN(O(LSTM)))
wherein BN represents batch normalization;
step F-2: calculating the output of the CRNN-MA model:
O(CRNN-MA)=σS([(O(CNN-B))T,[(O(LSTM-BN))T]·W)·V
wherein W, V ∈ R2048×2048Representing the weight of the fusion layer.
Preferably, the specific steps of step G include: softmax is expressed as:
Figure BDA0003128339690000042
and (f), (Vi) is the probability corresponding to the characteristic value, the sum of all probabilities is equal to 1, if one Vi is larger than all other V, the probability output by the characteristic value is the highest, the mapping component approaches to 1, and the mapping components of other characteristic values approach to 0.
Has the advantages that: the speech emotion recognition method of the convolution cyclic neural network with the multiple attention mechanisms is combined with the deep learning network, and the characteristics are processed simultaneously by adopting a parallel connection structure in the module, so that the speech emotion recognition performance can be effectively improved.
Drawings
FIG. 1 is a diagram of the structure of the CRNN-MA model;
FIG. 2 is a diagram of a multi-headed self-attention layer structure;
FIG. 3 is a diagram of a multi-dimensional attention layer model architecture.
Detailed Description
The architecture of the proposed CRNN-MA model is shown in fig. 1. Spectrogram features and frame-level features are first input into the model. The relationship between local and global features is obtained using three pooling layers in the CNN as inputs to a multi-headed self-attention module, which is shown in fig. 2. The multidimensional attention layer computes the weights of the different frames and features as shown in fig. 3. The fusion layer is then used to fuse the different outputs, while the softmax classifier outputs the results.
To verify the performance of the proposed model, experiments were performed on the ABC emotion database and the eNTERFACE emotion database. The ABC database is a german database with 6 different emotions recorded by 4 males and 4 females. There are 430 speech samples. The eNTERFACE emotion database included 43 subjects from 14 different countries, recording in english, for a total of 1283 speech samples.
For the ABC database, the LOSO cross-validation strategy was employed. In this strategy, each time a speech sample of one person is selected from the data set as the test set of the experiment, the remaining samples are used as the training set. Each person's voice takes turns as a test set. Finally, the average of several tests was calculated. For the eNTERFACE database, the data was randomly divided into 8 speaker independent samples, with each of the seven samples containing 5 speaker samples and the remaining one containing 8 speaker samples, and eight cross-validations were performed.
Due to the imbalance of emotion classes, weighted and unweighted accuracy rates are used to evaluate the experimental results. The weighted accuracy is the ratio of the number of correct samples to the number of all samples determined. The unweighted accuracy is the sum of the precision of all classes divided by the number of classes, regardless of the number of samples per class.
Table 1 shows the effect of multiple layers of attention on the performance of the model
TABLE 1 comparison of Performance of Multi-head attention layer to CNN (%; 'WA/UA')
Data set ABC eNTERFACE
CNN 49.2/41.7 70.9/71.0
CNN + Multi-headed attention layer 60.9/53.9 71.0/71.1
As can be seen from Table 1, the recognition effect of the emotion of the model can be effectively improved by the multi-head attention layer, WA values of the two databases are respectively improved by 11.7% and 0.1%, and UA values are respectively improved by 12.2% and 0.1%. The result shows that the proposed multi-head self-attention layer can improve the CNN by capturing the time-frequency information of the CNN module.
Table 2 shows the effect of the multidimensional attention layer on the performance of the LSTM model.
TABLE 2 comparison of Performance of multidimensional attention layer to LSTM (%; 'WA/UA')
Data set ABC eNTERFACE
LSTM 57.2/49.4 71.5/71.6
LSTM + multi-head attention layer 60.1/52.5 74.3/74.4
As can be seen from table 2, the multidimensional attention layer can also effectively improve the recognition effect of the model, the WA values of the two databases are respectively improved by 2.9% and 2.8%, and the UA values are respectively improved by 3.1% and 2.8%. It is reasonable to state that the proposed multidimensional attention layer is in coordinating emotion fragments.
Table 3 shows the enhancement of the fusion layer to the model performance.
TABLE 3 enhancement of fusion layer for model Performance (%; 'WA/UA')
Data set ABC eNTERFACE
CNN+LSTM 58.0/49.9 74.8/75.0
CRNN-MA (no fusion layer) 60.3/53.1 75.7/75.6
CRNN-MA 65.3/59.7 78.6/78.6
From table 3, we can see that the proposed CRNN-MA model achieves the best experimental effect, and the integration of the fusion layer on different advanced features makes the performance of the model generate positive influence; also, the addition of the fusion layer enables the model to obtain more effective emotional information.

Claims (8)

1. A speech emotion recognition method of a convolution cyclic neural network with a multiple attention mechanism is characterized by comprising the following steps: comprises the following steps of (a) carrying out,
step A: extracting spectrogram features and frame-level features in the voice;
and B: using CNN to learn time-frequency related information in the spectrogram;
and C: the multi-head self-attention layer acts on the CNN module to calculate the weights of different frames under global features of different scales and fuses features of different depths in the CNN;
step D: a multi-dimensional attention layer acts on the frame-level features to comprehensively consider the relationship between the local features and the global features;
step E: the processed frame-level features are transmitted into an LSTM model to obtain time information in the features;
step F: a blending layer to summarize the outputs of the different modules to enhance model performance;
step G: the emotion is classified using a softmax classifier.
2. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the specific steps of extracting spectrogram features in the step A comprise: pre-emphasis, framing and fast Fourier transform are carried out on the voice, and then the energy spectrum is processed through a group of Mel-scale triangular filter banks to obtain spectrogram characteristics; then, solving a first order difference and a second order difference of each section of spectrogram characteristic; the specific steps of extracting the frame-level features in the step A comprise: each frame of speech is extracted with 95-dimensional low-level descriptors including mel-frequency cepstral coefficients and their first derivatives, mel-frequency cepstral and its first derivatives, spectral features, spectral flatness, color spectrum, zero-crossing rate and root-mean-square.
3. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 2, wherein: the specific steps of the step B comprise:
b, the spectrogram characteristics in the step A and three-dimensional spectrogram characteristics formed by the first-order difference and the second-order difference are transmitted into a CNN module for learning; for the CNN module, AlexNet trained on an ImageNet data set is used as an initial model, the model has five convolutional layers and three pooling layers in total, and a full-link hit layer in the network is deleted to better match a multi-head self-attention layer; the input size is 227 x 3, the first tier of convolutional layers contains 96 convolutional kernels, size 11 x 11, and the second tier contains 256 convolutional kernels, size 5 x 5; the last three convolutional layers contain 384, 384 and 256 convolutional kernels, respectively, and have a size of 3 × 3.
4. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 3, wherein: the concrete steps of the step C comprise:
step C-1: setting the three pooling layers of the CNN in the step B as the input of the self-attention layer; in the self-attention layer, the input is first dimension reduced:
Fn=σR(fn*Xn)
wherein sigmaR(. to) denotes the ReLU activation function, ". to" is a convolution operation, XnFor input, X1,X2,X3Respectively representing a first pooling layer, a second pooling layer and a third pooling layer in the CNN;
step C-2: adding attention unitsTo calculate the interdependencies of all frames to obtain the weights of different frames
Figure FDA0003128339680000022
αn=Softmax(Vn·Un)
Wherein, Vn=σS(Fn·Wn+bn),T0Is the time dimension, W and U are weights, b is the offset, σSRepresents a Sigmoid activation function, and Softmax represents a Softmax operation;
step C-3: setting a convolution G of 1 × 1 size with 1024 convolution kernelsnThe calculation is as follows:
Figure FDA0003128339680000023
N0feature dimensions representing input features, then, at GnAdopt N0Max pooling operation by x 1:
Figure FDA0003128339680000024
step C-4: the output of the multi-headed self-attention layer combines all the autocorrelation layers:
Figure FDA0003128339680000021
wherein, On=Mn·αn∈R1024×1
5. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the specific steps of the step D comprise:
step D-1: in the multidimensional attention layer, a convolution f of 1 × 1 size is first set for one channelTAnd fNThe output of the frame dimension and feature dimension is represented as:
FT=σR(fT*XT)∈RT×N
FN=σR(fN*XN)∈RN×T
wherein XTAnd XN=(XT)TRepresenting the input of two dimensions of a multidimensional attention layer, T and N representing a frame dimension and a feature dimension, respectively;
step D-2: using the attention unit to score different frame dimensions or feature dimensions, thereby obtaining the weights of different attention layers as:
αT=Softmax(σR(FT·WT+bT)·UT)∈RT×1
αN=Softmax(σR(FN·WN+bN)·UN)∈RN×1
WT,UT,WN,UNrespectively represent a weight matrix, bT,bNRepresents a deviation;
step D-3: output of frame dimension OTAnd output of the feature dimension ONThe outputs of (a) are respectively expressed as:
Figure FDA0003128339680000025
Figure FDA0003128339680000026
wherein eT∈RT×1,eN∈RN×1
Figure FDA0003128339680000027
Representing the Hadamard product, and then adding ONAfter inversion with OTFusion is performed as input to LSTM, O(LSTM)∈R1024×1
6. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the concrete steps of the step E comprise:
step E-1: the input of LSTM at each time is the input value x of the current timetOutput value h of last timet-1And last cell state ct-1The outputs are respectively the current time htAnd current state ct'forget to remember door' ftInformation for determining cell discard:
ft=σ(Wf[ht-1,xt]+bf)
sigma denotes the activation function Sigmoid, W and b are the weight and offset, respectively, ftThe output of (1) is between 0 and 1, 1 represents that the information is completely reserved, and 0 represents that the data is completely discarded;
step E-2: the cell decides the value to update:
it=σ(Wi[ht-1,xt]+bi)
Figure FDA0003128339680000031
sigmoid decides which values are to be updated, tanh is used to create a new candidate;
step E-3: the cell state will update and output the final state:
Figure FDA0003128339680000032
ht=ot*tanh(Ct)。
7. the speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the specific steps of step F include:
step F-1: respectively optimizing the output values of two different modules to accelerate the convergence rate of training:
O(CNN)=σS(BN(O(CNN)))
O(LSTM-)=σR(BN(O(LSTM)))
wherein BN represents batch normalization;
step F-2: calculating the output of the CRNN-MA model:
O(CRNN-)=σS([(O(CNN-BN))T,[(O(LSTM-BN))T]·W)·V
wherein W, V ∈ R2048×2048Representing the weight of the fusion layer.
8. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the concrete steps of the step G comprise: softmax is expressed as:
Figure FDA0003128339680000033
the features are normalized using softmax, f (V)i) If one Vi is larger than all the other V, the probability of the output of the characteristic value is the highest, the mapping component approaches to 1, and the mapping components of the other characteristic values approach to 0.
CN202110695847.6A 2021-06-23 2021-06-23 Speech emotion recognition method of convolutional recurrent neural network with multiple attention mechanisms Active CN113450830B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110695847.6A CN113450830B (en) 2021-06-23 2021-06-23 Speech emotion recognition method of convolutional recurrent neural network with multiple attention mechanisms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110695847.6A CN113450830B (en) 2021-06-23 2021-06-23 Speech emotion recognition method of convolutional recurrent neural network with multiple attention mechanisms

Publications (2)

Publication Number Publication Date
CN113450830A true CN113450830A (en) 2021-09-28
CN113450830B CN113450830B (en) 2024-03-08

Family

ID=77812318

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110695847.6A Active CN113450830B (en) 2021-06-23 2021-06-23 Speech emotion recognition method of convolutional recurrent neural network with multiple attention mechanisms

Country Status (1)

Country Link
CN (1) CN113450830B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111785301A (en) * 2020-06-28 2020-10-16 重庆邮电大学 Residual error network-based 3DACRNN speech emotion recognition method and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664632A (en) * 2018-05-15 2018-10-16 华南理工大学 A kind of text emotion sorting algorithm based on convolutional neural networks and attention mechanism
CN110534132A (en) * 2019-09-23 2019-12-03 河南工业大学 A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic
CN110853680A (en) * 2019-11-05 2020-02-28 河南工业大学 double-BiLSTM structure with multi-input multi-fusion strategy for speech emotion recognition
US10699129B1 (en) * 2019-11-15 2020-06-30 Fudan University System and method for video captioning
US10937444B1 (en) * 2017-11-22 2021-03-02 Educational Testing Service End-to-end neural network based automated speech scoring
CN112784798A (en) * 2021-02-01 2021-05-11 东南大学 Multi-modal emotion recognition method based on feature-time attention mechanism

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10937444B1 (en) * 2017-11-22 2021-03-02 Educational Testing Service End-to-end neural network based automated speech scoring
CN108664632A (en) * 2018-05-15 2018-10-16 华南理工大学 A kind of text emotion sorting algorithm based on convolutional neural networks and attention mechanism
CN110534132A (en) * 2019-09-23 2019-12-03 河南工业大学 A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic
CN110853680A (en) * 2019-11-05 2020-02-28 河南工业大学 double-BiLSTM structure with multi-input multi-fusion strategy for speech emotion recognition
US10699129B1 (en) * 2019-11-15 2020-06-30 Fudan University System and method for video captioning
CN112784798A (en) * 2021-02-01 2021-05-11 东南大学 Multi-modal emotion recognition method based on feature-time attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周凯烨;: "基于运动注意力模块的多分支动作识别网络", 工业控制计算机, no. 07 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111785301A (en) * 2020-06-28 2020-10-16 重庆邮电大学 Residual error network-based 3DACRNN speech emotion recognition method and storage medium

Also Published As

Publication number Publication date
CN113450830B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
CN110400579B (en) Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
CN107610707B (en) A kind of method for recognizing sound-groove and device
CN112466326B (en) Voice emotion feature extraction method based on transducer model encoder
Umamaheswari et al. An enhanced human speech emotion recognition using hybrid of PRNN and KNN
Bhat et al. Automatic assessment of sentence-level dysarthria intelligibility using BLSTM
CN110459225B (en) Speaker recognition system based on CNN fusion characteristics
CN111798874A (en) Voice emotion recognition method and system
CN108520753A (en) Voice lie detection method based on the two-way length of convolution memory network in short-term
CN111583964A (en) Natural speech emotion recognition method based on multi-mode deep feature learning
Yılmaz et al. Articulatory features for asr of pathological speech
CN115862684A (en) Audio-based depression state auxiliary detection method for dual-mode fusion type neural network
Janbakhshi et al. Automatic dysarthric speech detection exploiting pairwise distance-based convolutional neural networks
CN116524960A (en) Speech emotion recognition system based on mixed entropy downsampling and integrated classifier
Alashban et al. Speaker gender classification in mono-language and cross-language using BLSTM network
Jiang et al. Speech Emotion Recognition Using Deep Convolutional Neural Network and Simple Recurrent Unit.
CN110348482A (en) A kind of speech emotion recognition system based on depth model integrated architecture
CN112466284B (en) Mask voice identification method
CN113450830B (en) Speech emotion recognition method of convolutional recurrent neural network with multiple attention mechanisms
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
Işık et al. Turkish dialect recognition using acoustic and phonotactic features in deep learning architectures
Romero et al. Exploring transformer-based language recognition using phonotactic information
Zhang et al. Autoencoder based on cepstrum separation to detect depression from speech
Abdiche et al. Text-independent speaker identification using mel-frequency energy coefficients and convolutional neural networks
Kalita et al. Use of Bidirectional Long Short Term Memory in Spoken Word Detection with reference to the Assamese language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant