CN111583964A - Natural speech emotion recognition method based on multi-mode deep feature learning - Google Patents
Natural speech emotion recognition method based on multi-mode deep feature learning Download PDFInfo
- Publication number
- CN111583964A CN111583964A CN202010290317.9A CN202010290317A CN111583964A CN 111583964 A CN111583964 A CN 111583964A CN 202010290317 A CN202010290317 A CN 202010290317A CN 111583964 A CN111583964 A CN 111583964A
- Authority
- CN
- China
- Prior art keywords
- neural network
- dimensional
- convolutional neural
- emotion
- dimensional convolutional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 36
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 129
- 230000008451 emotion Effects 0.000 claims abstract description 62
- 238000003062 neural network model Methods 0.000 claims abstract description 10
- 238000007500 overflow downdraw method Methods 0.000 claims abstract description 7
- 238000013528 artificial neural network Methods 0.000 claims description 35
- 238000001228 spectrum Methods 0.000 claims description 23
- 230000004927 fusion Effects 0.000 claims description 17
- 238000012549 training Methods 0.000 claims description 16
- 238000011176 pooling Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 10
- 238000010606 normalization Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 3
- 230000001537 neural effect Effects 0.000 claims 1
- 230000000295 complement effect Effects 0.000 abstract description 5
- 239000012634 fragment Substances 0.000 description 8
- 230000003595 spectral effect Effects 0.000 description 8
- 238000002474 experimental method Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 4
- 230000003068 static effect Effects 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000010200 validation analysis Methods 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000015654 memory Effects 0.000 description 2
- 230000007935 neutral effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 238000013526 transfer learning Methods 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- RTAQQCXQSZGOHL-UHFFFAOYSA-N Titanium Chemical compound [Ti] RTAQQCXQSZGOHL-UHFFFAOYSA-N 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000020411 cell activation Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000006996 mental state Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Hospice & Palliative Care (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Child & Adolescent Psychology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a natural speech emotion recognition method based on multimode deep feature learning, which comprises the following steps of: s1, generating an appropriate multi-modal representation: generating three appropriate audio representations from the original one-dimensional speech signal for input of subsequent different CNN models; s2, learning multi-modal characteristics by adopting a multi-depth convolution neural network model; and S3, integrating the classification results of different CNN models by adopting a score level fusion method, and outputting the final speech emotion recognition result. According to the method, deep multi-modal characteristics with complementary characteristics are fused and learned through the multiple deep convolutional neural networks, the emotion classification performance is obviously improved, and characteristics with good discrimination are provided for natural speech emotion recognition.
Description
Technical Field
The invention relates to the technical field of voice signal processing and mode recognition, in particular to a natural voice emotion recognition method based on multimode deep feature learning.
Background
In recent years, natural speech emotion recognition, which aims to provide intelligent emotion services that can be used for voice call centers, healthcare, and emotion calculation through a speech interaction pattern directly with a computer, has become an active and challenging research topic in the fields of pattern recognition, speech signal processing, artificial intelligence, and the like, unlike conventional input devices.
At present, in the field of speech emotion recognition, a great deal of preliminary work is mainly performed on simulated emotion, because the establishment of the simulated emotion database is much easier than that of natural emotion. In recent years, research on emotion recognition of natural speech in real environments has been receiving attention from researchers because it is closer to reality and much more difficult to recognize than a simulated emotion.
The speech emotion feature extraction is a key step in speech emotion recognition, and aims to extract feature parameters capable of reflecting emotion expression information of a speaker from emotion speech signals. Currently, a large number of speech emotion recognition documents employ manually designed features for emotion recognition, such as prosodic features (fundamental frequency, amplitude, utterance duration), timbre features (formants, spectral energy distribution, harmonic-to-noise ratio), spectral features (mel-frequency cepstral coefficients (MFCC), Linear Predictive Coefficients (LPC), and Linear Predictive Cepstral Coefficients (LPCC)). However, these manually designed speech emotion feature parameters belong to low-level features, and have a semantic gap with emotion labels understood by human beings, so that it is necessary to develop a high-level speech emotion feature extraction method.
To address this problem, emerging deep learning techniques in recent years may provide clues, and as deeper architectures are used, deep learning techniques generally have certain advantages over traditional approaches, including their ability to automatically detect complex structures and features without the need for manual feature extraction.
Up to now, various representative deep learning techniques, such as Deep Neural Network (DNN), deep Convolutional Neural Network (CNN), long-short term memory based recurrent neural network (LSTMRNN), etc., have been used for speech emotion recognition.
For example, a "speech emotion recognition method based on multi-scale deep convolution cyclic neural network" disclosed in the Chinese patent literature (publication No. CN108717856A) combines a deep Convolution Neural Network (CNN) and a long-term memory network (LSTM), and simultaneously considers the characteristics of different discriminative power of two-dimensional (2D) voice frequency spectrum segment information with different lengths on different emotion type identification, provides a multi-scale CNN + LSTM mixed deep learning model, is applied to natural voice emotion identification in actual environment, however, the speech emotion recognition method using 2D speech spectral fragment information as CNN input cannot capture dynamic change information expressed by features in 2D time-frequency (time-frequency) between consecutive frames in a sentence of speech, and thus cannot provide feature parameters with good discriminative power for natural speech emotion recognition. Although the LSTM-RNN can be used for modeling of temporal information, the temporal information is over-emphasized.
Disclosure of Invention
The invention provides a natural speech emotion recognition method based on multimode deep feature learning, aiming at overcoming the defect that dynamic change information in 2D time-frequency feature representation among continuous frames in a sentence of speech cannot be captured in the prior art, so that feature parameters with good discrimination can not be provided for natural speech emotion recognition.
In order to achieve the purpose, the invention adopts the following technical scheme:
a natural speech emotion recognition method based on multi-mode deep feature learning comprises the following steps:
s1, generating an appropriate multi-modal representation: generating three appropriate audio representations from the original one-dimensional speech signal for input of subsequent different CNN models;
s2, learning multi-modal characteristics by adopting a multi-depth convolution neural network model;
and S3, integrating the classification results of different CNN models by adopting a score level fusion method, and outputting the final speech emotion recognition result.
In the scheme of the invention, the deep Multi-modal characteristics learned by adopting a Multi-depth convolutional neural network model (Multi-CNN), namely a one-dimensional convolutional neural network (1D-CNN), a two-dimensional convolutional neural network (2D-CNN) and a three-dimensional convolutional neural network (3D-CNN), have the characteristic of certain complementarity, are used for natural speech emotion recognition, the technical problem that dynamic change information in 2D time-frequency characteristic representation among continuous frames in a sentence of speech cannot be captured in the prior art is solved, the deep Multi-modal characteristics with the characteristic of complementarity are learned through fusion of the Multi-depth convolutional neural network, the emotion classification performance is obviously improved, and the characteristic of good discrimination is provided for natural speech emotion recognition.
Preferably, the step S1 includes the steps of:
s1.1, segmenting a one-dimensional original voice signal waveform into segments, inputting the segments into a one-dimensional convolution neural network, and setting the length of a voice segment;
s1.2, extracting a two-dimensional Mel frequency spectrogram from a one-dimensional original voice signal, and constructing a three-channel frequency spectrum segment similar to an RGB image as the input of a two-dimensional convolution neural network;
s1.3, forming a 3D dynamic segment similar to a video by a plurality of continuous two-dimensional Mel frequency spectrum segment sequences, and performing space-time feature learning by taking the 3D dynamic segment as the input of a three-dimensional convolution neural network.
Preferably, the step S2 includes the steps of:
s2.1, performing one-dimensional original voice signal waveform modeling by adopting a one-dimensional convolution neural network: constructing a one-dimensional convolutional neural network model, and training the constructed one-dimensional convolutional neural network model;
s2.2, performing two-dimensional Mel spectrum modeling by using a two-dimensional convolution neural network: aiming at target data, finely adjusting the existing AlexNet deep convolution neural network model trained in advance, and sampling the generated Mel spectrum segment size of three channels;
s2.3, modeling the space-time dynamic information based on the three-dimensional convolutional neural network: and executing a space-time characteristic learning task based on the extracted video-like 3D dynamic segment, and avoiding network overfitting by adopting a dropout regularization method.
Preferably, the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three maximum pooling layers, two full-connected layers and 1 Softmax classified output layer, and the one-dimensional convolutional layers comprise a batch processing normalization layer and a modified linear unit activation function layer, namely, input data is normalized before the one-dimensional convolutional neural network is trained.
Preferably, the AlexNet deep convolutional neural network model includes five convolutional layers, three maximum pooling layers, and two fully connected layers.
Preferably, the fine tuning in step S2.2 comprises the following steps:
1) copying the whole network parameters from a pre-trained AlexNet deep convolution neural network model to initialize the used two-dimensional convolution neural network;
2) replacing the softmax output layer in the AlexNet deep convolutional neural network model with a new sample label vector corresponding to the number of emotion classes used in the data set;
3) the used two-dimensional convolutional neural network is retrained using a standard back propagation strategy.
Fine-tuning is widely used for transfer learning in computer vision and alleviates the problem of data insufficiency.
Preferably, the three-dimensional convolutional neural network in the step S2.3 includes two three-dimensional convolutional layers including a batch normalization layer and a modified linear cell activation function layer, two three-dimensional maximum pooling layers, two full-link layers, and one softmax output layer.
Preferably, the step S3 includes the steps of:
s3.1, performing network training on the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, and updating network parameters;
s3.2, performing average operation on all the divided segment classification results in a sentence of voice by adopting an average pooling strategy on the segment classification results obtained from the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, so as to generate emotion classification score results on the whole sentence of voice;
s3.3, maximizing the emotion classification score results on the whole sentence voice level, and obtaining the emotion recognition result of each convolutional neural network;
and S3.4, combining classification score results obtained by different convolutional neural network models on the whole sentence voice level by utilizing a score level fusion strategy so as to carry out final emotion classification.
Preferably, step S3.4 may be expressed as:
scorefusion=λ1score1D+λ2score2D+λ3score3D;
λ1+λ2+λ3=1;
wherein λ1、λ2And λ3Weight values, λ, representing different classification scores obtained by one-, two-and three-dimensional convolutional neural networks1、λ2And λ3Is determined by stepping at 0.1 intervals at 0,1]And determining the optimal value obtained by searching in the range.
Preferably, the updating of the network parameters in step S3.1 is shown by the following expression:
where W represents the weight value of the softmax layer of the network parameter theta,the representation is associated with the input data aiCorresponding output of the last full connection layer (FC) layer, yiA class label vector representing the ith segment, which is equal to the emotion category of the whole sentence of speech, and H represents a softmax logarithmic loss function, wherein H is expressed by the following expression:where C represents the total number of emotion categories.
The invention has the beneficial effects that: the method solves the technical problems that dynamic change information in 2D time-frequency feature representation between continuous frames in a sentence of voice cannot be captured in the prior art, so that good discrimination characteristic parameters cannot be provided for natural voice emotion recognition, and time information is over-emphasized.
Drawings
FIG. 1 is a general architectural block diagram of the present invention.
Fig. 2 is a graph of a confusion matrix for the identification results of the fractional fusion of the present invention on the AFEW5.0 dataset.
FIG. 3 is a diagram of a confusion matrix of the recognition results of the fractional fusion of the present invention on the BAUM-1s dataset.
Detailed Description
The technical scheme of the invention is further specifically described by the following embodiments and the accompanying drawings.
Example 1: in this embodiment, a natural speech emotion recognition method based on multi-mode deep feature learning, as shown in fig. 1, includes the following steps:
s1, generating an appropriate multi-modal representation: three suitable audio representations are generated from the original one-dimensional speech signal for subsequent input of different CNN models.
Step S1 includes the following steps:
s1.1, segmenting a one-dimensional original voice signal waveform into segments, inputting the segments into a one-dimensional convolutional neural network (1D-CNN), and setting the length of the voice segments; for best performance, the length of the speech segment is set to 625 frames, the original speech signal is down-sampled at 22kHz and scaled to [ -256, 256 ]. In this case, the scaled data is naturally close to zero, so the average does not need to be subtracted.
S1.2, extracting a two-dimensional Mel frequency spectrum graph from a one-dimensional original voice signal, and constructing a three-channel frequency spectrum segment similar to an RGB image as the input of a two-dimensional convolution neural network (2D-CNN);
s1.3, forming a 3D dynamic segment similar to a video by a plurality of continuous two-dimensional Mel frequency spectrum segment sequences, and performing space-time feature learning by taking the 3D dynamic segment as the input of a three-dimensional convolutional neural network (3D-CNN).
In experimental validation, we generated a three-channel two-dimensional Mel-frequency spectrum fragment of a size that was, as an input to a two-dimensional convolutional neural network. Specifically, we extract the entire log Mel spectrogram of a speech signal in the range of 20Hz to 8000Hz using 64 Mel filter banks. At this time, a Hamming window of 25ms is used, overlapping for 10 ms. Then, using a 64 frame size text box, the entire log Mel spectrum is segmented into fixed length segments, resulting in a 64' 64 static segment. Then, we calculate the first and second regression coefficients of the generated static segment along the time axis, thereby obtaining the first derivative (delta) coefficient and the second derivative (delta-delta) coefficient of the static segment. Finally, we can generate Mel-frequency spectral slices of three channels (static, first and second derivatives), similar to color RGB images in computer vision.
And S2, learning the Multi-modal characteristics by adopting a Multi-depth convolution neural network model Multi-CNN.
Step S2 includes the following steps:
s2.1, performing one-dimensional original voice signal waveform modeling by adopting a one-dimensional convolution neural network: and constructing a one-dimensional convolutional neural network model, and training the constructed one-dimensional convolutional neural network model.
The stride length of the convolutional layer and the maximum pooling layer is set to 1, as shown in table 1, the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three maximum pooling layers, two full-link layers and 1 Softmax classification output layer, the one-dimensional convolutional layer comprises a batch normalization layer and a modified linear unit (ReLU) activation function layer, namely, input data are normalized before the one-dimensional convolutional neural network is trained, and the output of the Softmax layer corresponds to the whole emotion category on the used data set.
TABLE 1
S2.2, performing two-dimensional Mel spectrum modeling by using a two-dimensional convolution neural network: aiming at target data, finely adjusting an existing AlexNet deep convolution neural network model trained in advance, and sampling the size of a generated three-channel Mel frequency spectrum segment to an input size fixed by the AlexNet model; as shown in table 2, the AlexNet deep convolutional neural network model includes five convolutional layers, three max pooling layers, and two fully-connected layers.
TABLE 2
The fine tuning in step S2.2 comprises the following steps:
1) copying the whole network parameters from a pre-trained AlexNet deep convolution neural network model to initialize the used two-dimensional convolution neural network;
2) replacing the softmax output layer in the AlexNet deep convolutional neural network model with a new sample label vector corresponding to the number of emotion classes used in the data set;
3) the used two-dimensional convolutional neural network is retrained using a standard back propagation strategy.
Fine-tuning is widely used for transfer learning in computer vision and alleviates the problem of data insufficiency.
S2.3, modeling the space-time dynamic information based on the three-dimensional convolutional neural network: and executing a space-time characteristic learning task based on the extracted video-like 3D dynamic segment, and avoiding network overfitting by adopting a dropout regularization method.
As shown in table 3, the three-dimensional convolutional neural network includes two three-dimensional convolutional layers including a batch normalization layer and a modified linear unit (ReLU) activation function layer, two three-dimensional maximum pooling layers, two fully-connected layers, and one softmax output layer.
TABLE 3
And S3, integrating the classification results of different CNN models by adopting a score level fusion method, and outputting a final speech emotion recognition result.
Since the feature representations obtained from the one-dimensional convolutional neural network and the three-dimensional convolutional neural network capture completely different acoustic characteristics compared to the two-dimensional convolutional neural network based on 2D time-frequency representation as input, which indicates that emotion features learned from the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional convolutional neural network may be complementary to each other, and therefore, they need to be integrated in a multiple convolutional neural network fusion network, which may further improve speech emotion classification performance.
Step S3 includes the following steps:
s3.1, performing network training on the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, and updating network parameters, wherein the updated network parameters are shown in the following expression:
where W represents the weight value of the softmax layer of the network parameter theta,the representation is associated with the input data aiCorresponding output of the last full connection layer (FC) layer, yiThe class label vector representing the ith segment is equal to the emotion category of the whole sentence of voice, H represents a softmax logarithmic loss function, and H is expressed by the following expression:where C represents the total number of emotion categories.
S3.2, performing average operation on all the divided segment classification results in a sentence of voice by adopting an average pooling strategy on the segment classification results obtained from the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, so as to generate emotion classification score results on the whole sentence of voice level (utterance-level);
s3.3, maximizing the emotion classification score results on the whole sentence voice level, and obtaining the emotion recognition result of each convolutional neural network;
s3.4, combining classification score results (score) obtained by different convolutional neural network models (1D-CNN,2D-CNN,3D-CNN) on the whole sentence voice level by utilizing a score-level fusion strategy to perform final emotion classification, wherein the classification score results can be expressed as:
scorefusion=λ1score1D+λ2score2D+λ3score3D;
λ1+λ2+λ3=1;
wherein λ1、λ2And λ3Weight values, λ, representing different classification scores obtained by one-, two-and three-dimensional convolutional neural networks1、λ2And λ3Is determined by stepping at 0.1 intervals at [0,1 ]]And searching the range to obtain the optimal value.
To verify the effectiveness of our proposed method for natural speech emotion recognition, we performed experiments using two challenging natural emotion speech data sets AFEW5.0 and bamm-1 s, and no other action emotion speech data set.
The AFEW5.0 dataset contains 7 emotion categories, such as angry, happy, sad, offensive, surprised, afraid, and neutral, with three annotators invited to annotate these emotions. The AFEW5.0 dataset is divided into three parts: training set Train (723 samples), validation set Val (383 samples) and Test set Test (539 samples). We do not use the test set because it only opens up the acquisition right for the researchers participating in the competition.
BAUM-1s contains not only 6 basic emotional categories, including anger, happiness, sadness, dislikes, fear, surprise, but also other mental states such as uncertainty, thinking, concentration, and confusion. Here we only focused on identifying 6 basic emotion classes, resulting in a subset with 521 emotion video samples.
A. Setting of experiments
For training of the one-dimensional convolutional neural network, the two-dimensional convolutional neural network, and the three-dimensional convolutional neural network, the minimum batch processing size of input data is 30, the maximum cycle (epoch) number is 300, and the learning rate is 0.001. In order to accelerate the training speed of the convolutional neural network, a NVIDIA GTX TITAN X GPU with a 12GB memory is adopted, and natural speech emotion recognition is realized by using a cross-validation strategy irrelevant to a speaker, wherein the strategy is mainly used for a real scene.
The experiment used the original training set (Train) on the AFEW5.0 dataset and the validation set (Val) during training as test sets. On the BAUM-1s dataset, which contains 31 Turkish people, a leave-one-out cross-validation (LOSGO) strategy of 5 group crossings was employed. In this way, the average recognition accuracy over five tests was reported on the BAUM-1s dataset.
Note that in the experiment, we segmented the Mel spectrum of the whole speech extracted from the audio sample into a certain number of Mel-spectrum (Mel-spectral) segments, and performed segment-level feature learning using a convolutional neural network. In this case, we set the emotion classification of each Mel-frequency spectrum segment as an emotion tag on the speech level of the whole sentence.
B. Network training
Network training is carried out on the one-dimensional convolution neural network, the two-dimensional convolution neural network and the three-dimensional neural network, network parameters are updated, and the updated network parameters are shown in the following expression:
where W represents the weight value of the softmax layer of the network parameter theta,the representation is associated with the input data aiCorresponding final FC layer output, yiAnd a class label vector representing the ith segment is equal to the emotion classification of the sounding level, H represents a softmax logarithmic loss function, and H is expressed as follows:where C represents the total number of emotion categories.
C. Results and analysis
The sample fragment length input to the one-dimensional convolutional neural network (1D-CNN) may have a significant impact on the performance of the 1D-CNN network. Therefore, we preliminarily studied the performance of different sample segment lengths as the input of the 1D-CNN network, i.e., tested the performance of four different sample segment lengths (125, 625, 3125, 15625 frames) as the input of the one-dimensional convolutional neural network, and the number of corresponding convolutional layers. Table 4 shows the identification performance of four different sample segment lengths associated with the number of layers of convolution. Note that each convolutional layer is followed by a max pooling layer, except that the last convolutional layer is equivalent to a fully-connected layer.
TABLE 4
As a result, the sample segment length of 625 frames performed best among the four different sample segment lengths in the AFEW5.0 and bamm-1 s datasets, as shown in table 4. Specifically, our method has an accuracy of 24.02% on the AFEW5.0 dataset and 37.37% on the bamu-1 s dataset. Larger sample segment lengths help improve performance, however, sample segment lengths that are too large do not necessarily improve performance. This may be due to the larger sample fragment length reducing the number of samples used in the one-dimensional convolutional neural network training. Therefore, the performance of the one-dimensional convolutional neural network does not always improve when the sample segment length increases, so we set the sample segment length of the one-dimensional convolutional neural network to 625 frames.
Since the extracted three-channel Mel frequency spectrum fragment is similar to an RGB image and is used as an input of a 2D-CNN network, it is feasible to perform fine tuning on an existing depth model based on ImageNet data. In order to evaluate the fine-tuning effect of different pre-training deep network models, fine-tuning recognition performances of three typical deep network models, namely AlexNet, VGG-16 and ResNet-50, on a target emotion data set are compared. The recognition results of these deep network models are obtained by averaging the score scores of all the segmented segments and then performing a maximization operation on them.
Table 5 shows the fine tuning recognition results for three typical deep network models (e.g., AlexNet, VGG-16, and ResNet-50). As can be seen from Table 5, AlexNet performed slightly better than VGG-16 and ResNet-50, with AlexNet accuracy of 29.24% in the AFEW5.0 database and VGG-16 and ResNet-50 of 28.16% and 28.55%, respectively, and in BAUM-1s database, AlexNet, VGG-16, and ResNet-50 of 42.29%, 41.73%, and 41.97%, respectively. This indicates that the deeper network models such as VGG-16 and ResNet-50 do not have a significant performance improvement over the shallower AlexNet, probably because the emotion data sets used are very limited and therefore the number of speech samples produced is not sufficient to train deeper networks.
TABLE 5
For spatio-temporal feature learning, a plurality of continuous two-dimensional Mel frequency spectrum fragment sequences form a 3D dynamic fragment similar to a video as the input of a three-dimensional convolutional neural network (3D-CNN). The created video segment length is equal to the number of consecutive two-dimensional Mel-frequency spectrum segments. The video segment length also significantly affects the performance of the 3D-CNN network.
In order to evaluate the performance of different video segment lengths as input to a 3D CNN network, we have given in experiments the identification of 4 different video segment lengths (4, 6, 8, 10 Mel spectral segments). For these video segment lengths, the three-dimensional convolutional neural network has the same network structure except for the first convolutional layer. In the first convolutional layer (Conv1.), its depth of three-dimensional filter size (i.e., the number of consecutive Mel-frequency spectral slices in series) is equal to the corresponding video slice length. Table 6 gives the performance of four different video segment lengths (i.e. 4, 6, 8, 10 Mel spectral segments) and the depth of the three-dimensional filter size of the first convolution layer.
TABLE 6
As can be seen from table 6, the video segment length containing 4 consecutive voice Mel spectral segments achieves the best performance in the AFEW5.0 database and the bamu-1 s database with 28.46% and 37.97% accuracy, respectively. As the video segment length increases, the performance of the three-dimensional convolutional neural network degrades, possibly because the number of video segments used to train the three-dimensional convolutional neural network decreases as the video segment length increases.
In the experiment, we propose and compare two fusion methods of multiple deep convolutional neural networks: feature level fusion and fractional level fusion. For feature level fusion, we first extract the entire sentence-level speech feature for each convolutional neural network, i.e. by using an averaging pooling operation on the segment features represented by the output of the last fully-connected layer of the convolutional neural network used. Then, the features of three whole sentence voice levels from one-dimensional, two-dimensional and three-dimensional convolution neural networks are directly connected in series to form a total 5376-D feature vector, and finally, a linear Support Vector Machine (SVM) is adopted to carry out final emotion classification.
Table 7 lists the different multiple deep convolutional neural network fusion methods and the identification of the single convolutional neural network that achieves the best performance. For the AFEW5.0 database, the optimal weight values are 0.3, 0.4; for the BAUM-1s database, the optimal weight values are 0.2, 0.5, 0.3. From the results in table 7, it can be seen:
1) two-dimensional convolutional neural networks (2D-CNN) perform best, followed by three-dimensional convolutional neural networks (3D-CNN) and one-dimensional convolutional neural networks (1D-CNN). This shows that it is effective to fine-tune the existing deep network model AlexNet based on ImageNet data using the generated two-dimensional Mel voice spectrum fragments similar to RGB images, thereby relieving the pressure of emotion data insufficiency on deep neural network training.
2) Fractional fusion has better performance than feature-level fusion. This shows that fractional order fusion is more suitable for the fusion of multiple deep neural networks.
3) Compared with single convolution neural networks such as one-dimensional, two-dimensional and three-dimensional convolution neural networks, the method has better performance in realizing the fusion of multiple convolution neural networks in the characteristic layer and the fractional layer. This shows that the multi-modal deep features learned from one-dimensional, two-dimensional, three-dimensional convolutional neural networks are complementary, so they are integrated into a multi-deep convolutional neural network fusion network to obtain significantly improved emotion classification performance.
TABLE 7
In order to provide the recognition accuracy of each emotion, fig. 2 and fig. 3 respectively show confusion matrices of recognition results. The fractional fusion method now achieved recognition accuracies of 35.77% and 44.06% on the two data sets, respectively.
As shown in fig. 2, we can see that the accuracy of the three emotions, "angry", "neutral" and "fear" in the AFEW5.0 database is 56.25%, 50.79% and 43.48%, respectively. And the classification accuracy of the other four emotions, namely 'annoying', 'happy', 'sad' and 'surprise', is less than 33%.
As can be seen from FIG. 3, in the BAUM-1s database, two emotions, "sad" and "happy" were recognized with accuracy rates of 70.90% and 55.49%, respectively. The accuracy of the other four emotions, namely 'anger', 'fear', 'offensive' and 'surprise' is less than 29%.
In the scheme of the invention, the deep Multi-modal characteristics learned by adopting a Multi-depth convolutional neural network model (Multi-CNN), namely a one-dimensional convolutional neural network (1D-CNN), a two-dimensional convolutional neural network (2D-CNN) and a three-dimensional convolutional neural network (3D-CNN), have certain complementary characteristics, are used for natural speech emotion recognition, the technical problem that dynamic change information in 2D time-frequency characteristic representation among continuous frames in a sentence of speech cannot be captured in the prior art is solved, the deep Multi-modal characteristics with the complementary characteristics are learned through fusion of the Multi-depth convolutional neural network, the emotion classification performance is obviously improved, and the characteristics with good discrimination are provided for natural speech emotion recognition.
Claims (10)
1. A natural speech emotion recognition method based on multi-mode deep feature learning is characterized by comprising the following steps:
s1, generating an appropriate multi-modal representation: generating three appropriate audio representations from the original one-dimensional speech signal for input of subsequent different CNN models;
s2, learning multi-modal characteristics by adopting a multi-depth convolution neural network model;
and S3, integrating the classification results of different CNN models by adopting a score level fusion method, and outputting the final speech emotion recognition result.
2. The method for recognizing natural speech emotion based on multi-mode depth feature learning of claim 1, wherein said step S1 includes the following steps:
s1.1, segmenting a one-dimensional original voice signal waveform into segments, inputting the segments into a one-dimensional convolution neural network, and setting the length of a voice segment;
s1.2, extracting a two-dimensional Mel frequency spectrogram from a one-dimensional original voice signal, and constructing a three-channel frequency spectrum segment similar to an RGB image as the input of a two-dimensional convolution neural network;
s1.3, forming a 3D dynamic segment similar to a video by a plurality of continuous two-dimensional Mel frequency spectrum segment sequences, and performing space-time feature learning by taking the 3D dynamic segment as the input of a three-dimensional convolution neural network.
3. The method for recognizing natural speech emotion based on multi-mode depth feature learning of claim 1, wherein said step S2 includes the following steps:
s2.1, performing one-dimensional original voice signal waveform modeling by adopting a one-dimensional convolution neural network: constructing a one-dimensional convolutional neural network model, and training the constructed one-dimensional convolutional neural network model;
s2.2, performing two-dimensional Mel spectrum modeling by using a two-dimensional convolution neural network: aiming at target data, finely adjusting the existing AlexNet deep convolution neural network model trained in advance, and sampling the generated Mel spectrum segment size of three channels;
s2.3, modeling the space-time dynamic information based on the three-dimensional convolutional neural network: and executing a space-time characteristic learning task based on the extracted video-like 3D dynamic segment, and avoiding network overfitting by adopting a dropout regularization method.
4. The method according to claim 3, wherein the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three maximum pooling layers, two full-connected layers and 1 Softmax classification output layer, and the one-dimensional convolutional layers comprise a batch normalization layer and a modified linear unit activation function layer, namely, input data is normalized before the one-dimensional convolutional neural network is trained.
5. The method according to claim 3, wherein the AlexNet deep convolutional neural network model comprises five convolutional layers, three maximum pooling layers and two full-connected layers.
6. The method for natural speech emotion recognition based on multimode deep feature learning as claimed in claim 3, wherein the fine tuning in step S2.2 comprises the following steps:
1) copying the whole network parameters from a pre-trained AlexNet deep convolution neural network model to initialize the used two-dimensional convolution neural network;
2) replacing the softmax output layer in the AlexNet deep convolutional neural network model with a new sample label vector corresponding to the number of emotion classes used in the data set;
3) the used two-dimensional convolutional neural network is retrained using a standard back propagation strategy.
7. The method according to claim 1, wherein the three-dimensional convolutional neural network in step S2.3 comprises two three-dimensional convolutional layers, two three-dimensional max pooling layers, two fully-connected layers and one softmax output layer, and the three-dimensional convolutional layers comprise a batch normalization layer and a modified linear unit activation function layer.
8. The method for recognizing natural speech emotion based on multi-mode depth feature learning of claim 1, wherein said step S3 includes the following steps:
s3.1, performing network training on the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, and updating network parameters;
s3.2, performing average operation on all the divided segment classification results in a sentence of voice by adopting an average pooling strategy on the segment classification results obtained from the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, so as to generate emotion classification score results on the whole sentence of voice;
s3.3, maximizing the emotion classification score results on the whole sentence voice level, and obtaining the emotion recognition result of each convolutional neural network;
and S3.4, combining classification score results obtained by different convolutional neural network models on the whole sentence voice level by utilizing a score level fusion strategy so as to carry out final emotion classification.
9. The method for recognizing natural speech emotion based on multi-mode deep feature learning as claimed in claim 8, wherein said step S3.4 can be expressed as:
scorefusion=λ1score1D+λ2score2D+λ3score3D;
λ1+λ2+λ3=1;
wherein λ1、λ2And λ3Representing through a one-dimensional convolutional neural networkWeight values, λ, of different classification scores obtained by a network, a two-dimensional convolutional neural network and a three-dimensional convolutional neural network1、λ2And λ3Is determined by stepping at 0.1 intervals at [0,1 ]]And searching the range to obtain the optimal value.
10. The method for natural speech emotion recognition based on multimode deep feature learning of claim 8, wherein the step S3.1 of updating network parameters is represented by the following expression:
wherein W represents the weight value of the softmax layer of the network parameter θ, γ (v)i(ii) a θ) is expressed as the input data aiCorresponding output of the last full connection layer (FC) layer, yiA class label vector representing the ith segment, which is equal to the emotion category of the whole sentence of speech, and H represents a softmax logarithmic loss function, wherein H is expressed by the following expression:where C represents the total number of emotion categories.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010290317.9A CN111583964B (en) | 2020-04-14 | 2020-04-14 | Natural voice emotion recognition method based on multimode deep feature learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010290317.9A CN111583964B (en) | 2020-04-14 | 2020-04-14 | Natural voice emotion recognition method based on multimode deep feature learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111583964A true CN111583964A (en) | 2020-08-25 |
CN111583964B CN111583964B (en) | 2023-07-21 |
Family
ID=72126539
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010290317.9A Active CN111583964B (en) | 2020-04-14 | 2020-04-14 | Natural voice emotion recognition method based on multimode deep feature learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111583964B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112347910A (en) * | 2020-11-05 | 2021-02-09 | 中国电子科技集团公司第二十九研究所 | Signal fingerprint identification method based on multi-mode deep learning |
CN113116300A (en) * | 2021-03-12 | 2021-07-16 | 复旦大学 | Physiological signal classification method based on model fusion |
CN113409824A (en) * | 2021-07-06 | 2021-09-17 | 青岛洞听智能科技有限公司 | Speech emotion recognition method |
CN113780107A (en) * | 2021-08-24 | 2021-12-10 | 电信科学技术第五研究所有限公司 | Radio signal detection method based on deep learning dual-input network model |
CN113903362A (en) * | 2021-08-26 | 2022-01-07 | 电子科技大学 | Speech emotion recognition method based on neural network |
CN114612810A (en) * | 2020-11-23 | 2022-06-10 | 山东大卫国际建筑设计有限公司 | Dynamic self-adaptive abnormal posture recognition method and device |
CN114726802A (en) * | 2022-03-31 | 2022-07-08 | 山东省计算中心(国家超级计算济南中心) | Network traffic identification method and device based on different data dimensions |
CN115195757A (en) * | 2022-09-07 | 2022-10-18 | 郑州轻工业大学 | Electric bus starting driving behavior modeling and recognition training method |
CN117373491A (en) * | 2023-12-07 | 2024-01-09 | 天津师范大学 | Method and device for dynamically extracting voice emotion characteristics |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108717856A (en) * | 2018-06-16 | 2018-10-30 | 台州学院 | A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network |
CN109036465A (en) * | 2018-06-28 | 2018-12-18 | 南京邮电大学 | Speech-emotion recognition method |
CN110534132A (en) * | 2019-09-23 | 2019-12-03 | 河南工业大学 | A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic |
-
2020
- 2020-04-14 CN CN202010290317.9A patent/CN111583964B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108717856A (en) * | 2018-06-16 | 2018-10-30 | 台州学院 | A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network |
CN109036465A (en) * | 2018-06-28 | 2018-12-18 | 南京邮电大学 | Speech-emotion recognition method |
CN110534132A (en) * | 2019-09-23 | 2019-12-03 | 河南工业大学 | A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic |
Non-Patent Citations (2)
Title |
---|
JAEBOK KIM 等: "Learning spectro-temporal features with 3D CNNs for speech emotion recognition", 《ARXIV:1708.05071V1》 * |
JIANFENG ZHAO 等: "Learning deep features to recognise speech emotion using merged deep CNN", 《IET SIGNAL PROCESSING》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112347910B (en) * | 2020-11-05 | 2022-05-31 | 中国电子科技集团公司第二十九研究所 | Signal fingerprint identification method based on multi-mode deep learning |
CN112347910A (en) * | 2020-11-05 | 2021-02-09 | 中国电子科技集团公司第二十九研究所 | Signal fingerprint identification method based on multi-mode deep learning |
CN114612810B (en) * | 2020-11-23 | 2023-04-07 | 山东大卫国际建筑设计有限公司 | Dynamic self-adaptive abnormal posture recognition method and device |
CN114612810A (en) * | 2020-11-23 | 2022-06-10 | 山东大卫国际建筑设计有限公司 | Dynamic self-adaptive abnormal posture recognition method and device |
CN113116300A (en) * | 2021-03-12 | 2021-07-16 | 复旦大学 | Physiological signal classification method based on model fusion |
CN113409824A (en) * | 2021-07-06 | 2021-09-17 | 青岛洞听智能科技有限公司 | Speech emotion recognition method |
CN113780107A (en) * | 2021-08-24 | 2021-12-10 | 电信科学技术第五研究所有限公司 | Radio signal detection method based on deep learning dual-input network model |
CN113780107B (en) * | 2021-08-24 | 2024-03-01 | 电信科学技术第五研究所有限公司 | Radio signal detection method based on deep learning dual-input network model |
CN113903362A (en) * | 2021-08-26 | 2022-01-07 | 电子科技大学 | Speech emotion recognition method based on neural network |
CN113903362B (en) * | 2021-08-26 | 2023-07-21 | 电子科技大学 | Voice emotion recognition method based on neural network |
CN114726802A (en) * | 2022-03-31 | 2022-07-08 | 山东省计算中心(国家超级计算济南中心) | Network traffic identification method and device based on different data dimensions |
CN115195757A (en) * | 2022-09-07 | 2022-10-18 | 郑州轻工业大学 | Electric bus starting driving behavior modeling and recognition training method |
CN115195757B (en) * | 2022-09-07 | 2023-08-04 | 郑州轻工业大学 | Electric bus starting driving behavior modeling and recognition training method |
CN117373491A (en) * | 2023-12-07 | 2024-01-09 | 天津师范大学 | Method and device for dynamically extracting voice emotion characteristics |
CN117373491B (en) * | 2023-12-07 | 2024-02-06 | 天津师范大学 | Method and device for dynamically extracting voice emotion characteristics |
Also Published As
Publication number | Publication date |
---|---|
CN111583964B (en) | 2023-07-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111583964B (en) | Natural voice emotion recognition method based on multimode deep feature learning | |
Er | A novel approach for classification of speech emotions based on deep and acoustic features | |
CN112348075B (en) | Multi-mode emotion recognition method based on contextual attention neural network | |
Zhang et al. | Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching | |
Sun | End-to-end speech emotion recognition with gender information | |
CN108597541B (en) | Speech emotion recognition method and system for enhancing anger and happiness recognition | |
CN110400579B (en) | Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network | |
Schuller et al. | Emotion recognition in the noise applying large acoustic feature sets | |
CN110675859B (en) | Multi-emotion recognition method, system, medium, and apparatus combining speech and text | |
CN108717856A (en) | A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network | |
CN107972028B (en) | Man-machine interaction method and device and electronic equipment | |
CN108269133A (en) | A kind of combination human bioequivalence and the intelligent advertisement push method and terminal of speech recognition | |
CN103996155A (en) | Intelligent interaction and psychological comfort robot service system | |
CN110534133B (en) | Voice emotion recognition system and voice emotion recognition method | |
Dhuheir et al. | Emotion recognition for healthcare surveillance systems using neural networks: A survey | |
Xu et al. | Multi-type features separating fusion learning for Speech Emotion Recognition | |
Cornejo et al. | Audio-visual emotion recognition using a hybrid deep convolutional neural network based on census transform | |
CN114898779A (en) | Multi-mode fused speech emotion recognition method and system | |
CN110348482A (en) | A kind of speech emotion recognition system based on depth model integrated architecture | |
CN117765981A (en) | Emotion recognition method and system based on cross-modal fusion of voice text | |
Zaferani et al. | Automatic personality traits perception using asymmetric auto-encoder | |
Nanduri et al. | A Review of multi-modal speech emotion recognition and various techniques used to solve emotion recognition on speech data | |
Naveenkumar et al. | Audio based emotion detection and recognizing tool using mel frequency based cepstral coefficient | |
Ullah et al. | Speech emotion recognition using deep neural networks | |
Kumar et al. | Machine learning technique-based emotion classification using speech signals |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |