CN111583964A

CN111583964A - Natural speech emotion recognition method based on multi-mode deep feature learning

Info

Publication number: CN111583964A
Application number: CN202010290317.9A
Authority: CN
Inventors: 张石清; 赵小明
Original assignee: Taizhou University
Current assignee: Taizhou University
Priority date: 2020-04-14
Filing date: 2020-04-14
Publication date: 2020-08-25
Anticipated expiration: 2040-04-14
Also published as: CN111583964B

Abstract

The invention discloses a natural speech emotion recognition method based on multimode deep feature learning, which comprises the following steps of: s1, generating an appropriate multi-modal representation: generating three appropriate audio representations from the original one-dimensional speech signal for input of subsequent different CNN models; s2, learning multi-modal characteristics by adopting a multi-depth convolution neural network model; and S3, integrating the classification results of different CNN models by adopting a score level fusion method, and outputting the final speech emotion recognition result. According to the method, deep multi-modal characteristics with complementary characteristics are fused and learned through the multiple deep convolutional neural networks, the emotion classification performance is obviously improved, and characteristics with good discrimination are provided for natural speech emotion recognition.

Description

Natural speech emotion recognition method based on multi-mode deep feature learning

Technical Field

The invention relates to the technical field of voice signal processing and mode recognition, in particular to a natural voice emotion recognition method based on multimode deep feature learning.

Background

In recent years, natural speech emotion recognition, which aims to provide intelligent emotion services that can be used for voice call centers, healthcare, and emotion calculation through a speech interaction pattern directly with a computer, has become an active and challenging research topic in the fields of pattern recognition, speech signal processing, artificial intelligence, and the like, unlike conventional input devices.

At present, in the field of speech emotion recognition, a great deal of preliminary work is mainly performed on simulated emotion, because the establishment of the simulated emotion database is much easier than that of natural emotion. In recent years, research on emotion recognition of natural speech in real environments has been receiving attention from researchers because it is closer to reality and much more difficult to recognize than a simulated emotion.

The speech emotion feature extraction is a key step in speech emotion recognition, and aims to extract feature parameters capable of reflecting emotion expression information of a speaker from emotion speech signals. Currently, a large number of speech emotion recognition documents employ manually designed features for emotion recognition, such as prosodic features (fundamental frequency, amplitude, utterance duration), timbre features (formants, spectral energy distribution, harmonic-to-noise ratio), spectral features (mel-frequency cepstral coefficients (MFCC), Linear Predictive Coefficients (LPC), and Linear Predictive Cepstral Coefficients (LPCC)). However, these manually designed speech emotion feature parameters belong to low-level features, and have a semantic gap with emotion labels understood by human beings, so that it is necessary to develop a high-level speech emotion feature extraction method.

To address this problem, emerging deep learning techniques in recent years may provide clues, and as deeper architectures are used, deep learning techniques generally have certain advantages over traditional approaches, including their ability to automatically detect complex structures and features without the need for manual feature extraction.

Up to now, various representative deep learning techniques, such as Deep Neural Network (DNN), deep Convolutional Neural Network (CNN), long-short term memory based recurrent neural network (LSTMRNN), etc., have been used for speech emotion recognition.

For example, a "speech emotion recognition method based on multi-scale deep convolution cyclic neural network" disclosed in the Chinese patent literature (publication No. CN108717856A) combines a deep Convolution Neural Network (CNN) and a long-term memory network (LSTM), and simultaneously considers the characteristics of different discriminative power of two-dimensional (2D) voice frequency spectrum segment information with different lengths on different emotion type identification, provides a multi-scale CNN + LSTM mixed deep learning model, is applied to natural voice emotion identification in actual environment, however, the speech emotion recognition method using 2D speech spectral fragment information as CNN input cannot capture dynamic change information expressed by features in 2D time-frequency (time-frequency) between consecutive frames in a sentence of speech, and thus cannot provide feature parameters with good discriminative power for natural speech emotion recognition. Although the LSTM-RNN can be used for modeling of temporal information, the temporal information is over-emphasized.

Disclosure of Invention

The invention provides a natural speech emotion recognition method based on multimode deep feature learning, aiming at overcoming the defect that dynamic change information in 2D time-frequency feature representation among continuous frames in a sentence of speech cannot be captured in the prior art, so that feature parameters with good discrimination can not be provided for natural speech emotion recognition.

In order to achieve the purpose, the invention adopts the following technical scheme:

a natural speech emotion recognition method based on multi-mode deep feature learning comprises the following steps:

s1, generating an appropriate multi-modal representation: generating three appropriate audio representations from the original one-dimensional speech signal for input of subsequent different CNN models;

s2, learning multi-modal characteristics by adopting a multi-depth convolution neural network model;

and S3, integrating the classification results of different CNN models by adopting a score level fusion method, and outputting the final speech emotion recognition result.

In the scheme of the invention, the deep Multi-modal characteristics learned by adopting a Multi-depth convolutional neural network model (Multi-CNN), namely a one-dimensional convolutional neural network (1D-CNN), a two-dimensional convolutional neural network (2D-CNN) and a three-dimensional convolutional neural network (3D-CNN), have the characteristic of certain complementarity, are used for natural speech emotion recognition, the technical problem that dynamic change information in 2D time-frequency characteristic representation among continuous frames in a sentence of speech cannot be captured in the prior art is solved, the deep Multi-modal characteristics with the characteristic of complementarity are learned through fusion of the Multi-depth convolutional neural network, the emotion classification performance is obviously improved, and the characteristic of good discrimination is provided for natural speech emotion recognition.

Preferably, the step S1 includes the steps of:

s1.1, segmenting a one-dimensional original voice signal waveform into segments, inputting the segments into a one-dimensional convolution neural network, and setting the length of a voice segment;

s1.2, extracting a two-dimensional Mel frequency spectrogram from a one-dimensional original voice signal, and constructing a three-channel frequency spectrum segment similar to an RGB image as the input of a two-dimensional convolution neural network;

s1.3, forming a 3D dynamic segment similar to a video by a plurality of continuous two-dimensional Mel frequency spectrum segment sequences, and performing space-time feature learning by taking the 3D dynamic segment as the input of a three-dimensional convolution neural network.

Preferably, the step S2 includes the steps of:

s2.1, performing one-dimensional original voice signal waveform modeling by adopting a one-dimensional convolution neural network: constructing a one-dimensional convolutional neural network model, and training the constructed one-dimensional convolutional neural network model;

s2.2, performing two-dimensional Mel spectrum modeling by using a two-dimensional convolution neural network: aiming at target data, finely adjusting the existing AlexNet deep convolution neural network model trained in advance, and sampling the generated Mel spectrum segment size of three channels;

s2.3, modeling the space-time dynamic information based on the three-dimensional convolutional neural network: and executing a space-time characteristic learning task based on the extracted video-like 3D dynamic segment, and avoiding network overfitting by adopting a dropout regularization method.

Preferably, the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three maximum pooling layers, two full-connected layers and 1 Softmax classified output layer, and the one-dimensional convolutional layers comprise a batch processing normalization layer and a modified linear unit activation function layer, namely, input data is normalized before the one-dimensional convolutional neural network is trained.

Preferably, the AlexNet deep convolutional neural network model includes five convolutional layers, three maximum pooling layers, and two fully connected layers.

Preferably, the fine tuning in step S2.2 comprises the following steps:

1) copying the whole network parameters from a pre-trained AlexNet deep convolution neural network model to initialize the used two-dimensional convolution neural network;

2) replacing the softmax output layer in the AlexNet deep convolutional neural network model with a new sample label vector corresponding to the number of emotion classes used in the data set;

3) the used two-dimensional convolutional neural network is retrained using a standard back propagation strategy.

Fine-tuning is widely used for transfer learning in computer vision and alleviates the problem of data insufficiency.

Preferably, the three-dimensional convolutional neural network in the step S2.3 includes two three-dimensional convolutional layers including a batch normalization layer and a modified linear cell activation function layer, two three-dimensional maximum pooling layers, two full-link layers, and one softmax output layer.

Preferably, the step S3 includes the steps of:

s3.1, performing network training on the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, and updating network parameters;

s3.2, performing average operation on all the divided segment classification results in a sentence of voice by adopting an average pooling strategy on the segment classification results obtained from the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, so as to generate emotion classification score results on the whole sentence of voice;

s3.3, maximizing the emotion classification score results on the whole sentence voice level, and obtaining the emotion recognition result of each convolutional neural network;

and S3.4, combining classification score results obtained by different convolutional neural network models on the whole sentence voice level by utilizing a score level fusion strategy so as to carry out final emotion classification.

Preferably, step S3.4 may be expressed as:

score^fusion＝λ₁score^1D+λ₂score^2D+λ₃score^3D；

λ₁+λ₂+λ₃＝1；

wherein λ₁、λ₂And λ₃Weight values, λ, representing different classification scores obtained by one-, two-and three-dimensional convolutional neural networks₁、λ₂And λ₃Is determined by stepping at 0.1 intervals at 0,1]And determining the optimal value obtained by searching in the range.

Preferably, the updating of the network parameters in step S3.1 is shown by the following expression:

where W represents the weight value of the softmax layer of the network parameter theta,

the representation is associated with the input data a_iCorresponding output of the last full connection layer (FC) layer, y_iA class label vector representing the ith segment, which is equal to the emotion category of the whole sentence of speech, and H represents a softmax logarithmic loss function, wherein H is expressed by the following expression:

where C represents the total number of emotion categories.

The invention has the beneficial effects that: the method solves the technical problems that dynamic change information in 2D time-frequency feature representation between continuous frames in a sentence of voice cannot be captured in the prior art, so that good discrimination characteristic parameters cannot be provided for natural voice emotion recognition, and time information is over-emphasized.

Drawings

FIG. 1 is a general architectural block diagram of the present invention.

Fig. 2 is a graph of a confusion matrix for the identification results of the fractional fusion of the present invention on the AFEW5.0 dataset.

FIG. 3 is a diagram of a confusion matrix of the recognition results of the fractional fusion of the present invention on the BAUM-1s dataset.

Detailed Description

The technical scheme of the invention is further specifically described by the following embodiments and the accompanying drawings.

Example 1: in this embodiment, a natural speech emotion recognition method based on multi-mode deep feature learning, as shown in fig. 1, includes the following steps:

s1, generating an appropriate multi-modal representation: three suitable audio representations are generated from the original one-dimensional speech signal for subsequent input of different CNN models.

Step S1 includes the following steps:

s1.1, segmenting a one-dimensional original voice signal waveform into segments, inputting the segments into a one-dimensional convolutional neural network (1D-CNN), and setting the length of the voice segments; for best performance, the length of the speech segment is set to 625 frames, the original speech signal is down-sampled at 22kHz and scaled to [ -256, 256 ]. In this case, the scaled data is naturally close to zero, so the average does not need to be subtracted.

S1.2, extracting a two-dimensional Mel frequency spectrum graph from a one-dimensional original voice signal, and constructing a three-channel frequency spectrum segment similar to an RGB image as the input of a two-dimensional convolution neural network (2D-CNN);

s1.3, forming a 3D dynamic segment similar to a video by a plurality of continuous two-dimensional Mel frequency spectrum segment sequences, and performing space-time feature learning by taking the 3D dynamic segment as the input of a three-dimensional convolutional neural network (3D-CNN).

In experimental validation, we generated a three-channel two-dimensional Mel-frequency spectrum fragment of a size that was, as an input to a two-dimensional convolutional neural network. Specifically, we extract the entire log Mel spectrogram of a speech signal in the range of 20Hz to 8000Hz using 64 Mel filter banks. At this time, a Hamming window of 25ms is used, overlapping for 10 ms. Then, using a 64 frame size text box, the entire log Mel spectrum is segmented into fixed length segments, resulting in a 64' 64 static segment. Then, we calculate the first and second regression coefficients of the generated static segment along the time axis, thereby obtaining the first derivative (delta) coefficient and the second derivative (delta-delta) coefficient of the static segment. Finally, we can generate Mel-frequency spectral slices of three channels (static, first and second derivatives), similar to color RGB images in computer vision.

And S2, learning the Multi-modal characteristics by adopting a Multi-depth convolution neural network model Multi-CNN.

Step S2 includes the following steps:

s2.1, performing one-dimensional original voice signal waveform modeling by adopting a one-dimensional convolution neural network: and constructing a one-dimensional convolutional neural network model, and training the constructed one-dimensional convolutional neural network model.

The stride length of the convolutional layer and the maximum pooling layer is set to 1, as shown in table 1, the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three maximum pooling layers, two full-link layers and 1 Softmax classification output layer, the one-dimensional convolutional layer comprises a batch normalization layer and a modified linear unit (ReLU) activation function layer, namely, input data are normalized before the one-dimensional convolutional neural network is trained, and the output of the Softmax layer corresponds to the whole emotion category on the used data set.

TABLE 1

S2.2, performing two-dimensional Mel spectrum modeling by using a two-dimensional convolution neural network: aiming at target data, finely adjusting an existing AlexNet deep convolution neural network model trained in advance, and sampling the size of a generated three-channel Mel frequency spectrum segment to an input size fixed by the AlexNet model; as shown in table 2, the AlexNet deep convolutional neural network model includes five convolutional layers, three max pooling layers, and two fully-connected layers.

TABLE 2

The fine tuning in step S2.2 comprises the following steps:

As shown in table 3, the three-dimensional convolutional neural network includes two three-dimensional convolutional layers including a batch normalization layer and a modified linear unit (ReLU) activation function layer, two three-dimensional maximum pooling layers, two fully-connected layers, and one softmax output layer.

TABLE 3

And S3, integrating the classification results of different CNN models by adopting a score level fusion method, and outputting a final speech emotion recognition result.

Since the feature representations obtained from the one-dimensional convolutional neural network and the three-dimensional convolutional neural network capture completely different acoustic characteristics compared to the two-dimensional convolutional neural network based on 2D time-frequency representation as input, which indicates that emotion features learned from the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional convolutional neural network may be complementary to each other, and therefore, they need to be integrated in a multiple convolutional neural network fusion network, which may further improve speech emotion classification performance.

Step S3 includes the following steps:

s3.1, performing network training on the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, and updating network parameters, wherein the updated network parameters are shown in the following expression:

the representation is associated with the input data a_iCorresponding output of the last full connection layer (FC) layer, y_iThe class label vector representing the ith segment is equal to the emotion category of the whole sentence of voice, H represents a softmax logarithmic loss function, and H is expressed by the following expression:

where C represents the total number of emotion categories.

S3.2, performing average operation on all the divided segment classification results in a sentence of voice by adopting an average pooling strategy on the segment classification results obtained from the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, so as to generate emotion classification score results on the whole sentence of voice level (utterance-level);

s3.4, combining classification score results (score) obtained by different convolutional neural network models (1D-CNN,2D-CNN,3D-CNN) on the whole sentence voice level by utilizing a score-level fusion strategy to perform final emotion classification, wherein the classification score results can be expressed as:

score^fusion＝λ₁score^1D+λ₂score^2D+λ₃score^3D；

λ₁+λ₂+λ₃＝1；

wherein λ₁、λ₂And λ₃Weight values, λ, representing different classification scores obtained by one-, two-and three-dimensional convolutional neural networks₁、λ₂And λ₃Is determined by stepping at 0.1 intervals at [0,1 ]]And searching the range to obtain the optimal value.

To verify the effectiveness of our proposed method for natural speech emotion recognition, we performed experiments using two challenging natural emotion speech data sets AFEW5.0 and bamm-1 s, and no other action emotion speech data set.

The AFEW5.0 dataset contains 7 emotion categories, such as angry, happy, sad, offensive, surprised, afraid, and neutral, with three annotators invited to annotate these emotions. The AFEW5.0 dataset is divided into three parts: training set Train (723 samples), validation set Val (383 samples) and Test set Test (539 samples). We do not use the test set because it only opens up the acquisition right for the researchers participating in the competition.

BAUM-1s contains not only 6 basic emotional categories, including anger, happiness, sadness, dislikes, fear, surprise, but also other mental states such as uncertainty, thinking, concentration, and confusion. Here we only focused on identifying 6 basic emotion classes, resulting in a subset with 521 emotion video samples.

A. Setting of experiments

For training of the one-dimensional convolutional neural network, the two-dimensional convolutional neural network, and the three-dimensional convolutional neural network, the minimum batch processing size of input data is 30, the maximum cycle (epoch) number is 300, and the learning rate is 0.001. In order to accelerate the training speed of the convolutional neural network, a NVIDIA GTX TITAN X GPU with a 12GB memory is adopted, and natural speech emotion recognition is realized by using a cross-validation strategy irrelevant to a speaker, wherein the strategy is mainly used for a real scene.

The experiment used the original training set (Train) on the AFEW5.0 dataset and the validation set (Val) during training as test sets. On the BAUM-1s dataset, which contains 31 Turkish people, a leave-one-out cross-validation (LOSGO) strategy of 5 group crossings was employed. In this way, the average recognition accuracy over five tests was reported on the BAUM-1s dataset.

Note that in the experiment, we segmented the Mel spectrum of the whole speech extracted from the audio sample into a certain number of Mel-spectrum (Mel-spectral) segments, and performed segment-level feature learning using a convolutional neural network. In this case, we set the emotion classification of each Mel-frequency spectrum segment as an emotion tag on the speech level of the whole sentence.

B. Network training

Network training is carried out on the one-dimensional convolution neural network, the two-dimensional convolution neural network and the three-dimensional neural network, network parameters are updated, and the updated network parameters are shown in the following expression:

the representation is associated with the input data a_iCorresponding final FC layer output, y_iAnd a class label vector representing the ith segment is equal to the emotion classification of the sounding level, H represents a softmax logarithmic loss function, and H is expressed as follows:

where C represents the total number of emotion categories.

C. Results and analysis

The sample fragment length input to the one-dimensional convolutional neural network (1D-CNN) may have a significant impact on the performance of the 1D-CNN network. Therefore, we preliminarily studied the performance of different sample segment lengths as the input of the 1D-CNN network, i.e., tested the performance of four different sample segment lengths (125, 625, 3125, 15625 frames) as the input of the one-dimensional convolutional neural network, and the number of corresponding convolutional layers. Table 4 shows the identification performance of four different sample segment lengths associated with the number of layers of convolution. Note that each convolutional layer is followed by a max pooling layer, except that the last convolutional layer is equivalent to a fully-connected layer.

TABLE 4

As a result, the sample segment length of 625 frames performed best among the four different sample segment lengths in the AFEW5.0 and bamm-1 s datasets, as shown in table 4. Specifically, our method has an accuracy of 24.02% on the AFEW5.0 dataset and 37.37% on the bamu-1 s dataset. Larger sample segment lengths help improve performance, however, sample segment lengths that are too large do not necessarily improve performance. This may be due to the larger sample fragment length reducing the number of samples used in the one-dimensional convolutional neural network training. Therefore, the performance of the one-dimensional convolutional neural network does not always improve when the sample segment length increases, so we set the sample segment length of the one-dimensional convolutional neural network to 625 frames.

Since the extracted three-channel Mel frequency spectrum fragment is similar to an RGB image and is used as an input of a 2D-CNN network, it is feasible to perform fine tuning on an existing depth model based on ImageNet data. In order to evaluate the fine-tuning effect of different pre-training deep network models, fine-tuning recognition performances of three typical deep network models, namely AlexNet, VGG-16 and ResNet-50, on a target emotion data set are compared. The recognition results of these deep network models are obtained by averaging the score scores of all the segmented segments and then performing a maximization operation on them.

Table 5 shows the fine tuning recognition results for three typical deep network models (e.g., AlexNet, VGG-16, and ResNet-50). As can be seen from Table 5, AlexNet performed slightly better than VGG-16 and ResNet-50, with AlexNet accuracy of 29.24% in the AFEW5.0 database and VGG-16 and ResNet-50 of 28.16% and 28.55%, respectively, and in BAUM-1s database, AlexNet, VGG-16, and ResNet-50 of 42.29%, 41.73%, and 41.97%, respectively. This indicates that the deeper network models such as VGG-16 and ResNet-50 do not have a significant performance improvement over the shallower AlexNet, probably because the emotion data sets used are very limited and therefore the number of speech samples produced is not sufficient to train deeper networks.

TABLE 5

For spatio-temporal feature learning, a plurality of continuous two-dimensional Mel frequency spectrum fragment sequences form a 3D dynamic fragment similar to a video as the input of a three-dimensional convolutional neural network (3D-CNN). The created video segment length is equal to the number of consecutive two-dimensional Mel-frequency spectrum segments. The video segment length also significantly affects the performance of the 3D-CNN network.

In order to evaluate the performance of different video segment lengths as input to a 3D CNN network, we have given in experiments the identification of 4 different video segment lengths (4, 6, 8, 10 Mel spectral segments). For these video segment lengths, the three-dimensional convolutional neural network has the same network structure except for the first convolutional layer. In the first convolutional layer (Conv1.), its depth of three-dimensional filter size (i.e., the number of consecutive Mel-frequency spectral slices in series) is equal to the corresponding video slice length. Table 6 gives the performance of four different video segment lengths (i.e. 4, 6, 8, 10 Mel spectral segments) and the depth of the three-dimensional filter size of the first convolution layer.

TABLE 6

As can be seen from table 6, the video segment length containing 4 consecutive voice Mel spectral segments achieves the best performance in the AFEW5.0 database and the bamu-1 s database with 28.46% and 37.97% accuracy, respectively. As the video segment length increases, the performance of the three-dimensional convolutional neural network degrades, possibly because the number of video segments used to train the three-dimensional convolutional neural network decreases as the video segment length increases.

In the experiment, we propose and compare two fusion methods of multiple deep convolutional neural networks: feature level fusion and fractional level fusion. For feature level fusion, we first extract the entire sentence-level speech feature for each convolutional neural network, i.e. by using an averaging pooling operation on the segment features represented by the output of the last fully-connected layer of the convolutional neural network used. Then, the features of three whole sentence voice levels from one-dimensional, two-dimensional and three-dimensional convolution neural networks are directly connected in series to form a total 5376-D feature vector, and finally, a linear Support Vector Machine (SVM) is adopted to carry out final emotion classification.

Table 7 lists the different multiple deep convolutional neural network fusion methods and the identification of the single convolutional neural network that achieves the best performance. For the AFEW5.0 database, the optimal weight values are 0.3, 0.4; for the BAUM-1s database, the optimal weight values are 0.2, 0.5, 0.3. From the results in table 7, it can be seen:

1) two-dimensional convolutional neural networks (2D-CNN) perform best, followed by three-dimensional convolutional neural networks (3D-CNN) and one-dimensional convolutional neural networks (1D-CNN). This shows that it is effective to fine-tune the existing deep network model AlexNet based on ImageNet data using the generated two-dimensional Mel voice spectrum fragments similar to RGB images, thereby relieving the pressure of emotion data insufficiency on deep neural network training.

2) Fractional fusion has better performance than feature-level fusion. This shows that fractional order fusion is more suitable for the fusion of multiple deep neural networks.

3) Compared with single convolution neural networks such as one-dimensional, two-dimensional and three-dimensional convolution neural networks, the method has better performance in realizing the fusion of multiple convolution neural networks in the characteristic layer and the fractional layer. This shows that the multi-modal deep features learned from one-dimensional, two-dimensional, three-dimensional convolutional neural networks are complementary, so they are integrated into a multi-deep convolutional neural network fusion network to obtain significantly improved emotion classification performance.

TABLE 7

In order to provide the recognition accuracy of each emotion, fig. 2 and fig. 3 respectively show confusion matrices of recognition results. The fractional fusion method now achieved recognition accuracies of 35.77% and 44.06% on the two data sets, respectively.

As shown in fig. 2, we can see that the accuracy of the three emotions, "angry", "neutral" and "fear" in the AFEW5.0 database is 56.25%, 50.79% and 43.48%, respectively. And the classification accuracy of the other four emotions, namely 'annoying', 'happy', 'sad' and 'surprise', is less than 33%.

As can be seen from FIG. 3, in the BAUM-1s database, two emotions, "sad" and "happy" were recognized with accuracy rates of 70.90% and 55.49%, respectively. The accuracy of the other four emotions, namely 'anger', 'fear', 'offensive' and 'surprise' is less than 29%.

In the scheme of the invention, the deep Multi-modal characteristics learned by adopting a Multi-depth convolutional neural network model (Multi-CNN), namely a one-dimensional convolutional neural network (1D-CNN), a two-dimensional convolutional neural network (2D-CNN) and a three-dimensional convolutional neural network (3D-CNN), have certain complementary characteristics, are used for natural speech emotion recognition, the technical problem that dynamic change information in 2D time-frequency characteristic representation among continuous frames in a sentence of speech cannot be captured in the prior art is solved, the deep Multi-modal characteristics with the complementary characteristics are learned through fusion of the Multi-depth convolutional neural network, the emotion classification performance is obviously improved, and the characteristics with good discrimination are provided for natural speech emotion recognition.

Claims

1. A natural speech emotion recognition method based on multi-mode deep feature learning is characterized by comprising the following steps:

2. The method for recognizing natural speech emotion based on multi-mode depth feature learning of claim 1, wherein said step S1 includes the following steps:

3. The method for recognizing natural speech emotion based on multi-mode depth feature learning of claim 1, wherein said step S2 includes the following steps:

4. The method according to claim 3, wherein the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three maximum pooling layers, two full-connected layers and 1 Softmax classification output layer, and the one-dimensional convolutional layers comprise a batch normalization layer and a modified linear unit activation function layer, namely, input data is normalized before the one-dimensional convolutional neural network is trained.

5. The method according to claim 3, wherein the AlexNet deep convolutional neural network model comprises five convolutional layers, three maximum pooling layers and two full-connected layers.

6. The method for natural speech emotion recognition based on multimode deep feature learning as claimed in claim 3, wherein the fine tuning in step S2.2 comprises the following steps:

7. The method according to claim 1, wherein the three-dimensional convolutional neural network in step S2.3 comprises two three-dimensional convolutional layers, two three-dimensional max pooling layers, two fully-connected layers and one softmax output layer, and the three-dimensional convolutional layers comprise a batch normalization layer and a modified linear unit activation function layer.

8. The method for recognizing natural speech emotion based on multi-mode depth feature learning of claim 1, wherein said step S3 includes the following steps:

9. The method for recognizing natural speech emotion based on multi-mode deep feature learning as claimed in claim 8, wherein said step S3.4 can be expressed as:

score^fusion＝λ₁score^1D+λ₂score^2D+λ₃score^3D；

λ₁+λ₂+λ₃＝1；

wherein λ₁、λ₂And λ₃Representing through a one-dimensional convolutional neural networkWeight values, λ, of different classification scores obtained by a network, a two-dimensional convolutional neural network and a three-dimensional convolutional neural network₁、λ₂And λ₃Is determined by stepping at 0.1 intervals at [0,1 ]]And searching the range to obtain the optimal value.

10. The method for natural speech emotion recognition based on multimode deep feature learning of claim 8, wherein the step S3.1 of updating network parameters is represented by the following expression:

wherein W represents the weight value of the softmax layer of the network parameter θ, γ (v)_i(ii) a θ) is expressed as the input data a_iCorresponding output of the last full connection layer (FC) layer, y_iA class label vector representing the ith segment, which is equal to the emotion category of the whole sentence of speech, and H represents a softmax logarithmic loss function, wherein H is expressed by the following expression:

where C represents the total number of emotion categories.