CN109036465B

CN109036465B - Speech emotion recognition method

Info

Publication number: CN109036465B
Application number: CN201810685220.0A
Authority: CN
Inventors: 孙林慧; 陈嘉
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2021-05-11
Anticipated expiration: 2038-06-28
Also published as: CN109036465A

Abstract

The invention discloses a speech emotion recognition method, which comprises the following steps: s1, converting the voice data used in the experiment into a spectrogram; s2, performing data amplification processing on the obtained spectrogram; s3, constructing a convolutional neural network fusing deep and shallow layer characteristics by using the traditional convolutional neural network; and S4, respectively adopting the traditional convolutional neural network and the convolutional neural network fused with the depth layer characteristics to carry out a speech emotion recognition experiment, and comparing the speech emotion recognition rates of the traditional convolutional neural network and the convolutional neural network. Compared with the traditional convolutional neural network, the convolutional neural network with the fusion of the deep and shallow features provided by the invention can fully extract the spectrogram features so as to improve the speech emotion recognition rate, and can fully fuse the shallow features with the deep features by reducing the dimension of the shallow features, thereby obtaining the features which can represent various emotions. The invention not only can effectively improve the speech emotion recognition rate and ensure the recognition accuracy, but also has more excellent generalization capability.

Description

Speech emotion recognition method

Technical Field

The invention relates to a speech emotion recognition method, in particular to a speech emotion recognition method based on convolutional neural network deep and shallow layer feature fusion, and relates to the technical field of speech emotion recognition.

Background

As a complex human psychological behavior and emotion, the method is always a research hotspot in many fields such as psychology, artificial intelligence and the like. The voice signal is the most natural communication mode between people, not only contains contents to be transmitted, but also contains rich emotional factors, and the voice signal is widely applied to emotional research at present.

The speech emotion recognition is to study the formation and change of the emotion state of a speaker from the perspective of speech signals so as to make the interaction between a computer and a human more intelligent. In the current research, the acoustic features used for emotion recognition mainly include spectral-related features, prosodic features, psychoacoustic features, and fusion features of the above features. In addition, in the above features, research is often focused only on the time domain or the frequency domain. However, the correlation between signals in the frequency domain and the time domain of a speech signal plays an important role in speech emotion recognition. The voice spectrogram is used as visual representation of voice signals, the horizontal axis represents time, the vertical axis represents frequency, two time-frequency domains are communicated, frequency points of the voice spectrogram are modeled as pixel points of images, and the relation between adjacent frequency points can be researched by utilizing image characteristics, so that the research result not only expresses the time-frequency characteristics of voice, but also reflects the language characteristics of speakers. At present, a lot of researchers have combined image processing and voice processing by using a spectrogram to obtain a good effect.

Methods of speech emotion recognition are generally divided into two categories: a conventional machine learning method and a deep learning method. However, in any method, feature extraction is an important step in the speech emotion recognition process. The key of the traditional machine learning method is feature selection, which is directly related to the accuracy of speech emotion recognition. To date, a large number of spectral-related features, prosodic features, and psychoacoustic features have been used for speech emotion recognition, but these features may not be sufficient to express subjective emotion. Compared with the traditional machine learning method, the deep learning method can extract high-level features, and has achieved certain achievement in vision-related tasks.

In recent years, Deep Convolutional Neural Networks (DCNNs) have made great progress in the study of speech emotion recognition. However, in a conventional convolutional neural network, as the convolutional layer is deeper, the feature mapping dimension is smaller and smaller, and the features are more abstract, so that the semantic features become more obvious, however, the global information of the spectrogram becomes more fuzzy. The shallow features can provide global information, but the semantic features are not obvious, while the deep features provide enough semantic features but lack global information, so that the finally extracted emotional features cannot accurately distinguish various emotions.

In summary, how to a speech emotion recognition method based on convolution neural network deep and shallow feature fusion, deep features and shallow features are fused together, so as to obtain emotion features with greater distinctiveness, so as to solve the shortcomings of the conventional convolution neural network in speech emotion recognition, and the problem to be solved by the technical staff in the field is urgent.

Disclosure of Invention

In view of the above defects in the prior art, the present invention aims to provide a speech emotion recognition method based on convolutional neural network deep and shallow feature fusion.

Specifically, the speech emotion recognition method comprises the following steps:

s1, converting the voice data used in the experiment into a spectrogram;

s2, performing data amplification processing on the obtained spectrogram;

s3, constructing a convolutional neural network fusing deep and shallow layer characteristics by using the traditional convolutional neural network;

and S4, respectively adopting the traditional convolutional neural network and the convolutional neural network fused with the depth layer characteristics to carry out a speech emotion recognition experiment, and comparing the speech emotion recognition rates of the traditional convolutional neural network and the convolutional neural network.

Preferably, the voice data of S1 is from German Berlin emotion voice emotion library; the sampling frequency of the voice data is 16KHZ, and 16bit quantization is carried out; the voice data comprises seven types of emotions which are respectively angry, boring, aversion, fear, happiness, neutrality and hurt.

Preferably, the step of converting the voice data used in the experiment into the spectrogram in S1 includes the following steps:

s11, performing framing processing on each section of voice data, and changing the section of voice data into x (m, n), wherein n is the frame length, and m is the number of frames;

s12, performing FFT on X (m, n) to obtain X (m, n), and making a periodic diagram Y (m, n) (Y (m, n) ═ X (m, n) × X (m, n)');

s13, taking 10 × log10(Y (M, N)), converting M into a scale M according to time, and converting N into a scale N according to frequency;

s14, (M, N,10 × log10(Y (M, N))) is expressed as a two-dimensional graph, and a spectrogram is obtained.

Preferably, the data amplification processing on the obtained spectrogram in S2 includes the following steps: and performing data amplification processing on the spectrogram by using a keras deep learning framework, wherein the data amplification processing comprises random image rotation, horizontal translation, vertical translation, miscut transformation, image scaling and horizontal overturning operation.

Preferably, the convolutional neural network fused with the depth layer features comprises an input layer, an intermediate layer and an output layer, wherein the intermediate layer comprises a convolutional layer, a pooling layer and a full-link layer.

Preferably, the mapping relation of the convolutional layer is,

wherein the content of the first and second substances,

is the jth feature set of the ith convolutional layer;

an ith feature set representing the l-1 convolutional layer;

representing a convolution kernel between two feature sets; denotes a two-dimensional convolution operation;

indicating an additive bias.

Preferably, the mapping relation at the pooling layer is,

wherein f is_p(. is the activation function of the pooling layer; down (-) denotes the l-1 layer to l layer pooling method, including both mean pooling and maximum pooling;

and

respectively representing multiply bias and add bias.

Preferably, the matrix features passing through the last pooling layer are arranged into a vector to form a grid layer, the grid layer is connected with the full-connection layer, the output relational expression of any point j in the grid layer is,

wherein f is_h(. -) represents an activation function; w is a_i,jRepresenting an input vector x_iThe weight value between the node j and the node j; theta_jIs the node threshold.

Preferably, the full link layer adopts a Softmax model to solve the multi-classification problem, and the loss function expression of Softmax is,

wherein the content of the first and second substances,

an input representing the jth neuron at level l (usually the last level);

represents the sum of the inputs of all neurons of the l-th layer;

represents the output of the jth neuron at the l layer; e represents a natural constant; l (-) is an indicative function, when includedThe result in the sign is true, the function result is 1, the result in the parenthesis is false, and the function result is 0.

Preferably, in the full connection layer, penalty to overlarge parameter in training process is realized by introducing weight attenuation item, the specific expression is,

wherein the content of the first and second substances,

are weight decay terms.

Compared with the prior art, the invention has the advantages that:

compared with the traditional convolutional neural network, the convolutional neural network with the fusion of the deep and shallow features provided by the invention can fully extract the spectrogram features so as to improve the speech emotion recognition rate, and can fully fuse the shallow features with the deep features by reducing the dimension of the shallow features, thereby obtaining the features which can represent various emotions. The invention not only can effectively improve the speech emotion recognition rate and ensure the recognition accuracy, but also has more excellent generalization capability. Test results show that the recognition rate of four kinds of emotions, namely boredom, aversion, happiness and neutrality, is improved to a certain extent in seven kinds of emotions of Germany Berlin library by using the convolutional neural network with the fused deep and shallow layer characteristics, particularly the recognition rate of the happiness and the neutrality is greatly improved, and the overall recognition rate is improved by 1.58%.

Meanwhile, the invention combines a transfer learning method, and utilizes the parameters of the traditional convolutional neural network training model as initialization parameters under the condition that the convolutional neural network becomes complex, thereby accelerating the convergence speed in the training process and improving the overall recognition speed and recognition efficiency.

In addition, the invention also provides reference for other related problems in the same field, can be expanded and extended on the basis of the reference, is applied to other technical schemes of voice recognition or emotion recognition algorithms in the field, and has very wide application prospect.

In conclusion, the invention provides a speech emotion recognition method based on convolutional neural network deep and shallow feature fusion. Has high use and popularization value.

The following detailed description of the embodiments of the present invention is provided in connection with the accompanying drawings for the purpose of facilitating understanding and understanding of the technical solutions of the present invention.

Drawings

FIG. 1 is a spectrogram sample of a number of emotional voices of a Berlin corpus used in the present invention;

FIG. 2 is a conventional convolutional neural network;

FIG. 3 is a modified convolutional neural network of the present invention;

FIG. 4 is a diagram of a conventional convolutional neural network training process;

FIG. 5 is a diagram of the improved convolutional neural network training process in the present invention.

Detailed Description

As shown in the attached drawings, the invention discloses a speech emotion recognition method, which comprises the following steps:

and S1, converting the voice data used in the experiment into a spectrogram.

The invention uses the emotion voice emotion library of German Berlin, the sampling frequency is 16KHZ, 16bit quantization, seven kinds of emotions are shared, and the emotions are respectively angry, boring, aversion, fear, happiness, neutrality and hurt.

Specifically, the step of converting the voice data used in the experiment into the spectrogram in S1 includes the following steps:

s11, firstly framing each segment of voice, and changing the frame into x (100,512) (n is the frame length, and m is the number of frames);

s12, performing FFT to obtain X (100,512), and making a periodic diagram Y (100,512) (Y (100,512) ═ X (100,512) × (100,512)');

s13, then 10 log10(Y (100,512)) is taken, the number of frames is converted into a lower scale M according to time, and the length of the frames is converted into a lower scale N according to frequency;

and S14, drawing (M, N,10 × log10(Y (M, N))) into a two-dimensional graph, namely the spectrogram. The spectrogram of a part of the sample is shown in FIG. 1.

And S2, performing data amplification treatment on the obtained spectrogram. In order to meet the requirement of a deep neural network on a large amount of data, the amplification of a spectrogram is realized by using a keras deep learning framework, the random image rotation, the horizontal translation, the vertical translation, the miscut transformation, the image scaling, the horizontal turning operation, the up-down turning operation and the like are mainly operated, and a large amount of data required in an experiment are finally obtained.

And S3, constructing the convolutional neural network fusing the deep and shallow layer characteristics by the traditional convolutional neural network.

The convolutional neural network is a feedforward neural network, and generally comprises an input layer, an intermediate layer and an output layer, wherein the intermediate layer is composed of one or more groups of feature extraction layers formed by convolution and pooling and a full connection layer, each layer is composed of a plurality of two-dimensional planes, and each plane comprises a plurality of neuron nodes (nodes). The convolution layer is used as a feature extraction layer, is the most important part in the whole convolution neural network, and can extract the features of voiceprints, energy and the like in various emotion speech spectrograms for subsequent classification processing. The mapping relation before and after convolution is as follows:

wherein the content of the first and second substances,

is the jth feature set of the ith convolutional layer,

the ith feature set representing the l-1 convolutional layer,

represents a convolution kernel between two feature sets, represents a two-dimensional convolution operation,

indicating an additive bias.

Usually, a pooling layer is connected behind the convolutional layer for performing dimension reduction processing on the features obtained by convolution, so as to prevent overfitting in the training process, and the pooling process is shown as follows:

wherein f is_p(. cndot.) is an activation function of the pooling layer, down (-) denotes a l-1 layer to l layer pooling method, generally divided into two methods of mean pooling and maximum pooling,

and

respectively representing multiply bias and add bias.

And arranging the matrix characteristics passing through the final pooling layer into a vector to form a grid layer, and connecting the grid layer with the full-connection layer, wherein the output of any point j is as follows:

wherein f is_h(. represents an activation function, w_i,jRepresenting an input vector x_iAnd the weight value between node j, theta_jIs the node threshold.

The fully-connected layer usually adopts a Softmax model to solve the multi-classification problem, and the loss function of Softmax is as follows:

wherein

Representing the input to the jth neuron at layer l (usually the last layer),

the sum of the inputs of all neurons of layer i is shown.

The output of the jth neuron of the ith layer is shown, e represents a natural constant, l (-) is an indicative function, and when the result in the brackets is true, the result of the function is 1, the result in the brackets is false, and the result of the function is 0.

To prevent local optimization of J (theta), introduce

And the weight attenuation item is used for punishing an overlarge parameter in the training process. The specific expression is as follows:

generally, the more the number of layers of the convolutional neural network is, the more distinctive the extracted features are, but problems of too long training time or difficulty in convergence and the like can be caused, so that a five-layer convolutional neural network is constructed, so that the distinctive features can be extracted, and the training time can also be reduced, and a specific network is shown in fig. 2. The convolutional neural network mainly comprises five convolutional layers, three pooling layers and three full-connection layers. The convolution kernel size of convolutional layer 1 is set to 11x11, the step size is 4, the number of neurons is 96, the convolutional layer 1 is the largest convolutional layer, the kernel size is 3x3, the step size is 2, the convolution kernel size of convolutional layer 2 is 5x5, the step size is 1, the number of neurons is 256, the convolutional layer 2 is also the largest convolutional layer, the kernel size is 3x3, the step size is 2, convolutional layers 3 and 4 are both set to have the convolution kernel size of 3x3, the step size is 1, the number of neurons is 384, the convolution kernel size of convolutional layer 5 is 3x3, the step size is 1, the number of neurons is 256, the convolutional layer 3 is also the largest convolutional layer, the kernel size is 3x3, and the step size is 2. And finally, connecting three full connection layers, wherein the number of the neurons of the first two layers is set to be 1024, and the number of the neurons of the last full connection layer is set to be 7.

In fig. 2, we can see that the traditional convolutional neural network ignores the influence of shallow features on the classification correctness, and in the present invention, we construct a novel convolutional neural network, as shown in fig. 3. The convolutional neural network mainly comprises six convolutional layers, four pooling layers and three full-connection layers. Compared with the conventional neural network in fig. 2, the convolutional layer 6 and the pooling layer 4 are added, the convolutional kernel size of the convolutional layer 6 is 3x3, the step length is 1, the number of neurons is 256, the pooling layer 4 is also the largest pooling layer, the kernel size is 3x3, the step length is 2, then the features obtained by three-layer convolution and the features obtained by five-layer convolution are fused through a fusion layer, finally, three fully-connected layers are connected, the number of neurons in the first two layers is 1024, and the number of neurons in the last fully-connected layer is 7.

Taking 70% of the spectrogram in the experiment as a training data set, 15% as a verification data set and the rest as a test data set, wherein the training data set is used for creating an effective classifier by adjusting the weight on the convolutional neural network, the verification data set is used for evaluating the performance of model construction in a training stage, a test platform is provided for fine tuning model parameters and selecting an optimal performance model, and the test data set is only used for testing a final trained model so as to confirm the actual classification capability of the model.

The training and testing was performed using a conventional convolutional neural network. The relationship between the loss and the iteration number in the training process is shown in fig. 4, the initial learning rate of the network is set to 0.0001, the attenuation is 0.1 times of the current learning rate after each 160 iterations (step length), the training loss starts to converge when the iterations approach 500 times, when the verification data set loss completely converges to 0.89, the model is saved, after 2500 iterations, the accuracy of the verification data set reaches 63.33%, and the whole training process lasts about 50 minutes.

By adopting a transfer learning method, an optimal model trained by the traditional convolutional neural network is used as a pre-training model, and the network proposed by the invention is used for continuous training on the basis of the model, so that the parameters of the model can be used as the initialization of the current network instead of random initialization, the convergence speed is increased, and the training time is shortened. The relationship between the loss and the iteration number in the training process is shown in fig. 5, since the parameters of the initialized network are generated by using the pre-trained model, the initial loss starts from 1.07, and the accuracy of the data set is verified to reach 54.26% by inheriting the parameters of the pre-trained model, the initial learning rate of the network is set to 0.0001, the attenuation is 0.1 times of the current learning rate after each 160 iterations (step size), the training loss starts to converge when the iteration is nearly 400 times, when the verification data set loss completely converges to 0.88, the model is saved, the accuracy of the verification data set reaches 64.78% after 2500 iterations, and the whole training process lasts for about 45 minutes.

The two models were used to perform the tests in the test data sets, and the specific experimental results are shown in tables 1 and 2.

TABLE 1 confusion matrix for seven types of emotions in Berlin Bank for traditional convolutional neural networks (%)

Emotion categories	Generating qi	Boring to	Aversion to	Fear of	Happy	Neutral property	Heart injury
								Generating qi	76.67	2.77	2.22	1.67	16.11	0.56	0
Boring to	0	90.00	0	0	1.67	5.56	2.77
								Aversion to	16.11	10.00	67.78	1.11	1.11	3.89	0
Fear of	19.44	15.00	3.89	31.67	20.56	2.22	7.22
								Happy	55.00	0	2.22	2.22	40.56	0	0
Neutral property	0	58.33	0	0	0	38.34	3.33
								Heart injury	0	6.11	0	0	0	0	93.89

TABLE 2 confusion moments for seven types of emotions in Berlin Bank for a convolutional neural network with fused light and dark features (%)

Emotion categories	Generating qi	Boring to	Aversion to	Fear of	Happy	Neutral property	Heart injury
								Generating qi	72.78	3.89	2.22	1.67	19.44	0	0
Boring to	0	96.11	0	0	1.11	1.11	1.67
								Aversion to	13.89	10.56	68.88	0	1.11	5.56	0
Fear of	15.56	20	4.45	30	22.22	0.56	7.21
								Happy	50.56	0	2.22	1.11	46.11	0	0
Neutral property	0	46.11	0	0	0	46.67	7.22
								Heart injury	0	10.56	0	0	0	0	89.44

From tables 1 and 2, it can be seen that, compared with the conventional convolutional neural network, the recognition rate of the four emotions of borning, disgust, happy and neutral in the seven emotions of berlin library in germany is improved to a certain extent by using the convolutional neural network of the present invention, especially, the recognition rate of the two emotions of happy and neutral is greatly improved, and the overall recognition rate is improved by 1.58%. From fig. 4 and 5 and tables 1 and 2, we compare the difference of the recognition rates of the two networks in the validation data set and the test data set respectively, the accuracy of the conventional convolutional neural network in the validation set is 63.33%, and the recognition rate in the test data set is 62.70%, which is different by 0.63%, the accuracy of the convolutional neural network with the fusion of the depth layer and the shallow layer features in the validation set is 64.78%, and the recognition rate in the test data set is 64.28%, which is different by 0.5%, and compared with the training model of the conventional convolutional neural network, the training model of the network proposed by us has stronger generalization capability.

The above experimental results show that: compared with the traditional convolutional neural network, the convolutional neural network with the fusion of the deep and shallow layer features can improve the speech emotion recognition rate, can accelerate the convergence rate and reduce the training time under the condition of combining with a transfer learning method, and has stronger generalization capability.

In conclusion, the invention can fully extract the spectrogram characteristics, thereby improving the speech emotion recognition rate. Compared with the traditional convolutional neural network, the convolutional neural network with the fused deep and shallow features provided by the invention can be fully fused with the deep features by reducing the dimension of the shallow features, so that the features which can represent various emotions can be obtained. The invention not only can effectively improve the speech emotion recognition rate and ensure the recognition accuracy, but also has more excellent generalization capability. Meanwhile, the invention combines a transfer learning method, and utilizes the parameters of the traditional convolutional neural network training model as initialization parameters under the condition that the convolutional neural network becomes complex, thereby accelerating the convergence speed in the training process and improving the overall recognition speed and recognition efficiency. In addition, the invention also provides reference for other related problems in the same field, can be expanded and extended on the basis of the reference, is applied to other technical schemes of voice recognition or emotion recognition algorithms in the field, and has very wide application prospect.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A speech emotion recognition method is characterized by comprising the following steps:

s1, converting the voice data used in the experiment into a spectrogram;

s2, performing data amplification processing on the obtained spectrogram;

s4, respectively adopting a traditional convolutional neural network and a convolutional neural network fused with the depth layer characteristics to carry out a speech emotion recognition experiment, and comparing the speech emotion recognition rates of the traditional convolutional neural network and the convolutional neural network;

the convolutional neural network fused with the deep and shallow layer characteristics comprises an input layer, an intermediate layer and an output layer, wherein the intermediate layer comprises a convolutional layer, a pooling layer and a full-connection layer;

the mapping relation of the convolutional layer is as follows,

wherein the content of the first and second substances,

is the jth feature set of the ith convolutional layer;

an ith feature set representing the l-1 convolutional layer;

representing an additive bias;

the mapping relation of the pooling layer is as follows,

and

respectively representing multiplication offset and addition offset;

all the matrix characteristics passing through the last pooling layer are arranged into a vector to form a grid layer, the grid layer is connected with a full-connection layer, the output relational expression of any point j in the grid layer is,

wherein f is_h(. -) represents an activation function; w is a_i,jRepresenting an input vector x_iThe weight value between the node j and the node j; theta_jIs the node threshold;

the full connection layer adopts a Softmax model to solve the multi-classification problem, the loss function expression of Softmax is,

wherein the content of the first and second substances,

represents the input of the jth neuron of the l layer;

represents the sum of the inputs of all neurons of the l-th layer;

represents the output of the jth neuron at the l layer; e represents a natural constant; l (-) is an indicative function, with a function result of 1 when the result in parentheses is true and the result in parentheses is false and the function result is 0;

in the full connection layer, the penalty of overlarge parameters in the training process is realized by introducing a weight attenuation item, the specific expression is,

wherein the content of the first and second substances,

are weight decay terms.

2. The speech emotion recognition method of claim 1, wherein: s1, the voice data come from German Berlin emotion voice emotion library; the sampling frequency of the voice data is 16KHZ, and 16bit quantization is carried out; the voice data comprises seven types of emotions which are respectively angry, boring, aversion, fear, happiness, neutrality and hurt.

3. The method for recognizing speech emotion according to claim 1, wherein the step of converting the speech data used in the experiment into a spectrogram at S1 comprises the steps of:

s12, performing FFT on X (m, n) to obtain X (m, n), and drawing a periodic diagram Y (m, n), where Y (m, n) ═ X (m, n) × X (m, n)';

s13, 10 × log₁₀(Y (M, N)), converting M to scale M according to time, and converting N to scale N according to frequency;

s14, mixing (M, N,10 log)₁₀(Y (m, n))) is expressed as a two-dimensional graph, and a spectrogram is obtained.

4. The method for recognizing speech emotion according to claim 1, wherein the step of performing data amplification processing on the obtained spectrogram in S2 includes the steps of: and performing data amplification processing on the spectrogram by using a keras deep learning framework, wherein the data amplification processing comprises random image rotation, horizontal translation, vertical translation, miscut transformation, image scaling and horizontal overturning operation.