CN109036465A

CN109036465A - Speech-emotion recognition method

Info

Publication number: CN109036465A
Application number: CN201810685220.0A
Authority: CN
Inventors: 孙林慧; 陈嘉
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2018-12-18
Anticipated expiration: 2038-06-28
Also published as: CN109036465B

Abstract

Present invention discloses a kind of speech-emotion recognition methods, include the following steps: S1, experiment voice data used is converted into sound spectrograph；S2, data amplification processing is carried out to obtained sound spectrograph；S3, the convolutional neural networks that fusion depth layer feature is constructed by traditional convolutional neural networks；S4, the convolutional neural networks that traditional convolutional neural networks and fusion depth layer feature are respectively adopted carry out speech emotion recognition experiment, compare the speech emotion recognition rate of the two.The present invention can sufficiently extract sound spectrograph feature, to improve speech emotion recognition rate, relative to traditional convolutional neural networks, the convolutional neural networks of depth layer Fusion Features proposed in the present invention can be by carrying out dimensionality reduction for shallow-layer feature, it is fully merged with further feature, to obtain the feature more representative of all kinds of emotions.The accuracy that the present invention can not only effectively improve speech emotion recognition rate, ensure to identify, but also there is more excellent generalization ability.

Description

Speech-emotion recognition method

Technical field

The present invention relates to a kind of speech-emotion recognition methods, are based on the convolutional neural networks depth in particular to one kind The speech-emotion recognition method of layer Fusion Features, is related to speech emotion recognition technical field.

Background technique

As a kind of human psychology behavior of complexity, emotion is always the research in many fields such as psychology, artificial intelligence Hot spot.Voice signal is interpersonal most natural exchange way, it not only includes the content to be transmitted, but also includes abundant Emotional factor, voice signal has been widely used in emotion research at present.

Speech emotion recognition is from the formation and variation of angle research speaker's affective state of voice signal, so as to calculate Interaction between machine and the mankind is more intelligent.In current research, the acoustic feature for emotion recognition mainly includes spectrum Correlated characteristic, prosodic features, sound quality feature and features described above fusion feature.Also, in features described above, study often only Pay close attention to time domain or frequency domain.But in voice signal frequency domain and time domain signal correlation, also played in speech emotion recognition Important role.Visual representation of the sound spectrograph as voice signal, horizontal axis represent the time, and the longitudinal axis represents frequency, have been connected to time-frequency The Frequency point of sound spectrograph is modeled as the pixel of image by two domains, can use the connection between characteristics of image research adjacent frequency, So, result of study is not only demonstrated by the time-frequency characteristics of voice, but also reflects the language feature of speaker.Currently, There are many researchers to combine image procossing with speech processes using sound spectrograph, has achieved good effect.

The method of speech emotion recognition is generally divided into two classes: traditional machine learning method and deep learning method.But nothing By being which kind of method, feature extraction is all the important step during speech emotion recognition.The pass of traditional machine learning method Key is feature selecting, this is directly related with the accuracy of speech emotion recognition.Up to the present, largely compose correlated characteristic, Prosodic features, sound quality feature have been used for speech emotion recognition, but these features may be not enough to express subjective emotion.With tradition Machine learning method compare, deep learning method can extract advanced features, and obtain in visual correlation task Certain achievement.

In recent years, depth convolutional neural networks (DCNNs) have made great progress in the research of speech emotion recognition.But It is that in a traditional convolutional neural networks, with going deep into for convolutional layer, the dimension of Feature Mapping is smaller and smaller, and feature is got over Come more abstract, therefore semantic feature becomes to be more and more obvious, however, becoming increasingly to obscure for the global information of sound spectrograph. Shallow-layer feature can provide global information, but semantic feature is not obvious, and although further feature provides enough semantic spies Sign, but it is a lack of global information, cause the affective characteristics finally extracted that can not accurately distinguish all kinds of emotions.

In conclusion how a kind of speech-emotion recognition method based on convolutional neural networks depth layer Fusion Features, will Further feature is together with shallow-layer Fusion Features, so that the bigger affective characteristics of distinction are obtained, to solve traditional convolution mind Shortcoming through network in speech emotion recognition problem just becomes those skilled in that art's urgent problem to be solved.

Summary of the invention

In view of the prior art, there are drawbacks described above, and the purpose of the present invention is to propose to one kind to be based on convolutional neural networks depth layer The speech-emotion recognition method of Fusion Features.

Specifically, a kind of speech-emotion recognition method, includes the following steps:

S1, experiment voice data used is converted into sound spectrograph；

S2, data amplification processing is carried out to obtained sound spectrograph；

S3, the convolutional neural networks that fusion depth layer feature is constructed by traditional convolutional neural networks；

S4, the convolutional neural networks that traditional convolutional neural networks and fusion depth layer feature are respectively adopted carry out voice Emotion recognition experiment, compares the speech emotion recognition rate of the two.

Preferably, voice data described in S1 comes from Berlin, Germany emotional speech emotion library；The sampling frequency of the voice data Rate is 16KHZ, 16bit quantization；The voice data includes seven class emotions altogether, respectively angry, boring, detests, fears, high It is emerging, it is neutral, it is sad.

Preferably, experiment voice data used is converted into sound spectrograph described in S1, included the following steps:

S11, sub-frame processing is carried out to each section of voice data, is changed into x (m, n), wherein n is frame length, and m is of frame Number；

S12, to x (m, n) carry out FFT transform, obtain X (m, n), be cyclic graph Y (m, n) (Y (m, n)=X (m, n) * X (m, n)')；

S13,10*log10 (Y (m, n)) is taken, by m according to time change scale M, by n according to frequency transformation scale N；

S14, (M, N, 10*log10 (Y (m, n))) is expressed as X-Y scheme, obtains sound spectrograph.

Preferably, data amplification processing is carried out to obtained sound spectrograph described in S2, included the following steps: using keras Deep learning frame carries out data amplification processing to sound spectrograph, and the data amplification processing is flat including Random-Rotation image, level Shifting, vertical translation, Shear Transform, image scaling and flip horizontal operation.

Preferably, the convolutional neural networks of the fusion depth layer feature include input layer, middle layer and output layer, institute Stating middle layer includes convolutional layer, pond layer and full articulamentum.

Preferably, the mapping relations formula of the convolutional layer is,

Wherein,It is j-th of characteristic set of first of convolutional layer；Indicate the ith feature collection of the l-1 convolutional layer It closes；Indicate the convolution kernel between two characteristic sets；* two-dimensional convolution operation is indicated；Indicate plus item biasing.

Preferably, the mapping relations formula in pond layer is,

Wherein, f_p() is the activation primitive of pond layer；Down () indicates l-1 layers to l layers of pond method, including equal It is worth two methods of pondization and maximum value pond；WithIt respectively indicates and multiplies biasing and biasing is set.

Preferably, it is arranged in a vector by each matrix character of the last layer pond layer, constitutes gated layer, the grid To change layer to be connected with full articulamentum, the output relation formula of any point j is in the gated layer,

Wherein, f_h() indicates activation primitive；w_i,jIndicate input vector x_iWith the weight between node j；θ_jIt is node threshold Value.

Preferably, the full articulamentum solves more classification problems, the loss function table of Softmax using Softmax model It is up to formula,

Wherein,Indicate the input of l layers of (usually the last layer) j-th of neuron；Illustrate l layers of institute There is the sum of the input of neuron；Indicate the output of l j-th of neuron of layer；E indicates natural constant；L () is instruction Property function, when result is true, function result 1 in bracket, result is false, function result 0 in bracket.

Preferably, it in the full articulamentum, by introducing weight attenuation term, realizes to parameter excessive in training process Punishment, expression is,

Wherein,For weight attenuation term.

Compared with prior art, advantages of the present invention is mainly reflected in the following aspects:

The present invention can sufficiently extract sound spectrograph feature, so that speech emotion recognition rate is improved, relative to traditional convolution Neural network, the convolutional neural networks of depth layer Fusion Features proposed in the present invention can be by dropping shallow-layer feature Dimension, is fully merged with further feature, to obtain the feature more representative of all kinds of emotions.The present invention not only can be effective The accuracy that ground improves speech emotion recognition rate, ensures to identify, and there is more excellent generalization ability.Test result table It is bright, use the convolutional neural networks of depth layer Fusion Features proposed by the invention, in seven class emotion of Berlin, Germany library, nothing Merely, detest, glad, the discrimination of neutral four class emotions obtains certain raising, the knowledge of especially glad and neutral two kinds of emotions Not rate is greatly improved, and whole discrimination improves 1.58%.

Meanwhile present invention incorporates the methods of shift learning to utilize tradition in the case where convolutional neural networks complicate Convolutional neural networks training pattern parameter as initiation parameter, to accelerate the convergence rate in training process, mention Whole recognition speed and recognition efficiency are risen.

In addition, the present invention also provides reference for other relevant issues in same domain, can be opened up on this basis Extension is stretched, and is applied in field in the technical solution of other speech recognitions or emotion recognition algorithm, has very wide application Prospect.

In conclusion the invention proposes a kind of speech emotion recognitions based on convolutional neural networks depth layer Fusion Features Method.With very high use and promotional value.

Just attached drawing in conjunction with the embodiments below, the embodiment of the present invention is described in further detail, so that of the invention Technical solution is more readily understood, grasps.

Detailed description of the invention

Fig. 1 is the sound spectrograph sample of corpus some emotional speeches in Berlin used in the present invention；

Fig. 2 is traditional convolutional neural networks；

Fig. 3 is improved convolutional neural networks in the present invention；

Fig. 4 is traditional convolutional neural networks training process figure；

Fig. 5 is improved convolutional neural networks training process figure in the present invention.

Specific embodiment

As shown in the picture, present invention discloses a kind of speech-emotion recognition method, include the following steps:

S1, experiment voice data used is converted into sound spectrograph.

The present invention uses the emotional speech emotion library of Berlin, Germany, and sample frequency 16KHZ, 16bit quantization shares seven Class emotion is anger respectively, boring, detests, fears, glad, neutral, sad.

Specifically, experiment voice data used is converted into sound spectrograph described in S1, include the following steps:

S11, for each section of voice framing first, become x (100,512) (n is frame length, and m is the number of frame)；

S12, FFT transform then is done, obtains X (100,512), is cyclic graph Y (100,512) (Y (100,512)=X (100,512)*X(100,512)')；

S13,10*log10 (Y (100,512)) are then taken, the number of frame according to time change once scale M, frame length Change scale N according to frequency；

S14, (M, N, 10*log10 (Y (m, n))) is finally drawn as X-Y scheme is exactly sound spectrograph.Part sample sound spectrograph As shown in Figure 1.

S2, data amplification processing is carried out to obtained sound spectrograph.In order to meet deep neural network for data volume It is required that very big demand, the present invention realizes the amplification of sound spectrograph using keras deep learning frame, and primary operational has Random-Rotation Image, horizontal translation, vertical translation, Shear Transform, image scaling, flip horizontal operate and spin upside down operation etc., finally Required mass data in being tested.

S3, the convolutional neural networks that fusion depth layer feature is constructed by traditional convolutional neural networks.

Convolutional neural networks are a kind of feedforward neural networks, generally include input layer, middle layer, output layer, middle layer by The feature extraction layer and full articulamentum composition that one or more groups of " convolution+pond " is constituted, every layer is made of some two-dimensional surfaces, Each plane includes several neuron nodes (node).Convolutional layer is entire convolutional neural networks as feature extraction layer In mostly important part, the features such as vocal print, the energy in various emotion sound spectrographs can be extracted, at subsequent classification Reason.Mapping relations formula before and after convolution is as follows:

Wherein,It is j-th of characteristic set of first of convolutional layer,Indicate the ith feature collection of the l-1 convolutional layer It closes,Indicate that the convolution kernel between two characteristic sets, * indicate two-dimensional convolution operation,Indicate plus item biasing.

A pond layer can be all connected usually behind convolutional layer, the feature for obtaining to convolution carries out at dimensionality reduction Reason, prevents occurring over-fitting in training process, pond process is shown below:

Wherein, f_p() is the activation primitive of pond layer, and down () indicates l-1 layers to l layers of pond method, general to divide For two methods of mean value pondization and maximum value pond,WithIt respectively indicates and multiplies biasing and biasing is set.

It is arranged in a vector by each matrix character of the last layer pond layer, constitutes gated layer, is connected with full articulamentum It connects, wherein j output in any point is as follows:

Wherein, f_h() indicates activation primitive, w_i,jIndicate input vector x_iWith the weight between node j, θ_jIt is node threshold Value.

Full articulamentum generallys use Softmax model to solve more classification problems, the following institute of the loss function of Softmax Show:

WhereinIndicate the input of l layers of (usually the last layer) j-th of neuron,Illustrate that l layers are owned The sum of input of neuron.Indicate the output of l j-th of neuron of layer, e indicates natural constant, and l () is indicative Function, result is true, function result 1 inside bracket, and result is false, function result 0 inside bracket.

J (θ) suboptimization in order to prevent introducesWeight attenuation term is used to during punishment training Excessive parameter.Expression is as follows:

The number of plies of usual convolutional neural networks is more, and the feature of extraction also more has distinction, but will lead to training Overlong time is difficult to the problems such as restraining, so we construct five layers of convolutional neural networks, can extract has in this way The feature of distinction can also reduce the training time, and specific network is as shown in Figure 2.The convolutional neural networks are mainly rolled up by five Lamination and three pond layers and three full articulamentum compositions.The convolution kernel of convolutional layer 1 is dimensioned to 11x11, step-length 4, Neuron number is 96, and pond layer 1 is maximum pond layer, and core size is 3x3, and the convolution kernel size of step-length 2, convolutional layer 2 is set It is set to 5x5, step-length 1, neuron number 256, pond layer 2 is also maximum pond layer, and core size is 3x3, step-length 2, volume It is 3x3 that convolution kernel size, which is all arranged, in lamination 3,4, and the convolution kernel size of step-length 1, neuron number 384, convolutional layer 5 is 3x3, step-length 1, neuron number 256, pond layer 3 are similarly maximum pond layer, and core size is 3x3, step-length 2.Finally Three full articulamentums are connected, first two layers of neuron number is set as 1024, and the neuron number of the last one full articulamentum is set It is set to 7.

In Fig. 2, it may be seen that traditional convolutional neural networks have ignored the feature of shallow-layer for classification correctness Influence, the present invention in, we construct a novel convolutional neural networks, as shown in Figure 3.The convolutional neural networks mainly by Six convolutional layers and four pond layers and three full articulamentum compositions.Relative to the traditional neural network in Fig. 2, we are added Convolutional layer 6 and pond layer 4, the convolution kernel size of convolutional layer 6 are 3x3, and step-length 1, neuron number 256, pond layer 4 is together Sample is maximum pond layer, and core size is 3x3, step-length 2, the feature and five for then being obtained three-layer coil product by a fused layer The feature that layer convolution obtains is merged, and three full articulamentums are finally connected, and first two layers of neuron number is set as 1024, most The neuron number of the full articulamentum of the latter is set as 7.

It is used as training dataset by 70% of sound spectrograph in experiment, 15% is used as validation data set, remaining as test Data set, training dataset are used to create effective classifier, verify data by adjusting the weight on convolutional neural networks Collect the performance for assessing training stage model construction, it provides a test platform, for finely tuning model parameter and selecting Optimum performance model, test data set is only used for the model that the final training of test is completed, to confirm the actual classification of the model Ability.

It is trained and tests using traditional convolutional neural networks.The relationship of loss and the number of iterations in training process As shown in figure 4, the initial learning rate of network is set as 0.0001, the 0.1 of current learning rate is decayed to after every 160 iteration (step-length) Times, training loss starts to restrain at iteration nearly 500 times, when validation data set loss Complete Convergence is in 0.89 when, Wo Menbao Model is deposited, after 2500 iteration, the accuracy of validation data set has reached 63.33%, and entire training process continues greatly About 50 minutes.

Using the method for transfer learning, use the trained optimal model of traditional convolutional neural networks as pre-training mould Type continues to train on the basis of the model using the network proposed in the present invention, in this way can using the parameter of the model as Convergence rate, less training time are accelerated in the initialization of current network, rather than random initializtion.Loss in training process Relationship with the number of iterations is as shown in figure 5, since the parameter of initialization network is generated using the model of pre-training, initially Loss will be since 1.07, and pass through the parameter for inheriting pre-training model, so that the accuracy of validation data set reaches 54.26%, the initial learning rate of network is set as 0.0001, decays to the 0.1 of current learning rate after every 160 iteration (step-length) Times, training loss starts to restrain at iteration nearly 400 times, when validation data set loss Complete Convergence is in 0.88 when, Wo Menbao Model is deposited, after 2500 iteration, the accuracy of validation data set has reached 64.78%, and entire training process continues greatly About 45 minutes.

It is tested respectively using two models in test data concentration, specific experimental result is as shown in Table 1 and Table 2.

The confusion matrix (%) of the traditional convolutional neural networks of table 1 seven class emotions in the library of Berlin

Emotional category	It is angry	It is boring	Detest	Fear	It is glad	It is neutral	It is sad
								It is angry	76.67	2.77	2.22	1.67	16.11	0.56	0
It is boring	0	90.00	0	0	1.67	5.56	2.77
								Detest	16.11	10.00	67.78	1.11	1.11	3.89	0
Fear	19.44	15.00	3.89	31.67	20.56	2.22	7.22
								It is glad	55.00	0	2.22	2.22	40.56	0	0
It is neutral	0	58.33	0	0	0	38.34	3.33
								It is sad	0	6.11	0	0	0	0	93.89

The convolutional neural networks of 2 depth Fusion Features of table, the seven class emotions in the library of Berlin obscure square (%)

Emotional category	It is angry	It is boring	Detest	Fear	It is glad	It is neutral	It is sad
								It is angry	72.78	3.89	2.22	1.67	19.44	0	0
It is boring	0	96.11	0	0	1.11	1.11	1.67
								Detest	13.89	10.56	68.88	0	1.11	5.56	0
Fear	15.56	20	4.45	30	22.22	0.56	7.21
								It is glad	50.56	0	2.22	1.11	46.11	0	0
It is neutral	0	46.11	0	0	0	46.67	7.22
								It is sad	0	10.56	0	0	0	0	89.44

From Tables 1 and 2, it will be seen that using the convolutional neural networks in the present invention relative to traditional convolution Neural network, it is boring in seven class emotion of Berlin, Germany library, detest, glad, the discrimination of neutral four class emotions obtains certain It improves, the discrimination of especially glad and neutral two kinds of emotions is greatly improved, and whole discrimination improves 1.58%. From Fig. 4, Fig. 5 and table 1, table 2, it is poor in the discrimination of validation data set and test data set that we are respectively compared two kinds of networks Not, traditional convolutional neural networks are 63.33% in the accuracy that verifying is concentrated, and are in the discrimination that test data is concentrated 62.70%, 0.63% is differed, the convolutional neural networks of the depth layer Fusion Features in the present invention are in the accuracy that verifying is concentrated 64.78%, and the discrimination concentrated in test data is 64.28%, difference 0.5%, relative to traditional convolutional neural networks Training pattern, it is proposed that network training pattern have stronger generalization ability.

Above the results showed that being compared with traditional convolutional neural networks, depth layer proposed by the invention is special Speech emotion recognition rate can be improved in the convolutional neural networks of sign fusion, can in the case where combining with transfer learning method To accelerate convergence rate, the training time is reduced, also, the instruction of the convolutional neural networks of proposed depth layer Fusion Features Practicing model has stronger generalization ability.

In conclusion the present invention can sufficiently extract sound spectrograph feature, to improve speech emotion recognition rate.Relative to biography The convolutional neural networks of system, the present invention proposed in depth layer Fusion Features convolutional neural networks can by by shallow-layer spy Sign carries out dimensionality reduction, is fully merged with further feature, to obtain the feature more representative of all kinds of emotions.The present invention is not only The accuracy that speech emotion recognition rate can be effectively improved, ensure to identify, and there is more excellent generalization ability.Together When, present invention incorporates the methods of shift learning, in the case where convolutional neural networks complicate, utilize traditional convolutional Neural The parameter of network training model is as initiation parameter, to accelerate the convergence rate in training process, improves whole Recognition speed and recognition efficiency.It, can be as in addition, the present invention also provides reference for other relevant issues in same domain According to expansion extension is carried out, apply in field in the technical solution of other speech recognitions or emotion recognition algorithm, has very Wide application prospect.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit and essential characteristics of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included within the present invention, and any reference signs in the claims should not be construed as limiting the involved claims.

In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiments being understood that.

Claims

1. a kind of speech-emotion recognition method, which comprises the steps of:

S1, experiment voice data used is converted into sound spectrograph；

S4, the convolutional neural networks that traditional convolutional neural networks and fusion depth layer feature are respectively adopted carry out speech emotional Identification experiment, compares the speech emotion recognition rate of the two.

2. speech-emotion recognition method according to claim 1, it is characterised in that: voice data described in S1 comes from German cypress Woods emotional speech emotion library；The sample frequency of the voice data is 16KHZ, 16bit quantization；The voice data includes seven altogether Class emotion, it is respectively angry, it is boring, detest, fears, it is glad, it is neutral, it is sad.

3. speech-emotion recognition method according to claim 1, which is characterized in that by voice number used in experiment described in S1 According to sound spectrograph is converted into, include the following steps:

S11, sub-frame processing is carried out to each section of voice data, is changed into x (m, n), wherein n is frame length, and m is the number of frame；

4. speech-emotion recognition method according to claim 1, which is characterized in that described in S2 to obtained sound spectrograph into Row data amplification processing includes the following steps: to carry out data amplification processing, institute to sound spectrograph using keras deep learning frame Stating data amplification processing includes that Random-Rotation image, horizontal translation, vertical translation, Shear Transform, image scaling and level are turned over Turn operation.

5. speech-emotion recognition method according to claim 1, it is characterised in that: the convolution of the fusion depth layer feature Neural network includes input layer, middle layer and output layer, and the middle layer includes convolutional layer, pond layer and full articulamentum.

6. speech-emotion recognition method according to claim 5, it is characterised in that: the mapping relations formula of the convolutional layer For,

Wherein,It is j-th of characteristic set of first of convolutional layer；Indicate the ith feature set of the l-1 convolutional layer；Indicate the convolution kernel between two characteristic sets；* two-dimensional convolution operation is indicated；Indicate plus item biasing.

7. speech-emotion recognition method according to claim 5, it is characterised in that: the mapping relations formula in pond layer For,

Wherein, f_p() is the activation primitive of pond layer；Down () indicates l-1 layers to l layers of pond method, including mean value pond Change and two methods of maximum value pond；WithIt respectively indicates and multiplies biasing and biasing is set.

8. speech-emotion recognition method according to claim 5, it is characterised in that: by each square of the last layer pond layer Battle array feature permutation constitutes gated layer, the gated layer is connected with full articulamentum, any one in the gated layer at a vector The output relation formula of point j is,

Wherein, f_h() indicates activation primitive；w_i,jIndicate input vector x_iWith the weight between node j；θ_jIt is Node B threshold.

9. speech-emotion recognition method according to claim 5, it is characterised in that: the full articulamentum uses Softmax Model solves more classification problems, and the loss function expression formula of Softmax is,

Wherein,Indicate the input of l layers of (usually the last layer) j-th of neuron；Illustrate l layers of all minds The sum of input through member；Indicate the output of l j-th of neuron of layer；E indicates natural constant；L () is indicative letter Number, when result is true, function result 1 in bracket, result is false, function result 0 in bracket.

10. speech-emotion recognition method according to claim 9, it is characterised in that: in the full articulamentum, by drawing Entering weight attenuation term, realizes the punishment to parameter excessive in training process, expression is,

Wherein,For weight attenuation term.