CN106782602A

CN106782602A - Speech-emotion recognition method based on length time memory network and convolutional neural networks

Info

Publication number: CN106782602A
Application number: CN201611093447.3A
Authority: CN
Inventors: 袁亮; 卢官明; 闫静杰
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2016-12-01
Filing date: 2016-12-01
Publication date: 2017-05-31
Anticipated expiration: 2036-12-01
Also published as: CN106782602B

Abstract

The invention discloses a kind of speech-emotion recognition method based on length time memory network and convolutional neural networks, the method builds the speech emotion recognition system based on LSTM and CNN, using voice sequence as system input, LSTM and CNN are trained using back-propagation algorithm, optimize the parameter of network, the network model after being optimized；Emotional semantic classification is carried out to the voice sequence of new input using the network model for having trained, is divided into sad, glad, detest, frightened, scaring, neutral six kinds of emotions.The method considered two kinds of network models of LSTM and CNN, it is to avoid artificial selection and extract the cumbersome of feature, improves the accuracy rate of emotion recognition.

Description

Speech-emotion recognition method based on length time memory network and convolutional neural networks

Technical field

The present invention relates to image procossing and area of pattern recognition, length time memory network and convolutional Neural are based particularly on The speech-emotion recognition method of network.

Background technology

In human communication, there is the side exchanged including the much information including voice, body language, facial expression etc. Formula.Wherein, voice signal is the most quick, exchange way of most original, and studied scholar is considered to realize man-machine interaction most One of effective method.Since nearly half a century, scholars have studied the substantial amounts of problem on speech recognition, i.e., how by language Sound sequence is converted to text.Although being made substantial progress in terms of speech recognition, because machine cannot understand speaker's Affective state, causes realizing being had got long long way to go in terms of the mankind and machine are naturally interacted.This has also driven another The research of aspect, is how to identify the affective state of speaker, i.e. speech emotion recognition from voice.

Speech emotion recognition as man-machine interaction an important branch, can be extensively using with education, medical treatment, traffic etc. Every field.In onboard system, can serve as being monitored the state of mind of driver, judge it whether in safe shape State, so as to be reminded when driver fatigue, it is to avoid the generation of traffic accident；In telephone service, can serve as The user fierce to words expression sorts, and is transferred to artificial customer service, optimizes Consumer's Experience, improves whole service water It is flat；In clinical medicine, the emotion change by speech emotion recognition to patients with depression or autism children is tracked, As medical diagnosis on disease and the instrument of auxiliary treatment；In robot research, help robot that the sense of people is understood using voice messaging Feelings, make the response of friendly and intelligence, realize interaction.

The traditional extraction feature of the method for most of speech emotion recognition use at this stage, then classified with grader Method.Conventional phonetic feature includes fundamental tone, word speed, intensity (prosodic features), linear prediction residue error, mel-frequency cepstrum Coefficient (spectrum signature) etc..Conventional sorting technique includes hidden Markov model, SVMs, gauss hybrid models.Pass The emotion identification method of system has tended to ripe, but still suffers from certain deficiency.Such as, which kind of feature pair is still not clear at present The influence of emotion recognition is maximum, only from a kind of feature as the foundation for judging in most of experiments, reduces emotion recognition Objectivity.In addition, in existing some features, such as the prosodic features such as fundamental tone, word speed is influenceed larger by the style of speaker, increase The complexity of identification is added.

With the development of nearly stage deep learning, many researcher's selections complete emotion knowledge using training network model Not.In existing speech-emotion recognition method, mainly there is the speech-emotion recognition method based on depth belief network, based on length The speech-emotion recognition method of time memory network and the speech-emotion recognition method based on convolutional neural networks.Above-mentioned three kinds of sides In method, the major defect for existing is：The advantage of each network model cannot be taken into account.Such as, depth belief network can will be one-dimensional Sequence is used as to be input into, but the correlation between sequence cannot be utilized front and rear；Although length time memory network can utilize sequence Correlation between front and rear, but the intrinsic dimensionality for extracting is higher；Convolutional neural networks cannot be processed directly voice sequence, Need first to carry out Fourier transformation to voice signal, be converted into conduct input after frequency spectrum.Traditional speech-emotion recognition method It is small in feature extraction and development of categories prospect, and the existing speech emotional method network based on deep learning is more single.

The content of the invention

The technical problems to be solved by the invention are to overcome the deficiencies in the prior art, and provide and be based on length time memory net The speech-emotion recognition method of network and convolutional neural networks, it is to avoid the complex process of artificial extraction and screening feature, by instruction Practice the adjusting parameter of network self-adapting, obtain optimal emotion recognition effect.

The present invention uses following technical scheme to solve above-mentioned technical problem：

According to a kind of speech emotion recognition based on length time memory network and convolutional neural networks of the present invention Method, comprises the following steps：

Step A, pretreatment operation is carried out to the speech samples in speech emotional database so that the equal energy of each speech samples Represented with an isometric sequence, so as to obtain pretreated voice sequence；

The speech emotion recognition system of step B, structure based on length time memory network LSTM and convolutional neural networks CNN System, it includes two basic modules：Length time memory mixed-media network modules mixed-media and convolutional neural networks module；

Step C, pretreated voice sequence is sequentially sent to speech emotion recognition system is repeatedly trained, using anti- The parameter of LSTM and CNN, the network model after being optimized are adjusted to propagation algorithm；

Step D, the network model obtained using step C training carry out emotional semantic classification to the voice sequence of new input, are divided into Sad, glad, detest, frightened, scaring, neutral six kinds of emotions.

As a kind of speech emotion recognition based on length time memory network and convolutional neural networks of the present invention The further prioritization scheme of method, the length time memory mixed-media network modules mixed-media in the step B, specific construction step is as follows：

B1.1, the length for setting speech samples sequence are m, and m=n × n, n are positive integer, and setting current time forgets door list The output of unit and input gate cell is respectively f_tAnd i_t, meet：

f_t=σ (W_f·x_c+b_f)

i_t=σ (W_i·x_c+b_i)

Wherein, x_c=[h_t-1,x_t], new vector x_cIt is by two h_t-1、x_tVector joins end to end what is obtained, x_tIt is current time Input, h_t-1It is the state of previous moment hidden layer, x_cNew vector after for connection, W_fAnd W_iRespectively forget gate cell and defeated Enter the weight matrix of gate cell, b_fAnd b_iRespectively forget the bias vector of gate cell and input gate cell, σ () is sigmoid Excitation function；

B1.2, current cell state C is calculated by following formula_tValue：

Wherein, C_t-1It is previous moment cell state, For current time is thin The reference value of born of the same parents' state, W_CIt is the weight matrix of cell state, b_CBe the bias vector of cell state, tanh () be hyperbolic just Cut function；

B1.3, the output h that each concealed nodes is obtained according to following formula_t, by h_tIt is sequentially connected, constitutes the characteristic vector of m dimensions；

h_t=o_t*tanh(C_t)

o_t=σ (W_o·[h_t-1,x_t]+b_o)

Wherein, W_oTo export the weight matrix of gate cell, b_oTo export the bias vector of gate cell, o_tIt is output gate cell Output.

As a kind of speech emotion recognition based on length time memory network and convolutional neural networks of the present invention The further prioritization scheme of method, the convolutional neural networks module in the step B, specific construction step is as follows：

B2.1, the characteristic vector of the m dimensions that will be extracted in step B1.3 are converted to the eigenmatrix of n × n as convolutional Neural The input of network；

B2.2, the ground floor of convolutional neural networks are convolutional layer, from m₁Individual k₁×k₁The convolution kernel of dimension enters to input data Row convolution algorithm, convolution step-length is s₁, the result after convolution carries out Nonlinear Mapping, obtains convolutional layer by excitation function again Output is m₁Individual l₁×l₁The characteristic pattern of dimension；

The second layer of B2.3 convolutional neural networks is pond layer, from m₂Individual k₂×k₂The convolution kernel of dimension, to ground floor convolution The characteristic pattern of layer output carries out step-length for s₂Pond, obtain the output i.e. m of pond layer₂Individual l₂×l₂The characteristic pattern of dimension；

B2.4, the third layer of convolutional neural networks are convolutional layer, from m₃Individual k₃×k₃The convolution kernel of dimension, to second layer pond The characteristic pattern for changing layer output carries out convolution algorithm, and the result after convolution carries out Nonlinear Mapping by excitation function again, is rolled up The output of lamination is m₃Individual l₃×l₃The characteristic pattern of dimension；

B2.5, the 4th layer of convolutional neural networks be convolutional layer, from m₄Individual k₄×k₄The convolution kernel of dimension, rolls up to third layer The characteristic pattern of lamination output carries out convolution algorithm, and the result after convolution carries out Nonlinear Mapping by excitation function again, is rolled up The output of lamination is m₄Individual l₄×l₄The characteristic pattern of dimension；

B2.6, the layer 5 of convolutional neural networks are convolutional layer, from m₅Individual k₅×k₅The convolution kernel of dimension, rolls up to the 4th layer The characteristic pattern of lamination output carries out convolution algorithm, and the result after convolution carries out Nonlinear Mapping by excitation function again, is rolled up The output of lamination is m₅Individual l₅×l₅The characteristic pattern of dimension；

B2.7, the layer 6 of convolutional neural networks are pond layer, from m₆Individual k₆×k₆The convolution kernel of dimension, rolls up to layer 5 The characteristic pattern of lamination output carries out step-length for s₆Pond, obtain the output i.e. m of pond layer₆Individual l₆×l₆The characteristic pattern of dimension；

B2.8, the seven, the eight of convolutional neural networks, nine layer are full articulamentum；Wherein, layer 7 is by layer 6 pond layer The characteristic pattern of output is connected to the c node of this layer entirely；The 8th layer of c node to layer 7 carries out the non-linear change of ReLU functions After changing, the connection weight of node layer is then hidden using the control of dropout methods, full linking number is c；9th layer of full connection Layer output node is p, is output as having merged the softmax losses of feature tag.

As a kind of speech emotion recognition based on length time memory network and convolutional neural networks of the present invention The further prioritization scheme of method, the function J (θ) of the softmax losses of the convolutional neural networks in the step B is defined as follows：

Wherein, x⁽ⁱ⁾It is input vector, y⁽ⁱ⁾It is the corresponding emotional category of input vector, i=1,2 ... q, q corresponds to voice Sample number；θ_jIt is parameter to be trained, j=1,2 ... p, p corresponds to emotional category number, and T is transposition, and e is the nature truth of a matter；1 { } is indicator function, and when braces intermediate value is true, function value is 1, otherwise is 0.

As a kind of speech emotion recognition based on length time memory network and convolutional neural networks of the present invention The further prioritization scheme of method, tanh function representations areSigmoid function representations areWherein, x is variable.

The present invention uses above technical scheme compared with prior art, with following technique effect：

(1) the artificial complex process extracted and screen feature is avoided, by the adjusting parameter of training network self adaptation, is obtained Obtain emotion recognition effect most preferably；

(2) speech-emotion recognition method based on LSTM and CNN, two kinds of different network models are merged, and are borrowed Helping LSTM directly can be processed voice sequence, while the correlation between before and after can utilizing sequence time；By CNN The interference of noise is reduced, while more abstract feature can be learnt, the accuracy and robustness of emotion recognition is improved.

Brief description of the drawings

Fig. 1 is the flow chart of the speech-emotion recognition method based on LSTM and CNN of the invention.

Fig. 2 is the basic frame structure figure of the speech emotion recognition system based on LSTM and CNN for building.

Fig. 3 is the basic framework figure of length time memory mixed-media network modules mixed-media in speech emotion recognition system.

Fig. 4 is the basic framework figure of convolutional neural networks module in speech emotion recognition system.

Specific embodiment

In order that the object, technical solutions and advantages of the present invention are clearer, below in conjunction with the accompanying drawings and the specific embodiments The present invention will be described in detail.

It is of the invention to be based on if Fig. 1 is the flow chart of the speech-emotion recognition method based on LSTM and CNN of the invention The realization of the speech-emotion recognition method of LSTM and CNN is mainly comprised the steps of：

Step 1：Suitable speech emotional database is selected, sound bite therein is gathered；

In actual mechanical process, AFEW databases, the database is selected to provide original video segment, these piece of video Duan Jun is taken from cinematographic work.Compared with conventional laboratory data base, the voice and emotional expression in AFEW databases more connect Real-life situation is bordering on, with more generality.The sample age was dispersed between 1 to 70 years old, covered all age group, wherein The sample of a large amount of Children and teenagers is contained, subsequently can be used for the emotion recognition of young main body.Sample in database Originally it is divided into six classes, respectively sadness, glad, detest, frightened, scaring, neutrality, is marked with 1~6.In selecting video Used as sample set, sample frequency is 48kHz to sound bite.

Step 2：Read voice sample data, unified samples sequence length；

Due to the duration difference of speech samples, while being concentrated mainly on voice sequence in view of useful information Zone line, chooses 16384 sampled points near each voice sequence intermediate point and represents whole voice in practice.According to 7：3 Ratio arbitrarily choose speech samples respectively as training set, checking collection.The voice sequence of each sample set and label are stored It is pkl files.

Step 3：Speech emotion recognition system is built, voice sequence as input is carried out to length time memory network Training, obtains the output of hidden layer；Fig. 2 is the basic framework knot of the speech emotion recognition system based on LSTM and CNN for building Composition, illustrates that speech samples complete the overall process of emotional semantic classification, and system mainly contains two basic modules of LSTM and CNN； Fig. 3 is the basic framework figure of length time memory mixed-media network modules mixed-media in speech emotion recognition system, illustrates LSTM NEs Internal structure, reflects contacting between hiding layer state and each gate cell；Fig. 4 is convolution god in speech emotion recognition system Through the basic framework figure of mixed-media network modules mixed-media, illustrate that eigenmatrix is generated after convolution, pond and full attended operation and believe containing label The vectorial process of breath；

Use x₀,x₁,x₂,…,x_t... represent the voice sequence of input, h₀,h₁,h₂,…,h_t... represent each concealed nodes State.x_c=[h_t-1,x_t], the input at the state of previous moment hidden layer and current time is connected as a vector by expression x_c.The output that setting t forgets gate cell and input gate cell is respectively f_tAnd i_t, calculate f_tAnd i_tValue it is as follows：

f_t=σ (W_f·x_c+b_f) (1)

i_t=σ (W_i·x_c+b_i) (2)

The value of cell state is calculated by following formula：

The output of mixed-media network modules mixed-media determines by current cell state, be after filtering after cell value.First will be cellular State ensures its output area between -1 to 1, by the output of this layer by a sigmoid unit followed by tanh functions Value o_tProduct is done, the output h of hidden layer is determined_t：

o_t=σ (W_o·[h_t-1,x_t]+b_o) (4)

h_t=o_t*tanh(C_t) (5)

Obtain the output h of each concealed nodes_tIt is sequentially connected afterwards, constitutes a length of 16384 characteristic vector, and by this feature Vector is converted to 128 × 128 eigenmatrix.

Step 4：Using eigenmatrix as input, convolutional neural networks are trained, comprised the following steps that：

Ground floor is convolution pond layer, and the convolution kernel from 96 11 × 11 dimensions carries out convolution algorithm to input data, rolls up Product step-length is 3, and signal characteristic is strengthened by convolution algorithm, reduces noise.96 40 × 40 characteristic patterns of dimension are generated after convolution.

The second layer is pond layer, and using the convolution kernel of 4 × 4 dimensions, the characteristic pattern to the generation of ground floor convolutional layer carries out step-length It is 3 pond, generates 96 13 × 13 characteristic patterns of dimension；

Third layer is convolutional layer, and the convolution kernel from 256 5 × 5 dimensions carries out convolution to the characteristic pattern that the second layer is generated, and adopts With the mode for expanding edge bonus point group, feature graph embedding in convolution process is prevented.After nonlinear transformation, 256 13 are generated The characteristic pattern of × 13 dimensions；

4th layer is convolutional layer, and the convolution kernel from 384 5 × 5 dimensions carries out convolution to the characteristic pattern that third layer is generated, together Sample after nonlinear transformation, generates 384 13 × 13 characteristic patterns of dimension by the way of edge bonus point group is expanded.

Layer 5 is convolutional layer, from 256 5 × 5 convolution kernels of dimension, while by the way of edge is expanded, will generate Characteristic pattern by after Nonlinear Mapping, generate 384 13 × 13 characteristic patterns of dimension.

Layer 6 is pond layer, and using the convolution kernel of 3 × 3-dimensional, the characteristic pattern to the generation of layer 5 convolutional layer carries out step-length It is 2 pond, generates 256 6 × 6 characteristic patterns of dimension；

Seven, the eight, nine layers is full articulamentum.Wherein layer 7 is that the complete of layer 6 generation characteristic pattern is connected into 4096 Node, the 8th layer is after ReLU function nonlinear transformations are carried out to layer 7 node, hidden layer section to be controlled using dropout methods The work weight of point, dropout methods random drop part concealed nodes in training every time, the node being dropped temporarily can be with Not think be network structure a part, but retain node weight, every time only be adjusted from partial parameters.8th layer Full linking number be 4096.9th layer of full articulamentum output node is 6, is output as having merged feature tag Softmax loses.

Step 5：Using the parameter of LSTM and CNN in back-propagation algorithm adjustment system, optimum network model is chosen, preserved Its parameter；

Step 6：Test set sample is sent into optimal network model, emotion knowledge is carried out to it using the network for training Not.

The function J (θ) of the softmax losses of convolutional neural networks is defined as follows：

Wherein, x⁽ⁱ⁾It is input vector, y⁽ⁱ⁾It is the corresponding emotional category of input vector, i=1,2 ... q, q corresponds to voice Sample number；θ_jIt is parameter to be trained, j=1,2 ... p, p corresponds to emotional category number, and T is transposition, and e is the nature truth of a matter；1 { } is indicator function, and when braces intermediate value is true, function value is 1, otherwise is 0；With the increase of number of training, loss The value of function can be reduced constantly, corresponding θ when loss function tends towards stability_jIt is then the parameter of network model after optimization.

Tanh functions (hyperbolic tangent function) in the present invention, are expressed asReLU function (modified lines Property unit function), (0, x), sigmoid functions (S sigmoid growth curves) are expressed as to be expressed as f (x)=maxX is variable.

Claims

1. a kind of speech-emotion recognition method based on length time memory network and convolutional neural networks, it is characterised in that bag Include following steps：

Step A, pretreatment operation is carried out to the speech samples in speech emotional database so that each speech samples can use one Individual isometric sequence is represented, so as to obtain pretreated voice sequence；

The speech emotion recognition system of step B, structure based on length time memory network LSTM and convolutional neural networks CNN, its Comprising two basic modules：Length time memory mixed-media network modules mixed-media and convolutional neural networks module；

Step C, pretreated voice sequence is sequentially sent to speech emotion recognition system is repeatedly trained, using reversely biography Broadcast the parameter that algorithm adjusts LSTM and CNN, the network model after being optimized；

Step D, emotional semantic classification is carried out to the voice sequence of new input using the step C network models that obtain of training, be divided into it is sad, Glad, detest, frightened, scaring, neutral six kinds of emotions.

2. a kind of speech emotion recognition based on length time memory network and convolutional neural networks according to claim 1 Method, it is characterised in that the length time memory mixed-media network modules mixed-media in the step B, specific construction step is as follows：

B1.1, the length for setting speech samples sequence are m, and m=n × n, n are positive integer, setting current time forget gate cell and The output for being input into gate cell is respectively f_tAnd i_t, meet：

f_t=σ (W_f·x_c+b_f)

i_t=σ (W_i·x_c+b_i)

Wherein, x_c=[h_t-1,x_t], new vector x_cIt is by two h_t-1、x_tVector joins end to end what is obtained, x_tIt is defeated for current time Enter, h_t-1It is the state of previous moment hidden layer, x_cNew vector after for connection, W_fAnd W_iRespectively forget gate cell and input gate The weight matrix of unit, b_fAnd b_iRespectively forget the bias vector of gate cell and input gate cell, σ () is encouraged for sigmoid Function；

B1.2, current cell state C is calculated by following formula_tValue：

C_{t} = f_{t} * C_{t - 1} + i_{t} * {\tilde{C}}_{t}

Wherein, C_t-1It is previous moment cell state, For current time is cellular The reference value of state, W_CIt is the weight matrix of cell state, b_CIt is the bias vector of cell state, tanh () is tanh letter Number；

h_t=o_t*tanh(C_t)

o_t=σ (W_o·[h_t-1,x_t]+b_o)

Wherein, W_oTo export the weight matrix of gate cell, b_oTo export the bias vector of gate cell, o_tTo export the defeated of gate cell Go out.

3. a kind of speech emotion recognition based on length time memory network and convolutional neural networks according to claim 2 Method, it is characterised in that the convolutional neural networks module in the step B, specific construction step is as follows：

B2.1, the characteristic vector of the m dimensions that will be extracted in step B1.3 are converted to the eigenmatrix of n × n as convolutional neural networks Input；

B2.2, the ground floor of convolutional neural networks are convolutional layer, from m₁Individual k₁×k₁The convolution kernel of dimension is rolled up to input data Product computing, convolution step-length is s₁, the result after convolution carries out Nonlinear Mapping, obtains the output of convolutional layer by excitation function again That is m₁Individual l₁×l₁The characteristic pattern of dimension；

The second layer of B2.3 convolutional neural networks is pond layer, from m₂Individual k₂×k₂The convolution kernel of dimension, it is defeated to ground floor convolutional layer The characteristic pattern for going out carries out step-length for s₂Pond, obtain the output i.e. m of pond layer₂Individual l₂×l₂The characteristic pattern of dimension；

B2.4, the third layer of convolutional neural networks are convolutional layer, from m₃Individual k₃×k₃The convolution kernel of dimension, to second layer pond layer The characteristic pattern of output carries out convolution algorithm, and the result after convolution carries out Nonlinear Mapping by excitation function again, obtains convolutional layer Output be m₃Individual l₃×l₃The characteristic pattern of dimension；

B2.5, the 4th layer of convolutional neural networks be convolutional layer, from m₄Individual k₄×k₄The convolution kernel of dimension, to third layer convolutional layer The characteristic pattern of output carries out convolution algorithm, and the result after convolution carries out Nonlinear Mapping by excitation function again, obtains convolutional layer Output be m₄Individual l₄×l₄The characteristic pattern of dimension；

B2.6, the layer 5 of convolutional neural networks are convolutional layer, from m₅Individual k₅×k₅The convolution kernel of dimension, to the 4th layer of convolutional layer The characteristic pattern of output carries out convolution algorithm, and the result after convolution carries out Nonlinear Mapping by excitation function again, obtains convolutional layer Output be m₅Individual l₅×l₅The characteristic pattern of dimension；

B2.7, the layer 6 of convolutional neural networks are pond layer, from m₆Individual k₆×k₆The convolution kernel of dimension, to layer 5 convolutional layer The characteristic pattern of output carries out step-length for s₆Pond, obtain the output i.e. m of pond layer₆Individual l₆×l₆The characteristic pattern of dimension；

B2.8, the seven, the eight of convolutional neural networks, nine layer are full articulamentum；Wherein, layer 7 is by the output of layer 6 pond layer Characteristic pattern be connected to the c node of this layer entirely；The 8th layer of c node to layer 7 carries out ReLU function nonlinear transformations Afterwards, the connection weight of node layer is then hidden using the control of dropout methods, full linking number is c；9th layer of full articulamentum Output node is p, is output as having merged the softmax losses of feature tag.

4. a kind of speech emotion recognition based on length time memory network and convolutional neural networks according to claim 3 Method, it is characterised in that the function J (θ) of the softmax losses of the convolutional neural networks in the step B is defined as follows：

J (θ) = - \frac{1}{q} [Σ_{i = 1}^{q} Σ_{j = 1}^{P} 1 {y^{(i)} = j} l o g \frac{e^{θ_{j}^{T} x^{(i)}}}{Σ_{j = 1}^{p} e^{θ_{j}^{T} x^{(i)}}}]

Wherein, x⁽ⁱ⁾It is input vector, y⁽ⁱ⁾It is the corresponding emotional category of input vector, i=1,2 ... q, q corresponds to speech samples Number；θ_jIt is parameter to be trained, j=1,2 ... p, p corresponds to emotional category number, and T is transposition, and e is the nature truth of a matter；1 { } be Indicator function, when braces intermediate value is true, function value is 1, otherwise is 0.

5. a kind of speech emotion recognition based on length time memory network and convolutional neural networks according to claim 3 Method, it is characterised in that tanh function representations areSigmoid function representations areWherein, x is variable.