CN106782602B - Speech emotion recognition method based on deep neural network - Google Patents

Speech emotion recognition method based on deep neural network Download PDF

Info

Publication number
CN106782602B
CN106782602B CN201611093447.3A CN201611093447A CN106782602B CN 106782602 B CN106782602 B CN 106782602B CN 201611093447 A CN201611093447 A CN 201611093447A CN 106782602 B CN106782602 B CN 106782602B
Authority
CN
China
Prior art keywords
layer
convolution
output
neural network
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611093447.3A
Other languages
Chinese (zh)
Other versions
CN106782602A (en
Inventor
袁亮
卢官明
闫静杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201611093447.3A priority Critical patent/CN106782602B/en
Publication of CN106782602A publication Critical patent/CN106782602A/en
Application granted granted Critical
Publication of CN106782602B publication Critical patent/CN106782602B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a speech emotion recognition method based on a long-time memory network and a convolutional neural network, which comprises the steps of constructing a speech emotion recognition system based on LSTM and CNN, taking a speech sequence as the input of the system, training the LSTM and CNN by adopting a back propagation algorithm, optimizing parameters of the network and obtaining an optimized network model; and carrying out emotion classification on the newly input voice sequence by using the trained network model, wherein the newly input voice sequence is divided into six emotions, namely sadness, happiness, aversion, fear, startle and neutrality. The method comprehensively considers two network models of LSTM and CNN, avoids the complexity of manual selection and feature extraction, and improves the accuracy of emotion recognition.

Description

Speech emotion recognition method based on deep neural network
Technical Field
The invention relates to the field of image processing and pattern recognition, in particular to a speech emotion recognition method based on a long-time memory network and a convolutional neural network.
Background
In interpersonal communication, there are a variety of ways of exchanging information including voice, body language, facial expressions, and the like. Among them, speech signals are the fastest and most primitive communication methods, and are considered by researchers to be one of the most effective methods for human-computer interaction. Over the last half century, scholars have studied a large number of topics regarding speech recognition, namely how to convert speech sequences into text. Despite significant advances in speech recognition, there is a long way to achieve natural human and machine interaction because the machine cannot understand the emotional state of the speaker. This has also led to another aspect of research on how to identify the emotional state of the speaker from the speech, i.e., speech emotion recognition.
The speech emotion recognition is used as an important branch of man-machine interaction, and can be widely applied to various fields such as education, medical treatment, traffic and the like. In the vehicle-mounted system, the system can be used for monitoring the mental state of a driver and judging whether the driver is in a safe state, so that the driver can be reminded when the driver is tired, and traffic accidents are avoided; in the telephone service, the system can be used for sorting the users with fierce speech expression, and switching the users to human customer service, thereby optimizing the user experience and improving the whole service level; in clinical medicine, emotional changes of depression patients or autistic children are tracked by means of speech emotion recognition and used as a tool for disease diagnosis and adjuvant therapy; in the robot research, the robot is helped to understand the emotion of a person by utilizing voice information, friendly and intelligent responses are made, and interaction is realized.
Most speech emotion recognition methods in the present stage adopt the traditional method of extracting features and then classifying by using a classifier. Common speech features include pitch, speech rate, intensity (prosodic features), linear prediction cepstral coefficients, mel-frequency cepstral coefficients (spectral features), and the like. Common classification methods include hidden markov models, support vector machines, and gaussian mixture models. The traditional emotion recognition method tends to mature, but has certain defects. For example, it is not clear which feature has the greatest influence on emotion recognition, and only one feature is selected as a basis for judgment in most experiments, so that the objectivity of emotion recognition is reduced. In addition, some existing features, such as pitch, speech rate, etc., are greatly affected by the style of the speaker, which increases the complexity of recognition.
With the development of deep learning in the near stage, many researchers choose to adopt a training network model to complete emotion recognition. The existing speech emotion recognition methods mainly include a speech emotion recognition method based on a deep belief network, a speech emotion recognition method based on a long-time memory network and a speech emotion recognition method based on a convolutional neural network. Of the three methods described above, there are major disadvantages: the advantages of each network model cannot be taken into account. For example, a deep belief network may use a one-dimensional sequence as input, but cannot exploit the correlation between the front and back of the sequence; although the long-time and short-time memory network can utilize the correlation between the front and the back of the sequence, the extracted feature dimension is higher; the convolutional neural network cannot directly process a voice sequence, and needs to perform fourier transform on a voice signal and convert the voice signal into a frequency spectrum to be used as input. The traditional speech emotion recognition method has small prospect in feature extraction and classification development, and the existing speech emotion method based on deep learning has a single network.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides the speech emotion recognition method based on the long-time memory network and the convolutional neural network, avoids the complex process of manually extracting and screening the characteristics, and obtains the optimal emotion recognition effect by training the adaptive adjustment parameters of the network.
The invention adopts the following technical scheme for solving the technical problems:
the speech emotion recognition method based on the long-and-short-time memory network and the convolutional neural network comprises the following steps of:
step A, preprocessing voice samples in a voice emotion database to enable each voice sample to be represented by a sequence with equal length, and accordingly a preprocessed voice sequence is obtained;
step B, constructing a speech emotion recognition system based on the long-time memory network LSTM and the convolutional neural network CNN, wherein the speech emotion recognition system comprises two basic modules: the long and short time memory network module and the convolutional neural network module;
step C, sequentially sending the preprocessed voice sequence into a voice emotion recognition system for multiple training, and adjusting parameters of LSTM and CNN by using a back propagation algorithm to obtain an optimized network model;
and D, carrying out emotion classification on the newly input voice sequence by using the network model obtained by training in the step C, wherein the emotion classification comprises six emotions, namely sadness, happiness, disgust, fear, fright and neutrality.
As a further optimization scheme of the speech emotion recognition method based on the long-and-short-time memory network and the convolutional neural network, the long-and-short-time memory network module in the step B is specifically constructed by the following steps:
b1.1, setting the length of the speech sample sequence as m, where m is n × n, n is a positive integer, and setting the outputs of the forgetting gate unit and the input gate unit at the current time as ftAnd itIs full ofFoot:
ft=σ(Wf·xc+bf)
it=σ(Wi·xc+bi)
wherein x isc=[ht-1,xt]New vector xcIs to make two ht-1、xtObtained by joining vectors end to end, xtFor input at the current time, ht-1For hiding the state of the layer at the previous moment, xcFor the concatenated new vector, WfAnd WiWeight matrices for the forgetting gate unit and the input gate unit, respectively, bfAnd biBias vectors of a forgetting gate unit and an input gate unit respectively, wherein sigma (·) is a sigmoid excitation function;
b1.2, calculating to obtain the current cell state C by the following formulatThe value of (c):
Figure GDA0002270908420000031
wherein, Ct-1The state of the cells at the previous moment,
Figure GDA0002270908420000032
Figure GDA0002270908420000033
is a reference value of the cell state at the present time, WcIs a weight matrix of cell states, bCTan h (-) is a hyperbolic tangent function as a bias vector for the cell state;
b1.3, obtaining the output h of each hidden node according to the following formulatH is to betSequentially connecting to form m-dimensional characteristic vectors;
ht=ot*tanh(Ct)
ot=σ(Wo·[ht-1,xt]+bo)
wherein, WoAs a weight matrix of the output gate unit, boIs an offset vector of the output gate unit, otTo be transportedAn output of the gate-out unit.
As a further optimization scheme of the speech emotion recognition method based on the long-and-short-time memory network and the convolutional neural network, the convolutional neural network module in the step B is specifically constructed by the following steps:
b2.1, converting the m-dimensional feature vector extracted in the step B1.3 into an n multiplied by n feature matrix as the input of the convolutional neural network;
b2.2, selecting m as convolution layer as first layer of convolution neural network1K is1×k1Performing convolution operation on input data by using dimensional convolution kernel with convolution step length of s1The result after convolution is non-linearly mapped by an excitation function to obtain the output m of the convolution layer1An1×l1A feature map of the dimension;
the second layer of the B2.3 convolutional neural network is a pooling layer, and m is selected2K is2×k2Performing dimensional convolution kernel with step length s on the feature map output by the first layer of convolution layer2Pooling to obtain the output m of the pooling layer2An2×l2A feature map of the dimension;
b2.4, selecting m as convolution layer as the third layer of the convolution neural network3K is3×k3Performing convolution operation on the feature map output by the second pooling layer by using a dimensional convolution kernel, and performing nonlinear mapping on the result after convolution by using an excitation function to obtain the output m of the convolution layer3An3×l3A feature map of the dimension;
b2.5, selecting m as convolution layer as the fourth layer of the convolution neural network4K is4×k4Performing convolution operation on the characteristic graph output by the third layer of convolution layer by using a dimensional convolution kernel, and performing nonlinear mapping on the result after convolution by using an excitation function to obtain the output m of the convolution layer4An4×l4A feature map of the dimension;
b2.6, selecting m as convolution layer for the fifth layer of the convolution neural network5K is5×k5A dimensional convolution kernel for performing convolution on the feature map output by the fourth layer of convolution layerPerforming product operation, performing nonlinear mapping on the convolved result by an excitation function to obtain the output m of the convolutional layer5An5×l5A feature map of the dimension;
b2.7, selecting m as a pooling layer as a sixth layer of the convolutional neural network6K is6×k6The convolution kernel of dimension carries out step length s on the characteristic diagram output by the fifth layer convolution layer6Pooling to obtain the output m of the pooling layer6An6×l6A feature map of the dimension;
b2.8, the seventh layer, the eighth layer and the ninth layer of the convolutional neural network are all connected layers; wherein, the seventh layer is to connect the feature map output by the sixth layer pooling layer to c nodes of the layer; the eighth layer carries out ReLU function nonlinear transformation on c nodes of the seventh layer, and then the connection weight of the nodes of the hidden layer is controlled by using a dropout method, wherein the total connection number is c; the output nodes of the ninth full-connection layer are p, and the output is softmax loss fused with the feature tags.
As a further optimization scheme of the speech emotion recognition method based on the long-and-short-term memory network and the convolutional neural network, a function J (θ) of softmax loss of the convolutional neural network in step B is defined as follows:
Figure GDA0002270908420000041
wherein x is(i)As an input vector, y(i)The emotion category corresponding to the input vector is i-1, 2, … q, and q corresponds to the number of voice samples; thetajJ is 1,2, … p, p is corresponding to the number of emotion categories, T is transposition, e is natural base number; 1 {. is an indication function, and when the value of the median in the parenthesis is true, the value of the function is 1, otherwise, the value is 0.
As a further optimization scheme of the speech emotion recognition method based on the long-and-short-time memory network and the convolutional neural network, the tanh function is expressed as
Figure GDA0002270908420000042
sigmoid function is expressed as
Figure GDA0002270908420000043
Wherein x is a variable.
Compared with the prior art, the invention adopting the technical scheme has the following technical effects:
(1) the complex process of manual extraction and feature screening is avoided, and the optimal emotion recognition effect is obtained by training the adaptive adjustment parameters of the network;
(2) the speech emotion recognition method based on LSTM and CNN fuses two different network models, can directly process a speech sequence by means of LSTM, and can utilize the correlation between the front and the back of sequence time; the interference of noise is reduced by means of CNN, more abstract characteristics can be learned, and the accuracy and robustness of emotion recognition are improved.
Drawings
FIG. 1 is a flow chart of the speech emotion recognition method based on LSTM and CNN of the present invention.
FIG. 2 is a basic framework structure diagram of the constructed LSTM and CNN-based speech emotion recognition system.
FIG. 3 is a basic framework diagram of a long and short duration memory network module in the speech emotion recognition system.
Fig. 4 is a basic framework diagram of a convolutional neural network module in a speech emotion recognition system.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
As shown in fig. 1, which is a flow chart of the speech emotion recognition method based on LSTM and CNN of the present invention, the implementation of the speech emotion recognition method based on LSTM and CNN of the present invention mainly includes the following steps:
step 1: selecting a proper voice emotion database, and collecting voice fragments in the proper voice emotion database;
in actual operation, an AFEW database is selected that provides the original video clips, all of which are cut from a movie work. Compared with a common laboratory database, the voice and emotion expression in the AFEW database is closer to the real life environment and has more generality. The sample ages were spread between 1 and 70 years, covering various age groups, including large samples of children and adolescents, which could be subsequently used for emotional identification in younger subjects. The samples in the database are divided into six categories, namely sadness, happiness, disgust, fear, fright and neutrality, which are marked by 1-6. And selecting a voice segment in the video as a sample set, wherein the sampling frequency is 48 kHz.
Step 2: reading voice sample data, and unifying the length of a sample sequence;
because of the differences in the duration of the speech samples, and considering that the useful information is mainly concentrated in the middle region of the speech sequence, 16384 sample points near the middle point of each speech sequence are actually selected to represent the entire speech. According to the following steps: and 3, randomly selecting the voice samples according to the proportion to be respectively used as a training set and a verification set. The speech sequences and labels for each sample set are stored as a pkl file.
And step 3: constructing a speech emotion recognition system, taking a speech sequence as input, training a long-time memory network, and obtaining the output of a hidden layer; FIG. 2 is a basic framework structure diagram of a constructed speech emotion recognition system based on LSTM and CNN, which illustrates the whole process of completing emotion classification of speech samples, and the system mainly comprises two basic modules of LSTM and CNN; FIG. 3 is a basic framework diagram of a long and short duration memory network module in the speech emotion recognition system, illustrating the internal structure of the LSTM network unit, reflecting the relationship between the hidden layer state and each gate unit; FIG. 4 is a basic framework diagram of a convolutional neural network module in a speech emotion recognition system, illustrating a process of generating a vector containing tag information after a feature matrix is subjected to convolution, pooling and full-connection operations;
by x0,x1,x2,…,xt… denotes the input speech sequence, h0,h1,h2,…,htAnd …, the status of each hidden node. x is the number ofc=[ht-1,xt]Indicates that the previous time is about to beThe state of the hidden layer and the input at the current moment are concatenated into a vector xc. Setting the outputs of the forgetting gate unit and the input gate unit at t time as ftAnd itCalculating ftAnd itThe values of (a) are as follows:
ft=σ(Wf·xc+bf) (1)
it=σ(Wi·xc+bi) (2)
the value of the cell state is calculated by the following formula:
Figure GDA0002270908420000061
the output of the network module is determined by the current cell state and is the filtered cell value. Firstly, the cell state is passed through a sigmoid unit, then the output range of the cell state is ensured to be between-1 and 1 by utilizing a tanh function, and the output value o of the layer is obtainedtMultiplying to determine the output h of the hidden layert
ot=σ(Wo·[ht-1,xt]+bo) (4)
ht=ot*tanh(Ct) (5)
Get the output h of each hidden nodetThese are then concatenated in turn to form a long 16384 feature vector, which is converted to a 128 x 128 feature matrix.
And 4, step 4: taking the feature matrix as input, training the convolutional neural network, and specifically comprising the following steps:
the first layer is a convolution pooling layer, 96 convolution kernels with 11 multiplied by 11 dimensions are selected to carry out convolution operation on input data, the convolution step length is 3, signal characteristics are enhanced through the convolution operation, and noise is reduced. After convolution, 96 feature maps of 40 × 40 dimensions are generated.
The second layer is a pooling layer, and the feature maps generated by the first layer of the convolutional layers are pooled by a step length of 3 by using a 4 x 4-dimensional convolutional kernel to generate 96 13 x 13-dimensional feature maps;
the third layer is a convolution layer, 256 5 x 5-dimensional convolution cores are selected to perform convolution on the feature map generated by the second layer, and the feature map is prevented from shrinking in the convolution process by adopting an edge expansion and grouping mode. Generating 256 characteristic graphs with 13 multiplied by 13 dimensions after nonlinear transformation;
the fourth layer is a convolution layer, 384 5 × 5 dimensional convolution cores are selected to perform convolution on the feature maps generated by the third layer, and 384 13 × 13 dimensional feature maps are generated after nonlinear transformation in the same way of expanding edges and grouping.
And the fifth layer is a convolutional layer, 256 5 × 5-dimensional convolutional kernels are selected, and 384 13 × 13-dimensional feature maps are generated after nonlinear mapping is performed on the generated feature maps in an edge expansion mode.
The sixth layer is a pooling layer, and the feature maps generated by the fifth layer are pooled by using a convolution kernel with 3 x 3 dimensions and with the step length of 2 to generate 256 feature maps with 6 x 6 dimensions;
the seventh, eighth and ninth layers are all connecting layers. The seventh layer is that the characteristic graph generated by the sixth layer is fully connected to 4096 nodes, the eighth layer is that the working weight of the nodes of the hidden layer is controlled by using a dropout method after the ReLU function nonlinear transformation is carried out on the nodes of the seventh layer, the dropout method randomly discards part of the hidden nodes during each training, the discarded nodes can be temporarily considered not to be part of the network structure, but the weight of the nodes is kept, and only part of parameters are selected for adjustment each time. The number of full connections of the eighth layer is 4096. The output nodes of the ninth layer full connection layer are 6, and the output is the softmax loss fused with the feature labels.
And 5: adjusting parameters of LSTM and CNN in the system by using a back propagation algorithm, selecting an optimal network model, and storing the parameters;
step 6: and sending the test set sample into an optimal network model, and carrying out emotion recognition on the test set sample by using the trained network.
The function J (θ) of the softmax loss of the convolutional neural network is defined as follows:
Figure GDA0002270908420000071
wherein x is(i)As an input vector, y(i)The emotion category corresponding to the input vector is i-1, 2, … q, and q corresponds to the number of voice samples; thetajJ is 1,2, … p, p is corresponding to the number of emotion categories, T is transposition, e is natural base number; 1 {. is an indication function, and when the median value of the brace is true, the value of the function is 1, otherwise, the value is 0; as the number of training samples increases, the value of the loss function decreases continuously, and the corresponding theta is equal to theta when the loss function is stablejThe parameters of the optimized network model.
The tanh function (hyperbolic tangent function) in the present invention is expressed as
Figure GDA0002270908420000072
ReLU function (modified linear unit function) expressed as f (x) max (0, x), sigmoid function (S-shaped growth curve) expressed as
Figure GDA0002270908420000073
x is a variable.

Claims (3)

1. A speech emotion recognition method based on a long-time memory network and a convolutional neural network is characterized by comprising the following steps:
step A, preprocessing voice samples in a voice emotion database to enable each voice sample to be represented by a sequence with equal length, and accordingly a preprocessed voice sequence is obtained;
step B, constructing a speech emotion recognition system based on the long-time memory network LSTM and the convolutional neural network CNN, wherein the speech emotion recognition system comprises two basic modules: the long and short time memory network module and the convolutional neural network module;
step C, sequentially sending the preprocessed voice sequence into a voice emotion recognition system for multiple training, and adjusting parameters of LSTM and CNN by using a back propagation algorithm to obtain an optimized network model;
d, carrying out emotion classification on the newly input voice sequence by using the network model obtained by training in the step C, wherein the emotion classification is divided into six emotions, namely sadness, happiness, disgust, fear, startle and neutrality;
the long-time memory network module in the step B is specifically constructed by the following steps:
b1.1, setting the length of the speech sample sequence as m, where m is n × n, n is a positive integer, and setting the outputs of the forgetting gate unit and the input gate unit at the current time as ftAnd itAnd satisfies the following conditions:
ft=σ(Wf·xc+bf)
it=σ(Wi·xc+bi)
wherein x isc=[ht-1,xt]New vector xcIs to make two ht-1、xtObtained by joining vectors end to end, xtFor input at the current time, ht-1For hiding the state of the layer at the previous moment, xcFor the concatenated new vector, WfAnd WiWeight matrices for the forgetting gate unit and the input gate unit, respectively, bfAnd biBias vectors of a forgetting gate unit and an input gate unit respectively, wherein sigma (·) is a sigmoid excitation function;
b1.2, calculating to obtain the current cell state C by the following formulatThe value of (c):
Figure FDA0002199228220000011
wherein, Ct-1The state of the cells at the previous moment,
Figure FDA0002199228220000012
Figure FDA0002199228220000013
is a reference value of the cell state at the present time, WcIs a weight matrix of cell states, bcTan h (-) is a hyperbolic tangent function as a bias vector for the cell state;
b1.3, obtained according to the formulaOutput h of each hidden nodetH is to betSequentially connecting to form m-dimensional characteristic vectors;
ht=ot*tanh(Ct)
ot=σ(Wo·[ht-1,xt]+bo)
wherein, WoAs a weight matrix of the output gate unit, boIs an offset vector of the output gate unit, otIs the output of the output gate unit;
the convolutional neural network module in the step B is specifically constructed by the following steps:
b2.1, converting the m-dimensional feature vector extracted in the step B1.3 into an n multiplied by n feature matrix as the input of the convolutional neural network;
b2.2, selecting m as convolution layer as first layer of convolution neural network1K is1×k1Performing convolution operation on input data by using dimensional convolution kernel with convolution step length of s1The result after convolution is non-linearly mapped by an excitation function to obtain the output m of the convolution layer1An1×l1A feature map of the dimension;
the second layer of the B2.3 convolutional neural network is a pooling layer, and m is selected2K is2×k2Performing dimensional convolution kernel with step length s on the feature map output by the first layer of convolution layer2Pooling to obtain the output m of the pooling layer2An2×l2A feature map of the dimension;
b2.4, selecting m as convolution layer as the third layer of the convolution neural network3K is3×k3Performing convolution operation on the feature map output by the second pooling layer by using a dimensional convolution kernel, and performing nonlinear mapping on the result after convolution by using an excitation function to obtain the output m of the convolution layer3An3×l3A feature map of the dimension;
b2.5, selecting m as convolution layer as the fourth layer of the convolution neural network4K is4×k4Performing convolution operation on the feature graph output by the third layer of convolution layer by using a dimensional convolution kernel, and performing convolution operation on the result after convolution by using an excitation functionPerforming nonlinear mapping to obtain the output m of the convolutional layer4An4×l4A feature map of the dimension;
b2.6, selecting m as convolution layer for the fifth layer of the convolution neural network5K is5×k5Performing convolution operation on the feature graph output by the fourth layer of convolution layer by using a dimensional convolution kernel, and performing nonlinear mapping on the result after convolution by using an excitation function to obtain the output m of the convolution layer5An5×l5A feature map of the dimension;
b2.7, selecting m as a pooling layer as a sixth layer of the convolutional neural network6K is6×k6The convolution kernel of dimension carries out step length s on the characteristic diagram output by the fifth layer convolution layer6Pooling to obtain the output m of the pooling layer6An6×l6A feature map of the dimension;
b2.8, the seventh layer, the eighth layer and the ninth layer of the convolutional neural network are all connected layers; wherein, the seventh layer is to connect the feature map output by the sixth layer pooling layer to c nodes of the layer; the eighth layer carries out ReLU function nonlinear transformation on c nodes of the seventh layer, and then the connection weight of the nodes of the hidden layer is controlled by using a dropout method, wherein the total connection number is c; the output nodes of the ninth full-connection layer are p, and the output is softmax loss fused with the feature tags.
2. The emotion speech recognition method based on the long-and-short-term memory network and the convolutional neural network as claimed in claim 1, wherein the function J (θ) of softmax loss of the convolutional neural network in step B is defined as follows:
Figure FDA0002199228220000031
wherein x is(i)As an input vector, y(i)The emotion category corresponding to the input vector is i-1, 2, … q, and q corresponds to the number of voice samples; thetajJ is 1,2, … p, p is corresponding to the number of emotion categories, T is transposition, e is natural base number; 1 {. is an indication function, in parenthesisWhen the value is true, the function takes a value of 1, otherwise it is 0.
3. The method of claim 1, wherein the tanh function is expressed as
Figure FDA0002199228220000032
sigmoid function is expressed as
Figure FDA0002199228220000033
Wherein x is a variable.
CN201611093447.3A 2016-12-01 2016-12-01 Speech emotion recognition method based on deep neural network Active CN106782602B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611093447.3A CN106782602B (en) 2016-12-01 2016-12-01 Speech emotion recognition method based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611093447.3A CN106782602B (en) 2016-12-01 2016-12-01 Speech emotion recognition method based on deep neural network

Publications (2)

Publication Number Publication Date
CN106782602A CN106782602A (en) 2017-05-31
CN106782602B true CN106782602B (en) 2020-03-17

Family

ID=58913860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611093447.3A Active CN106782602B (en) 2016-12-01 2016-12-01 Speech emotion recognition method based on deep neural network

Country Status (1)

Country Link
CN (1) CN106782602B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2816680C1 (en) * 2023-03-31 2024-04-03 Автономная некоммерческая организация высшего образования "Университет Иннополис" Method of recognizing speech emotions using 3d convolutional neural network

Families Citing this family (90)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018227169A1 (en) * 2017-06-08 2018-12-13 Newvoicemedia Us Inc. Optimal human-machine conversations using emotion-enhanced natural speech
CN107293288B (en) * 2017-06-09 2020-04-21 清华大学 Acoustic model modeling method of residual long-short term memory recurrent neural network
CN107633842B (en) * 2017-06-12 2018-08-31 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN107392109A (en) * 2017-06-27 2017-11-24 南京邮电大学 A kind of neonatal pain expression recognition method based on deep neural network
CN107274378B (en) * 2017-07-25 2020-04-03 江西理工大学 Image fuzzy type identification and parameter setting method based on fusion memory CNN
CN107562784A (en) * 2017-07-25 2018-01-09 同济大学 Short text classification method based on ResLCNN models
CN107562792B (en) * 2017-07-31 2020-01-31 同济大学 question-answer matching method based on deep learning
CN107293290A (en) * 2017-07-31 2017-10-24 郑州云海信息技术有限公司 The method and apparatus for setting up Speech acoustics model
CN107506414B (en) * 2017-08-11 2020-01-07 武汉大学 Code recommendation method based on long-term and short-term memory network
CN108346436B (en) 2017-08-22 2020-06-23 腾讯科技(深圳)有限公司 Voice emotion detection method and device, computer equipment and storage medium
CN107705807B (en) * 2017-08-24 2019-08-27 平安科技(深圳)有限公司 Voice quality detecting method, device, equipment and storage medium based on Emotion identification
CN109426858B (en) * 2017-08-29 2021-04-06 京东方科技集团股份有限公司 Neural network, training method, image processing method, and image processing apparatus
CN107785011B (en) * 2017-09-15 2020-07-03 北京理工大学 Training method, device, equipment and medium of speech rate estimation model and speech rate estimation method, device and equipment
CN107679557B (en) * 2017-09-19 2020-11-27 平安科技(深圳)有限公司 Driving model training method, driver identification method, device, equipment and medium
CN107464568B (en) * 2017-09-25 2020-06-30 四川长虹电器股份有限公司 Speaker identification method and system based on three-dimensional convolution neural network text independence
CN107679199A (en) * 2017-10-11 2018-02-09 北京邮电大学 A kind of external the Chinese text readability analysis method based on depth local feature
CN107703564B (en) * 2017-10-13 2020-04-14 中国科学院深圳先进技术研究院 Rainfall prediction method and system and electronic equipment
CN107818307B (en) * 2017-10-31 2021-05-18 天津大学 Multi-label video event detection method based on LSTM network
CN107862331A (en) * 2017-10-31 2018-03-30 华中科技大学 It is a kind of based on time series and CNN unsafe acts recognition methods and system
CN109754790B (en) * 2017-11-01 2020-11-06 中国科学院声学研究所 Speech recognition system and method based on hybrid acoustic model
CN108039181B (en) * 2017-11-02 2021-02-12 北京捷通华声科技股份有限公司 Method and device for analyzing emotion information of sound signal
CN107885853A (en) * 2017-11-14 2018-04-06 同济大学 A kind of combined type file classification method based on deep learning
CN107992938B (en) * 2017-11-24 2019-05-14 清华大学 Space-time big data prediction technique and system based on positive and negative convolutional neural networks
CN108280406A (en) * 2017-12-30 2018-07-13 广州海昇计算机科技有限公司 A kind of Activity recognition method, system and device based on segmentation double-stream digestion
CN108597539B (en) * 2018-02-09 2021-09-03 桂林电子科技大学 Speech emotion recognition method based on parameter migration and spectrogram
CN108304823B (en) * 2018-02-24 2022-03-22 重庆邮电大学 Expression recognition method based on double-convolution CNN and long-and-short-term memory network
CN108520753B (en) * 2018-02-26 2020-07-24 南京工程学院 Voice lie detection method based on convolution bidirectional long-time and short-time memory network
CN108550375A (en) * 2018-03-14 2018-09-18 鲁东大学 A kind of emotion identification method, device and computer equipment based on voice signal
CN108806698A (en) * 2018-03-15 2018-11-13 中山大学 A kind of camouflage audio recognition method based on convolutional neural networks
CN108564954B (en) * 2018-03-19 2020-01-10 平安科技(深圳)有限公司 Deep neural network model, electronic device, identity verification method, and storage medium
CN108831450A (en) * 2018-03-30 2018-11-16 杭州鸟瞰智能科技股份有限公司 A kind of virtual robot man-machine interaction method based on user emotion identification
CN108764268A (en) * 2018-04-02 2018-11-06 华南理工大学 A kind of multi-modal emotion identification method of picture and text based on deep learning
CN108564942B (en) * 2018-04-04 2021-01-26 南京师范大学 Voice emotion recognition method and system based on adjustable sensitivity
CN108766419B (en) * 2018-05-04 2020-10-27 华南理工大学 Abnormal voice distinguishing method based on deep learning
CN108806667B (en) * 2018-05-29 2020-04-17 重庆大学 Synchronous recognition method of voice and emotion based on neural network
CN110179453B (en) * 2018-06-01 2020-01-03 山东省计算中心(国家超级计算济南中心) Electrocardiogram classification method based on convolutional neural network and long-short term memory network
CN108961072A (en) * 2018-06-07 2018-12-07 平安科技(深圳)有限公司 Push method, apparatus, computer equipment and the storage medium of insurance products
CN108717856B (en) * 2018-06-16 2022-03-08 台州学院 Speech emotion recognition method based on multi-scale deep convolution cyclic neural network
CN108922617B (en) * 2018-06-26 2021-10-26 电子科技大学 Autism auxiliary diagnosis method based on neural network
CN108630199A (en) * 2018-06-30 2018-10-09 中国人民解放军战略支援部队信息工程大学 A kind of data processing method of acoustic model
CN109034034A (en) * 2018-07-12 2018-12-18 广州麦仑信息科技有限公司 A kind of vein identification method based on nitrification enhancement optimization convolutional neural networks
CN109003625B (en) * 2018-07-27 2021-01-12 中国科学院自动化研究所 Speech emotion recognition method and system based on ternary loss
CN109190514B (en) * 2018-08-14 2021-10-01 电子科技大学 Face attribute recognition method and system based on bidirectional long-short term memory network
CN109147826B (en) * 2018-08-22 2022-12-27 平安科技(深圳)有限公司 Music emotion recognition method and device, computer equipment and computer storage medium
CN109036459B (en) * 2018-08-22 2019-12-27 百度在线网络技术(北京)有限公司 Voice endpoint detection method and device, computer equipment and computer storage medium
CN109087635A (en) * 2018-08-30 2018-12-25 湖北工业大学 A kind of speech-sound intelligent classification method and system
CN109285562B (en) * 2018-09-28 2022-09-23 东南大学 Voice emotion recognition method based on attention mechanism
CN109346107B (en) * 2018-10-10 2022-09-30 中山大学 LSTM-based method for inversely solving pronunciation of independent speaker
CN109243490A (en) * 2018-10-11 2019-01-18 平安科技(深圳)有限公司 Driver's Emotion identification method and terminal device
CN109389992A (en) * 2018-10-18 2019-02-26 天津大学 A kind of speech-emotion recognition method based on amplitude and phase information
CN109282837B (en) * 2018-10-24 2021-06-22 福州大学 Demodulation method of Bragg fiber grating staggered spectrum based on LSTM network
CN109036467B (en) * 2018-10-26 2021-04-16 南京邮电大学 TF-LSTM-based CFFD extraction method, voice emotion recognition method and system
CN109243493B (en) * 2018-10-30 2022-09-16 南京工程学院 Infant crying emotion recognition method based on improved long-time and short-time memory network
CN109243494B (en) * 2018-10-30 2022-10-11 南京工程学院 Children emotion recognition method based on multi-attention mechanism long-time memory network
CN109146066A (en) * 2018-11-01 2019-01-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition
CN109567793B (en) * 2018-11-16 2021-11-23 西北工业大学 Arrhythmia classification-oriented ECG signal processing method
CN111222624B (en) * 2018-11-26 2022-04-29 深圳云天励飞技术股份有限公司 Parallel computing method and device
CN110096587B (en) * 2019-01-11 2020-07-07 杭州电子科技大学 Attention mechanism-based LSTM-CNN word embedded fine-grained emotion classification model
CN109637545B (en) * 2019-01-17 2023-05-30 哈尔滨工程大学 Voiceprint recognition method based on one-dimensional convolution asymmetric bidirectional long-short-time memory network
JP6580281B1 (en) * 2019-02-20 2019-09-25 ソフトバンク株式会社 Translation apparatus, translation method, and translation program
CN110322900A (en) * 2019-06-25 2019-10-11 深圳市壹鸽科技有限公司 A kind of method of phonic signal character fusion
CN110363751B (en) * 2019-07-01 2021-08-03 浙江大学 Large intestine endoscope polyp detection method based on generation cooperative network
CN112446266B (en) * 2019-09-04 2024-03-29 北京君正集成电路股份有限公司 Face recognition network structure suitable for front end
CN110600018B (en) * 2019-09-05 2022-04-26 腾讯科技(深圳)有限公司 Voice recognition method and device and neural network training method and device
CN110738852B (en) * 2019-10-23 2020-12-18 浙江大学 Intersection steering overflow detection method based on vehicle track and long and short memory neural network
CN110929762B (en) * 2019-10-30 2023-05-12 中科南京人工智能创新研究院 Limb language detection and behavior analysis method and system based on deep learning
CN112766292A (en) * 2019-11-04 2021-05-07 中移(上海)信息通信科技有限公司 Identity authentication method, device, equipment and storage medium
CN112819133A (en) * 2019-11-15 2021-05-18 北方工业大学 Construction method of deep hybrid neural network emotion recognition model
CN110956953B (en) * 2019-11-29 2023-03-10 中山大学 Quarrel recognition method based on audio analysis and deep learning
CN111028859A (en) * 2019-12-15 2020-04-17 中北大学 Hybrid neural network vehicle type identification method based on audio feature fusion
CN111179910A (en) * 2019-12-17 2020-05-19 深圳追一科技有限公司 Speed of speech recognition method and apparatus, server, computer readable storage medium
WO2021127982A1 (en) * 2019-12-24 2021-07-01 深圳市优必选科技股份有限公司 Speech emotion recognition method, smart device, and computer-readable storage medium
CN111241817A (en) * 2020-01-20 2020-06-05 首都医科大学 Text-based depression identification method
CN111210844B (en) * 2020-02-03 2023-03-24 北京达佳互联信息技术有限公司 Method, device and equipment for determining speech emotion recognition model and storage medium
CN111310672A (en) * 2020-02-19 2020-06-19 广州数锐智能科技有限公司 Video emotion recognition method, device and medium based on time sequence multi-model fusion modeling
CN111248882B (en) * 2020-02-21 2022-07-29 乐普(北京)医疗器械股份有限公司 Method and device for predicting blood pressure
CN111326178A (en) * 2020-02-27 2020-06-23 长沙理工大学 Multi-mode speech emotion recognition system and method based on convolutional neural network
CN111524535B (en) * 2020-04-30 2022-06-21 杭州电子科技大学 Feature fusion method for speech emotion recognition based on attention mechanism
CN111709284B (en) * 2020-05-07 2023-05-30 西安理工大学 Dance emotion recognition method based on CNN-LSTM
CN112383369A (en) * 2020-07-23 2021-02-19 哈尔滨工业大学 Cognitive radio multi-channel spectrum sensing method based on CNN-LSTM network model
CN112037822B (en) * 2020-07-30 2022-09-27 华南师范大学 Voice emotion recognition method based on ICNN and Bi-LSTM
CN112101095B (en) * 2020-08-02 2023-08-29 华南理工大学 Suicide and violence tendency emotion recognition method based on language and limb characteristics
CN112001482B (en) * 2020-08-14 2024-05-24 佳都科技集团股份有限公司 Vibration prediction and model training method, device, computer equipment and storage medium
CN112187413B (en) * 2020-08-28 2022-05-03 中国人民解放军海军航空大学航空作战勤务学院 SFBC (Small form-factor Block code) identifying method and device based on CNN-LSTM (convolutional neural network-Link State transition technology)
CN112259126B (en) * 2020-09-24 2023-06-20 广州大学 Robot and method for assisting in identifying autism voice features
CN112735434A (en) * 2020-12-09 2021-04-30 中国人民解放军陆军工程大学 Voice communication method and system with voiceprint cloning function
CN112735479B (en) * 2021-03-31 2021-07-06 南方电网数字电网研究院有限公司 Speech emotion recognition method and device, computer equipment and storage medium
CN113221758B (en) * 2021-05-16 2023-07-14 西北工业大学 GRU-NIN model-based underwater sound target identification method
CN113255800B (en) * 2021-06-02 2021-10-15 中国科学院自动化研究所 Robust emotion modeling system based on audio and video
CN114305418B (en) * 2021-12-16 2023-08-04 广东工业大学 Data acquisition system and method for intelligent assessment of depression state

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10783900B2 (en) * 2014-10-03 2020-09-22 Google Llc Convolutional, long short-term memory, fully connected deep neural networks
CN105279495B (en) * 2015-10-23 2019-06-04 天津大学 A kind of video presentation method summarized based on deep learning and text
CN105469065B (en) * 2015-12-07 2019-04-23 中国科学院自动化研究所 A kind of discrete emotion identification method based on recurrent neural network
CN105844239B (en) * 2016-03-23 2019-03-29 北京邮电大学 It is a kind of that video detecting method is feared based on CNN and LSTM cruelly
CN106096568B (en) * 2016-06-21 2019-06-11 同济大学 A kind of pedestrian's recognition methods again based on CNN and convolution LSTM network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2816680C1 (en) * 2023-03-31 2024-04-03 Автономная некоммерческая организация высшего образования "Университет Иннополис" Method of recognizing speech emotions using 3d convolutional neural network

Also Published As

Publication number Publication date
CN106782602A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN106782602B (en) Speech emotion recognition method based on deep neural network
CN108615010B (en) Facial expression recognition method based on parallel convolution neural network feature map fusion
Ren et al. Deep scalogram representations for acoustic scene classification
CN109036465B (en) Speech emotion recognition method
CN112784798A (en) Multi-modal emotion recognition method based on feature-time attention mechanism
CN109472024A (en) A kind of file classification method based on bidirectional circulating attention neural network
CN112466326B (en) Voice emotion feature extraction method based on transducer model encoder
CN110675859B (en) Multi-emotion recognition method, system, medium, and apparatus combining speech and text
CN110459225B (en) Speaker recognition system based on CNN fusion characteristics
CN107393554A (en) In a kind of sound scene classification merge class between standard deviation feature extracting method
CN103544963A (en) Voice emotion recognition method based on core semi-supervised discrimination and analysis
CN108962247B (en) Multi-dimensional voice information recognition system and method based on progressive neural network
CN110367967A (en) A kind of pocket lightweight human brain condition detection method based on data fusion
CN109410974A (en) Sound enhancement method, device, equipment and storage medium
Han et al. Speech emotion recognition with a resnet-cnn-transformer parallel neural network
CN112151071B (en) Speech emotion recognition method based on mixed wavelet packet feature deep learning
CN110534133A (en) A kind of speech emotion recognition system and speech-emotion recognition method
CN111461201A (en) Sensor data classification method based on phase space reconstruction
Ocquaye et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition
CN115565540B (en) Invasive brain-computer interface Chinese pronunciation decoding method
CN110992988A (en) Speech emotion recognition method and device based on domain confrontation
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
CN112397092A (en) Unsupervised cross-library speech emotion recognition method based on field adaptive subspace
CN112668486A (en) Method, device and carrier for identifying facial expressions of pre-activated residual depth separable convolutional network
Adiga et al. Multimodal emotion recognition for human robot interaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Yuen Road Qixia District of Nanjing City, Jiangsu Province, No. 9 210023

Applicant after: Nanjing Post & Telecommunication Univ.

Address before: 210003, No. 66, new exemplary Road, Nanjing, Jiangsu

Applicant before: Nanjing Post & Telecommunication Univ.

GR01 Patent grant
GR01 Patent grant