CN109036465B - Speech emotion recognition method - Google Patents

Speech emotion recognition method Download PDF

Info

Publication number
CN109036465B
CN109036465B CN201810685220.0A CN201810685220A CN109036465B CN 109036465 B CN109036465 B CN 109036465B CN 201810685220 A CN201810685220 A CN 201810685220A CN 109036465 B CN109036465 B CN 109036465B
Authority
CN
China
Prior art keywords
layer
neural network
convolutional neural
spectrogram
speech emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810685220.0A
Other languages
Chinese (zh)
Other versions
CN109036465A (en
Inventor
孙林慧
陈嘉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201810685220.0A priority Critical patent/CN109036465B/en
Publication of CN109036465A publication Critical patent/CN109036465A/en
Application granted granted Critical
Publication of CN109036465B publication Critical patent/CN109036465B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention discloses a speech emotion recognition method, which comprises the following steps: s1, converting the voice data used in the experiment into a spectrogram; s2, performing data amplification processing on the obtained spectrogram; s3, constructing a convolutional neural network fusing deep and shallow layer characteristics by using the traditional convolutional neural network; and S4, respectively adopting the traditional convolutional neural network and the convolutional neural network fused with the depth layer characteristics to carry out a speech emotion recognition experiment, and comparing the speech emotion recognition rates of the traditional convolutional neural network and the convolutional neural network. Compared with the traditional convolutional neural network, the convolutional neural network with the fusion of the deep and shallow features provided by the invention can fully extract the spectrogram features so as to improve the speech emotion recognition rate, and can fully fuse the shallow features with the deep features by reducing the dimension of the shallow features, thereby obtaining the features which can represent various emotions. The invention not only can effectively improve the speech emotion recognition rate and ensure the recognition accuracy, but also has more excellent generalization capability.

Description

Speech emotion recognition method
Technical Field
The invention relates to a speech emotion recognition method, in particular to a speech emotion recognition method based on convolutional neural network deep and shallow layer feature fusion, and relates to the technical field of speech emotion recognition.
Background
As a complex human psychological behavior and emotion, the method is always a research hotspot in many fields such as psychology, artificial intelligence and the like. The voice signal is the most natural communication mode between people, not only contains contents to be transmitted, but also contains rich emotional factors, and the voice signal is widely applied to emotional research at present.
The speech emotion recognition is to study the formation and change of the emotion state of a speaker from the perspective of speech signals so as to make the interaction between a computer and a human more intelligent. In the current research, the acoustic features used for emotion recognition mainly include spectral-related features, prosodic features, psychoacoustic features, and fusion features of the above features. In addition, in the above features, research is often focused only on the time domain or the frequency domain. However, the correlation between signals in the frequency domain and the time domain of a speech signal plays an important role in speech emotion recognition. The voice spectrogram is used as visual representation of voice signals, the horizontal axis represents time, the vertical axis represents frequency, two time-frequency domains are communicated, frequency points of the voice spectrogram are modeled as pixel points of images, and the relation between adjacent frequency points can be researched by utilizing image characteristics, so that the research result not only expresses the time-frequency characteristics of voice, but also reflects the language characteristics of speakers. At present, a lot of researchers have combined image processing and voice processing by using a spectrogram to obtain a good effect.
Methods of speech emotion recognition are generally divided into two categories: a conventional machine learning method and a deep learning method. However, in any method, feature extraction is an important step in the speech emotion recognition process. The key of the traditional machine learning method is feature selection, which is directly related to the accuracy of speech emotion recognition. To date, a large number of spectral-related features, prosodic features, and psychoacoustic features have been used for speech emotion recognition, but these features may not be sufficient to express subjective emotion. Compared with the traditional machine learning method, the deep learning method can extract high-level features, and has achieved certain achievement in vision-related tasks.
In recent years, Deep Convolutional Neural Networks (DCNNs) have made great progress in the study of speech emotion recognition. However, in a conventional convolutional neural network, as the convolutional layer is deeper, the feature mapping dimension is smaller and smaller, and the features are more abstract, so that the semantic features become more obvious, however, the global information of the spectrogram becomes more fuzzy. The shallow features can provide global information, but the semantic features are not obvious, while the deep features provide enough semantic features but lack global information, so that the finally extracted emotional features cannot accurately distinguish various emotions.
In summary, how to a speech emotion recognition method based on convolution neural network deep and shallow feature fusion, deep features and shallow features are fused together, so as to obtain emotion features with greater distinctiveness, so as to solve the shortcomings of the conventional convolution neural network in speech emotion recognition, and the problem to be solved by the technical staff in the field is urgent.
Disclosure of Invention
In view of the above defects in the prior art, the present invention aims to provide a speech emotion recognition method based on convolutional neural network deep and shallow feature fusion.
Specifically, the speech emotion recognition method comprises the following steps:
s1, converting the voice data used in the experiment into a spectrogram;
s2, performing data amplification processing on the obtained spectrogram;
s3, constructing a convolutional neural network fusing deep and shallow layer characteristics by using the traditional convolutional neural network;
and S4, respectively adopting the traditional convolutional neural network and the convolutional neural network fused with the depth layer characteristics to carry out a speech emotion recognition experiment, and comparing the speech emotion recognition rates of the traditional convolutional neural network and the convolutional neural network.
Preferably, the voice data of S1 is from German Berlin emotion voice emotion library; the sampling frequency of the voice data is 16KHZ, and 16bit quantization is carried out; the voice data comprises seven types of emotions which are respectively angry, boring, aversion, fear, happiness, neutrality and hurt.
Preferably, the step of converting the voice data used in the experiment into the spectrogram in S1 includes the following steps:
s11, performing framing processing on each section of voice data, and changing the section of voice data into x (m, n), wherein n is the frame length, and m is the number of frames;
s12, performing FFT on X (m, n) to obtain X (m, n), and making a periodic diagram Y (m, n) (Y (m, n) ═ X (m, n) × X (m, n)');
s13, taking 10 × log10(Y (M, N)), converting M into a scale M according to time, and converting N into a scale N according to frequency;
s14, (M, N,10 × log10(Y (M, N))) is expressed as a two-dimensional graph, and a spectrogram is obtained.
Preferably, the data amplification processing on the obtained spectrogram in S2 includes the following steps: and performing data amplification processing on the spectrogram by using a keras deep learning framework, wherein the data amplification processing comprises random image rotation, horizontal translation, vertical translation, miscut transformation, image scaling and horizontal overturning operation.
Preferably, the convolutional neural network fused with the depth layer features comprises an input layer, an intermediate layer and an output layer, wherein the intermediate layer comprises a convolutional layer, a pooling layer and a full-link layer.
Preferably, the mapping relation of the convolutional layer is,
Figure BDA0001711514570000031
wherein the content of the first and second substances,
Figure BDA0001711514570000032
is the jth feature set of the ith convolutional layer;
Figure BDA0001711514570000033
an ith feature set representing the l-1 convolutional layer;
Figure BDA0001711514570000034
representing a convolution kernel between two feature sets; denotes a two-dimensional convolution operation;
Figure BDA0001711514570000035
indicating an additive bias.
Preferably, the mapping relation at the pooling layer is,
Figure BDA0001711514570000036
wherein f isp(. is the activation function of the pooling layer; down (-) denotes the l-1 layer to l layer pooling method, including both mean pooling and maximum pooling;
Figure BDA0001711514570000037
and
Figure BDA0001711514570000038
respectively representing multiply bias and add bias.
Preferably, the matrix features passing through the last pooling layer are arranged into a vector to form a grid layer, the grid layer is connected with the full-connection layer, the output relational expression of any point j in the grid layer is,
Figure BDA0001711514570000039
wherein f ish(. -) represents an activation function; w is ai,jRepresenting an input vector xiThe weight value between the node j and the node j; thetajIs the node threshold.
Preferably, the full link layer adopts a Softmax model to solve the multi-classification problem, and the loss function expression of Softmax is,
Figure BDA0001711514570000041
wherein the content of the first and second substances,
Figure BDA0001711514570000042
an input representing the jth neuron at level l (usually the last level);
Figure BDA0001711514570000043
represents the sum of the inputs of all neurons of the l-th layer;
Figure BDA0001711514570000044
represents the output of the jth neuron at the l layer; e represents a natural constant; l (-) is an indicative function, when includedThe result in the sign is true, the function result is 1, the result in the parenthesis is false, and the function result is 0.
Preferably, in the full connection layer, penalty to overlarge parameter in training process is realized by introducing weight attenuation item, the specific expression is,
Figure BDA0001711514570000045
wherein the content of the first and second substances,
Figure BDA0001711514570000046
are weight decay terms.
Compared with the prior art, the invention has the advantages that:
compared with the traditional convolutional neural network, the convolutional neural network with the fusion of the deep and shallow features provided by the invention can fully extract the spectrogram features so as to improve the speech emotion recognition rate, and can fully fuse the shallow features with the deep features by reducing the dimension of the shallow features, thereby obtaining the features which can represent various emotions. The invention not only can effectively improve the speech emotion recognition rate and ensure the recognition accuracy, but also has more excellent generalization capability. Test results show that the recognition rate of four kinds of emotions, namely boredom, aversion, happiness and neutrality, is improved to a certain extent in seven kinds of emotions of Germany Berlin library by using the convolutional neural network with the fused deep and shallow layer characteristics, particularly the recognition rate of the happiness and the neutrality is greatly improved, and the overall recognition rate is improved by 1.58%.
Meanwhile, the invention combines a transfer learning method, and utilizes the parameters of the traditional convolutional neural network training model as initialization parameters under the condition that the convolutional neural network becomes complex, thereby accelerating the convergence speed in the training process and improving the overall recognition speed and recognition efficiency.
In addition, the invention also provides reference for other related problems in the same field, can be expanded and extended on the basis of the reference, is applied to other technical schemes of voice recognition or emotion recognition algorithms in the field, and has very wide application prospect.
In conclusion, the invention provides a speech emotion recognition method based on convolutional neural network deep and shallow feature fusion. Has high use and popularization value.
The following detailed description of the embodiments of the present invention is provided in connection with the accompanying drawings for the purpose of facilitating understanding and understanding of the technical solutions of the present invention.
Drawings
FIG. 1 is a spectrogram sample of a number of emotional voices of a Berlin corpus used in the present invention;
FIG. 2 is a conventional convolutional neural network;
FIG. 3 is a modified convolutional neural network of the present invention;
FIG. 4 is a diagram of a conventional convolutional neural network training process;
FIG. 5 is a diagram of the improved convolutional neural network training process in the present invention.
Detailed Description
As shown in the attached drawings, the invention discloses a speech emotion recognition method, which comprises the following steps:
and S1, converting the voice data used in the experiment into a spectrogram.
The invention uses the emotion voice emotion library of German Berlin, the sampling frequency is 16KHZ, 16bit quantization, seven kinds of emotions are shared, and the emotions are respectively angry, boring, aversion, fear, happiness, neutrality and hurt.
Specifically, the step of converting the voice data used in the experiment into the spectrogram in S1 includes the following steps:
s11, firstly framing each segment of voice, and changing the frame into x (100,512) (n is the frame length, and m is the number of frames);
s12, performing FFT to obtain X (100,512), and making a periodic diagram Y (100,512) (Y (100,512) ═ X (100,512) × (100,512)');
s13, then 10 log10(Y (100,512)) is taken, the number of frames is converted into a lower scale M according to time, and the length of the frames is converted into a lower scale N according to frequency;
and S14, drawing (M, N,10 × log10(Y (M, N))) into a two-dimensional graph, namely the spectrogram. The spectrogram of a part of the sample is shown in FIG. 1.
And S2, performing data amplification treatment on the obtained spectrogram. In order to meet the requirement of a deep neural network on a large amount of data, the amplification of a spectrogram is realized by using a keras deep learning framework, the random image rotation, the horizontal translation, the vertical translation, the miscut transformation, the image scaling, the horizontal turning operation, the up-down turning operation and the like are mainly operated, and a large amount of data required in an experiment are finally obtained.
And S3, constructing the convolutional neural network fusing the deep and shallow layer characteristics by the traditional convolutional neural network.
The convolutional neural network is a feedforward neural network, and generally comprises an input layer, an intermediate layer and an output layer, wherein the intermediate layer is composed of one or more groups of feature extraction layers formed by convolution and pooling and a full connection layer, each layer is composed of a plurality of two-dimensional planes, and each plane comprises a plurality of neuron nodes (nodes). The convolution layer is used as a feature extraction layer, is the most important part in the whole convolution neural network, and can extract the features of voiceprints, energy and the like in various emotion speech spectrograms for subsequent classification processing. The mapping relation before and after convolution is as follows:
Figure BDA0001711514570000061
wherein the content of the first and second substances,
Figure BDA0001711514570000062
is the jth feature set of the ith convolutional layer,
Figure BDA0001711514570000063
the ith feature set representing the l-1 convolutional layer,
Figure BDA0001711514570000064
represents a convolution kernel between two feature sets, represents a two-dimensional convolution operation,
Figure BDA0001711514570000065
indicating an additive bias.
Usually, a pooling layer is connected behind the convolutional layer for performing dimension reduction processing on the features obtained by convolution, so as to prevent overfitting in the training process, and the pooling process is shown as follows:
Figure BDA0001711514570000066
wherein f isp(. cndot.) is an activation function of the pooling layer, down (-) denotes a l-1 layer to l layer pooling method, generally divided into two methods of mean pooling and maximum pooling,
Figure BDA0001711514570000071
and
Figure BDA0001711514570000072
respectively representing multiply bias and add bias.
And arranging the matrix characteristics passing through the final pooling layer into a vector to form a grid layer, and connecting the grid layer with the full-connection layer, wherein the output of any point j is as follows:
Figure BDA0001711514570000073
wherein f ish(. represents an activation function, wi,jRepresenting an input vector xiAnd the weight value between node j, thetajIs the node threshold.
The fully-connected layer usually adopts a Softmax model to solve the multi-classification problem, and the loss function of Softmax is as follows:
Figure BDA0001711514570000074
wherein
Figure BDA0001711514570000075
Representing the input to the jth neuron at layer l (usually the last layer),
Figure BDA0001711514570000076
the sum of the inputs of all neurons of layer i is shown.
Figure BDA0001711514570000077
The output of the jth neuron of the ith layer is shown, e represents a natural constant, l (-) is an indicative function, and when the result in the brackets is true, the result of the function is 1, the result in the brackets is false, and the result of the function is 0.
To prevent local optimization of J (theta), introduce
Figure BDA0001711514570000078
And the weight attenuation item is used for punishing an overlarge parameter in the training process. The specific expression is as follows:
Figure BDA0001711514570000079
generally, the more the number of layers of the convolutional neural network is, the more distinctive the extracted features are, but problems of too long training time or difficulty in convergence and the like can be caused, so that a five-layer convolutional neural network is constructed, so that the distinctive features can be extracted, and the training time can also be reduced, and a specific network is shown in fig. 2. The convolutional neural network mainly comprises five convolutional layers, three pooling layers and three full-connection layers. The convolution kernel size of convolutional layer 1 is set to 11x11, the step size is 4, the number of neurons is 96, the convolutional layer 1 is the largest convolutional layer, the kernel size is 3x3, the step size is 2, the convolution kernel size of convolutional layer 2 is 5x5, the step size is 1, the number of neurons is 256, the convolutional layer 2 is also the largest convolutional layer, the kernel size is 3x3, the step size is 2, convolutional layers 3 and 4 are both set to have the convolution kernel size of 3x3, the step size is 1, the number of neurons is 384, the convolution kernel size of convolutional layer 5 is 3x3, the step size is 1, the number of neurons is 256, the convolutional layer 3 is also the largest convolutional layer, the kernel size is 3x3, and the step size is 2. And finally, connecting three full connection layers, wherein the number of the neurons of the first two layers is set to be 1024, and the number of the neurons of the last full connection layer is set to be 7.
In fig. 2, we can see that the traditional convolutional neural network ignores the influence of shallow features on the classification correctness, and in the present invention, we construct a novel convolutional neural network, as shown in fig. 3. The convolutional neural network mainly comprises six convolutional layers, four pooling layers and three full-connection layers. Compared with the conventional neural network in fig. 2, the convolutional layer 6 and the pooling layer 4 are added, the convolutional kernel size of the convolutional layer 6 is 3x3, the step length is 1, the number of neurons is 256, the pooling layer 4 is also the largest pooling layer, the kernel size is 3x3, the step length is 2, then the features obtained by three-layer convolution and the features obtained by five-layer convolution are fused through a fusion layer, finally, three fully-connected layers are connected, the number of neurons in the first two layers is 1024, and the number of neurons in the last fully-connected layer is 7.
And S4, respectively adopting the traditional convolutional neural network and the convolutional neural network fused with the depth layer characteristics to carry out a speech emotion recognition experiment, and comparing the speech emotion recognition rates of the traditional convolutional neural network and the convolutional neural network.
Taking 70% of the spectrogram in the experiment as a training data set, 15% as a verification data set and the rest as a test data set, wherein the training data set is used for creating an effective classifier by adjusting the weight on the convolutional neural network, the verification data set is used for evaluating the performance of model construction in a training stage, a test platform is provided for fine tuning model parameters and selecting an optimal performance model, and the test data set is only used for testing a final trained model so as to confirm the actual classification capability of the model.
The training and testing was performed using a conventional convolutional neural network. The relationship between the loss and the iteration number in the training process is shown in fig. 4, the initial learning rate of the network is set to 0.0001, the attenuation is 0.1 times of the current learning rate after each 160 iterations (step length), the training loss starts to converge when the iterations approach 500 times, when the verification data set loss completely converges to 0.89, the model is saved, after 2500 iterations, the accuracy of the verification data set reaches 63.33%, and the whole training process lasts about 50 minutes.
By adopting a transfer learning method, an optimal model trained by the traditional convolutional neural network is used as a pre-training model, and the network proposed by the invention is used for continuous training on the basis of the model, so that the parameters of the model can be used as the initialization of the current network instead of random initialization, the convergence speed is increased, and the training time is shortened. The relationship between the loss and the iteration number in the training process is shown in fig. 5, since the parameters of the initialized network are generated by using the pre-trained model, the initial loss starts from 1.07, and the accuracy of the data set is verified to reach 54.26% by inheriting the parameters of the pre-trained model, the initial learning rate of the network is set to 0.0001, the attenuation is 0.1 times of the current learning rate after each 160 iterations (step size), the training loss starts to converge when the iteration is nearly 400 times, when the verification data set loss completely converges to 0.88, the model is saved, the accuracy of the verification data set reaches 64.78% after 2500 iterations, and the whole training process lasts for about 45 minutes.
The two models were used to perform the tests in the test data sets, and the specific experimental results are shown in tables 1 and 2.
TABLE 1 confusion matrix for seven types of emotions in Berlin Bank for traditional convolutional neural networks (%)
Emotion categories Generating qi Boring to Aversion to Fear of Happy Neutral property Heart injury
Generating qi 76.67 2.77 2.22 1.67 16.11 0.56 0
Boring to 0 90.00 0 0 1.67 5.56 2.77
Aversion to 16.11 10.00 67.78 1.11 1.11 3.89 0
Fear of 19.44 15.00 3.89 31.67 20.56 2.22 7.22
Happy 55.00 0 2.22 2.22 40.56 0 0
Neutral property 0 58.33 0 0 0 38.34 3.33
Heart injury 0 6.11 0 0 0 0 93.89
TABLE 2 confusion moments for seven types of emotions in Berlin Bank for a convolutional neural network with fused light and dark features (%)
Emotion categories Generating qi Boring to Aversion to Fear of Happy Neutral property Heart injury
Generating qi 72.78 3.89 2.22 1.67 19.44 0 0
Boring to 0 96.11 0 0 1.11 1.11 1.67
Aversion to 13.89 10.56 68.88 0 1.11 5.56 0
Fear of 15.56 20 4.45 30 22.22 0.56 7.21
Happy 50.56 0 2.22 1.11 46.11 0 0
Neutral property 0 46.11 0 0 0 46.67 7.22
Heart injury 0 10.56 0 0 0 0 89.44
From tables 1 and 2, it can be seen that, compared with the conventional convolutional neural network, the recognition rate of the four emotions of borning, disgust, happy and neutral in the seven emotions of berlin library in germany is improved to a certain extent by using the convolutional neural network of the present invention, especially, the recognition rate of the two emotions of happy and neutral is greatly improved, and the overall recognition rate is improved by 1.58%. From fig. 4 and 5 and tables 1 and 2, we compare the difference of the recognition rates of the two networks in the validation data set and the test data set respectively, the accuracy of the conventional convolutional neural network in the validation set is 63.33%, and the recognition rate in the test data set is 62.70%, which is different by 0.63%, the accuracy of the convolutional neural network with the fusion of the depth layer and the shallow layer features in the validation set is 64.78%, and the recognition rate in the test data set is 64.28%, which is different by 0.5%, and compared with the training model of the conventional convolutional neural network, the training model of the network proposed by us has stronger generalization capability.
The above experimental results show that: compared with the traditional convolutional neural network, the convolutional neural network with the fusion of the deep and shallow layer features can improve the speech emotion recognition rate, can accelerate the convergence rate and reduce the training time under the condition of combining with a transfer learning method, and has stronger generalization capability.
In conclusion, the invention can fully extract the spectrogram characteristics, thereby improving the speech emotion recognition rate. Compared with the traditional convolutional neural network, the convolutional neural network with the fused deep and shallow features provided by the invention can be fully fused with the deep features by reducing the dimension of the shallow features, so that the features which can represent various emotions can be obtained. The invention not only can effectively improve the speech emotion recognition rate and ensure the recognition accuracy, but also has more excellent generalization capability. Meanwhile, the invention combines a transfer learning method, and utilizes the parameters of the traditional convolutional neural network training model as initialization parameters under the condition that the convolutional neural network becomes complex, thereby accelerating the convergence speed in the training process and improving the overall recognition speed and recognition efficiency. In addition, the invention also provides reference for other related problems in the same field, can be expanded and extended on the basis of the reference, is applied to other technical schemes of voice recognition or emotion recognition algorithms in the field, and has very wide application prospect.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (4)

1. A speech emotion recognition method is characterized by comprising the following steps:
s1, converting the voice data used in the experiment into a spectrogram;
s2, performing data amplification processing on the obtained spectrogram;
s3, constructing a convolutional neural network fusing deep and shallow layer characteristics by using the traditional convolutional neural network;
s4, respectively adopting a traditional convolutional neural network and a convolutional neural network fused with the depth layer characteristics to carry out a speech emotion recognition experiment, and comparing the speech emotion recognition rates of the traditional convolutional neural network and the convolutional neural network;
the convolutional neural network fused with the deep and shallow layer characteristics comprises an input layer, an intermediate layer and an output layer, wherein the intermediate layer comprises a convolutional layer, a pooling layer and a full-connection layer;
the mapping relation of the convolutional layer is as follows,
Figure FDA0003000913530000011
wherein the content of the first and second substances,
Figure FDA0003000913530000012
is the jth feature set of the ith convolutional layer;
Figure FDA0003000913530000013
an ith feature set representing the l-1 convolutional layer;
Figure FDA0003000913530000014
representing a convolution kernel between two feature sets; denotes a two-dimensional convolution operation;
Figure FDA0003000913530000015
representing an additive bias;
the mapping relation of the pooling layer is as follows,
Figure FDA0003000913530000016
wherein f isp(. is the activation function of the pooling layer; down (-) denotes the l-1 layer to l layer pooling method, including both mean pooling and maximum pooling;
Figure FDA0003000913530000017
and
Figure FDA0003000913530000018
respectively representing multiplication offset and addition offset;
all the matrix characteristics passing through the last pooling layer are arranged into a vector to form a grid layer, the grid layer is connected with a full-connection layer, the output relational expression of any point j in the grid layer is,
Figure FDA0003000913530000021
wherein f ish(. -) represents an activation function; w is ai,jRepresenting an input vector xiThe weight value between the node j and the node j; thetajIs the node threshold;
the full connection layer adopts a Softmax model to solve the multi-classification problem, the loss function expression of Softmax is,
Figure FDA0003000913530000022
wherein the content of the first and second substances,
Figure FDA0003000913530000023
represents the input of the jth neuron of the l layer;
Figure FDA0003000913530000024
represents the sum of the inputs of all neurons of the l-th layer;
Figure FDA0003000913530000025
represents the output of the jth neuron at the l layer; e represents a natural constant; l (-) is an indicative function, with a function result of 1 when the result in parentheses is true and the result in parentheses is false and the function result is 0;
in the full connection layer, the penalty of overlarge parameters in the training process is realized by introducing a weight attenuation item, the specific expression is,
Figure FDA0003000913530000026
wherein the content of the first and second substances,
Figure FDA0003000913530000027
are weight decay terms.
2. The speech emotion recognition method of claim 1, wherein: s1, the voice data come from German Berlin emotion voice emotion library; the sampling frequency of the voice data is 16KHZ, and 16bit quantization is carried out; the voice data comprises seven types of emotions which are respectively angry, boring, aversion, fear, happiness, neutrality and hurt.
3. The method for recognizing speech emotion according to claim 1, wherein the step of converting the speech data used in the experiment into a spectrogram at S1 comprises the steps of:
s11, performing framing processing on each section of voice data, and changing the section of voice data into x (m, n), wherein n is the frame length, and m is the number of frames;
s12, performing FFT on X (m, n) to obtain X (m, n), and drawing a periodic diagram Y (m, n), where Y (m, n) ═ X (m, n) × X (m, n)';
s13, 10 × log10(Y (M, N)), converting M to scale M according to time, and converting N to scale N according to frequency;
s14, mixing (M, N,10 log)10(Y (m, n))) is expressed as a two-dimensional graph, and a spectrogram is obtained.
4. The method for recognizing speech emotion according to claim 1, wherein the step of performing data amplification processing on the obtained spectrogram in S2 includes the steps of: and performing data amplification processing on the spectrogram by using a keras deep learning framework, wherein the data amplification processing comprises random image rotation, horizontal translation, vertical translation, miscut transformation, image scaling and horizontal overturning operation.
CN201810685220.0A 2018-06-28 2018-06-28 Speech emotion recognition method Active CN109036465B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810685220.0A CN109036465B (en) 2018-06-28 2018-06-28 Speech emotion recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810685220.0A CN109036465B (en) 2018-06-28 2018-06-28 Speech emotion recognition method

Publications (2)

Publication Number Publication Date
CN109036465A CN109036465A (en) 2018-12-18
CN109036465B true CN109036465B (en) 2021-05-11

Family

ID=65520725

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810685220.0A Active CN109036465B (en) 2018-06-28 2018-06-28 Speech emotion recognition method

Country Status (1)

Country Link
CN (1) CN109036465B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10812334B2 (en) * 2018-06-29 2020-10-20 Forescout Technologies, Inc. Self-training classification
CN109637522B (en) * 2018-12-26 2022-12-09 杭州电子科技大学 Speech emotion recognition method for extracting depth space attention features based on spectrogram
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN109767790A (en) * 2019-02-28 2019-05-17 中国传媒大学 A kind of speech-emotion recognition method and system
CN110459225B (en) * 2019-08-14 2022-03-22 南京邮电大学 Speaker recognition system based on CNN fusion characteristics
CN110534133B (en) * 2019-08-28 2022-03-25 珠海亿智电子科技有限公司 Voice emotion recognition system and voice emotion recognition method
CN110556130A (en) * 2019-09-17 2019-12-10 平安科技(深圳)有限公司 Voice emotion recognition method and device and storage medium
CN110619889B (en) * 2019-09-19 2022-03-15 Oppo广东移动通信有限公司 Sign data identification method and device, electronic equipment and storage medium
CN110634491B (en) * 2019-10-23 2022-02-01 大连东软信息学院 Series connection feature extraction system and method for general voice task in voice signal
CN111449644A (en) * 2020-03-19 2020-07-28 复旦大学 Bioelectricity signal classification method based on time-frequency transformation and data enhancement technology
CN111583964B (en) * 2020-04-14 2023-07-21 台州学院 Natural voice emotion recognition method based on multimode deep feature learning
CN111883178B (en) * 2020-07-17 2023-03-17 渤海大学 Double-channel voice-to-image-based emotion recognition method
CN112151071B (en) * 2020-09-23 2022-10-28 哈尔滨工程大学 Speech emotion recognition method based on mixed wavelet packet feature deep learning
CN113257280A (en) * 2021-06-07 2021-08-13 苏州大学 Speech emotion recognition method based on wav2vec
CN113643724B (en) * 2021-07-06 2023-04-28 中国科学院声学研究所南海研究站 Kiwi emotion recognition method and system based on time-frequency double-branch characteristics

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106340309A (en) * 2016-08-23 2017-01-18 南京大空翼信息技术有限公司 Dog bark emotion recognition method and device based on deep learning
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN107067011A (en) * 2017-03-20 2017-08-18 北京邮电大学 A kind of vehicle color identification method and device based on deep learning
CN107705806A (en) * 2017-08-22 2018-02-16 北京联合大学 A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks
US20180061439A1 (en) * 2016-08-31 2018-03-01 Gregory Frederick Diamos Automatic audio captioning
CN107895571A (en) * 2016-09-29 2018-04-10 亿览在线网络技术(北京)有限公司 Lossless audio file identification method and device
CN108010533A (en) * 2016-10-27 2018-05-08 北京酷我科技有限公司 The automatic identifying method and device of voice data code check
CN108009148A (en) * 2017-11-16 2018-05-08 天津大学 Text emotion classification method for expressing based on deep learning
CN108205535A (en) * 2016-12-16 2018-06-26 北京酷我科技有限公司 The method and its system of Emotion tagging

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106340309A (en) * 2016-08-23 2017-01-18 南京大空翼信息技术有限公司 Dog bark emotion recognition method and device based on deep learning
US20180061439A1 (en) * 2016-08-31 2018-03-01 Gregory Frederick Diamos Automatic audio captioning
CN107895571A (en) * 2016-09-29 2018-04-10 亿览在线网络技术(北京)有限公司 Lossless audio file identification method and device
CN108010533A (en) * 2016-10-27 2018-05-08 北京酷我科技有限公司 The automatic identifying method and device of voice data code check
CN108205535A (en) * 2016-12-16 2018-06-26 北京酷我科技有限公司 The method and its system of Emotion tagging
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN107067011A (en) * 2017-03-20 2017-08-18 北京邮电大学 A kind of vehicle color identification method and device based on deep learning
CN107705806A (en) * 2017-08-22 2018-02-16 北京联合大学 A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks
CN108009148A (en) * 2017-11-16 2018-05-08 天津大学 Text emotion classification method for expressing based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition;Linhui Sun et al.;《International Journal of Speech Technology》;20180829;第931-940页 *
Hypernet: Towards accurate region proposal generation and joint object detection;Tao Kong et al.;《Proceedings of the IEEE conference on computer》;20161231;第845-853页 *
Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network;Abdul Malik Badshah et al.;《International Conference on Platform》;20171231;全文 *

Also Published As

Publication number Publication date
CN109036465A (en) 2018-12-18

Similar Documents

Publication Publication Date Title
CN109036465B (en) Speech emotion recognition method
Sun et al. Speech emotion recognition based on DNN-decision tree SVM model
CN108717856B (en) Speech emotion recognition method based on multi-scale deep convolution cyclic neural network
CN108597539B (en) Speech emotion recognition method based on parameter migration and spectrogram
CN106782602B (en) Speech emotion recognition method based on deep neural network
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
CN107578775B (en) Multi-classification voice method based on deep neural network
CN109992779B (en) Emotion analysis method, device, equipment and storage medium based on CNN
Kamaruddin et al. Cultural dependency analysis for understanding speech emotion
CN109637522B (en) Speech emotion recognition method for extracting depth space attention features based on spectrogram
CN107705806A (en) A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks
CN109389992A (en) A kind of speech-emotion recognition method based on amplitude and phase information
CN106847309A (en) A kind of speech-emotion recognition method
CN110534132A (en) A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic
CN111583964B (en) Natural voice emotion recognition method based on multimode deep feature learning
CN110675859B (en) Multi-emotion recognition method, system, medium, and apparatus combining speech and text
CN108197294A (en) A kind of text automatic generation method based on deep learning
CN107785015A (en) A kind of audio recognition method and device
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN109256118B (en) End-to-end Chinese dialect identification system and method based on generative auditory model
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN107039036A (en) A kind of high-quality method for distinguishing speek person based on autocoding depth confidence network
CN110111797A (en) Method for distinguishing speek person based on Gauss super vector and deep neural network
CN106875940A (en) A kind of Machine self-learning based on neutral net builds knowledge mapping training method
CN109558935A (en) Emotion recognition and exchange method and system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant