CN109036465A - Speech-emotion recognition method - Google Patents
Speech-emotion recognition method Download PDFInfo
- Publication number
- CN109036465A CN109036465A CN201810685220.0A CN201810685220A CN109036465A CN 109036465 A CN109036465 A CN 109036465A CN 201810685220 A CN201810685220 A CN 201810685220A CN 109036465 A CN109036465 A CN 109036465A
- Authority
- CN
- China
- Prior art keywords
- layer
- speech
- emotion recognition
- neural networks
- convolutional neural
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 56
- 230000008909 emotion recognition Effects 0.000 claims abstract description 27
- 230000004927 fusion Effects 0.000 claims abstract description 24
- 230000008451 emotion Effects 0.000 claims abstract description 22
- 238000002474 experimental method Methods 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 12
- 230000003321 amplification Effects 0.000 claims abstract description 11
- 238000003199 nucleic acid amplification method Methods 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims description 29
- 210000002569 neuron Anatomy 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 11
- 230000007935 neutral effect Effects 0.000 claims description 11
- 230000002996 emotional effect Effects 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 238000013135 deep learning Methods 0.000 claims description 5
- 238000013519 translation Methods 0.000 claims description 5
- 206010016275 Fear Diseases 0.000 claims description 3
- 125000004122 cyclic group Chemical group 0.000 claims description 3
- 238000013139 quantization Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 2
- 241000218691 Cupressaceae Species 0.000 claims 1
- 230000009467 reduction Effects 0.000 abstract description 3
- 238000012360 testing method Methods 0.000 description 10
- 238000010200 validation analysis Methods 0.000 description 7
- 238000011160 research Methods 0.000 description 6
- 238000000605 extraction Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 230000000977 initiatory effect Effects 0.000 description 2
- 238000003475 lamination Methods 0.000 description 2
- 238000013526 transfer learning Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000000452 restraining effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Psychiatry (AREA)
- Hospice & Palliative Care (AREA)
- Child & Adolescent Psychology (AREA)
- Image Analysis (AREA)
Abstract
Present invention discloses a kind of speech-emotion recognition methods, include the following steps: S1, experiment voice data used is converted into sound spectrograph;S2, data amplification processing is carried out to obtained sound spectrograph;S3, the convolutional neural networks that fusion depth layer feature is constructed by traditional convolutional neural networks;S4, the convolutional neural networks that traditional convolutional neural networks and fusion depth layer feature are respectively adopted carry out speech emotion recognition experiment, compare the speech emotion recognition rate of the two.The present invention can sufficiently extract sound spectrograph feature, to improve speech emotion recognition rate, relative to traditional convolutional neural networks, the convolutional neural networks of depth layer Fusion Features proposed in the present invention can be by carrying out dimensionality reduction for shallow-layer feature, it is fully merged with further feature, to obtain the feature more representative of all kinds of emotions.The accuracy that the present invention can not only effectively improve speech emotion recognition rate, ensure to identify, but also there is more excellent generalization ability.
Description
Technical field
The present invention relates to a kind of speech-emotion recognition methods, are based on the convolutional neural networks depth in particular to one kind
The speech-emotion recognition method of layer Fusion Features, is related to speech emotion recognition technical field.
Background technique
As a kind of human psychology behavior of complexity, emotion is always the research in many fields such as psychology, artificial intelligence
Hot spot.Voice signal is interpersonal most natural exchange way, it not only includes the content to be transmitted, but also includes abundant
Emotional factor, voice signal has been widely used in emotion research at present.
Speech emotion recognition is from the formation and variation of angle research speaker's affective state of voice signal, so as to calculate
Interaction between machine and the mankind is more intelligent.In current research, the acoustic feature for emotion recognition mainly includes spectrum
Correlated characteristic, prosodic features, sound quality feature and features described above fusion feature.Also, in features described above, study often only
Pay close attention to time domain or frequency domain.But in voice signal frequency domain and time domain signal correlation, also played in speech emotion recognition
Important role.Visual representation of the sound spectrograph as voice signal, horizontal axis represent the time, and the longitudinal axis represents frequency, have been connected to time-frequency
The Frequency point of sound spectrograph is modeled as the pixel of image by two domains, can use the connection between characteristics of image research adjacent frequency,
So, result of study is not only demonstrated by the time-frequency characteristics of voice, but also reflects the language feature of speaker.Currently,
There are many researchers to combine image procossing with speech processes using sound spectrograph, has achieved good effect.
The method of speech emotion recognition is generally divided into two classes: traditional machine learning method and deep learning method.But nothing
By being which kind of method, feature extraction is all the important step during speech emotion recognition.The pass of traditional machine learning method
Key is feature selecting, this is directly related with the accuracy of speech emotion recognition.Up to the present, largely compose correlated characteristic,
Prosodic features, sound quality feature have been used for speech emotion recognition, but these features may be not enough to express subjective emotion.With tradition
Machine learning method compare, deep learning method can extract advanced features, and obtain in visual correlation task
Certain achievement.
In recent years, depth convolutional neural networks (DCNNs) have made great progress in the research of speech emotion recognition.But
It is that in a traditional convolutional neural networks, with going deep into for convolutional layer, the dimension of Feature Mapping is smaller and smaller, and feature is got over
Come more abstract, therefore semantic feature becomes to be more and more obvious, however, becoming increasingly to obscure for the global information of sound spectrograph.
Shallow-layer feature can provide global information, but semantic feature is not obvious, and although further feature provides enough semantic spies
Sign, but it is a lack of global information, cause the affective characteristics finally extracted that can not accurately distinguish all kinds of emotions.
In conclusion how a kind of speech-emotion recognition method based on convolutional neural networks depth layer Fusion Features, will
Further feature is together with shallow-layer Fusion Features, so that the bigger affective characteristics of distinction are obtained, to solve traditional convolution mind
Shortcoming through network in speech emotion recognition problem just becomes those skilled in that art's urgent problem to be solved.
Summary of the invention
In view of the prior art, there are drawbacks described above, and the purpose of the present invention is to propose to one kind to be based on convolutional neural networks depth layer
The speech-emotion recognition method of Fusion Features.
Specifically, a kind of speech-emotion recognition method, includes the following steps:
S1, experiment voice data used is converted into sound spectrograph;
S2, data amplification processing is carried out to obtained sound spectrograph;
S3, the convolutional neural networks that fusion depth layer feature is constructed by traditional convolutional neural networks;
S4, the convolutional neural networks that traditional convolutional neural networks and fusion depth layer feature are respectively adopted carry out voice
Emotion recognition experiment, compares the speech emotion recognition rate of the two.
Preferably, voice data described in S1 comes from Berlin, Germany emotional speech emotion library;The sampling frequency of the voice data
Rate is 16KHZ, 16bit quantization;The voice data includes seven class emotions altogether, respectively angry, boring, detests, fears, high
It is emerging, it is neutral, it is sad.
Preferably, experiment voice data used is converted into sound spectrograph described in S1, included the following steps:
S11, sub-frame processing is carried out to each section of voice data, is changed into x (m, n), wherein n is frame length, and m is of frame
Number;
S12, to x (m, n) carry out FFT transform, obtain X (m, n), be cyclic graph Y (m, n) (Y (m, n)=X (m, n) * X (m,
n)');
S13,10*log10 (Y (m, n)) is taken, by m according to time change scale M, by n according to frequency transformation scale N;
S14, (M, N, 10*log10 (Y (m, n))) is expressed as X-Y scheme, obtains sound spectrograph.
Preferably, data amplification processing is carried out to obtained sound spectrograph described in S2, included the following steps: using keras
Deep learning frame carries out data amplification processing to sound spectrograph, and the data amplification processing is flat including Random-Rotation image, level
Shifting, vertical translation, Shear Transform, image scaling and flip horizontal operation.
Preferably, the convolutional neural networks of the fusion depth layer feature include input layer, middle layer and output layer, institute
Stating middle layer includes convolutional layer, pond layer and full articulamentum.
Preferably, the mapping relations formula of the convolutional layer is,
Wherein,It is j-th of characteristic set of first of convolutional layer;Indicate the ith feature collection of the l-1 convolutional layer
It closes;Indicate the convolution kernel between two characteristic sets;* two-dimensional convolution operation is indicated;Indicate plus item biasing.
Preferably, the mapping relations formula in pond layer is,
Wherein, fp() is the activation primitive of pond layer;Down () indicates l-1 layers to l layers of pond method, including equal
It is worth two methods of pondization and maximum value pond;WithIt respectively indicates and multiplies biasing and biasing is set.
Preferably, it is arranged in a vector by each matrix character of the last layer pond layer, constitutes gated layer, the grid
To change layer to be connected with full articulamentum, the output relation formula of any point j is in the gated layer,
Wherein, fh() indicates activation primitive;wi,jIndicate input vector xiWith the weight between node j;θjIt is node threshold
Value.
Preferably, the full articulamentum solves more classification problems, the loss function table of Softmax using Softmax model
It is up to formula,
Wherein,Indicate the input of l layers of (usually the last layer) j-th of neuron;Illustrate l layers of institute
There is the sum of the input of neuron;Indicate the output of l j-th of neuron of layer;E indicates natural constant;L () is instruction
Property function, when result is true, function result 1 in bracket, result is false, function result 0 in bracket.
Preferably, it in the full articulamentum, by introducing weight attenuation term, realizes to parameter excessive in training process
Punishment, expression is,
Wherein,For weight attenuation term.
Compared with prior art, advantages of the present invention is mainly reflected in the following aspects:
The present invention can sufficiently extract sound spectrograph feature, so that speech emotion recognition rate is improved, relative to traditional convolution
Neural network, the convolutional neural networks of depth layer Fusion Features proposed in the present invention can be by dropping shallow-layer feature
Dimension, is fully merged with further feature, to obtain the feature more representative of all kinds of emotions.The present invention not only can be effective
The accuracy that ground improves speech emotion recognition rate, ensures to identify, and there is more excellent generalization ability.Test result table
It is bright, use the convolutional neural networks of depth layer Fusion Features proposed by the invention, in seven class emotion of Berlin, Germany library, nothing
Merely, detest, glad, the discrimination of neutral four class emotions obtains certain raising, the knowledge of especially glad and neutral two kinds of emotions
Not rate is greatly improved, and whole discrimination improves 1.58%.
Meanwhile present invention incorporates the methods of shift learning to utilize tradition in the case where convolutional neural networks complicate
Convolutional neural networks training pattern parameter as initiation parameter, to accelerate the convergence rate in training process, mention
Whole recognition speed and recognition efficiency are risen.
In addition, the present invention also provides reference for other relevant issues in same domain, can be opened up on this basis
Extension is stretched, and is applied in field in the technical solution of other speech recognitions or emotion recognition algorithm, has very wide application
Prospect.
In conclusion the invention proposes a kind of speech emotion recognitions based on convolutional neural networks depth layer Fusion Features
Method.With very high use and promotional value.
Just attached drawing in conjunction with the embodiments below, the embodiment of the present invention is described in further detail, so that of the invention
Technical solution is more readily understood, grasps.
Detailed description of the invention
Fig. 1 is the sound spectrograph sample of corpus some emotional speeches in Berlin used in the present invention;
Fig. 2 is traditional convolutional neural networks;
Fig. 3 is improved convolutional neural networks in the present invention;
Fig. 4 is traditional convolutional neural networks training process figure;
Fig. 5 is improved convolutional neural networks training process figure in the present invention.
Specific embodiment
As shown in the picture, present invention discloses a kind of speech-emotion recognition method, include the following steps:
S1, experiment voice data used is converted into sound spectrograph.
The present invention uses the emotional speech emotion library of Berlin, Germany, and sample frequency 16KHZ, 16bit quantization shares seven
Class emotion is anger respectively, boring, detests, fears, glad, neutral, sad.
Specifically, experiment voice data used is converted into sound spectrograph described in S1, include the following steps:
S11, for each section of voice framing first, become x (100,512) (n is frame length, and m is the number of frame);
S12, FFT transform then is done, obtains X (100,512), is cyclic graph Y (100,512) (Y (100,512)=X
(100,512)*X(100,512)');
S13,10*log10 (Y (100,512)) are then taken, the number of frame according to time change once scale M, frame length
Change scale N according to frequency;
S14, (M, N, 10*log10 (Y (m, n))) is finally drawn as X-Y scheme is exactly sound spectrograph.Part sample sound spectrograph
As shown in Figure 1.
S2, data amplification processing is carried out to obtained sound spectrograph.In order to meet deep neural network for data volume
It is required that very big demand, the present invention realizes the amplification of sound spectrograph using keras deep learning frame, and primary operational has Random-Rotation
Image, horizontal translation, vertical translation, Shear Transform, image scaling, flip horizontal operate and spin upside down operation etc., finally
Required mass data in being tested.
S3, the convolutional neural networks that fusion depth layer feature is constructed by traditional convolutional neural networks.
Convolutional neural networks are a kind of feedforward neural networks, generally include input layer, middle layer, output layer, middle layer by
The feature extraction layer and full articulamentum composition that one or more groups of " convolution+pond " is constituted, every layer is made of some two-dimensional surfaces,
Each plane includes several neuron nodes (node).Convolutional layer is entire convolutional neural networks as feature extraction layer
In mostly important part, the features such as vocal print, the energy in various emotion sound spectrographs can be extracted, at subsequent classification
Reason.Mapping relations formula before and after convolution is as follows:
Wherein,It is j-th of characteristic set of first of convolutional layer,Indicate the ith feature collection of the l-1 convolutional layer
It closes,Indicate that the convolution kernel between two characteristic sets, * indicate two-dimensional convolution operation,Indicate plus item biasing.
A pond layer can be all connected usually behind convolutional layer, the feature for obtaining to convolution carries out at dimensionality reduction
Reason, prevents occurring over-fitting in training process, pond process is shown below:
Wherein, fp() is the activation primitive of pond layer, and down () indicates l-1 layers to l layers of pond method, general to divide
For two methods of mean value pondization and maximum value pond,WithIt respectively indicates and multiplies biasing and biasing is set.
It is arranged in a vector by each matrix character of the last layer pond layer, constitutes gated layer, is connected with full articulamentum
It connects, wherein j output in any point is as follows:
Wherein, fh() indicates activation primitive, wi,jIndicate input vector xiWith the weight between node j, θjIt is node threshold
Value.
Full articulamentum generallys use Softmax model to solve more classification problems, the following institute of the loss function of Softmax
Show:
WhereinIndicate the input of l layers of (usually the last layer) j-th of neuron,Illustrate that l layers are owned
The sum of input of neuron.Indicate the output of l j-th of neuron of layer, e indicates natural constant, and l () is indicative
Function, result is true, function result 1 inside bracket, and result is false, function result 0 inside bracket.
J (θ) suboptimization in order to prevent introducesWeight attenuation term is used to during punishment training
Excessive parameter.Expression is as follows:
The number of plies of usual convolutional neural networks is more, and the feature of extraction also more has distinction, but will lead to training
Overlong time is difficult to the problems such as restraining, so we construct five layers of convolutional neural networks, can extract has in this way
The feature of distinction can also reduce the training time, and specific network is as shown in Figure 2.The convolutional neural networks are mainly rolled up by five
Lamination and three pond layers and three full articulamentum compositions.The convolution kernel of convolutional layer 1 is dimensioned to 11x11, step-length 4,
Neuron number is 96, and pond layer 1 is maximum pond layer, and core size is 3x3, and the convolution kernel size of step-length 2, convolutional layer 2 is set
It is set to 5x5, step-length 1, neuron number 256, pond layer 2 is also maximum pond layer, and core size is 3x3, step-length 2, volume
It is 3x3 that convolution kernel size, which is all arranged, in lamination 3,4, and the convolution kernel size of step-length 1, neuron number 384, convolutional layer 5 is
3x3, step-length 1, neuron number 256, pond layer 3 are similarly maximum pond layer, and core size is 3x3, step-length 2.Finally
Three full articulamentums are connected, first two layers of neuron number is set as 1024, and the neuron number of the last one full articulamentum is set
It is set to 7.
In Fig. 2, it may be seen that traditional convolutional neural networks have ignored the feature of shallow-layer for classification correctness
Influence, the present invention in, we construct a novel convolutional neural networks, as shown in Figure 3.The convolutional neural networks mainly by
Six convolutional layers and four pond layers and three full articulamentum compositions.Relative to the traditional neural network in Fig. 2, we are added
Convolutional layer 6 and pond layer 4, the convolution kernel size of convolutional layer 6 are 3x3, and step-length 1, neuron number 256, pond layer 4 is together
Sample is maximum pond layer, and core size is 3x3, step-length 2, the feature and five for then being obtained three-layer coil product by a fused layer
The feature that layer convolution obtains is merged, and three full articulamentums are finally connected, and first two layers of neuron number is set as 1024, most
The neuron number of the full articulamentum of the latter is set as 7.
S4, the convolutional neural networks that traditional convolutional neural networks and fusion depth layer feature are respectively adopted carry out voice
Emotion recognition experiment, compares the speech emotion recognition rate of the two.
It is used as training dataset by 70% of sound spectrograph in experiment, 15% is used as validation data set, remaining as test
Data set, training dataset are used to create effective classifier, verify data by adjusting the weight on convolutional neural networks
Collect the performance for assessing training stage model construction, it provides a test platform, for finely tuning model parameter and selecting
Optimum performance model, test data set is only used for the model that the final training of test is completed, to confirm the actual classification of the model
Ability.
It is trained and tests using traditional convolutional neural networks.The relationship of loss and the number of iterations in training process
As shown in figure 4, the initial learning rate of network is set as 0.0001, the 0.1 of current learning rate is decayed to after every 160 iteration (step-length)
Times, training loss starts to restrain at iteration nearly 500 times, when validation data set loss Complete Convergence is in 0.89 when, Wo Menbao
Model is deposited, after 2500 iteration, the accuracy of validation data set has reached 63.33%, and entire training process continues greatly
About 50 minutes.
Using the method for transfer learning, use the trained optimal model of traditional convolutional neural networks as pre-training mould
Type continues to train on the basis of the model using the network proposed in the present invention, in this way can using the parameter of the model as
Convergence rate, less training time are accelerated in the initialization of current network, rather than random initializtion.Loss in training process
Relationship with the number of iterations is as shown in figure 5, since the parameter of initialization network is generated using the model of pre-training, initially
Loss will be since 1.07, and pass through the parameter for inheriting pre-training model, so that the accuracy of validation data set reaches
54.26%, the initial learning rate of network is set as 0.0001, decays to the 0.1 of current learning rate after every 160 iteration (step-length)
Times, training loss starts to restrain at iteration nearly 400 times, when validation data set loss Complete Convergence is in 0.88 when, Wo Menbao
Model is deposited, after 2500 iteration, the accuracy of validation data set has reached 64.78%, and entire training process continues greatly
About 45 minutes.
It is tested respectively using two models in test data concentration, specific experimental result is as shown in Table 1 and Table 2.
The confusion matrix (%) of the traditional convolutional neural networks of table 1 seven class emotions in the library of Berlin
Emotional category | It is angry | It is boring | Detest | Fear | It is glad | It is neutral | It is sad |
It is angry | 76.67 | 2.77 | 2.22 | 1.67 | 16.11 | 0.56 | 0 |
It is boring | 0 | 90.00 | 0 | 0 | 1.67 | 5.56 | 2.77 |
Detest | 16.11 | 10.00 | 67.78 | 1.11 | 1.11 | 3.89 | 0 |
Fear | 19.44 | 15.00 | 3.89 | 31.67 | 20.56 | 2.22 | 7.22 |
It is glad | 55.00 | 0 | 2.22 | 2.22 | 40.56 | 0 | 0 |
It is neutral | 0 | 58.33 | 0 | 0 | 0 | 38.34 | 3.33 |
It is sad | 0 | 6.11 | 0 | 0 | 0 | 0 | 93.89 |
The convolutional neural networks of 2 depth Fusion Features of table, the seven class emotions in the library of Berlin obscure square (%)
Emotional category | It is angry | It is boring | Detest | Fear | It is glad | It is neutral | It is sad |
It is angry | 72.78 | 3.89 | 2.22 | 1.67 | 19.44 | 0 | 0 |
It is boring | 0 | 96.11 | 0 | 0 | 1.11 | 1.11 | 1.67 |
Detest | 13.89 | 10.56 | 68.88 | 0 | 1.11 | 5.56 | 0 |
Fear | 15.56 | 20 | 4.45 | 30 | 22.22 | 0.56 | 7.21 |
It is glad | 50.56 | 0 | 2.22 | 1.11 | 46.11 | 0 | 0 |
It is neutral | 0 | 46.11 | 0 | 0 | 0 | 46.67 | 7.22 |
It is sad | 0 | 10.56 | 0 | 0 | 0 | 0 | 89.44 |
From Tables 1 and 2, it will be seen that using the convolutional neural networks in the present invention relative to traditional convolution
Neural network, it is boring in seven class emotion of Berlin, Germany library, detest, glad, the discrimination of neutral four class emotions obtains certain
It improves, the discrimination of especially glad and neutral two kinds of emotions is greatly improved, and whole discrimination improves 1.58%.
From Fig. 4, Fig. 5 and table 1, table 2, it is poor in the discrimination of validation data set and test data set that we are respectively compared two kinds of networks
Not, traditional convolutional neural networks are 63.33% in the accuracy that verifying is concentrated, and are in the discrimination that test data is concentrated
62.70%, 0.63% is differed, the convolutional neural networks of the depth layer Fusion Features in the present invention are in the accuracy that verifying is concentrated
64.78%, and the discrimination concentrated in test data is 64.28%, difference 0.5%, relative to traditional convolutional neural networks
Training pattern, it is proposed that network training pattern have stronger generalization ability.
Above the results showed that being compared with traditional convolutional neural networks, depth layer proposed by the invention is special
Speech emotion recognition rate can be improved in the convolutional neural networks of sign fusion, can in the case where combining with transfer learning method
To accelerate convergence rate, the training time is reduced, also, the instruction of the convolutional neural networks of proposed depth layer Fusion Features
Practicing model has stronger generalization ability.
In conclusion the present invention can sufficiently extract sound spectrograph feature, to improve speech emotion recognition rate.Relative to biography
The convolutional neural networks of system, the present invention proposed in depth layer Fusion Features convolutional neural networks can by by shallow-layer spy
Sign carries out dimensionality reduction, is fully merged with further feature, to obtain the feature more representative of all kinds of emotions.The present invention is not only
The accuracy that speech emotion recognition rate can be effectively improved, ensure to identify, and there is more excellent generalization ability.Together
When, present invention incorporates the methods of shift learning, in the case where convolutional neural networks complicate, utilize traditional convolutional Neural
The parameter of network training model is as initiation parameter, to accelerate the convergence rate in training process, improves whole
Recognition speed and recognition efficiency.It, can be as in addition, the present invention also provides reference for other relevant issues in same domain
According to expansion extension is carried out, apply in field in the technical solution of other speech recognitions or emotion recognition algorithm, has very
Wide application prospect.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie
In the case where without departing substantially from spirit and essential characteristics of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power
Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims
Variation is included within the present invention, and any reference signs in the claims should not be construed as limiting the involved claims.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped
Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should
It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art
The other embodiments being understood that.
Claims (10)
1. a kind of speech-emotion recognition method, which comprises the steps of:
S1, experiment voice data used is converted into sound spectrograph;
S2, data amplification processing is carried out to obtained sound spectrograph;
S3, the convolutional neural networks that fusion depth layer feature is constructed by traditional convolutional neural networks;
S4, the convolutional neural networks that traditional convolutional neural networks and fusion depth layer feature are respectively adopted carry out speech emotional
Identification experiment, compares the speech emotion recognition rate of the two.
2. speech-emotion recognition method according to claim 1, it is characterised in that: voice data described in S1 comes from German cypress
Woods emotional speech emotion library;The sample frequency of the voice data is 16KHZ, 16bit quantization;The voice data includes seven altogether
Class emotion, it is respectively angry, it is boring, detest, fears, it is glad, it is neutral, it is sad.
3. speech-emotion recognition method according to claim 1, which is characterized in that by voice number used in experiment described in S1
According to sound spectrograph is converted into, include the following steps:
S11, sub-frame processing is carried out to each section of voice data, is changed into x (m, n), wherein n is frame length, and m is the number of frame;
S12, to x (m, n) carry out FFT transform, obtain X (m, n), be cyclic graph Y (m, n) (Y (m, n)=X (m, n) * X (m,
n)');
S13,10*log10 (Y (m, n)) is taken, by m according to time change scale M, by n according to frequency transformation scale N;
S14, (M, N, 10*log10 (Y (m, n))) is expressed as X-Y scheme, obtains sound spectrograph.
4. speech-emotion recognition method according to claim 1, which is characterized in that described in S2 to obtained sound spectrograph into
Row data amplification processing includes the following steps: to carry out data amplification processing, institute to sound spectrograph using keras deep learning frame
Stating data amplification processing includes that Random-Rotation image, horizontal translation, vertical translation, Shear Transform, image scaling and level are turned over
Turn operation.
5. speech-emotion recognition method according to claim 1, it is characterised in that: the convolution of the fusion depth layer feature
Neural network includes input layer, middle layer and output layer, and the middle layer includes convolutional layer, pond layer and full articulamentum.
6. speech-emotion recognition method according to claim 5, it is characterised in that: the mapping relations formula of the convolutional layer
For,
Wherein,It is j-th of characteristic set of first of convolutional layer;Indicate the ith feature set of the l-1 convolutional layer;Indicate the convolution kernel between two characteristic sets;* two-dimensional convolution operation is indicated;Indicate plus item biasing.
7. speech-emotion recognition method according to claim 5, it is characterised in that: the mapping relations formula in pond layer
For,
Wherein, fp() is the activation primitive of pond layer;Down () indicates l-1 layers to l layers of pond method, including mean value pond
Change and two methods of maximum value pond;WithIt respectively indicates and multiplies biasing and biasing is set.
8. speech-emotion recognition method according to claim 5, it is characterised in that: by each square of the last layer pond layer
Battle array feature permutation constitutes gated layer, the gated layer is connected with full articulamentum, any one in the gated layer at a vector
The output relation formula of point j is,
Wherein, fh() indicates activation primitive;wi,jIndicate input vector xiWith the weight between node j;θjIt is Node B threshold.
9. speech-emotion recognition method according to claim 5, it is characterised in that: the full articulamentum uses Softmax
Model solves more classification problems, and the loss function expression formula of Softmax is,
Wherein,Indicate the input of l layers of (usually the last layer) j-th of neuron;Illustrate l layers of all minds
The sum of input through member;Indicate the output of l j-th of neuron of layer;E indicates natural constant;L () is indicative letter
Number, when result is true, function result 1 in bracket, result is false, function result 0 in bracket.
10. speech-emotion recognition method according to claim 9, it is characterised in that: in the full articulamentum, by drawing
Entering weight attenuation term, realizes the punishment to parameter excessive in training process, expression is,
Wherein,For weight attenuation term.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810685220.0A CN109036465B (en) | 2018-06-28 | 2018-06-28 | Speech emotion recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810685220.0A CN109036465B (en) | 2018-06-28 | 2018-06-28 | Speech emotion recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109036465A true CN109036465A (en) | 2018-12-18 |
CN109036465B CN109036465B (en) | 2021-05-11 |
Family
ID=65520725
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810685220.0A Active CN109036465B (en) | 2018-06-28 | 2018-06-28 | Speech emotion recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109036465B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109637522A (en) * | 2018-12-26 | 2019-04-16 | 杭州电子科技大学 | A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph |
CN109767790A (en) * | 2019-02-28 | 2019-05-17 | 中国传媒大学 | A kind of speech-emotion recognition method and system |
CN109935243A (en) * | 2019-02-25 | 2019-06-25 | 重庆大学 | Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model |
CN110459225A (en) * | 2019-08-14 | 2019-11-15 | 南京邮电大学 | A kind of speaker identification system based on CNN fusion feature |
CN110534133A (en) * | 2019-08-28 | 2019-12-03 | 珠海亿智电子科技有限公司 | A kind of speech emotion recognition system and speech-emotion recognition method |
CN110619889A (en) * | 2019-09-19 | 2019-12-27 | Oppo广东移动通信有限公司 | Sign data identification method and device, electronic equipment and storage medium |
CN110634491A (en) * | 2019-10-23 | 2019-12-31 | 大连东软信息学院 | Series connection feature extraction system and method for general voice task in voice signal |
CN111449644A (en) * | 2020-03-19 | 2020-07-28 | 复旦大学 | Bioelectricity signal classification method based on time-frequency transformation and data enhancement technology |
CN111583964A (en) * | 2020-04-14 | 2020-08-25 | 台州学院 | Natural speech emotion recognition method based on multi-mode deep feature learning |
CN111883178A (en) * | 2020-07-17 | 2020-11-03 | 渤海大学 | Double-channel voice-to-image-based emotion recognition method |
CN112151071A (en) * | 2020-09-23 | 2020-12-29 | 哈尔滨工程大学 | Speech emotion recognition method based on mixed wavelet packet feature deep learning |
WO2021051577A1 (en) * | 2019-09-17 | 2021-03-25 | 平安科技(深圳)有限公司 | Speech emotion recognition method, apparatus, device, and storage medium |
CN113257280A (en) * | 2021-06-07 | 2021-08-13 | 苏州大学 | Speech emotion recognition method based on wav2vec |
CN113643724A (en) * | 2021-07-06 | 2021-11-12 | 中国科学院声学研究所南海研究站 | Kiwi emotion recognition method and system based on time-frequency double-branch characteristics |
US11343149B2 (en) * | 2018-06-29 | 2022-05-24 | Forescout Technologies, Inc. | Self-training classification |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106340309A (en) * | 2016-08-23 | 2017-01-18 | 南京大空翼信息技术有限公司 | Dog bark emotion recognition method and device based on deep learning |
CN106847309A (en) * | 2017-01-09 | 2017-06-13 | 华南理工大学 | A kind of speech-emotion recognition method |
CN107067011A (en) * | 2017-03-20 | 2017-08-18 | 北京邮电大学 | A kind of vehicle color identification method and device based on deep learning |
CN107705806A (en) * | 2017-08-22 | 2018-02-16 | 北京联合大学 | A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks |
US20180061439A1 (en) * | 2016-08-31 | 2018-03-01 | Gregory Frederick Diamos | Automatic audio captioning |
CN107895571A (en) * | 2016-09-29 | 2018-04-10 | 亿览在线网络技术(北京)有限公司 | Lossless audio file identification method and device |
CN108009148A (en) * | 2017-11-16 | 2018-05-08 | 天津大学 | Text emotion classification method for expressing based on deep learning |
CN108010533A (en) * | 2016-10-27 | 2018-05-08 | 北京酷我科技有限公司 | The automatic identifying method and device of voice data code check |
CN108205535A (en) * | 2016-12-16 | 2018-06-26 | 北京酷我科技有限公司 | The method and its system of Emotion tagging |
-
2018
- 2018-06-28 CN CN201810685220.0A patent/CN109036465B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106340309A (en) * | 2016-08-23 | 2017-01-18 | 南京大空翼信息技术有限公司 | Dog bark emotion recognition method and device based on deep learning |
US20180061439A1 (en) * | 2016-08-31 | 2018-03-01 | Gregory Frederick Diamos | Automatic audio captioning |
CN107895571A (en) * | 2016-09-29 | 2018-04-10 | 亿览在线网络技术(北京)有限公司 | Lossless audio file identification method and device |
CN108010533A (en) * | 2016-10-27 | 2018-05-08 | 北京酷我科技有限公司 | The automatic identifying method and device of voice data code check |
CN108205535A (en) * | 2016-12-16 | 2018-06-26 | 北京酷我科技有限公司 | The method and its system of Emotion tagging |
CN106847309A (en) * | 2017-01-09 | 2017-06-13 | 华南理工大学 | A kind of speech-emotion recognition method |
CN107067011A (en) * | 2017-03-20 | 2017-08-18 | 北京邮电大学 | A kind of vehicle color identification method and device based on deep learning |
CN107705806A (en) * | 2017-08-22 | 2018-02-16 | 北京联合大学 | A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks |
CN108009148A (en) * | 2017-11-16 | 2018-05-08 | 天津大学 | Text emotion classification method for expressing based on deep learning |
Non-Patent Citations (3)
Title |
---|
ABDUL MALIK BADSHAH ET AL.: "Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network", 《INTERNATIONAL CONFERENCE ON PLATFORM》 * |
LINHUI SUN ET AL.: "Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition", 《INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY》 * |
TAO KONG ET AL.: "Hypernet: Towards accurate region proposal generation and joint object detection", 《PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER》 * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11936660B2 (en) | 2018-06-29 | 2024-03-19 | Forescout Technologies, Inc. | Self-training classification |
US11343149B2 (en) * | 2018-06-29 | 2022-05-24 | Forescout Technologies, Inc. | Self-training classification |
CN109637522A (en) * | 2018-12-26 | 2019-04-16 | 杭州电子科技大学 | A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph |
CN109637522B (en) * | 2018-12-26 | 2022-12-09 | 杭州电子科技大学 | Speech emotion recognition method for extracting depth space attention features based on spectrogram |
CN109935243A (en) * | 2019-02-25 | 2019-06-25 | 重庆大学 | Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model |
CN109767790A (en) * | 2019-02-28 | 2019-05-17 | 中国传媒大学 | A kind of speech-emotion recognition method and system |
CN110459225A (en) * | 2019-08-14 | 2019-11-15 | 南京邮电大学 | A kind of speaker identification system based on CNN fusion feature |
CN110534133A (en) * | 2019-08-28 | 2019-12-03 | 珠海亿智电子科技有限公司 | A kind of speech emotion recognition system and speech-emotion recognition method |
CN110534133B (en) * | 2019-08-28 | 2022-03-25 | 珠海亿智电子科技有限公司 | Voice emotion recognition system and voice emotion recognition method |
WO2021051577A1 (en) * | 2019-09-17 | 2021-03-25 | 平安科技(深圳)有限公司 | Speech emotion recognition method, apparatus, device, and storage medium |
CN110619889A (en) * | 2019-09-19 | 2019-12-27 | Oppo广东移动通信有限公司 | Sign data identification method and device, electronic equipment and storage medium |
CN110634491B (en) * | 2019-10-23 | 2022-02-01 | 大连东软信息学院 | Series connection feature extraction system and method for general voice task in voice signal |
CN110634491A (en) * | 2019-10-23 | 2019-12-31 | 大连东软信息学院 | Series connection feature extraction system and method for general voice task in voice signal |
CN111449644A (en) * | 2020-03-19 | 2020-07-28 | 复旦大学 | Bioelectricity signal classification method based on time-frequency transformation and data enhancement technology |
CN111583964A (en) * | 2020-04-14 | 2020-08-25 | 台州学院 | Natural speech emotion recognition method based on multi-mode deep feature learning |
CN111583964B (en) * | 2020-04-14 | 2023-07-21 | 台州学院 | Natural voice emotion recognition method based on multimode deep feature learning |
CN111883178A (en) * | 2020-07-17 | 2020-11-03 | 渤海大学 | Double-channel voice-to-image-based emotion recognition method |
CN112151071A (en) * | 2020-09-23 | 2020-12-29 | 哈尔滨工程大学 | Speech emotion recognition method based on mixed wavelet packet feature deep learning |
CN112151071B (en) * | 2020-09-23 | 2022-10-28 | 哈尔滨工程大学 | Speech emotion recognition method based on mixed wavelet packet feature deep learning |
CN113257280A (en) * | 2021-06-07 | 2021-08-13 | 苏州大学 | Speech emotion recognition method based on wav2vec |
CN113643724A (en) * | 2021-07-06 | 2021-11-12 | 中国科学院声学研究所南海研究站 | Kiwi emotion recognition method and system based on time-frequency double-branch characteristics |
Also Published As
Publication number | Publication date |
---|---|
CN109036465B (en) | 2021-05-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109036465A (en) | Speech-emotion recognition method | |
Zhao et al. | Learning deep features to recognise speech emotion using merged deep CNN | |
Sun et al. | Speech emotion recognition based on DNN-decision tree SVM model | |
Chatziagapi et al. | Data Augmentation Using GANs for Speech Emotion Recognition. | |
CN110400579B (en) | Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network | |
CN103544963B (en) | A kind of speech-emotion recognition method based on core semi-supervised discrimination and analysis | |
CN108597539A (en) | Speech-emotion recognition method based on parameter migration and sound spectrograph | |
CN109637522B (en) | Speech emotion recognition method for extracting depth space attention features based on spectrogram | |
CN108717856A (en) | A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network | |
CN111583964B (en) | Natural voice emotion recognition method based on multimode deep feature learning | |
CN108305616A (en) | A kind of audio scene recognition method and device based on long feature extraction in short-term | |
CN106952649A (en) | Method for distinguishing speek person based on convolutional neural networks and spectrogram | |
CN106847309A (en) | A kind of speech-emotion recognition method | |
CN107785015A (en) | A kind of audio recognition method and device | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
CN110111797A (en) | Method for distinguishing speek person based on Gauss super vector and deep neural network | |
CN110047516A (en) | A kind of speech-emotion recognition method based on gender perception | |
CN106297773A (en) | A kind of neutral net acoustic training model method | |
CN108986798B (en) | Processing method, device and the equipment of voice data | |
CN110148408A (en) | A kind of Chinese speech recognition method based on depth residual error | |
CN110189766B (en) | Voice style transfer method based on neural network | |
CN109036468A (en) | Speech-emotion recognition method based on deepness belief network and the non-linear PSVM of core | |
CN109949821A (en) | A method of far field speech dereverbcration is carried out using the U-NET structure of CNN | |
CN110534133A (en) | A kind of speech emotion recognition system and speech-emotion recognition method | |
CN109767789A (en) | A kind of new feature extracting method for speech emotion recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |