CN109036465A - Speech-emotion recognition method - Google Patents

Speech-emotion recognition method Download PDF

Info

Publication number
CN109036465A
CN109036465A CN201810685220.0A CN201810685220A CN109036465A CN 109036465 A CN109036465 A CN 109036465A CN 201810685220 A CN201810685220 A CN 201810685220A CN 109036465 A CN109036465 A CN 109036465A
Authority
CN
China
Prior art keywords
layer
speech
emotion recognition
neural networks
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810685220.0A
Other languages
Chinese (zh)
Other versions
CN109036465B (en
Inventor
孙林慧
陈嘉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201810685220.0A priority Critical patent/CN109036465B/en
Publication of CN109036465A publication Critical patent/CN109036465A/en
Application granted granted Critical
Publication of CN109036465B publication Critical patent/CN109036465B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • Image Analysis (AREA)

Abstract

Present invention discloses a kind of speech-emotion recognition methods, include the following steps: S1, experiment voice data used is converted into sound spectrograph;S2, data amplification processing is carried out to obtained sound spectrograph;S3, the convolutional neural networks that fusion depth layer feature is constructed by traditional convolutional neural networks;S4, the convolutional neural networks that traditional convolutional neural networks and fusion depth layer feature are respectively adopted carry out speech emotion recognition experiment, compare the speech emotion recognition rate of the two.The present invention can sufficiently extract sound spectrograph feature, to improve speech emotion recognition rate, relative to traditional convolutional neural networks, the convolutional neural networks of depth layer Fusion Features proposed in the present invention can be by carrying out dimensionality reduction for shallow-layer feature, it is fully merged with further feature, to obtain the feature more representative of all kinds of emotions.The accuracy that the present invention can not only effectively improve speech emotion recognition rate, ensure to identify, but also there is more excellent generalization ability.

Description

Speech-emotion recognition method
Technical field
The present invention relates to a kind of speech-emotion recognition methods, are based on the convolutional neural networks depth in particular to one kind The speech-emotion recognition method of layer Fusion Features, is related to speech emotion recognition technical field.
Background technique
As a kind of human psychology behavior of complexity, emotion is always the research in many fields such as psychology, artificial intelligence Hot spot.Voice signal is interpersonal most natural exchange way, it not only includes the content to be transmitted, but also includes abundant Emotional factor, voice signal has been widely used in emotion research at present.
Speech emotion recognition is from the formation and variation of angle research speaker's affective state of voice signal, so as to calculate Interaction between machine and the mankind is more intelligent.In current research, the acoustic feature for emotion recognition mainly includes spectrum Correlated characteristic, prosodic features, sound quality feature and features described above fusion feature.Also, in features described above, study often only Pay close attention to time domain or frequency domain.But in voice signal frequency domain and time domain signal correlation, also played in speech emotion recognition Important role.Visual representation of the sound spectrograph as voice signal, horizontal axis represent the time, and the longitudinal axis represents frequency, have been connected to time-frequency The Frequency point of sound spectrograph is modeled as the pixel of image by two domains, can use the connection between characteristics of image research adjacent frequency, So, result of study is not only demonstrated by the time-frequency characteristics of voice, but also reflects the language feature of speaker.Currently, There are many researchers to combine image procossing with speech processes using sound spectrograph, has achieved good effect.
The method of speech emotion recognition is generally divided into two classes: traditional machine learning method and deep learning method.But nothing By being which kind of method, feature extraction is all the important step during speech emotion recognition.The pass of traditional machine learning method Key is feature selecting, this is directly related with the accuracy of speech emotion recognition.Up to the present, largely compose correlated characteristic, Prosodic features, sound quality feature have been used for speech emotion recognition, but these features may be not enough to express subjective emotion.With tradition Machine learning method compare, deep learning method can extract advanced features, and obtain in visual correlation task Certain achievement.
In recent years, depth convolutional neural networks (DCNNs) have made great progress in the research of speech emotion recognition.But It is that in a traditional convolutional neural networks, with going deep into for convolutional layer, the dimension of Feature Mapping is smaller and smaller, and feature is got over Come more abstract, therefore semantic feature becomes to be more and more obvious, however, becoming increasingly to obscure for the global information of sound spectrograph. Shallow-layer feature can provide global information, but semantic feature is not obvious, and although further feature provides enough semantic spies Sign, but it is a lack of global information, cause the affective characteristics finally extracted that can not accurately distinguish all kinds of emotions.
In conclusion how a kind of speech-emotion recognition method based on convolutional neural networks depth layer Fusion Features, will Further feature is together with shallow-layer Fusion Features, so that the bigger affective characteristics of distinction are obtained, to solve traditional convolution mind Shortcoming through network in speech emotion recognition problem just becomes those skilled in that art's urgent problem to be solved.
Summary of the invention
In view of the prior art, there are drawbacks described above, and the purpose of the present invention is to propose to one kind to be based on convolutional neural networks depth layer The speech-emotion recognition method of Fusion Features.
Specifically, a kind of speech-emotion recognition method, includes the following steps:
S1, experiment voice data used is converted into sound spectrograph;
S2, data amplification processing is carried out to obtained sound spectrograph;
S3, the convolutional neural networks that fusion depth layer feature is constructed by traditional convolutional neural networks;
S4, the convolutional neural networks that traditional convolutional neural networks and fusion depth layer feature are respectively adopted carry out voice Emotion recognition experiment, compares the speech emotion recognition rate of the two.
Preferably, voice data described in S1 comes from Berlin, Germany emotional speech emotion library;The sampling frequency of the voice data Rate is 16KHZ, 16bit quantization;The voice data includes seven class emotions altogether, respectively angry, boring, detests, fears, high It is emerging, it is neutral, it is sad.
Preferably, experiment voice data used is converted into sound spectrograph described in S1, included the following steps:
S11, sub-frame processing is carried out to each section of voice data, is changed into x (m, n), wherein n is frame length, and m is of frame Number;
S12, to x (m, n) carry out FFT transform, obtain X (m, n), be cyclic graph Y (m, n) (Y (m, n)=X (m, n) * X (m, n)');
S13,10*log10 (Y (m, n)) is taken, by m according to time change scale M, by n according to frequency transformation scale N;
S14, (M, N, 10*log10 (Y (m, n))) is expressed as X-Y scheme, obtains sound spectrograph.
Preferably, data amplification processing is carried out to obtained sound spectrograph described in S2, included the following steps: using keras Deep learning frame carries out data amplification processing to sound spectrograph, and the data amplification processing is flat including Random-Rotation image, level Shifting, vertical translation, Shear Transform, image scaling and flip horizontal operation.
Preferably, the convolutional neural networks of the fusion depth layer feature include input layer, middle layer and output layer, institute Stating middle layer includes convolutional layer, pond layer and full articulamentum.
Preferably, the mapping relations formula of the convolutional layer is,
Wherein,It is j-th of characteristic set of first of convolutional layer;Indicate the ith feature collection of the l-1 convolutional layer It closes;Indicate the convolution kernel between two characteristic sets;* two-dimensional convolution operation is indicated;Indicate plus item biasing.
Preferably, the mapping relations formula in pond layer is,
Wherein, fp() is the activation primitive of pond layer;Down () indicates l-1 layers to l layers of pond method, including equal It is worth two methods of pondization and maximum value pond;WithIt respectively indicates and multiplies biasing and biasing is set.
Preferably, it is arranged in a vector by each matrix character of the last layer pond layer, constitutes gated layer, the grid To change layer to be connected with full articulamentum, the output relation formula of any point j is in the gated layer,
Wherein, fh() indicates activation primitive;wi,jIndicate input vector xiWith the weight between node j;θjIt is node threshold Value.
Preferably, the full articulamentum solves more classification problems, the loss function table of Softmax using Softmax model It is up to formula,
Wherein,Indicate the input of l layers of (usually the last layer) j-th of neuron;Illustrate l layers of institute There is the sum of the input of neuron;Indicate the output of l j-th of neuron of layer;E indicates natural constant;L () is instruction Property function, when result is true, function result 1 in bracket, result is false, function result 0 in bracket.
Preferably, it in the full articulamentum, by introducing weight attenuation term, realizes to parameter excessive in training process Punishment, expression is,
Wherein,For weight attenuation term.
Compared with prior art, advantages of the present invention is mainly reflected in the following aspects:
The present invention can sufficiently extract sound spectrograph feature, so that speech emotion recognition rate is improved, relative to traditional convolution Neural network, the convolutional neural networks of depth layer Fusion Features proposed in the present invention can be by dropping shallow-layer feature Dimension, is fully merged with further feature, to obtain the feature more representative of all kinds of emotions.The present invention not only can be effective The accuracy that ground improves speech emotion recognition rate, ensures to identify, and there is more excellent generalization ability.Test result table It is bright, use the convolutional neural networks of depth layer Fusion Features proposed by the invention, in seven class emotion of Berlin, Germany library, nothing Merely, detest, glad, the discrimination of neutral four class emotions obtains certain raising, the knowledge of especially glad and neutral two kinds of emotions Not rate is greatly improved, and whole discrimination improves 1.58%.
Meanwhile present invention incorporates the methods of shift learning to utilize tradition in the case where convolutional neural networks complicate Convolutional neural networks training pattern parameter as initiation parameter, to accelerate the convergence rate in training process, mention Whole recognition speed and recognition efficiency are risen.
In addition, the present invention also provides reference for other relevant issues in same domain, can be opened up on this basis Extension is stretched, and is applied in field in the technical solution of other speech recognitions or emotion recognition algorithm, has very wide application Prospect.
In conclusion the invention proposes a kind of speech emotion recognitions based on convolutional neural networks depth layer Fusion Features Method.With very high use and promotional value.
Just attached drawing in conjunction with the embodiments below, the embodiment of the present invention is described in further detail, so that of the invention Technical solution is more readily understood, grasps.
Detailed description of the invention
Fig. 1 is the sound spectrograph sample of corpus some emotional speeches in Berlin used in the present invention;
Fig. 2 is traditional convolutional neural networks;
Fig. 3 is improved convolutional neural networks in the present invention;
Fig. 4 is traditional convolutional neural networks training process figure;
Fig. 5 is improved convolutional neural networks training process figure in the present invention.
Specific embodiment
As shown in the picture, present invention discloses a kind of speech-emotion recognition method, include the following steps:
S1, experiment voice data used is converted into sound spectrograph.
The present invention uses the emotional speech emotion library of Berlin, Germany, and sample frequency 16KHZ, 16bit quantization shares seven Class emotion is anger respectively, boring, detests, fears, glad, neutral, sad.
Specifically, experiment voice data used is converted into sound spectrograph described in S1, include the following steps:
S11, for each section of voice framing first, become x (100,512) (n is frame length, and m is the number of frame);
S12, FFT transform then is done, obtains X (100,512), is cyclic graph Y (100,512) (Y (100,512)=X (100,512)*X(100,512)');
S13,10*log10 (Y (100,512)) are then taken, the number of frame according to time change once scale M, frame length Change scale N according to frequency;
S14, (M, N, 10*log10 (Y (m, n))) is finally drawn as X-Y scheme is exactly sound spectrograph.Part sample sound spectrograph As shown in Figure 1.
S2, data amplification processing is carried out to obtained sound spectrograph.In order to meet deep neural network for data volume It is required that very big demand, the present invention realizes the amplification of sound spectrograph using keras deep learning frame, and primary operational has Random-Rotation Image, horizontal translation, vertical translation, Shear Transform, image scaling, flip horizontal operate and spin upside down operation etc., finally Required mass data in being tested.
S3, the convolutional neural networks that fusion depth layer feature is constructed by traditional convolutional neural networks.
Convolutional neural networks are a kind of feedforward neural networks, generally include input layer, middle layer, output layer, middle layer by The feature extraction layer and full articulamentum composition that one or more groups of " convolution+pond " is constituted, every layer is made of some two-dimensional surfaces, Each plane includes several neuron nodes (node).Convolutional layer is entire convolutional neural networks as feature extraction layer In mostly important part, the features such as vocal print, the energy in various emotion sound spectrographs can be extracted, at subsequent classification Reason.Mapping relations formula before and after convolution is as follows:
Wherein,It is j-th of characteristic set of first of convolutional layer,Indicate the ith feature collection of the l-1 convolutional layer It closes,Indicate that the convolution kernel between two characteristic sets, * indicate two-dimensional convolution operation,Indicate plus item biasing.
A pond layer can be all connected usually behind convolutional layer, the feature for obtaining to convolution carries out at dimensionality reduction Reason, prevents occurring over-fitting in training process, pond process is shown below:
Wherein, fp() is the activation primitive of pond layer, and down () indicates l-1 layers to l layers of pond method, general to divide For two methods of mean value pondization and maximum value pond,WithIt respectively indicates and multiplies biasing and biasing is set.
It is arranged in a vector by each matrix character of the last layer pond layer, constitutes gated layer, is connected with full articulamentum It connects, wherein j output in any point is as follows:
Wherein, fh() indicates activation primitive, wi,jIndicate input vector xiWith the weight between node j, θjIt is node threshold Value.
Full articulamentum generallys use Softmax model to solve more classification problems, the following institute of the loss function of Softmax Show:
WhereinIndicate the input of l layers of (usually the last layer) j-th of neuron,Illustrate that l layers are owned The sum of input of neuron.Indicate the output of l j-th of neuron of layer, e indicates natural constant, and l () is indicative Function, result is true, function result 1 inside bracket, and result is false, function result 0 inside bracket.
J (θ) suboptimization in order to prevent introducesWeight attenuation term is used to during punishment training Excessive parameter.Expression is as follows:
The number of plies of usual convolutional neural networks is more, and the feature of extraction also more has distinction, but will lead to training Overlong time is difficult to the problems such as restraining, so we construct five layers of convolutional neural networks, can extract has in this way The feature of distinction can also reduce the training time, and specific network is as shown in Figure 2.The convolutional neural networks are mainly rolled up by five Lamination and three pond layers and three full articulamentum compositions.The convolution kernel of convolutional layer 1 is dimensioned to 11x11, step-length 4, Neuron number is 96, and pond layer 1 is maximum pond layer, and core size is 3x3, and the convolution kernel size of step-length 2, convolutional layer 2 is set It is set to 5x5, step-length 1, neuron number 256, pond layer 2 is also maximum pond layer, and core size is 3x3, step-length 2, volume It is 3x3 that convolution kernel size, which is all arranged, in lamination 3,4, and the convolution kernel size of step-length 1, neuron number 384, convolutional layer 5 is 3x3, step-length 1, neuron number 256, pond layer 3 are similarly maximum pond layer, and core size is 3x3, step-length 2.Finally Three full articulamentums are connected, first two layers of neuron number is set as 1024, and the neuron number of the last one full articulamentum is set It is set to 7.
In Fig. 2, it may be seen that traditional convolutional neural networks have ignored the feature of shallow-layer for classification correctness Influence, the present invention in, we construct a novel convolutional neural networks, as shown in Figure 3.The convolutional neural networks mainly by Six convolutional layers and four pond layers and three full articulamentum compositions.Relative to the traditional neural network in Fig. 2, we are added Convolutional layer 6 and pond layer 4, the convolution kernel size of convolutional layer 6 are 3x3, and step-length 1, neuron number 256, pond layer 4 is together Sample is maximum pond layer, and core size is 3x3, step-length 2, the feature and five for then being obtained three-layer coil product by a fused layer The feature that layer convolution obtains is merged, and three full articulamentums are finally connected, and first two layers of neuron number is set as 1024, most The neuron number of the full articulamentum of the latter is set as 7.
S4, the convolutional neural networks that traditional convolutional neural networks and fusion depth layer feature are respectively adopted carry out voice Emotion recognition experiment, compares the speech emotion recognition rate of the two.
It is used as training dataset by 70% of sound spectrograph in experiment, 15% is used as validation data set, remaining as test Data set, training dataset are used to create effective classifier, verify data by adjusting the weight on convolutional neural networks Collect the performance for assessing training stage model construction, it provides a test platform, for finely tuning model parameter and selecting Optimum performance model, test data set is only used for the model that the final training of test is completed, to confirm the actual classification of the model Ability.
It is trained and tests using traditional convolutional neural networks.The relationship of loss and the number of iterations in training process As shown in figure 4, the initial learning rate of network is set as 0.0001, the 0.1 of current learning rate is decayed to after every 160 iteration (step-length) Times, training loss starts to restrain at iteration nearly 500 times, when validation data set loss Complete Convergence is in 0.89 when, Wo Menbao Model is deposited, after 2500 iteration, the accuracy of validation data set has reached 63.33%, and entire training process continues greatly About 50 minutes.
Using the method for transfer learning, use the trained optimal model of traditional convolutional neural networks as pre-training mould Type continues to train on the basis of the model using the network proposed in the present invention, in this way can using the parameter of the model as Convergence rate, less training time are accelerated in the initialization of current network, rather than random initializtion.Loss in training process Relationship with the number of iterations is as shown in figure 5, since the parameter of initialization network is generated using the model of pre-training, initially Loss will be since 1.07, and pass through the parameter for inheriting pre-training model, so that the accuracy of validation data set reaches 54.26%, the initial learning rate of network is set as 0.0001, decays to the 0.1 of current learning rate after every 160 iteration (step-length) Times, training loss starts to restrain at iteration nearly 400 times, when validation data set loss Complete Convergence is in 0.88 when, Wo Menbao Model is deposited, after 2500 iteration, the accuracy of validation data set has reached 64.78%, and entire training process continues greatly About 45 minutes.
It is tested respectively using two models in test data concentration, specific experimental result is as shown in Table 1 and Table 2.
The confusion matrix (%) of the traditional convolutional neural networks of table 1 seven class emotions in the library of Berlin
Emotional category It is angry It is boring Detest Fear It is glad It is neutral It is sad
It is angry 76.67 2.77 2.22 1.67 16.11 0.56 0
It is boring 0 90.00 0 0 1.67 5.56 2.77
Detest 16.11 10.00 67.78 1.11 1.11 3.89 0
Fear 19.44 15.00 3.89 31.67 20.56 2.22 7.22
It is glad 55.00 0 2.22 2.22 40.56 0 0
It is neutral 0 58.33 0 0 0 38.34 3.33
It is sad 0 6.11 0 0 0 0 93.89
The convolutional neural networks of 2 depth Fusion Features of table, the seven class emotions in the library of Berlin obscure square (%)
Emotional category It is angry It is boring Detest Fear It is glad It is neutral It is sad
It is angry 72.78 3.89 2.22 1.67 19.44 0 0
It is boring 0 96.11 0 0 1.11 1.11 1.67
Detest 13.89 10.56 68.88 0 1.11 5.56 0
Fear 15.56 20 4.45 30 22.22 0.56 7.21
It is glad 50.56 0 2.22 1.11 46.11 0 0
It is neutral 0 46.11 0 0 0 46.67 7.22
It is sad 0 10.56 0 0 0 0 89.44
From Tables 1 and 2, it will be seen that using the convolutional neural networks in the present invention relative to traditional convolution Neural network, it is boring in seven class emotion of Berlin, Germany library, detest, glad, the discrimination of neutral four class emotions obtains certain It improves, the discrimination of especially glad and neutral two kinds of emotions is greatly improved, and whole discrimination improves 1.58%. From Fig. 4, Fig. 5 and table 1, table 2, it is poor in the discrimination of validation data set and test data set that we are respectively compared two kinds of networks Not, traditional convolutional neural networks are 63.33% in the accuracy that verifying is concentrated, and are in the discrimination that test data is concentrated 62.70%, 0.63% is differed, the convolutional neural networks of the depth layer Fusion Features in the present invention are in the accuracy that verifying is concentrated 64.78%, and the discrimination concentrated in test data is 64.28%, difference 0.5%, relative to traditional convolutional neural networks Training pattern, it is proposed that network training pattern have stronger generalization ability.
Above the results showed that being compared with traditional convolutional neural networks, depth layer proposed by the invention is special Speech emotion recognition rate can be improved in the convolutional neural networks of sign fusion, can in the case where combining with transfer learning method To accelerate convergence rate, the training time is reduced, also, the instruction of the convolutional neural networks of proposed depth layer Fusion Features Practicing model has stronger generalization ability.
In conclusion the present invention can sufficiently extract sound spectrograph feature, to improve speech emotion recognition rate.Relative to biography The convolutional neural networks of system, the present invention proposed in depth layer Fusion Features convolutional neural networks can by by shallow-layer spy Sign carries out dimensionality reduction, is fully merged with further feature, to obtain the feature more representative of all kinds of emotions.The present invention is not only The accuracy that speech emotion recognition rate can be effectively improved, ensure to identify, and there is more excellent generalization ability.Together When, present invention incorporates the methods of shift learning, in the case where convolutional neural networks complicate, utilize traditional convolutional Neural The parameter of network training model is as initiation parameter, to accelerate the convergence rate in training process, improves whole Recognition speed and recognition efficiency.It, can be as in addition, the present invention also provides reference for other relevant issues in same domain According to expansion extension is carried out, apply in field in the technical solution of other speech recognitions or emotion recognition algorithm, has very Wide application prospect.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit and essential characteristics of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included within the present invention, and any reference signs in the claims should not be construed as limiting the involved claims.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiments being understood that.

Claims (10)

1. a kind of speech-emotion recognition method, which comprises the steps of:
S1, experiment voice data used is converted into sound spectrograph;
S2, data amplification processing is carried out to obtained sound spectrograph;
S3, the convolutional neural networks that fusion depth layer feature is constructed by traditional convolutional neural networks;
S4, the convolutional neural networks that traditional convolutional neural networks and fusion depth layer feature are respectively adopted carry out speech emotional Identification experiment, compares the speech emotion recognition rate of the two.
2. speech-emotion recognition method according to claim 1, it is characterised in that: voice data described in S1 comes from German cypress Woods emotional speech emotion library;The sample frequency of the voice data is 16KHZ, 16bit quantization;The voice data includes seven altogether Class emotion, it is respectively angry, it is boring, detest, fears, it is glad, it is neutral, it is sad.
3. speech-emotion recognition method according to claim 1, which is characterized in that by voice number used in experiment described in S1 According to sound spectrograph is converted into, include the following steps:
S11, sub-frame processing is carried out to each section of voice data, is changed into x (m, n), wherein n is frame length, and m is the number of frame;
S12, to x (m, n) carry out FFT transform, obtain X (m, n), be cyclic graph Y (m, n) (Y (m, n)=X (m, n) * X (m, n)');
S13,10*log10 (Y (m, n)) is taken, by m according to time change scale M, by n according to frequency transformation scale N;
S14, (M, N, 10*log10 (Y (m, n))) is expressed as X-Y scheme, obtains sound spectrograph.
4. speech-emotion recognition method according to claim 1, which is characterized in that described in S2 to obtained sound spectrograph into Row data amplification processing includes the following steps: to carry out data amplification processing, institute to sound spectrograph using keras deep learning frame Stating data amplification processing includes that Random-Rotation image, horizontal translation, vertical translation, Shear Transform, image scaling and level are turned over Turn operation.
5. speech-emotion recognition method according to claim 1, it is characterised in that: the convolution of the fusion depth layer feature Neural network includes input layer, middle layer and output layer, and the middle layer includes convolutional layer, pond layer and full articulamentum.
6. speech-emotion recognition method according to claim 5, it is characterised in that: the mapping relations formula of the convolutional layer For,
Wherein,It is j-th of characteristic set of first of convolutional layer;Indicate the ith feature set of the l-1 convolutional layer;Indicate the convolution kernel between two characteristic sets;* two-dimensional convolution operation is indicated;Indicate plus item biasing.
7. speech-emotion recognition method according to claim 5, it is characterised in that: the mapping relations formula in pond layer For,
Wherein, fp() is the activation primitive of pond layer;Down () indicates l-1 layers to l layers of pond method, including mean value pond Change and two methods of maximum value pond;WithIt respectively indicates and multiplies biasing and biasing is set.
8. speech-emotion recognition method according to claim 5, it is characterised in that: by each square of the last layer pond layer Battle array feature permutation constitutes gated layer, the gated layer is connected with full articulamentum, any one in the gated layer at a vector The output relation formula of point j is,
Wherein, fh() indicates activation primitive;wi,jIndicate input vector xiWith the weight between node j;θjIt is Node B threshold.
9. speech-emotion recognition method according to claim 5, it is characterised in that: the full articulamentum uses Softmax Model solves more classification problems, and the loss function expression formula of Softmax is,
Wherein,Indicate the input of l layers of (usually the last layer) j-th of neuron;Illustrate l layers of all minds The sum of input through member;Indicate the output of l j-th of neuron of layer;E indicates natural constant;L () is indicative letter Number, when result is true, function result 1 in bracket, result is false, function result 0 in bracket.
10. speech-emotion recognition method according to claim 9, it is characterised in that: in the full articulamentum, by drawing Entering weight attenuation term, realizes the punishment to parameter excessive in training process, expression is,
Wherein,For weight attenuation term.
CN201810685220.0A 2018-06-28 2018-06-28 Speech emotion recognition method Active CN109036465B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810685220.0A CN109036465B (en) 2018-06-28 2018-06-28 Speech emotion recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810685220.0A CN109036465B (en) 2018-06-28 2018-06-28 Speech emotion recognition method

Publications (2)

Publication Number Publication Date
CN109036465A true CN109036465A (en) 2018-12-18
CN109036465B CN109036465B (en) 2021-05-11

Family

ID=65520725

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810685220.0A Active CN109036465B (en) 2018-06-28 2018-06-28 Speech emotion recognition method

Country Status (1)

Country Link
CN (1) CN109036465B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109637522A (en) * 2018-12-26 2019-04-16 杭州电子科技大学 A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph
CN109767790A (en) * 2019-02-28 2019-05-17 中国传媒大学 A kind of speech-emotion recognition method and system
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN110459225A (en) * 2019-08-14 2019-11-15 南京邮电大学 A kind of speaker identification system based on CNN fusion feature
CN110534133A (en) * 2019-08-28 2019-12-03 珠海亿智电子科技有限公司 A kind of speech emotion recognition system and speech-emotion recognition method
CN110619889A (en) * 2019-09-19 2019-12-27 Oppo广东移动通信有限公司 Sign data identification method and device, electronic equipment and storage medium
CN110634491A (en) * 2019-10-23 2019-12-31 大连东软信息学院 Series connection feature extraction system and method for general voice task in voice signal
CN111449644A (en) * 2020-03-19 2020-07-28 复旦大学 Bioelectricity signal classification method based on time-frequency transformation and data enhancement technology
CN111583964A (en) * 2020-04-14 2020-08-25 台州学院 Natural speech emotion recognition method based on multi-mode deep feature learning
CN111883178A (en) * 2020-07-17 2020-11-03 渤海大学 Double-channel voice-to-image-based emotion recognition method
CN112151071A (en) * 2020-09-23 2020-12-29 哈尔滨工程大学 Speech emotion recognition method based on mixed wavelet packet feature deep learning
WO2021051577A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Speech emotion recognition method, apparatus, device, and storage medium
CN113257280A (en) * 2021-06-07 2021-08-13 苏州大学 Speech emotion recognition method based on wav2vec
CN113643724A (en) * 2021-07-06 2021-11-12 中国科学院声学研究所南海研究站 Kiwi emotion recognition method and system based on time-frequency double-branch characteristics
US11343149B2 (en) * 2018-06-29 2022-05-24 Forescout Technologies, Inc. Self-training classification

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106340309A (en) * 2016-08-23 2017-01-18 南京大空翼信息技术有限公司 Dog bark emotion recognition method and device based on deep learning
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN107067011A (en) * 2017-03-20 2017-08-18 北京邮电大学 A kind of vehicle color identification method and device based on deep learning
CN107705806A (en) * 2017-08-22 2018-02-16 北京联合大学 A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks
US20180061439A1 (en) * 2016-08-31 2018-03-01 Gregory Frederick Diamos Automatic audio captioning
CN107895571A (en) * 2016-09-29 2018-04-10 亿览在线网络技术(北京)有限公司 Lossless audio file identification method and device
CN108009148A (en) * 2017-11-16 2018-05-08 天津大学 Text emotion classification method for expressing based on deep learning
CN108010533A (en) * 2016-10-27 2018-05-08 北京酷我科技有限公司 The automatic identifying method and device of voice data code check
CN108205535A (en) * 2016-12-16 2018-06-26 北京酷我科技有限公司 The method and its system of Emotion tagging

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106340309A (en) * 2016-08-23 2017-01-18 南京大空翼信息技术有限公司 Dog bark emotion recognition method and device based on deep learning
US20180061439A1 (en) * 2016-08-31 2018-03-01 Gregory Frederick Diamos Automatic audio captioning
CN107895571A (en) * 2016-09-29 2018-04-10 亿览在线网络技术(北京)有限公司 Lossless audio file identification method and device
CN108010533A (en) * 2016-10-27 2018-05-08 北京酷我科技有限公司 The automatic identifying method and device of voice data code check
CN108205535A (en) * 2016-12-16 2018-06-26 北京酷我科技有限公司 The method and its system of Emotion tagging
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN107067011A (en) * 2017-03-20 2017-08-18 北京邮电大学 A kind of vehicle color identification method and device based on deep learning
CN107705806A (en) * 2017-08-22 2018-02-16 北京联合大学 A kind of method for carrying out speech emotion recognition using spectrogram and deep convolutional neural networks
CN108009148A (en) * 2017-11-16 2018-05-08 天津大学 Text emotion classification method for expressing based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ABDUL MALIK BADSHAH ET AL.: "Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network", 《INTERNATIONAL CONFERENCE ON PLATFORM》 *
LINHUI SUN ET AL.: "Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition", 《INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY》 *
TAO KONG ET AL.: "Hypernet: Towards accurate region proposal generation and joint object detection", 《PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11936660B2 (en) 2018-06-29 2024-03-19 Forescout Technologies, Inc. Self-training classification
US11343149B2 (en) * 2018-06-29 2022-05-24 Forescout Technologies, Inc. Self-training classification
CN109637522A (en) * 2018-12-26 2019-04-16 杭州电子科技大学 A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph
CN109637522B (en) * 2018-12-26 2022-12-09 杭州电子科技大学 Speech emotion recognition method for extracting depth space attention features based on spectrogram
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN109767790A (en) * 2019-02-28 2019-05-17 中国传媒大学 A kind of speech-emotion recognition method and system
CN110459225A (en) * 2019-08-14 2019-11-15 南京邮电大学 A kind of speaker identification system based on CNN fusion feature
CN110534133A (en) * 2019-08-28 2019-12-03 珠海亿智电子科技有限公司 A kind of speech emotion recognition system and speech-emotion recognition method
CN110534133B (en) * 2019-08-28 2022-03-25 珠海亿智电子科技有限公司 Voice emotion recognition system and voice emotion recognition method
WO2021051577A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Speech emotion recognition method, apparatus, device, and storage medium
CN110619889A (en) * 2019-09-19 2019-12-27 Oppo广东移动通信有限公司 Sign data identification method and device, electronic equipment and storage medium
CN110634491B (en) * 2019-10-23 2022-02-01 大连东软信息学院 Series connection feature extraction system and method for general voice task in voice signal
CN110634491A (en) * 2019-10-23 2019-12-31 大连东软信息学院 Series connection feature extraction system and method for general voice task in voice signal
CN111449644A (en) * 2020-03-19 2020-07-28 复旦大学 Bioelectricity signal classification method based on time-frequency transformation and data enhancement technology
CN111583964A (en) * 2020-04-14 2020-08-25 台州学院 Natural speech emotion recognition method based on multi-mode deep feature learning
CN111583964B (en) * 2020-04-14 2023-07-21 台州学院 Natural voice emotion recognition method based on multimode deep feature learning
CN111883178A (en) * 2020-07-17 2020-11-03 渤海大学 Double-channel voice-to-image-based emotion recognition method
CN112151071A (en) * 2020-09-23 2020-12-29 哈尔滨工程大学 Speech emotion recognition method based on mixed wavelet packet feature deep learning
CN112151071B (en) * 2020-09-23 2022-10-28 哈尔滨工程大学 Speech emotion recognition method based on mixed wavelet packet feature deep learning
CN113257280A (en) * 2021-06-07 2021-08-13 苏州大学 Speech emotion recognition method based on wav2vec
CN113643724A (en) * 2021-07-06 2021-11-12 中国科学院声学研究所南海研究站 Kiwi emotion recognition method and system based on time-frequency double-branch characteristics

Also Published As

Publication number Publication date
CN109036465B (en) 2021-05-11

Similar Documents

Publication Publication Date Title
CN109036465A (en) Speech-emotion recognition method
Zhao et al. Learning deep features to recognise speech emotion using merged deep CNN
Sun et al. Speech emotion recognition based on DNN-decision tree SVM model
Chatziagapi et al. Data Augmentation Using GANs for Speech Emotion Recognition.
CN110400579B (en) Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network
CN103544963B (en) A kind of speech-emotion recognition method based on core semi-supervised discrimination and analysis
CN108597539A (en) Speech-emotion recognition method based on parameter migration and sound spectrograph
CN109637522B (en) Speech emotion recognition method for extracting depth space attention features based on spectrogram
CN108717856A (en) A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN111583964B (en) Natural voice emotion recognition method based on multimode deep feature learning
CN108305616A (en) A kind of audio scene recognition method and device based on long feature extraction in short-term
CN106952649A (en) Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN106847309A (en) A kind of speech-emotion recognition method
CN107785015A (en) A kind of audio recognition method and device
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN110111797A (en) Method for distinguishing speek person based on Gauss super vector and deep neural network
CN110047516A (en) A kind of speech-emotion recognition method based on gender perception
CN106297773A (en) A kind of neutral net acoustic training model method
CN108986798B (en) Processing method, device and the equipment of voice data
CN110148408A (en) A kind of Chinese speech recognition method based on depth residual error
CN110189766B (en) Voice style transfer method based on neural network
CN109036468A (en) Speech-emotion recognition method based on deepness belief network and the non-linear PSVM of core
CN109949821A (en) A method of far field speech dereverbcration is carried out using the U-NET structure of CNN
CN110534133A (en) A kind of speech emotion recognition system and speech-emotion recognition method
CN109767789A (en) A kind of new feature extracting method for speech emotion recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant