CN110415728A - A kind of method and apparatus identifying emotional speech - Google Patents
A kind of method and apparatus identifying emotional speech Download PDFInfo
- Publication number
- CN110415728A CN110415728A CN201910690493.9A CN201910690493A CN110415728A CN 110415728 A CN110415728 A CN 110415728A CN 201910690493 A CN201910690493 A CN 201910690493A CN 110415728 A CN110415728 A CN 110415728A
- Authority
- CN
- China
- Prior art keywords
- signal
- network
- voice signal
- sound spectrograph
- emotional speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Hospice & Palliative Care (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Child & Adolescent Psychology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
This application provides a kind of method and apparatus for identifying emotional speech, which comprises obtains voice signal to be identified;The voice signal to be identified is pre-processed according to preprocessing rule, multiframe is obtained and pre-processes voice signal;The corresponding sound spectrograph signal is obtained based on every frame pretreatment voice signal;Sound spectrograph signal input is had to the second network model of the first Optimal Parameters respectively, obtains corresponding emotional speech type;Wherein, second network model, comprising: for extracting the first sub-network of emotional speech features and dimensionality reduction, and the third sub-network for accelerating algorithm convergence and classifying to the emotional speech features from the sound spectrograph signal.The application extracts feature with CNN network model from sound spectrograph, and LSTM network model is allowed to carry out time series modeling to these characteristics.Enhance accuracy, robustness of the emotion recognition under open environment.Reduce network throughput and algorithm complexity, fit algorithm is in built-in equipment operation.
Description
Technical field
This application involves artificial intelligence fields, and in particular to the training method and device of emotional speech for identification, and
The method and apparatus for identifying emotional speech.
Background technique
The purpose of emotion recognition is to allow computer to have to observe similar to people is the same, understands identification, shows emotion
Ability, preferably to be interacted with the mankind.And since ancient times, the exchange between people and people is logical in the overwhelming majority
This means that teach orally are crossed, why voice becomes the main media of Human communication, the reason is that voice is in addition to comprising wanting table
The text information stated further comprises the information other than other many texts, such as the language of speaker, emotion, physical condition, gender
Etc. information.This is that simple text understanding cannot be completed.Potential emotion information is obtained by processing voice signal to exist
The fields such as teaching cognitive state analysis, the perception of patient mood state analysis, public domain danger early warning, blind visual have answers extensively
Use potentiality.Therefore, the key technology as intelligent interaction, affection computation, speech emotion recognition is ground as artificial intelligence in recent years
Study carefully emphasis.
Very big progress is had been achieved for the research of speech emotion recognition both at home and abroad at present.But the conventional acoustic feature rhythm
Feature, the correlated characteristic based on spectrum and sound quality feature describe speech emotional from time domain, frequency domain angle respectively, can not reflect language simultaneously
Sound signal time-frequency affective characteristics.And still image algorithm is applied in natural scene, lacks effective benefit to dynamic sequence information
With causing algorithm robustness poor, application effect is to be improved.
Current most important method is while to reflect voice signal time-frequency emotion spy using sound spectrograph as speech emotional feature
Then sign carries out affective feature extraction classification using deep learning algorithm.But this kind of algorithm recognition accuracy makes moderate progress, simultaneously
Since algorithm process data volume is big, algorithm complexity height causes to require height to hardware calculated performance, is deployed in high performance service more
On device, it is very difficult to apply in actual life natural scene.
Summary of the invention
The application provides a kind of training method of emotional speech for identification, a kind of training cartridge of emotional speech for identification
It sets, a kind of method identifying emotional speech and a kind of device for identifying emotional speech;To solve current emotion recognition algorithm operation
Measure big problem.
In order to solve the above-mentioned technical problem, the embodiment of the present application provides the following technical solution:
This application provides a kind of training methods of emotional speech for identification characterized by comprising
Successively obtain one group of sample sound spectrograph signal, wherein the sample sound spectrograph signal is according to N kind emotional speech class
Sound spectrograph signal is divided into one group in N group by type, and in same group sample sound spectrograph signal emotional speech type it is identical, every group
Between sample sound spectrograph signal emotional speech type it is different, N is greater than 1 integer;The sample sound spectrograph signal includes label
The label of the emotional speech type;
First network model is respectively trained using every group of sample sound spectrograph signal and reaches default training termination condition, to obtain
Obtain the first Optimal Parameters of first network model;
Wherein, the first network model, comprising: for extracting emotional speech features from the sound spectrograph signal and dropping
First sub-network of dimension, and the second sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.
Optionally, it is described successively obtain one group of trained voice signal before, further includes:
Obtain voice signal to be identified;
The voice signal to be identified is pre-processed according to preprocessing rule, multiframe is obtained and pre-processes voice signal;
The corresponding sound spectrograph signal is generated based on every frame pretreatment voice signal;
The label is marked to the sound spectrograph signal according to N kind emotional speech type, obtains the N group sample sound spectrograph
Signal.
Optionally, the preprocessing rule, comprising: preemphasis rule, adding window framing rule and end-point detection rule;
It is described that the voice signal to be identified is pre-processed according to preprocessing rule, obtain multiframe pretreatment voice letter
Number, comprising:
Preemphasis processing is carried out to the voice signal to be identified according to preemphasis rule, obtains the first voice signal;
Adding window framing is carried out to first voice signal according to adding window framing rule, obtains the second voice signal of multiframe;
End-point detection is carried out to every the second voice signal of frame according to end-point detection rule, obtains multiframe pretreatment voice letter
Number.
Optionally, the end-point detection rule is specifically time domain end-point detection rule.
Optionally, first sub-network, comprising: the first full articulamentum, the first activation primitive, first prevent over-fitting
Layer, the second full articulamentum, the second activation primitive, second prevent the full articulamentum of over-fitting layer, third and third activation primitive.
Optionally, second sub-network, comprising: change input data dimension, change input time stamp dimension, the first length
Circular recursion network layer, third prevent over-fitting layer, the second long circular recursion network layer in short-term, the long circulation in short-term of third from passing in short-term
Return network layer, output sliced layer, change long circular recursion network dimension in short-term, the 4th full articulamentum, change input label dimension,
Loss function layer and output category accuracy;
Wherein, the output sliced layer is used to divide the sequence end vector of the second sub-network output;The sequence end
Vector is used to calculate error feedback simultaneously corrective network weight or prediction emotional speech type with the label.
Optionally, the default trained termination condition includes at least one of the following conditions:
The percentage for exporting total test sample number shared by correct sample number is greater than preset percentage threshold value:
The percentage that every group of sample sound spectrograph signal exports the corresponding total test sample number of group shared by correct sample number is greater than
Default second percentage threshold.
This application provides a kind of training devices of emotional speech for identification, comprising:
Sample unit is obtained, for successively obtaining one group of sample sound spectrograph signal, wherein the sample sound spectrograph signal is
One group be divided into sound spectrograph signal according to N kind emotional speech type in N group, and in same group sample sound spectrograph signal emotion
Sound-type is identical, and the emotional speech type of sample sound spectrograph signal is different between every group, and N is greater than 1 integer;The sample language
Spectral Signal includes the label for marking the emotional speech type;
Training sample unit reaches default instruction for first network model to be respectively trained using every group of sample sound spectrograph signal
Practice termination condition, to obtain the first Optimal Parameters of first network model;
Wherein, the first network model, comprising: for extracting emotional speech features from the sound spectrograph signal and dropping
First sub-network of dimension, and the second sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.
This application provides a kind of methods for identifying emotional speech, comprising:
Obtain voice signal to be identified;
The voice signal to be identified is pre-processed according to preprocessing rule, multiframe is obtained and pre-processes voice signal;
The corresponding sound spectrograph signal is obtained based on every frame pretreatment voice signal;
Sound spectrograph signal input is had to the second network model of the first Optimal Parameters respectively, obtains corresponding emotion
Sound-type;
Wherein, second network model, comprising: for extracting emotional speech features from the sound spectrograph signal and dropping
First sub-network of dimension, and the third sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.
This application provides a kind of devices for identifying emotional speech, comprising:
Voice signal unit to be identified is obtained, for obtaining voice signal to be identified;
Pretreatment unit obtains multiframe for pre-processing according to preprocessing rule to the voice signal to be identified
Pre-process voice signal;
Sound spectrograph signal element is obtained, for obtaining the corresponding sound spectrograph letter based on every frame pretreatment voice signal
Number;
Emotional speech type units are obtained, for sound spectrograph signal input to be had the of the first Optimal Parameters respectively
Two network models obtain corresponding emotional speech type;
Wherein, second network model, comprising: for extracting emotional speech features from the sound spectrograph signal and dropping
First sub-network of dimension, and the third sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.
Disclosure based on the above embodiment can know, the embodiment of the present application have it is following the utility model has the advantages that
This application provides a kind of method and apparatus for identifying emotional speech, which comprises obtains voice to be identified
Signal;The voice signal to be identified is pre-processed according to preprocessing rule, multiframe is obtained and pre-processes voice signal;It is based on
Every frame pretreatment voice signal obtains the corresponding sound spectrograph signal;It is respectively that sound spectrograph signal input is excellent with first
The second network model for changing parameter, obtains corresponding emotional speech type;Wherein, second network model, comprising: for from
The first sub-network of emotional speech features and dimensionality reduction is extracted in the sound spectrograph signal, and for accelerating algorithm convergence and to institute
State the third sub-network of emotional speech features classification.
In conjunction with brain it is main when carrying out emotion recognition there are three characteristics: timing, randomness, real-time, the application from
These three characteristics are set about, and nurse the artificial application background of machine with intelligence, building is embedded in based on speech emotional under the conditions of open environment
Formula intelligent front end identifying system.
The application uses the dynamic emotion identification method of LSTM, and the circle collection of image sequence, study are carried out using LSTM
With memory sequences related information, emotion differentiation is carried out in conjunction with single image information and serial correlation information, emotion recognition is enhanced and exists
Accuracy, robustness under open environment.
The application simplifies CNN convolutional layer, reduces algorithm complexity, increases LSTM layer number, reinforces algorithm and learns image
Sequence relation ability increases timestamp layer cont to solve LSTM length different images serial correlation inquiry learning, increases slice layers
For dividing the sequence end vector of the second sub-network output;Sequence end vector is used to calculate error with the label anti-
Present simultaneously corrective network weight or prediction emotional speech type.Network transaction data amount is greatly reduced, algorithm complexity is reduced
With adaptive algorithm in built-in equipment operation.
The network model that speech emotion recognition algorithm and server training are completed is transplanted to (such as Huawei by the application
The atlas 200dk system on chip of CPU, NPU, ISP (a collection)) on embedded platform, realize speech emotion recognition system
Intelligent front end identifying system.
Detailed description of the invention
Fig. 1 is the flow chart of the training method of emotional speech for identification provided by the embodiments of the present application;
Fig. 2 is the structure of the first network model of the training method of emotional speech for identification provided by the embodiments of the present application
Figure;
Fig. 3 is the structure chart of the first sub-network provided by the embodiments of the present application;
Fig. 4 is the unit block diagram of the training device of emotional speech for identification provided by the embodiments of the present application;
Fig. 5 is the flow chart of the method for identification emotional speech provided by the embodiments of the present application;
Fig. 6 is the structure chart of the second network model of the method for identification emotional speech provided by the embodiments of the present application;
Fig. 7 is the unit block diagram of the device of identification emotional speech provided by the embodiments of the present application.
Specific embodiment
In the following, being described in detail in conjunction with specific embodiment of the attached drawing to the application, but not as the restriction of the application.
It should be understood that various modifications can be made to disclosed embodiments.Therefore, description above should not regard
To limit, and only as the example of embodiment.Those skilled in the art will expect in the scope and spirit of the present application
Other modifications.
The attached drawing being included in the description and forms part of the description shows embodiments herein, and with it is upper
What face provided is used to explain the application together to substantially description and the detailed description given below to embodiment of the application
Principle.
By the description of the preferred form with reference to the accompanying drawings to the embodiment for being given as non-limiting example, the application's
These and other characteristic will become apparent.
It is also understood that although the application is described referring to some specific examples, those skilled in the art
Member realizes many other equivalents of the application in which can determine, they have feature as claimed in claim and therefore all
In the protection scope defined by whereby.
When read in conjunction with the accompanying drawings, in view of following detailed description, above and other aspect, the feature and advantage of the application will become
It is more readily apparent.
The specific embodiment of the application is described hereinafter with reference to attached drawing;It will be appreciated, however, that the disclosed embodiments are only
Various ways implementation can be used in the example of the application.Known and/or duplicate function and structure and be not described in detail to avoid
Unnecessary or extra details makes the application smudgy.Therefore, specific structural and functionality disclosed herein is thin
Section is not intended to restrictions, but as just the basis of claim and representative basis be used to instructing those skilled in the art with
Substantially any appropriate detailed construction diversely uses the application.
This specification can be used phrase " in one embodiment ", " in another embodiment ", " in another embodiment
In " or " in other embodiments ", it can be referred to one or more of the identical or different embodiment according to the application.
To first embodiment provided by the present application, i.e., a kind of embodiment of the training method of emotional speech for identification.
The present embodiment is described in detail below with reference to Fig. 1, Fig. 2 and Fig. 3, wherein Fig. 1 provides for the embodiment of the present application
Emotional speech for identification training method flow chart;Fig. 2 is emotional speech for identification provided by the embodiments of the present application
Training method first network model structure chart;Fig. 3 is the structure chart of the first sub-network provided by the embodiments of the present application.
The embodiment of the present application combination brain is main when carrying out emotion recognition, and there are three characteristics: timing, randomness, in real time
Property, set about from these three characteristics, the artificial application background of machine is nursed with intelligence, building is based on speech emotional under the conditions of open environment
Embedded front-end identifying system.
Shown in Figure 1, step S101 successively obtains one group of sample sound spectrograph signal.
Sound spectrograph is to describe the spectrum information in voice signal with the two dimensional image of time change.It is horizontal in sound spectrograph
Axis indicates the time of voice signal, and the longitudinal axis indicates each frequency content of voice signal.Each frequency at any moment in image
The depth of the strong and weak color of ingredient indicates.In speech signal analysis, common analysis method have frequency domain analysis and when
Domain analysis method, and sound spectrograph combines the two, can dynamically show the size of each moment different frequency component, so
The information content carried in sound spectrograph is far longer than the summation of simple time domain and simple the carried information content of frequency domain.Since language is composed
Immense value of the figure in speech analysis, commonly referred to as visual speech.
The sound spectrograph signal is exactly the information associated with voice obtained from sound spectrograph.
In order to train network model to identify, emotional speech type, the embodiment of the present disclosure are instructing collected sound spectrograph signal
It is divided into N group as training sample according to N kind emotional speech type before practicing.The sample sound spectrograph signal is according to N kind emotion language
Sound spectrograph signal is divided into one group in N group by sound type, and in same group sample sound spectrograph signal emotional speech type it is identical,
The emotional speech type of sample sound spectrograph signal is different between every group, and N is greater than 1 integer;The sample sound spectrograph signal includes
Mark the label of the emotional speech type.
Step S102 is respectively trained first network model using every group of sample sound spectrograph signal and reaches default training termination item
Part, to obtain the first Optimal Parameters of first network model.
The first network model, comprising: for extracting emotional speech features and dimensionality reduction from the sound spectrograph signal
First sub-network, and the second sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.
The embodiment of the present application provide CNN+LSTM network model (i.e. first network model) be by CNN network model and
LSTM network model is unified into a frame, and the purpose designed in this way is: CNN network model is good at image procossing, LSTM net
Network model is good at time series modeling, and the LSTM network model of deep structure also has Feature Mapping to separable space
Ability, in order to the two be combined, extracts spy from sound spectrograph with CNN network model using the ability having complementary advantages each other
Sign allows LSTM network model to carry out time series modeling to these characteristics.The embodiment of the present application is in ALEXTNet network model (packet
Included the stacking of several CNN network models) on the basis of make to simplify, eliminate several convolutional layers, reduce algorithm calculation amount, adapt to
Built-in equipment operation, the results showed, the network number of plies of LSTM network model is more, and study serial correlation information capability is got over
By force, convergence is faster, so increasing the network layer of LSTM network model, reinforces algorithm to Sequence Learning ability, keeps algorithmic statement fast
Degree is faster.
It is shown in Figure 2, wherein ALEXTNet is the first sub-network.Cont is timestamp layer, passes to LSTM network
Model treatment temporal information.
First sub-network, comprising: the first full articulamentum (fc1), the first activation primitive (relu4), first prevented
Fitting layer (drop4), the second full articulamentum (fc2), the second activation primitive (relu5), second prevent over-fitting layer (drop5),
The full articulamentum of third (fc3) and third activation primitive (relu6).For example, the alextnet network model simplified.Refer to Fig. 3
It is shown,
Second sub-network, comprising: change input data dimension (reshape-data), change input time stamp dimension
(reshape-cm), the first long circular recursion network layer (lstm1), third in short-term prevents over-fitting layer (lstm1-drop), the
Two long circular recursion network layers (lstm2) in short-term, the long circular recursion network layer (lstm3) in short-term of third, output sliced layer
(slice), change long circular recursion network dimension (reshape-lstm) in short-term, the 4th full articulamentum (fc4), change input mark
Sign dimension (reshape-label), loss function layer (loss) and output category accuracy (accuracy).
Wherein, output sliced layer (slice) is used to divide the sequence end vector of the second sub-network output;It is described
Sequence end vector is used to calculate error feedback simultaneously corrective network weight or prediction emotional speech type with the label.For example,
LSTM network model.
In training, three layers of LSTM network have been selected, 128 hidden neuron numbers are set, while timing length is set and is
10,.Using batch gradient descent method, batch size is set as 10, and it is 80000 times that Simultaneous Iteration batch, which takes turns number, LSTM's
It is 5 that gradient, which cuts threshold value, and using Adam optimization method, learning rate is set as 0.0005.
The default trained termination condition includes at least one of the following conditions:
Condition one, the percentage for exporting total test sample number shared by correct sample number are greater than default first percentage threshold
Value.
For example, default first percentage threshold is 90%.
Condition two, every group of sample sound spectrograph signal export the percentage of the corresponding total test sample number of group shared by correct sample number
Than being greater than default second percentage threshold.
For example, default second percentage threshold is 90%.
The embodiment of the present application use 5 folding cross validations method, will in total 7000 sample sound spectrograph signal sequences it is (each
A sequence includes 10 sound spectrographs) composed by all pictures be divided into 5 parts, be then used as test sample for i-th part each time,
Remaining (be here is calculated according to the accuracy rate of sound spectrograph picture, that is, according to all sound spectrographs as training sample
What the picture in sequence was calculated as a sample data, the label of a sound spectrograph sequence also means that the sound spectrograph originally
All pictures of sequence mark as label), then obtain corresponding accuracy rate Ci, then the final model is directed to emotion language
The accuracy rate of spectral data collection is thenThe accuracy of identification obtained in this way is more reliable.
The embodiment of the present application simplifies CNN convolutional layer, reduces algorithm complexity, increases LSTM layer number, reinforces algorithm
Image sequence relational capability is practised, increases timestamp layer cont to solve LSTM length different images serial correlation inquiry learning, increases
(slice layers) of sliced layer of output for dividing the sequence end vector of the second sub-network output;Sequence end vector is used for
Error feedback simultaneously corrective network weight or prediction emotional speech type are calculated with the label.Greatly reduce network transaction data
Amount, reduces algorithm complexity with adaptive algorithm in built-in equipment operation.
It is described successively obtain one group of trained voice signal before, it is further comprising the steps of:
Step 100-1 obtains voice signal to be identified.
Step 100-2 pre-processes the voice signal to be identified according to preprocessing rule, obtains multiframe pretreatment
Voice signal.
Due to the physical characteristic of voice signal itself, the physical characteristic of speaker's phonatory organ and voice recording environment etc.
The raw tone of various factors, speaker generally can not be directly used in processing, need to undergo the pretreated step of voice
Suddenly subsequent processing could be used for.
The preprocessing rule, comprising: preemphasis rule, adding window framing rule and end-point detection rule.
It is described that the voice signal to be identified is pre-processed according to preprocessing rule, obtain multiframe pretreatment voice letter
Number, comprising the following steps:
Step 100-2-1 carries out preemphasis processing to the voice signal to be identified according to preemphasis rule, obtains first
Voice signal.
Since glottal excitation and mouth and nose radiate, the average power spectra of voice signal in high band meeting sharp-decay, get over by frequency
Height, corresponding spectrum value are smaller.Voice motivates after vocal cord vibration sounding through sound channel, then incoming human ear.In the process,
The high frequency section of voice can have certain decaying.Also, relative to the low frequency part of voice signal, high frequency section information is more difficult
It obtains, and can more characterize the emotion information of people.
Preemphasis processing is that voice signal to be identified is handled using digital filter, amplifies its high-frequency information.
The formula of preemphasis rule:
Wherein,
X (n) is expressed as the voice signal to be identified of input;
Z is expressed as frequency;
H (z) is expressed as preemphasis filter;
μ is expressed as pre emphasis factor;The threshold range of μ is 0.9~0.97, the embodiment of the present application μ=0.9375;
Y (n) is expressed as the first voice signal.
Step 100-2-2 carries out adding window framing to first voice signal according to adding window framing rule, obtains multiframe the
Two voice signals.
Voice signal has non-stationary, therefore directly cannot analyze and handle, but its is non-stationary to be pronounced by human body
The resonance and resonance of system cause, and vibration velocity is in relatively slow, can be considered quasi-steady state in 20ms or so short time range
Process, i.e. short-term stationary process.Therefore, it before analyzing voice signal, handling, needs first to carry out short time treatment to it, i.e., will
Voice signal is divided into several segments Short Time Speech, wherein every section of Short Time Speech is referred to as a frame of voice signal.
To keep voice signal smooth after sub-frame processing and avoiding losing voice signal portion because of sub-frame processing as far as possible
Divide information, consecutive frame is made mutually to overlap, i.e., frame, which moves, is less than frame length, refers to that every frame voice signal or so section mutually includes one
Fraction.
To voice signal adding window framing process, it can be considered that one sliding window function of application on the voice signal, window function are sliding
Dynamic step-length is that frame moves.If oneself was carried out sub-frame processing to voice signal, framing windowing process is to add on the every frame of voice signal
One window function.
The formula of adding window framing rule:
Sω(n)=s (n) * ω (n);
Wherein,
S (n) is expressed as the first voice signal;
ω (n) is expressed as window function.Common window function has rectangular window, Hanning window, Hamming window etc., the embodiment of the present application
Algorithm is preferable using flatness, energy high degree avoids the Hamming window of truncation effect as the window function of adding window sub-frame processing.The Chinese
The expression formula of bright window is as shown in formula.
Wherein, N is frame length.
Step 100-2-3 carries out end-point detection to every the second voice signal of frame according to end-point detection rule, it is pre- to obtain multiframe
Handle voice signal.
End-point detection is the important link of speech signal pre-processing.It, may in voice signal since playback environ-ment is limited
It, can extraction and generation to subsequent acoustic feature there are interference informations such as environmental noise, long-time mute and head and the tail garbage signals
Sound spectrograph effect affect greatly, and then influence speech emotion recognition system identification ability.The purpose of end-point detection is inspection
Efficient voice start-stop point in voice signal is measured, the invalid mute and environmental noise in voice signal can be rejected, with this to the greatest extent may be used
Negative effect of the interference information to follow-up work can be reduced.
There are two types of common end-point detecting methods, time domain end-point detection and frequency domain end-point detection.
Time-domain end-point detecting method commonly uses volume to be detected, and this method calculation amount is smaller, but is easy to voiceless sound portion
Divide and causes to judge by accident.
Frequency domain end-point detecting method can be subdivided into two kinds: one is the variabilities according to frequency spectrum, that is, have the frequency spectrum of sound
Variation is more regular, in this, as judgment criteria;Another kind is the entropy according to frequency spectrum, referred to as composes entropy, there is the partial spectrum of sound
Entropy is typically small.
End-point detection rule described in the embodiment of the present application is specifically time domain end-point detection rule.Based on volume, it is aided with
Zero-crossing rate is as important detection parameters, and this method calculation amount is small, and arithmetic speed is fast, while also avoiding to a certain extent and only using sound
Amount come carry out end-point detection and caused by erroneous judgement.
Specifically includes the following steps:
Step 100-2-3-1, judges whether the value of second voice signal is greater than default volume threshold.
Default volume threshold is an empirical value.It needs to carry out comprehensive detection in conjunction with zero-crossing rate due to subsequent, in threshold value
Set that value is appropriate to the occasion high unsuitable low, i.e., threshold value can slightly set height.
Step 100-2-3-2, if so, the second voice signal is voiced sound sound section.
Voiced sound sound section is exactly the sound decibel larger portion that human hair goes out.
Step 100-2-3-3, if it is not, then mute figure.
Mute sound section refers to that sound decibel is very low, is approximately no sound, environmental noise or voiceless sound sound section.
Distinguish whether certain amount of bass part is voiceless sound (referring to that human hair goes out decibel lower part sound), it can be according in short-term
Zero-crossing rate judges.In general, in the environment of indoors, unvoiced part zero-crossing rate will be apparently higher than environmental noise and mute
Therefore part need to preset zero-crossing rate threshold value, voiceless sound is considered greater than default zero-crossing rate threshold value, lower than being considered for the threshold value
Environmental noise is mute.
For convenience, the front and back stop time point for the voiced portions that detected according to default volume threshold has been set as
Sound starting point and sound terminating point.From sound one frame frame of starting point toward being pushed forward, judge whether voice signal value is greater than default volume
Threshold value, if so, being considered the voiced portions that human hair goes out, then the point is new sound starting point;If not, then it is assumed that the point is forward
Voiceless sound of the part for environmental noise or mute or human hair out, then foundation preset whether zero-crossing rate threshold decision is that people issues clearly
Sound.Similarly, from sound one frame frame of terminating point toward moving back, method is similar with sound starting point, does not repeat.
Short-time zero-crossing rate refers in one frame of voice signal, the zeroaxial number of waveform.
The formula of short-time zero-crossing rate:
Wherein,
S is expressed as the value of sampled point;
T is expressed as frame length;
π { A } indicates that when A is true, being worth is 1;When A is true, being worth is 0.
Step 100-3 generates the corresponding sound spectrograph signal based on every frame pretreatment voice signal.
It, in this way can band since traditional feature extraction mode is to extract feature using the filter group of various engineers
The information on frequency domain come is lost.In order to avoid this problem, the embodiment of the present application provides CNN+LSTM network model, directly
The sound spectrograph of speaker is input in network model, remains the spectrum information of voice signal to the greatest extent in this way.
It is assumed that discrete voice signal x (n) is expressed as x after sub-frame processingn(m);N=0,1 ..., N-1;Wherein, n is
Frame number, m are sampled point serial numbers in a frame, and N is frame length.
The Short Time Fourier Transform formula of signal x (n) is as follows:
Wherein, ω (n) is expressed as window function;
The formula of the discrete time-domain Fourier transformation (DTFT) of signal x (n) is as follows:
The formula of discrete Fourier transform (DFT) is as follows:
Wherein, as 0≤k≤N-1, then X (n, k) is exactly the short-time magnitude Power estimation of x (n);Spectral energy density letter at m
The formula of number p (n, k) is as follows
P (n, k)=| X (n, k) |2;
Using n as abscissa, k is ordinate, and the value of p (n, k) is indicated with gray scale or colour, and obtained X-Y scheme is language spectrum
Figure.Through 10log10The sound spectrograph that (P (n, k)) fortran can be expressed in decibels.
Step 100-4 marks the label to the sound spectrograph signal according to N kind emotional speech type, obtains described in N group
Sample sound spectrograph signal.
In conjunction with brain it is main when carrying out emotion recognition there are three characteristics: timing, randomness, real-time, the application from
These three characteristics are set about, and nurse the artificial application background of machine with intelligence, building is embedded in based on speech emotional under the conditions of open environment
Formula intelligent front end identifying system.
The application uses the dynamic emotion identification method of LSTM, and the circle collection of image sequence, study are carried out using LSTM
With memory sequences related information, emotion differentiation is carried out in conjunction with single image information and serial correlation information, emotion recognition is enhanced and exists
Accuracy, robustness under open environment.
The application simplifies CNN convolutional layer, reduces algorithm complexity, increases LSTM layer number, reinforces algorithm and learns image
Sequence relation ability increases timestamp layer cont to solve LSTM length different images serial correlation inquiry learning, increases slice layers
For dividing the sequence end vector of the second sub-network output;Sequence end vector is used to calculate error with the label anti-
Present simultaneously corrective network weight or prediction emotional speech type.Network transaction data amount is greatly reduced, algorithm complexity is reduced
With adaptive algorithm in built-in equipment operation.
The network model that speech emotion recognition algorithm and server training are completed is transplanted to (such as Huawei by the application
The atlas 200dk system on chip of CPU, NPU, ISP (a collection)) on embedded platform, realize speech emotion recognition system
Intelligent front end identifying system.
Corresponding with first embodiment provided by the present application, present invention also provides second embodiments, i.e., a kind of for knowing
The training device of other emotional speech.Since second embodiment is substantially similar to first embodiment, so describe fairly simple, phase
The part of pass refers to the corresponding explanation of first embodiment.Installation practice described below is only schematical.
Fig. 4 shows a kind of embodiment of training device of emotional speech for identification provided by the present application.Fig. 4 is this Shen
Please embodiment provide emotional speech for identification training device unit block diagram.
Shown in Figure 4, the application provides a kind of training device of emotional speech for identification, comprising: obtains sample
Unit 401, training sample unit 402.
Sample unit 401 is obtained, for successively obtaining one group of sample sound spectrograph signal, wherein the sample sound spectrograph letter
It number is one group be divided into sound spectrograph signal according to N kind emotional speech type in N group, and sample sound spectrograph signal in same group
Emotional speech type is identical, and the emotional speech type of sample sound spectrograph signal is different between every group, and N is greater than 1 integer;The sample
This sound spectrograph signal includes the label for marking the emotional speech type;
Training sample unit 402 reaches pre- for first network model to be respectively trained using every group of sample sound spectrograph signal
If training termination condition, to obtain the first Optimal Parameters of first network model;
Wherein, the first network model, comprising: for extracting emotional speech features from the sound spectrograph signal and dropping
First sub-network of dimension, and the second sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.
Optionally, described device further include: the first pretreatment unit;
In the pretreatment unit, including
Subelement is obtained, for obtaining voice signal to be identified;
First pretreatment subelement is obtained for being pre-processed according to preprocessing rule to the voice signal to be identified
Multiframe is taken to pre-process voice signal;
Sound spectrograph signal subelement is generated, for generating the corresponding sound spectrograph letter based on every frame pretreatment voice signal
Number;
Sample sound spectrograph signal subelement is obtained, for marking according to N kind emotional speech type to the sound spectrograph signal
The label obtains the N group sample sound spectrograph signal.
Optionally, the preprocessing rule, comprising: preemphasis rule, adding window framing rule and end-point detection rule;
In the pretreatment subelement, comprising:
Preemphasis subelement is obtained for carrying out preemphasis processing to the voice signal to be identified according to preemphasis rule
Take the first voice signal;
The formula of preemphasis rule:
Wherein,
X (n) is expressed as the voice signal to be identified of input;
Z is expressed as frequency;
H (z) is expressed as preemphasis filter;
μ is expressed as pre emphasis factor;The threshold range of μ is 0.9~0.97, the embodiment of the present application μ=0.9375;
Y (n) is expressed as the first voice signal.
Adding window framing subelement is obtained for carrying out adding window framing to first voice signal according to adding window framing rule
Take the second voice signal of multiframe;
The formula of adding window framing rule:
Sω(n)=s (n) * ω (n);
Wherein,
S (n) is expressed as the first voice signal;
ω (n) is expressed as window function.Common window function has rectangular window, Hanning window, Hamming window etc., the embodiment of the present application
Algorithm is preferable using flatness, energy high degree avoids the Hamming window of truncation effect as the window function of adding window sub-frame processing.The Chinese
The expression formula of bright window is as shown in formula.
Wherein, N is frame length.
End-point detection subelement is obtained for carrying out end-point detection to every the second voice signal of frame according to end-point detection rule
Multiframe is taken to pre-process voice signal.
Optionally, the end-point detection rule is specifically time domain end-point detection rule.
In the end-point detection subelement, comprising:
Judgment sub-unit, for judging whether the value of second voice signal is greater than default volume threshold.
Default volume threshold is an empirical value.It needs to carry out comprehensive detection in conjunction with zero-crossing rate due to subsequent, in threshold value
Set that value is appropriate to the occasion high unsuitable low, i.e., threshold value can slightly set height.
Voiced sound sound cross-talk unit is determined, if the output result for the judgment sub-unit is "Yes", the second voice letter
Number be voiced sound sound section.
Voiced sound sound section is exactly the sound decibel larger portion that human hair goes out.
Determine mute sound cross-talk unit, is used for if it is not, then mute sound section.
Mute sound section refers to that sound decibel is very low, is approximately no sound, environmental noise or voiceless sound sound section.
Distinguish whether certain amount of bass part is voiceless sound (referring to that human hair goes out decibel lower part sound), it can be according in short-term
Zero-crossing rate judges.In general, in the environment of indoors, unvoiced part zero-crossing rate will be apparently higher than environmental noise and mute
Therefore part need to preset zero-crossing rate threshold value, voiceless sound is considered greater than default zero-crossing rate threshold value, lower than being considered for the threshold value
Environmental noise is mute.
Optionally, first sub-network, comprising: the first full articulamentum (fc1), the first activation primitive (relu4), first
Prevent over-fitting layer (drop4), the second full articulamentum (fc2), the second activation primitive (relu5), second from preventing over-fitting layer
(drop5), the full articulamentum of third (fc3) and third activation primitive (relu6).
Optionally, second sub-network, comprising: when changing input data dimension (reshape-data), changing input
Between stab that dimension (reshape-cm), first long circular recursion network layer (lstm1), third prevent over-fitting layer (lstm1- in short-term
Drop), the second long circular recursion network layer (lstm2) in short-term, third long circular recursion network layer (lstm3), outputting cutting in short-term
Being layered (slice), change length, circular recursion network dimension (reshape-lstm), the 4th full articulamentum (fc4), change are defeated in short-term
Enter label dimension (reshape-label), loss function layer (loss) and output category accuracy (accuracy).
Wherein, output sliced layer (slice) is used to divide the sequence end vector of the second sub-network output;It is described
Sequence end vector is used to calculate error feedback simultaneously corrective network weight or prediction emotional speech type with the label.For example,
LSTM network model.
Optionally, the default trained termination condition includes at least one of the following conditions:
Condition one, the percentage for exporting total test sample number shared by correct sample number are greater than default first percentage threshold
Value;
Condition two, every group of sample sound spectrograph signal export the percentage of the corresponding total test sample number of group shared by correct sample number
Than being greater than default second percentage threshold.
In conjunction with brain it is main when carrying out emotion recognition there are three characteristics: timing, randomness, real-time, the application from
These three characteristics are set about, and nurse the artificial application background of machine with intelligence, building is embedded in based on speech emotional under the conditions of open environment
Formula intelligent front end identifying system.
The application uses the dynamic emotion identification method of LSTM, and the circle collection of image sequence, study are carried out using LSTM
With memory sequences related information, emotion differentiation is carried out in conjunction with single image information and serial correlation information, emotion recognition is enhanced and exists
Accuracy, robustness under open environment.
The application simplifies CNN convolutional layer, reduces algorithm complexity, increases LSTM layer number, reinforces algorithm and learns image
Sequence relation ability increases timestamp layer cont to solve LSTM length different images serial correlation inquiry learning, increases slice layers
For dividing the sequence end vector of the second sub-network output;Sequence end vector is used to calculate error with the label anti-
Present simultaneously corrective network weight or prediction emotional speech type.Network transaction data amount is greatly reduced, algorithm complexity is reduced
With adaptive algorithm in built-in equipment operation.
The network model that speech emotion recognition algorithm and server training are completed is transplanted to (such as Huawei by the application
The atlas 200dk system on chip of CPU, NPU, ISP (a collection)) on embedded platform, realize speech emotion recognition system
Intelligent front end identifying system.
To 3rd embodiment provided by the present application, i.e., a kind of embodiment for the method for identifying emotional speech.Due to third reality
It is associated with first embodiment to apply example, so describing fairly simple, the correspondence that relevant part refers to first embodiment is said
It is bright.Installation practice described below is only schematical.
The present embodiment is described in detail below with reference to Fig. 5 and Fig. 6, wherein Fig. 5 is knowledge provided by the embodiments of the present application
The flow chart of the method for other emotional speech;Fig. 6 is the second network of the method for identification emotional speech provided by the embodiments of the present application
The structure chart of model.
Shown in Figure 5, step S501 obtains voice signal to be identified;
Step S502 pre-processes the voice signal to be identified according to preprocessing rule, obtains multiframe pretreatment
Voice signal;
Step S503 obtains the corresponding sound spectrograph signal based on every frame pretreatment voice signal;
Sound spectrograph signal input is had the second network model of the first Optimal Parameters respectively, obtained by step S504
Corresponding emotional speech type;
Wherein, second network model, comprising: for extracting emotional speech features from the sound spectrograph signal and dropping
First sub-network of dimension, and the third sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.
It is shown in Figure 6, wherein ALEXTNet is the first sub-network.
First sub-network, comprising: the first full articulamentum (fc1), the first activation primitive (relu4), first prevented
Fitting layer (drop4), the second full articulamentum (fc2), the second activation primitive (relu5), second prevent over-fitting layer (drop5),
The full articulamentum of third (fc3) and third activation primitive (relu6).
Optionally, the third sub-network, comprising: when changing input data dimension (reshape-data), changing input
Between stab that dimension (reshape-cm), first long circular recursion network layer (lstm1), third prevent over-fitting layer (lstm1- in short-term
Drop), the second long circular recursion network layer (lstm2) in short-term, the long circular recursion network layer (lstm3) in short-term of third, change length
Circular recursion network dimension (reshape-lstm), the 4th full articulamentum (fc4) and calculating class probability (softmax- in short-term
loss)。
First network model and the second network model inherently algorithm principle be it is identical, Fig. 2 and Fig. 6 are by that will instruct
Practice model file (train_val.prototxt) and deployment model file (deploy.prototxt) is input to netscope
(drawing tool that a caffe is provided) draws data flow diagram.
Deploy.prototxt file is deleted on the basis of train_val.prototxt file first
It is formed after thing.Due to the property of two files, the part of training all can be inside train_val.prototxt file
It is deleted in deploy.prototxt file.
Final layer for trained first network model is loss function layer (loss) and output category accuracy
(accuracy), the final layer for the second network model for being used to dispose is to calculate class probability (softmax-loss).First
It is both loss layer, all without any weighting parameter.By checking caffe source code, two network models are most in fact here
Layer is all the application that softmax is returned afterwards, and it is direct that class probability (softmax-loss) is only calculated when being defined as Softmax
Probability front is calculated, and after calculating probability in loss function layer (loss) and output category accuracy (accuracy)
Portion.
It does not need to train in the second network model, the probability rear portion in relation to training can all delete, and compare
API defined in TensorFlow, training and deployment all use a set of network layer API, in fact inside network interface to train and portion
Administration has judgement to operate, and caffe some network layers define two sets of API, different using using in training and deployment
API。
In addition, can be seen that the structure for comparing the second network model with Fig. 6 from Fig. 2, in the structure of first network model
The third of the last layer is long to include output cutting in short-term between circular recursion network layer (lstm3) and the 4th full articulamentum (fc4)
Layer (slice) and long circular recursion network dimension (reshape-lstm) in short-term of change.Their main functions export sliced layer
(slice), the output sequence last bit element of the long circular recursion network layer (lstm3) in short-term of interception third is mainly responsible for (in sequence
Each element represents a characteristics of image).The long circular recursion network dimension (reshape-lstm) in short-term of change, which is mainly responsible for, to be changed
Become output sliced layer (slice) output data dimension, is allowed to be suitble to the input of the 4th full articulamentum fc4 data dimension.This is this
Invention deliberate action, the second sub-network input and output are equal length time series respectively, in the second sub-network of the same output
In time series, every element is all influenced by same sequence inputting element before it, that is to say, that output sliced layer (slice)
The output sequence last bit element of the long circular recursion network layer (lstm3) in short-term of interception third is by institute in same sequence before it
There is input element to influence, compared to same sequence other elements more representative of the feature of entire sequence, first network model is sought to
Error is calculated using this element value and label, then calculates probability rear portion.
Second network model shown in Figure 6, the present invention do not place output sliced layer (slice), first this
Network does not need trained network weight, is not belonging to neural network algorithm, does not export sliced layer (slice) yet just meaning the
Each element in three long circular recursion network layer (lstm3) output sequences in short-term is exported.It can certainly add defeated
Sliced layer (slice) only exports long circular recursion network layer (lstm3) network output sequence last bit element in short-term of third out.This
Realize that the difference in effect is the probability that the former microphone acquires that every frame voice provides a recognition result in algorithm, and the latter is
One time series (10 frame) voice of every acquisition provides a recognition result probability.
In conjunction with brain it is main when carrying out emotion recognition there are three characteristics: timing, randomness, real-time, the application from
These three characteristics are set about, and nurse the artificial application background of machine with intelligence, building is embedded in based on speech emotional under the conditions of open environment
Formula intelligent front end identifying system.
The application uses the dynamic emotion identification method of LSTM, and the circle collection of image sequence, study are carried out using LSTM
With memory sequences related information, emotion differentiation is carried out in conjunction with single image information and serial correlation information, emotion recognition is enhanced and exists
Accuracy, robustness under open environment.
The application simplifies CNN convolutional layer, reduces algorithm complexity, increases LSTM layer number, reinforces algorithm and learns image
Sequence relation ability increases timestamp layer cont to solve LSTM length different images serial correlation inquiry learning, increases slice layers
For dividing the sequence end vector of the second sub-network output;Sequence end vector is used to calculate error with the label anti-
Present simultaneously corrective network weight or prediction emotional speech type.Network transaction data amount is greatly reduced, algorithm complexity is reduced
With adaptive algorithm in built-in equipment operation.
The network model that speech emotion recognition algorithm and server training are completed is transplanted to (such as Huawei by the application
The atlas 200dk system on chip of CPU, NPU, ISP (a collection)) on embedded platform, realize speech emotion recognition system
Intelligent front end identifying system.
It is corresponding with 3rd embodiment provided by the present application, present invention also provides fourth embodiment, i.e., a kind of identification feelings
Feel the device of voice.Since fourth embodiment is substantially similar to 3rd embodiment, so describe fairly simple, relevant part
Refer to the corresponding explanation of 3rd embodiment.Installation practice described below is only schematical.
Fig. 7 shows a kind of embodiment of device for identifying emotional speech provided by the present application.Fig. 7 is the embodiment of the present application
The unit block diagram of the device of the identification emotional speech of offer.
Shown in Figure 7, the application provides a kind of device for identifying emotional speech, comprising: obtains voice letter to be identified
Number unit 701, pretreatment unit 702 obtain sound spectrograph signal element 703, obtain emotional speech type units 704.
Voice signal unit 701 to be identified is obtained, for obtaining voice signal to be identified;
Pretreatment unit 702 obtains more for being pre-processed according to preprocessing rule to the voice signal to be identified
Frame pre-processes voice signal;
Sound spectrograph signal element 703 is obtained, for obtaining the corresponding sound spectrograph based on every frame pretreatment voice signal
Signal;
Emotional speech type units 704 are obtained, for sound spectrograph signal input to be had the first Optimal Parameters respectively
The second network model, obtain corresponding emotional speech type;
Wherein, second network model, comprising: for extracting emotional speech features from the sound spectrograph signal and dropping
First sub-network of dimension, and the third sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.
First sub-network, comprising: the first full articulamentum (fc1), the first activation primitive (relu4), first prevented
Fitting layer (drop4), the second full articulamentum (fc2), the second activation primitive (relu5), second prevent over-fitting layer (drop5),
The full articulamentum of third (fc3) and third activation primitive (relu6).
Optionally, the third sub-network, comprising: when changing input data dimension (reshape-data), changing input
Between stab that dimension (reshape-cm), first long circular recursion network layer (lstm1), third prevent over-fitting layer (lstm1- in short-term
Drop), the second long circular recursion network layer (lstm2) in short-term, the long circular recursion network layer (lstm3) in short-term of third, change length
Circular recursion network dimension (reshape-lstm), the 4th full articulamentum (fc4) and calculating class probability (softmax- in short-term
loss)。
In conjunction with brain it is main when carrying out emotion recognition there are three characteristics: timing, randomness, real-time, the application from
These three characteristics are set about, and nurse the artificial application background of machine with intelligence, building is embedded in based on speech emotional under the conditions of open environment
Formula intelligent front end identifying system.
The application uses the dynamic emotion identification method of LSTM, and the circle collection of image sequence, study are carried out using LSTM
With memory sequences related information, emotion differentiation is carried out in conjunction with single image information and serial correlation information, emotion recognition is enhanced and exists
Accuracy, robustness under open environment.
The application simplifies CNN convolutional layer, reduces algorithm complexity, increases LSTM layer number, reinforces algorithm and learns image
Sequence relation ability increases timestamp layer cont to solve LSTM length different images serial correlation inquiry learning, increases slice layers
For dividing the sequence end vector of the second sub-network output;Sequence end vector is used to calculate error with the label anti-
Present simultaneously corrective network weight or prediction emotional speech type.Network transaction data amount is greatly reduced, algorithm complexity is reduced
With adaptive algorithm in built-in equipment operation.
The network model that speech emotion recognition algorithm and server training are completed is transplanted to (such as Huawei by the application
The atlas 200dk system on chip of CPU, NPU, ISP (a collection)) on embedded platform, realize speech emotion recognition system
Intelligent front end identifying system.
Above embodiments are only the exemplary embodiment of the application, are not used in limitation the application, the protection scope of the application
It is defined by the claims.Those skilled in the art can make respectively the application in the essence and protection scope of the application
Kind modification or equivalent replacement, this modification or equivalent replacement also should be regarded as falling within the scope of protection of this application.
Claims (10)
1. a kind of training method of emotional speech for identification characterized by comprising
Successively obtain one group of sample sound spectrograph signal, wherein the sample sound spectrograph signal is will according to N kind emotional speech type
Sound spectrograph signal is divided into one group in N group, and in same group sample sound spectrograph signal emotional speech type it is identical, sample between every group
The emotional speech type of this sound spectrograph signal is different, and N is greater than 1 integer;The sample sound spectrograph signal includes described in label
The label of emotional speech type;
First network model is respectively trained using every group of sample sound spectrograph signal and reaches default training termination condition, to obtain the
First Optimal Parameters of one network model;
Wherein, the first network model, comprising: for extracting emotional speech features and dimensionality reduction from the sound spectrograph signal
First sub-network, and the second sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.
2. training method according to claim 1, which is characterized in that successively obtain one group of trained voice signal described
Before, further includes:
Obtain voice signal to be identified;
The voice signal to be identified is pre-processed according to preprocessing rule, multiframe is obtained and pre-processes voice signal;
The corresponding sound spectrograph signal is generated based on every frame pretreatment voice signal;
The label is marked to the sound spectrograph signal according to N kind emotional speech type, obtains the N group sample sound spectrograph letter
Number.
3. training method according to claim 2, which is characterized in that the preprocessing rule, comprising: preemphasis rule,
Adding window framing rule and end-point detection rule;
It is described that the voice signal to be identified is pre-processed according to preprocessing rule, it obtains multiframe and pre-processes voice signal,
Include:
Preemphasis processing is carried out to the voice signal to be identified according to preemphasis rule, obtains the first voice signal;
Adding window framing is carried out to first voice signal according to adding window framing rule, obtains the second voice signal of multiframe;
End-point detection is carried out to every the second voice signal of frame according to end-point detection rule, multiframe is obtained and pre-processes voice signal.
4. training method according to claim 3, which is characterized in that the end-point detection rule is specifically the inspection of time domain endpoint
Gauge is then.
5. training method according to claim 1, which is characterized in that first sub-network, comprising: the first full connection
Layer, the first activation primitive, first prevent over-fitting layer, the second full articulamentum, the second activation primitive, second prevent over-fitting layer,
The full articulamentum of third and third activation primitive.
6. training method according to claim 1, which is characterized in that second sub-network, comprising: change input data
Dimension, changing input time stamp dimension, the first length, circular recursion network layer, third prevent over-fitting layer, the second length in short-term in short-term
Circular recursion network layer, third long circular recursion network layer in short-term, change the long dimension of circular recursion network in short-term at output sliced layer
Degree, changes input label dimension, loss function layer and output category accuracy at the 4th full articulamentum;
Wherein, the output sliced layer is used to divide the sequence end vector of the second sub-network output;Sequence end vector
For calculating error feedback simultaneously corrective network weight or prediction emotional speech type with the label.
7. training method according to claim 1, which is characterized in that the default trained termination condition, include at least with
One of lower condition:
The percentage for exporting total test sample number shared by correct sample number is greater than preset percentage threshold value:
The percentage that every group of sample sound spectrograph signal exports the corresponding total test sample number of group shared by correct sample number is greater than default
Second percentage threshold.
8. a kind of training device of emotional speech for identification characterized by comprising
Sample unit is obtained, for successively obtaining one group of sample sound spectrograph signal, wherein the sample sound spectrograph signal is basis
Sound spectrograph signal is divided into one group in N group by N kind emotional speech type, and in same group sample sound spectrograph signal emotional speech
Type is identical, and the emotional speech type of sample sound spectrograph signal is different between every group, and N is greater than 1 integer;The sample sound spectrograph
Signal includes the label for marking the emotional speech type;
Training sample unit reaches default training eventually for first network model to be respectively trained using every group of sample sound spectrograph signal
Only condition, to obtain the first Optimal Parameters of first network model;
Wherein, the first network model, comprising: for extracting emotional speech features and dimensionality reduction from the sound spectrograph signal
First sub-network, and the second sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.
9. a kind of method for identifying emotional speech characterized by comprising
Obtain voice signal to be identified;
The voice signal to be identified is pre-processed according to preprocessing rule, multiframe is obtained and pre-processes voice signal;
The corresponding sound spectrograph signal is obtained based on every frame pretreatment voice signal;
Sound spectrograph signal input is had to the second network model of the first Optimal Parameters respectively, obtains corresponding emotional speech
Type;
Wherein, second network model, comprising: for extracting emotional speech features and dimensionality reduction from the sound spectrograph signal
First sub-network, and the third sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.
10. a kind of device for identifying emotional speech characterized by comprising
Voice signal unit to be identified is obtained, for obtaining voice signal to be identified;
Pretreatment unit obtains multiframe and locates in advance for being pre-processed according to preprocessing rule to the voice signal to be identified
Manage voice signal;
Sound spectrograph signal element is obtained, for obtaining the corresponding sound spectrograph signal based on every frame pretreatment voice signal;
Emotional speech type units are obtained, for sound spectrograph signal input to be had to the second net of the first Optimal Parameters respectively
Network model obtains corresponding emotional speech type;
Wherein, second network model, comprising: for extracting emotional speech features and dimensionality reduction from the sound spectrograph signal
First sub-network, and the third sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910690493.9A CN110415728B (en) | 2019-07-29 | 2019-07-29 | Method and device for recognizing emotion voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910690493.9A CN110415728B (en) | 2019-07-29 | 2019-07-29 | Method and device for recognizing emotion voice |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110415728A true CN110415728A (en) | 2019-11-05 |
CN110415728B CN110415728B (en) | 2022-04-01 |
Family
ID=68363877
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910690493.9A Active CN110415728B (en) | 2019-07-29 | 2019-07-29 | Method and device for recognizing emotion voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110415728B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110782872A (en) * | 2019-11-11 | 2020-02-11 | 复旦大学 | Language identification method and device based on deep convolutional recurrent neural network |
CN111312292A (en) * | 2020-02-18 | 2020-06-19 | 北京三快在线科技有限公司 | Emotion recognition method and device based on voice, electronic equipment and storage medium |
CN111326178A (en) * | 2020-02-27 | 2020-06-23 | 长沙理工大学 | Multi-mode speech emotion recognition system and method based on convolutional neural network |
CN111883178A (en) * | 2020-07-17 | 2020-11-03 | 渤海大学 | Double-channel voice-to-image-based emotion recognition method |
CN112466324A (en) * | 2020-11-13 | 2021-03-09 | 上海听见信息科技有限公司 | Emotion analysis method, system, equipment and readable storage medium |
CN112613481A (en) * | 2021-01-04 | 2021-04-06 | 上海明略人工智能(集团)有限公司 | Bearing abrasion early warning method and system based on frequency spectrum |
CN113257280A (en) * | 2021-06-07 | 2021-08-13 | 苏州大学 | Speech emotion recognition method based on wav2vec |
CN113566948A (en) * | 2021-07-09 | 2021-10-29 | 中煤科工集团沈阳研究院有限公司 | Fault audio recognition and diagnosis method for robot coal pulverizer |
CN113628639A (en) * | 2021-07-06 | 2021-11-09 | 哈尔滨理工大学 | Voice emotion recognition method based on multi-head attention mechanism |
CN113689887A (en) * | 2020-05-18 | 2021-11-23 | 辉达公司 | Speech detection termination using one or more neural networks |
CN113808620A (en) * | 2021-08-27 | 2021-12-17 | 西藏大学 | Tibetan language emotion recognition method based on CNN and LSTM |
JP2023503703A (en) * | 2020-04-09 | 2023-01-31 | ワイピー ラブス カンパニー,リミテッド | Service provision method and system based on user voice |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108597539A (en) * | 2018-02-09 | 2018-09-28 | 桂林电子科技大学 | Speech-emotion recognition method based on parameter migration and sound spectrograph |
CN109767790A (en) * | 2019-02-28 | 2019-05-17 | 中国传媒大学 | A kind of speech-emotion recognition method and system |
-
2019
- 2019-07-29 CN CN201910690493.9A patent/CN110415728B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108597539A (en) * | 2018-02-09 | 2018-09-28 | 桂林电子科技大学 | Speech-emotion recognition method based on parameter migration and sound spectrograph |
CN109767790A (en) * | 2019-02-28 | 2019-05-17 | 中国传媒大学 | A kind of speech-emotion recognition method and system |
Non-Patent Citations (2)
Title |
---|
LIM W: "Speech Emotion Recognition using Convolutional and Recurrent Neural Networks", 《2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE》 * |
缪裕青等: "基于参数迁移和卷积循环神经网络的语音情感识别", 《计算机工程与应用》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110782872A (en) * | 2019-11-11 | 2020-02-11 | 复旦大学 | Language identification method and device based on deep convolutional recurrent neural network |
CN111312292A (en) * | 2020-02-18 | 2020-06-19 | 北京三快在线科技有限公司 | Emotion recognition method and device based on voice, electronic equipment and storage medium |
CN111326178A (en) * | 2020-02-27 | 2020-06-23 | 长沙理工大学 | Multi-mode speech emotion recognition system and method based on convolutional neural network |
JP7443532B2 (en) | 2020-04-09 | 2024-03-05 | ワイピー ラブス カンパニー,リミテッド | Service provision method and system based on user voice |
JP2023503703A (en) * | 2020-04-09 | 2023-01-31 | ワイピー ラブス カンパニー,リミテッド | Service provision method and system based on user voice |
CN113689887A (en) * | 2020-05-18 | 2021-11-23 | 辉达公司 | Speech detection termination using one or more neural networks |
CN111883178A (en) * | 2020-07-17 | 2020-11-03 | 渤海大学 | Double-channel voice-to-image-based emotion recognition method |
CN112466324A (en) * | 2020-11-13 | 2021-03-09 | 上海听见信息科技有限公司 | Emotion analysis method, system, equipment and readable storage medium |
CN112613481A (en) * | 2021-01-04 | 2021-04-06 | 上海明略人工智能(集团)有限公司 | Bearing abrasion early warning method and system based on frequency spectrum |
CN113257280A (en) * | 2021-06-07 | 2021-08-13 | 苏州大学 | Speech emotion recognition method based on wav2vec |
CN113628639A (en) * | 2021-07-06 | 2021-11-09 | 哈尔滨理工大学 | Voice emotion recognition method based on multi-head attention mechanism |
CN113566948A (en) * | 2021-07-09 | 2021-10-29 | 中煤科工集团沈阳研究院有限公司 | Fault audio recognition and diagnosis method for robot coal pulverizer |
CN113808620A (en) * | 2021-08-27 | 2021-12-17 | 西藏大学 | Tibetan language emotion recognition method based on CNN and LSTM |
CN113808620B (en) * | 2021-08-27 | 2023-03-21 | 西藏大学 | Tibetan language emotion recognition method based on CNN and LSTM |
Also Published As
Publication number | Publication date |
---|---|
CN110415728B (en) | 2022-04-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110415728A (en) | A kind of method and apparatus identifying emotional speech | |
CN109599129B (en) | Voice depression recognition system based on attention mechanism and convolutional neural network | |
CN108831485B (en) | Speaker identification method based on spectrogram statistical characteristics | |
CN108648748B (en) | Acoustic event detection method under hospital noise environment | |
CN105632501B (en) | A kind of automatic accent classification method and device based on depth learning technology | |
CN108922513B (en) | Voice distinguishing method and device, computer equipment and storage medium | |
CN111798874A (en) | Voice emotion recognition method and system | |
CN108281146A (en) | A kind of phrase sound method for distinguishing speek person and device | |
EP2887351A1 (en) | Computer-implemented method, computer system and computer program product for automatic transformation of myoelectric signals into audible speech | |
CN111724770B (en) | Audio keyword identification method for generating confrontation network based on deep convolution | |
CN111261147A (en) | Music embedding attack defense method facing voice recognition system | |
CN108986798B (en) | Processing method, device and the equipment of voice data | |
CN104464724A (en) | Speaker recognition method for deliberately pretended voices | |
Rammo et al. | Detecting the speaker language using CNN deep learning algorithm | |
CN108922561A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN109272986A (en) | A kind of dog sound sensibility classification method based on artificial neural network | |
Alghifari et al. | On the use of voice activity detection in speech emotion recognition | |
CN109036470A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
Jie et al. | Speech emotion recognition of teachers in classroom teaching | |
CN105283916B (en) | Electronic watermark embedded device, electronic watermark embedding method and computer readable recording medium | |
Nasrun et al. | Human emotion detection with speech recognition using Mel-frequency cepstral coefficient and support vector machine | |
Zhang et al. | Depthwise separable convolutions for short utterance speaker identification | |
CN111243621A (en) | Construction method of GRU-SVM deep learning model for synthetic speech detection | |
CN103077706A (en) | Method for extracting and representing music fingerprint characteristic of music with regular drumbeat rhythm | |
CN109767790A (en) | A kind of speech-emotion recognition method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |