CN110415728A

CN110415728A - A kind of method and apparatus identifying emotional speech

Info

Publication number: CN110415728A
Application number: CN201910690493.9A
Authority: CN
Inventors: 崔明明; 房建东
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2019-07-29
Filing date: 2019-07-29
Publication date: 2019-11-05
Anticipated expiration: 2039-07-29
Also published as: CN110415728B

Abstract

This application provides a kind of method and apparatus for identifying emotional speech, which comprises obtains voice signal to be identified；The voice signal to be identified is pre-processed according to preprocessing rule, multiframe is obtained and pre-processes voice signal；The corresponding sound spectrograph signal is obtained based on every frame pretreatment voice signal；Sound spectrograph signal input is had to the second network model of the first Optimal Parameters respectively, obtains corresponding emotional speech type；Wherein, second network model, comprising: for extracting the first sub-network of emotional speech features and dimensionality reduction, and the third sub-network for accelerating algorithm convergence and classifying to the emotional speech features from the sound spectrograph signal.The application extracts feature with CNN network model from sound spectrograph, and LSTM network model is allowed to carry out time series modeling to these characteristics.Enhance accuracy, robustness of the emotion recognition under open environment.Reduce network throughput and algorithm complexity, fit algorithm is in built-in equipment operation.

Description

A kind of method and apparatus identifying emotional speech

Technical field

This application involves artificial intelligence fields, and in particular to the training method and device of emotional speech for identification, and The method and apparatus for identifying emotional speech.

Background technique

The purpose of emotion recognition is to allow computer to have to observe similar to people is the same, understands identification, shows emotion Ability, preferably to be interacted with the mankind.And since ancient times, the exchange between people and people is logical in the overwhelming majority This means that teach orally are crossed, why voice becomes the main media of Human communication, the reason is that voice is in addition to comprising wanting table The text information stated further comprises the information other than other many texts, such as the language of speaker, emotion, physical condition, gender Etc. information.This is that simple text understanding cannot be completed.Potential emotion information is obtained by processing voice signal to exist The fields such as teaching cognitive state analysis, the perception of patient mood state analysis, public domain danger early warning, blind visual have answers extensively Use potentiality.Therefore, the key technology as intelligent interaction, affection computation, speech emotion recognition is ground as artificial intelligence in recent years Study carefully emphasis.

Very big progress is had been achieved for the research of speech emotion recognition both at home and abroad at present.But the conventional acoustic feature rhythm Feature, the correlated characteristic based on spectrum and sound quality feature describe speech emotional from time domain, frequency domain angle respectively, can not reflect language simultaneously Sound signal time-frequency affective characteristics.And still image algorithm is applied in natural scene, lacks effective benefit to dynamic sequence information With causing algorithm robustness poor, application effect is to be improved.

Current most important method is while to reflect voice signal time-frequency emotion spy using sound spectrograph as speech emotional feature Then sign carries out affective feature extraction classification using deep learning algorithm.But this kind of algorithm recognition accuracy makes moderate progress, simultaneously Since algorithm process data volume is big, algorithm complexity height causes to require height to hardware calculated performance, is deployed in high performance service more On device, it is very difficult to apply in actual life natural scene.

Summary of the invention

The application provides a kind of training method of emotional speech for identification, a kind of training cartridge of emotional speech for identification It sets, a kind of method identifying emotional speech and a kind of device for identifying emotional speech；To solve current emotion recognition algorithm operation Measure big problem.

In order to solve the above-mentioned technical problem, the embodiment of the present application provides the following technical solution:

This application provides a kind of training methods of emotional speech for identification characterized by comprising

Successively obtain one group of sample sound spectrograph signal, wherein the sample sound spectrograph signal is according to N kind emotional speech class Sound spectrograph signal is divided into one group in N group by type, and in same group sample sound spectrograph signal emotional speech type it is identical, every group Between sample sound spectrograph signal emotional speech type it is different, N is greater than 1 integer；The sample sound spectrograph signal includes label The label of the emotional speech type；

First network model is respectively trained using every group of sample sound spectrograph signal and reaches default training termination condition, to obtain Obtain the first Optimal Parameters of first network model；

Wherein, the first network model, comprising: for extracting emotional speech features from the sound spectrograph signal and dropping First sub-network of dimension, and the second sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.

Optionally, it is described successively obtain one group of trained voice signal before, further includes:

Obtain voice signal to be identified；

The voice signal to be identified is pre-processed according to preprocessing rule, multiframe is obtained and pre-processes voice signal；

The corresponding sound spectrograph signal is generated based on every frame pretreatment voice signal；

The label is marked to the sound spectrograph signal according to N kind emotional speech type, obtains the N group sample sound spectrograph Signal.

Optionally, the preprocessing rule, comprising: preemphasis rule, adding window framing rule and end-point detection rule；

It is described that the voice signal to be identified is pre-processed according to preprocessing rule, obtain multiframe pretreatment voice letter Number, comprising:

Preemphasis processing is carried out to the voice signal to be identified according to preemphasis rule, obtains the first voice signal；

Adding window framing is carried out to first voice signal according to adding window framing rule, obtains the second voice signal of multiframe；

End-point detection is carried out to every the second voice signal of frame according to end-point detection rule, obtains multiframe pretreatment voice letter Number.

Optionally, the end-point detection rule is specifically time domain end-point detection rule.

Optionally, first sub-network, comprising: the first full articulamentum, the first activation primitive, first prevent over-fitting Layer, the second full articulamentum, the second activation primitive, second prevent the full articulamentum of over-fitting layer, third and third activation primitive.

Optionally, second sub-network, comprising: change input data dimension, change input time stamp dimension, the first length Circular recursion network layer, third prevent over-fitting layer, the second long circular recursion network layer in short-term, the long circulation in short-term of third from passing in short-term Return network layer, output sliced layer, change long circular recursion network dimension in short-term, the 4th full articulamentum, change input label dimension, Loss function layer and output category accuracy；

Wherein, the output sliced layer is used to divide the sequence end vector of the second sub-network output；The sequence end Vector is used to calculate error feedback simultaneously corrective network weight or prediction emotional speech type with the label.

Optionally, the default trained termination condition includes at least one of the following conditions:

The percentage for exporting total test sample number shared by correct sample number is greater than preset percentage threshold value:

The percentage that every group of sample sound spectrograph signal exports the corresponding total test sample number of group shared by correct sample number is greater than Default second percentage threshold.

This application provides a kind of training devices of emotional speech for identification, comprising:

Sample unit is obtained, for successively obtaining one group of sample sound spectrograph signal, wherein the sample sound spectrograph signal is One group be divided into sound spectrograph signal according to N kind emotional speech type in N group, and in same group sample sound spectrograph signal emotion Sound-type is identical, and the emotional speech type of sample sound spectrograph signal is different between every group, and N is greater than 1 integer；The sample language Spectral Signal includes the label for marking the emotional speech type；

Training sample unit reaches default instruction for first network model to be respectively trained using every group of sample sound spectrograph signal Practice termination condition, to obtain the first Optimal Parameters of first network model；

This application provides a kind of methods for identifying emotional speech, comprising:

Obtain voice signal to be identified；

The corresponding sound spectrograph signal is obtained based on every frame pretreatment voice signal；

Sound spectrograph signal input is had to the second network model of the first Optimal Parameters respectively, obtains corresponding emotion Sound-type；

Wherein, second network model, comprising: for extracting emotional speech features from the sound spectrograph signal and dropping First sub-network of dimension, and the third sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.

This application provides a kind of devices for identifying emotional speech, comprising:

Voice signal unit to be identified is obtained, for obtaining voice signal to be identified；

Pretreatment unit obtains multiframe for pre-processing according to preprocessing rule to the voice signal to be identified Pre-process voice signal；

Sound spectrograph signal element is obtained, for obtaining the corresponding sound spectrograph letter based on every frame pretreatment voice signal Number；

Emotional speech type units are obtained, for sound spectrograph signal input to be had the of the first Optimal Parameters respectively Two network models obtain corresponding emotional speech type；

Disclosure based on the above embodiment can know, the embodiment of the present application have it is following the utility model has the advantages that

This application provides a kind of method and apparatus for identifying emotional speech, which comprises obtains voice to be identified Signal；The voice signal to be identified is pre-processed according to preprocessing rule, multiframe is obtained and pre-processes voice signal；It is based on Every frame pretreatment voice signal obtains the corresponding sound spectrograph signal；It is respectively that sound spectrograph signal input is excellent with first The second network model for changing parameter, obtains corresponding emotional speech type；Wherein, second network model, comprising: for from The first sub-network of emotional speech features and dimensionality reduction is extracted in the sound spectrograph signal, and for accelerating algorithm convergence and to institute State the third sub-network of emotional speech features classification.

In conjunction with brain it is main when carrying out emotion recognition there are three characteristics: timing, randomness, real-time, the application from These three characteristics are set about, and nurse the artificial application background of machine with intelligence, building is embedded in based on speech emotional under the conditions of open environment Formula intelligent front end identifying system.

The application uses the dynamic emotion identification method of LSTM, and the circle collection of image sequence, study are carried out using LSTM With memory sequences related information, emotion differentiation is carried out in conjunction with single image information and serial correlation information, emotion recognition is enhanced and exists Accuracy, robustness under open environment.

The application simplifies CNN convolutional layer, reduces algorithm complexity, increases LSTM layer number, reinforces algorithm and learns image Sequence relation ability increases timestamp layer cont to solve LSTM length different images serial correlation inquiry learning, increases slice layers For dividing the sequence end vector of the second sub-network output；Sequence end vector is used to calculate error with the label anti- Present simultaneously corrective network weight or prediction emotional speech type.Network transaction data amount is greatly reduced, algorithm complexity is reduced With adaptive algorithm in built-in equipment operation.

The network model that speech emotion recognition algorithm and server training are completed is transplanted to (such as Huawei by the application The atlas 200dk system on chip of CPU, NPU, ISP (a collection)) on embedded platform, realize speech emotion recognition system Intelligent front end identifying system.

Detailed description of the invention

Fig. 1 is the flow chart of the training method of emotional speech for identification provided by the embodiments of the present application；

Fig. 2 is the structure of the first network model of the training method of emotional speech for identification provided by the embodiments of the present application Figure；

Fig. 3 is the structure chart of the first sub-network provided by the embodiments of the present application；

Fig. 4 is the unit block diagram of the training device of emotional speech for identification provided by the embodiments of the present application；

Fig. 5 is the flow chart of the method for identification emotional speech provided by the embodiments of the present application；

Fig. 6 is the structure chart of the second network model of the method for identification emotional speech provided by the embodiments of the present application；

Fig. 7 is the unit block diagram of the device of identification emotional speech provided by the embodiments of the present application.

Specific embodiment

In the following, being described in detail in conjunction with specific embodiment of the attached drawing to the application, but not as the restriction of the application.

It should be understood that various modifications can be made to disclosed embodiments.Therefore, description above should not regard To limit, and only as the example of embodiment.Those skilled in the art will expect in the scope and spirit of the present application Other modifications.

The attached drawing being included in the description and forms part of the description shows embodiments herein, and with it is upper What face provided is used to explain the application together to substantially description and the detailed description given below to embodiment of the application Principle.

By the description of the preferred form with reference to the accompanying drawings to the embodiment for being given as non-limiting example, the application's These and other characteristic will become apparent.

It is also understood that although the application is described referring to some specific examples, those skilled in the art Member realizes many other equivalents of the application in which can determine, they have feature as claimed in claim and therefore all In the protection scope defined by whereby.

When read in conjunction with the accompanying drawings, in view of following detailed description, above and other aspect, the feature and advantage of the application will become It is more readily apparent.

The specific embodiment of the application is described hereinafter with reference to attached drawing；It will be appreciated, however, that the disclosed embodiments are only Various ways implementation can be used in the example of the application.Known and/or duplicate function and structure and be not described in detail to avoid Unnecessary or extra details makes the application smudgy.Therefore, specific structural and functionality disclosed herein is thin Section is not intended to restrictions, but as just the basis of claim and representative basis be used to instructing those skilled in the art with Substantially any appropriate detailed construction diversely uses the application.

This specification can be used phrase " in one embodiment ", " in another embodiment ", " in another embodiment In " or " in other embodiments ", it can be referred to one or more of the identical or different embodiment according to the application.

To first embodiment provided by the present application, i.e., a kind of embodiment of the training method of emotional speech for identification.

The present embodiment is described in detail below with reference to Fig. 1, Fig. 2 and Fig. 3, wherein Fig. 1 provides for the embodiment of the present application Emotional speech for identification training method flow chart；Fig. 2 is emotional speech for identification provided by the embodiments of the present application Training method first network model structure chart；Fig. 3 is the structure chart of the first sub-network provided by the embodiments of the present application.

The embodiment of the present application combination brain is main when carrying out emotion recognition, and there are three characteristics: timing, randomness, in real time Property, set about from these three characteristics, the artificial application background of machine is nursed with intelligence, building is based on speech emotional under the conditions of open environment Embedded front-end identifying system.

Shown in Figure 1, step S101 successively obtains one group of sample sound spectrograph signal.

Sound spectrograph is to describe the spectrum information in voice signal with the two dimensional image of time change.It is horizontal in sound spectrograph Axis indicates the time of voice signal, and the longitudinal axis indicates each frequency content of voice signal.Each frequency at any moment in image The depth of the strong and weak color of ingredient indicates.In speech signal analysis, common analysis method have frequency domain analysis and when Domain analysis method, and sound spectrograph combines the two, can dynamically show the size of each moment different frequency component, so The information content carried in sound spectrograph is far longer than the summation of simple time domain and simple the carried information content of frequency domain.Since language is composed Immense value of the figure in speech analysis, commonly referred to as visual speech.

The sound spectrograph signal is exactly the information associated with voice obtained from sound spectrograph.

In order to train network model to identify, emotional speech type, the embodiment of the present disclosure are instructing collected sound spectrograph signal It is divided into N group as training sample according to N kind emotional speech type before practicing.The sample sound spectrograph signal is according to N kind emotion language Sound spectrograph signal is divided into one group in N group by sound type, and in same group sample sound spectrograph signal emotional speech type it is identical, The emotional speech type of sample sound spectrograph signal is different between every group, and N is greater than 1 integer；The sample sound spectrograph signal includes Mark the label of the emotional speech type.

Step S102 is respectively trained first network model using every group of sample sound spectrograph signal and reaches default training termination item Part, to obtain the first Optimal Parameters of first network model.

The first network model, comprising: for extracting emotional speech features and dimensionality reduction from the sound spectrograph signal First sub-network, and the second sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.

The embodiment of the present application provide CNN+LSTM network model (i.e. first network model) be by CNN network model and LSTM network model is unified into a frame, and the purpose designed in this way is: CNN network model is good at image procossing, LSTM net Network model is good at time series modeling, and the LSTM network model of deep structure also has Feature Mapping to separable space Ability, in order to the two be combined, extracts spy from sound spectrograph with CNN network model using the ability having complementary advantages each other Sign allows LSTM network model to carry out time series modeling to these characteristics.The embodiment of the present application is in ALEXTNet network model (packet Included the stacking of several CNN network models) on the basis of make to simplify, eliminate several convolutional layers, reduce algorithm calculation amount, adapt to Built-in equipment operation, the results showed, the network number of plies of LSTM network model is more, and study serial correlation information capability is got over By force, convergence is faster, so increasing the network layer of LSTM network model, reinforces algorithm to Sequence Learning ability, keeps algorithmic statement fast Degree is faster.

It is shown in Figure 2, wherein ALEXTNet is the first sub-network.Cont is timestamp layer, passes to LSTM network Model treatment temporal information.

First sub-network, comprising: the first full articulamentum (fc1), the first activation primitive (relu4), first prevented Fitting layer (drop4), the second full articulamentum (fc2), the second activation primitive (relu5), second prevent over-fitting layer (drop5), The full articulamentum of third (fc3) and third activation primitive (relu6).For example, the alextnet network model simplified.Refer to Fig. 3 It is shown,

Second sub-network, comprising: change input data dimension (reshape-data), change input time stamp dimension (reshape-cm), the first long circular recursion network layer (lstm1), third in short-term prevents over-fitting layer (lstm1-drop), the Two long circular recursion network layers (lstm2) in short-term, the long circular recursion network layer (lstm3) in short-term of third, output sliced layer (slice), change long circular recursion network dimension (reshape-lstm) in short-term, the 4th full articulamentum (fc4), change input mark Sign dimension (reshape-label), loss function layer (loss) and output category accuracy (accuracy).

Wherein, output sliced layer (slice) is used to divide the sequence end vector of the second sub-network output；It is described Sequence end vector is used to calculate error feedback simultaneously corrective network weight or prediction emotional speech type with the label.For example, LSTM network model.

In training, three layers of LSTM network have been selected, 128 hidden neuron numbers are set, while timing length is set and is 10,.Using batch gradient descent method, batch size is set as 10, and it is 80000 times that Simultaneous Iteration batch, which takes turns number, LSTM's It is 5 that gradient, which cuts threshold value, and using Adam optimization method, learning rate is set as 0.0005.

The default trained termination condition includes at least one of the following conditions:

Condition one, the percentage for exporting total test sample number shared by correct sample number are greater than default first percentage threshold Value.

For example, default first percentage threshold is 90%.

Condition two, every group of sample sound spectrograph signal export the percentage of the corresponding total test sample number of group shared by correct sample number Than being greater than default second percentage threshold.

For example, default second percentage threshold is 90%.

The embodiment of the present application use 5 folding cross validations method, will in total 7000 sample sound spectrograph signal sequences it is (each A sequence includes 10 sound spectrographs) composed by all pictures be divided into 5 parts, be then used as test sample for i-th part each time, Remaining (be here is calculated according to the accuracy rate of sound spectrograph picture, that is, according to all sound spectrographs as training sample What the picture in sequence was calculated as a sample data, the label of a sound spectrograph sequence also means that the sound spectrograph originally All pictures of sequence mark as label), then obtain corresponding accuracy rate C_i, then the final model is directed to emotion language The accuracy rate of spectral data collection is thenThe accuracy of identification obtained in this way is more reliable.

The embodiment of the present application simplifies CNN convolutional layer, reduces algorithm complexity, increases LSTM layer number, reinforces algorithm Image sequence relational capability is practised, increases timestamp layer cont to solve LSTM length different images serial correlation inquiry learning, increases (slice layers) of sliced layer of output for dividing the sequence end vector of the second sub-network output；Sequence end vector is used for Error feedback simultaneously corrective network weight or prediction emotional speech type are calculated with the label.Greatly reduce network transaction data Amount, reduces algorithm complexity with adaptive algorithm in built-in equipment operation.

It is described successively obtain one group of trained voice signal before, it is further comprising the steps of:

Step 100-1 obtains voice signal to be identified.

Step 100-2 pre-processes the voice signal to be identified according to preprocessing rule, obtains multiframe pretreatment Voice signal.

Due to the physical characteristic of voice signal itself, the physical characteristic of speaker's phonatory organ and voice recording environment etc. The raw tone of various factors, speaker generally can not be directly used in processing, need to undergo the pretreated step of voice Suddenly subsequent processing could be used for.

The preprocessing rule, comprising: preemphasis rule, adding window framing rule and end-point detection rule.

It is described that the voice signal to be identified is pre-processed according to preprocessing rule, obtain multiframe pretreatment voice letter Number, comprising the following steps:

Step 100-2-1 carries out preemphasis processing to the voice signal to be identified according to preemphasis rule, obtains first Voice signal.

Since glottal excitation and mouth and nose radiate, the average power spectra of voice signal in high band meeting sharp-decay, get over by frequency Height, corresponding spectrum value are smaller.Voice motivates after vocal cord vibration sounding through sound channel, then incoming human ear.In the process, The high frequency section of voice can have certain decaying.Also, relative to the low frequency part of voice signal, high frequency section information is more difficult It obtains, and can more characterize the emotion information of people.

Preemphasis processing is that voice signal to be identified is handled using digital filter, amplifies its high-frequency information.

The formula of preemphasis rule:

Wherein,

X (n) is expressed as the voice signal to be identified of input；

Z is expressed as frequency；

H (z) is expressed as preemphasis filter；

μ is expressed as pre emphasis factor；The threshold range of μ is 0.9~0.97, the embodiment of the present application μ=0.9375；

Y (n) is expressed as the first voice signal.

Step 100-2-2 carries out adding window framing to first voice signal according to adding window framing rule, obtains multiframe the Two voice signals.

Voice signal has non-stationary, therefore directly cannot analyze and handle, but its is non-stationary to be pronounced by human body The resonance and resonance of system cause, and vibration velocity is in relatively slow, can be considered quasi-steady state in 20ms or so short time range Process, i.e. short-term stationary process.Therefore, it before analyzing voice signal, handling, needs first to carry out short time treatment to it, i.e., will Voice signal is divided into several segments Short Time Speech, wherein every section of Short Time Speech is referred to as a frame of voice signal.

To keep voice signal smooth after sub-frame processing and avoiding losing voice signal portion because of sub-frame processing as far as possible Divide information, consecutive frame is made mutually to overlap, i.e., frame, which moves, is less than frame length, refers to that every frame voice signal or so section mutually includes one Fraction.

To voice signal adding window framing process, it can be considered that one sliding window function of application on the voice signal, window function are sliding Dynamic step-length is that frame moves.If oneself was carried out sub-frame processing to voice signal, framing windowing process is to add on the every frame of voice signal One window function.

The formula of adding window framing rule:

S_ω(n)=s (n) * ω (n)；

Wherein,

S (n) is expressed as the first voice signal；

ω (n) is expressed as window function.Common window function has rectangular window, Hanning window, Hamming window etc., the embodiment of the present application Algorithm is preferable using flatness, energy high degree avoids the Hamming window of truncation effect as the window function of adding window sub-frame processing.The Chinese The expression formula of bright window is as shown in formula.

Wherein, N is frame length.

Step 100-2-3 carries out end-point detection to every the second voice signal of frame according to end-point detection rule, it is pre- to obtain multiframe Handle voice signal.

End-point detection is the important link of speech signal pre-processing.It, may in voice signal since playback environ-ment is limited It, can extraction and generation to subsequent acoustic feature there are interference informations such as environmental noise, long-time mute and head and the tail garbage signals Sound spectrograph effect affect greatly, and then influence speech emotion recognition system identification ability.The purpose of end-point detection is inspection Efficient voice start-stop point in voice signal is measured, the invalid mute and environmental noise in voice signal can be rejected, with this to the greatest extent may be used Negative effect of the interference information to follow-up work can be reduced.

There are two types of common end-point detecting methods, time domain end-point detection and frequency domain end-point detection.

Time-domain end-point detecting method commonly uses volume to be detected, and this method calculation amount is smaller, but is easy to voiceless sound portion Divide and causes to judge by accident.

Frequency domain end-point detecting method can be subdivided into two kinds: one is the variabilities according to frequency spectrum, that is, have the frequency spectrum of sound Variation is more regular, in this, as judgment criteria；Another kind is the entropy according to frequency spectrum, referred to as composes entropy, there is the partial spectrum of sound Entropy is typically small.

End-point detection rule described in the embodiment of the present application is specifically time domain end-point detection rule.Based on volume, it is aided with Zero-crossing rate is as important detection parameters, and this method calculation amount is small, and arithmetic speed is fast, while also avoiding to a certain extent and only using sound Amount come carry out end-point detection and caused by erroneous judgement.

Specifically includes the following steps:

Step 100-2-3-1, judges whether the value of second voice signal is greater than default volume threshold.

Default volume threshold is an empirical value.It needs to carry out comprehensive detection in conjunction with zero-crossing rate due to subsequent, in threshold value Set that value is appropriate to the occasion high unsuitable low, i.e., threshold value can slightly set height.

Step 100-2-3-2, if so, the second voice signal is voiced sound sound section.

Voiced sound sound section is exactly the sound decibel larger portion that human hair goes out.

Step 100-2-3-3, if it is not, then mute figure.

Mute sound section refers to that sound decibel is very low, is approximately no sound, environmental noise or voiceless sound sound section.

Distinguish whether certain amount of bass part is voiceless sound (referring to that human hair goes out decibel lower part sound), it can be according in short-term Zero-crossing rate judges.In general, in the environment of indoors, unvoiced part zero-crossing rate will be apparently higher than environmental noise and mute Therefore part need to preset zero-crossing rate threshold value, voiceless sound is considered greater than default zero-crossing rate threshold value, lower than being considered for the threshold value Environmental noise is mute.

For convenience, the front and back stop time point for the voiced portions that detected according to default volume threshold has been set as Sound starting point and sound terminating point.From sound one frame frame of starting point toward being pushed forward, judge whether voice signal value is greater than default volume Threshold value, if so, being considered the voiced portions that human hair goes out, then the point is new sound starting point；If not, then it is assumed that the point is forward Voiceless sound of the part for environmental noise or mute or human hair out, then foundation preset whether zero-crossing rate threshold decision is that people issues clearly Sound.Similarly, from sound one frame frame of terminating point toward moving back, method is similar with sound starting point, does not repeat.

Short-time zero-crossing rate refers in one frame of voice signal, the zeroaxial number of waveform.

The formula of short-time zero-crossing rate:

Wherein,

S is expressed as the value of sampled point；

T is expressed as frame length；

π { A } indicates that when A is true, being worth is 1；When A is true, being worth is 0.

Step 100-3 generates the corresponding sound spectrograph signal based on every frame pretreatment voice signal.

It, in this way can band since traditional feature extraction mode is to extract feature using the filter group of various engineers The information on frequency domain come is lost.In order to avoid this problem, the embodiment of the present application provides CNN+LSTM network model, directly The sound spectrograph of speaker is input in network model, remains the spectrum information of voice signal to the greatest extent in this way.

It is assumed that discrete voice signal x (n) is expressed as x after sub-frame processing_n(m)；N=0,1 ..., N-1；Wherein, n is Frame number, m are sampled point serial numbers in a frame, and N is frame length.

The Short Time Fourier Transform formula of signal x (n) is as follows:

Wherein, ω (n) is expressed as window function；

The formula of the discrete time-domain Fourier transformation (DTFT) of signal x (n) is as follows:

The formula of discrete Fourier transform (DFT) is as follows:

Wherein, as 0≤k≤N-1, then X (n, k) is exactly the short-time magnitude Power estimation of x (n)；Spectral energy density letter at m The formula of number p (n, k) is as follows

P (n, k)=| X (n, k) |²；

Using n as abscissa, k is ordinate, and the value of p (n, k) is indicated with gray scale or colour, and obtained X-Y scheme is language spectrum Figure.Through 10log₁₀The sound spectrograph that (P (n, k)) fortran can be expressed in decibels.

Step 100-4 marks the label to the sound spectrograph signal according to N kind emotional speech type, obtains described in N group Sample sound spectrograph signal.

Corresponding with first embodiment provided by the present application, present invention also provides second embodiments, i.e., a kind of for knowing The training device of other emotional speech.Since second embodiment is substantially similar to first embodiment, so describe fairly simple, phase The part of pass refers to the corresponding explanation of first embodiment.Installation practice described below is only schematical.

Fig. 4 shows a kind of embodiment of training device of emotional speech for identification provided by the present application.Fig. 4 is this Shen Please embodiment provide emotional speech for identification training device unit block diagram.

Shown in Figure 4, the application provides a kind of training device of emotional speech for identification, comprising: obtains sample Unit 401, training sample unit 402.

Sample unit 401 is obtained, for successively obtaining one group of sample sound spectrograph signal, wherein the sample sound spectrograph letter It number is one group be divided into sound spectrograph signal according to N kind emotional speech type in N group, and sample sound spectrograph signal in same group Emotional speech type is identical, and the emotional speech type of sample sound spectrograph signal is different between every group, and N is greater than 1 integer；The sample This sound spectrograph signal includes the label for marking the emotional speech type；

Training sample unit 402 reaches pre- for first network model to be respectively trained using every group of sample sound spectrograph signal If training termination condition, to obtain the first Optimal Parameters of first network model；

Optionally, described device further include: the first pretreatment unit；

In the pretreatment unit, including

Subelement is obtained, for obtaining voice signal to be identified；

First pretreatment subelement is obtained for being pre-processed according to preprocessing rule to the voice signal to be identified Multiframe is taken to pre-process voice signal；

Sound spectrograph signal subelement is generated, for generating the corresponding sound spectrograph letter based on every frame pretreatment voice signal Number；

Sample sound spectrograph signal subelement is obtained, for marking according to N kind emotional speech type to the sound spectrograph signal The label obtains the N group sample sound spectrograph signal.

In the pretreatment subelement, comprising:

Preemphasis subelement is obtained for carrying out preemphasis processing to the voice signal to be identified according to preemphasis rule Take the first voice signal；

The formula of preemphasis rule:

Wherein,

X (n) is expressed as the voice signal to be identified of input；

Z is expressed as frequency；

H (z) is expressed as preemphasis filter；

Y (n) is expressed as the first voice signal.

Adding window framing subelement is obtained for carrying out adding window framing to first voice signal according to adding window framing rule Take the second voice signal of multiframe；

The formula of adding window framing rule:

S_ω(n)=s (n) * ω (n)；

Wherein,

S (n) is expressed as the first voice signal；

Wherein, N is frame length.

End-point detection subelement is obtained for carrying out end-point detection to every the second voice signal of frame according to end-point detection rule Multiframe is taken to pre-process voice signal.

In the end-point detection subelement, comprising:

Judgment sub-unit, for judging whether the value of second voice signal is greater than default volume threshold.

Voiced sound sound cross-talk unit is determined, if the output result for the judgment sub-unit is "Yes", the second voice letter Number be voiced sound sound section.

Determine mute sound cross-talk unit, is used for if it is not, then mute sound section.

Optionally, first sub-network, comprising: the first full articulamentum (fc1), the first activation primitive (relu4), first Prevent over-fitting layer (drop4), the second full articulamentum (fc2), the second activation primitive (relu5), second from preventing over-fitting layer (drop5), the full articulamentum of third (fc3) and third activation primitive (relu6).

Optionally, second sub-network, comprising: when changing input data dimension (reshape-data), changing input Between stab that dimension (reshape-cm), first long circular recursion network layer (lstm1), third prevent over-fitting layer (lstm1- in short-term Drop), the second long circular recursion network layer (lstm2) in short-term, third long circular recursion network layer (lstm3), outputting cutting in short-term Being layered (slice), change length, circular recursion network dimension (reshape-lstm), the 4th full articulamentum (fc4), change are defeated in short-term Enter label dimension (reshape-label), loss function layer (loss) and output category accuracy (accuracy).

Condition one, the percentage for exporting total test sample number shared by correct sample number are greater than default first percentage threshold Value；

To 3rd embodiment provided by the present application, i.e., a kind of embodiment for the method for identifying emotional speech.Due to third reality It is associated with first embodiment to apply example, so describing fairly simple, the correspondence that relevant part refers to first embodiment is said It is bright.Installation practice described below is only schematical.

The present embodiment is described in detail below with reference to Fig. 5 and Fig. 6, wherein Fig. 5 is knowledge provided by the embodiments of the present application The flow chart of the method for other emotional speech；Fig. 6 is the second network of the method for identification emotional speech provided by the embodiments of the present application The structure chart of model.

Shown in Figure 5, step S501 obtains voice signal to be identified；

Step S502 pre-processes the voice signal to be identified according to preprocessing rule, obtains multiframe pretreatment Voice signal；

Step S503 obtains the corresponding sound spectrograph signal based on every frame pretreatment voice signal；

Sound spectrograph signal input is had the second network model of the first Optimal Parameters respectively, obtained by step S504 Corresponding emotional speech type；

It is shown in Figure 6, wherein ALEXTNet is the first sub-network.

First sub-network, comprising: the first full articulamentum (fc1), the first activation primitive (relu4), first prevented Fitting layer (drop4), the second full articulamentum (fc2), the second activation primitive (relu5), second prevent over-fitting layer (drop5), The full articulamentum of third (fc3) and third activation primitive (relu6).

Optionally, the third sub-network, comprising: when changing input data dimension (reshape-data), changing input Between stab that dimension (reshape-cm), first long circular recursion network layer (lstm1), third prevent over-fitting layer (lstm1- in short-term Drop), the second long circular recursion network layer (lstm2) in short-term, the long circular recursion network layer (lstm3) in short-term of third, change length Circular recursion network dimension (reshape-lstm), the 4th full articulamentum (fc4) and calculating class probability (softmax- in short-term loss)。

First network model and the second network model inherently algorithm principle be it is identical, Fig. 2 and Fig. 6 are by that will instruct Practice model file (train_val.prototxt) and deployment model file (deploy.prototxt) is input to netscope (drawing tool that a caffe is provided) draws data flow diagram.

Deploy.prototxt file is deleted on the basis of train_val.prototxt file first It is formed after thing.Due to the property of two files, the part of training all can be inside train_val.prototxt file It is deleted in deploy.prototxt file.

Final layer for trained first network model is loss function layer (loss) and output category accuracy (accuracy), the final layer for the second network model for being used to dispose is to calculate class probability (softmax-loss).First It is both loss layer, all without any weighting parameter.By checking caffe source code, two network models are most in fact here Layer is all the application that softmax is returned afterwards, and it is direct that class probability (softmax-loss) is only calculated when being defined as Softmax Probability front is calculated, and after calculating probability in loss function layer (loss) and output category accuracy (accuracy) Portion.

It does not need to train in the second network model, the probability rear portion in relation to training can all delete, and compare API defined in TensorFlow, training and deployment all use a set of network layer API, in fact inside network interface to train and portion Administration has judgement to operate, and caffe some network layers define two sets of API, different using using in training and deployment API。

In addition, can be seen that the structure for comparing the second network model with Fig. 6 from Fig. 2, in the structure of first network model The third of the last layer is long to include output cutting in short-term between circular recursion network layer (lstm3) and the 4th full articulamentum (fc4) Layer (slice) and long circular recursion network dimension (reshape-lstm) in short-term of change.Their main functions export sliced layer (slice), the output sequence last bit element of the long circular recursion network layer (lstm3) in short-term of interception third is mainly responsible for (in sequence Each element represents a characteristics of image).The long circular recursion network dimension (reshape-lstm) in short-term of change, which is mainly responsible for, to be changed Become output sliced layer (slice) output data dimension, is allowed to be suitble to the input of the 4th full articulamentum fc4 data dimension.This is this Invention deliberate action, the second sub-network input and output are equal length time series respectively, in the second sub-network of the same output In time series, every element is all influenced by same sequence inputting element before it, that is to say, that output sliced layer (slice) The output sequence last bit element of the long circular recursion network layer (lstm3) in short-term of interception third is by institute in same sequence before it There is input element to influence, compared to same sequence other elements more representative of the feature of entire sequence, first network model is sought to Error is calculated using this element value and label, then calculates probability rear portion.

Second network model shown in Figure 6, the present invention do not place output sliced layer (slice), first this Network does not need trained network weight, is not belonging to neural network algorithm, does not export sliced layer (slice) yet just meaning the Each element in three long circular recursion network layer (lstm3) output sequences in short-term is exported.It can certainly add defeated Sliced layer (slice) only exports long circular recursion network layer (lstm3) network output sequence last bit element in short-term of third out.This Realize that the difference in effect is the probability that the former microphone acquires that every frame voice provides a recognition result in algorithm, and the latter is One time series (10 frame) voice of every acquisition provides a recognition result probability.

It is corresponding with 3rd embodiment provided by the present application, present invention also provides fourth embodiment, i.e., a kind of identification feelings Feel the device of voice.Since fourth embodiment is substantially similar to 3rd embodiment, so describe fairly simple, relevant part Refer to the corresponding explanation of 3rd embodiment.Installation practice described below is only schematical.

Fig. 7 shows a kind of embodiment of device for identifying emotional speech provided by the present application.Fig. 7 is the embodiment of the present application The unit block diagram of the device of the identification emotional speech of offer.

Shown in Figure 7, the application provides a kind of device for identifying emotional speech, comprising: obtains voice letter to be identified Number unit 701, pretreatment unit 702 obtain sound spectrograph signal element 703, obtain emotional speech type units 704.

Voice signal unit 701 to be identified is obtained, for obtaining voice signal to be identified；

Pretreatment unit 702 obtains more for being pre-processed according to preprocessing rule to the voice signal to be identified Frame pre-processes voice signal；

Sound spectrograph signal element 703 is obtained, for obtaining the corresponding sound spectrograph based on every frame pretreatment voice signal Signal；

Emotional speech type units 704 are obtained, for sound spectrograph signal input to be had the first Optimal Parameters respectively The second network model, obtain corresponding emotional speech type；

Above embodiments are only the exemplary embodiment of the application, are not used in limitation the application, the protection scope of the application It is defined by the claims.Those skilled in the art can make respectively the application in the essence and protection scope of the application Kind modification or equivalent replacement, this modification or equivalent replacement also should be regarded as falling within the scope of protection of this application.

Claims

1. a kind of training method of emotional speech for identification characterized by comprising

Successively obtain one group of sample sound spectrograph signal, wherein the sample sound spectrograph signal is will according to N kind emotional speech type Sound spectrograph signal is divided into one group in N group, and in same group sample sound spectrograph signal emotional speech type it is identical, sample between every group The emotional speech type of this sound spectrograph signal is different, and N is greater than 1 integer；The sample sound spectrograph signal includes described in label The label of emotional speech type；

First network model is respectively trained using every group of sample sound spectrograph signal and reaches default training termination condition, to obtain the First Optimal Parameters of one network model；

Wherein, the first network model, comprising: for extracting emotional speech features and dimensionality reduction from the sound spectrograph signal First sub-network, and the second sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.

2. training method according to claim 1, which is characterized in that successively obtain one group of trained voice signal described Before, further includes:

Obtain voice signal to be identified；

The label is marked to the sound spectrograph signal according to N kind emotional speech type, obtains the N group sample sound spectrograph letter Number.

3. training method according to claim 2, which is characterized in that the preprocessing rule, comprising: preemphasis rule, Adding window framing rule and end-point detection rule；

It is described that the voice signal to be identified is pre-processed according to preprocessing rule, it obtains multiframe and pre-processes voice signal, Include:

End-point detection is carried out to every the second voice signal of frame according to end-point detection rule, multiframe is obtained and pre-processes voice signal.

4. training method according to claim 3, which is characterized in that the end-point detection rule is specifically the inspection of time domain endpoint Gauge is then.

5. training method according to claim 1, which is characterized in that first sub-network, comprising: the first full connection Layer, the first activation primitive, first prevent over-fitting layer, the second full articulamentum, the second activation primitive, second prevent over-fitting layer, The full articulamentum of third and third activation primitive.

6. training method according to claim 1, which is characterized in that second sub-network, comprising: change input data Dimension, changing input time stamp dimension, the first length, circular recursion network layer, third prevent over-fitting layer, the second length in short-term in short-term Circular recursion network layer, third long circular recursion network layer in short-term, change the long dimension of circular recursion network in short-term at output sliced layer Degree, changes input label dimension, loss function layer and output category accuracy at the 4th full articulamentum；

Wherein, the output sliced layer is used to divide the sequence end vector of the second sub-network output；Sequence end vector For calculating error feedback simultaneously corrective network weight or prediction emotional speech type with the label.

7. training method according to claim 1, which is characterized in that the default trained termination condition, include at least with One of lower condition:

8. a kind of training device of emotional speech for identification characterized by comprising

Sample unit is obtained, for successively obtaining one group of sample sound spectrograph signal, wherein the sample sound spectrograph signal is basis Sound spectrograph signal is divided into one group in N group by N kind emotional speech type, and in same group sample sound spectrograph signal emotional speech Type is identical, and the emotional speech type of sample sound spectrograph signal is different between every group, and N is greater than 1 integer；The sample sound spectrograph Signal includes the label for marking the emotional speech type；

Training sample unit reaches default training eventually for first network model to be respectively trained using every group of sample sound spectrograph signal Only condition, to obtain the first Optimal Parameters of first network model；

9. a kind of method for identifying emotional speech characterized by comprising

Obtain voice signal to be identified；

Sound spectrograph signal input is had to the second network model of the first Optimal Parameters respectively, obtains corresponding emotional speech Type；

Wherein, second network model, comprising: for extracting emotional speech features and dimensionality reduction from the sound spectrograph signal First sub-network, and the third sub-network for restraining for accelerating algorithm and classifying to the emotional speech features.

10. a kind of device for identifying emotional speech characterized by comprising

Pretreatment unit obtains multiframe and locates in advance for being pre-processed according to preprocessing rule to the voice signal to be identified Manage voice signal；

Sound spectrograph signal element is obtained, for obtaining the corresponding sound spectrograph signal based on every frame pretreatment voice signal；

Emotional speech type units are obtained, for sound spectrograph signal input to be had to the second net of the first Optimal Parameters respectively Network model obtains corresponding emotional speech type；