CN109817246A

CN109817246A - Training method, emotion identification method, device, equipment and the storage medium of emotion recognition model

Info

Publication number: CN109817246A
Application number: CN201910145605.2A
Authority: CN
Inventors: 刘博卿; 贾雪丽; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-02-27
Filing date: 2019-02-27
Publication date: 2019-05-28
Anticipated expiration: 2039-02-27
Also published as: WO2020173133A1; CN109817246B

Abstract

This application involves intelligent decision fields, based on deep learning training emotion recognition model.Training method, emotion identification method, device, computer equipment and the storage medium of a kind of emotion recognition model are specifically disclosed, this method comprises: obtaining the voice messaging and corresponding data label of user；Sample data is constructed according to voice messaging and corresponding data label；The voice messaging in sample data is pre-processed according to default processing rule to obtain corresponding spectral vectors；Preset Recognition with Recurrent Neural Network is extracted, Recognition with Recurrent Neural Network includes attention mechanism, and attention mechanism is used to reinforce the partial region in voice messaging；Based on Recognition with Recurrent Neural Network, model training is carried out to obtain emotion recognition model according to the corresponding spectral vectors of voice messaging and data label.This method can be improved emotion recognition model can generalization, improve model identification accuracy rate.

Description

Training method, emotion identification method, device, equipment and the storage of emotion recognition model Medium

Technical field

This application involves model training technical fields more particularly to a kind of training method of emotion recognition model, emotion to know Other method, apparatus, computer equipment and storage medium.

Background technique

In recent years, extensive hair has been obtained using the emotion recognition model of voice recognition user emotion based on machine learning Exhibition, but many challenges have been also faced with for the emotion recognition of sound, such as in order to generate the knowledge of lasting accurate positive negative affect Not, in such a way that text and acoustic feature combine, this mode needs to utilize speech recognition part identification model Sound is converted text information by (Automatic Speech Recognition, ASR) technology, serious but there are retardances The problem of.Meanwhile there is also the problems of generalization difference for emotion recognition model, it is quasi- when model is applied to new speaker True rate can reduce.

Summary of the invention

This application provides a kind of training method of emotion recognition model, emotion identification method, device, computer equipment and Storage medium, with improve emotion recognition model can generalization, improve the accuracy rate of identification.

In a first aspect, this application provides a kind of training methods of emotion recognition model, which comprises

Obtain the voice messaging and the corresponding data label of the voice messaging of user；

Sample data is constructed according to the voice messaging and corresponding data label；

The voice messaging in the sample data is pre-processed according to default processing rule to obtain corresponding frequency spectrum Vector；

Preset Recognition with Recurrent Neural Network is extracted, the Recognition with Recurrent Neural Network includes attention mechanism, the attention mechanism For reinforcing the partial region in the voice messaging；

Based on the Recognition with Recurrent Neural Network, model is carried out according to the corresponding spectral vectors of the voice messaging and data label Training is to obtain emotion recognition model.

Second aspect, present invention also provides a kind of emotion identification methods, which comprises

Acquire the voice signal of user；

The voice signal is pre-processed according to default processing rule to obtain the corresponding frequency spectrum of the voice signal Vector；

The spectral vectors are input to emotion recognition model to identify the emotion of the user, to obtain the use The emotional category at family, the emotion recognition model are the model obtained using above-mentioned emotion recognition model training method training.

The third aspect, present invention also provides a kind of training device of emotion recognition model, described device includes:

Acquiring unit, for obtaining the voice messaging and the corresponding data label of the voice messaging of user；

Sample construction unit, for constructing sample data according to the voice messaging and corresponding data label；

Pretreatment unit, for according to default processing rule to the voice messaging in the sample data pre-processed with Obtain corresponding spectral vectors；

Extraction unit, for extracting preset Recognition with Recurrent Neural Network, the Recognition with Recurrent Neural Network includes attention mechanism, institute Attention mechanism is stated for reinforcing the partial region in the voice messaging；

Model training unit, for being based on the Recognition with Recurrent Neural Network, according to the corresponding spectral vectors of the voice messaging Model training is carried out with data label to obtain emotion recognition model.

The third aspect, present invention also provides a kind of emotion recognition device, described device includes:

Signal acquisition unit, for acquiring the voice signal of user；

Signal processing unit, for being pre-processed the predicate to obtain to the voice signal according to default processing rule The corresponding spectral vectors of sound signal；

Emotion recognition unit carries out the emotion of the user for the spectral vectors to be input to emotion recognition model Identification, to obtain the emotional category of the user, the emotion recognition model is using above-mentioned emotion recognition model training side The model that method training obtains.

Fourth aspect, present invention also provides a kind of computer equipment, the computer equipment includes memory and processing Device；The memory is for storing computer program；The processor, for executing the computer program and described in the execution Training method or the emotion identification method such as above-mentioned emotion recognition model are realized when computer program.

5th aspect, present invention also provides a kind of computer readable storage medium, the computer readable storage medium It is stored with computer program, the computer program makes the processor realize such as above-mentioned emotion recognition when being executed by processor The training method of model or the emotion identification method.

This application discloses training method, device, equipment and the storage medium of a kind of emotion recognition model, this method is being obtained After getting the voice messaging and corresponding data label of user, according to default processing rule to voice messaging pre-processed with Corresponding spectral vectors are obtained, then are based on preset Recognition with Recurrent Neural Network, according to the corresponding spectral vectors of voice messaging and data Label carries out model training to obtain emotion recognition model, wherein the Recognition with Recurrent Neural Network includes attention mechanism, the attention Power mechanism is used to reinforce the partial region in the voice messaging.The emotion recognition model that this method trains has can generalization By force, the high accuracy for examination of identification.

Detailed description of the invention

Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is a kind of schematic flow diagram of the training method for emotion recognition model that embodiments herein provides；

Fig. 2 is the structural schematic diagram for the Recognition with Recurrent Neural Network that embodiments herein provides；

Fig. 3 is the sub-step schematic flow diagram of the training method of the emotion recognition model in Fig. 1；

Fig. 4 is a kind of schematic flow diagram of the training method for emotion recognition model that embodiments herein provides；

Fig. 5 is a kind of schematic flow diagram for emotion identification method that embodiments herein provides；

Fig. 6 is a kind of schematic block diagram of model training apparatus provided by the embodiments of the present application；

Fig. 7 is the schematic block diagram of another model training apparatus provided by the embodiments of the present application；

Fig. 8 is a kind of schematic block diagram of emotion recognition device provided by the embodiments of the present application；

Fig. 9 is a kind of structural representation block diagram for computer equipment that one embodiment of the application provides.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.

Flow chart shown in the drawings only illustrates, it is not necessary to including all content and operation/step, also not It is that must be executed by described sequence.For example, some operation/steps can also decompose, combine or partially merge, therefore practical The sequence of execution is possible to change according to the actual situation.

Embodiments herein provides a kind of training method of emotion recognition model, emotion identification method, device, calculating Machine equipment and storage medium.Wherein, server can be used to be trained for the training method of emotion recognition model；Emotion identification method Can be applied in terminal or server, go out the affective style of the user for the voice recognition according to user, for example, it is glad or Sadness etc..

Wherein, server can be independent server, or server cluster.The terminal can be mobile phone, put down The electronic equipments such as plate computer, laptop, desktop computer, personal digital assistant and wearable device.

With reference to the accompanying drawing, it elaborates to some embodiments of the application.In the absence of conflict, following Feature in embodiment and embodiment can be combined with each other.

Referring to Fig. 1, Fig. 1 is a kind of signal stream of the training method for emotion recognition model that embodiments herein provides Cheng Tu.Wherein, which is to carry out model training based on preset Recognition with Recurrent Neural Network to obtain.

As shown in Fig. 2, Fig. 2 is a kind of structural representation for preset Recognition with Recurrent Neural Network that embodiments herein provides Figure.The structure of the Recognition with Recurrent Neural Network includes input layer, circulation layer, attention mechanism, connects layer and output layer entirely；The attention Power mechanism is for the mapping relations between the output quantity and weight vectors of the circulation layer according to attention establishing equation to realize Reinforce the partial region in the voice messaging, and then improves the recognition accuracy of model.

Wherein, circulation layer includes shot and long term memory network (Long Short-Term Memory, LSTM) unit, output layer It is exported using Softmax.In the structure of Recognition with Recurrent Neural Network, temporal dependence in the corresponding list entries of input layer It is to be modeled with one includes the circulation layer of shot and long term memory network unit；Attention mechanism is to be applied in the sequence often It is that some regions in sequence increase more weights, these regions are to know in the output of one time point corresponding circulation layer Not positive negative-morality when important region.Relative to other Recognition with Recurrent Neural Network (Recurrent Neural Networks, RNN for), which can be used to learn prolonged dependence, while gradient disappears not yet Or the problem of gradient explosion, available better recognition effect.

Below with reference to the structure of the Recognition with Recurrent Neural Network in Fig. 2, the emotion recognition of embodiments herein offer is introduced The training method of model.

As shown in Figure 1, the training method of the emotion recognition model, for training emotion recognition model with quickly and accurately Identify the affective style of user.Wherein the training method includes step S101 to step S105.

S101, the voice messaging and the corresponding data label of the voice messaging for obtaining user.

Wherein, data label is the affective tag of user, including positive mood label, neutral mood label and negative-morality label Deng.It is of course also possible to voice messaging is divided into more classes, and then corresponding more data labels, for example, it is glad, sad, fear, The data labels such as sad or neutral, different data label represent the different moods of user.

Specifically, the voice messaging of user is obtained from presetting database, which includes label data, i.e., The corresponding data label of the voice messaging.Before this, further includes: acquire the voice messaging of user and according to data label pair The voice messaging is marked, and the voice messaging for being marked with data label is stored in the presetting database.With Family may include the user etc. of the crowds such as user in different crowd, such as child, youth, middle age and old age；It is understood that It is also possible to crowd of different occupation, such as teacher, student, doctor, lawyer and IT personnel etc., and then abundant sample data is more Sample.

In one embodiment, in order to improve the recognition accuracy of model, voice messaging is set and is acquired, is i.e. institute State the voice messaging and the corresponding data label of the voice messaging for obtaining user, comprising: obtain user and tell about different emotions Corresponding voice messaging and the user carry out the data mark that emotion marking generates to the voice messaging when story of type Label.

Specifically, acquisition user tells about two passive stories and the corresponding voice letter of two optimistic stories first Breath；And before saying every story or after telling a story, obtains the user and give a mark according to scoring criterion to its mood Corresponding marking score；Scoring criterion such as makes 0-5 scores of expression negative-morality, and 6-10 points are positive moods, and raw according to marking score At corresponding data label；For example marking is 4 points, then the corresponding label data of the voice messaging is negative-morality label.

It is of course also possible to which the user of acquisition is told about two passive stories and the corresponding voice letter of two optimistic stories Breath carries out segmentation marking, and determines corresponding data label according to the corresponding marking score of segmentation marking, for example, by voice messaging It is divided into two sections of sound bites, the marking score of first segment sound bite is 0 point, then corresponding data label is negative-morality, second The marking score of section sound bite is 10 points, then corresponding data label is positive mood.

S102, sample data is constructed according to the voice messaging and corresponding data label.

Specifically, sample data can be constituted according to the voice messaging of acquisition user and corresponding data label.User For multiple users, particular number does not limit herein, and since the emotion of user is different, which includes positive sample data With negative sample data, positive sample data correspond to the voice messaging of positive mood, and positive mood is such as optimistic, glad and excitement etc.；It is negative Sample data corresponds to the voice messaging of negative-morality, and negative-morality is such as the relevant mood such as passive, sad and pain.

The default processing rule of S103, basis pre-processes the voice messaging in the sample data corresponding to obtain Spectral vectors.

Wherein, the voice messaging in the sample data is for producing the information in frequency domain by the default processing rule, Specifically such as the voice messaging acquired in the time domain is converted into using Fast Fourier Transform (FFT) rule or wavelet transformation rule Information in frequency domain.

In one embodiment, in order to accelerate model training and identification precision, using preprocessing rule, such as Fig. 3 institute Show, i.e. step S103 includes: sub-step S103a to sub-step S103d.

S103a, framing windowing process is carried out to the voice messaging in the sample data, and to obtain that treated, voice is believed Breath.

Wherein, it is 40ms that frame length, which is specifically arranged, in framing windowing process, is carried out according to the frame length 40ms of setting to voice messaging Then dividing processing is added hamming window to handle voice messaging after segmentation, is added at hamming window again with the voice messaging after being divided Reason refers to voice messaging after segmentation multiplied by a window function, in order to carry out Fourier expansion.

It should be noted that framing windowing process, the specific frame length that is arranged can be set to other values, for example, be set as 50ms, 30ms or other values.

In one embodiment, framing windowing process is being carried out to be handled to the voice messaging in the sample data Before voice messaging afterwards, can also preemphasis processing be carried out to voice messaging, specifically multiplied by one and the frequency of voice messaging The predetermined coefficient being positively correlated, to promote the amplitude of high frequency, the size of the predetermined coefficient and the parameter of model training are associated, i.e., Changed according to the variation of model parameter, for example, with weight vectors a_iIt is associated, with specific reference to weight vectors a_iCorresponding mean value increases Increase greatly, or is reduced according to the mean value and reduced.Purpose is preferably to improve the accuracy of identification of model.

In an alternative embodiment, predetermined coefficient can be set to an empirical value, and an empirical value, which is arranged, to be used In eliminating effect caused by vocal cords and lip in user's voiced process, to compensate the height that voice messaging is constrained by articulatory system Frequency part, and the formant of high frequency can be highlighted.

S103b, to treated, voice messaging carries out frequency-domain transform to obtain corresponding amplitude spectrum.

Specifically to treated voice messaging carry out Fast Fourier Transform (FFT) (Fast Fourier Transform, It FFT), is that amplitude is as amplitude spectrum in order to obtain in the present embodiment, i.e., after Fast Fourier Transform (FFT) to obtain corresponding parameter Amplitude.It is of course also possible to the other parameters after FFT transform, for example amplitude is plus phase information etc..

It is understood that wavelet transformation can also be carried out to treated voice messaging to obtain corresponding parameter, and Select transformed amplitude as amplitude spectrum.

S103c, the amplitude spectrum is filtered by Meier filter group, and to the amplitude spectrum after filtering processing Discrete cosine transform is carried out to obtain mel-frequency cepstrum coefficient.

Specifically, described that the amplitude spectrum is filtered by Meier filter group, comprising: to obtain the voice The corresponding maximum frequency of information calculates the corresponding mel-frequency of the maximum frequency using mel-frequency calculation formula；According to meter The quantity of the mel-frequency of calculation and the Meier filter group intermediate cam filter calculates in two adjacent triangular filters The Meier spacing of frequency of heart；The linear distribution to multiple triangular filters is completed according to the Meier spacing；It is linear according to completing Multiple triangular filters of distribution are filtered the amplitude spectrum.

Meier filter group specifically includes the triangular filter that 40 linear distributions are measured in Meier.Amplitude spectrum will be obtained to lead to 40 linear distributions are crossed after the triangular filter that Meier is measured is filtered, then carries out discrete cosine transform and obtains Meier Frequency cepstral coefficient.

It determines corresponding maximum frequency in voice messaging, can be calculated most according to maximum frequency using mel-frequency calculation formula Big mel-frequency calculates two adjacent triangular filters according to quantity (40) of maximum mel-frequency and triangular filter The spacing of centre frequency；The linear distribution to multiple triangular filters is completed according to the spacing calculated.

Wherein, the mel-frequency calculation formula are as follows:

In formula (1), f_melFor the mel-frequency, f is the corresponding maximum frequency of the voice messaging, and A is coefficient, Specially 2595.

For example, the maximum frequency determined is 4000Hz, can find out maximum mel-frequency using formula (1) is 2146.1mel。

Since in Meier measure range, the centre frequency of each triangular filter is the linear distribution of equal intervals.By This, can calculate the spacing of the centre frequency of two adjacent triangular filters are as follows:

Wherein, Δ mel is the spacing of the centre frequency of two adjacent triangular filters；K is the quantity of triangular filter.

S103d, the mel-frequency cepstrum coefficient is normalized to obtain the corresponding frequency of the voice messaging Compose vector.

Specifically, use zero-mean normalization that the mel-frequency cepstrum coefficient is normalized described to obtain The corresponding spectral vectors of voice messaging, the zero-mean normalize corresponding conversion formula are as follows:

Wherein,For the mean value of mel-frequency cepstrum coefficient；σ is the standard deviation of mel-frequency cepstrum coefficient；X is each plum That frequency cepstral coefficient；x^*For the mel-frequency cepstrum coefficient after normalization.

Zero-the mean normalization (Z-Score standardization) used, also referred to as standard deviation standardize.Treated data Mean value be 0, mark difference be 1.Z-Score standardization is uniformly to convert different magnitude of data to the same magnitude, unified It is measured with calculated Z-Score value, to guarantee the comparativity between data.

S104, preset Recognition with Recurrent Neural Network is extracted, the Recognition with Recurrent Neural Network includes attention mechanism, the attention Mechanism is used to reinforce the partial region in the voice messaging.

Wherein, the structure of the Recognition with Recurrent Neural Network includes input layer, circulation layer, attention mechanism, connects layer and output entirely Layer；The attention mechanism is for the mapping between the output quantity and weight vectors of the circulation layer according to attention establishing equation Relationship is to realize the partial region reinforced in the voice messaging.

The attention equation are as follows:

Wherein, g is the input vector for connecting layer entirely；h_iFor the output quantity of the corresponding circulation layer of each time point i；a_i It is the corresponding weight vectors of each time point i, it is big to the influence for connecting layer and output layer entirely for representing each time point i It is small.

The key of attention mechanism is study to this equation, and the equation gives each circulation layer in each time point i Output h_iWith a weight vectors a_iBetween establish a mapping relations, h_iIndicate the output of circulation layer, a_iIt is for representing Influence size of each time point to the layer after in network.

Wherein, f (h_i) in parameter in the training process can be optimised, expression formula specifically:

f(h_i)=tanh (Wh_i+b) (4)

In formula (4), W and b are the parameter of linear equation, h_iCorresponding is the output of the LSTM layer of each time point i, It is expressed as h_i=(h₀,...h_T-1), wherein T is the total number at time point in the sequence given for one.It is simple in the present embodiment The form for its expression formula changed, the specific activation primitive for adding a tanh using a linear function such as formula (4) both can be with Preferable effect is obtained, while the training speed of model can be improved again.

The time point i, weight vectors a given for one_iFormula are as follows:

In formula (5), W is the matrix parameter of a dimension S*D, and S is positive integer, b and u be a dimension be S to Parameter is measured, D is the number of network unit in the circulation layer.

It should be noted that g is input of the vector as full articulamentum, activation primitive uses ReLu function, later Connecting layer uses Softmax function entirely, to obtain output to the end.

S105, be based on the Recognition with Recurrent Neural Network, according to the corresponding spectral vectors of the voice messaging and data label into Row model training is to obtain emotion recognition model.

Specifically, spectral vectors are input to preset Recognition with Recurrent Neural Network and carry out model training, pass through improved model In attention mechanism the major part in sound is reinforced, optimize corresponding model parameter and then obtain emotion recognition mould Type, model training parameter are as shown in table 1.

Table 1 is the relevant parameter of training network

Parameter type	Parameter value
		Optimization algorithm	Adam
Learning rate	0.0005
		LSTM unit number	128
Full articulamentum neuron number	20
		The probability that Dropout retains	0.7

Model training method provided by the above embodiment is in the voice messaging and corresponding data label for getting user Afterwards, voice messaging is pre-processed according to default processing rule to obtain corresponding spectral vectors, then is based on preset circulation Neural network, according to the corresponding spectral vectors of voice messaging and data label progress model training to obtain emotion recognition model, Wherein, which includes attention mechanism, and the attention mechanism is used to reinforce the part in the voice messaging Region.The emotion recognition model that this method trains have can generalization it is strong, the high accuracy for examination of identification.

Referring to Fig. 4, Fig. 4 is the signal of the training method for another emotion recognition model that embodiments herein provides Flow chart.Wherein, which is to carry out model training based on preset Recognition with Recurrent Neural Network to obtain, and certainly may be used To be trained to obtain using other networks.

As shown in figure 4, the training method of the emotion recognition model, including step S201 to step S207.

S201, the voice messaging and the corresponding data label of the voice messaging for obtaining user.

S202, sample data is constructed according to the voice messaging and corresponding data label, the sample data is at least Including positive sample data and negative sample data.

Specifically, sample data can be constituted according to the voice messaging of acquisition user and corresponding data label.Due to The emotion of user is different, therefore the sample data includes at least positive sample data and negative sample data, for example may also include neutrality Sample data.Positive sample data correspond to the voice messaging of positive mood；Negative sample data correspond to the voice messaging of negative-morality.

S203, judge whether positive sample data and negative sample data in the sample data reach balance.

Specifically, whether the positive sample data judged in the sample data and negative sample data reach balance, and Judging result is generated, which includes: positive sample data and negative sample data balancing and positive sample data and negative sample number According to imbalance.

Wherein, if positive sample data and negative sample data nonbalance, then follow the steps S204；If positive sample data and negative sample Notebook data balance, thens follow the steps S205.

S204, the sample data is handled according to default data processing rule so that positive sample data and negative Sample data reaches balance.

If the positive sample data and negative sample data nonbalance, according to default data processing rule to the sample data It is handled so that the positive sample data and negative sample data reach balance.Specifically, sample can be corresponded to by two ways Data are handled so that positive sample data and negative sample data reach balance.It is respectively as follows:

One, sample data is handled by way of over-sampling: positive sample data in the sample data of building and Negative sample data, usually negative sample data are less than positive sample data, specifically by the negative sample data duplication repeatedly and with just The sample data of sample data composing training.For the sample data that training is used for, due to negative sample number therein According to replicating several times, constitute new sample data, and then can solve the problem of sample unevenness more.

Two, sample data is handled by the way that Weighted Loss Function is arranged: the intersection entropy function by making a standard Or the Model Weight θ of the cross entropy function minimization training of weighting is optimal, especially by the thought of weighting, such as negative sample It is few, know it is negative sample when training, goes to be adjusted model parameter by weight, to increase the influence of negative sample. Wherein, the corresponding expression formula of cross entropy loss function of standard are as follows:

Wherein,It is the output of the Softmax of each sequence n observed, it is F*D that wherein X, which is dimension, Matrix, wherein F represent is each time point input spectral coefficient quantity；C_nIt is the sequence that each is observed The label of the corresponding class of n, the value range of label are { 0,1 }, naturally it is also possible to it be { 0,1,2 }, respectively corresponds negative sample, it is neutral Sample and positive sample.It is of course also possible to use the intersection entropy function of weighting, the cross entropy of the intersection entropy function and standard of the weighting Loss function is similar, and target is all to solve the problems, such as that sample data is non-uniform.

The default processing rule of S205, basis pre-processes the voice messaging in the sample data corresponding to obtain Spectral vectors.

Specifically, if the positive sample data and negative sample data reach balance, according to default processing rule to described Voice messaging in sample data is pre-processed to obtain corresponding spectral vectors.Wherein, the default processing rule for for Voice messaging in the sample data is produced into the information in frequency domain, specifically such as using Fast Fourier Transform (FFT) rule or The voice messaging acquired in the time domain is converted into the information in frequency domain by wavelet transformation rule.

S206, preset Recognition with Recurrent Neural Network is extracted, the Recognition with Recurrent Neural Network includes attention mechanism, the attention Mechanism is used to reinforce the partial region in the voice messaging.

S207, be based on the Recognition with Recurrent Neural Network, according to the corresponding spectral vectors of the voice messaging and data label into Row model training is to obtain emotion recognition model.

Specifically, spectral vectors are input to preset Recognition with Recurrent Neural Network and carry out model training, pass through improved model In attention mechanism the major part in sound is reinforced, optimize corresponding model parameter and then obtain emotion recognition mould Type.

Model training method provided by the above embodiment is in the voice messaging and corresponding data label for getting user Afterwards, when sample data reaches data balancing, voice messaging is pre-processed according to default processing rule corresponding to obtain Spectral vectors, then it is based on preset Recognition with Recurrent Neural Network, mould is carried out according to the corresponding spectral vectors of voice messaging and data label Type training is to obtain emotion recognition model, wherein the Recognition with Recurrent Neural Network includes attention mechanism, and the attention mechanism is used for Reinforce the partial region in the voice messaging.The emotion recognition model that this method trains have can generalization it is strong, identification High accuracy for examination.Simultaneously as extreme mood is often more rare much than neutral mood, therefore sample unevenness Problem and lead to overfitting problem, this method can solve sample problem of non-uniform very well, and then improve the accuracy of model.

Referring to Fig. 5, Fig. 5 is a kind of schematic flow diagram for emotion identification method that embodiments herein provides.The feelings Feel recognition methods, can be applied in terminal or server, for the emotion according to the voice recognition user of user.

As shown in figure 5, the emotion identification method, including step S301 to step S303.

S301, the voice signal for acquiring user.

Specifically, corresponding voice signal, the sound pick-up outfit when chatting with user can be acquired by sound pick-up outfit such as to record Sound pen, smart phone, tablet computer, notebook or intelligent wearable device etc., such as Intelligent bracelet or smartwatch etc..

The default processing rule of S302, basis pre-processes the voice signal corresponding to obtain the voice signal Spectral vectors.

Specifically, the voice signal is pre-processed according to default processing rule corresponding to obtain the voice signal Spectral vectors, comprising: framing windowing process is carried out to voice messaging with the voice messaging that obtains that treated；To treated language Message breath carries out Fast Fourier Transform (FFT) to obtain amplitude spectrum；Meier filter group is increased to amplitude spectrum, and by Meier filter The output of group does discrete cosine transform to obtain mel-frequency cepstrum coefficient；Obtained each mel-frequency cepstrum coefficient is carried out Normalized is to obtain the corresponding spectral vectors of voice messaging.

S303, the spectral vectors are input to emotion recognition model the emotion of the user is identified, to obtain The emotional category of the user.

Wherein, the emotion recognition model is using the emotion recognition model training method training provided in above-described embodiment Obtained model.The spectral vectors of input are analyzed by the emotion recognition model, to accurately obtain the emotion of user, Specially affective style, such as glad, sad or neutrality etc..

Emotion identification method provided by the above embodiment, by the voice signal for acquiring user；According to default processing rule The voice signal is pre-processed to obtain the corresponding spectral vectors of the voice signal；The spectral vectors are input to Emotion recognition model identifies the emotion of the user, to obtain the emotional category of the user.This method can be quick The affective style of user is recognized, while having many advantages, such as that recognition accuracy is high again.

Referring to Fig. 6, Fig. 6 is a kind of schematic block diagram for model training apparatus that one embodiment of the application provides, the mould Type training device can be configured in server, for executing the training method of emotion recognition model above-mentioned.

As shown in fig. 6, the model training apparatus 400, comprising: information acquisition unit 401, sample construction unit 402, data Processing unit 403, network extraction unit 404 and model training unit 405.

Information acquisition unit 401, for obtaining the voice messaging and the corresponding data label of the voice messaging of user.

Sample construction unit 402, for constructing sample data according to the voice messaging and corresponding data label.

Data processing unit 403, it is pre- for being carried out according to default processing rule to the voice messaging in the sample data Processing is to obtain corresponding spectral vectors.

In one embodiment, the data processing unit 403, comprising:

Information processing subelement 4031, for carrying out framing windowing process to the voice messaging in the sample data to obtain To treated voice messaging；Information converts subelement 4032, for treated, voice messaging to carry out frequency-domain transform to obtain To corresponding amplitude spectrum；Filtering transformation subelement 4033, for being filtered place to the amplitude spectrum by Meier filter group Reason, and discrete cosine transform is carried out to obtain mel-frequency cepstrum coefficient to the amplitude spectrum after filtering processing；Normalize subelement 4034, for the mel-frequency cepstrum coefficient to be normalized with obtain the corresponding frequency spectrum of the voice messaging to Amount.

In one embodiment, filtering transformation subelement 4033, is specifically used for: obtaining the corresponding maximum of the voice messaging Frequency calculates the corresponding mel-frequency of the maximum frequency using mel-frequency calculation formula；According to the mel-frequency of calculating with And between the Meier of the centre frequency of quantity two adjacent triangular filters of calculating of the Meier filter group intermediate cam filter Away from；The linear distribution to multiple triangular filters is completed according to the Meier spacing；According to the multiple triangles for completing linear distribution Filter is filtered the amplitude spectrum.

Network extraction unit 404, for extracting preset Recognition with Recurrent Neural Network, the Recognition with Recurrent Neural Network includes attention Mechanism, the attention mechanism are used to reinforce the partial region in the voice messaging；

Model training unit 405, for be based on the Recognition with Recurrent Neural Network, according to the corresponding frequency spectrum of the voice messaging to Amount and data label carry out model training to obtain emotion recognition model.

Referring to Fig. 7, Fig. 7 is the schematic block diagram for another model training apparatus that one embodiment of the application provides, it should Model training apparatus can be configured in server, for executing the training method of emotion recognition model above-mentioned.

As shown in fig. 7, the model training apparatus 500, comprising: information acquisition unit 501, sample construction unit 502, balance Judging unit 503, Balance Treatment unit 504, data processing unit 505, network extraction unit 506 and model training unit 507.

Information acquisition unit 501, for obtaining the voice messaging and the corresponding data label of the voice messaging of user.

Sample construction unit 502, for constructing sample data, institute according to the voice messaging and corresponding data label Stating sample data includes positive sample data and negative sample data.

Judging unit 503 is balanced, for judging whether positive sample data in the sample data and negative sample data reach To balance

Balance Treatment unit 504, if the positive sample data and negative sample data nonbalance are used for, at preset data Reason rule is handled the sample data so that the positive sample data and negative sample data reach balance.

Data processing unit 505, if being used for the positive sample data and negative sample data balancing, according to default processing rule Voice messaging in the sample data is pre-processed to obtain corresponding spectral vectors.

Network extraction unit 506, for extracting preset Recognition with Recurrent Neural Network, the Recognition with Recurrent Neural Network includes attention Mechanism, the attention mechanism are used to reinforce the partial region in the voice messaging；

Model training unit 507, for be based on the Recognition with Recurrent Neural Network, according to the corresponding frequency spectrum of the voice messaging to Amount and data label carry out model training to obtain emotion recognition model.

Referring to Fig. 8, Fig. 8 is a kind of schematic block diagram for emotion recognition device that one embodiment of the application provides, the feelings Sense identification device can be configured in terminal or server, for executing emotion identification method above-mentioned.

As shown in figure 8, the emotion recognition device 600, comprising: signal acquisition unit 601, signal processing unit 602 and feelings Feel recognition unit 603.

Signal acquisition unit 601, for acquiring the voice signal of user.

Signal processing unit 602, for being pre-processed the voice signal to obtain according to default processing rule The corresponding spectral vectors of predicate sound signal.

Emotion recognition unit 603, for the spectral vectors to be input to emotion recognition model to the emotion of the user It is identified, to obtain the emotional category of the user, the emotion recognition model is to use emotion described in any of the above embodiments The model that the training of identification model training method obtains.

It should be noted that it is apparent to those skilled in the art that, for convenience of description and succinctly, The device of foregoing description and the specific work process of each unit, can refer to corresponding processes in the foregoing method embodiment, herein It repeats no more.

Above-mentioned device can be implemented as a kind of form of computer program, which can be as shown in Figure 9 Computer equipment on run.

Referring to Fig. 9, Fig. 9 is a kind of structural representation block diagram of computer equipment provided by the embodiments of the present application.The meter Calculating machine equipment can be server or terminal.

Refering to Fig. 9, which includes processor, memory and the network interface connected by system bus, In, memory may include non-volatile memory medium and built-in storage.

Non-volatile memory medium can storage program area and computer program.The computer program includes program instruction, The program instruction is performed, and processor may make to execute training method or the emotion recognition side of any one emotion recognition model Method.

Processor supports the operation of entire computer equipment for providing calculating and control ability.

Built-in storage provides environment for the operation of the computer program in non-volatile memory medium, the computer program quilt When processor executes, processor may make to execute the training method or emotion identification method of any one emotion recognition model.

The network interface such as sends the task dispatching of distribution for carrying out network communication.It will be understood by those skilled in the art that Structure shown in Fig. 9, only the block diagram of part-structure relevant to application scheme, is not constituted to application scheme institute The restriction for the computer equipment being applied thereon, specific computer equipment may include than more or fewer portions as shown in the figure Part perhaps combines certain components or with different component layouts.

It should be understood that processor can be central processing unit (Central Processing Unit, CPU), it should Processor can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specially With integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor are patrolled Collect device, discrete hardware components etc..Wherein, general processor can be microprocessor or the processor be also possible to it is any often The processor etc. of rule.

Wherein, in one embodiment, the processor is for running computer program stored in memory, with reality Existing following steps:

Obtain the voice messaging and the corresponding data label of the voice messaging of user；According to the voice messaging and Corresponding data label constructs sample data；The voice messaging in the sample data is located in advance according to default processing rule Reason is to obtain corresponding spectral vectors；Preset Recognition with Recurrent Neural Network is extracted, the Recognition with Recurrent Neural Network includes attention mechanism, The attention mechanism is used to reinforce the partial region in the voice messaging；Based on the Recognition with Recurrent Neural Network, according to described The corresponding spectral vectors of voice messaging and data label carry out model training to obtain emotion recognition model.

In one embodiment, the processor is realizing the default processing rule of the basis in the sample data When voice messaging is pre-processed to obtain corresponding spectral vectors, for realizing:

Framing windowing process is carried out to the voice messaging in the sample data with the voice messaging that obtains that treated；To place Voice messaging after reason carries out frequency-domain transform to obtain corresponding amplitude spectrum；The amplitude spectrum is carried out by Meier filter group Filtering processing, and discrete cosine transform is carried out to obtain mel-frequency cepstrum coefficient to the amplitude spectrum after filtering processing；To described Mel-frequency cepstrum coefficient is normalized to obtain the corresponding spectral vectors of the voice messaging.

In one embodiment, the processor described is filtered the amplitude spectrum by Meier filter group realizing When wave processing, for realizing:

The corresponding maximum frequency of the voice messaging is obtained, calculates the maximum frequency pair using mel-frequency calculation formula The mel-frequency answered；Two are calculated according to the quantity of the mel-frequency of calculating and the Meier filter group intermediate cam filter The Meier spacing of the centre frequency of adjacent triangular filter；It is completed according to the Meier spacing to the linear of multiple triangular filters Distribution；The amplitude spectrum is filtered according to the multiple triangular filters for completing linear distribution.

In one embodiment, the mel-frequency calculation formula are as follows:

Wherein, f_melFor the mel-frequency, f is the corresponding maximum frequency of the voice messaging, and A is coefficient.

In one embodiment, described place is normalized to the mel-frequency cepstrum coefficient realizing in the processor When reason is to obtain the voice messaging corresponding spectral vectors, for realizing:

Use zero-mean normalization that the mel-frequency cepstrum coefficient is normalized to obtain the voice letter Corresponding spectral vectors are ceased, the zero-mean normalizes corresponding conversion formula are as follows:

In one embodiment, the structure of the Recognition with Recurrent Neural Network include input layer, it is circulation layer, attention mechanism, complete Even layer and output layer；The attention mechanism is used for the output quantity and weight vectors of the circulation layer according to attention establishing equation Between mapping relations to realize the partial region reinforced in the voice messaging；

The attention equation are as follows:

Wherein,f(h_i)=tanh (Wh_i+b)；G is the input vector for connecting layer entirely；h_iFor The output quantity of the corresponding circulation layer of each time point i；a_iIt is the corresponding weight vectors of each time point i, it is every for representing One time point i is to the influence size for connecting layer and output layer entirely；T is the total number of time point i；W is the matrix of a dimension S*D Parameter, S are positive integer, and b and u are the vector parameter that a dimension is S, and D is the number of network unit in the circulation layer.

Wherein, in another embodiment, the processor is for running computer program stored in memory, with reality Existing following steps:

Acquire the voice signal of user；

The spectral vectors are input to emotion recognition model to identify the emotion of the user, to obtain the use The emotional category at family, the emotion recognition model are using the described in any item emotion recognition model training sides of preceding claim The model that method training obtains.

A kind of computer readable storage medium is also provided in embodiments herein, the computer readable storage medium is deposited Computer program is contained, includes program instruction in the computer program, the processor executes described program instruction, realizes this Apply for the training method or emotion identification method of any one emotion recognition model that embodiment provides.

Wherein, the computer readable storage medium can be the storage inside of computer equipment described in previous embodiment Unit, such as the hard disk or memory of the computer equipment.The computer readable storage medium is also possible to the computer The plug-in type hard disk being equipped on the External memory equipment of equipment, such as the computer equipment, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digita l, SD) card, flash card (Flash Card) etc..

The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any Those familiar with the art within the technical scope of the present application, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should all cover within the scope of protection of this application.Therefore, the protection scope of the application should be with right It is required that protection scope subject to.

Claims

1. a kind of training method of emotion recognition model characterized by comprising

The voice messaging in the sample data is pre-processed according to default processing rule to obtain corresponding spectral vectors；

Preset Recognition with Recurrent Neural Network is extracted, the Recognition with Recurrent Neural Network includes attention mechanism, and the attention mechanism is used for Reinforce the partial region in the voice messaging；

Based on the Recognition with Recurrent Neural Network, model training is carried out according to the corresponding spectral vectors of the voice messaging and data label To obtain emotion recognition model.

2. training method according to claim 1, which is characterized in that the default processing rule of the basis is to the sample number Voice messaging in is pre-processed to obtain corresponding spectral vectors, comprising:

Framing windowing process is carried out to the voice messaging in the sample data with the voice messaging that obtains that treated；

To treated, voice messaging carries out frequency-domain transform to obtain corresponding amplitude spectrum；

The amplitude spectrum is filtered by Meier filter group, and the amplitude spectrum after filtering processing is carried out discrete remaining String is converted to obtain mel-frequency cepstrum coefficient；

The mel-frequency cepstrum coefficient is normalized to obtain the corresponding spectral vectors of the voice messaging.

3. training method according to claim 2, which is characterized in that it is described by Meier filter group to the amplitude spectrum It is filtered, comprising:

The corresponding maximum frequency of the voice messaging is obtained, it is corresponding to calculate the maximum frequency using mel-frequency calculation formula Mel-frequency；

Two adjacent triangles are calculated according to the quantity of the mel-frequency of calculating and the Meier filter group intermediate cam filter The Meier spacing of the centre frequency of filter；

The linear distribution to multiple triangular filters is completed according to the Meier spacing；

The amplitude spectrum is filtered according to the multiple triangular filters for completing linear distribution.

4. training method according to claim 3, which is characterized in that the mel-frequency calculation formula are as follows:

Wherein, f_melFor the mel-frequency, f is the corresponding maximum frequency of the voice messaging, and A is coefficient；

It is described the mel-frequency cepstrum coefficient to be normalized to obtain the corresponding spectral vectors of the voice messaging, Include:

Use zero-mean normalization that the mel-frequency cepstrum coefficient is normalized to obtain the voice messaging pair The spectral vectors answered, the zero-mean normalize corresponding conversion formula are as follows:

Wherein,For the mean value of mel-frequency cepstrum coefficient；σ is the standard deviation of mel-frequency cepstrum coefficient；X is each Meier frequency Rate cepstrum coefficient；x^*For the mel-frequency cepstrum coefficient after normalization.

5. training method according to claim 1, which is characterized in that the structure of the Recognition with Recurrent Neural Network includes input Layer, attention mechanism, connects layer and output layer at circulation layer entirely；The attention mechanism is used for according to attention establishing equation Mapping relations between the output quantity and weight vectors of circulation layer are to realize the partial region reinforced in the voice messaging；

The attention equation are as follows:

Wherein,f(h_i)=tanh (Wh_i+b)；G is the input vector for connecting layer entirely；h_iIt is each The output quantity of the corresponding circulation layer of a time point i；a_iIt is the corresponding weight vectors of each time point i, for representing each Time point i is to the influence size for connecting layer and output layer entirely；T is the total number of time point i；The matrix ginseng that W is a dimension S*D Number, S are positive integer, and b and u are the vector parameter that a dimension is S, and D is the number of network unit in the circulation layer.

6. a kind of emotion identification method characterized by comprising

Acquire the voice signal of user；

The voice signal is pre-processed according to default processing rule to obtain the corresponding spectral vectors of the voice signal；

The spectral vectors are input to emotion recognition model to identify the emotion of the user, to obtain the user's Emotional category, the emotion recognition model are using emotion recognition model training method described in any one of claims 1 to 5 The model that training obtains.

7. a kind of training device of emotion recognition model characterized by comprising

Information acquisition unit, for obtaining the voice messaging and the corresponding data label of the voice messaging of user；

Data processing unit, for being pre-processed the voice messaging in the sample data to obtain according to default processing rule To corresponding spectral vectors；

Network extraction unit, for extracting preset Recognition with Recurrent Neural Network, the Recognition with Recurrent Neural Network includes attention mechanism, institute Attention mechanism is stated for reinforcing the partial region in the voice messaging；

Model training unit, for being based on the Recognition with Recurrent Neural Network, according to the corresponding spectral vectors sum number of the voice messaging Model training is carried out according to label to obtain emotion recognition model.

8. a kind of emotion recognition device characterized by comprising

Signal acquisition unit, for acquiring the voice signal of user；

Signal processing unit, for being pre-processed to the voice signal according to default processing rule to obtain the voice letter Number corresponding spectral vectors；

Emotion recognition unit knows the emotion of the user for the spectral vectors to be input to emotion recognition model Not, to obtain the emotional category of the user, the emotion recognition model is using described in any one of claims 1 to 5 The model that the training of emotion recognition model training method obtains.

9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor；

The memory is for storing computer program；

The processor, for executing the computer program and realization such as claim 1 when executing the computer program The emotion identification method into the training method of emotion recognition model described in any one of 5, or such as claim 6.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program make the processor realize the feelings as described in any one of claims 1 to 5 when being executed by processor Feel the training method of identification model, or such as the emotion identification method in claim 6.