CN110047516A

CN110047516A - A kind of speech-emotion recognition method based on gender perception

Info

Publication number: CN110047516A
Application number: CN201910186313.3A
Authority: CN
Inventors: 王龙标; 党建武; 张林娟; 郭丽丽
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-03-12
Filing date: 2019-03-12
Publication date: 2019-07-23

Abstract

The present invention discloses a kind of speech-emotion recognition method based on gender perception, utilizes the gender Perception Features of gender information: distributed sex character and gender guidance feature.And gender Perception Features and sound spectrograph are fused into assemblage characteristic, learn high-level depth characteristic from assemblage characteristic with CNN-BLSTM network and do emotional semantic classification.Key step has: voice segment, feature preparation, Fusion Features, feature extraction and classification.Gender Perception Features of the invention are compared with existing feature, can effectively utilize the information of gender.The speech-emotion recognition method of gender perception can effectively improve the accuracy of speech emotion recognition.

Description

A kind of speech-emotion recognition method based on gender perception

Technical field

The present invention is speech emotion recognition field, and the utilization and speech emotion recognition feature for being specifically related to sex character are melted Conjunction method.

Background technique

Currently, human-computer interaction is popular in various ways, especially conversational system and intelligent sound assistant.Emotion carries Important semantic information, it is believed that speech emotion recognition can effectively help machine to understand user's intention.It accurately distinguishes and uses Family mood can provide good interactivity, and improve user experience.But we still have with machine nature communication aspects Many difficult, we are not really achieved human-computer interaction.Speech emotion recognition task is still a very big challenge.

Many researchs discovery gender differences sway the emotion expression, this shows that gender information can help speech emotion recognition. The study found that being better than the classification accuracy that gender information incorporates emotion recognition to establish the independent speech emotion recognition of gender respectively System.Gender information has been widely used for speech emotion recognition task.But only not with the simple coding mode such as one-hot coding Gender information can be effectively utilized.Therefore addition gender information, the accuracy rate of emotion recognition only can slightly improve, ineffective.

In order to solve problem above, we have proposed distributed sex characters and two kinds of gender guidance feature new gender senses Know feature.Distributed sex character describes the distribution and individual difference of men and women；Gender guidance feature passes through DNN network from acoustics It is extracted in signal.Both gender Perception Features are merged with sound spectrograph respectively, and last point is then done by CNN-BLSTM model Class.

1) gender information is not efficiently used for traditional voice emotion recognition task, proposes new gender Perception Features, Effectively utilize gender information.

2) new gender Perception Features not only indicate men and women's information, but also reflect the sound of individual difference and part speaker Characteristic is learned, the effective rate of utilization of sex character is improved.

3) by gender Perception Features and sound spectrograph effective integration, and emotional semantic classification is carried out with CNN-BLSTM model, it can be effective The accuracy of ground promotion emotion recognition.

Summary of the invention

The present invention is to propose the gender Perception Features that can effectively utilize gender information the technical issues of solution: distribution Formula sex character and gender guidance feature.And gender Perception Features and sound spectrograph are fused into assemblage characteristic, with CNN-BLSTM net Network learns high-level depth characteristic from assemblage characteristic and does emotional semantic classification.Specific technical solution is as follows:

Voice segment: language grade voice signal is divided into the voice segments of regular length by step 1.

Step 2, feature prepare

1) extract sound spectrograph: to segmentation voice carry out short time discrete Fourier transform, obtain primary light spectrogram S, size be a × b；

2) extract gender Perception Features: distributed sex character and gender drive feature；

2-1) distributed sex character extracts: the random value of fixed dimension is respectively set for male and female first as male Female's template.In order to reflect individual difference, stochastic variable is added in fixed sex template.Finally, point of male Cloth sex character DGF_MChange in the range of m-k, and the distributed sex character DGF of women_FChange in the range of k-z；

2-2) gender drives feature extraction: the x for extracting segmentation voice first ties up acoustic feature.It is distinguished to have feature The function of gender uses deep neural network DNN to extract y from acoustic feature and ties up bottleneck characteristic as gender driving feature GDF.

Step 3, Fusion Features

By step 2 1) extract primary light spectrogram S and 2-1) in DGF be fused together into assemblage characteristic F₁.I-th The assemblage characteristic vector F of j-th of segment in language₁It can indicate are as follows:

F_1ij=[S_ij,DGF_ij] (1)

By step 2 1) extract primary light spectrogram S and 2-2) in GDF be fused together into assemblage characteristic F₂.I-th The assemblage characteristic vector F of j-th of segment in language₂It can indicate are as follows:

F_2ij=[S_ij,GDF_ij] (2)

Step 4, feature extraction.Level characteristics are extracted from assemblage characteristic respectively using CNN.

Step 5, classification.By the chronological feature at sentence level of the level characteristics obtained in step 4, it is sent to Learn context Time Dependent in BLSTM network, completes the emotional semantic classification of language grade.7 kinds of emotions include it is neutral, sad, frightened, Glad, angry, boring, detest.

Characteristic use DNN network is driven in the step 2 in order to obtain gender.Gender driving feature specifically constructs step It is rapid as follows:

1) input of DNN is x dimension acoustic feature

2) three hidden layers h1, h2, h3 are set, and wherein h2 hiding unit is less than h1, h3.H2 is also referred to as bottleneck layer.

3) use true gender label as the teacher signal of training DNN.DNN can pass through cost function backpropagation Derivative train.Cost function is measuring the cross entropy between target output and reality output in each trained example.

DNN is trained, the output of hidden layer h2 is bottleneck characteristic, i.e. gender drives feature.

The present invention is based on the speech-emotion recognition methods of gender perception to be based on CNN-BLSTM model.Its configuration is as follows:

There are two convolutional layers and two maximum pond layers by CNN.First convolutional layer has n1 convolution kernel, and convolution size is k₁× k₁.The pond size of first pond layer is p₁×p₁.Second convolutional layer has n₂A convolution kernel, convolution size are k₂×k₂.The The size of two pond layers is p₂×p₂.Flattening layer is in order to two-dimensional characteristic spectrum 1 dimensional vector of flat chemical conversion.In flattening layer Later, Feature Mapping to s is tieed up using the layer that is fully connected with s hidden unit.There are two hidden layers in BLSTM, each Layer has u hidden unit.

Beneficial effect

Gender Perception Features of the invention are compared with existing feature, can effectively utilize the information of gender.Gender perception Speech-emotion recognition method can effectively improve the accuracy of speech emotion recognition.

Detailed description of the invention

Fig. 1 is that the present invention is based on the speech emotion recognition model framework figures of gender driving feature；

Fig. 2 is the DNN model structure for extracting gender driving feature.

Specific embodiment

The present invention is described in detail with reference to the accompanying drawing.

In order to verify the present invention, we verify on Emo-DB database.Emo-DB includes 535 sentences, is divided into Sad, glad, frightened, neutral, angry, boring, 7 kinds of emotions of detest.

Fig. 1 is that the present invention is based on the speech emotion recognition model framework figures of gender driving feature.As shown in Figure 1, main packet Containing following five steps.

Step 1, voice segment.Language grade voice signal is divided into the voice segments of regular length.Every section of segment length 265ms.Often Section includes 25 frames, frame length 25ms, frame shifting 10ms.In Emo-DB affection data library, a language about more than 50,000 is collected according to the method described above Tablet section is tested.Wherein duration longest sentence is divided into 349 voice segments.

Step 2, feature prepare.

1) it extracts sound spectrograph: short time discrete Fourier transform being carried out to segmentation voice, obtains primary light spectrogram S, size 25 ×129；

2) extract gender Perception Features: distributed sex character and gender drive feature.

2-1) distributed sex character extracts: the random value of 32 dimensions is respectively set for male and female first as men and women's mould Plate.In order to reflect individual difference, stochastic variable is added in fixed sex template.Finally, the distribution of male Sex character DGF_MChange in the range of 0-0.5, and the distributed sex character DGF of women_FChange in the range of 0.5-1.

2-2) gender drives feature extraction: extracting 384 dimension acoustic features of segmentation voice with openSMILE tool first. 384 dimension acoustic features are provided by INTERSPEECH 2009Emotion Challenge, by the 32 rudimentary descriptors (LLDs) of dimension and Its statistical value composition.32 dimension LLDs include zero-crossing rate, root mean square energy, fundamental frequency, the harmonic to noise ratio of auto-correlation function and Mel frequency Rate cepstrum coefficient etc..

In order to make feature that there is the other function of distinction, 32 dimension bottles are extracted from acoustic feature with deep neural network DNN Neck feature drives feature as gender.Fig. 2 is illustrated for extracting bottleneck characteristic DNN model structure.The input of DNN is 384 dimensions Acoustic feature.The hidden unit of three hidden layers h1, h2, h3 are respectively 1024,32,1024.The output layer of DNN uses true Teacher signal of the gender label as training DNN.DNN is trained, the output of hidden layer h2 is that 32 dimension genders drive feature.

Step 3, Fusion Features.

By the DGF in step 2 2-2), temporally dimension repeats 25 times into section grade DGF, and size is 25 × 32.By step Two 2-1) extract original spectrum Figure 25 × 129 and section grade DGF be fused into section grade assemblage characteristic F₁, size is 25 × 161.

Similarly, by the GDF in step 2 2-3), temporally dimension repeats 25 times into section grade GDF, and size is 25 × 32.It will Step 2 2-1) extract original spectrum Figure 25 × 129 and section grade GDF be fused into section grade assemblage characteristic F₂, size be 25 × 161。

Step 4, feature extraction.Using CNN respectively from assemblage characteristic F₁、F₂Middle extraction level characteristics.There are two convolution by CNN Layer and two maximum pond layers.First convolutional layer has 32 convolution kernels, and convolution size is 5 × 5, activation primitive relu.First The pond size of a pond layer is 2 × 2.Second convolutional layer has 64 convolution kernels, and convolution size is 5 × 5, and activation primitive is relu.The size of second pond layer is 2 × 2.After flattening layer, being fully connected with 1024 hidden units is used Layer ties up the Feature Mapping of study to 1024.The hidden unit of output layer is 7, activation primitive softmax.It is obtained after full articulamentum Take the level characteristics in subsequent classification.

Step 5, classification.By the chronological feature at sentence level of the level characteristics obtained in step 4, it is sent to Learn context Time Dependent in BLSTM network, complete the emotional semantic classification of language grade, distinguishes neutral, sad, frightened, glad, anger Anger, boring, seven kinds of emotions of detest.There are two hidden layer in BLSTM, each layer has 1024 hidden units.

Table 1 is the result that gender Perception Features are merged with sound spectrograph on EmoDB database

ID	Feature	Size	Weight precision	Unweighted precision
					1	Sound spectrograph	25×129	86.73%	86.40%
2	Sound spectrograph+only hot sex character	25×131	86.92%	86.24%
					3	Sound spectrograph+distribution sex character	25×161	88.97%	88.31%
4	Sound spectrograph+gender drives feature	25×161	92.71%	92.62%

Table 1 shows the speech emotion recognition model based on gender perception, carries out speech emotional classification using different characteristic Weighting precision and unweighted precision.By observing table 1, we concluded that 1) will solely hot sex character (male 01, Women 10) merged with sound spectrograph as feature carry out speech emotional classification do not obtain good classification results.This is because solely The dimension of hot sex character is 2, can be ignored substantially, CNN is without calligraphy learning to the information in newly-increased only hot sex character.2) exist Distributed sex character is added in the language emotion recognition system of gender perception and gender driving aspect ratio only uses sound spectrograph and exists Unweighted precision aspect relative error reduces by 14.04% and 45.74% respectively.3) aspect ratio distribution gender is driven using gender Feature has better classification accuracy.The reason is that the feature of gender driving not only indicates sex character, but also it can reflect and speak The true individual difference of person and acoustic information, and distributed sex character can only reflect the gender information of speaker.As a result it proves Speech-emotion recognition method based on gender perception can improve the accuracy of speech emotional classification, and the present invention is effective.

Claims

1. a kind of speech-emotion recognition method based on gender perception, which is characterized in that firstly, utilizing the gender sense of gender information Know feature: distributed sex character and gender guidance feature；Then, gender Perception Features are fused into combination with sound spectrograph respectively Feature learns high-level depth characteristic with CNN-BLSTM network from assemblage characteristic and does emotional semantic classification.

2. a kind of speech-emotion recognition method based on gender perception according to claim 1, which is characterized in that specific Steps are as follows:

Voice segment: language grade voice signal is divided into the voice segments of regular length by step 1；

Step 2, feature prepare

1) it extracts sound spectrograph: short time discrete Fourier transform is carried out to segmentation voice, obtain primary light spectrogram S, size is a × b；

2-1) distributed sex character extracts: the random value of fixed dimension is respectively set for male and female first as men and women's mould Plate；Stochastic variable is added in fixed sex template；Finally, the distributed sex character DGF of male_MIn m-k In the range of change, and the distributed sex character DGF of women_FChange in the range of k-z；

2-2) gender drives feature extraction: the x for extracting segmentation voice first ties up acoustic feature；With deep neural network DNN from sound It learns and extracts y dimension bottleneck characteristic in feature as gender driving feature GDF；

Step 3, Fusion Features

By step 2 1) extract primary light spectrogram S and 2-1) in DGF be fused together into assemblage characteristic F₁, in i-th of language The assemblage characteristic vector F of j-th of segment₁It can indicate are as follows:

F_1ij=[S_ij,DGF_ij] (1)

By step 2 1) extract primary light spectrogram S and 2-2) in GDF be fused together into assemblage characteristic F₂, in i-th of language The assemblage characteristic vector F of j-th of segment₂It can indicate are as follows:

F_2ij=[S_ij,GDF_ij] (2)

Step 4, feature extraction

Level characteristics are extracted from assemblage characteristic respectively using CNN；

Step 5, classification

By the chronological feature at sentence level of the level characteristics obtained in step 4, it is sent in BLSTM network in study Hereafter Time Dependent completes the emotional semantic classification of language grade.

3. a kind of speech-emotion recognition method based on gender perception according to claim 1, which is characterized in that the step The specific construction step of gender driving feature is as follows in rapid two:

1) input of DNN is x dimension acoustic feature；

2) three hidden layers h1, h2, h3 are set, and wherein h2 hiding unit is less than h1, h3, and h2 is also referred to as bottleneck layer；

3) use true gender label as the teacher signal of training DNN, DNN can spreading out by cost function backpropagation Biology is trained, and cost function is measuring the cross entropy between target output and reality output in each trained example；

4. according to a kind of speech-emotion recognition method based on gender perception described in claim 1, which is characterized in that the step Configuration based on CNN-BLSTM model in four and five is as follows:

There are two convolutional layers and two maximum pond layers by CNN:

First convolutional layer has n1 convolution kernel, and convolution size is k₁×k₁；

The pond size of first pond layer is p₁×p₁, second convolutional layer have n₂A convolution kernel, convolution size are k₂×k₂；

The size of second pond layer is p₂×p₂,

After flattening layer, Feature Mapping to s is tieed up using the layer that is fully connected with s hidden unit；

There are two hidden layer in BLSTM, each layer has u hidden unit.