CN103456302A

CN103456302A - Emotion speaker recognition method based on emotion GMM model weight synthesis

Info

Publication number: CN103456302A
Application number: CN2013103945338A
Authority: CN
Inventors: 杨莹春; 陈力; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2013-09-02
Filing date: 2013-09-02
Publication date: 2013-12-18
Anticipated expiration: 2033-09-02
Also published as: CN103456302B

Abstract

The invention discloses an emotion speaker recognition method based on emotion GMM model weight synthesis. The method comprises the steps that (1) neutral GMM models of speakers are built regarding to each speaker and different emotion GMM models are acquired according to corresponding neutral emotion weight parameter conversion models; (2) voices of the speakers are collected, voice features are extracted, and the acquired voice features are calculated and scored in all the emotion GMM models in the step (1); (3) all scores are compared, and a speaker corresponding to the emotion GMM model getting the highest score is the speaker to be recognized. According to the emotion speaker recognition method based on emotion GMM model weight synthesis, the robustness of speaker emotion change reorganization and speaker reorganization accuracy are improved on the basis of collection of the speaker neutral voices because the neutral emotion weight models of the speakers are built.

Description

A kind of based on the synthetic emotional speaker recognition method of emotion GMM Model Weight

Technical field

The present invention relates to signal and process and pattern-recognition, more specifically, the present invention relates to a kind of based on the synthetic emotional speaker recognition method of emotion GMM Model Weight.

Background technology

Speaker Recognition Technology refers to utilizes signal processing technology and mode identification method, identifies the technology of its identity by the voice that gather the speaker, mainly comprises two steps: the speech recognition of speaker model training and testing.Emotional speaker identification is that training utterance and the tested speech in order to solve the registration speaker exists the inconsistent Speaker Recognition System hydraulic performance decline problem caused of emotion.The method that this patent proposes is exactly by setting up speaker's virtual emotion model, improves the recognition performance of system.

At present, the main Short Time Speech feature that Speaker Identification adopts comprises Mel cepstrum coefficient (MFCC), linear predict code cepstralcoefficients (LPCC), the linear predictor coefficient of perceptual weighting (PLP).The algorithm of Speaker Identification mainly comprises vector quantization (VQ), universal background model method (GMM-UBM), support vector machine (SVM) etc.Wherein, GMM-UBM is very extensive in the application of whole Speaker Identification field.

In emotional speaker identification, training utterance is generally neutral emotional speech, because in real world applications, generally the user only can provide the model of the voice training oneself under neutral pronunciation.And when test, voice may comprise the voice of various emotions, as happiness, sad etc.Yet traditional Speaker Recognition System can not be processed the mismatch of this training and testing environment.

Summary of the invention

The invention provides a kind of based on the synthetic emotional speaker recognition method of emotion GMM Model Weight, by setting up speaker's neutral emotion weight model, on the basis that only gathers the neutral voice of speaker, raising changes the robustness of identification to speaker's emotion, improve the accuracy of Speaker Identification.

A kind of based on the synthetic emotional speaker recognition method of emotion GMM Model Weight, step is as follows:

(1), for each speaker, set up speaker's neutral GMM model, and, according to corresponding neutral emotion weight parameter transformation model, obtain different emotion GMM models;

The said emotion of the present invention can have multiple choices, such as glad, indignation, alarmed, sad, constrain etc., the kind of the emotion of selecting is more, final recognition result is more accurate, but corresponding calculated amount also can increase, therefore, during use, can select the emotion kind of proper number according to needs, corresponding every kind of emotion is set up emotion GMM model.

(2) gather speaker's to be identified voice and extract phonetic feature, in all emotion GMM models that the phonetic feature obtained is obtained in step (1), carrying out score calculating;

In this step, corresponding neutral GMM model and emotion GMM model have been set up in to be identified speaking per capita in step (1), for some speakers to be identified, if do not set up corresponding neutral GMM model and emotion GMM model in step (1), can not be identified this speaker to be identified.

(3) all scores are compared, the corresponding speaker of emotion GMM model that score is the highest is speaker to be identified.

There are mapping relations between each speaker's neutral model and the weight between emotion model, utilize this mapping relations, can directly calculate emotion model by neutral model, the method for building up of neutral emotion weight parameter transformation model can adopt various algorithm of the prior art, as long as can between neutral model and emotion model, set up mapping relations, preferably, described neutral emotion weight parameter transformation model utilizes radial base neural net or sparse expression to set up.

As preferably, the process of establishing of described neutral emotion weight parameter transformation model specifically comprises the following steps:

1-1, in development library, extract the Short Time Speech feature of the different speakers under all affective states, go out the irrelevant Gaussian mixture model-universal background model of emotion by the EM Algorithm for Training;

1-2, utilize this Gaussian mixture model-universal background model, by self-adaptation average and adaptive weighting, obtain the neutral GMM model of each speaker in development library;

1-3, utilize the neutral GMM model of step 1-2, by the method for adaptive weighting, obtain the emotion GMM model under various affective states;

1-4, utilize the weight in the emotion GMM model of weight in the neutral GMM model of step 1-2 and step 1-3, train RBF Neural Network or sparse expression model, obtain neutral emotion weight parameter transformation model.

Development library in the present invention refers to, before realizing the present invention, first chooses arbitrarily some speakers and forms development library, and the speaker in follow-up identifying is not necessarily identical with the speaker in development library, can be identical, and also can be different.

As preferably, while adopting radial base neural net to obtain neutral emotion weight parameter transformation model, specifically comprise the following steps: in development library, utilize each speaker's neutral GMM Model Weight sequence and every kind of emotion GMM Model Weight sequence corresponding to this speaker, train the mapping relations that obtain between GMM Model Weight sequence and every kind of emotion GMM Model Weight sequence, i.e. neutral emotion weight parameter transformation model by Orthogonal Least Square.

As preferably, while adopting sparse expression to obtain neutral emotion weight parameter transformation model, specifically comprise the following steps: in development library, utilize each speaker's neutral GMM Model Weight sequence and every kind of emotion GMM Model Weight sequence corresponding to this speaker, obtain neutral emotion alignment dictionary, i.e. neutral emotion weight parameter transformation model.

The present invention is based on the synthetic emotional speaker recognition method of emotion GMM Model Weight, by setting up speaker's neutral emotion weight model, on the basis that only gathers the neutral voice of speaker, improve the robustness that speaker's emotion is changed to identification, improve the accuracy of Speaker Identification.

The accompanying drawing explanation

Fig. 1 is the process flow diagram that the present invention is based on the synthetic emotional speaker recognition method of emotion GMM Model Weight;

Fig. 2 the present invention is based in the synthetic emotional speaker recognition method of emotion GMM Model Weight the radially structural drawing of base neural net;

Fig. 3 is the structural drawing that the present invention is based on neutral emotion alignment dictionary in the synthetic emotional speaker recognition method of emotion GMM Model Weight.

Embodiment

Below in conjunction with accompanying drawing, to the present invention is based on the synthetic emotional speaker recognition method of emotion GMM Model Weight, be described in detail.

What in the experimental data in the present invention, adopt is Chinese emotional speech database (MASC), this database is quietly adopting Olympus DM-20 recording pen to record under environment, 68 speakers that this database is Chinese by 68 mother tongues form, male sex speaker 45 people wherein, female speaker 23 people.In recognition methods provided by the present invention, multiple choices can be arranged, in the present embodiment, for convenience of description and concrete test result is provided, chosen 5 kinds of affective states, be respectively neutral, angry, glad, indignation and sad, each speaker has 5 kinds of voice under affective state.Each speaker reads aloud 2 sections paragraphs (about 30s record length) and reads aloud 5 words and 20 statements each 3 times under neutral emotion, read aloud 5 words and 20 statements each 3 times under all the other every kind of affective states, for each speaker, word and the statement under neutral and other affective states, read aloud are all identical; For all speakers, the word of reading aloud and statement are all identical.

Test data in the present invention is carried out at association's workstation, and it is configured to: CPU E5420, and dominant frequency 2.5GHz, inside save as 4G, and experiment realizes under Visual Studio environment.

As shown in Figure 1, a kind of based on the synthetic emotional speaker recognition method of emotion GMM Model Weight, step is as follows:

In test process, choose arbitrarily several speakers' voice as development library, generally, speaker's number of choosing is no less than 10, for example choose front 18 speakers' voice as development library, record all voice of front 18 speakers under neutral and all the other five kinds of affective states in this development library, training obtains UBM model (being Gaussian mixture model-universal background model of the prior art).

In test process, remove the speaker in development library, all the other speakers are formed to the evaluation and test collection, in evaluation and test, to concentrate, the UBM model that in each speaker's neutral GMM model exploitation storehouse, training obtains, obtain by self-adaptation average and adaptive weighting.

The process of establishing of the neutral emotion weight parameter transformation model in this step specifically comprises the following steps:

Voice signal to speakers different in development library under neutral and all the other affective states carries out pre-service, pretreated step comprises sample quantization, zero-suppress and float, pre-emphasis (increasing the weight of the HFS of signal) and windowing (one section voice signal is divided into to some sections), and every section voice signal is extracted to the Short Time Speech feature.

All speakers' Short Time Speech feature is gone out to the irrelevant Gaussian mixture model-universal background model UBM λ (x) of emotion by the EM Algorithm for Training, and expression formula is as follows;

λ (x) = Σ_{i = 1}^{n} ω_{i} Φ (μ_{i}, Σ_{i}; x)

Wherein: ω _ithe weight that means i gaussian component;

Φ means gauss of distribution function;

μ _ithe average that means i gaussian component;

Σ _ithe variance that means i gaussian component;

X means the Short Time Speech feature;

N means the number of gaussian component, can adjust according to needs, is traditionally arranged to be 512.

Voice in the exploitation storehouse under each speaker's neutral emotion, by self-adaptation average and adaptive weighting, obtain speaker's neutral GMM model.Only adopt the self-adaptation average in prior art, while self-adaptation average and adaptive weighting in the present invention, self-adaptive weight sum self-adaptation average adopts identical method to realize.

1-3, utilize the neutral GMM model of step 1-2, by the method for adaptive weighting, obtain the emotion GMM model (the corresponding emotion GMM model of each affective state) under various affective states; In this step, adaptive weighting adopts and method identical in step 1-2.

In test, adopt radial base neural net and two kinds of embodiments of sparse expression model, obtain neutral emotion weight parameter transformation model, and test result is contrasted.

When adopting radial base neural net to obtain neutral emotion weight parameter transformation model, specifically comprise the following steps: in development library, utilize each speaker's neutral GMM Model Weight sequence and every kind of emotion GMM Model Weight sequence corresponding to this speaker, train the mapping relations that obtain between GMM Model Weight sequence and every kind of emotion GMM Model Weight sequence, i.e. neutral emotion weight parameter transformation model by Orthogonal Least Square.

The weight sequence of the neutral GMM model of each speaker in development library is designated as to [ω _{n, 1}, ω _{n, 2}..., ω _n,n], wherein, N means neutral affective state, n means the number of gaussian component; The weight sequence of the emotion GMM model that this speaker is corresponding is designated as [ω _{e, 1}, ω _{e, 2}..., ω _e,n]; Wherein, E means affective state, and n means the number of gaussian component.

As shown in Figure 2, radial base neural net is divided into input layer, hidden layer and output layer; The weight sequence that wherein input layer is neutral GMM model, the weight sequence that output layer is emotion GMM model (the weight sequence of each speaker's the corresponding emotion GMM model of each affective state), hidden layer activation function K (x) adopts radial basis function, and expression formula is as follows:

K (x) = e^{- {| | \frac{x - ν}{θ} | |}^{2}}

Wherein, the input value that x is input layer, i.e. the weight sequence of neutral GMM model;

The average that ν is radial basis function;

The variance that θ is radial basis function.

When train RBF Neural Network, by the K-means clustering method, calculate ν and θ; Calculate the weight w between hidden layer and output layer by Orthogonal Least Square, this weight w is also that (concrete computation process is referring to document [J.Robert for neutral emotion weight parameter transformation model, J.Schilling, J.Carroll.Approximation of nonlinear systems with radial basis function neural network[J] .IEEE Transactions on neural networks, 2001,12 (1): 21-28.]).

When adopting sparse expression to obtain neutral emotion weight parameter transformation model, specifically comprise the following steps: in development library, utilize each speaker's neutral GMM Model Weight sequence and every kind of emotion GMM Model Weight sequence corresponding to this speaker, obtain neutral emotion alignment dictionary, i.e. neutral emotion weight parameter transformation model.

As shown in Figure 3, in Fig. 3 in empty frame is a neutral emotion alignment dictionary, wherein, each row consists of a speaker's neutral GMM Model Weight sequence and a kind of emotion GMM Model Weight sequence of this speaker, and each speaker is to there being 4 neutral emotion alignment dictionaries.

In Fig. 3, the first half D _nthe neutral GMM Model Weight sequence that comprises all speakers in development library, the latter half D _ethe emotion GMM Model Weight sequence that comprises all speakers in development library, the number that in Fig. 3, M is the speaker in development library.

Obtain neutral emotion weight parameter transformation model in development library after, (in 68 speakers, remove 18 speakers in development library for the evaluation and test collection, the set that remaining 50 speaker forms) each speaker in sets up corresponding neutral GMM model and emotion GMM model, and process of establishing is different according to the acquisition process of neutral emotion weight parameter transformation model.

When adopting radial base neural net to obtain neutral emotion weight parameter transformation model, at first calculate each speaker's neutral GMM model in the UBM model of mode in step (1) by self-adaptation average and adaptive weighting, neutral GMM Model Weight sequence is designated as to ω _{n, enroll}, emotion GMM Model Weight sequence is designated as ω _{e, enroll}, utilize

calculate virtual emotion weight sequence ω _{e, enroll}, in formula, the neuronic number that C is hidden layer, K _jbe j hidden layer activation function, w _jbe hidden layer that j neuron is corresponding and the weight between output layer.

When adopting sparse expression to obtain neutral emotion weight parameter transformation model, at first calculate each speaker's neutral GMM model in the UBM model of mode in step (1) by self-adaptation average and adaptive weighting, by neutral GMM Model Weight sequence [ω _{n, 1}, ω _{n, 2}..., ω _n,n] (be ω _n), and neutral GMM Model Weight dictionary D _n, obtain sparse coefficient B,

\begin{matrix} \arg \underset{x}{\min {| | B | |}_{1}} & subjectto | | D_{N} B - ω_{N} | | \leq ϵ \end{matrix}

Wherein, ε is the limit of error, can set according to concrete condition, be made as 1.3 in the present embodiment, concrete computation process is referring to [J.Wright, A.Y.Yang, A.Ganesh, S.S.Sastry, and Y.Ma, " Robust face recognition via sparse representation; " IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.31, no.2, pp.210 – 227,2009.].

Utilize ω _{e, enroll}=D _e* B calculates virtual emotion weight sequence ω _{e, enroll}.

The average of each speaker's neutral GMM model gaussian component is concentrated in evaluation and test, and variance and virtual emotion weight form the corresponding emotion GMM model of each speaker.

λ_{E} (x) = Σ_{i = 1}^{n} ω_{E, enroll, i} Φ (μ_{N, i}, Σ_{N, i})

In formula, Φ means gauss of distribution function;

μ _n,ithe average that means i gaussian component under neutral affective state;

Σ _n,ithe variance that means i gaussian component under neutral affective state;

ω _{e, enroll, i}mean virtual emotion weights omega _{e, enroll}in the weight of i gaussian component;

X means the Short Time Speech feature;

N means the number of gaussian component, is made as 512 in the present embodiment.

After having set up the concentrated all speakers' of evaluation and test neutral GMM model and emotion GMM model, start to carry out speaker's identification.

(2) gather speaker's to be identified voice and extract the Short Time Speech feature, in all emotion GMM models that the Short Time Speech feature obtained is obtained in step (1), carrying out score calculating;

In this step, corresponding neutral GMM model and emotion GMM model have been set up in to be identified speaking per capita in step (1).

Voice to be identified are concentrated in all neutral GMM models and emotion GMM model and carried out respectively Likelihood Score calculating in evaluation and test, concentrate k speaker's model for evaluation and test, the Likelihood Score of the Short Time Speech feature xt of voice to be identified can utilize following formula to calculate:

s_{N, k} = Σ_{i = 1}^{n} ω_{N, i, k} N (x_{t}, μ_{N, i, k}, Σ_{N, i, k})

In formula, s _n,kfor the score of voice to be identified in k speaker's neutral GMM model;

ω _{n, i, k}be i the weight that gaussian component is corresponding in k speaker's neutral GMM model;

X _tfor the Short Time Speech feature;

μ _{n, i, k}the average that means i gaussian component under k speaker's neutral affective state;

Σ _{n, i, k}the variance that means i gaussian component under k speaker's neutral affective state;

s_{E, k} = Σ_{i = 1}^{n} ω_{E, i, k} N (x_{t}, μ_{N, i, k}, Σ_{N, i, k})

In formula, s _e,kfor the score of voice to be identified in k speaker's emotion GMM model;

ω _{e, i, k}be i the weight that gaussian component is corresponding in k speaker's emotion GMM model;

X _tfor the Short Time Speech feature;

For k speaker model, the Short Time Speech feature x of voice to be identified _tfinal score s _kfor the maximal value of Likelihood Score in all neutral GMM models and emotion GMM model,

S _k＝max(s _N,k,s _E,k)

For example, a certain section voice to be identified are in k speaker model, and the corresponding score maximum of this affective state of happiness, using glad corresponding score as S _k.

Select the value of statement to be identified score maximum in all speaker models, the recognition result as final, be shown below

id = \underset{x}{\max \arg} S_{k}

In formula, the sequence number of the corresponding speaker model of value that id is the score maximum.

For example, a certain section S that voice to be identified obtain in the 20th speaker model _kmaximum, recognition result is that voice to be identified are sent by the 20th speaker.

Concentrated to evaluating and testing, all statements under five kinds of emotional speeches are tested, and tested speech amounts to 15,000 (60 statements of 50 evaluation and test 5 kinds of emotional words * of people * (20 statements, each statement repeats 3 times)).In experiment, simulation be the process that the speaker differentiates, the GMM-UBM Comparison of experiment results of experimental result and benchmark is in Table 1.

Table 1

Emotional semantic classification	Benchmark GMM-UBM	Radial base neural net	Sparse expression
				Neutral	90.87%	95.23%	96.47%
Indignation	41.83%	51.97%	50.27%
				Glad	44.80%	53.57%	51.20%
In alarm	39.20%	46.70%	45.57%
				Sad	65.80%	69.60%	67.70%
On average	56.50%	63.41%	62.24%

As can be seen from Table 1, the inventive method can be synthesized speaker's emotion model effectively, under various affective states, the accuracy rate of identification is greatly improved, simultaneously, for radial base neural net and sparse expression, overall recognition accuracy has also improved respectively 6.91% and 5.74%, proves that this method improves a lot to improving emotional speaker identification accuracy and robustness.

Claims

1. one kind based on the synthetic emotional speaker recognition method of emotion GMM Model Weight, it is characterized in that, step is as follows:

2. as claimed in claim 1ly based on the synthetic emotional speaker recognition method of emotion GMM Model Weight, it is characterized in that, described neutral emotion weight parameter transformation model utilizes radial base neural net or sparse expression to set up.

3. as claimed in claim 2ly based on the synthetic emotional speaker recognition method of emotion GMM Model Weight, it is characterized in that, the process of establishing of described neutral emotion weight parameter transformation model specifically comprises the following steps:

4. as claimed in claim 3 based on the synthetic emotional speaker recognition method of emotion GMM Model Weight, it is characterized in that, while adopting radial base neural net to obtain neutral emotion weight parameter transformation model, specifically comprise the following steps: in development library, utilize each speaker's neutral GMM Model Weight sequence and every kind of emotion GMM Model Weight sequence corresponding to this speaker, train the mapping relations that obtain between GMM Model Weight sequence and every kind of emotion GMM Model Weight sequence, i.e. neutral emotion weight parameter transformation model by Orthogonal Least Square.

5. as claimed in claim 3 based on the synthetic emotional speaker recognition method of emotion GMM Model Weight, it is characterized in that, while adopting sparse expression to obtain neutral emotion weight parameter transformation model, specifically comprise the following steps: in development library, utilize each speaker's neutral GMM Model Weight sequence and every kind of emotion GMM Model Weight sequence corresponding to this speaker, obtain neutral emotion alignment dictionary, i.e. neutral emotion weight parameter transformation model.