CN102592593B

CN102592593B - Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech

Info

Publication number: CN102592593B
Application number: CN201210091525.1A
Authority: CN
Inventors: 吴强; 刘琚; 孙建德
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2012-03-31
Filing date: 2012-03-31
Publication date: 2014-01-01
Anticipated expiration: 2032-03-31
Also published as: CN102592593A

Abstract

The invention discloses an emotional-characteristic extraction method implemented through considering the sparsity of a multilinear group in a speech. The method comprising the following steps: considering multiple factors such as time, frequency, scale and direction information included in a speech signal; carrying out characteristic extraction by using a sparse decomposition method for multilinear groups; carrying out multilinear characterization on an energy spectrum of the speech signal through Gabor functions with different scales and the directions; solving a characteristic projection matrix by using a sparse tensor decomposition method for groups; calculating a characteristic projection with a frequency order; carrying out characteristic decorrelation through discrete cosine transform; and finally, calculating first-order and second-order differential coefficients so as to obtain the emotional characteristics of the speed. According to the invention, the factors such as time, frequency, scale and direction and the like in a speech signal is taken into consideration and used for extracting emotional characteristics, and the characteristic projection is performed by using a sparse tensor decomposition method for groups, thereby finally improving the accuracy rate of various speech emotion recognitions.

Description

A kind of emotional characteristics extracting method of considering polyteny group sparse characteristic in voice

Technical field

The present invention relates to a kind ofly for improving the voice mood feature extracting method of voice mood recognition performance, belong to the voice process technology field.

Background technology

Voice are one of modes the most easily that people are exchanged in daily life, and this also makes the researchist try to explore how to utilize voice as the instrument exchanged between people and machine.Except traditional interactive modes such as speech recognition, speaker's mood is also a kind of important interactive information, and machine Understanding speaker's mood automatically is one of important symbol of human-computer interaction intelligent.

Voice mood is identified in the signal processing and the intelligent human-machine interaction field has important value, and a lot of potential application are arranged.Aspect man-machine interaction, the mood of identifying the speaker by computing machine can improve cordiality and the accuracy of system, and for example long-distance educational system can be adjusted course in time by identification student's mood, thereby promotes teaching efficiency; In telephone contact center and mobile communication, can obtain in time user's emotional information, improve the quality of service; Whether the energy that onboard system can detect the driver by Emotion identification is concentrated, and makes corresponding auxiliary warning.Aspect medical science, voice-based Emotion identification can be used as a kind of instrument, helps the doctor to be diagnosed patient's the state of an illness.

For voice mood identification, an important problem is exactly how to extract effective feature to be used for meaning different moods.According to traditional feature extracting method, usually one section voice signal can be divided into to multiframe, in order to obtain approximate signal stably.The periodic feature obtained from each frame is called local feature, such as fundamental tone, energy etc., its advantage is that existing sorter can utilize local feature to estimate comparatively accurately the parameter of different emotional states, shortcoming is that intrinsic dimensionality and sample number are more, has influence on the speed of feature extraction and classification.Add up and obtain feature and be called global characteristics by the feature to whole sentence, its advantage is to obtain nicety of grading and speed preferably, but has lost the time sequence information of voice signal, the problem of lack of training samples easily occurs.Generally, voice mood is identified feature commonly used following a few class: continuous acoustic feature, spectrum signature, the feature based on the Teager energy operator etc.

According to the result of study of psychology and metrics etc., speaker's mood in voice the most intuitively feature be exactly the continuous feature of the rhythm, as fundamental tone, energy, the speed of speaking etc.Corresponding global characteristics comprises the average, median, standard deviation, maximal value, minimum value of fundamental tone or energy etc., and first, second resonance peak etc.

Spectrum signature provides the useful frequency information in the voice signal, is also important feature extraction mode in voice mood identification.Spectrum signature commonly used comprises linear predictor coefficient (LPC), linear prediction cepstrum coefficient coefficient (LPCC), Mel frequency cepstrum coefficient (MFCC), perceptual weighting linear prediction (PLP) etc.

Voice are that Nonlinear Space air-flow in sonification system produces, and Teager energy operator (TEO) is that a kind of that the people such as Teager proposes can follow the tracks of the arithmetic operation that in the glottis cycle, signal energy changes fast, for the fine structure of analyzing speech.Under different emotional states, the flexible situation of muscle can affect the motion of sonification system hollow air-flow, according to the people's such as Bou-Ghazale result of study, can know, the feature based on TEO can be used for detecting the intense strain in voice.

According to numerous experimental evaluation results, for voice mood identification, select suitable characteristic present for different classification task, the feature based on the Teager energy is suitable for detecting the intense strain in voice signal; Acoustic feature is applicable to distinguishing high mood (high-arousal emotion) and the low mood (low-arousal emotion) of waking up waken up continuously; And, for the mood classification task of multiclass, the voice that spectrum signature is best suited for characterize, if spectrum signature is combined with acoustic feature continuously, or consider the association analysis of many factors, also can reach the purpose of raising nicety of grading.

At voice mood feature extraction and the another one important stage after having selected, classify exactly.In area of pattern recognition, various sorters all are used to the voice mood feature is classified at present, comprise Hidden Markov Model (HMM) (HMM), gauss hybrid models (GMM), support vector machine (SVM), linear discriminant analysis (LDA) and integrated classifier etc.Hidden Markov Model (HMM) is one of recognizer the most widely of application in voice mood identification, this has benefited from its generally application in voice signal, be particularly useful for processing the data with sequential organization, from current result of study, the Emotion identification system based on Hidden Markov Model (HMM) can provide than the high-class accuracy rate.Gauss hybrid models can be regarded as the Hidden Markov Model (HMM) of only having a state, is very suitable for polynary distribution is carried out to modeling, and the people such as Breazeal utilize GMM to be applied to the KISMET speech database as sorter, and five class moods are carried out to Classification and Identification.The support vector machine area of pattern recognition that has been widely used, its ultimate principle is to higher dimensional space, to make characteristic line to divide Projection Character by kernel function, compare HMM and GMM, it has advantages of training algorithm global optimum and Existence dependency in the extensive border of data, and many results of study are utilize support vector machine as the sorter of voice mood identification and obtained classifying quality preferably.

As shown in Figure 1, traditional voice mood recognition methods based on spectrum signature adopts following steps usually:

1) pre-service is carried out in the voice signal of input, comprise windowing, filtering, pre-emphasis etc.;

2) signal is carried out to short time discrete Fourier transform, by the Mei Er quarter window, carry out filtering, then ask logarithmic spectrum (getting log);

3) utilize discrete cosine transform to calculate cepstrum, then weighting, ask cepstral mean to subtract, and calculates difference;

4) utilize gauss hybrid models (GMM) to be trained, obtain the model of different moods;

5) mood model obtained by training, identified test data, obtains recognition accuracy.

At present for two class mood classification, as negative emotions and neutral mood, reached nicety of grading relatively preferably, but the classification for the multiclass mood, due to the unbalancedness of data, only consider the reasons such as single factors (frequency or time), make the feature property distinguished poor, the mood nicety of grading is relatively low, and this makes voice-based Emotion identification system applies be restricted.

Summary of the invention

Only consider single factors for the feature extraction in the traditional voice Emotion identification, as frequency or time, make the poor problem of the feature property distinguished, the present invention proposes a kind of voice mood feature extracting method of considering polyteny group sparse characteristic in voice, identifying and can improve multiclass Emotion identification accuracy rate for voice mood.

The emotional characteristics extracting method of polyteny group sparse characteristic in consideration voice of the present invention is:

Consider that voice signal comprises the multiple factors of time, frequency, yardstick and directional information, utilize the method for polyteny group Its Sparse Decomposition to carry out feature extraction, Gabor function by different scale and direction carries out the polyteny sign to the speech signal energy spectrum, utilize the sparse tensor resolution method of group to solve the Projection Character matrix, Projection Character on the calculated rate rank, to the feature decorrelation, obtain single order and the second order difference coefficient of feature through discrete cosine transform by difference; Specifically comprise the following steps:

(1) gather voice signal s (t) (by equipment collections such as microphones), utilize Short Time Fourier Transform that s (t) is transformed to time-frequency domain, obtain time-frequency representation S (f, t) and the energy spectrum P (f, t) of signal;

(2) utilize the two-dimensional Gabor function with different scale and direction to carry out convolutional filtering to energy spectrum, the Gabor function definition is as follows:

g_{\overset{&OverBar;}{k}} (\overset{&OverBar;}{x}) = \frac{{\overset{&OverBar;}{k}}^{2}}{σ^{2}} \cdot e^{- ({\overset{&OverBar;}{k}}^{2} \cdot {\overset{&OverBar;}{x}}^{2} / 2 σ^{2})} \cdot [e^{j \overset{&OverBar;}{k} \cdot \overset{&OverBar;}{x}} - e^{- (σ^{2} / 2)}],

Wherein:

=P (f, t) is the element that energy spectrum P (f, t) is f in t frame, frequency;

be the yardstick of control function and the vector of direction, j means imaginary part unit, k _v=2 ^-(v+2)/2, φ=u (π/K), the direction of u representative function, the yardstick of v representative function, K means total direction number, σ is the constant of determining the function envelope, is made as 2 π.

The Gabor function is the polyteny sign of voice signal to the result of energy spectrum P (f, t) convolutional filtering here

that a size is 5 rank tensors, each rank mean respectively time, frequency, direction, yardstick and classification, then right the frequency rank carry out the filtering of Mei Er quarter window and obtain 5 new rank tensors p, psize be N ₁* N ₂* N ₃* N ₄* N ₅, the length of every single order is N _i, i=1 ..., 5;

(3) polyteny obtained is characterized pcarry out the sparse tensor resolution of group, calculate the projection matrix U on different factors ⁽ⁱ⁾, i=1 ..., 5, in order to carry out Projection Character, set up following decomposition model:

P≈ Λ× ₁U ⁽¹⁾× ₂U ⁽²⁾× ₃U ⁽³⁾× ₄U ⁽⁴⁾× ₅U ⁽⁵⁾

Wherein, U ⁽ⁱ⁾that the size that decomposition obtains afterwards is N _ithe projection matrix of * K; Λbe the 5 rank tensors that diagonal element is 1, size is K * K * K * K * K; * _imean the Matrix Multiplication computing of tensor i rank, it is defined as follows:

{(\underset{&OverBar;}{X} \times_{i} A)}_{n_{1}, \cdot \cdot \cdot n_{i - 1}, k, n_{i + 1}, \cdot \cdot \cdot n_{M}} = \underset{n_{i}}{Σ} {\underset{&OverBar;}{X}}_{n_{1}, \cdot \cdot \cdot n_{M}} A_{k, n_{i}}

Wherein xmean that a size is N ₁* ... * N _mm rank tensor, A is that a size is N _ithe matrix of * K,

it is tensor xelement,

it is the element of matrix A;

Calculate projection matrix U ⁽ⁱ⁾, i=1 ... the concrete decomposable process of I is as follows, and i means the index of rank (corresponding different factors) here, I=5:

1. adopt alternately lowest mean square or random initializtion U ⁽ⁱ⁾>=0, i=1 ..., I;

2. to projection matrix U ⁽ⁱ⁾, i=1 ..., each column vector of I

i=1 ..., I, k=1 ..., K carries out normalization;

3. error objective function

while being greater than certain threshold value, following operation is carried out in circulation:

● from i=1 to I, carry out successively:

Wherein, || || _fmean the Frobenius norm,

it is tensor p ^(k)i rank tensor matrixes launch,

⊙ is that the Khatri-Rao of matrix is long-pending, and o means vectorial apposition, λ _kand q _ibe for regulating the weight coefficient of objective function composition degree of rarefication, get the numerical value between 0 to 1;

If ● i ≠ 5,

γ_{k}^{i} = u_{k}^{(I) T} u_{k}^{(I)},

Wherein

mean

transposition, if i=5,

4. work as objective function ewhile being less than certain threshold value, circulation finishes, and calculates projection matrix U ⁽ⁱ⁾, i=1 ..., I;

(4) utilize the U of the projection matrix corresponding to frequency domain obtained ⁽²⁾polyteny to voice signal characterizes pcarry out Projection Character:

\underset{&OverBar;}{S} = \underset{&OverBar;}{P} \times_{2} U_{+}^{(2)},

Wherein, [Y] ₊=max (0, Y) mean to choose the matrix that the non-negative element in matrix Y forms, if element is less than 0, be set to 0,

projection matrix U ⁽²⁾the matrix that the non-negative element of pseudoinverse forms, * ₂representing matrix

with pcarry out 2 rank Matrix Multiplications of tensor;

(5) the time rank are fixed, to the polyteny sparse representation obtained scarry out tensor and launch operation, obtain size and be

eigenmatrix S _(f), wherein

{\hat{N}}_{1} = K \cdot N_{3} \cdot N_{4} \cdot N_{5};

(6) utilize discrete cosine transform to S _(f)carry out decorrelation, obtain voice mood feature F, the single order of calculated characteristics and second order difference coefficient obtain final emotional characteristics.

The present invention considers the feature extraction for mood of factors such as time, frequency, yardstick and direction in voice signal, utilizes the sparse tensor resolution method of group to carry out Projection Character, has finally improved the accuracy rate of multiclass voice mood identification.

The accompanying drawing explanation

Fig. 1 is the schematic block diagram of traditional voice Emotion identification process;

Fig. 2 is the schematic diagram of feature extracting method of the present invention;

Fig. 3 is the schematic block diagram that adopts voice mood identifying of the present invention.

Fig. 4 is the experimental result comparison diagram to four class voice mood identifications.

Embodiment

As shown in Figure 2, the voice mood recognition methods based on polyteny group sparse features of the present invention specifically comprises the following steps:

(1) collect voice signal s (t) by equipment such as microphones, utilize Short Time Fourier Transform that s (t) is transformed to time-frequency domain, obtain time-frequency representation S (f, t) and the energy spectrum P (f, t) of signal;

(2) utilize the two-dimensional Gabor function with different scale and direction to carry out convolutional filtering to energy spectrum, the polyteny that obtains voice signal characterizes

then right

the frequency rank carry out the filtering of Mei Er quarter window and characterized p;

The Gabor function definition is as follows:

g_{\overset{&OverBar;}{k}} (\overset{&OverBar;}{x}) = \frac{{\overset{&OverBar;}{k}}^{2}}{σ^{2}} \cdot e^{- ({\overset{&OverBar;}{k}}^{2} \cdot {\overset{&OverBar;}{x}}^{2} / 2 σ^{2})} \cdot [e^{j \overset{&OverBar;}{k} \cdot \overset{&OverBar;}{x}} - e^{- (σ^{2} / 2)}],

Wherein: it is the element that energy spectrum P (f, t) is f in t frame, frequency;

be the yardstick of control function and the vector of direction, j means imaginary part unit, k _v=2 ^-(v+2)/2π, φ=u (π/K), the direction of u representative function, the yardstick of v representative function, K means total direction number, σ is the constant of determining the function envelope, is made as 2 π.

The Gabor function is the polyteny sign of voice signal to the result of energy spectrum P (f, t) convolutional filtering

here

that a size is

5 rank tensors, each rank mean respectively time, frequency, direction, yardstick and classification, then right

the frequency rank carry out the filtering of Mei Er quarter window and obtain 5 new rank tensors p, psize be N ₁* N ₂* N ₃* N ₄* N ₅, the length of every single order is N _i, i=1 ..., 5;

(3) to characterizing pcarry out the sparse tensor resolution of group, calculate the projection matrix U on different factors ⁽ⁱ⁾, i=1 ..., 5, in order to carry out Projection Character.Set up following decomposition model:

P≈Λ× ₁U ⁽¹⁾× ₂U ⁽²⁾× ₃U ⁽³⁾× ₄U ⁽⁴⁾× ₅U ⁽⁵⁾

{(\underset{&OverBar;}{X} \times_{i} A)}_{n_{1}, \cdot \cdot \cdot n_{i - 1}, k, n_{i + 1}, \cdot \cdot \cdot n_{M}} = \underset{n_{i}}{Σ} {\underset{&OverBar;}{X}}_{n_{1}, \cdot \cdot \cdot n_{M}} A_{k, n_{i}}

it is tensor xelement,

it is the element of matrix A.

For calculating projection matrix U ⁽ⁱ⁾, i=1 ..., I, I=5 here, concrete decomposable process is as follows:

A) adopt alternately lowest mean square or random initializtion U ⁽ⁱ⁾>=0, i=1 ..., I;

B) to projection matrix U ⁽ⁱ⁾, i=1 ..., each column vector of I

i=1 ..., I, k=1 ..., K carries out normalization;

C) error objective function

while being greater than certain threshold value, circulation is carried out to finish drilling

Do:

● from n=1 to I, carry out successively

Wherein, || || _fmean the Frobenius norm,

it is tensor p ^(k)i rank tensor matrixes launch,

If ● n ≠ 5,

γ_{k}^{i} = u_{k}^{(I) T} u_{k}^{(I)},

Wherein

mean

transposition, if n=5,

D) work as objective function ewhile being less than certain threshold value, circulation finishes, and calculates projection matrix U ⁽ⁱ⁾, i=1 ... I;

\underset{&OverBar;}{S} = \underset{&OverBar;}{P} \times_{2} U_{+}^{(2)}

with pcarry out 2 rank Matrix Multiplications of tensor;

eigenmatrix S _(f), wherein

{\hat{N}}_{1} = N_{2} \cdot N_{3} \cdot N_{4} \cdot N_{5};

As shown in Figure 3, adopt above-mentioned feature extracting method to carry out the process of voice mood identification, comprise the following steps:

1) obtain the voice signal data s that there are different mood labels _l(t), l=1 ..., L, the different moods of total L class;

2) utilize the feature extracting method shown in Fig. 2 to be extracted the feature F of different moods

3) utilize mixed Gaussian mixture model (GMM) to carry out modeling to different emotional characteristicses, by learning training, obtain the corresponding mood model M of mood of l class _l;

4) when the voice signal of given unknown type of emotion

while being tested, the mood model M that utilizes GMM to set up _l, l=1 ..., L, carry out the measuring and calculation maximum posteriori probability successively, obtains the mood classification of maximum probability, is the Emotion identification result of this voice signal.

Effect of the present invention can further illustrate by experiment.

The recognition performance of the feature extracting method of the present invention's proposition has been tested in experiment on FAU Aibo data set, and 4 class moods (Anger, Emphatic, Neutral, Rest) are identified.The sampling rate of this experiment voice signal is 8kHz, adopt Hamming window to carry out windowing, the 23ms window is long, the 10ms window moves, utilize Short Time Fourier Transform to calculate the energy spectrum of signal, there are 4 different yardsticks and 4 different directions Gabor functions carry out the time-frequency convolutional filtering to energy spectrum, adopt the Mel bank of filters calculating Mei Er power spectrum that size is 36, utilize projection matrix to carry out Projection Character on the frequency domain rank, utilize DCT to carry out decorrelation to feature.

Fig. 4 has provided the method for the present invention's proposition and the recognition performance of existing Feature Extraction Technology (MFCC and LFPC feature) compares, from final recognition accuracy, after adopting the present invention, the accuracy rate of multiclass voice mood identification effectively improves, than classic method, MFCC has improved 6.1%, than the LFPC method, has improved 5.8%.

Claims

1. a voice mood feature extracting method of considering polyteny group sparse features in voice is characterized in that:

Consider that voice signal comprises the multiple factors of time, frequency, yardstick and directional information, utilize the method for polyteny group Its Sparse Decomposition to carry out feature extraction, Gabor function by different scale and direction carries out the polyteny sign to the speech signal energy spectrum, utilize the sparse tensor resolution method of group to solve the Projection Character matrix, Projection Character on the calculated rate rank, through discrete cosine transform, to the feature decorrelation, the single order of calculated characteristics and second order difference coefficient specifically comprise the following steps:

(1) gather voice signal s (t), utilize Short Time Fourier Transform that s (t) is transformed to time-frequency domain, obtain time-frequency representation S (f, t) and the energy spectrum P (f, t) of signal;

g_{\overset{&OverBar;}{k}} (\overset{&OverBar;}{x}) = \frac{{\overset{&OverBar;}{k}}^{2}}{σ^{2}} \cdot e^{- ({\overset{&OverBar;}{k}}^{2} \cdot {\overset{&OverBar;}{x}}^{2} / 2 σ^{2})} \cdot [e^{j \overset{&OverBar;}{k} \cdot \overset{&OverBar;}{x}} - e^{- (σ^{2} / 2)}],

Wherein:

it is the element that energy spectrum P (f, t) is f in t frame, frequency;

be the yardstick of control function and the vector of direction, j means imaginary part unit, k _v=2 ^-(v+2)/2π, φ=u (π/K), the direction of u representative function, the yardstick of v representative function, K means total direction number, σ is the constant of determining the function envelope, is made as 2 π;

here

that a size is

the frequency rank carry out the filtering of Mei Er quarter window and obtain 5 new rank tensors p, its size is N ₁* N ₂* N ₃* N ₄* N ₅, the length of every single order is N _i, i=1 ..., 5;

P≈ Λ× ₁U ⁽¹⁾× ₂U ⁽²⁾× ₃U ⁽³⁾× ₄U ⁽⁴⁾× ₅U ⁽⁵⁾，

Wherein, U ⁽ⁱ⁾that the size that decomposition obtains afterwards is N _ithe projection matrix of * K, Λbe the 5 rank tensors that diagonal element is 1, size is K * K * K * K * K, * _imean the Matrix Multiplication computing of tensor i rank, it is defined as follows:

{(\underset{&OverBar;}{X} \times_{i} A)}_{n_{1}, \cdot \cdot \cdot n_{i - 1}, k, n_{i + 1}, \cdot \cdot \cdot n_{M}} = \underset{n_{i}}{Σ} {\underset{&OverBar;}{X}}_{n_{1}, \cdot \cdot \cdot n_{M}} A_{k, n_{i}}

it is tensor xelement,

it is the element of matrix A;

\underset{&OverBar;}{S} = \underset{&OverBar;}{P} \times_{2} U_{+}^{(2)}

, be projection matrix U ⁽²⁾the matrix that the non-negative element of pseudoinverse forms, * ₂representing matrix

with pcarry out 2 rank Matrix Multiplications of tensor;

eigenmatrix S _(f), wherein

2. the voice mood feature extracting method of polyteny group sparse features in consideration voice according to claim 1, is characterized in that: described calculating projection matrix U ⁽ⁱ⁾, i=1 ..., the concrete decomposable process of I is as follows, and i means the index on rank here, I=5:

2. to projection matrix U ⁽ⁱ⁾, i=1 ..., each column vector of I

i=1 ..., I, k=1 ..., K carries out normalization;

3. error objective function

● from n=1 to I, carry out successively:

Wherein, || || _fmean the Frobenius norm,

it is tensor p ^(k)i rank tensor matrixes launch,

the Khatri-Rao that is matrix is long-pending, and ο means vectorial apposition, λ _kand q _ibe for regulating the weight coefficient of objective function composition degree of rarefication, get the numerical value between 0 to 1;

If ● n ≠ 5,

wherein

mean

transposition, if n=5,

4. work as objective function ewhile being less than certain threshold value, circulation finishes, and calculates projection matrix U ⁽ⁱ⁾, i=1 ..., I.