CN103531206A

CN103531206A - Voice affective characteristic extraction method capable of combining local information and global information

Info

Publication number: CN103531206A
Application number: CN201310460191.5A
Authority: CN
Inventors: 文贵华; 孙亚新
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2013-09-30
Filing date: 2013-09-30
Publication date: 2014-01-22
Anticipated expiration: 2033-09-30
Also published as: CN103531206B

Abstract

The invention discloses a voice affective characteristic extraction method capable of combining local information and global information, which can extract three characteristics and belongs to the technical fields of voice signal processing and mode recognition. The voice affective characteristic extraction method comprises the following steps of (1) framing voice signals; (2) carrying out Fourier transform on each frame; (3) filtering a Fourier transform result by utilizing a Mel filter, solving energy from the filtering result, and taking the logarithm to the energy; (4) carrying out local Hu operation on the taken logarithm result to obtain a first characteristic; (5) carrying out discrete cosine transform on each frame after being subjected to the local Hu operation to obtain a second characteristic; (6) carrying out difference operation on the obtained logarithm result of the step (3), and carrying out the discrete cosine transform on each frame of the difference operation result to obtain a third characteristic. According to the voice affective characteristic extraction method capable of combining the local information and the global information, which is disclosed by the invention, the voice of each emotion can be quickly and effectively expressed, the application range comprises fields of voice retrieval, voice recognition, emotion computation and the like.

Description

The speech emotional characteristic extraction method of a kind of combination part and global information

Technical field

The present invention relates to a kind of voice signal and process and mode identification technology, particularly the speech emotional characteristic extraction method of a kind of combination part and global information.

Background technology

Along with the development of infotech, social development is calculated and is had higher requirement emotion.For example, aspect man-machine interaction, a computing machine that has emotion ability can obtain human emotion, classify, identify and respond, and then help user to obtain efficient and warm sensation, and can effectively alleviate the sense of defeat that people use computer, even can help people to understand own and other people feeling world.Whether the energy that for example adopts this type of technology to survey driving driver is concentrated, the press water equality of experiencing, and make relative response.In addition, emotion is calculated and can also be applied in the related industries such as robot, intelligent toy, game, ecommerce, to construct the style that more personalizes and scene more true to nature.Emotion has also reflected the mankind's mental health situation, and the application that emotion is calculated can help people to avoid unhealthy emotion effectively, and healthy psychology keeps pleasant.

People's facial expression, voice, physical signs etc. can reflect the mankind's emotion to a certain extent.The present invention relates to the speech feature extraction problem in voice-based emotion recognition.Use at present the feature in speech emotional identification to have a lot, widely used is MFCC feature.But MFCC has ignored the energy distribution information of Mel wave filter inside and the local distribution information between the different wave filter results of each frame, and responsive to noise, the present invention proposes a kind of speech emotional characteristic extraction method of simultaneously considering this two category information for this reason.

Summary of the invention

The shortcoming that the object of the invention is to overcome prior art, with not enough, provides the speech emotional characteristic extraction method of a kind of combination part with global information, and the method is simple, is easy to realize.

Object of the present invention is achieved through the following technical solutions: the speech emotional characteristic extraction method of a kind of combination part and global information, comprises the following steps:

[1] by voice signal, divide frame;

[2] each frame is carried out to Fourier transform;

[3] use Mel wave filter to Fourier transform results filtering, and filtering result is taken the logarithm;

[4] the logarithm result obtaining is used to local Hu computing, obtain the 1st category feature, be called HuLFPC feature;

[5] each frame after local Hu computing is carried out to discrete cosine transform, obtain the 2nd category feature, be called HuMFCC feature;

[6] the logarithm result of [3] step being calculated is carried out calculus of differences, then each frame of calculus of differences result is carried out to discrete cosine transform, obtains the 3rd category feature, is called DMFCC feature.

Described step [4], the logarithm result that step [3] is calculated is used local Hu computing, obtains the 1st category feature, is called HuLFPC feature.

Described step [5], carries out discrete cosine transform to each frame after local Hu computing, obtains the 2nd category feature, is called HuMFCC feature.

Described step [6], the logarithm result that step [3] is calculated is carried out calculus of differences in a window, then each frame of calculus of differences result is carried out to discrete cosine transform, obtains the 3rd category feature, is called DMFCC feature.

The present invention extracts following three category features:

The 1st category feature: for extracting the energy distribution information of each Mel wave filter inside, be called HuLFPC feature, first it divide frame by voice signal, and each frame is carried out to Fourier transform; Then Fourier transform results is used to Mel filter filtering, filtering result is asked to energy, and energy is taken the logarithm; Again the logarithm result obtaining is asked to Hu square in local window, obtain HuLFPC feature.

The 2nd category feature: for extracting the energy distribution information of each Mel wave filter inside, be called HuMFCC feature, its method be obtain HuLFPC feature after, the HuLFPC characteristic coefficient of each frame is carried out to one dimension dct transform, obtain HuMFCC feature.

The 3rd category feature: for extracting the local distribution information between the different wave filter results of each frame, be called DMFCC feature, its method, first divides frame by voice signal, and each frame is carried out to Fourier transform; Then Fourier transform results is used to Mel filter filtering, filtering result is asked to energy, and energy is taken the logarithm; Again the result of taking the logarithm is asked to difference in local window; Finally the difference coefficient of each frame is carried out to one dimension dct transform, obtain DMFCC feature.

Principle of work of the present invention: when speech emotional is different, all can there is corresponding variation in sound articulation, fundamental tone intensity of variation, intensity of phonation, word speed, these change and will change the intensity of sound spectrograph energy, and when as more clear in pronunciation, intensity of phonation is high, sound spectrograph energy comparison is concentrated.And the first moment of Hu just can evaluating data concentration of energy to the degree of data center of gravity, can be good at like this extracting and when speech emotional changes, cause the variation that on sound spectrograph, encircled energy occurs.Most of research is all only applied to derivative on the time shaft of sound spectrograph at present in addition, with this, extract the degree that energy changes, but when changing, emotion can change the frequency distribution of voice signal, thereby change on the frequency axis of sound spectrograph, so the derivative on this paper frequency of utilization axle extracts these variations.

The present invention has following advantage and effect with respect to prior art:

1, method is simple, and whole feature extraction framework is simple, is easy to realize.

2, algorithm complex is low, the formula high less than computation complexity in all feature extracting methods.

3, HuLFPC has local rotation, translation invariance, can prominent resonances peak, the integral energy distributed intelligence of voiceless sound, and can partly overcome various noises.

4, HuMFCC is transformed into frequency domain by each HuLFPC coefficient of each frame from time domain, except having the 3rd effect, relatively MFCC it can weaken the impact of the energy overall offset that the variation of fundamental tone brings.

5, DMFCC has given prominence to speech energy and has changed violent place, has reduced voice global energy and has changed the coefficient skew bringing, and makes the energy trend of sound spectrograph more outstanding simultaneously.

6, from accompanying drawing 2,3, can see in 6,7, HuLFPC and existing MLFPC feature differ larger; From accompanying drawing 4,5, can see in 8,9,10,11, it is also very large that DMFCC, HuMFCC and existing MFCC differ, thus the three class phonetic features that newly put forward to MFCC, the traditional voice features such as MLFPC have good complementation, successful.

Accompanying drawing explanation

Fig. 1 extracts the process flow diagram of three category features for speech emotional characteristic extraction method of the present invention.

The MLFPC feature visualization result of Fig. 2 " being exactly to rain also to go ".

The MLFPC feature visualization result of Fig. 3 " office worker finishes the work ".

The MFCC feature visualization result of Fig. 4 " being exactly to rain also to go ".

The MFCC feature visualization result of Fig. 5 " office worker finishes the work ".

The HuLFPC feature visualization result of Fig. 6 " being exactly to rain also to go ".

The HuLFPC feature visualization result of Fig. 7 " office worker finishes the work ".

The HuMFCC feature visualization result of Fig. 8 " being exactly to rain also to go ".

The HuMFCC feature visualization result of Fig. 9 " office worker finishes the work ".

The DMFCC feature visualization result of Figure 10 " being exactly to rain also to go ".

The DMFCC feature visualization result of Figure 11 " office worker finishes the work ".

Figure 12 is speech emotional recognition system structural drawing.

Embodiment

Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited to this.

Embodiment

As shown in Figure 1, the speech emotional characteristic extraction method of a kind of combination part and global information, comprises the following steps:

The first step: divide frame and windowing to obtain S to voice signal _k(N).Take following two formulas to divide frame, wherein N represents frame length, and inc represents the sampled point number that next frame departs from, fix (x) asks the nearest integer from x, and the sampling rate that fs is voice signal, from speech data, bw is the frequency resolution in sound spectrograph, and k represents k frame, and the present invention gets 60HZ.Windowed function is Hamming window.

N=fix(1.81*fs/bw)， (1)

inc=1.81/(4*bw)， (2)；

Second step: to S _k(N) carry out short time discrete Fourier transform F _k(N), and to F _k(N) use (3) formula to obtain Mel frequency G _k(N).

Mel(f)=2595*lg(1+f/700)， (3)；

The 3rd step: first use formula (4) to define a bank of filters that has M wave filter, each wave filter is triangular filter, M, calculating HuLFPC, gets 160 during HuMFCC, when calculating DMFCC, get 40.Then use formula (5) to calculate m wave filter to the filtered energy E of k frame _k(m).The matrix that the E obtaining is K*M, wherein K is the frame number of one section of voice.

H_{m} (k) = \{\begin{matrix} 0, k < f (m - 1) \\ \frac{2 (k - f (m - 1))}{(f (m + 1) - f (m - 1)) (f (m) - f (m - 1))}, f (m - 1) \leq k \leq f (m) \\ \frac{2 (f (m + 1) - k)}{(f (m + 1) - f (m - 1)) (f (m + 1) - f (m))}, f (m) \leq k \leq f (m + 1) \\ 0, k &GreaterEqual; f (m + 1) \end{matrix}, - - - (4);

Wherein,

f (m) increases along with the increase of m with the difference of f (m-1);

E_{k} (m) = \ln (Σ_{n = 0}^{N - 1} {| G_{k} (n) |}^{2} H_{m} (n)), 0 \leq m < M, 0 \leq k < K, - - - (5);

The 4th step: calculate the 1st category feature HuLFPC, feature visualization result as shown in Figure 6 and Figure 7, is first divided into E nonoverlapping window, each window is the matrix data E (r, c) of 3*3 size, then to all E (r, c) calculate Hu feature and obtain HuLFPC, its dimension is (K-2) * (M-2), and wherein, Hu feature calculation process is as follows: first to 2-D data E (r, c), use formula (6), (7), (8) calculate p+q rank geometric moment m _pq, p+q rank centre distance μ _pq, the normalized centre distance η in p+q rank _pq, in formula, f (k, m) is an element in 2-D data E (r, c), then uses formula (10) to obtain a Hu feature θ in window;

m_{pq} = Σ_{k = 1}^{K} Σ_{m = 1}^{M} k^{p} m^{q} f (k, m), p, q = 0,1,2 \cdot \cdot \cdot, - - - (6)

μ_{pq} = Σ_{k = 1}^{K} Σ_{m = 1}^{M} {(k - \overset{&OverBar;}{k})}^{p} {(m - \overset{&OverBar;}{m})}^{q} f (k, m), p, q = 0,1,2 \cdot \cdot \cdot, - - - (7)

η _pq=μ _pq/(μ ₀₀ ^ρ),ρ=(p+q)/2+1， (8)

Wherein, with

the center of gravity of representative data:

\overset{&OverBar;}{k} = m_{10} / m_{00}, \overset{&OverBar;}{m} = m_{01} / m_{00}, - - - (9)

θ=η ₂₀+η ₀₂， (10)

The 5th step: calculate the 2nd category feature HuMFCC, feature visualization result as shown in Figure 8 and Figure 9.The HuLFPC of each frame is carried out to DCT algorithm, get second coefficient and to last coefficient, form the HuMFCC feature of (K-2) * (M-3) dimension;

The 6th step: calculate the 3rd category feature DMFCC, feature visualization result as shown in Figure 10 and Figure 11.First E is divided into overlapping 3*3 window, pixel of the relatively previous window sliding of each window, is used formula (11) to calculate difference to all windows and obtains DLFPC.And then the DLFPC coefficient of each frame is carried out to DCT algorithm, get second coefficient and to last coefficient, form the DMFCC feature of (K-2) * (M-3) dimension;

D_{k} (m) = Σ_{i = - 1}^{1} Σ_{j = - 1}^{1} f (k + i, m + j) h (i, j) - - - (11)

In above formula, h (i, j) is as given a definition:

H = [\begin{matrix} - 1 & 0 & 1 \\ - 1 & 0 & 1 \\ - 1 & 0 & 1 \end{matrix}], - - - (12)

In a speech emotional recognition system, check the validity of the speech feature extraction method that the present invention proposes below.The speech emotional recognition system of experiment as shown in figure 12, is mainly divided into two large processes: training process and assorting process.Two processes include the feature extraction of speech samples, comprise three category features of the present invention and existing MFCC, MLFPC, the combination of LPCC feature, then adopt characteristic statistics method to change the proper vector of voice into, statistical value comprises: the average of each feature, variance, and the average of each feature first order difference, variance.Feature selection approach adopts SFS (Sequential Forward Feature Selection).Sorter adopts support vector machines (Support Vector Machine).

Training process comprises the following steps:

1) the emotional speech sample database Dw of preparation training, comprises the audio file of voice and corresponding emotion class label: indignation, frightened, irritated, detest, happy, neutral, sad.

2) the method according to this invention obtains three category feature HuMFCC, HuLFPC, DMFCC, and calculates traditional characteristic MFCC, MLFPC, and LPCC, is all converted into two dimensional character matrix by all training utterance samples, and wherein one dimension is the frame of training utterance sample.

3) use average, variance, and average, the variance statistical method of each feature first order difference in frame dimension change into proper vector by two dimensional character matrix.

4) proper vector is used to SFS feature selection approach, obtain the effective subset of emotional semantic classification, and obtain the proper vector V after feature selecting.

5) to V, adopt SVM to do emotion classifiers, with 5 times of interleaved modes, select the optimal parameter of SVM, and obtain corresponding disaggregated model.

As shown in Figure 3, identifying comprises the following steps speech emotional recognition system structural drawing of the present invention:

1) according to the present invention, obtain three category feature HuMFCC, HuLFPC, DMFCC, and calculate traditional characteristic MFCC, MLFPC, LPCC, is all converted into two dimensional character matrix by all identification speech samples, and wherein one dimension is the frame of identification speech samples.

2) use average, variance, and average, the variance statistical method of each feature first order difference in frame dimension change into proper vector by two dimensional character matrix.

3) parameter of the validity feature subset obtaining according to training process, the proper vector V. of the recognition sample of acquisition after feature selecting

4) adopt the disaggregated model of SVM that V is categorized as to a classification in emotion.

The corpus that the effect assessment of emotion recognition of the present invention adopts is German EMO-DB speech emotional database, and it is the standard database in speech emotional identification field.First complete training process, then identify test.Test pattern is undertaken by 5 times of interleaved modes.Can identify indignation, fear, agitation, detest, happy, neutral, sad 7 kinds of emotions, in the situation that speaker relies on, average classification accuracy rate is 91.67%, and except being easier to obscure with indignation ratio happily, between other mood, discrimination is better.Speaker independently in situation average classification accuracy rate be 82.20%, now happy, indignation, detest three kinds of mood ratios and be easier to obscure, the discrimination between other mood is better.As shown in Fig. 2, Fig. 3, Fig. 6 and Fig. 7, HuLFPC and existing MLFPC feature differ larger; As shown in Fig. 4, Fig. 5, Fig. 8, Fig. 9, Figure 10 and Figure 11, it is also very large that DMFCC, HuMFCC and existing MFCC differ.

Above-described embodiment is preferably embodiment of the present invention; but embodiments of the present invention are not restricted to the described embodiments; other any do not deviate from change, the modification done under Spirit Essence of the present invention and principle, substitutes, combination, simplify; all should be equivalent substitute mode, within being included in protection scope of the present invention.

Claims

1. in conjunction with a speech emotional characteristic extraction method local and global information, it is characterized in that, comprise the following steps:

[1] by voice signal, divide frame;

[2] each frame is carried out to Fourier transform;

[4] the logarithm result obtaining is used to local Hu computing, obtain the 1st category feature;

[5] each frame after local Hu computing is carried out to discrete cosine transform, obtain the 2nd category feature;

[6] the logarithm result of step [3] being calculated is carried out calculus of differences, then each frame of calculus of differences result is carried out to discrete cosine transform, obtains the 3rd category feature.

2. the speech emotional characteristic extraction method of combination according to claim 1 part and global information, is characterized in that, described step [4] comprises the following steps:

1. E is divided into nonoverlapping window, each window is the matrix data E (r, c) of 3 * 3 sizes;

2. all E (r, c) are calculated to Hu feature and obtain HuLFPC, its dimension is:

(K-2)×(M-2)，

Wherein, Hu feature calculation process is as follows:

First, to 2-D data E (r, c), use following (6) formula, (7) formula and (8) formula to calculate p+q rank geometric moment m _pq, p+q rank centre distance μ _pq, the normalized centre distance η in p+q rank _pq:

m_{pq} = Σ_{k = 1}^{K} Σ_{m = 1}^{M} k^{p} m^{q} f (k, m), p, q = 0,1,2 \cdot \cdot \cdot, - - - (6)

μ_{pq} = Σ_{k = 1}^{K} Σ_{m = 1}^{M} {(k - \overset{&OverBar;}{k})}^{p} {(m - \overset{&OverBar;}{m})}^{q} f (k, m), p, q = 0,1,2 \cdot \cdot \cdot, - - - (7)

η _pq=μ _pq/(μ ₀₀ ^ρ),ρ=(p+q)/2+1， (8)

Wherein, f (k, m) is an element in 2-D data E (r, c);

Then, obtain a Hu feature θ in window:

\overset{&OverBar;}{k} = m_{10} / m_{00}, \overset{&OverBar;}{m} = m_{01} / m_{00}, - - - (9)

θ=η ₂₀+η ₀₂，(10)

Wherein,

with

the center of gravity of representative data.

3. the local speech emotional characteristic extraction method with global information of combination according to claim 1, it is characterized in that, in step [5], the HuLFPC of each frame is carried out to DCT algorithm, get second coefficient and form (K-2) * (M-3) HuMFCC feature of dimension to last coefficient.

4. the speech emotional characteristic extraction method of combination according to claim 1 part and global information, is characterized in that, described step [6] comprises the following steps:

I, E is divided into 3 * 3 overlapping windows, pixel of the relatively previous window sliding of each window, to all windows, use (11) formulas to calculate difference and obtain DLFPC:

D_{k} (m) = Σ_{i = - 1}^{1} Σ_{j = - 1}^{1} f (k + i, m + j) h (i, j) - - - (11)

In formula, h (i, j) is defined as follows:

H = [\begin{matrix} - 1 & 0 & 1 \\ - 1 & 0 & 1 \\ - 1 & 0 & 1 \end{matrix}];

II, the DLFPC coefficient of each frame is carried out to DCT algorithm, get second coefficient and form (K-2) * (M-3) DMFCC feature of dimension to last coefficient.

5. the speech emotional characteristic extraction method of combination according to claim 1 part and global information, is characterized in that, in described step [1], takes (1) formula and (2) formula to divide frame:

N=fix(1.81*fs/bw)， (1)

inc=1.81/(4*bw)， (2)

Wherein, N represents frame length, and inc represents the sampled point number that next frame departs from, fix (x) asks the nearest integer from x, and the sampling rate that fs is voice signal, from speech data, bw is the frequency resolution in sound spectrograph, and the value of bw is got 60HZ, and windowed function is Hamming window.

6. the speech emotional characteristic extraction method of combination according to claim 1 part and global information, is characterized in that, in described step [2], to S _k(N) carry out short time discrete Fourier transform F _k(N), and to F _k(N) use (3) formula to obtain Mel frequency G _k(N):

Mel(f)=2595*lg(1+f/700)， (3)

Wherein, k represents k frame.

7. the speech emotional characteristic extraction method of combination according to claim 1 part and global information, is characterized in that, described step [3] comprises the following steps:

(I) defines a bank of filters that has M wave filter, and each wave filter is triangular filter;

(II) use formula (5) is calculated m wave filter to the filtered energy E of k frame _k(m), the matrix that the E of acquisition is K * M, wherein K is the frame number of one section of voice:

H_{m} (k) = \{\begin{matrix} 0, k < f (m - 1) \\ \frac{2 (k - f (m - 1))}{(f (m + 1) - f (m - 1)) (f (m) - f (m - 1))}, f (m - 1) \leq k \leq f (m) \\ \frac{2 (f (m + 1) - k)}{(f (m + 1) - f (m - 1)) (f (m + 1) - f (m))}, f (m) \leq k \leq f (m + 1) \\ 0, k &GreaterEqual; f (m + 1) \end{matrix}, - - - (4)

Wherein, difference between f (m) and f (m-1) increases along with the increase of m;

E_{k} (m) = \ln (Σ_{n = 0}^{N - 1} {| G_{k} (n) |}^{2} H_{m} (n)), 0 \leq m < M, 0 \leq k < K, - - - (5)

Wherein, M gets 160, M and get 40 when calculating DMFCC when calculating HuLFPC and HuMFCC.