CN101256768A

CN101256768A - Time frequency two-dimension converse spectrum characteristic extracting method for recognizing language species

Info

Publication number: CN101256768A
Application number: CNA2008101033280A
Authority: CN
Inventors: 张卫强; 刘加
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2008-04-03
Filing date: 2008-04-03
Publication date: 2008-09-03
Anticipated expiration: 2028-04-03
Also published as: CN101256768B

Abstract

A time frequency two-dimensional cepstrum characteristic extracting method for identifying the languages is related to, characterized in that the method first computes voice signal sub-band energy in each frame, obtains a time frequency distributive matrix after jointing the multi-frame sub-band energy, then performs a two-dimensional DCT conversion, eliminates the relativity of a matrix time direction and a frequency direction, and then re-arrays the converted coefficient and decreases the dimension to obtain the final characteristic. The characteristic not only uses the short-time stability of the voice, but also extracts the long-time information for identifying the languages. The method can be used for identifying the languages.

Description

The time-frequency two-dimensional cepstrum feature extracting method that is used for languages identification

Technical field

The invention belongs to field of speech recognition, specifically, relate to a kind of time-frequency two-dimensional cepstrum feature extracting method, can be used for languages identification.

Background technology

Languages identification is meant the kind of using machine to identify its language from one section voice signal.The languages recognition technology is mainly used in people and systems such as interactive voice dialogue, speech polling and monitoring.

Use the most in the languages identification at present that universal characteristics is the MFCC (Mel frequency cepstral coefficient) and the feature of deriving thereof, also have LPCC (linear prediction cepstrum coefficient) and PLP (perception linear prediction) etc. in addition.Wherein LPCC is according to the proposition of people's sound generating mechanism, and MFCC and PLP part considered that the people's listens the perception characteristic.

In the languages identification, the above-mentioned essential characteristic of general using is carried out computing again, obtains the feature of deriving, and uses in the lump then with after the former feature splicing.The most frequently used feature of deriving is the difference feature, generally comprises first order difference and second order difference.The essential characteristic of supposing the t frame is { c _j(t), j=0,1 ..., N-1}, then one jump branch is characterized as

δ_{j} (t) = \frac{Σ_{d = 1}^{D} d (c_{j} (t + d) - c_{j} (t - d))}{Σ_{d = 1}^{D} d^{2}}, j = 1,2, . . . N - 1 - - - (1)

Wherein D is the size of difference window, and general value is 2.In like manner, by first order difference δ _j(t) calculate by formula (2) and can obtain second order difference α _j(t).

α_{j} (t) = \frac{Σ_{d = 1}^{D} d (δ_{j} (t + d) - δ_{j} (t - d))}{Σ_{d = 1}^{D} d^{2}}, j = 1,2, . . . N - 1 - - - (2)

With essential characteristic and its single order and second order difference splicing, can obtain a new eigenvector, { c _j(t), j=0,1 ..., N-1; δ _j(t), j=0,1 ..., N-1; α _j(t), j=0,1 ..., N-1}.

In addition, in languages identification, time sequence information is a very main feature, and in order to make full use of the time sequence information in the voice, scholars have proposed SDC (offset deviation is divided cepstrum) feature in recent years.The SDC feature is actually by K piece first order difference feature and is spliced, and can be expressed as

s _(iN+j)(t)＝c _j(t+iS+b)-c _j(t+iS-b)，j＝1，2，...N-1；i＝0，1，...，K-1 (3)

Frame number when wherein b is for calculating first order difference feature is poor, and general value is 1; K is the piece number, and general value is 7; S is the skew frame number between each piece, and general value is 3.

With the difference feature class seemingly, SDC also can splice with essential characteristic, forms new eigenvector { c _j(t), j=0,1 ..., N-1; s _(iN+j)(t), j=0,1 ..., N-1, i=0,1 ..., K-1}.Experiment showed, that the simple SDC feature of this aspect ratio is more effective.

Though contained more time sequence information in the SDC feature, because it is spliced by some first order differences, can there be the problem of two aspects in this: the first, its dimension is higher, has increased the complexity of system; The second, still there is stronger correlativity between each dimension, be unfavorable for that the rear end sorter is to its modeling.

Summary of the invention

In order to solve the deficiency that existing SDC feature exists, the invention provides a kind of extracting method of time-frequency two-dimensional cepstrum feature, both reduced the correlativity between each dimension of feature, reduced the dimension of feature again, and can reduce the complexity of language recognition system and improve its performance.When using digital integrated circuit to realize, compare, adopt the present invention's (21 dimensional feature) can make characteristic storage module and sorter computing module economize on resources 62.5% with 56 dimension SDC features commonly used at present.

The invention is characterized in that described method realizes according to the following steps in digital integrated circuit chip:

Step (1): voice signal is carried out zero-meanization and pre-emphasis, and wherein zero-meanization is meant that whole section voice deduct its average, and pre-emphasis is that voice are carried out high-pass filtering, and filter transfer function is H (z)=1-0.975z ^-1

Step (2): voice signal is pressed frame length 20ms, and frame moves 10ms and carries out the processing of branch frame;

Step (3): the two-dimentional time-frequency distributions matrix of setting up an information when reflecting voice simultaneously stationarity and languages are long in short-term according to the following steps:

Step (3.1): described voice signal is added Hamming window, obtain data x (m), m=0,1 ..., M-1}, M are that frame data are counted;

Step (3.2): the data that add Hamming window are done DFT conversion (discrete Fourier transform (DFT)), obtain:

X (ω_{k}) = Σ_{m = 0}^{M - 1} x (m) e^{- j \frac{2 π}{M} mk}

ω wherein _kRepresent frequency, k represents the frequency label;

Step (3.3): the sub belt energy e that in frequency field, calculates F quarter window in each frame by the Mel frequency marking with following formula _f, F=24:

e_{f} = \frac{1}{U_{f} - L_{f} + 1} Σ_{k = L_{f}}^{U_{f}} {| X (ω_{k}) |}^{2}

U wherein _fAnd L _fBe respectively the up-and-down boundary of f subband, again F subband energy bins become a vector e:

e＝[e ₀，e ₁，…，e _F-1] ^T

Wherein subscript T represents transposition;

Step (3.4): get the middle T frame vector of step (3.3) and be in juxtaposition, form a two-dimentional time-frequency distributions matrix E (t), T=19:

E (t) = [e (t), e (t + 1), \cdot \cdot \cdot, e (t + T - 1)]

Step (4): matrix E (t) is carried out two-dimensional dct (discrete cosine transform), obtain two-dimentional cepstrum coefficient:

C (p, q) = γ_{p} γ_{q} Σ_{τ = 0}^{T - 1} Σ_{f = 0}^{F - 1} e_{f} (t + τ - 1) \cos \frac{π (2 τ + 1) p}{2 T} \cos \frac{π (2 f + 1) q}{2 F}

Wherein τ and f are summation variable, γ _pAnd γ _qBe normalization coefficient:

γ_{p} = \{\begin{matrix} 1 / \sqrt{T}, & p = 0 \\ \sqrt{2 / T}, & p &GreaterEqual; 1 \end{matrix},

γ_{q} = \{\begin{matrix} 1 / \sqrt{F}, & q = 0 \\ \sqrt{2 / F}, & q &GreaterEqual; 1 \end{matrix}

Step (5): choose element as the upper left corner part of matrix E (t) fundamental component as feature, represent with TFC, then the rearrangement formulae that upper left corner part is arranged as vector is:

TFC (\frac{{(p + q)}^{2} + 3 p + q}{2}) = C (p, q) .

The invention has the beneficial effects as follows, can from voice signal, extract feature when effectively being used for languages identification long, both reduced each correlativity between tieing up of feature, reduced total dimension of feature again.Can improve the discrimination of languages identification like this, reduce the complexity of recognition system simultaneously again, reduce demand characteristic storage and sorter calculation resources.

Description of drawings

Fig. 1 is a feature extraction FB(flow block) of the present invention.

Fig. 2 is a time-frequency two-dimensional cepstrum feature numbering synoptic diagram of the present invention.

Embodiment

Because voice have stationarity in short-term, the frame length of generally choosing 20ms during feature extraction carries out short time discrete Fourier transform.Handle if get longer frame length, voice signal is no longer steady in a frame length.And the information of languages contains in long voice segments, and for example Chinese character of Chinese approximately continues 250ms, if frame moves and is 10ms, then is about as much as 25 frames.

Based on above consideration, the present invention at first adopts Fourier techniques in short-term, suppose a frame add behind the Hamming window data for x (m), m=0,1 ..., M-1}, its DFT is transformed to

X (ω_{k}) = Σ_{m = 0}^{M - 1} x (m) e^{- j \frac{2 π}{M} mk} - - - (4)

ω wherein _kRepresent frequency, k represents the frequency label.By Mel frequency marking each sub belt energy in the individual quarter window of frequency-domain calculations F (general value is 24), can get

e_{n} = \frac{1}{U_{n} - L_{n} + 1} Σ_{k = L_{n}}^{U_{n}} {| X (ω_{k}) |}^{2} - - - (5)

U wherein _nAnd L _nIt is respectively the up-and-down boundary of n subband.

F sub belt energy can be formed a vector

e＝[e ₀，e ₁，…，e _F-1] ^T (6)

Wherein subscript T represents transposition.T (general value is 19) the such vector of frame is in juxtaposition, and can form a two-dimentional time-frequency distributions matrix

E (t) = [e (t), e (t + 1), \cdot \cdot \cdot, e (t + T - 1)]

E (t) matrix had both utilized the stationarity in short-term of voice, information when having extracted languages long again.But its dimension is higher on the one hand, reaches T * F dimension; On the other hand, because the continuity of time-frequency distributions all exists certain correlativity between its horizontal (time orientation) and vertical (frequency direction) element.This two aspect is unfavorable for that all sorter is to its modeling.Can eliminate linear dependence and dimension between the feature by Linear transformation technology.

The present invention carries out two-dimensional dct transform to E (t) matrix, obtains two-dimentional cepstrum coefficient

C (p, q) = γ_{p} γ_{q} Σ_{τ = 0}^{T - 1} Σ_{f = 0}^{F - 1} e_{f} (t + τ - 1) \cos \frac{π (2 τ + 1) p}{2 T} \cos \frac{π (2 f + 1) q}{2 F} - - - (8)

γ_{p} = \{\begin{matrix} 1 / \sqrt{T}, & p = 0 \\ \sqrt{2 / T}, & p &GreaterEqual; 1 \end{matrix},

γ_{q} = \{\begin{matrix} 1 / \sqrt{F}, & q = 0 \\ \sqrt{2 / F}, & q &GreaterEqual; 1 \end{matrix} - - - (9)

Can remove the correlativity of vertical and horizontal like this, can make the fundamental component of E (t) matrix be compressed to matrix upper left corner part simultaneously, choose matrix upper left corner part element like this, get final product the whole matrix of approximate description, thereby reach the purpose of dimension compression.Hypothesis matrix upper left corner part element represents with TFC, and then the permutatation formula that triangular portions is arranged as vector is

TFC = (\frac{{(p + q)}^{2} + 3 p + q}{2}) = C (p, q) - - - (10)

As depicted in figs. 1 and 2, it is as follows to implement concrete steps of the present invention:

(1) at first voice signal is carried out pre-service, comprise zero-meanization and pre-emphasis;

(2) frame length 20ms pressed in voice, frame moves 10ms and carries out the processing of branch frame;

(3) every frame voice are added Hamming window;

(4) data after the windowing are carried out the DFT conversion, obtain frequency spectrum;

(5) press Mel frequency marking sub belt energy in F quarter window in the every frame of frequency-domain calculations;

(6) each frame sub belt energy is arranged in chronological order, obtained the time-frequency energy distribution;

(7) choose the time-frequency energy of a rectangular window, the rectangular window time-axis direction length of side is the T point, and the frequency axis direction length of side is the F point, forms the time-frequency distributions matrix, carries out two-dimensional dct transform, obtains time-frequency two-dimensional cepstrum coefficient matrix;

(8) choose the coefficient in the triangle of the upper left corner in the time-frequency two-dimensional cepstrum coefficient matrix, and the coefficient in the diabolo rearranges and obtains vector, L dimension before getting, feature to the end.

The present invention tests Chinese (mainland mandarin), English (the non-southern accent of the U.S.) and the Japanese in the CallFriend database that adopts standard in the world.This database is the telephone conversation voice of 8kHz sampling μ rule compression.Training set is from the 1st dish of each languages, and totally 20 sections, every section is long double dialogue in about 30 minutes.Test set therefrom is syncopated as 500 sections, every section voice that contain 30 seconds approximately at random from the 3rd dish of each languages.

The time-frequency two-dimensional cepstrum feature that MFCC feature, SDC feature and the present invention are proposed compares test.All test sections are carried out languages respectively to each languages confirm, when adjusting false alarm rate and rate of failing to report equate, can obtain the wrong rate such as grade of system, adopt the evaluation index of the average wrong rate such as grade of each languages, wait wrong rate low more, show that system performance is good more as system.

In the experiment, adopt GMM (gauss hybrid models) as sorter, each GMM is made of 128 gaussian component.Adopt maximum likelihood method to train respectively, adopt K averaging method initialization model, use the Bauman-Welch algorithm iteration then 8 times.

MFCC feature employing 13 dimension essential characteristics (comprising C0) and single order, second order difference feature form the eigenvectors of totally 39 dimensions.The N-b-S-K parameter of SDC feature adopts 7-1-3-7 (comprising C0), adds 7 dimension MFCC features, forms the eigenvector of totally 56 dimensions altogether.The parameter that the time-frequency two-dimensional cepstrum feature adopts is F=24, T=19, and the L=21 dimension forms eigenvector before getting at last.

Experiment shows: adopt the MFCC feature, wrong rates such as languages identification are 15.57%; Adopt the SDC feature, wrong rates such as languages identification are 8.38%; Adopt the time-frequency two-dimensional cepstrum feature, wrong rates such as languages identification are 6.55%.As seen, the time-frequency two-dimensional cepstrum feature that proposes of the present invention is used for languages identification and improves a lot on performance than MFCC and SDC feature commonly used at present.

In addition, when digital integrated circuit is realized, compare, adopt the present invention's (21 dimensional feature) can make characteristic storage module and sorter computing module economize on resources 46.2% with 39 dimension MFCC features. Compare with 56 dimension SDC features, adopt the present invention's (21 dimensional feature) can make characteristic storage module and sorter computing module economize on resources 62.5%.

Claims

1. the time-frequency two-dimensional cepstrum feature extracting method that is used for languages identification is characterized in that described method realizes according to the following steps in digital integrated circuit chip:

X (ω_{k}) = Σ_{m = 0}^{M - 1} x (m) e^{- j \frac{2 π}{M} mk}

ω wherein _kRepresent frequency, k represents the frequency label;

Step (3.3): the sub belt energy e that in frequency field, calculates F quarter window in every frame by the Mel frequency marking with following formula _f, F=24:

e_{f} = \frac{1}{U_{f} - L_{f} + 1} Σ_{k = L_{f}}^{U_{f}} {| X (ω_{k}) |}^{2}

e＝[e ₀，e ₁，…，e _F-1] ^T

Wherein subscript T represents transposition;

E (t) = [e (t), e (t + 1), \cdot \cdot \cdot, e (t + T - 1)]

C (p, q) = γ_{p} γ_{q} Σ_{τ = 0}^{T - 1} Σ_{f = 0}^{F - 1} e_{f} (t + τ - 1) \cos \frac{π (2 τ + 1) p}{2 T} \cos \frac{π (2 f + 1) q}{2 F}

γ_{p} = \{\begin{matrix} 1 / \sqrt{T}, & p = 0 \\ \sqrt{2 / T}, & p &GreaterEqual; 1 \end{matrix},

γ_{q} = \{\begin{matrix} 1 / \sqrt{F}, & q = 0 \\ \sqrt{2 / F}, & q &GreaterEqual; 1 \end{matrix}

TFC (\frac{{(p + q)}^{2} + 3 p + q}{2}) = C (p, q) .