CN101640043A

CN101640043A - Speaker recognition method based on multi-coordinate sequence kernel and system thereof

Info

Publication number: CN101640043A
Application number: CN200910092138A
Authority: CN
Inventors: 何亮; 邓妍; 刘加
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2009-09-01
Filing date: 2009-09-01
Publication date: 2010-02-03

Abstract

The invention provides a speaker recognition method based on a multi-coordinate sequence kernel, comprising a training stage and a recognition stage. The method comprises the following steps: in the training stage, preprocessing training voice; extracting a characteristic vector sequence from the preprocessed training voice; selecting an origin of a multi-coordinate system in a characteristic vector space, and mapping the characteristic vector sequence in each coordinate system; selecting an algorithm according to the coordinate system and splicing the vector sequence of each coordinate systeminto a super-vector; determining a super-vector space and a kernel function of a support vector machine (SVM), and training with the SVM algorithm to obtain a trained speaker model; and in the recognition stage, testing the super-vector by the trained model and outputting a decision mark. In the invention, by effective modeling on the voice signal characteristic sequence, the speaker recognitionmethod helps utilize the information contained in high dimensional statistics, reduce computational complexity of an integrated circuit, and improve speaker recognition accuracy and recognition speed.

Description

Method for distinguishing speek person and system based on multi-coordinate sequence kernel

Technical field

The present invention relates to speech recognition, mode identification technology, particularly a kind of method for distinguishing speek person and system of the supporting vector machine model based on multi-coordinate sequence kernel.

Background technology

Speaker Identification is meant utilizes machine, determines the technology of given voice signal speaker ' s identity.According to the difference of identification mission, Speaker Identification is divided into two kinds of speaker verification and speaker's identifications again.The speaker verification judges that whether given voice are from given speaker; Speaker's identification is to utilize given voice, seeks given speaker in test library.Speaker Recognition Technology is mainly used in systems such as security service, human nature service.

As shown in Figure 1, in the prior art based on the basic flow sheet of the Speaker Recognition System of frequency spectrum layer, may further comprise the steps:

Step S101 is converted into voice the feature that is easy to discern.Feature commonly used has Mel frequency cepstral coefficient (MFCC), linear prediction cepstrum coefficient (LPCC) and perception linear prediction (PLP) and their feature of deriving.

Step S102 selects suitable modeling technique that feature is distinguished.Common modeling technique has gauss hybrid models (GMM) and support vector machine (SVM).

Step S103, the result handles to model output, obtains court verdict.

Wherein, the GMM model parameter is simple, and explicit physical meaning has preferable performance under training, the sufficient situation of recognition data.Yet, in actual applications,, therefore restricted the application performance of GMM because speaker's voice length is shorter.SVM seeks the optimal classification face at higher dimensional space under the guidance of structural risk minimization, the small sample training data is had good recognition capability.Recently, the SVM knowwhy is complete day by day, uses to obtain significant progress.

The GMM-SVM recognition system that GMM and SVM are used in combination can comprehensive two modeling techniques advantage.As with the mode of characteristic sequence, adopt the adaptive GMM of speaker's voice the SVM modeling method that the GMM model bank is classified to the vector space mapping.But, the GMM-SVM system does not have to solve following two aspect problems: the 1) information of not utilizing the characteristic sequence high-order statistic to be implied; 2) do not solve SVM input space vector each the dimension " lack of uniformity ".

Summary of the invention

Purpose of the present invention is intended to solve at least one of above-mentioned technological deficiency, particularly solves the defective of present speech recognition technology.

For overcoming above-mentioned technological deficiency, one aspect of the present invention has proposed a kind of method for distinguishing speek person based on multi-coordinate sequence kernel, may further comprise the steps:

Training stage:

Training utterance is carried out pre-service;

From pretreated training utterance, extract feature vector sequence;

Select the multi-coordinate initial point in space of feature vectors,, described feature vector sequence is shone upon at each coordinate system according to the metric relation between described feature vector sequence and each coordinate origin;

According to the coordinate system selection algorithm, the vector sequence of each coordinate system is spliced, be spliced into super vector;

Determine the super vector space, the kernel function of support vector machine SVM, and adopt algorithm of support vector machine to train the model measurement super vector that obtains training;

Cognitive phase:

Recognizing voice is carried out pre-service;

From pretreated voice, extract feature vector sequence;

According to the metric relation between selected each coordinate origin of feature vector sequence and training stage, feature vector sequence is shone upon at each coordinate system;

Utilize the model measurement super vector that has trained, mark is adjudicated in output, and according to described judgement mark the speaker is discerned.

As one embodiment of the present of invention, adopt a pair of other training patterns to train.

As one embodiment of the present of invention, describedly select the multi-coordinate initial point to comprise in space of feature vectors:

Adopt EM algorithm training gauss hybrid models, and with the gauss hybrid models average as each coordinate origin.

Adopt the VQ algorithm, select the initial point of VQ code book for use as each coordinate system.

The present invention also proposes a kind of Speaker Recognition System based on the multi-coordinate sequence kernel on the other hand, comprises voice pretreatment module, characteristic extracting module, characteristic sequence mapping block, training module and identification module,

Described voice pretreatment module is used for training utterance or recognizing voice are carried out pre-service, the part of carrying out noise reduction, going music etc. and speaker to have nothing to do, output clean speech signal supply characteristic extraction module;

Described characteristic extracting module is used to read in pretreated training utterance or the recognizing voice that described voice pretreatment module provides, and extracts feature, output characteristic sequence;

Described characteristic sequence mapping block is used for according to selected subcoordinate system, and the characteristic sequence that characteristic extracting module is exported becomes super vector;

Described training module is used to utilize the super vector of characteristic sequence mapping block output, selects suitable kernel function, utilizes the SVM training algorithm to train speaker model, and sets up the speaker model storehouse;

Described identification module is used for according to shining upon super vector and the speaker model storehouse that forms for recognizing voice, and mark is adjudicated in output, and according to described judgement mark the speaker is discerned.

As one embodiment of the present of invention, described characteristic sequence mapping block adopts EM algorithm training gauss hybrid models, and with the gauss hybrid models average as each coordinate origin.

As one embodiment of the present of invention, described characteristic sequence mapping block adopts the VQ algorithm, selects the initial point of VQ code book as each coordinate system for use.

As one embodiment of the present of invention, also comprise the model memory module, be used to preserve the speaker model storehouse that training module is set up, and offer identification module.

The present invention passes through the effective modeling of phonic signal character sequence, and the information of both having utilized the higher-dimension statistic to contain has reduced the computational complexity on integrated circuit again, has improved the accuracy rate and the recognition speed of Speaker Identification.

Aspect that the present invention adds and advantage part in the following description provide, and part will become obviously from the following description, or recognize by practice of the present invention.

Description of drawings

Above-mentioned and/or additional aspect of the present invention and advantage are from obviously and easily understanding becoming the description of embodiment below in conjunction with accompanying drawing, wherein:

Fig. 1 is based on the basic flow sheet of the Speaker Recognition System of frequency spectrum layer in the prior art;

Fig. 2 is the method for distinguishing speek person process flow diagram based on the supporting vector machine model of multi-coordinate sequence kernel of the embodiment of the invention;

Fig. 3 is the Speaker Recognition System structural drawing based on the multi-coordinate sequence kernel of the embodiment of the invention.

Embodiment

Describe embodiments of the invention below in detail, the example of described embodiment is shown in the drawings.Should be understood that below by the embodiment that is described with reference to the drawings be exemplary, only be used to explain the present invention, and can not be interpreted as limitation of the present invention.

As shown in Figure 2, method for distinguishing speek person process flow diagram for the embodiment of the invention based on the supporting vector machine model of multi-coordinate sequence kernel, modeling method of the present invention can realize according to the following steps in digital integrated circuit chip that the recognition methods of the embodiment of the invention comprises two stages: training stage and cognitive phase.

Training stage:

Step S201 carries out pre-service to the training utterance data.

Wherein, the training utterance data are carried out pre-service be may further comprise the steps: the training utterance signal is carried out zero-meanization and pre-emphasis, wherein zero-meanization is meant that whole section voice deduct its average, and pre-emphasis is that voice are carried out high-pass filtering, and filter transfer function is H (z)=1-α z ^-1, 0.95≤α≤1 wherein.Divide frame to voice signal, wherein, frame length 20ms, frame moves 10ms.

Step S202 is from pretreated training utterance extracting data feature.

Wherein, may further comprise the steps from pretreated training utterance extracting data feature:

Step S301 adds Hamming window to described voice signal, and wherein Hamming window function is:

ω_{H} (n) = \{\begin{matrix} 0.54 - 0.46 \cos (\frac{2 πn}{N - 1}) & 0 \leq n \leq N - 1 \\ 1 & others \end{matrix} .

Step S302 does discrete Fourier transform (DFT) (DFT) to the data that add Hamming window.

X (ω_{k}) = Σ_{n = 0}^{N - 1} x (n) e^{- j \frac{2 π}{M} nk}

Wherein, ω _kRepresent frequency, k represents the frequency label, and N is that the DFT conversion is counted.

Step S303, select that M is arranged (m=1,2 ..., the M) bank of filters of individual wave filter, wherein, m triangular form wave filter is as giving a definition:

H_{m} [k] = \{\begin{matrix} 0 & k < f [m - 1] \\ \frac{(k - f [m - 1])}{(f [m] - f [m - 1])} & f [m - 1] \leq k \leq f [m] \\ \frac{(f [m + 1] - k)}{(f [m + 1] - f [m])} & f [m] \leq k \leq f [m + 1 \\ 0 & k > f [m] \end{matrix}

Wherein,

Σ_{m = 1}^{M} H_{m} [k] = 1 .

F[m wherein] be the frontier point of quarter window, determine by following formula:

f [m] = \frac{N}{F_{s}} B^{- 1} (B (f_{l}) + m \frac{B (f_{h}) - B (f_{l})}{M + 1})

Wherein, f _lAnd f _hBe the low-limit frequency and the highest frequency of given bank of filters, B is the mapping function of frequency to the Mel frequency marking: B (f)=1125 ln (1+ (f/700)), wherein, and B ^-1Be the mapping function of Mel frequency marking to frequency, B ^-1(b)=700exp ((b/1125)-1).

Step S304 calculates the logarithm energy of each wave filter output, wherein,

S [m] = \ln [Σ_{k = 0}^{N - 1} {| X_{ω} [k] |}^{2} H_{m} [k]],

0＜m≤M。

Step S305, discrete cosine transform, and calculate the MFCC coefficient.Wherein,

c [n] = Σ_{m = 0}^{M - 1} S [m] \cos (πn (m - 1 / 2) / M),

0＜m≤M，

Q maintains number before getting, and splicing becomes MFCC essential characteristic c=[c ₀, c ₁..., c _Q].

Step S306 asks first order difference feature δ ', second order difference feature δ ".For the j dimensional feature, first order difference is characterized as:

δ_{j}^{'} (n) = \frac{Σ_{d = 1}^{D} d (c_{j} (n + d) - c_{j} (n - d))}{Σ_{d = 1}^{D} d^{2}},

j＝1，2，…，N-1

Wherein, D is the size of difference window, generally speaking, and D=2.Subsequently with first order difference feature δ ' calculating second order difference feature δ ":

δ_{j}^{''} (n) = \frac{Σ_{d = 1}^{D} d (δ_{j}^{'} (n + d) - δ_{j}^{'} (n - d))}{Σ_{d = 1}^{D} d^{2}},

j＝1，2，…，N-1

With primitive character, the first order difference feature, the splicing of second order difference feature constitutes Speaker Identification eigenvector y (n),

y(n)＝[c ₁(n)，c ₂(n)，…，c _Q(n)，δ′ ₁(n)，δ′ ₂(n)，…，δ′ _Q(n)，δ″ ₁(n)，δ″ ₂(n)，…，δ″ _Q(n)]。

Step S203 selects the multi-coordinate initial point and extracts speaker's super vector.

Wherein, select multi-coordinate initial point and extract speaker's super vector and may further comprise the steps:

Step S401 chooses multi-coordinate origin sequence o={o ₁, o ₂..., o _C, wherein, C is the coordinate system number, choose algorithm and can choose the GMM average of using EM algorithm training gained, or the code book of VQ algorithm acquisition.

Step S402 selects eigenvector y (n) and origin o _jTolerance f[y (n), o _c], 1≤c≤C, and calculated characteristics vector y (n) is in the occupation rate of each subcoordinate system:

γ [y (n) | o_{j}] = \frac{f [y (n), o_{j}]}{Σ_{c = 1}^{C} f [y (n), o_{c}]} .

Step S403 selects the feature expansion function g[y (n) of each coordinate system, c _c], integrating step S402 calculates the occupation rate of gained, and eigenvector y (n) is mapped to super vector:

&upsi; (n) = [γ [y (n) | o_{1}] g [y (n), o_{1}], γ [y (n) | o_{2}] g [y (n), o_{2}], . . ., γ [y (n) | o_{C}) g (y (n), o_{C}] .

Step S404, the super vector sequence υ (n) that the characteristic sequence mapping forms was averaged the time, obtained the super vector of this section voice correspondence

v = \frac{1}{T} Σ_{n = 1}^{T} v (n) .

Step S405, the weight vectors ω or the projector space V of calculating speaker super vector.Wherein, a kind of computing method of the ω of weight vectors are as follows:

ω_{i} = \sqrt{Σ_{id} {| | {&upsi;}_{i}^{id} | |}^{2}},

Wherein, subscript id represents speaker's index, and i represents each dimension of weight vectors, υ in this step _i ^IdRepresent the value of the i dimension of id the pairing super vector of speaker.

Step S204 by algorithm of support vector machine, sets up speaker model.

Wherein, by algorithm of support vector machine, set up speaker model and may further comprise the steps:

Step S501, the support vector machine training algorithm.

Order input sample set is (v _p, θ _p), p=[1,2 ..., P], θ _p∈+1 ,-1}, usually, θ _p=+1 sample is called positive sample, θ _p=-1 sample is called negative sample.The SVM algorithm is sought optimal classification face ω, makes that the distance between the positive and negative sample set is maximum.Optimal classification face ω gets by finding the solution following majorized function:

\min L = \frac{1}{2} {| | ω | |}^{2} + C (Σ_{p = 1}^{P} ξ_{p}),

Wherein, ‖ ω ‖ ²And distance is inversely proportional to ξ between the positive negative sample _pBe the slack variable of introducing under the linear inseparable situation of sample, C is the wrong punishment degree of dividing sample of control.Following formula is found the solution at dual space, and majorized function becomes:

\max Σ_{p = 1}^{P} α_{p} - \frac{1}{2} Σ_{q = 1}^{P} α_{p} α_{q} θ_{p} θ_{q} K ({&upsi;}_{p}, {&upsi;}_{q}),

Wherein,

Σ_{p = 1}^{P} θ_{p} α_{p} = 0,

α _p≥0，p＝1，2，…，P，

Wherein, K (υ _p, υ _q) be υ _pAnd υ _qKernel function.If optimum solution α ^*, then the optimal classification face is the linear combination of training sample:

ω^{*} = Σ_{p = 1}^{P} α_{p}^{*} θ_{p} {&upsi;}_{p},

The optimal classification function:

f (&upsi;) = Σ_{p = 1}^{P} α_{p}^{*} θ_{p} K ({&upsi;}_{p}, &upsi;) + b^{*} .

Step S502, the kernel function of correction step S501.

In step S501, use K (υ _p, υ _q) representative vector υ _pAnd υ _qBetween tolerance, in the present invention it is modified to K (υ _p, υ _q, ζ), wherein, ζ is used to revise υ _pAnd υ _qBetween tolerance.If select weight vectors ω for use, then ζ refers in particular to ω, K (υ _p, υ _q, ζ)=K (υ _pω, υ _qω), υ wherein _pω represents υ _pInner product with ω.The method that multiple choices weight vectors ω is arranged, one is used sample to be

ω_{i} = \sqrt{Σ_{id} ({&upsi;}_{i}^{id} \cdot {&upsi;}_{i}^{id})},

Wherein id represents speaker's label, and i represents the i dimension of super vector; If select projection subspace V for use, then ζ refers in particular to V, K (υ _p, υ _q, ζ)=K (V υ _p, V υ _q), V υ wherein _pRepresent υ _pTo subspace V projection, can adopt subspace analysis method estimated projection subspace V.

Step S503 adopts a pair of other training patterns, adopts the described SVM training algorithm of step S501, the kernel function that adopts step S502 to revise, and the speaker model of training gained is { ω ^*, b ^*.

Step S205 repeats above step, sets up the speaker model storehouse.

Cognitive phase:

Step S206 discerns the speaker.

At first, extract the SVM input super vector (but concrete refer step S201-203 does not repeat them here) of tested speech according to above-mentioned calculation step.Then, utilize step S503 training and speaker model { ω ^*, b ^*, the SVM input super vector of tested speech is given a mark.If use method of weighting, marking formula f (υ _t)=ω ^*(υ _tω)+b ^*If use projecting method, marking formula f (υ _t)=ω ^*(V _{υ t})+b ^*If obatained score, is judged the tested speech and the voice of training speaker model greater than certain threshold value and is come from same speaker; If obatained score, judges that the tested speech and the voice of training speaker model are not to come from same speaker smaller or equal to certain threshold value.

As shown in Figure 3, be the Speaker Recognition System structural drawing based on the multi-coordinate sequence kernel of the embodiment of the invention, this system comprises voice pretreatment module 100, characteristic extracting module 200, characteristic sequence mapping block 300, training module 400 and identification module 500.Voice pretreatment module 100 is used for training utterance or recognizing voice are carried out pre-service, the part of carrying out noise reduction, going music etc. and speaker to have nothing to do, output clean speech signal supply characteristic extraction module 200.Characteristic extracting module 200 is used to read in pretreated training utterance or the recognizing voice that voice pretreatment module 100 provides, and extracts feature, output characteristic sequence.Characteristic sequence mapping block 300 is used for according to selected subcoordinate system, and the characteristic sequence that characteristic extracting module 200 is exported becomes super vector.Training module 400 is used to utilize the super vector of characteristic sequence mapping block 300 outputs, selects suitable kernel function, utilizes SVM training algorithm training speaker model, and sets up the speaker model storehouse.Identification module 500 is used for according to shining upon super vector and the speaker model storehouse that forms for recognizing voice, and mark is adjudicated in output, and according to the judgement mark speaker is discerned.

As one embodiment of the present of invention, can adopt a pair of other training patterns to train.

As one embodiment of the present of invention, characteristic sequence mapping block 300 can adopt EM algorithm training gauss hybrid models, and with the gauss hybrid models average as each coordinate origin.

As an alternate embodiments of the present invention, characteristic sequence mapping block 300 can adopt the VQ algorithm, selects the initial point of VQ code book as each coordinate system for use.

As an alternate embodiments of the present invention, this system also comprises model memory module 600, is used to preserve the speaker model storehouse that training module 400 is set up, and offers identification module 500.

Although illustrated and described embodiments of the invention, for the ordinary skill in the art, be appreciated that without departing from the principles and spirit of the present invention and can carry out multiple variation, modification, replacement and modification that scope of the present invention is by claims and be equal to and limit to these embodiment.

Claims

1, a kind of method for distinguishing speek person based on multi-coordinate sequence kernel is characterized in that, may further comprise the steps:

Training stage:

Training utterance is carried out pre-service;

From pretreated training utterance, extract feature vector sequence;

Cognitive phase:

Recognizing voice is carried out pre-service;

From pretreated voice, extract feature vector sequence;

2, the method for distinguishing speek person based on multi-coordinate sequence kernel as claimed in claim 1 is characterized in that, adopts a pair of other training patterns to train.

3, the method for distinguishing speek person based on multi-coordinate sequence kernel as claimed in claim 1 is characterized in that, describedly selects the multi-coordinate initial point to comprise in space of feature vectors:

4, the method for distinguishing speek person based on multi-coordinate sequence kernel as claimed in claim 1 is characterized in that, describedly selects the multi-coordinate initial point to comprise in space of feature vectors:

5, a kind of Speaker Recognition System based on the multi-coordinate sequence kernel is characterized in that, comprises voice pretreatment module, characteristic extracting module, characteristic sequence mapping block, training module and identification module,

6, the Speaker Recognition System based on the multi-coordinate sequence kernel as claimed in claim 5 is characterized in that, adopts a pair of other training patterns to train.

7, the Speaker Recognition System based on the multi-coordinate sequence kernel as claimed in claim 5 is characterized in that, described characteristic sequence mapping block adopts EM algorithm training gauss hybrid models, and with the gauss hybrid models average as each coordinate origin.

8, the Speaker Recognition System based on the multi-coordinate sequence kernel as claimed in claim 5 is characterized in that, described characteristic sequence mapping block adopts the VQ algorithm, selects the initial point of VQ code book as each coordinate system for use.

9, the Speaker Recognition System based on the multi-coordinate sequence kernel as claimed in claim 5 is characterized in that, also comprises the model memory module, is used to preserve the speaker model storehouse that training module is set up, and offers identification module.