CN1652206A

CN1652206A - Sound veins identifying method

Info

Publication number: CN1652206A
Application number: CNA2005100599131A
Authority: CN
Inventors: 郑方; 熊振宇; 宋战江
Original assignee: Individual
Current assignee: Beijing D Ear Technologies Co ltd
Priority date: 2005-04-01
Filing date: 2005-04-01
Publication date: 2005-08-10
Anticipated expiration: 2025-04-01
Also published as: CN1302456C

Abstract

The present invention provides a vocal print identification method, belonging to the field of identity identification technology based on biological characteristics. Said method includes the following steps: firstly, extracting acoustic feature form voice waveforms of several speakers and forming characteristic rector sequence of several speakers; according to the characteristic vector sequence constructing a general background model, according probability model of every speaker; extracting void feature from speech to be identified, forming characteristic vector sequence of speech to be identified, and rearranging to obtain rearranged characteristic vector sequence, in order to reamange every vector in characteristic vector sequence selecting kernel Gaussian mix from Gaussian mixed tree and calculating probability likelibood fraction of rearranged characteristic vector sequence to be identified which is respectively marched with probability model of every speaker, calculating sum total of probabilikelibood factions, and lopping and taking maximum fraction as identification result.

Description

A kind of method for recognizing sound-groove

Technical field

The present invention relates to a kind of method for recognizing sound-groove, belong to identity identification technical field based on biological characteristic.

Background technology

In prior art, based on universal background model (Universal Background Model, hereinafter to be referred as UBM) Application on Voiceprint Recognition (Voiceprint Recognition) method of text-independent comprise the training method of universal background model UBM, three parts of the training method of speaker model and the recognition methods of vocal print.

The training method of universal background model UBM is:

(1) from a plurality of speakers' sound waveform, extracts acoustic feature, form a plurality of speakers' feature vector sequence;

(2) feature vector sequence according to a plurality of speakers makes up a universal background model.Its method is that the feature vector sequence to all speakers adopts certain existing clustering algorithm (as traditional LBG algorithm) to carry out cluster, obtains the mixing of K Gaussian distribution, and wherein k Gaussian distribution mean value vector is μ _k, the diagonal angle variance matrix is ∑ _kThe note number percent that k eigenvector number that Gaussian distribution contained accounts for vector sum in the full feature vector sequence during cluster is w _k, then universal background model is

UBM = {μ_{k}^{ubm}, Σ_{k}^{ubm}, w_{k}^{ubm} | 1 \leq k \leq K} .

Wherein the training method of speaker model is:

(1) from each speaker's sound waveform, extracts acoustic feature, form this speaker's feature vector sequence;

(2) universal background model is carried out self-adaptation according to each speaker's feature vector sequence respectively, obtain everyone sound-groove model, each individual sound-groove model is put together form a model bank.Its adaptive approach can adopt any existing adaptive approach (as traditional MAP adaptive approach), speaker's sound-groove model M={ μ _k, ∑ _k, w _k| Gaussian Mixture among 1≤k≤K} and universal background model

UBM = {μ_{k}^{ubm}, Σ_{k}^{ubm}, w_{k}^{ubm} | 1 \leq k \leq K}

In Gaussian Mixture have one to one relation.

Wherein the recognition methods of vocal print is:

(1) from people's to be identified sound, extracts acoustic feature and form feature vector sequence to be identified;

(2) will this feature vector sequence to be identified and this model bank in sound-groove model carry out matching ratio one by one, the matching score that obtains feature vector sequence and each speaker's sound-groove model (is also referred to as the log-likelihood score, and adjudicate or likelihood score, or score); The method of calculated characteristics vector sequence and speaker model coupling mark is: to feature vector sequence X={X to be identified ₁..., X _TIn each frame X _t, 1≤t≤T at first with the universal background model coupling, finds universal background model

UBM = {μ_{k}^{ubm}, Σ_{k}^{ubm}, w_{k}^{ubm} | 1 \leq k \leq K}

In with X _tN the Gaussian Mixture k that mates most ₁..., k _N, use speaker's sound-groove model M={ μ then _k, ∑ _k, w _k| corresponding Gaussian Mixture is calculated the coupling mark of this speaker model among 1≤k≤K}

S (X_{t} | M) = \ln Σ_{n = 1}^{N} w_{k_{n}} \cdot p (X_{t} | μ_{k_{n}}, Σ_{k_{n}});

The mark of whole sequence then is:

S (\overset{&OverBar;}{X} | M) = Σ_{t = 1}^{T} S (X_{t} | M);

(3) according to the type (the closed set vocal print is differentiated, the opener vocal print is differentiated and vocal print is confirmed) of the recognition methods of vocal print, in needs, refuse to know judgement, thereby obtain a result.

Shortcoming: the subject matter based on the method for recognizing sound-groove of universal background model is that the calculated amount of discerning is too big, and its calculating comprises:

(1) to each frame speech characteristic vector X _t, 1≤t≤T will select N the mixing of mating most from universal background model; And the mixed number of universal background model is very big usually, is generally 1,024 or 2,048, causes calculated amount very big; (2) all speaker models are calculated the coupling mark; Though each speaker model only need calculate the mark (N=4 usually) of N Gaussian Mixture, very big speaker model number can cause very big calculated amount equally.

Summary of the invention

The objective of the invention is to propose a kind of method for recognizing sound-groove, existing to overcome based on the too big shortcoming of the method for recognizing sound-groove operand of universal background model, improve the arithmetic speed of Application on Voiceprint Recognition.

The method for recognizing sound-groove that the present invention proposes may further comprise the steps:

(2) make up a universal background model according to above-mentioned feature vector sequence;

(3), make up the Gaussian Mixture tree according to above-mentioned universal background model;

(4), train each speaker's probability model according to above-mentioned universal background model;

(5) from voice to be identified, extract acoustic feature, form the feature vector sequence of voice to be identified,, obtain the feature vector sequence that reorders this eigenvector rearrangement;

(6) be each vector in the above-mentioned feature vector sequence that reorders, from the Gaussian Mixture tree of above-mentioned structure, select the Gaussian Mixture of core;

(7) according to above-mentioned core Gaussian Mixture, calculate above-mentioned voice to be identified reorder eigenvector respectively with the probability likelihood mark of each speaker's probability model coupling;

(8) calculate above-mentioned voice to be identified reorder eigenvector respectively with the summation of the probability likelihood mark of each speaker's probability model coupling, and carry out beta pruning, get the recognition result that is of mark maximum.

In the said method, step (5) is eigenvector rearrangement, and the method for the feature vector sequence that obtains reordering may further comprise the steps:

(1) at feature vector sequence X={X ₁..., X _TIn, n selects vector with the interval, forms vector sequence O={X ₁, X _1+n, X _1+2n... }, set up sequence Y, make Y=O;

(2) in sequence Y, get the arithmetic mean of the sequence number of adjacent vector from left to right successively, if from the vector of the nearest sequence number correspondence of this mean value not in above-mentioned Y, then from X, take out this vector and join among the new vector sequence Q;

(3) back of adding the above-mentioned vector sequence Q that obtains to vector sequence Y;

(4) repeating step (2) and (3) are up to vector sequence X={X ₁..., X _TIn all vector arrange alls in vector sequence Y.

In the said method, be each eigenvector, from the Gaussian Mixture tree that makes up, select the method for core Gaussian Mixture, comprise the steps:

(1) all child nodes of establishing the root node of Gaussian Mixture tree are the both candidate nodes set;

(2) to described each eigenvector, the likelihood mark of each Gaussian distribution in the calculated candidate node set;

(3) if both candidate nodes is a leaf node, then select N the highest Gaussian distribution of likelihood mark as the core Gaussian Mixture; If both candidate nodes is not a leaf node, then select K the highest node of likelihood mark, all child nodes of K node are gathered as both candidate nodes, repeat above-mentioned steps (2) and (3).

In the said method, step (8) is carried out beta pruning to the summation of probability likelihood mark, and that gets the mark maximum is the method for recognition result, may further comprise the steps:

(1) the probability model set of establishing all speakers is candidate collection;

(2) successively to each vector in the described vector sequence that reorders, the likelihood mark of all probability models in the calculated candidate set, and threshold value Θ is set _τ=S (τ)-B, wherein, S (τ) is for calculating in the vector sequence that reorders behind the τ frame, and the highest likelihood mark of model in the candidate collection, B be the constant according to the identification requirement setting;

(3) all likelihood marks are deleted from candidate collection less than the speaker model of above-mentioned threshold value;

(4) repeating step (2) and (3), only surplus next model in candidate collection, or all vectors have all been calculated.

The method for recognizing sound-groove that the present invention proposes, proposed to select (Tree-based Kernel Selection based on the core of tree, TBKS) method and beta pruning (the Observation Reordering based Pruning that reorders based on measurement vector, ORBP) method is used for the Application on Voiceprint Recognition system based on universal background model, substantially do not reducing under the prerequisite of discrimination, reduce the required calculated amount of Application on Voiceprint Recognition significantly, improve the speed of Application on Voiceprint Recognition.Method for recognizing sound-groove of the present invention and the general method for recognizing sound-groove based on universal background model have 1031 speakers at one, 1816 test statements speech database on test.The general method for recognizing sound-groove recognition correct rate based on universal background model is 95.32%, method for recognizing sound-groove recognition correct rate 95.26% of the present invention, and travelling speed has improved 16 times.

Description of drawings

Fig. 1 is the structural representation of the Gaussian Mixture tree that relates in the inventive method.

Embodiment

The method for recognizing sound-groove that the present invention proposes at first from extract acoustic feature from a plurality of speakers' sound waveform, forms a plurality of speakers' feature vector sequence; Make up a universal background model according to above-mentioned feature vector sequence; According to above-mentioned universal background model, make up the Gaussian Mixture tree; According to above-mentioned universal background model, train each speaker's probability model; From voice to be identified, extract acoustic feature, form the feature vector sequence of voice to be identified,, obtain the feature vector sequence that reorders this eigenvector rearrangement; Be each vector in the above-mentioned feature vector sequence that reorders, from the Gaussian Mixture tree of above-mentioned structure, select the Gaussian Mixture of core; According to above-mentioned core Gaussian Mixture, calculate above-mentioned voice to be identified reorder eigenvector respectively with the probability likelihood mark of each speaker's probability model coupling; Calculate above-mentioned voice to be identified reorder eigenvector respectively with the summation of the probability likelihood mark of each speaker's probability model coupling, and carry out beta pruning, get the recognition result that is of mark maximum.

Below introduce one embodiment of the present of invention.

Method for recognizing sound-groove embodiment of the present invention comprises the training of universal background model, the structure of universal background model Gaussian Mixture tree, and the training of speaker model and Application on Voiceprint Recognition are described as follows:

The universal background model training concrete steps of present embodiment comprise:

(1) gets 60 male speakers and 60 women speakers' voice data, its raw tone Wave data is analyzed, throw and remove wherein each quiet section;

(2) wide and wide half of frame be that frame moves with 32 milliseconds of frames, each frame extracted the linear prediction cepstrum parameters (LPCC) of 16 dimensions, and calculate its auto-regressive analysis parameter, forms 32 eigenvectors of tieing up; The eigenvector composition characteristic vector sequence of all frames;

(3) make up this speaker's sound-groove model: the feature vector sequence to the speaker adopts traditional LBG algorithm to carry out cluster, obtains the mixing of 1,024 Gaussian distribution, and wherein k Gaussian distribution mean value vector is μ _k, the diagonal angle variance matrix is ∑ _kThe number percent that k eigenvector number that Gaussian distribution contained accounts for vector sum in the full feature vector sequence during note LBG cluster is w _k, then universal background model is: UBM={ μ _k, ∑ _k, w _k| 1≤k≤K}.

The structure concrete steps of the universal background model Gaussian Mixture tree of present embodiment comprise:

(1) specifying tree structure is 5 layers, and the ground floor root node has 16 child nodes, and each node of the second layer has 4 child nodes, and the 3rd layer of each node has 4 byte points, and the 4th node layer number is determined by the construction method of Gaussian Mixture tree;

(2) adopt the construction method of aforementioned Gaussian Mixture tree to make up the Gaussian Mixture tree;

The speaker model training concrete steps of present embodiment comprise:

(1) gets 1 speaker's voice data, its raw tone Wave data is analyzed, throw and remove wherein each quiet section;

(3) feature vector sequence with the speaker adopts traditional MAP method to carry out self-adaptation to universal background model, obtains speaker model;

(4) if also have not training of speaker, then change the training that step 1) is carried out next speaker; Otherwise training process finishes.

The Application on Voiceprint Recognition of present embodiment may further comprise the steps:

(1) collection speaker's to be identified voice data is analyzed its raw tone Wave data, throws and removes wherein each quiet section;

(2) the wide and frame of identical frame moves when training with sound-groove model, each frame is extracted the linear prediction cepstrum parameters (LPCC) of 16 dimensions, and calculate its auto-regressive analysis parameter vector, forms 32 dimensional feature vectors to be identified; The eigenvector to be identified of all frames is formed feature vector sequence X={X to be identified ₁..., X _T;

(3) adopt the pruning method that reorders based on measurement vector, to X={X ₁..., X _TResequence, obtain new sequence Y={Y ₁..., Y _T;

(4) all speakers' sound-groove model is a set of candidates in the setting sound-groove model storehouse;

(5) for each frame phonetic feature Y _τ, 1≤τ≤T adopts the aforementioned searching method that mixes that mates most, finds 4 Gaussian Mixture of mating most with this frame phonetic feature in the universal background model, and its label is k ₁, k ₂, k ₃, k ₄

(6) from set of candidates, get a speaker's sound-groove model M={ μ _k, ∑ _k, w _k| 1≤k≤K}, calculate its matching score

S (Y_{τ} | M) = Σ_{t = 1}^{4} (w_{k_{i}} \cdot p (Y_{τ} | μ_{k_{i}}, Σ_{k_{i}}));

And calculate the accumulation score of this model

S (M) = Σ_{t = 1}^{τ} \ln S (Y_{τ} | M);

(7) find the highest speaker model of set of candidates accumulation score, it accumulates to such an extent that be divided into S _Max(τ), set pruning threshold Θ _τ=S _Max(τ)-and B, all coupling marks in the set of candidates are lower than threshold value Θ _τSound-groove model deletion;

(8) repeat above step, only surplus next speaker model or whole speech characteristic vector sequence were all handled in the set of candidates set;

(9) take out the mark S that accumulates the score maximum in the set of candidates _Max(T) and corresponding speaker model M _MaxAs recognition result; The output result, the Application on Voiceprint Recognition process finishes.

Claims

1, a kind of method for recognizing sound-groove is characterized in that this method may further comprise the steps:

2, the method for claim 1 is characterized in that wherein step (5) with the eigenvector rearrangement, and the method for the feature vector sequence that obtains reordering may further comprise the steps:

(1) at feature vector sequence X={X ₁... X _TIn, n selects vector with the interval, forms vector sequence O={X ₁, X _1+n, X _1+2n... }, set up sequence Y, make Y=O;

(4) repeating step (2) and (3) are up to vector sequence X={X ₁... X _TIn all vector arrange alls in vector sequence Y.

3, the method for claim 1 is characterized in that wherein being each eigenvector, selects the method for core Gaussian Mixture from the Gaussian Mixture tree that makes up, and comprises the steps:

4, the method for claim 1 is characterized in that wherein step (8) is carried out beta pruning to the summation of probability likelihood mark, and that gets the mark maximum is the method for recognition result, may further comprise the steps: