CN103258536A

CN103258536A - Large-scaled speaker identification method

Info

Publication number: CN103258536A
Application number: CN2013100747439A
Authority: CN
Inventors: 罗森林; 谢尔曼; 潘丽敏
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2013-03-08
Filing date: 2013-03-08
Publication date: 2013-08-21
Anticipated expiration: 2033-03-08
Also published as: CN103258536B

Abstract

The invention relates to a text-independent speaker identification method, wherein the text-independent speaker identification method is based on 2D-Haar voice frequency characteristics and suitable for large-scaled speakers. The invention provides conception and a calculation method of the 2D-Haar voice frequency characteristics, and foundational voice frequency characteristics are used to form a voice frequency characteristic graph at first; then the voice frequency characteristic graph is used to extract 2D-Haar voice frequency characteristics; then an AdaBoost.MH algorithm is used to accomplish screening of the 2D-Haar voice frequency characteristics and training of a speaker classifier; finally the trained speaker classifier is used to achieve the identification of speakers. Compared with the prior art, the large-scaled speaker identification method can effectively restrain decay of identification accuracy rate in a large-scale speaker identification situation, and has high identification accuracy rate and identification speed. The text-independent speaker identification method is not only applied to a desktop computer, but also applied to mobile calculation platforms like a cell phone, a tablet and the like.

Description

A kind of extensive speaker's identification method

Technical field

The present invention relates to a kind of text-independent speaker identification method that is applicable to extensive speaker, belong to technical field of biometric identification; Angle from technology realizes also belongs to computer science and voice processing technology field.

Background technology

Speaker's identification (Speaker Identification) technology is Speaker Identification (Speaker Recognition, SR) technology important branch, be the voice signal characteristics of utilizing each speaker, from one section voice, extract speaker information, and then judge that this section voice are which of some philtrums is said, be the pattern recognition problem of " multiselect one ".Along with the high speed development of modern electronic technology in recent years, the application demand of speaker's recognition techniques is more come by force (fields such as for example court's discriminating, suspect's tone tracking location, speech retrieval), also with characteristics such as its unique convenience, economy and accuracys and receive increasing concern.

According to the type difference of the content of speaking, speaker's identification can be divided into text about (Text-dependent) and text-independent (Text-independent) two big classes.The speaker recognition system relevant with text requires user's content pronunciation according to the rules, and everyone identification model is accurately set up one by one, and content pronunciation that also must be in accordance with regulations during identification; The recognition system of text-independent is not then stipulated speaker's pronunciation content, and model is set up difficulty relatively, but range of application is wideer.In some cases, people can't (perhaps not wish) to force the speaker to read aloud one section specific literal, and in these application scenarioss, speaker's identification method of text-independent just seems especially important.

The basic fundamental of speaker's identification that this is irrelevant can be divided into voice collecting, feature extraction, and sorting technique three classes, wherein key issue is feature extraction and sorting technique.

The feature extraction aspect, present main stream approach adopts that (Linear Predictive Coding Cepstrum is LPCC) as characteristic parameter based on the Mel cepstrum coefficient (MFCC) of bottom Principles of Acoustics or linear prediction cepstrum coefficient.

The sorting technique aspect, main stream approach can be divided three classes, template matching method (dynamic time warping (DTW), vector quantization (VQ)), probabilistic method (hidden Markov model (HMM), gauss hybrid models (GMM)), identification, classification device algorithm (artificial neural network (ANN), Support Vector Machine (SVM)).What extensively be used at present is gauss hybrid models (GMM) method and Support Vector Machine (SVM) method.In the said method, the GMM-UBM model is used widely; Support Vector Machine (SVM) method and GMM-UBM have very strong contacting, and the feature super vector that the SVM system of present main flow adopts is generally all produced by GMM.

Based on said method, speaker's recognition techniques of text-independent obtains practical application in some occasions.Yet when number to be recognized constantly increased, the accuracy rate of said method can obviously descend, and when number is increased to certain scale, will be difficult to satisfy the demand of practical application, and this is the major issue that text-independent speaker recognition techniques need solve.

Summary of the invention

The objective of the invention is: innovate from feature extraction and two levels of sorting technique, propose a kind of extensive speaker's identification method, treating still can to obtain higher accuracy rate under the more scene of identification number.

Design concept of the present invention is: propose 2D-Haar audio feature extraction method, introduce the certain time sequence relation information, and audio feature space is extended to hundreds thousand of dimensions, for recognition algorithm provides huger feature space; Simultaneously, use the AdaBoost.MH algorithm, the representative feature combination of screening in feature space is for establishing target speaker's identification sorter.The present invention does not increase training and recognize time expense when further promoting accuracy rate, have characteristics fast and accurately.

Technical scheme of the present invention realizes as follows:

Step 1 is obtained the voice signal of speaker to be recognized (being the target speaker), forms basic speech storehouse S.

Concrete grammar is: microphone is connected with computing machine, obtains target speaker's voice signal, and be stored in the computing machine with the form of audio file, the corresponding audio file of each target speaker forms basic speech storehouse S={s ₁, s ₂, s ₃..., s _k, wherein k is target speaker's sum.

Step 2 is carried out the audio frequency characteristics integrogram to the voice among the S of basic speech storehouse and is calculated, and forms foundation characteristic storehouse R.Detailed process is as follows:

Step 2.1 is for k target speaker, to its audio file s _kCarry out the branch frame and handle (frame length f _s, frame moves Δ f _sBe set by the user), and extract the elementary audio feature (as MFCC, LPCC, sub belt energy etc.) of each frame, the elementary audio feature combination with each frame forms a foundation characteristic file v who comprises c frame, every frame p dimensional feature amount _k

v _kIn the content of proper vector of each frame be: { [foundation characteristic 1(p ₁Dimension)], [foundation characteristic 2(p ₂Dimension)] ..., [foundation characteristic n(p _nDimension)] }.

In more than describing, for the duration audio file s that is t _k:

p = Σ_{1}^{n} p_{n} .

Step 2.2 is for k target speaker's foundation characteristic file v _k, adopt the mode of sliding window, be that window is long, s is stepping with a, convert all c frame audio frequency characteristics vectors to audio frequency characteristics graphic sequence file G _k(referring to Fig. 2).

G _k={ g ₁, g ₂, g ₃... g _Uk, wherein,

Step 2.3 on the basis of step 2.2, is calculated the characteristic pattern sequential file G for k target speaker _kIn every width of cloth characteristic pattern g _uCharacteristic-integration figure r _u, form this speaker's characteristic-integration graphic sequence file R _k={ r ₁, r ₂, r ₃... r _u, the characteristic-integration graphic sequence file of all k target speakers among the S of basic speech storehouse is put together, form foundation characteristic storehouse R={R ₁, R ₂..., R _k.

Yi Zhi, the computing formula of all speakers' characteristic-integration figure sum m is in the foundation characteristic storehouse:

Described characteristic-integration figure and primitive character figure are measure-alike, and (x, value y) is defined as former figure corresponding point (x ', y ') and all eigenwert sums in upper left side thereof more arbitrarily on it.Definition is as follows:

ii (x, y) = \underset{x^{'} \leq x, y^{'} \leq y}{Σ} i (x^{'}, y^{'}),

Ii in the formula (x, y) (x, value y), i (x', y') eigenwert of point on the expression primitive character figure (x ', y ') of point on the expression integrogram.

Step 3 on the basis of foundation characteristic storehouse R, generates each target speaker's training characteristics file set B.Detailed process is as follows:

Step 3.1 marks the tag file among the R of foundation characteristic storehouse, and concrete grammar is:

Use continuous integer numbering as speaker's label, represent different target speakers, so that Computer Processing.Final mark pattern is R '={ (R ₁, 1), (R ₂, 2) ... (R _k, k) }, wherein, Y={1,2 ..., k} is target speaker tally set, k is target speaker number;

Step 3.2, on the basis of step 3.1, for each target speaker sets up the tag file collection B that is used for speaker's registration, concrete grammar is:

In the feature database R ' of the amiable people's label of mark, carry out k wheel arrangement, take turns in the housekeeping every, at first with k target speaker's audio frequency characteristics file r _kAs positive sample, keep its speaker's label k; Then with remaining speaker's audio frequency characteristics file as negative sample, and their speaker's label changed to " other "; At last above-mentioned k audio frequency characteristics file stored in the independent file, and with this tag file folder called after B _k, that is:

B ₁={(R ₁，1),(R ₂，other),…(R _k，other)}，

B ₂={(R ₁，other),(R ₂，2),…(R _k，other)}，

……

B _k={(R ₁，other),(R ₂，other),…(R _k，k)}

After the k wheel housekeeping, the final tag file collection B={B that is constituted by k tag file folder that forms ₁, B ₂..., B _k.

Step 4, on the basis of step 3, extract the 2D-Haar audio frequency characteristics, and carry out the speaker and register, just travel through k file among the tag file collection B successively, and use training characteristics file wherein to train independent " 1 pair is surplus " sorter for each target speaker, finally obtain the sorter pond that is constituted by k speaker's sorter.

For k target speaker, the sorter W that it is corresponding _kTraining process as follows:

Step 4.1 is to the formed tag file folder of step 3.2 B _kIn all characteristic-integration graphic sequence file R _kEvery width of cloth integrogram carry out the 2D-Haar audio feature extraction.Concrete grammar is:

Calculate corresponding H dimension 2D-Haar audio frequency characteristics value (wherein H is determined by the size of the 2D-Haar audio frequency characteristics type that adopts and integrogram) according to each integrogram, obtain the data acquisition S={ (x for the training of speaker's sorter ₁, l _i) ..., (x _m, l _i).Wherein, x _iRepresent i the corresponding whole H dimension 2D-Haar audio frequency characteristics vectors of integrogram, l _i∈ Y, (Y={1,2 ..., k}) i corresponding speaker's label of integrogram of expression.

Described H dimension 2D-Haar audio frequency characteristics value, the value of every dimension 2D-Haar audio frequency characteristics is on the original audio characteristic pattern, in the square region of arbitrary dimension, position, use the eigenwert sum of a certain specific rectangular area to deduct the eigenwert sum of another specific rectangular area, can calculate acquisition fast by integrogram.

The corresponding H dimension of every width of cloth integrogram 2D-Haar audio frequency characteristics vector note is done delegation, make tag file folder B _kIn the whole H dimension 2D-Haar audio frequency characteristics vector of all m width of cloth integrograms constitute that m is capable, the eigenmatrix X of H row.

Step 4.2, the 2D-Haar audio frequency characteristics matrix X that uses the AdaBoost.MH method that step 4.1 is obtained carries out feature screening and sorter training, obtains speaker's sorter.The ultimate principle of described AdaBoost.MH method is: by F wheel iteration, select F dimension principal character from H dimension 2D-Haar audio frequency characteristics value set, training simultaneously obtains F Weak Classifier, and it is formed strong classifier.

Employed Weak Classifier in the above-mentioned interative computation need meet the following conditions: 1. the input of Weak Classifier is one-dimensional eigenwert (is a certain specific dimension in the proper vector, or a certain row among the eigenmatrix X); 2. at speaker's label l to be recognized _i, the output of Weak Classifier is 1 or-1.

The concrete training process of AdaBoost.MH is:

Step 4.2.1, the weight of the every width of cloth integrogram of initialization correspondence, note is made D ₁(i, l _i)=1/ (mk), i=1 ... m, l _i∈ Y.

Step 4.2.2, successively with each columns of eigenmatrix X according to (namely the H of all integrograms the organizes same dimensional feature) input as a Weak Classifier, carry out the computing of H wheel, calculate r according to following formula _{F, j}Value:

r_{f, j} = \underset{j, (i, l)}{Σ} D_{f} (i, l_{i}) K_{i} [l_{i}] h_{j} (x_{i}, l_{i}), j = 1 . . . H

Wherein, h _j(x _i, l _i) expression with the j dimensional feature value extracted in i the integrogram as the Weak Classifier of importing, D _f(i, l _i) represent that f takes turns the weighted value of i training integrogram in the iteration,

K_{i} [l_{i}] = \{\begin{matrix} + 1 & l_{i} &Element; [1, . . ., k] \\ - 1 & l_{i} &NotElement; [1, . . ., k] \end{matrix} .

From an above-mentioned H Weak Classifier, select a h _j(x, l _i), make r _f=max (r _{F, j}), with this sorter characteristic of correspondence f _j(x) as the feature dimensions of choosing, simultaneously this Weak Classifier note is made h _t(x l), adds in the strong classifier.Wherein, f _j(x) the j dimension of expression H dimension 2D-Haar audio frequency characteristics vector (being the j row of eigenmatrix X), h _j(x, l) expression adopts j dimensional feature value as the Weak Classifier of input;

Step 4.2.3 calculates the Weak Classifier h that is selected by step 4.2.2 _j(x, weight l) _f:

α_{f} = \frac{1}{2} \ln (\frac{1 + r_{f}}{1 - r_{f}});

Step 4.2.4, the weight Df+1 of each integrogram in the calculating next round iteration;

D_{f + 1} = \frac{D_{f} (i, l_{i}) \exp (- α_{f} K_{i} [l_{i}] h_{f} (x_{i}, l_{i}))}{Z_{f}}, i = 1 . . . m .

Wherein, h _f(x _i, l _i) the j dimensional feature value extracted with i integrogram in the expression f wheel iteration is as the Weak Classifier of importing, Z _fIt is normalized factor

Z_{f} = \underset{i, l}{Σ} D_{f} (i, l_{i}) \exp (- α_{f} K_{i} [l_{i}] h_{f} (x_{i}, l_{i})), i = 1 . . . m .

Step 4.2.5, the new weight substitution step 4.2.2 with step 4.2.4 obtains according to the method for step 4.2.2 to step 4.2.4, chooses a new feature dimensions, obtains a new Weak Classifier simultaneously and adds in the strong classifier;

Step 4.2.6 to the method iteration of step 4.2.5 F time, obtains the strong classifier be made up of F Weak Classifier according to step 4.2.2, i.e. k speaker's identification sorter is expressed as:

W_{k} (x) = \arg \max_{l} S_{l}, S_{l} = (Σ_{t = 1}^{F} α_{t} h_{t} (x, l)) - - - (1)

Step 4.2.7, treat that k wheel training finishes after, all k speaker's sorter gathered formation speaker sorter pond W={W ₁(x), W ₂(x) ..., W _k(x) }.

Step 5, speaker's identification to the voice document extraction 2D-Haar audio frequency characteristics of unknown speaker, and is carried out in the speaker's sorter pond that utilizes step 4 training to obtain.

Step 5.1 is carried out the audio frequency characteristics integrogram to the identification voice document and is extracted, and obtains waiting to recognize audio frequency characteristics integrogram sequence G'={g' ₁, g' ₂, g' ₃... g' _{U '}.Concrete grammar is identical with the described method of step 2.Wherein, in the audio frequency characteristics graphic sequence transfer process (corresponding to step 2.2), the long a of window, stepping s, value and step 2 in identical; Similarly, for a voice document to be recognized that comprises the c' frame, the characteristic pattern quantity u ' that the characteristic pattern sequence comprises is:

Step 5.2 on the basis of step 5.1, according to the described 2D-Haar audio feature extraction of step 4.1 method, for the every width of cloth characteristic pattern in the characteristic pattern sequence extracts the 2D-Haar audio frequency characteristics, constitutes 2D-Haar audio frequency characteristics matrix X'.

Step 5.3, each sorter among the 2D-Haar audio frequency characteristics matrix X ' that step 5.2 the is obtained input speaker sorter pond W obtains classification results sequence R.

Described classification results sequence R is made up of the individual element of u ', and wherein the concrete computing method of each element are:

Step 5.3.1 according to (1) formula among the step 4.2.6, reads certain Weak Classifier h in speaker's sorter _t(x, l) and corresponding 2D-Haar audio frequency characteristics f _j(x);

Step 5.3.2 for every kind of label to be selected (being k or other), calculates the output h of this Weak Classifier respectively _t(f _j(x), l), and with this output valve with the weight in the sorter _tBe added to label l to be selected _iCorresponding weighted value S _LiIn;

Step 5.3.3 carries out after the F wheel circulation every kind of label l to be selected according to the method for step 5.3.1-step 5.3.2 _iTo obtain a weighted value S _LiSelect a weighted value S of value maximum _Li, record the to be selected label l corresponding with it simultaneously _iAs the classification results of this audio frequency characteristics figure, note is made (l _i,

), l wherein _kBe speaker's label, Be corresponding strong classifier weighted sum.

Step 5.3.4 will wait that all classification results of recognizing audio frequency combine, and composition and classification is sequence as a result

R = {(l_{i}, S_{l_{i}, u^{'}}) : (l_{1}, S_{l_{1}, 1}), (l_{1}, S_{l_{1}, 2}), (l_{2}, S_{l_{2}, 3}), \cdot \cdot \cdot (l_{i}, S_{l_{i}, u^{'}})} .

Step 5.4, it is comprehensive that the classification results sequence that step 5.3 is obtained is carried out the result, obtains final speaker's recognition results.

Concrete grammar is: strong classifiers all in the sequence is as a result differentiated weight

By speaker's label l _iWeighting is the final recognition results output of the speaker's label that makes the weighted sum maximum as this section voice.

Beneficial effect

Than Mel cepstrum coefficient (MFCC) or the linear prediction cepstrum coefficient characteristic parameter extraction methods such as (LPCC) based on the bottom Principles of Acoustics, the 2D-Haar audio feature extraction method that the present invention proposes is introduced le certain time sequence relation information, and audio feature space extended to hundreds thousand of dimensions, for recognition algorithm provides huger feature space.

Compare with speaker's sorting techniques such as GMM, SVM, the present invention adopts and uses the AdaBoost.MH algorithm, Decision Stump Weak Classifier in conjunction with single feature input carries out the feature screening, representativeness and the discrimination of proper vector had both been improved, also reduced the computation burden that the speaker recognizes the stage, arithmetic speed is higher.In conjunction with 2D-Haar audio frequency characteristics and AdaBoost.MH algorithm, can realize extensive speaker's accurate identification, have higher utility.

Description of drawings

Fig. 1 is theory diagram of the present invention;

Fig. 2 extracts the principle intention for audio frequency characteristics figure and characteristic pattern sequence that the present invention proposes;

Fig. 3 is speaker's registration process schematic diagram of the present invention;

Fig. 4 is speaker's identification process schematic diagram of the present invention;

Fig. 5 is employed 5 class 2D-Haar audio frequency characteristics in speaker training and the identification process in the embodiment;

Fig. 6 is in the embodiment, when using the TIMIT sound bank to test, and the performance comparison of the present invention and GMM-UBM algorithm.

Embodiment

Be described in further details objects and advantages of the present invention below in conjunction with the embodiment of drawings and Examples to the inventive method in order better to illustrate.

Below all tests all finish at same computing machine, concrete configuration is: Intel double-core CPU(dominant frequency 1.8G), 1G internal memory, WindowsXP SP3 operating system.

First link

This link will be used the voice document of TIMIT audio repository, describe in detail when target speaker scale is 600 people, the detailed process of speaker's registration/training of the present invention, speaker's identification.

The TIMIT sound bank is the java standard library of being produced jointly by Massachusetts Polytechnics, Stanford Research Institute, Texas Instrument, has comprised 630 speakers' (438 male sex and 192 women) language material, everyone 10 voice.

Whole speech datas of picked at random 600 people from all speakers, from everyone 10 voice, choose again 1 duration greater than 5 seconds file as speaker's registration/training utterance file; Any voice of 1 people of picked at random are as the identification voice document in addition.

Concrete implementation step is as follows:

Because the TIMIT sound bank has been the complete audio file of storage, therefore direct voice document with 600 target speakers forms basic speech storehouse S={s ₁, s ₂, s ₃..., s _k, wherein k=600 is target speaker's sum.

Step 2.1 is for k target speaker, to its audio file s _kCarry out the branch frame and handle, and extract each frame the elementary audio feature (in the present embodiment, use MFCC, LPCC, PLPC), the elementary audio feature combination with each frame forms a foundation characteristic file v who comprises c frame, every frame p dimensional feature amount _k

In the present embodiment, v _kIn the content of proper vector of each frame be: { [MFCC(12 dimension)], [LPCC(12 dimension)], [PLPC(8 dimension)] }, divide the frame length of frame operation to be set at f _s=30ms, frame move and are set at Δ f _s=20ms.

p = Σ_{1}^{n} p_{n} = 12 + 12 + 8 = 32 .

Step 2.2 is for k target speaker's foundation characteristic file v _k, adopt the mode of sliding window, be that window is long, s is stepping with a, convert all c frame audio frequency characteristics vectors to audio frequency characteristics graphic sequence file G _k(referring to Fig. 2).In the present embodiment, a=32, s=16.

G _k={ g ₁, g ₂, g ₃... g _Uk, wherein,

Step 2.3 on the basis of step 2.2, is calculated the characteristic pattern sequential file G for k target speaker _kIn every width of cloth characteristic pattern g _uCharacteristic-integration figure r _u, form this speaker's characteristic-integration graphic sequence file R _k={ r ₁, r ₂, r ₃... r _u, the characteristic-integration graphic sequence file of all 600 target speakers among the S of basic speech storehouse is put together, form foundation characteristic storehouse R={R ₁, R ₂..., R _k.

In the present embodiment, total duration of all 600 audio files is 3630.50s, therefore:

ii (x, y) = \underset{x^{'} \leq x, y^{'} \leq y}{Σ} i (x^{'}, y^{'}),

Use continuous integer numbering as speaker's label, represent different target speakers, so that Computer Processing.Final mark pattern is R '={ (R ₁, 1), (R ₂, 2) ... (R ₆₀₀, 600) }, wherein, Y={1,2 ..., 600} is target speaker tally set;

In the feature database R ' of the amiable people's label of mark, carry out 600 and take turns arrangement, take turns in the housekeeping every, at first with k target speaker's audio frequency characteristics file r _kAs positive sample, keep its speaker's label k; Then with remaining speaker's audio frequency characteristics file as negative sample, and their speaker's label changed to " other "; At last above-mentioned 600 audio frequency characteristics files are stored in the independent file, and with this tag file folder called after B _k, that is:

B ₁={(R ₁，1),(R ₂，other),…(R ₆₀₀，other)}，

B ₂={(R ₁，other),(R ₂，2),…(R ₆₀₀，other)}，

……

B ₆₀₀={(R ₁，other),(R ₂，other),…(R ₆₀₀，600)}

600 take turns after the housekeeping, the final tag file collection B={B that is made of 600 tag file folders that forms ₁, B ₂..., B ₆₀₀.

Step 4, on the basis of step 3, extract the 2D-Haar audio frequency characteristics, and carry out speaker's registration, just travel through 600 files among the tag file collection B successively, and use training characteristics file wherein to train independent " 1 pair surplus " sorter for each target speaker.

Step 4.1 is to the formed tag file folder of step 3.2 B _kIn all characteristic-integration graphic sequence file R _kEvery width of cloth integrogram carry out the 2D-Haar audio feature extraction.

Calculate corresponding H dimension 2D-Haar audio frequency characteristics value according to each integrogram, obtain the data acquisition S={ (x for the training of speaker's sorter ₁, l _i) ..., (x _m, l _i).Wherein, x _iRepresent i the corresponding whole H dimension 2D-Haar audio frequency characteristics vectors of integrogram, l _i∈ Y, (Y={1,2 ..., k}) i corresponding speaker's label of integrogram of expression.

Fig. 5 has showed the computation schema of the 5 class 2D-Haar audio frequency characteristics that present embodiment uses, the value of every dimension 2D-Haar audio frequency characteristics is: on the original audio characteristic pattern, on the square region of arbitrary dimension, position, according to a certain quasi-mode among Fig. 5, the eigenwert sum of calculating black region deducts the eigenwert sum of white portion.This feature has following three characteristics:

1. fast operation.Cooperate integrogram, the extraction of virtually any size 2D-Haar audio frequency characteristics only need be carried out data read and the plus and minus calculation of fixed number of times.The 2D-Haar audio frequency characteristics that comprises 2 rectangles only needs to read 6 points from integrogram and adds/subtract computing, and the feature of 3 rectangles only need read 8 points, and the feature of 4 rectangles only need read 9 points.

2. the property distinguished is strong.The dimension of 2D-Haar audio feature space is very high, and 5 quasi-modes that use with present embodiment are example, and one 32 * 32 integrogram, 5 quasi-modes can produce total dimension and surpass 510,000 2D-Haar audio frequency characteristics, and concrete quantity is as shown in table 2.

The quantity of 32 * 32 integrograms of table 25 class 2D-Haar audio frequency characteristics

This dimension is considerably beyond the raw information of audio frequency FFT energy spectrum, also head and shoulders above the dimension of feature space after the SVM Nonlinear Mapping.In addition, owing to audio frequency characteristics figure is made up of the continuous audio frame of some, so the 2D-Haar audio frequency characteristics also can reflect certain time sequence information.

In the present embodiment, the concrete grammar of 2D-Haar audio feature extraction is: at first according to integrogram and said method, calculate 510112 all dimension 2D-Haar audio frequency characteristics values, obtain 2D-Haar audio frequency characteristics value set; And then every width of cloth integrogram corresponding 510112 is tieed up 2D-Haar audio frequency characteristics vector note do delegation, make tag file folder B _kIn the whole H dimension 2D-Haar audio frequency characteristics vector of all m width of cloth integrograms constitute that m are capable, the eigenmatrix X of 510112 row, shown in step 2.3, in the present embodiment, m=22690.

Step 4.2, the 2D-Haar audio frequency characteristics matrix X that uses the AdaBoost.MH method that step 4.1 is obtained carries out feature screening and sorter training, obtains speaker's sorter.The ultimate principle of described AdaBoost.MH method is: by F wheel iteration, select F dimension principal character from 510112 dimension 2D-Haar audio frequency characteristics value sets, training simultaneously obtains F Weak Classifier, and it is formed strong classifier.

In the present embodiment, the value of F is 400.

Employed Weak Classifier in the above-mentioned interative computation, its definition is:

h_{j} (x, y) = \{\begin{matrix} 1 & p_{j, y} x_{j} < p_{j, y} θ_{j, y} \\ - 1 & p_{j, y} x_{j} &GreaterEqual; p_{j, y} θ_{j, y} \end{matrix}, - - - (2)

Wherein, x _jThe input of expression Weak Classifier, θ _{J, y}The threshold value that obtains after the expression training, p _{J, y}The direction of the indication sign of inequality.

The concrete training process of AdaBoost.MH is:

Step 4.2.2, each columns with eigenmatrix X, carries out 510112 and takes turns computing as the input of a Weak Classifier according to (i.e. 510112 of all integrograms groups of same dimensional features) successively, calculates r according to following formula _{F, j}Value:

r_{f, j} = \underset{j, (i, l)}{Σ} D_{f} (i, l_{i}) K_{i} [l_{i}] h_{j} (x_{i}, l_{i}), j = 1 . . . 510112

K_{i} [l_{i}] = \{\begin{matrix} + 1 & l_{i} &Element; [1, . . ., k] \\ - 1 & l_{i} &NotElement; [1, . . ., k] \end{matrix} .

From above-mentioned 510112 Weak Classifiers, select a h _j(x, l _i), make r _f=max (r _{F, j}), with this sorter characteristic of correspondence f _j(x) as the feature dimensions of choosing, simultaneously this Weak Classifier note is made h _t(x l), adds in the strong classifier.Wherein, f _j(x) the j dimension of expression 510112 dimension 2D-Haar audio frequency characteristics vectors (being the j row of eigenmatrix X), h _j(x, l) expression adopts j dimensional feature value as the Weak Classifier of input;

α_{f} = \frac{1}{2} \ln (\frac{1 + r_{f}}{1 - r_{f}});

D_{f + 1} = \frac{D_{f} (i, l_{i}) \exp (- α_{f} K_{i} [l_{i}] h_{f} (x_{i}, l_{i}))}{Z_{f}}, i = 1 . . . m .

Z_{f} = \underset{i, l}{Σ} D_{f} (i, l_{i}) \exp (- α_{f} K_{i} [l_{i}] h_{f} (x_{i}, l_{i})), i = 1 . . . m .

Step 4.2.6 to the method iteration of step 4.2.5 400 times, obtains the strong classifier be made up of 400 Weak Classifiers according to step 4.2.2, i.e. k speaker's identification sorter is expressed as:

W_{k} (x) = \arg \max_{l} S_{l}, S_{l} = (Σ_{t = 1}^{F} α_{t} h_{t} (x, l)) - - - (1)

Step 4.2.7, treat 600 take turns training and finish after, 600 all speaker's sorters are gathered, constitute speaker's sorter pond W={W ₁(x), W ₂(x) ..., W ₆₀₀(x) }.

Step 5.1 is carried out the audio frequency characteristics integrogram to the identification voice document and is extracted, and obtains waiting to recognize audio frequency characteristics integrogram sequence G'={g' ₁, g' ₂, g' ₃... g' _{U '}.Concrete grammar is identical with the described method of step 2.Wherein, foundation characteristic file v _kLeaching process in (corresponding to step 2.2), frame length is set at f _s=30ms, frame move and are set at Δ f _s=20ms; In the audio frequency characteristics graphic sequence transfer process (corresponding to step 2.2), the long a=32 of window, stepping s=16; In the present embodiment, s _kTotal duration be 6.30s, therefore

p = Σ_{1}^{n} p_{n} = 12 + 12 + 8 = 32 .

Similarly, the value of totalframes c' of waiting to recognize voice is also by waiting that the length of recognizing voice document determines that the characteristic pattern quantity u ' that the characteristic pattern sequence comprises is:

Step 5.2 on the basis of step 5.1, according to the described 2D-Haar audio feature extraction of step 4.1 method, for the every width of cloth characteristic pattern in the characteristic pattern sequence extracts the 2D-Haar audio frequency characteristics, constitutes 510112 row, the 2D-Haar audio frequency characteristics matrix X ' of 39 row.

Step 5.3.3 carries out 400 according to the method for step 5.3.1-step 5.3.2 and takes turns after the circulation every kind of label l to be selected _iTo obtain a weighted value S _LiSelect a weighted value S of value maximum _Li, record the to be selected label l corresponding with it simultaneously _iAs the classification results of this audio frequency characteristics figure, note is made (l _i,

), l wherein _kBe speaker's label,

Be corresponding strong classifier weighted sum.

R = {(l_{i}, S_{l_{i}, u^{'}}) : (l_{1}, S_{l_{1}, 1}), (l_{1}, S_{l_{1}, 2}), (l_{2}, S_{l_{2}, 3}), \cdot \cdot \cdot (l_{i}, S_{l_{i}, u^{'}})} .

Second link

This link will be tested performance of the present invention, and test platform, speaker register/train the flow process speaker and recognize that flow process is identical with embodiment 1, below will repeat no more, and stress method and the result of performance test.

Experimental data generates by following steps: (1) whole speech datas of picked at random 100,200,300,400,500,600 people from all speakers, (2) from everyone voice, choose 7 as training data, 3 as the target test data, (3) at each target speaker, 50 other people statements of picked at random are recognized test data as emitting.

In order to compare, adopt the GMM-UBM method to compare, each target speaker is carried out 3 target tests and emits for 50 times and recognize test, record false acceptance rate (the False Acceptance Rate of two kinds of methods.FAR) and false rejection rate (False Rejection Rate FRR), draws the DET curve, and statistics accuracy rate and identification are consuming time.Wherein:

Wrong rate such as accuracy rate=1-.

Speaker's scale is increased at 600 o'clock by 100, and the performance of two kinds of methods is shown in Fig. 6 and table 3.As seen, when speaker's scale constantly increased, the identification accuracy rate of control methods descended apparent in view, and the extracting method downtrending of this paper institute is more slow, under 600 people's scales, the average identification accuracy rate that exceeds under 4.3%, 6 kind of speaker's scale than the accuracy rate of control methods can reach 91.3%.

The accuracy rate (%) of two kinds of methods under the different speaker's scales of table 3

In order to estimate the time efficiency of algorithm that this paper puies forward, add up the average identification t consuming time of different 2D-Haar intrinsic dimensionality F following p.s. of speech datas.As shown in Table 4, this paper institute extracting method has higher identification speed.

The average identification of this paper institute extracting method is consuming time under the different F values of table 4

By above-mentioned experiment as can be known, the 2D-Haar audio frequency characteristics has expanded the dimension of feature space effectively when having introduced time sequence information, provides possibility for training the more excellent sorter of performance; Simultaneously, use the AdaBoost.MH algorithm, the Decision Stump Weak Classifier of importing in conjunction with single feature carries out the feature screening, has both improved representativeness and the discrimination of proper vector, has also reduced the computation burden in identification stage, and identification speed is higher.In conjunction with 2D-Haar audio frequency characteristics and AdaBoost.MH algorithm, can realize extensive speaker's accurate identification.

Claims

1. extensive speaker's identification method is characterized in that, said method comprising the steps of:

Step 2 is carried out the audio frequency characteristics integrogram to the voice among the S of basic speech storehouse and is calculated, and forms foundation characteristic storehouse R.

Step 3 on the basis of foundation characteristic storehouse R, generates each target speaker's training characteristics file set B.

2. method according to claim 1 is characterized in that, described obtaining waited to recognize speaker's voice signal and do not required that the speaker pronounces according to presetting content of text in the feature templates.

3. method according to claim 1 is characterized in that, the step that described audio frequency characteristics integrogram calculates specifically comprises:

Step 1 is for k target speaker, to its audio file s _kCarry out the branch frame and handle (frame length f _s, frame moves Δ f _sBe set by the user), and extract the elementary audio feature (as MFCC, LPCC, sub belt energy etc., specifically use which feature, by user specify) of each frame, elementary audio feature combination with each frame forms a foundation characteristic file v who comprises c frame, every frame p dimensional feature amount _k

Step 2 is for k target speaker's foundation characteristic file v _k, adopt the mode of sliding window, be that window is long, s is stepping with a, convert all c frame audio frequency characteristics vectors to audio frequency characteristics graphic sequence file G _k

G _k={g ₁,g ₂,g ₃,…g _u}.

Step 3 on the basis of step 2, is calculated the characteristic pattern sequential file G for k target speaker _kIn every width of cloth characteristic pattern g _uCharacteristic-integration figure r _u, form this speaker's characteristic-integration graphic sequence file R _k={ r ₁, r ₂, r ₃... r _u, the characteristic-integration graphic sequence file of all k target speakers among the S of basic speech storehouse is put together, form foundation characteristic storehouse R={R ₁, R ₂..., R _k.

ii (x, y) = \underset{x^{'} \leq x, y^{'} \leq y}{Σ} i (x^{'}, y^{'}),

4. method according to claim 1 is characterized in that, the computing method of described extraction 2D-Haar audio frequency characteristics are:

The value of every dimension 2D-Haar audio frequency characteristics all is on the original audio characteristic pattern, in the square region of arbitrary dimension, position, use the eigenwert sum of a certain specific rectangular area to deduct the eigenwert sum of another specific rectangular area, can calculate acquisition fast by integrogram.Its total dimension H is determined by the size of the 2D-Haar audio frequency characteristics type that adopts and integrogram

5. method according to claim 1, it is characterized in that, the described sorter pond that is constituted by k speaker's sorter, need to get by the k training in rotation, whenever, take turns training and all will take turns iteration by F, select F dimension principal character from H dimension 2D-Haar audio frequency characteristics value set, training simultaneously obtains F Weak Classifier, it is formed strong classifier, and concrete grammar is:

Step 1, the weight of the every width of cloth integrogram of initialization correspondence, note is made D ₁(i, l _i)=1/ (mk), i=1 ... m, l _i∈ Y.

Step 2, successively with each columns of eigenmatrix X according to (namely the H of all integrograms the organizes same dimensional feature) input as a Weak Classifier, carry out the computing of H wheel, calculate r according to following formula _{F, j}Value:

r_{f, j} = \underset{j, (i, l)}{Σ} D_{f} (i, l_{i}) K_{i} [l_{i}] h_{j} (x_{i}, l_{i}), j = 1 . . . H

K_{i} [l_{i}] = \{\begin{matrix} + 1 & l_{i} &Element; [1, . . ., k] \\ - 1 & l_{i} &NotElement; [1, . . ., k] \end{matrix} .

Step 3 is calculated the Weak Classifier h that is selected by step 2 _j(x, weight l) _f:

α_{f} = \frac{1}{2} \ln (\frac{1 + r_{f}}{1 - r_{f}});

Step 4, the weight D of each integrogram in the calculating next round iteration _f+ 1;

D_{f + 1} = \frac{D_{f} (i, l_{i}) \exp (- α_{f} K_{i} [l_{i}] h_{f} (x_{i}, l_{i}))}{Z_{f}}, i = 1 . . . m .

Z_{f} = \underset{i, l}{Σ} D_{f} (i, l_{i}) \exp (- α_{f} K_{i} [l_{i}] h_{f} (x_{i}, l_{i})), i = 1 . . . m .

Step 5 with the new weight substitution step 2 that step 4 obtains, according to the method for step 2 to step 4, is chosen a new feature dimensions, obtains a new Weak Classifier simultaneously and adds in the strong classifier;

Step 6 to the method iteration of step 5 F time, obtains the strong classifier be made up of F Weak Classifier according to step 2, i.e. k speaker's identification sorter is expressed as:

W_{k} (x) = \arg \max_{l} S_{l}, S_{l} = (Σ_{t = 1}^{F} α_{t} h_{t} (x, l)) - - - (1)

6. method according to claim 5, it is characterized in that, employed Weak Classifier in the described interative computation need meet the following conditions: 1. the input of Weak Classifier is one-dimensional eigenwert (is a certain specific dimension in the proper vector, or a certain row among the eigenmatrix X); 2. at speaker's label l to be recognized _i, the output of Weak Classifier is 1 or-1.

7. method according to claim 1 is characterized in that, the step of described speaker's identification is:

Step 1 is carried out the audio frequency characteristics integrogram to the identification voice document and is extracted, and obtains waiting to recognize audio frequency characteristics integrogram sequence G'={g' ₁, g' ₂, g' ₃... g' _{U '},

Concrete grammar and parameter value are described identical with claim 3.

Step 2 on the basis of step 1, for the every width of cloth characteristic pattern in the characteristic pattern sequence extracts the 2D-Haar audio frequency characteristics, constitutes 2D-Haar audio frequency characteristics matrix X ', and concrete grammar is described identical with claim 4.

Step 3, the 2D-Haar audio frequency characteristics matrix X ' that step 2 is obtained imports each sorter of speaker's sorter pond W simultaneously, obtains classification results sequence R.

Step 4, it is comprehensive that the classification results sequence that step 3 is obtained is carried out the result, obtains final speaker's recognition results.

8. method according to claim 7 is characterized in that, described classification results sequence R is made up of the individual element of u ', and wherein the concrete computing method of each element are:

Step 1 according to (1) formula in the claim 5, reads certain Weak Classifier h in speaker's sorter _t(x, l) and corresponding 2D-Haar audio frequency characteristics f _j(x);

Step 2 for every kind of label to be selected (being k or other), is calculated the output h of this Weak Classifier respectively _t(f _j(x), l), and with this output valve with the weight in the sorter _tBe added to label l to be selected _iCorresponding weighted value S _LiIn;

Step 3 is carried out after the F wheel circulation every kind of label l to be selected according to the method for step 1-step 2 _iTo obtain a weighted value S _LiSelect a weighted value S of value maximum _Li, record the to be selected label l corresponding with it simultaneously _iAs the classification results of this audio frequency characteristics figure, note is made (l _i,

), l wherein _kBe speaker's label,

Be corresponding strong classifier weighted sum.

Step 4 will wait that all classification results of recognizing audio frequency combine, and composition and classification is sequence as a result

R = {(l_{i}, S_{l_{i}, u^{'}}) : (l_{1}, S_{l_{1}, 1}), (l_{1}, S_{l_{1}, 2}), (l_{2}, S_{l_{2}, 3}), \cdot \cdot \cdot (l_{i}, S_{l_{i}, u^{'}})} .

9. method according to claim 7 is characterized in that, the comprehensive computing method of described 7 results are:

Strong classifiers all in the sequence is as a result differentiated weight