CN103258536B

CN103258536B - A kind of extensive speaker's identification method

Info

Publication number: CN103258536B
Application number: CN201310074743.9A
Authority: CN
Inventors: 罗森林; 谢尔曼; 潘丽敏
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2013-03-08
Filing date: 2013-03-08
Publication date: 2015-10-21
Anticipated expiration: 2033-03-08
Also published as: CN103258536A

Abstract

The present invention relates to a kind of based on 2D-Haar audio frequency characteristics, the text that is applicable to extensive speaker has nothing to do speaker's identification method.The present invention proposes concept and the computing method of 2D-Haar audio frequency characteristics, first use elementary audio structural feature audio frequency characteristics figure; And then utilize audio frequency characteristics figure to extract 2D-Haar audio frequency characteristics, re-use the training that AdaBoost.MH algorithm completes screening to 2D-Haar audio frequency characteristics and speaker clustering device; The speaker clustering device that final utilization trains realizes speaker's identification.Compared with prior art, the present invention recognizes the decay of accuracy rate under can effectively suppressing extensive speaker to recognize occasion, have higher identification accuracy rate and identification speed; Be not only applicable to desktop computer, be also applicable to the mobile computing platform such as mobile phone, panel computer.

Description

A kind of extensive speaker's identification method

Technical field

The present invention relates to a kind of text being applicable to extensive speaker to have nothing to do speaker's identification method, belong to technical field of biometric identification; From the angle that technology realizes, also belong to computer science and voice processing technology field.

Background technology

Speaker's identification (Speaker Identification) technology is Speaker Identification (SpeakerRecognition, SR) important branch of technology, it is the voice signal feature utilizing each speaker, speaker information is extracted from one section of voice, and then judge that this section of voice are which in some people is said, be the pattern recognition problem of " multiselect one ".Along with the high speed development of modern electronic technology in recent years, the application demand of speaker's recognition techniques more next strong fields such as () such as court's discriminating, suspect's tone tracking location, speech retrievals, also receives increasing concern with features such as the convenience of its uniqueness, economy and accuracys.

Different according to the type of content of speaking, speaker's identification can be divided into text dependent (Text-dependent) and the large class of text irrelevant (Text-independent) two.Require that user pronounces according to the content specified with the speaker identification system of text dependent, everyone identification model is accurately set up one by one, and also must by the content pronunciation of regulation when recognizing; The recognition system that text has nothing to do then does not specify the pronunciation content of speaker, and relative difficulty set up by model, can range of application wider.In some cases, people cannot (or not wishing) force speaker to read aloud one section of specific word, and in these application scenarioss, speaker's identification method that text has nothing to do just seems especially important.

The basic fundamental of this irrelevant speaker identification can be divided into voice collecting, feature extraction, and sorting technique three class, wherein key issue is feature extraction and classifying method.

Feature extraction aspect, the many employings of current main stream approach based on the mel cepstrum coefficients (MFCC) of bottom Principles of Acoustics or linear prediction residue error (Linear Predictive Coding Cepstrum, LPCC) as characteristic parameter.

Sorting technique aspect, main stream approach can be divided three classes, template matching method (dynamic time warping (DTW), vector quantization (VQ)), probabilistic method (hidden Markov model (HMM), gauss hybrid models (GMM)), identification, classification device algorithm (artificial neural network (ANN), Support Vector Machine (SVM)).Extensively used gauss hybrid models (GMM) method and Support Vector Machine (SVM) method at present.In said method, GMM-UBM model is used widely; Support Vector Machine (SVM) method and GMM-UBM have very strong contacting, and the feature super vector that the SVM system of current main flow adopts generally all is produced by GMM.

Based on said method, speaker's recognition techniques that text has nothing to do obtains practical application in some occasions.But when constantly increasing wait the number recognized, the accuracy rate of said method can obviously decline, and when number is increased to certain scale, will be difficult to the demand meeting practical application, this is that text has nothing to do a major issue of speaker's recognition techniques needs solution.

Summary of the invention

The object of the invention is: innovate from feature extraction and sorting technique two levels, propose a kind of extensive speaker's identification method, treating, under the scene that identification number is more, still can obtain higher accuracy rate.

Design concept of the present invention is: propose 2D-Haar audio feature extraction methods, introduce certain sequential relationship information, and audio feature space is extended to hundreds thousand of dimension, for recognition algorithm provides huger feature space; Meanwhile, use AdaBoost.MH algorithm, in feature space, screen representative Feature Combination, for the identification sorter of establishing target speaker.The present invention, while promoting accuracy rate further, does not increase training and recognize time expense, has feature fast and accurately.

Technical scheme of the present invention realizes as follows:

Step 1, obtains the voice signal waiting to recognize speaker (i.e. target speaker), basis of formation sound bank S.

Concrete grammar is: microphone is connected with computing machine, obtains the voice signal of target speaker, and is stored in computing machine with the form of audio file, the corresponding audio file of each target speaker, basis of formation sound bank S={s ₁, s ₂, s ₃..., s _k, wherein k is the sum of target speaker.

Step 2, carries out the calculating of audio frequency characteristics integrogram to the voice in the S of basic speech storehouse, basis of formation feature database R.Detailed process is as follows:

Step 2.1, for a kth target speaker, to its audio file s _kcarry out sub-frame processing (frame length f _s, frame moves Δ f _sbe set by the user), and extract the elementary audio feature (as MFCC, LPCC, sub belt energy etc.) of each frame, by the elementary audio Feature Combination of each frame, form the foundation characteristic file v that comprises c frame, every frame p dimensional feature amount _k.

V _kin the content of proper vector of each frame be: { [foundation characteristic 1(p ₁dimension)], [foundation characteristic 2(p ₂dimension)] ..., [foundation characteristic n(p _ndimension)] }.

In more than describing, be the audio file s of t for a duration _k:

p = Σ_{1}^{n} p_{n} .

Step 2.2, for the foundation characteristic file v of a kth target speaker _k, adopt the mode of sliding window, be that window is long, s is stepping with a, convert all c frame audio feature vector to audio frequency characteristics graphic sequence file G _k(see Fig. 2).

G _k={ g ₁, g ₂, g ₃... g _uk, wherein,

Step 2.3, on the basis of step 2.2, calculates the characteristic pattern sequential file G for a kth target speaker _kin every width characteristic pattern g _ucharacteristic-integration figure r _u, form the characteristic-integration graphic sequence file R of this speaker _k={ r ₁, r ₂, r ₃... r _u, the characteristic-integration graphic sequence file of all k target speaker in the S of basic speech storehouse is put together, basis of formation feature database R={R ₁, R ₂..., R _k.

Yi Zhi, in foundation characteristic storehouse, the computing formula of the characteristic-integration figure sum m of all speakers is:

Described characteristic-integration figure and primitive character figure is measure-alike, and on it, the value of any point (x, y) is defined as all eigenwert sum of former figure corresponding point (x ', y ') and upper left side thereof.Definition is as follows:

ii (x, y) = \underset{x^{'} \leq x, y^{'} \leq y}{Σ} i (x^{'}, y^{'}),

In formula, ii (x, y) represents the value of point (x, y) on integrogram, and i (x', y') represents the eigenwert of point on primitive character figure (x ', y ').

Step 3, on the basis of foundation characteristic storehouse R, generates the training characteristics file set B of each target speaker.Detailed process is as follows:

Step 3.1, mark the tag file in the R of foundation characteristic storehouse, concrete grammar is:

Use continuous print integer numbering as speaker's label, represent different target speakers, so that computer disposal.Final mark pattern is R '={ (R ₁, 1), (R ₂, 2) ... (R _k, k) }, wherein, Y={1,2 ..., k} is target speaker tally set, and k is target speaker number;

Step 3.2, on the basis of step 3.1, for each target speaker sets up the tag file collection B being used for speaker's registration, concrete grammar is:

In the feature database R ' of the amiable people's label of mark, carry out k wheel and arrange, often taking turns in housekeeping, first by the audio frequency characteristics file r of a kth target speaker _kas positive sample, retain its speaker's label k; Then using remaining speaker's audio frequency characteristics file as negative sample, and their speaker's label to be changed to " other "; Finally above-mentioned k audio frequency characteristics file is stored in independent file, and by this tag file folder called after B _k, that is:

B ₁={(R ₁，1),(R ₂，other),…(R _k，other)}，

B ₂={(R ₁，other),(R ₂，2),…(R _k，other)}，

……

B _k={(R ₁，other),(R ₂，other),…(R _k，k)}

After k takes turns housekeeping, final formation presss from both sides by k tag file the tag file collection B={B formed ₁, B ₂..., B _k.

Step 4, on the basis of step 3, extract 2D-Haar audio frequency characteristics, and carry out speaker's registration, namely travel through k file in tag file collection B successively, and use training characteristics file to be wherein that each target speaker trains independent " 1 to remaining " sorter, finally obtain the sorter pond be made up of k speaker clustering device.

For a kth target speaker, the sorter W of its correspondence _ktraining process as follows:

Step 4.1, to the tag file folder B that step 3.2 is formed _kin all characteristic-integration graphic sequence file R _kevery width integrogram carry out 2D-Haar audio feature extraction.Concrete grammar is:

Calculate corresponding H according to each integrogram and tie up 2D-Haar audio frequency characteristics value (wherein H is determined by the size of the 2D-Haar audio frequency characteristics type adopted and integrogram), obtain the data acquisition S={ (x for the training of speaker clustering device ₁, l _i) ..., (x _m, l _i).Wherein, x _irepresent that the whole H corresponding to i-th integrogram tie up 2D-Haar audio feature vector, l _i∈ Y, (Y={1,2 ..., k}) and represent speaker's label corresponding to i-th integrogram.

Described H ties up 2D-Haar audio frequency characteristics value, the value often tieing up 2D-Haar audio frequency characteristics is on original audio characteristic pattern, in the square region of arbitrary dimension, position, use the eigenwert sum of a certain specific rectangular area to deduct the eigenwert sum of another specific rectangular area, calculate acquisition fast by integrogram.

Corresponding for every width integrogram H is tieed up 2D-Haar audio feature vector and is denoted as a line, make tag file press from both sides B _kin the whole H of all m width integrograms tie up 2D-Haar audio feature vector and form that m is capable, the eigenmatrix X of H row.

Step 4.2, uses AdaBoost.MH method to carry out Feature Selection and sorter training to the 2D-Haar audio frequency characteristics matrix X that step 4.1 obtains, obtains speaker clustering device.The ultimate principle of described AdaBoost.MH method is: take turns iteration by F, and tie up 2D-Haar audio frequency characteristics value set from H and select F to tie up principal character, simultaneous training obtains F Weak Classifier, is formed strong classifier.

The Weak Classifier used in above-mentioned interative computation, need meet the following conditions: 1. the input of Weak Classifier is one-dimensional eigenwert (a certain specific dimension namely in proper vector, or a certain row in eigenmatrix X); 2. for speaker's label l to be recognized _i, the output of Weak Classifier is 1 or-1.

The concrete training process of AdaBoost.MH is:

Step 4.2.1, the weight that initialization every width integrogram is corresponding, is denoted as D ₁(i, l _i)=1/ (mk), i=1 ... m, l _i∈ Y.

Step 4.2.2, successively using the input of each column data (i.e. the same dimensional feature of H group of all integrograms) of eigenmatrix X as a Weak Classifier, carries out H and takes turns computing, calculate r according to the following formula _f,jvalue:

r_{f, j} = \underset{j, (i, l)}{Σ} D_{f} (i, l_{i}) K_{i} [l_{i}] h_{j} (x_{i}, l_{i}), j = 1 . . . H

Wherein, h _j(x _i, l _i) represent that the jth dimensional feature value extracted in i-th integrogram is as the Weak Classifier inputted, D _f(i, l _i) represent that f takes turns the weighted value of i-th training integrogram in iteration,

K_{i} [l_{i}] = \{\begin{matrix} + 1 & l_{i} &Element; [1, . . ., k] \\ - 1 & l_{i} &NotElement; [1, . . ., k] \end{matrix} .

A h is selected from an above-mentioned H Weak Classifier _j(x, l _i), make r _f=max (r _f,j), by this sorter characteristic of correspondence f _jx this Weak Classifier, as the feature dimensions chosen, is denoted as h by () simultaneously _t(x, l), adds in strong classifier.Wherein, f _jx () represents that H ties up the jth dimension (i.e. the jth row of eigenmatrix X) of 2D-Haar audio feature vector, h _j(x, l) represents the Weak Classifier adopting jth dimensional feature value as input;

Step 4.2.3, calculates the Weak Classifier h selected by step 4.2.2 _jthe weight α of (x, l) _f:

α_{f} = \frac{1}{2} \ln (\frac{1 + r_{f}}{1 - r_{f}});

Step 4.2.4, calculates the weight Df+1 of each integrogram in next round iteration;

D_{f + 1} = \frac{D_{f} (i, l_{i}) \exp (- α_{f} K_{i} [l_{i}] h_{f} (x_{i}, l_{i}))}{Z_{f}}, i = 1 . . . m .

Wherein, h _f(x _i, l _i) represent that f takes turns using the jth dimensional feature value of i-th integrogram extraction as the Weak Classifier of input in iteration, Z _fit is normalized factor

Z_{f} = \underset{i, l}{Σ} D_{f} (i, l_{i}) \exp (- α_{f} K_{i} [l_{i}] h_{f} (x_{i}, l_{i})), i = 1 . . . m .

Step 4.2.5, the new weight obtained by step 4.2.4 substitutes into step 4.2.2, according to the method for step 4.2.2 to step 4.2.4, chooses the feature dimensions that is new, obtains a new Weak Classifier simultaneously and adds in strong classifier;

Step 4.2.6, according to method iteration F time of step 4.2.2 to step 4.2.5, obtains the strong classifier be made up of F Weak Classifier, i.e. the identification sorter of a kth speaker, is expressed as:

W_{k} (x) = \arg \max_{l} S_{l}, S_{l} = (Σ_{t = 1}^{F} α_{t} h_{t} (x, l)) - - - (1)

Step 4.2.7, after the training of k wheel terminates, gathers all k speaker clustering device, forms speaker clustering device pond W={W ₁(x), W ₂(x) ..., W _k(x) }.

Step 5, utilizes step 4 to train the speaker clustering device pond obtained, and extracts 2D-Haar audio frequency characteristics, and carry out speaker's identification to the voice document of unknown speaker.

Step 5.1, carries out the extraction of audio frequency characteristics integrogram to identification voice document, obtains treating identification audio frequency characteristics integrogram sequence G'={g' ₁, g' ₂, g' ₃... g' _{u '}.Concrete grammar is identical with method described in step 2.Wherein, in audio frequency characteristics graphic sequence transfer process (corresponding to step 2.2), the long a of window, stepping s, identical with step 2 of value; Similar, for one comprise c' frame wait recognize voice document, the characteristic pattern quantity u ' that characteristic pattern sequence comprises is:

Step 5.2, on the basis of step 5.1, according to the 2D-Haar audio feature extraction methods described in step 4.1, for the every width characteristic pattern in characteristic pattern sequence extracts 2D-Haar audio frequency characteristics, forms 2D-Haar audio frequency characteristics matrix X'.

Step 5.3, the 2D-Haar audio frequency characteristics matrix X ' step 5.2 obtained inputs each sorter in speaker clustering device pond W, obtains classification results sequence R.

Described classification results sequence R is made up of the individual element of u ', and wherein the circular of each element is:

Step 5.3.1, according to (1) formula in step 4.2.6, reads certain Weak Classifier h in speaker clustering device _t(x, l) and corresponding 2D-Haar audio frequency characteristics f thereof _j(x);

Step 5.3.2, for often kind of label to be selected (i.e. k or other), calculates the output h of this Weak Classifier respectively _t(f _j(x), l), and by this output valve with the weight α in sorter _tbe added to label l to be selected _icorresponding weighted value S _liin;

Step 5.3.3, after carrying out the circulation of F wheel according to the method for step 5.3.1-step 5.3.2, often kind of label l to be selected _ia weighted value S will be obtained _li.Select the weighted value S that value is maximum _li, record label l to be selected corresponding thereto simultaneously _ias the classification results of this audio frequency characteristics figure, be denoted as (l _i, ), wherein l _kfor speaker's label, for corresponding strong classifier weighted sum.

Step 5.3.4, recognizes that all classification results of audio frequency combine, composition and classification result sequence by waiting

R = {(l_{i}, S_{l_{i}, u^{'}}) : (l_{1}, S_{l_{1}, 1}), (l_{1}, S_{l_{1}, 2}), (l_{2}, S_{l_{2}, 3}), \cdot \cdot \cdot (l_{i}, S_{l_{i}, u^{'}})} .

Step 5.4, carries out result comprehensively to the classification results sequence that step 5.3 obtains, obtains final speaker's recognition results.

Concrete grammar is: strong classifiers all in result sequence is differentiated weight by speaker's label l _iweighting, exports the final recognition results of the speaker's label making weighted sum maximum as this section of voice.

Beneficial effect

Compared to the characteristic parameter extraction method such as mel cepstrum coefficients (MFCC) or linear prediction residue error (LPCC) based on bottom Principles of Acoustics, the 2D-Haar audio feature extraction methods that the present invention proposes introduces the certain sequential relationship information of le, and audio feature space is extended to hundreds thousand of dimension, for recognition algorithm provides huger feature space.

Compared with the speaker clustering methods such as GMM, SVM, the present invention adopts and uses AdaBoost.MH algorithm, Decision Stump Weak Classifier in conjunction with single feature input carries out Feature Selection, both representativeness and the discrimination of proper vector had been improve, decrease the computation burden that speaker recognizes the stage, arithmetic speed is higher.In conjunction with 2D-Haar audio frequency characteristics and AdaBoost.MH algorithm, the accurate identification of extensive speaker can be realized, there is higher practical value.

Accompanying drawing explanation

Fig. 1 is theory diagram of the present invention;

Fig. 2 is the audio frequency characteristics figure that proposes of the present invention and characteristic pattern sequential extraction procedures principle intention;

Fig. 3 is speaker's registration process schematic diagram of the present invention;

Fig. 4 is speaker's identification process schematic diagram of the present invention;

The 5 class 2D-Haar audio frequency characteristics of Fig. 5 for using in speaker's training in embodiment and identification process;

Fig. 6 is in embodiment, when using TIMIT sound bank to test, and the performance test of the present invention and GMM-UBM algorithm.

Embodiment

In order to better objects and advantages of the present invention are described, be described in further details below in conjunction with the embodiment of drawings and Examples to the inventive method.

All tests all complete on same computer below, and concrete configuration is: Intel double-core CPU(dominant frequency 1.8G), 1G internal memory, WindowsXP SP3 operating system.

First link

This link will use the voice document of TIMIT audio repository, describes in detail when target speaker scale is 600 people, the detailed process of speaker's registration/training of the present invention, speaker's identification.

TIMIT sound bank is the java standard library produced jointly by Massachusetts Polytechnics, Stanford Research Institute, Texas Instrument, contains the language material of 630 speakers (438 male sex and 192 women), everyone 10 voice.

Whole speech datas of random selecting 600 people from all speakers, then from everyone 10 voice, choose 1 duration be greater than the file of 5 seconds as speaker's registration/training utterance file; Any voice of random selecting 1 people are as identification voice document in addition.

Concrete implementation step is as follows:

Because TIMIT sound bank has been store complete audio file, therefore direct by the voice document basis of formation sound bank S={s of 600 objective speakers ₁, s ₂, s ₃..., s _k, wherein k=600 is the sum of target speaker.

Step 2.1, for a kth target speaker, to its audio file s _kcarry out sub-frame processing, and extract the elementary audio feature (in the present embodiment, using MFCC, LPCC, PLPC) of each frame, by the elementary audio Feature Combination of each frame, form the foundation characteristic file v that comprises c frame, every frame p dimensional feature amount _k.

In the present embodiment, v _kin the content of proper vector of each frame be: { [MFCC(12 dimension)], [LPCC(12 dimension)], [PLPC(8 dimension)] }, the frame length of framing operation is set as f _s=30ms, frame moves and is set as Δ f _s=20ms.

p = Σ_{1}^{n} p_{n} = 12 + 12 + 8 = 32 .

Step 2.2, for the foundation characteristic file v of a kth target speaker _k, adopt the mode of sliding window, be that window is long, s is stepping with a, convert all c frame audio feature vector to audio frequency characteristics graphic sequence file G _k(see Fig. 2).In the present embodiment, a=32, s=16.

G _k={ g ₁, g ₂, g ₃... g _uk, wherein,

Step 2.3, on the basis of step 2.2, calculates the characteristic pattern sequential file G for a kth target speaker _kin every width characteristic pattern g _ucharacteristic-integration figure r _u, form the characteristic-integration graphic sequence file R of this speaker _k={ r ₁, r ₂, r ₃... r _u, the characteristic-integration graphic sequence file of 600 target speakers all in the S of basic speech storehouse is put together, basis of formation feature database R={R ₁, R ₂..., R _k.

In the present embodiment, total duration of all 600 audio files is 3630.50s, therefore:

ii (x, y) = \underset{x^{'} \leq x, y^{'} \leq y}{Σ} i (x^{'}, y^{'}),

Use continuous print integer numbering as speaker's label, represent different target speakers, so that computer disposal.Final mark pattern is R '={ (R ₁, 1), (R ₂, 2) ... (R ₆₀₀, 600) }, wherein, Y={1,2 ..., 600} is target speaker tally set;

In the feature database R ' of the amiable people's label of mark, carry out 600 and take turns arrangement, often taking turns in housekeeping, first by the audio frequency characteristics file r of a kth target speaker _kas positive sample, retain its speaker's label k; Then using remaining speaker's audio frequency characteristics file as negative sample, and their speaker's label to be changed to " other "; Finally above-mentioned 600 audio frequency characteristics files are stored in independent file, and by this tag file folder called after B _k, that is:

B ₁={(R ₁，1),(R ₂，other),…(R ₆₀₀，other)}，

B ₂={(R ₁，other),(R ₂，2),…(R ₆₀₀，other)}，

……

B ₆₀₀={(R ₁，other),(R ₂，other),…(R ₆₀₀，600)}

600 take turns housekeeping after, final formation presss from both sides by 600 tag files the tag file collection B={B formed ₁, B ₂..., B ₆₀₀.

Step 4, on the basis of step 3, extract 2D-Haar audio frequency characteristics, and carry out speaker's registration, namely travel through 600 files in tag file collection B successively, and use training characteristics file to be wherein that each target speaker trains independent " 1 to remaining " sorter.

Step 4.1, to the tag file folder B that step 3.2 is formed _kin all characteristic-integration graphic sequence file R _kevery width integrogram carry out 2D-Haar audio feature extraction.

Calculate corresponding H according to each integrogram and tie up 2D-Haar audio frequency characteristics value, obtain the data acquisition S={ (x for the training of speaker clustering device ₁, l _i) ..., (x _m, l _i).Wherein, x _irepresent that the whole H corresponding to i-th integrogram tie up 2D-Haar audio feature vector, l _i∈ Y, (Y={1,2 ..., k}) and represent speaker's label corresponding to i-th integrogram.

Fig. 5 illustrates the computation schema of the 5 class 2D-Haar audio frequency characteristics that the present embodiment uses, the value often tieing up 2D-Haar audio frequency characteristics is: on original audio characteristic pattern, in the square region of arbitrary dimension, position, according to quasi-mode a certain in Fig. 5, the eigenwert sum calculating black region deducts the eigenwert sum of white portion.This feature has following three features:

1. fast operation.Coordinate integrogram, the extraction of any size 2D-Haar audio frequency characteristics only need perform digital independent and the plus and minus calculation of fixed number of times.The 2D-Haar audio frequency characteristics comprising 2 rectangles only need read 6 points and carry out plus/minus computing from integrogram, and the feature of 3 rectangles only need read 8 points, and the feature of 4 rectangles only need read 9 points.

2. distinction is strong.The dimension of 2D-Haar audio feature space is very high, and 5 quasi-modes used for the present embodiment, the integrogram of 32 × 32,5 quasi-modes can produce the 2D-Haar audio frequency characteristics that total dimension has exceeded 510,000, and concrete quantity is as shown in table 2.

The quantity of table 2 32 × 32 integrogram 5 class 2D-Haar audio frequency characteristics

This dimension considerably beyond the raw information of audio frequency FFT energy spectrum, also far beyond the dimension of feature space after SVM Nonlinear Mapping.In addition, because audio frequency characteristics figure is made up of the continuous audio frame of some, therefore 2D-Haar audio frequency characteristics also can reflect certain time sequence information.

In the present embodiment, the concrete grammar of 2D-Haar audio feature extraction is: first according to integrogram and said method, calculates 510112 all dimension 2D-Haar audio frequency characteristics values, obtains 2D-Haar audio frequency characteristics value set; And then corresponding for every width integrogram 510112 dimension 2D-Haar audio feature vector are denoted as a line, make tag file press from both sides B _kin the whole H of all m width integrograms tie up 2D-Haar audio feature vector and form that m is capable, the eigenmatrix X of 510112 row, as shown in step 2.3, in the present embodiment, m=22690.

Step 4.2, uses AdaBoost.MH method to carry out Feature Selection and sorter training to the 2D-Haar audio frequency characteristics matrix X that step 4.1 obtains, obtains speaker clustering device.The ultimate principle of described AdaBoost.MH method is: take turns iteration by F, and from 510112 dimension 2D-Haar audio frequency characteristics value sets, select F to tie up principal character, simultaneous training obtains F Weak Classifier, is formed strong classifier.

In the present embodiment, the value of F is 400.

The Weak Classifier used in above-mentioned interative computation, its definition is:

h_{j} (x, y) = \{\begin{matrix} 1 & p_{j, y} x_{j} < p_{j, y} θ_{j, y} \\ - 1 & p_{j, y} x_{j} &GreaterEqual; p_{j, y} θ_{j, y} \end{matrix}, - - - (2)

Wherein, x _jrepresent the input of Weak Classifier, θ _{j, y}the threshold value obtained after representing training, p _j,ythe direction of the instruction sign of inequality.

The concrete training process of AdaBoost.MH is:

Step 4.2.2, successively using each column data (i.e. 510112 groups of same dimensional features of all integrograms) of eigenmatrix X as the input of a Weak Classifier, carry out 510112 and take turns computing, calculate r according to the following formula _f,jvalue:

r_{f, j} = \underset{j, (i, l)}{Σ} D_{f} (i, l_{i}) K_{i} [l_{i}] h_{j} (x_{i}, l_{i}), j = 1 . . . 510112

K_{i} [l_{i}] = \{\begin{matrix} + 1 & l_{i} &Element; [1, . . ., k] \\ - 1 & l_{i} &NotElement; [1, . . ., k] \end{matrix} .

A h is selected from above-mentioned 510112 Weak Classifiers _j(x, l _i), make r _f=max (r _f,j), by this sorter characteristic of correspondence f _jx this Weak Classifier, as the feature dimensions chosen, is denoted as h by () simultaneously _t(x, l), adds in strong classifier.Wherein, f _jx () represents the jth dimension (i.e. the jth row of eigenmatrix X) of 510112 dimension 2D-Haar audio feature vector, h _j(x, l) represents the Weak Classifier adopting jth dimensional feature value as input;

α_{f} = \frac{1}{2} \ln (\frac{1 + r_{f}}{1 - r_{f}});

D_{f + 1} = \frac{D_{f} (i, l_{i}) \exp (- α_{f} K_{i} [l_{i}] h_{f} (x_{i}, l_{i}))}{Z_{f}}, i = 1 . . . m .

Z_{f} = \underset{i, l}{Σ} D_{f} (i, l_{i}) \exp (- α_{f} K_{i} [l_{i}] h_{f} (x_{i}, l_{i})), i = 1 . . . m .

Step 4.2.6, according to the method iteration 400 times of step 4.2.2 to step 4.2.5, obtains the strong classifier be made up of 400 Weak Classifiers, i.e. the identification sorter of a kth speaker, is expressed as:

W_{k} (x) = \arg \max_{l} S_{l}, S_{l} = (Σ_{t = 1}^{F} α_{t} h_{t} (x, l)) - - - (1)

Step 4.2.7, takes turns after training terminates until 600, is gathered by all 600 speaker clustering devices, form speaker clustering device pond W={W ₁(x), W ₂(x) ..., W ₆₀₀(x) }.

Step 5.1, carries out the extraction of audio frequency characteristics integrogram to identification voice document, obtains treating identification audio frequency characteristics integrogram sequence G'={g' ₁, g' ₂, g' ₃... g' _{u '}.Concrete grammar is identical with method described in step 2.Wherein, foundation characteristic file v _kleaching process in (corresponding to step 2.2), frame length is set as f _s=30ms, frame moves and is set as Δ f _s=20ms; In audio frequency characteristics graphic sequence transfer process (corresponding to step 2.2), the long a=32 of window, stepping s=16; In the present embodiment, s _ktotal duration be 6.30s, therefore

p = Σ_{1}^{n} p_{n} = 12 + 12 + 8 = 32 .

Similar, wait that the value of the totalframes c' recognizing voice is also by waiting to recognize that the length of voice document is determined, the characteristic pattern quantity u ' that characteristic pattern sequence comprises is:

Step 5.2, on the basis of step 5.1, according to the 2D-Haar audio feature extraction methods described in step 4.1, for the every width characteristic pattern in characteristic pattern sequence extracts 2D-Haar audio frequency characteristics, forms 510112 row, the 2D-Haar audio frequency characteristics matrix X ' of 39 row.

Step 5.3.3, according to the method for step 5.3.1-step 5.3.2 carry out 400 take turns circulation after, often kind of label l to be selected _ia weighted value S will be obtained _li.Select the weighted value S that value is maximum _li, record label l to be selected corresponding thereto simultaneously _ias the classification results of this audio frequency characteristics figure, be denoted as (l _i, ), wherein l _kfor speaker's label, for corresponding strong classifier weighted sum.

R = {(l_{i}, S_{l_{i}, u^{'}}) : (l_{1}, S_{l_{1}, 1}), (l_{1}, S_{l_{1}, 2}), (l_{2}, S_{l_{2}, 3}), \cdot \cdot \cdot (l_{i}, S_{l_{i}, u^{'}})} .

Second link

This link will be tested performance of the present invention, and test platform, speaker register/train flow process speaker and recognize that flow process is identical with embodiment 1, below will repeat no more, and stress method and the result of performance test.

Experimental data is generated by following steps: (1) whole speech datas of random selecting 100,200,300,400,500,600 people from all speakers, (2) from everyone voice, 7 are chosen as training data, 3 as target detection data, (3) for each target speaker, random selecting 50 other people recognize test data as emitting by statement.

In order to compare, adopt GMM-UBM method to contrast, carrying out 3 target detections and emit for 50 times recognizing test to each target speaker, false acceptance rate (the False Acceptance Rate of record two kinds of methods.And false rejection rate (False Rejection Rate, FRR) FAR), draw DET curve, and add up accuracy rate and identification is consuming time.Wherein:

The wrong rates such as accuracy rate=1-.

When speaker's scale is increased to 600 by 100, the performance of two kinds of methods is as shown in Fig. 6 and table 3.Visible, when speaker's scale constantly increases, the identification accuracy rate of control methods declines obvious, and institute's extracting method downtrending is herein more slow, under 600 people's scales, exceed 4.3% than the accuracy rate of control methods, the average identification accuracy rate under 6 kinds of speaker's scales can reach 91.3%.

The accuracy rate (%) of two kinds of methods under the different speaker's scale of table 3

In order to evaluate herein put forward the time efficiency of algorithm, add up the average identification t consuming time of different 2D-Haar intrinsic dimensionality F speech data lower p.s..As shown in Table 4, institute's extracting method has higher identification speed herein.

The average identification of the lower institute's extracting method herein of table 4 different F value is consuming time

From above-mentioned experiment, 2D-Haar audio frequency characteristics, while introducing time sequence information, has expanded the dimension of feature space effectively, for the sorter training performance more excellent provides possibility; Meanwhile, use AdaBoost.MH algorithm, the Decision Stump Weak Classifier in conjunction with single feature input carries out Feature Selection, and both improve representativeness and the discrimination of proper vector, decreased the computation burden in identification stage, identification speed is higher.In conjunction with 2D-Haar audio frequency characteristics and AdaBoost.MH algorithm, the accurate identification of extensive speaker can be realized.

Claims

1. extensive speaker's identification method, is characterized in that, said method comprising the steps of:

Step 1, obtains the voice signal waiting to recognize speaker, basis of formation sound bank S;

Step 2, carries out the calculating of audio frequency characteristics integrogram to the voice in the S of basic speech storehouse, basis of formation feature database R, and the step that described audio frequency characteristics integrogram calculates specifically comprises:

Step 2.1, waits to recognize speaker, to its audio file s for kth _kcarry out sub-frame processing, frame length f _s, frame moves Δ f _sbe set by the user, and extract the elementary audio feature of each frame, by the elementary audio Feature Combination of each frame, form the foundation characteristic file v that comprises c frame, every frame p dimensional feature amount _k,

V _kin the content of proper vector of each frame be: { [foundation characteristic 1 (p ₁dimension)], [foundation characteristic 2 (p ₂dimension)] ..., [foundation characteristic n (p _ndimension)] },

Step 2.2, waits for kth the foundation characteristic file v recognizing speaker _k, adopt the mode of sliding window, be that window is long, s is stepping with a, convert all c frame audio feature vector to audio frequency characteristics graphic sequence file G _k,

G _k={ g ₁, g ₂, g ₃... g _u, wherein,

Step 2.3, on the basis of step 2, calculates the audio frequency characteristics graphic sequence file G waiting to recognize speaker for kth _kin every width audio frequency characteristics figure g _ucharacteristic-integration figure r _u, form the characteristic-integration graphic sequence file R of this speaker _k={ r ₁, r ₂, r ₃... r _u, all k in the S of basic speech storehouse is waited recognize that the characteristic-integration graphic sequence file of speaker puts together, basis of formation feature database R={R ₁, R ₂..., R _k,

Described characteristic-integration figure and original audio frequency characteristics figure is measure-alike, any point (x on it, y) value is defined as all eigenwert sum of corresponding point on original audio frequency characteristics figure (x ', y ') and upper left side thereof, and definition is as follows:

i i (x, y) = \underset{x^{'} \leq x, y^{'} \leq y}{Σ} i (x^{'}, y^{'}),

In formula, ii (x, y) represents the value of point (x, y) on integrogram, and i (x', y') represents the eigenwert of point on original audio frequency characteristics figure (x ', y ');

Step 3, on the basis of foundation characteristic storehouse R, generates each training characteristics file set B waiting to recognize speaker;

Step 4, on the basis of step 3, extract 2D-Haar audio frequency characteristics, and carry out speaker's registration, namely travel through k file in tag file collection B successively, and use training characteristics file to be wherein eachly treat that identification speaker trains independent " 1 to remaining " sorter, finally obtain the sorter pond be made up of k speaker clustering device, described extraction 2D-Haar audio frequency characteristics " computing method be:

The value often tieing up 2D-Haar audio frequency characteristics is all on original audio frequency characteristics figure, in the square region of arbitrary dimension, position, the eigenwert sum of a certain specific rectangular area is used to deduct the eigenwert sum of another specific rectangular area, calculate acquisition fast by integrogram, its total dimension H is determined by the size of the 2D-Haar audio frequency characteristics type adopted and integrogram;

Corresponding for every width integrogram H is tieed up 2D-Haar audio feature vector and is denoted as a line, make tag file press from both sides B _kin the whole H of all m width audio frequency characteristics integrograms tie up 2D-Haar audio feature vector and form that m is capable, the eigenmatrix X of H row;

Step 5, utilizes step 4 to train the speaker clustering device pond obtained, and extracts 2D-Haar audio frequency characteristics, finally carry out speaker's identification to the voice document of unknown speaker.

2. method according to claim 1, is characterized in that, described acquisition waits to recognize that the voice signal of speaker does not require that speaker pronounces according to content of text preset in feature templates.

3. method according to claim 1, it is characterized in that, the described sorter pond be made up of k speaker clustering device, need be got by k training in rotation, often wheel training all will take turns iteration by F, and tie up 2D-Haar audio frequency characteristics value set from H and select F to tie up principal character, simultaneous training obtains F Weak Classifier, formed strong classifier, concrete grammar is:

Step 1, the weight that initialization every width integrogram is corresponding, is denoted as D ₁(i, l _i)=1/ (mk), i=1 ... m, l _i∈ Y, represents the speaker's label corresponding to i-th integrogram, Y={1,2 ..., k} is target speaker tally set, and k is target speaker number, and m is the quantity of audio frequency characteristics integrogram;

Step 2, successively using be denoted as each column data of eigenmatrix X, the H group of all integrograms with the input of dimensional feature as a Weak Classifier, carry out H and take turns computing, calculate r according to the following formula _f,jvalue:

r_{f, j} = \underset{j, (i, l)}{Σ} D_{f} (i, l_{i}) K_{i} [l_{i}] h_{j} {(x_{i} l_{i})}_{,} j = 1 ... H

K_{i} [l_{i}] = \{\begin{matrix} + 1 & l_{i} &Element; [1, ..., k] \\ - 1 & l_{i} &NotElement; [1, ..., k] \end{matrix},

A h is selected from an above-mentioned H Weak Classifier _j(x, l _i), make r _f=max (r _f,j), by this sorter characteristic of correspondence f _jx this Weak Classifier, as the feature dimensions chosen, is denoted as h by () simultaneously _t(x, l), adds in strong classifier, wherein, and f _jx () represents that H ties up the jth dimension of 2D-Haar audio feature vector, h _j(x, l) represents the Weak Classifier adopting jth dimensional feature value as input;

Step 3, calculates the Weak Classifier h selected by step 2 _jthe weight α of (x, l) _f:

α_{f} = \frac{1}{2} l n (\frac{1 + r_{f}}{1 - r_{f}});

Step 4, calculates the weight D of each integrogram in next round iteration _f+1;

D_{f + 1} = \frac{D_{f} (i, l_{i}) \exp (- α_{f} K_{i} [l_{i}] h_{f} (x_{i}, l_{i}))}{Z_{f}}, i = 1 ... m .

Z_{f} = \underset{i, l}{Σ} D_{f} (i, l_{i}) \exp (- α_{f} K_{i} [l_{i}] h_{f} (x_{i}, l_{i})), j = 1 ... m .

Step 5, new weight step 4 obtained substitutes into step 2, according to step 2 to the method for step 4, chooses the feature dimensions that is new, obtains a new Weak Classifier simultaneously and adds in strong classifier;

Step 6, according to step 2 to method iteration F time of step 5, obtains the strong classifier be made up of F Weak Classifier, i.e. the identification sorter of a kth speaker, is expressed as:

W_{k} (x) = \arg \underset{l}{m a x} S_{l}, S_{l} = (Σ_{t = 1}^{F} α_{t} h_{t} (x, l)) - - - (1) .

4. method according to claim 3, it is characterized in that the Weak Classifier used in described interative computation need meet the following conditions: 1. the input of Weak Classifier is one-dimensional eigenwert, namely a certain specific dimension in proper vector, or a certain row in eigenmatrix X; 2. for speaker's label l to be recognized _i, the output of Weak Classifier is 1 or-1.

5. method according to claim 1, is characterized in that, the concrete steps of described speaker's identification are:

Step 1, treats the original audio frequency characteristics figure voice document of identification and carries out the extraction of audio frequency characteristics integrogram, obtains treating identification audio frequency characteristics integrogram sequence G'={g' ₁, g' ₂, g' ₃... g' _{u '}, u' represents and treats in identification audio frequency characteristics integrogram sequence, the quantity of audio frequency characteristics integrogram, for one comprise c' frame wait recognize voice document, the quantity u' of the audio frequency integration characteristic pattern that audio frequency characteristics integrogram sequence comprises is: a represents that the window set in generation audio frequency characteristics figure process is long, and s represents in same process, the stepping of sliding window movement;

Step 2, on the basis of step 1, for the every width audio frequency characteristics figure in audio frequency characteristics graphic sequence extracts 2D-Haar audio frequency characteristics, forms 2D-Haar audio frequency characteristics matrix X ';

Step 3, the 2D-Haar audio frequency characteristics matrix X ' step 2 obtained inputs each sorter of speaker clustering device pond W simultaneously, obtains classification results sequence R;

Step 4, carries out result comprehensively to the classification results sequence that step 3 obtains, obtains final speaker's recognition results.

6. method according to claim 5, is characterized in that, described classification results sequence R is made up of the individual element of u ', and wherein the circular of each element is:

Step 1, according to (1) formula in claim 5 step 6, reads certain Weak Classifier h in speaker clustering device _t(x, l) and corresponding 2D-Haar audio frequency characteristics f thereof _j(x);

Step 2, for label k and other to be selected, calculates the output h of each Weak Classifier respectively _t(f _j(x), l), and by this output valve with the weight α in sorter _tbe added to label l to be selected _icorresponding weighted value S _liin;

Step 3, after carrying out the circulation of F wheel according to the method for step 1-step 2, often kind of label l to be selected _ia weighted value S will be obtained _li, select the weighted value S that value is maximum _li, record label l to be selected corresponding thereto simultaneously _ias the classification results of this audio frequency characteristics figure, be denoted as wherein l _kfor speaker's label, for corresponding strong classifier weighted sum;

Step 4, recognizes that all classification results of audio frequency combine, composition and classification result sequence by waiting

R = {(l_{i}, S_{l_{i}, u^{'}}) : (l_{1}, S_{l_{1}, 1}), (l_{1} S_{l_{1} 2}), (l_{2} S_{l_{2}, 3}), . . . (l_{i} S_{l_{i}, u^{'}})} .

7. method according to claim 5, is characterized in that, the computing method of described " result is comprehensive " link are:

Strong classifiers all in result sequence is differentiated weight by speaker's label l _iweighting, exports the final recognition results of the speaker's label making weighted sum maximum as this section of voice.