CN103985384A

CN103985384A - Text-independent speaker identification device based on random projection histogram model

Info

Publication number: CN103985384A
Application number: CN201410232526.2A
Authority: CN
Inventors: 于泓; 马占宇; 郭军
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2014-05-28
Filing date: 2014-05-28
Publication date: 2014-08-13
Anticipated expiration: 2034-05-28
Also published as: CN103985384B

Abstract

The embodiment of the invention discloses a text-independent speaker identification device based on a random projection histogram model. The method includes the three steps of characteristic extraction, model training and identification; wherein in the characteristic extraction step, non-normalized incremental line spectrum frequency characteristics are converted into normalized differential line spectrum frequency characteristics, the differential line spectrum frequency characteristics of consecutive frames are combined to generate composite differential line spectrum frequency characteristics for expressing the dynamic performance of signals; in the model training process, random projection parameters are designed according to the distribution characteristics of the composite differential line spectrum frequency characteristics, random projection is conducted on training data sets, and probability models are established by calculating an average histogram; in the identification process, characteristic extraction is conducted on voice signals of a person to be identified according to the first step, then the extracted characteristics are input into the models obtained in the second step, the likelihood value of each probability model is calculated, the maximum likelihood value is obtained, and then the serial number of a speaker is identified. By the adoption of the method, the text-independent speaker identification rate can be increased, and the method has great practical value.

Description

A kind of text-independent speaker identification device based on Random Maps histogram model

Technical field

The invention belongs to field of audio processing and described emphatically a kind of text-independent speaker identification device based on Random Maps histogram model.

Background technology

Speaker Identification is that computing machine utilizes the information that can reflect speaker characteristic comprising in sound bite to identify the technology of speaker ' s identity, and this technology is in information security, and the fields such as long-distance identity-certifying have very important research and using value.

According to the difference of identifying object, speaker can be differentiated and is divided into text dependent and text-independent two classes.Speaker's authentication technique of text dependent wherein, requires to utilize keyword that speaker pronounces and crucial sentence as training sample, utilizes identical content pronunciation to identify while distinguishing, this system is used inconvenience and the easy stolen record of key content.Speaker's recognition techniques of text-independent, when training and while recognizing, do not stipulate the content of speaking, identifying object is voice signal freely, need in voice signal freely, find feature and the method for the information that can characterize speaker, therefore set up speaker model difficulty relatively, but this utilization is convenient and safe.Described in the invention is the identification device of text-independent.

Speaker differentiates that conventionally comprising 3 ingredients (1) extracts the feature that can express speaker's feature from training utterance data centralization; (2) for speaker trains a model that can reflect its phonetic feature regularity of distribution; (3) feature of inputting voice by calculating is made final decision with the degree of agreeing with of the training pattern of having obtained.

Conventional speaker's identification system adopts MFCC (Mel-frequency Cepstral Coefficients in feature extraction part, Mel cepstrum coefficient) or LSF (Line Spectral Frequencies, line spectral frequencies) as essential characteristic, in model training part, adopt GMM (Gaussian Mixture Model, gauss hybrid models) or statistic histogram as probability model.

Traditional feature is easy to be subject to noise and multidate information beyond expression of words, GMM model is only applicable carries out modeling for the wider feature of distribution range, although statistic histogram model can carry out modeling to the characteristic signal of any distribution, but when lack of training samples or characteristic dimension are when too high, in the model of setting up, there is a large amount of zero points, cause result discontinuous.The method for distinguishing speek person of text-independent described in the invention can solve the above problems greatly.

Summary of the invention

In order to solve the existing defect of above-mentioned technology and to improve speaker's resolution of text-independent, the invention provides a kind of text-independent speaker discrimination method based on composite difference line spectral frequencies feature and stochastic transformation histogram model, comprise the following steps:

One. characteristic extraction step:

A, differential lines spectral frequency characteristic extraction step: by the non-normalized line spectral frequencies eigentransformation increasing progressively of K dimension of obtaining from speech linear predictive coding model, be that K+1 ties up normalized differential lines spectral frequency feature.

The step of B, generation composite difference line spectral frequencies feature: 3 adjacent frame difference line spectral frequencies features are carried out to combination producing composite difference line spectral frequencies feature with the dynamic perfromance of expression signal.

Two. Random Maps histogram model training step: the training utterance for each speaker extracts T frame composite difference line spectral frequencies feature as 1 group of training dataset according to the description of step 1.The method of employing Random Maps is carried out H stochastic transformation to this training dataset and is obtained H group training characteristics.Every stack features is carried out to statistics with histogram, and utilize the average histogram of H group training characteristics as this speaker's probability model.Final each speaker can train and obtain an one's own model.

Three. differentiate coupling step: input after one section of voice, adopt the method for step 1 generate 1 stack features and will in this feature input step two, train in each speaker's who obtains model, calculate this stack features for the likelihood value of each model, get the numbering that maximum likelihood value is wherein confirmed speaker.

According to speaker's discrimination method of a kind of and text-independent of an embodiment of the invention, the described normalized differential lines spectral frequency feature extraction mode of steps A is as follows:

[x wherein ₁, x ₂..., x _k] ^tfor the K dimension line spectral frequencies feature before conversion, △ x is the normalization differential lines spectral frequency feature of the rear K+1 dimension of conversion

According to speaker's discrimination method of a kind of and text-independent of an embodiment of the invention, the concrete generative process of the composite difference line spectral frequencies feature of describing in step B is as follows:

The differential lines spectral frequency of supposing t frame is characterized as △ x (t), and the composite difference line spectral frequencies of t frame is characterized as:

Sup△x(t)＝[△x(t-τ) ^T,△x(t) ^T,△x(t+τ) ^T] ^T

Wherein τ is positive integer, gets τ=1 in the present invention.

According to speaker's discrimination method of a kind of and text-independent of an embodiment of the invention, the model training method described in step 2 is as follows:

1) the composite difference line spectral frequencies feature of dimension D=K+1 is carried out to Random Maps conversion, transformation for mula is: y=Ax+b, and wherein A is the Random-Rotation scaled matrix of D * D dimension, b is the random translation vector of D * 1 dimension.

2) random translation vector b=[b ₁, b ₂..., b _i... b _k+1] ^tin each element value be equally distributed stochastic variable between 0～1.

3) rotation scaled matrix A is the product of Random-Rotation unit matrix U and random convergent-divergent diagonal matrix Λ

A＝ΛU

|U|＝1

4) design process of Random-Rotation unit matrix U is as follows:

1. the stochastic matrix V that generates a D * D dimension, each element in V meets being uniformly distributed between 0～1

2. matrix V is carried out to QR and decompose V=QR, wherein Q is unit orthogonal matrix

3. by judging that whether the determinant of Q equals 1, comes element q ₁₁revise guarantee Q the value of determinant be 1

5) design process of random convergent-divergent diagonal matrix Λ is as follows:

The element of j dimension of composite difference line spectral frequencies feature meets Beta and distributes, and its probability density function is

Beta (x_{j}; α_{j}, β_{j}) = \frac{Γ (α_{j} + β_{j})}{Γ (α_{j}) Γ (β_{j})} x_{j}^{α_{j} - 1} {(1 - x_{j})}^{β_{j} - 1}

If

R (x_{j}; α_{j}, β_{j}) = {&Integral;}_{0}^{1} B {eta}^{2} (x_{j}; α_{j}, β_{j}) {dx}_{j}

h_{j} = R {(x_{j}; α_{j}, β_{j})}^{- \frac{1}{2}} {(6 Π_{i = 1}^{D} R {(x_{i}; α_{i}, β_{i})}^{\frac{1}{2}})}^{\frac{1}{2 + D}} N^{- \frac{1}{2 + D}}

Wherein D is the dimension of composite difference line spectral frequencies feature, the number that N is training characteristics.

In diagonal matrix Λ, the value of diagonal entry is

\log (λ_{j}) = Uniform [θ_{\min} + \log (h_{j}^{- 1}), θ_{\max} + \log (h_{j}^{- 1})]

θ wherein _min=0, θ _max=2 is relaxation parameter.

6) after Random Maps, to build probability model mode as follows for training data:

HD (x) = π_{ZeroDens} p (x | ZeroDens) + \frac{1 - π_{ZeroDens}}{H} Σ_{i}^{H} p (x | A_{i}, b_{i}),

First half is that probability estimate is carried out in the position at zero point in histogram, wherein for the probability occurring zero point in statistic histogram.The prior probability that p (x|ZeroDens) is null position, the priori is here compound Di Li Cray process.The proper vector of input is:

x＝Sup△x(t)＝[△x(t-τ) ^T,△x(t) ^T,△x(t+τ) ^T] ^T＝[△x ₁,△x ₂,△x ₃] ^T

p (x | ZeroDens) = Π_{n = 1}^{3} \frac{Γ (Σ_{k = 1}^{K + 1} α_{n, k})}{Π_{k = 1}^{K + 1} Γ (α_{n, k})} Π_{k = 1}^{K + 1} {({Δx}_{n, k})}^{α_{n, k} - 1}

Latter half is average statistics histogram probability estimate, and wherein H is the number of times that carries out Random Maps, 1 group of training dataset that contains N training data after H Random Maps, be transformed to H group training dataset

Wherein p (x|A _i, b _i) be the histogram probability estimate of input test data x in the i time conversion, be defined as follows:

p (x | A_{i}, b_{i}) = \frac{1}{Hv} Σ_{j = 1}^{N} II (round (y_{j}), round (y))

y＝A _ix+b _i

v＝|A _i| ^-1

According to speaker's discrimination method of a kind of and text-independent of an embodiment of the invention, the discriminating matching process implementation method described in step 3 is: by the characteristic data set of input be transported to and be directed to each speaker and train in probability model, calculate likelihood value.

L_{j} (\tilde{x}) = Σ_{i = 1}^{N} \log ({HT}_{j} (x_{i}))

Wherein for test feature collection about the likelihood value of j speaker model, by getting maximum likelihood value wherein, confirm speaker's numbering.

Beneficial effect of the present invention is, in terms of existing technologies, the present invention's application composite difference line spectral frequencies feature is extracted as speaker's feature, use Random Maps histogram training probability model, provide again complete implementation system for application, experiment show high efficiency of the present invention, there is very strong practicality.

Below in conjunction with accompanying drawing, specific embodiments of the present invention is described in detail.

Fig. 1 is process flow diagram of the present invention, and wherein solid line represents training department's minute flow process trend, and dotted line represents to differentiate part flow process trend, comprises the following steps:

The first step: characteristic extraction step, from speaker's voice sequence of training, extract composite difference line spectral frequencies feature.

Step S1: be differential lines spectral frequency feature by line spectral frequencies Feature Conversion;

Step S2: the differential lines spectral frequency feature of obtaining in S1 is combined and obtains composite difference line spectral frequencies feature.

Second step: training probability model

Step S3: build the distribution of Random Maps histogram model matching composite difference line spectral frequencies feature, realize details as shown in Figure 2.

The 3rd step: discrimination process

Step S1 and step S2 that speaker's voice sequence to be identified is repeated in the first step generate composite difference line spectral frequencies characteristic test collection, and input step S3 trains the model obtaining.

Step S4: calculate the likelihood value for each probability model, obtain maximum likelihood value, confirm speaker's numbering.

To be specifically described each step below:

Step S1 realizes the extraction of differential lines spectral frequency feature, by the non-normalized line spectral frequencies eigentransformation increasing progressively of K dimension of obtaining from speech linear predictive coding model, is that K+1 ties up normalized differential lines spectral frequency feature, and its implementation is as follows:

[x wherein ₁, x ₂..., x _k] ^tfor the K dimension line spectral frequencies feature before conversion, △ x is the normalization differential lines spectral frequency feature of the rear K+1 dimension of conversion.

Step S2 carries out combination producing composite difference line spectral frequencies feature with the dynamic perfromance of expression signal by 3 adjacent frame difference line spectral frequencies features.The differential lines spectral frequency of supposing t frame is characterized as △ x (t), and the composite difference line spectral frequencies of t frame is characterized as:

Sup△x(t)＝[△x(t-τ) ^T,△x(t) ^T,△x(t+τ) ^T] ^T

Wherein τ is positive integer, gets τ=1 in the present invention.

Step S3: build the distribution of Random Maps histogram model matching composite difference line spectral frequencies feature, concrete realization flow as shown in Figure 2:

1) according to the overall distribution of composite difference line spectral frequencies feature, obtain the prior probability that in histogram, locate zero point.

If the composite difference line spectral frequencies proper vector of closing of input is:

Whole being distributed as of composite difference line spectral frequencies feature:

p (x | ZeroDens) = Π_{n = 1}^{3} \frac{Γ (Σ_{k = 1}^{K + 1} α_{n, k})}{Π_{k = 1}^{K + 1} Γ (α_{n, k})} Π_{k = 1}^{K + 1} {({Δx}_{n, k})}^{α_{n, k} - 1}

The prior probability that occur zero point in histogram is

π_{ZeroDens} = \frac{1}{N + 1}

In histogram, the prior distribution of null position is:

π _ZeroDensp(x|ZeroDens)

2) the composite difference line spectral frequencies proper vector of closing of input is carried out Random Maps structure and calculated average histogram.

The formula that the composite difference line spectral frequencies feature of dimension D=K+1 is carried out to Random Maps conversion is y=Ax+b, and wherein A is the Random-Rotation scaled matrix of D * D dimension, and b is the random translation vector of D * 1 dimension.

Random translation vector b=[b ₁, b ₂..., b _i... b _k+1] ^tin each element value be equally distributed stochastic variable between 0～1.

Random-Rotation scaled matrix A can be decomposed into the product of Random-Rotation unit matrix U and random convergent-divergent diagonal matrix Λ

A＝ΛU

|U|＝1

Wherein the design process of Random-Rotation unit matrix U is as follows:

3. by judging that whether the determinant of Q equals 1, comes element q ₁₁the value of revising the determinant that guarantees Q is 1

The design process of random convergent-divergent diagonal matrix Λ is as follows:

1. calculate the distribution of each element in composite difference line spectral frequencies proper vector.The element of j dimension meets Beta and distributes, and its probability density function is

Beta (x_{j}; α_{j}, β_{j}) = \frac{Γ (α_{j} + β_{j})}{Γ (α_{j}) Γ (β_{j})} x_{j}^{α_{j} - 1} {(1 - x_{j})}^{β_{j} - 1}

2. calculate the wide h of histogrammic optimum bin in each dimension

R (x_{j}; α_{j}, β_{j}) = {&Integral;}_{0}^{1} B {eta}^{2} (x_{j}; α_{j}, β_{j}) {dx}_{j}

h_{j} = R {(x_{j}; α_{j}, β_{j})}^{- \frac{1}{2}} {(6 Π_{i = 1}^{D} R {(x_{i}; α_{i}, β_{i})}^{\frac{1}{2}})}^{\frac{1}{2 + D}} N^{- \frac{1}{2 + D}}

3. according to the wide h of optimum bin, generate the value λ of diagonal entry in diagonal matrix Λ

\log (λ_{j}) = Uniform [θ_{\min} + \log (h_{j}^{- 1}), θ_{\max} + \log (h_{j}^{- 1})],

θ wherein _min=0, θ _max=2 is relaxation parameter.

According to above-mentioned flow process, obtain after stochastic transformation parameter A, b, training characteristics data set is carried out to H stochastic transformation, 1 group of training dataset that contains N training sample after Random Maps, generate H group training dataset wherein the average histogram of H group training dataset is:

\frac{1 - π_{ZeroDens}}{H} Σ_{i}^{H} p (x | A_{i}, b_{i})

P (x|A wherein _i, b _i) be the histogram probability estimate of input test data x in the i time conversion, be defined as follows:

p (x | A_{i}, b_{i}) = \frac{1}{Hv} Σ_{j = 1}^{N} II (round (y_{j}), round (y))

y＝A _ix+b _i

v＝|A _i| ^-1

Therefore the Random Maps histogram figure probability estimate model finally obtaining is:

HD (x) = π_{ZeroDens} p (x | ZeroDens) + \frac{1 - π_{ZeroDens}}{H} Σ_{i}^{H} p (x | A_{i}, b_{i})

Discriminating matching process implementation method described in step S4 is:

By the characteristic data set of input be transported to and be directed to each speaker and train in probability model, calculate likelihood value.

L_{j} (\tilde{x}) = Σ_{i = 1}^{N} \log ({HT}_{j} (x_{i}))

Below by reference to the accompanying drawings the embodiment of the Speaker Identification scheme of the proposed text-independent based on composite difference line spectral frequencies feature and stochastic transformation histogram model is set forth.By the description of above embodiment, one of ordinary skill in the art can clearly recognize that the mode that the present invention can add essential general hardware platform by software realizes.Understanding based on such, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of computer software product, this software product is stored in a storage medium, comprises that some instructions are used so that the method described in each embodiment of one or more computer equipment execution the present invention.

According to thought of the present invention, all will change in specific embodiments and applications.In sum, this description should not be construed as limitation of the present invention.

Above-described embodiment of the present invention, does not form the restriction to invention protection domain.Any modification of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the text-independent speaker identification device based on Random Maps histogram model, is characterized in that, comprises the following steps:

One. characteristic extraction step:

A, the feature extraction of differential lines spectral frequency: by the non-normalized line spectral frequencies eigentransformation increasing progressively of K dimension of obtaining from speech linear predictive coding model, be that K+1 ties up normalized differential lines spectral frequency feature;

B, generation composite difference line spectral frequencies feature: 3 adjacent frame difference line spectral frequencies features are carried out to combination producing composite difference line spectral frequencies feature with the dynamic perfromance of expression signal.

2. in method for distinguishing speek person according to claim 1, step 1 A is characterised in that, during the feature extraction of differential lines spectral frequency, traditional line spectral frequencies proper vector is removed after π normalization, in vector, each adjacent element subtracts each other, obtain Differential Characteristics vector, and increase a regular element and usually guarantee that difference vector 1 norm of obtaining is 1.

3. in method for distinguishing speek person according to claim 1, step 1 B is characterised in that, when composite difference line spectral frequencies feature is obtained, 3 adjacent frame difference line spectral frequencies features is combined, and the spacing of consecutive frame is 1.

4. in method for distinguishing speek person according to claim 1, step 2 is characterised in that, stochastic transformation mode is y=Ax+b, and wherein A is Random-Rotation scaled matrix, and b is random translation vector.

5. according to the random translation vector b described in claim 4, it is characterized in that, each element in b, should meet being uniformly distributed between 0～1.

6. according to right, want the Random-Rotation scaled matrix A described in 4 it is characterized in that, A is the product of unit Orthogonal Units matrix U and diagonal matrix Λ.

7. according to the orthogonal matrix U of unit described in claim 6, it is characterized in that, the equally distributed square formation V that U is met between 0-1 by an all elements generates, V is carried out to QR decomposition, and whether the Q determinant of a matrix value obtaining according to decomposition is 1 its upper left corner element to be revised to obtain U.

8. according to the diagonal matrix Λ described in claim 6, it is characterized in that, the diagonal entry value of Λ is

θ wherein _min=0, θ _max=2, h _jfor training characteristics j dimension, histogrammic best bin is wide, and this numerical value is decided by the regularity of distribution of training data.

9. in method for distinguishing speek person according to claim 1, step 2 is characterised in that, speaker's probability model is defined as

Wherein the first half of equation has defined and has located Probabilistic estimation zero point in histogram, and latter half has defined the method for estimation of average histogram probability.

Wherein for the probability occurring zero point in statistic histogram.The prior probability that p (x|ZeroDens) is null position for input test data x the i time conversion in histogram probability estimate wherein

。

10. the prior probability p (x|ZeroDens) of null position according to claim 9 is characterized in that, this priori should utilize compound Dirichlet distribute to estimate.