CN103985384B

CN103985384B - Text-independent speaker identification device based on random projection histogram model

Info

Publication number: CN103985384B
Application number: CN201410232526.2A
Authority: CN
Inventors: 于泓; 马占宇; 郭军
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2014-05-28
Filing date: 2014-05-28
Publication date: 2015-04-15
Anticipated expiration: 2034-05-28
Also published as: CN103985384A

Abstract

The embodiment of the invention discloses a text-independent speaker identification device based on a random projection histogram model. The method includes the three steps of characteristic extraction, model training and identification; wherein in the characteristic extraction step, non-normalized incremental line spectrum frequency characteristics are converted into normalized differential line spectrum frequency characteristics, the differential line spectrum frequency characteristics of consecutive frames are combined to generate composite differential line spectrum frequency characteristics for expressing the dynamic performance of signals; in the model training process, random projection parameters are designed according to the distribution characteristics of the composite differential line spectrum frequency characteristics, random projection is conducted on training data sets, and probability models are established by calculating an average histogram; in the identification process, characteristic extraction is conducted on voice signals of a person to be identified according to the first step, then the extracted characteristics are input into the models obtained in the second step, the likelihood value of each probability model is calculated, the maximum likelihood value is obtained, and then the serial number of a speaker is identified. By the adoption of the method, the text-independent speaker identification rate can be increased, and the method has great practical value.

Description

A kind of text based on Random Maps histogram model has nothing to do speaker detection device

Technical field

The invention belongs to field of audio processing to describe emphatically a kind of text based on Random Maps histogram model and to have nothing to do speaker detection device.

Background technology

Speaker Identification is that computing machine utilizes can reflecting of comprising in sound bite the information of speaker characteristic is to identify the technology of speaker ' s identity, and this technology is in information security, and the fields such as long-distance identity-certifying have very important Study and appliance and are worth.

According to the difference identifying object, speaker detection can be divided into text dependent and irrelevant two classes of text.The wherein speaker detection technology of text dependent, the keyword that requirement utilizes speaker to pronounce and crucial sentence are as training sample, and utilize identical content pronunciation to identify when distinguishing, this system uses inconvenience and the easy stolen record of key content.Speaker's recognition techniques that text is irrelevant, content of speaking is not specified when training and when recognizing, identify liking voice signal freely, need the characteristic sum method finding the information that can characterize speaker in voice signal freely, therefore speaker model relative difficulty is set up, but this technology safety easy to use.Described in the invention is the identification device that text has nothing to do.

Speaker detection usually comprises 3 ingredients (1) and extracts the feature can expressing speaker's feature from training utterance data centralization; (2) for speaker trains a model that can reflect its phonetic feature regularity of distribution; (3) agreeing with degree to carry out and make final decision by the feature calculating input voice and the training pattern to have obtained.

Conventional speaker detection system adopts MFCC (Mel-frequency Cepstral Coefficients in characteristic extraction part, mel cepstrum coefficients) or LSF (Line Spectral Frequencies, line spectral frequencies) as essential characteristic, in model training part, adopt GMM (GaussianMixture Model, gauss hybrid models) or statistic histogram as probability model.

Traditional feature is easy to be subject to noise and multidate information beyond expression of words, the feature that GMM model is only suitable for for distribution range is wider carries out modeling, although statistic histogram model can carry out modeling to the characteristic signal of Arbitrary distribution, but when lack of training samples or characteristic dimension too high time, there is a large amount of zero points in the model set up, cause result discontinuous.The method for distinguishing speek person that text described in the invention has nothing to do can solve the above problems greatly.

Summary of the invention

Improve the irrelevant speaker detection rate of text in order to the defect that solves existing for above-mentioned technology, the invention provides a kind of text based on composite difference line spectral frequencies feature and stochastic transformation histogram model and to have nothing to do speaker detection method, comprise the following steps:

One. characteristic extraction step:

A, differential lines spectral frequency characteristic extraction step: it is that K+1 ties up normalized differential lines spectral frequency feature that the K obtained from speech linear predictive coding model is tieed up the non-normalized line spectral frequencies eigentransformation increased progressively.

The step of B, generation composite difference line spectral frequencies feature: 3 adjacent frame difference line spectral frequencies features are carried out combination producing composite difference line spectral frequencies feature with the dynamic perfromance of expression signal.

Two. Random Maps histogram model training step: the training utterance for each speaker extracts T frame composite difference line spectral frequencies feature as 1 group of training dataset according to the description of step one.Adopt the method for Random Maps to carry out H stochastic transformation to this training dataset and obtain H group training characteristics.Statistics with histogram is carried out to every stack features, and utilizes the probability model of average histogram as this speaker of H group training characteristics.Final each speaker can train and obtain an one's own model.

Three. differentiate coupling step: after inputting one section of voice, the method of step one is adopted to generate 1 stack features and train in the model of each speaker obtained by this feature input step two, calculate the likelihood value of this stack features for each model, get wherein maximum likelihood value to confirm the numbering of speaker.

According to a kind of text-independent speaker detection method of an embodiment of the invention, the normalized differential lines spectral frequency feature extraction mode described by steps A is as follows:

Wherein [x ₁, x ₂..., x _k] ^tfor the K before conversion ties up line spectral frequencies feature, △ x is the normalization differential lines spectral frequency feature of K+1 dimension after conversion

According to a kind of text-independent speaker detection method of an embodiment of the invention, the concrete generative process of the composite difference line spectral frequencies feature described in step B is as follows:

Suppose that the differential lines spectral frequency of t frame is characterized as △ x (t), then the composite difference line spectral frequencies of t frame is characterized as:

Sup△x(t)＝[△x(t-τ) ^T,△x(t) ^T,△x(t+τ) ^T] ^T

Wherein τ is positive integer, gets τ=1 in the present invention.

According to a kind of text-independent speaker detection method of an embodiment of the invention, the model training method described in step 2 is as follows:

1) carry out Random Maps conversion to the composite difference line spectral frequencies feature of dimension D=K+1, transformation for mula is: y=Ax+b, and wherein A is the Random-Rotation scaled matrix of D × D dimension, and b is the random translation vector that D × 1 is tieed up.

2) random translation vector b=[b ₁, b ₂..., b _i... b _k+1] ^tin each element value be equally distributed stochastic variable between 0 ~ 1.

3) product that scaled matrix A is Random-Rotation unit matrix U and random convergent-divergent diagonal matrix Λ is rotated

A＝ΛU

|U|＝1

4) design process of Random-Rotation unit matrix U is as follows:

1. generate the stochastic matrix V of D × D dimension, each element in V meets being uniformly distributed between 0 ~ 1

2. carry out QR to matrix V and decompose V=QR, wherein Q is unit orthogonal matrix

3. by judging whether the determinant of Q equals 1, comes element q ₁₁carry out revising ensure Q the value of determinant be 1

5) design process of random convergent-divergent diagonal matrix Λ is as follows:

The element of a jth dimension of composite difference line spectral frequencies feature meets Beta distribution, and its probability density function is

Beta (x_{j}; α_{j}, β_{j}) = \frac{Γ (α_{j} + β_{j})}{Γ (α_{j}) Γ (β_{j})} x_{j}^{α_{j} - 1} {(1 - x_{j})}^{β_{j} - 1}

If

R (x_{j}; α_{j}, β_{j}) = {&Integral;}_{0}^{1} B {eta}^{2} (x_{j}; α_{j}, β_{j}) {dx}_{j}

h_{j} = R {(x_{j}; α_{j}, β_{j})}^{- \frac{1}{2}} {(6 Π_{i = 1}^{D} R {(x_{i}; α_{i}, β_{i})}^{\frac{1}{2}})}^{\frac{1}{2 + D}} N^{- \frac{1}{2 + D}}

Wherein D is the dimension of composite difference line spectral frequencies feature, and N is the number of training characteristics.

Then in diagonal matrix Λ, the value of diagonal entry is

\log (λ_{j}) = Uniform [θ_{\min} + \log (h_{j}^{- 1}), θ_{\max} + \log (h_{j}^{- 1})]

Wherein θ _min=0, θ _max=2 is relaxation parameter.

6) after Random Maps, build probability model mode as follows for training data:

HD (x) = π_{ZeroDens} p (x | ZeroDens) + \frac{1 - π_{ZeroDens}}{H} Σ_{i}^{H} p (x | A_{i}, b_{i}),

First half is that probability estimate is carried out in the position at zero point in histogram, wherein for the probability that zero point in statistic histogram occurs.The prior probability that p (x|ZeroDens) is null position, priori is here compound Di Li Cray process.The proper vector of input is:

x＝Sup△x(t)＝[△x(t-τ) ^T,△x(t) ^T,△x(t+τ) ^T] ^T＝[△x ₁,△x ₂,△x ₃] ^T

p (x | ZeroDens) = Π_{n = 1}^{3} \frac{Γ (Σ_{k = 1}^{K + 1} α_{n, k})}{Π_{k = 1}^{K + 1} Γ (α_{n, k})} Π_{k = 1}^{K + 1} {({Δx}_{n, k})}^{α_{n, k} - 1}

Latter half is average statistics histogram probability estimate, and wherein H is the number of times carrying out Random Maps, 1 group of training dataset containing N number of training data h group training dataset is transformed to after H Random Maps

Wherein p (x|A _i, b _i) be the histogram probability estimate of input test data x in i-th conversion, be defined as follows:

p (x | A_{i}, b_{i}) = \frac{1}{Hv} Σ_{j = 1}^{N} II (round (y_{j}), round (y))

y＝A _ix+b _i

v＝|A _i| ^-1

According to a kind of text-independent speaker detection method of an embodiment of the invention, the discriminating matching process implementation method described in step 3 is: by the characteristic data set of input be transported to and be directed to each speaker and train in probability model, calculate likelihood value.

L_{j} (\tilde{x}) = Σ_{i = 1}^{N} \log ({HT}_{j} (x_{i}))

Wherein for test feature collection about the likelihood value of a jth speaker model, confirm the numbering of speaker by getting wherein maximum likelihood value.

Beneficial effect of the present invention is, in terms of existing technologies, the present invention's application composite difference line spectral frequencies feature is extracted as the feature of speaker, use Random Maps histogram training probability model, provide again complete implementation system for application, experiment show high efficiency of the present invention, has a very strong practicality.

Below in conjunction with accompanying drawing, specific embodiments of the present invention is described in detail.

Fig. 1 is process flow diagram of the present invention, and wherein solid line represents training part run trend, and dotted line represents differentiates part run trend, comprises the following steps:

The first step: characteristic extraction step, extracts composite difference line spectral frequencies feature from speaker's voice sequence of training.

Step S1: be differential lines spectral frequency feature by line spectral frequencies Feature Conversion;

Step S2: combination is carried out to the differential lines spectral frequency feature obtained in S1 and obtains composite difference line spectral frequencies feature.

Second step: training probability model

Step S3: the distribution building Random Maps histogram model matching composite difference line spectral frequencies feature, realizes details as shown in Figure 2.

3rd step: discrimination process

Step S1 in the first step and step S2 is repeated to speaker's voice sequence to be identified and generates composite difference line spectral frequencies characteristic test collection, input step S3 train the model obtained.

Step S4: calculate the likelihood value for each probability model, obtains maximum likelihood value, confirms speaker's numbering.

To be specifically described each step below:

Step S1 realizes the extraction of differential lines spectral frequency feature, and it is that K+1 ties up normalized differential lines spectral frequency feature that the K obtained from speech linear predictive coding model is tieed up the non-normalized line spectral frequencies eigentransformation increased progressively, and its implementation is as follows:

Wherein [x ₁, x ₂..., x _k] ^tfor the K before conversion ties up line spectral frequencies feature, △ x is the normalization differential lines spectral frequency feature of K+1 dimension after conversion.

3 adjacent frame difference line spectral frequencies features are carried out combination producing composite difference line spectral frequencies feature with the dynamic perfromance of expression signal by step S2.Suppose that the differential lines spectral frequency of t frame is characterized as △ x (t), then the composite difference line spectral frequencies of t frame is characterized as:

Sup△x(t)＝[△x(t-τ) ^T,△x(t) ^T,△x(t+τ) ^T] ^T

Wherein τ is positive integer, gets τ=1 in the present invention.

Step S3: the distribution building Random Maps histogram model matching composite difference line spectral frequencies feature, concrete realization flow as shown in Figure 2:

1) prior probability located zero point in histogram is obtained according to the overall distribution of composite difference line spectral frequencies feature.

If the conjunction composite difference line spectral frequencies proper vector of input is:

Then being distributed as of entirety of composite difference line spectral frequencies feature:

p (x | ZeroDens) = Π_{n = 1}^{3} \frac{Γ (Σ_{k = 1}^{K + 1} α_{n, k})}{Π_{k = 1}^{K + 1} Γ (α_{n, k})} Π_{k = 1}^{K + 1} {({Δx}_{n, k})}^{α_{n, k} - 1}

In histogram, the prior probability of appearance at zero point is

π_{ZeroDens} = \frac{1}{N + 1}

Then in histogram, the prior distribution of null position is:

π _ZeroDensp(x|ZeroDens)

2) the conjunction composite difference line spectral frequencies proper vector of input is carried out to Random Maps structure and calculated average histogram.

The formula composite difference line spectral frequencies feature of dimension D=K+1 being carried out to Random Maps conversion is y=Ax+b, and wherein A is the Random-Rotation scaled matrix of D × D dimension, and b is the random translation vector that D × 1 is tieed up.

Random translation vector b=[b ₁, b ₂..., b _i... b _k+1] ^tin each element value be equally distributed stochastic variable between 0 ~ 1.

Random-Rotation scaled matrix A can be analyzed to the product of Random-Rotation unit matrix U and random convergent-divergent diagonal matrix Λ

A＝ΛU

|U|＝1

Wherein the design process of Random-Rotation unit matrix U is as follows:

3. by judging whether the determinant of Q equals 1, comes element q ₁₁the value carrying out revising the determinant ensureing Q is 1

The design process of random convergent-divergent diagonal matrix Λ is as follows:

1. the distribution of each element in composite difference line spectral frequencies proper vector is calculated.The element of jth dimension meets Beta distribution, and its probability density function is

Beta (x_{j}; α_{j}, β_{j}) = \frac{Γ (α_{j} + β_{j})}{Γ (α_{j}) Γ (β_{j})} x_{j}^{α_{j} - 1} {(1 - x_{j})}^{β_{j} - 1}

2. the wide h of histogrammic optimum bin in each dimension is calculated

R (x_{j}; α_{j}, β_{j}) = {&Integral;}_{0}^{1} B {eta}^{2} (x_{j}; α_{j}, β_{j}) {dx}_{j}

h_{j} = R {(x_{j}; α_{j}, β_{j})}^{- \frac{1}{2}} {(6 Π_{i = 1}^{D} R {(x_{i}; α_{i}, β_{i})}^{\frac{1}{2}})}^{\frac{1}{2 + D}} N^{- \frac{1}{2 + D}}

3. the value λ of diagonal entry in diagonal matrix Λ is generated according to the wide h of optimum bin

\log (λ_{j}) = Uniform [θ_{\min} + \log (h_{j}^{- 1}), θ_{\max} + \log (h_{j}^{- 1})],

Wherein θ _min=0, θ _max=2 is relaxation parameter.

After obtaining stochastic transformation parameter A, b according to above-mentioned flow process, H stochastic transformation is carried out to training characteristics data set, 1 group of training dataset containing N number of training sample h group training dataset is generated after Random Maps wherein the average histogram of H group training dataset is:

\frac{1 - π_{ZeroDens}}{H} Σ_{i}^{H} p (x | A_{i}, b_{i})

p (x | A_{i}, b_{i}) = \frac{1}{Hv} Σ_{j = 1}^{N} II (round (y_{j}), round (y))

y＝A _ix+b _i

v＝|A _i| ^-1

Therefore the final Random Maps histogram figure probability estimate model obtained is:

HD (x) = π_{ZeroDens} p (x | ZeroDens) + \frac{1 - π_{ZeroDens}}{H} Σ_{i}^{H} p (x | A_{i}, b_{i})

Discriminating matching process implementation method described in step S4 is:

By the characteristic data set of input be transported to and be directed to each speaker and train in probability model, calculate likelihood value.

L_{j} (\tilde{x}) = Σ_{i = 1}^{N} \log ({HT}_{j} (x_{i}))

Below by reference to the accompanying drawings the embodiment of the proposed Speaker Identification scheme had nothing to do based on the text of composite difference line spectral frequencies feature and stochastic transformation histogram model is set forth.By the description of above embodiment, one of ordinary skill in the art clearly can recognize that the mode that the present invention can add required general hardware platform by software realizes.Based on such understanding, technical scheme of the present invention can embody the part that prior art contributes in essence in other words in form of a computer software product, this software product is stored in a storage medium, comprises some instructions and performs method described in each embodiment of the present invention in order to make one or more computer equipment.

According to thought of the present invention, all will change in specific embodiments and applications.In sum, this description should not be construed as limitation of the present invention.

Above-described embodiment of the present invention, does not form the restriction to invention protection domain.Any amendment done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims

1. to have nothing to do a speaker detection method based on the text of Random Maps histogram model, it is characterized in that, comprise the following steps:

One. characteristic extraction step:

A, the feature extraction of differential lines spectral frequency: it is that K+1 ties up normalized differential lines spectral frequency feature that the K obtained from speech linear predictive coding model is tieed up the non-normalized line spectral frequencies eigentransformation increased progressively;

B, generation composite difference line spectral frequencies feature: 3 adjacent frame difference line spectral frequencies features are carried out combination producing composite difference line spectral frequencies feature with the dynamic perfromance of expression signal;

Two. Random Maps histogram model training step: the training utterance for each speaker extracts T frame composite difference line spectral frequencies feature as 1 group of training dataset according to the description of step one; Adopt the method for Random Maps to carry out H stochastic transformation to this training dataset and obtain H group training characteristics; Stochastic transformation mode is y=AX+b, and wherein A is Random-Rotation scaled matrix, and b is random translation vector; Each element in b, should meet being uniformly distributed between 0 ~ 1; A is the product of unit orthogonal matrices U and diagonal matrix Λ; Whether the equally distributed square formation V that U is met between 0 ~ 1 by an all elements generates, and carries out QR decomposition to V, and be 1 carry out correction to its top left hand element and obtain U according to decomposing the Q determinant of a matrix value that obtains; The diagonal entry value of Λ meets θ _min+ log (h _j ^-1) and θ _max+ log (h _j ^-1) between be uniformly distributed, wherein θ _min=0, θ _max=2, h _jfor the histogrammic best bin of training characteristics jth dimension is wide, this numerical value is decided by the regularity of distribution of training data; Statistics with histogram is carried out to every stack features, and utilizes the probability model of average histogram as this speaker of H group training characteristics; Final each speaker can train and obtain an one's own model;

2. in method for distinguishing speek person according to claim 1, the feature of step one A is, during the feature extraction of differential lines spectral frequency, by traditional line spectral frequencies proper vector except after π normalization, in vector, each adjacent element subtracts each other, acquisition Differential Characteristics vector, and increase a regular element usually ensure obtain difference vector 1 norm be 1.

3. in method for distinguishing speek person according to claim 1, the feature of step one B is, 3 adjacent frame difference line spectral frequencies features combined when composite difference line spectral frequencies feature obtains, the spacing of consecutive frame is 1.

4. in method for distinguishing speek person according to claim 1, the feature of step 2 is, the probability model of speaker is defined as

HD (x) = π_{ZeroDens} p (x | ZeroDens) + \frac{1 - π_{ZeroDens}}{H} Σ_{i}^{H} p (x | A_{i}, b_{i}),

Wherein π _zeroDensp (x|ZeroDens) defines and locates Probabilistic estimation zero point in histogram, define the method for estimation of average histogram probability; Wherein for the probability that zero point in statistic histogram occurs; The prior probability that p (x|ZeroDens) is null position; for the histogram probability estimate of input test data x in i-th conversion, wherein y _jfor the feature that a jth training data generates after i-th conversion; Y is the feature that input test data x generates after i-th conversion,

5. method for distinguishing speek person according to claim 4, wherein prior probability p (x|ZeroDens) should utilize compound Dirichlet distribute to estimate.