CN103985384A - Text-independent speaker identification device based on random projection histogram model - Google Patents
Text-independent speaker identification device based on random projection histogram model Download PDFInfo
- Publication number
- CN103985384A CN103985384A CN201410232526.2A CN201410232526A CN103985384A CN 103985384 A CN103985384 A CN 103985384A CN 201410232526 A CN201410232526 A CN 201410232526A CN 103985384 A CN103985384 A CN 103985384A
- Authority
- CN
- China
- Prior art keywords
- model
- speaker
- histogram
- feature
- random
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Complex Calculations (AREA)
Abstract
The embodiment of the invention discloses a text-independent speaker identification device based on a random projection histogram model. The method includes the three steps of characteristic extraction, model training and identification; wherein in the characteristic extraction step, non-normalized incremental line spectrum frequency characteristics are converted into normalized differential line spectrum frequency characteristics, the differential line spectrum frequency characteristics of consecutive frames are combined to generate composite differential line spectrum frequency characteristics for expressing the dynamic performance of signals; in the model training process, random projection parameters are designed according to the distribution characteristics of the composite differential line spectrum frequency characteristics, random projection is conducted on training data sets, and probability models are established by calculating an average histogram; in the identification process, characteristic extraction is conducted on voice signals of a person to be identified according to the first step, then the extracted characteristics are input into the models obtained in the second step, the likelihood value of each probability model is calculated, the maximum likelihood value is obtained, and then the serial number of a speaker is identified. By the adoption of the method, the text-independent speaker identification rate can be increased, and the method has great practical value.
Description
Technical field
The invention belongs to field of audio processing and described emphatically a kind of text-independent speaker identification device based on Random Maps histogram model.
Background technology
Speaker Identification is that computing machine utilizes the information that can reflect speaker characteristic comprising in sound bite to identify the technology of speaker ' s identity, and this technology is in information security, and the fields such as long-distance identity-certifying have very important research and using value.
According to the difference of identifying object, speaker can be differentiated and is divided into text dependent and text-independent two classes.Speaker's authentication technique of text dependent wherein, requires to utilize keyword that speaker pronounces and crucial sentence as training sample, utilizes identical content pronunciation to identify while distinguishing, this system is used inconvenience and the easy stolen record of key content.Speaker's recognition techniques of text-independent, when training and while recognizing, do not stipulate the content of speaking, identifying object is voice signal freely, need in voice signal freely, find feature and the method for the information that can characterize speaker, therefore set up speaker model difficulty relatively, but this utilization is convenient and safe.Described in the invention is the identification device of text-independent.
Speaker differentiates that conventionally comprising 3 ingredients (1) extracts the feature that can express speaker's feature from training utterance data centralization; (2) for speaker trains a model that can reflect its phonetic feature regularity of distribution; (3) feature of inputting voice by calculating is made final decision with the degree of agreeing with of the training pattern of having obtained.
Conventional speaker's identification system adopts MFCC (Mel-frequency Cepstral Coefficients in feature extraction part, Mel cepstrum coefficient) or LSF (Line Spectral Frequencies, line spectral frequencies) as essential characteristic, in model training part, adopt GMM (Gaussian Mixture Model, gauss hybrid models) or statistic histogram as probability model.
Traditional feature is easy to be subject to noise and multidate information beyond expression of words, GMM model is only applicable carries out modeling for the wider feature of distribution range, although statistic histogram model can carry out modeling to the characteristic signal of any distribution, but when lack of training samples or characteristic dimension are when too high, in the model of setting up, there is a large amount of zero points, cause result discontinuous.The method for distinguishing speek person of text-independent described in the invention can solve the above problems greatly.
Summary of the invention
In order to solve the existing defect of above-mentioned technology and to improve speaker's resolution of text-independent, the invention provides a kind of text-independent speaker discrimination method based on composite difference line spectral frequencies feature and stochastic transformation histogram model, comprise the following steps:
One. characteristic extraction step:
A, differential lines spectral frequency characteristic extraction step: by the non-normalized line spectral frequencies eigentransformation increasing progressively of K dimension of obtaining from speech linear predictive coding model, be that K+1 ties up normalized differential lines spectral frequency feature.
The step of B, generation composite difference line spectral frequencies feature: 3 adjacent frame difference line spectral frequencies features are carried out to combination producing composite difference line spectral frequencies feature with the dynamic perfromance of expression signal.
Two. Random Maps histogram model training step: the training utterance for each speaker extracts T frame composite difference line spectral frequencies feature as 1 group of training dataset according to the description of step 1.The method of employing Random Maps is carried out H stochastic transformation to this training dataset and is obtained H group training characteristics.Every stack features is carried out to statistics with histogram, and utilize the average histogram of H group training characteristics as this speaker's probability model.Final each speaker can train and obtain an one's own model.
Three. differentiate coupling step: input after one section of voice, adopt the method for step 1 generate 1 stack features and will in this feature input step two, train in each speaker's who obtains model, calculate this stack features for the likelihood value of each model, get the numbering that maximum likelihood value is wherein confirmed speaker.
According to speaker's discrimination method of a kind of and text-independent of an embodiment of the invention, the described normalized differential lines spectral frequency feature extraction mode of steps A is as follows:
[x wherein
1, x
2..., x
k]
tfor the K dimension line spectral frequencies feature before conversion, △ x is the normalization differential lines spectral frequency feature of the rear K+1 dimension of conversion
According to speaker's discrimination method of a kind of and text-independent of an embodiment of the invention, the concrete generative process of the composite difference line spectral frequencies feature of describing in step B is as follows:
The differential lines spectral frequency of supposing t frame is characterized as △ x (t), and the composite difference line spectral frequencies of t frame is characterized as:
Sup△x(t)=[△x(t-τ)
T,△x(t)
T,△x(t+τ)
T]
T
Wherein τ is positive integer, gets τ=1 in the present invention.
According to speaker's discrimination method of a kind of and text-independent of an embodiment of the invention, the model training method described in step 2 is as follows:
1) the composite difference line spectral frequencies feature of dimension D=K+1 is carried out to Random Maps conversion, transformation for mula is: y=Ax+b, and wherein A is the Random-Rotation scaled matrix of D * D dimension, b is the random translation vector of D * 1 dimension.
2) random translation vector b=[b
1, b
2..., b
i... b
k+1]
tin each element value be equally distributed stochastic variable between 0~1.
3) rotation scaled matrix A is the product of Random-Rotation unit matrix U and random convergent-divergent diagonal matrix Λ
A=ΛU
|U|=1
4) design process of Random-Rotation unit matrix U is as follows:
1. the stochastic matrix V that generates a D * D dimension, each element in V meets being uniformly distributed between 0~1
2. matrix V is carried out to QR and decompose V=QR, wherein Q is unit orthogonal matrix
3. by judging that whether the determinant of Q equals 1, comes element q
11revise guarantee Q the value of determinant be 1
5) design process of random convergent-divergent diagonal matrix Λ is as follows:
The element of j dimension of composite difference line spectral frequencies feature meets Beta and distributes, and its probability density function is
If
Wherein D is the dimension of composite difference line spectral frequencies feature, the number that N is training characteristics.
In diagonal matrix Λ, the value of diagonal entry is
θ wherein
min=0, θ
max=2 is relaxation parameter.
6) after Random Maps, to build probability model mode as follows for training data:
First half is that probability estimate is carried out in the position at zero point in histogram, wherein
for the probability occurring zero point in statistic histogram.The prior probability that p (x|ZeroDens) is null position, the priori is here compound Di Li Cray process.The proper vector of input is:
x=Sup△x(t)=[△x(t-τ)
T,△x(t)
T,△x(t+τ)
T]
T=[△x
1,△x
2,△x
3]
T
Latter half is average statistics histogram probability estimate, and wherein H is the number of times that carries out Random Maps, 1 group of training dataset that contains N training data
after H Random Maps, be transformed to H group training dataset
Wherein
p (x|A
i, b
i) be the histogram probability estimate of input test data x in the i time conversion, be defined as follows:
y=A
ix+b
i
v=|A
i|
-1
According to speaker's discrimination method of a kind of and text-independent of an embodiment of the invention, the discriminating matching process implementation method described in step 3 is: by the characteristic data set of input
be transported to and be directed to each speaker and train in probability model, calculate likelihood value.
Wherein
for test feature collection
about the likelihood value of j speaker model, by getting maximum likelihood value wherein, confirm speaker's numbering.
Beneficial effect of the present invention is, in terms of existing technologies, the present invention's application composite difference line spectral frequencies feature is extracted as speaker's feature, use Random Maps histogram training probability model, provide again complete implementation system for application, experiment show high efficiency of the present invention, there is very strong practicality.
Below in conjunction with accompanying drawing, specific embodiments of the present invention is described in detail.
Fig. 1 is process flow diagram of the present invention, and wherein solid line represents training department's minute flow process trend, and dotted line represents to differentiate part flow process trend, comprises the following steps:
The first step: characteristic extraction step, from speaker's voice sequence of training, extract composite difference line spectral frequencies feature.
Step S1: be differential lines spectral frequency feature by line spectral frequencies Feature Conversion;
Step S2: the differential lines spectral frequency feature of obtaining in S1 is combined and obtains composite difference line spectral frequencies feature.
Second step: training probability model
Step S3: build the distribution of Random Maps histogram model matching composite difference line spectral frequencies feature, realize details as shown in Figure 2.
The 3rd step: discrimination process
Step S1 and step S2 that speaker's voice sequence to be identified is repeated in the first step generate composite difference line spectral frequencies characteristic test collection, and input step S3 trains the model obtaining.
Step S4: calculate the likelihood value for each probability model, obtain maximum likelihood value, confirm speaker's numbering.
To be specifically described each step below:
Step S1 realizes the extraction of differential lines spectral frequency feature, by the non-normalized line spectral frequencies eigentransformation increasing progressively of K dimension of obtaining from speech linear predictive coding model, is that K+1 ties up normalized differential lines spectral frequency feature, and its implementation is as follows:
[x wherein
1, x
2..., x
k]
tfor the K dimension line spectral frequencies feature before conversion, △ x is the normalization differential lines spectral frequency feature of the rear K+1 dimension of conversion.
Step S2 carries out combination producing composite difference line spectral frequencies feature with the dynamic perfromance of expression signal by 3 adjacent frame difference line spectral frequencies features.The differential lines spectral frequency of supposing t frame is characterized as △ x (t), and the composite difference line spectral frequencies of t frame is characterized as:
Sup△x(t)=[△x(t-τ)
T,△x(t)
T,△x(t+τ)
T]
T
Wherein τ is positive integer, gets τ=1 in the present invention.
Step S3: build the distribution of Random Maps histogram model matching composite difference line spectral frequencies feature, concrete realization flow as shown in Figure 2:
1) according to the overall distribution of composite difference line spectral frequencies feature, obtain the prior probability that in histogram, locate zero point.
If the composite difference line spectral frequencies proper vector of closing of input is:
x=Sup△x(t)=[△x(t-τ)
T,△x(t)
T,△x(t+τ)
T]
T=[△x
1,△x
2,△x
3]
T
Whole being distributed as of composite difference line spectral frequencies feature:
The prior probability that occur zero point in histogram is
In histogram, the prior distribution of null position is:
π
ZeroDensp(x|ZeroDens)
2) the composite difference line spectral frequencies proper vector of closing of input is carried out Random Maps structure and calculated average histogram.
The formula that the composite difference line spectral frequencies feature of dimension D=K+1 is carried out to Random Maps conversion is y=Ax+b, and wherein A is the Random-Rotation scaled matrix of D * D dimension, and b is the random translation vector of D * 1 dimension.
Random translation vector b=[b
1, b
2..., b
i... b
k+1]
tin each element value be equally distributed stochastic variable between 0~1.
Random-Rotation scaled matrix A can be decomposed into the product of Random-Rotation unit matrix U and random convergent-divergent diagonal matrix Λ
A=ΛU
|U|=1
Wherein the design process of Random-Rotation unit matrix U is as follows:
1. the stochastic matrix V that generates a D * D dimension, each element in V meets being uniformly distributed between 0~1
2. matrix V is carried out to QR and decompose V=QR, wherein Q is unit orthogonal matrix
3. by judging that whether the determinant of Q equals 1, comes element q
11the value of revising the determinant that guarantees Q is 1
The design process of random convergent-divergent diagonal matrix Λ is as follows:
1. calculate the distribution of each element in composite difference line spectral frequencies proper vector.The element of j dimension meets Beta and distributes, and its probability density function is
2. calculate the wide h of histogrammic optimum bin in each dimension
Wherein D is the dimension of composite difference line spectral frequencies feature, the number that N is training characteristics.
3. according to the wide h of optimum bin, generate the value λ of diagonal entry in diagonal matrix Λ
According to above-mentioned flow process, obtain after stochastic transformation parameter A, b, training characteristics data set is carried out to H stochastic transformation, 1 group of training dataset that contains N training sample
after Random Maps, generate H group training dataset
wherein
the average histogram of H group training dataset is:
P (x|A wherein
i, b
i) be the histogram probability estimate of input test data x in the i time conversion, be defined as follows:
y=A
ix+b
i
v=|A
i|
-1
Therefore the Random Maps histogram figure probability estimate model finally obtaining is:
Discriminating matching process implementation method described in step S4 is:
By the characteristic data set of input
be transported to and be directed to each speaker and train in probability model, calculate likelihood value.
Wherein
for test feature collection
about the likelihood value of j speaker model, by getting maximum likelihood value wherein, confirm speaker's numbering.
Below by reference to the accompanying drawings the embodiment of the Speaker Identification scheme of the proposed text-independent based on composite difference line spectral frequencies feature and stochastic transformation histogram model is set forth.By the description of above embodiment, one of ordinary skill in the art can clearly recognize that the mode that the present invention can add essential general hardware platform by software realizes.Understanding based on such, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of computer software product, this software product is stored in a storage medium, comprises that some instructions are used so that the method described in each embodiment of one or more computer equipment execution the present invention.
According to thought of the present invention, all will change in specific embodiments and applications.In sum, this description should not be construed as limitation of the present invention.
Above-described embodiment of the present invention, does not form the restriction to invention protection domain.Any modification of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.
Claims (10)
1. the text-independent speaker identification device based on Random Maps histogram model, is characterized in that, comprises the following steps:
One. characteristic extraction step:
A, the feature extraction of differential lines spectral frequency: by the non-normalized line spectral frequencies eigentransformation increasing progressively of K dimension of obtaining from speech linear predictive coding model, be that K+1 ties up normalized differential lines spectral frequency feature;
B, generation composite difference line spectral frequencies feature: 3 adjacent frame difference line spectral frequencies features are carried out to combination producing composite difference line spectral frequencies feature with the dynamic perfromance of expression signal.
Two. Random Maps histogram model training step: the training utterance for each speaker extracts T frame composite difference line spectral frequencies feature as 1 group of training dataset according to the description of step 1.The method of employing Random Maps is carried out H stochastic transformation to this training dataset and is obtained H group training characteristics.Every stack features is carried out to statistics with histogram, and utilize the average histogram of H group training characteristics as this speaker's probability model.Final each speaker can train and obtain an one's own model.
Three. differentiate coupling step: input after one section of voice, adopt the method for step 1 generate 1 stack features and will in this feature input step two, train in each speaker's who obtains model, calculate this stack features for the likelihood value of each model, get the numbering that maximum likelihood value is wherein confirmed speaker.
2. in method for distinguishing speek person according to claim 1, step 1 A is characterised in that, during the feature extraction of differential lines spectral frequency, traditional line spectral frequencies proper vector is removed after π normalization, in vector, each adjacent element subtracts each other, obtain Differential Characteristics vector, and increase a regular element and usually guarantee that difference vector 1 norm of obtaining is 1.
3. in method for distinguishing speek person according to claim 1, step 1 B is characterised in that, when composite difference line spectral frequencies feature is obtained, 3 adjacent frame difference line spectral frequencies features is combined, and the spacing of consecutive frame is 1.
4. in method for distinguishing speek person according to claim 1, step 2 is characterised in that, stochastic transformation mode is y=Ax+b, and wherein A is Random-Rotation scaled matrix, and b is random translation vector.
5. according to the random translation vector b described in claim 4, it is characterized in that, each element in b, should meet being uniformly distributed between 0~1.
6. according to right, want the Random-Rotation scaled matrix A described in 4 it is characterized in that, A is the product of unit Orthogonal Units matrix U and diagonal matrix Λ.
7. according to the orthogonal matrix U of unit described in claim 6, it is characterized in that, the equally distributed square formation V that U is met between 0-1 by an all elements generates, V is carried out to QR decomposition, and whether the Q determinant of a matrix value obtaining according to decomposition is 1 its upper left corner element to be revised to obtain U.
8. according to the diagonal matrix Λ described in claim 6, it is characterized in that, the diagonal entry value of Λ is
θ wherein
min=0, θ
max=2, h
jfor training characteristics j dimension, histogrammic best bin is wide, and this numerical value is decided by the regularity of distribution of training data.
9. in method for distinguishing speek person according to claim 1, step 2 is characterised in that, speaker's probability model is defined as
Wherein the first half of equation has defined and has located Probabilistic estimation zero point in histogram, and latter half has defined the method for estimation of average histogram probability.
Wherein
for the probability occurring zero point in statistic histogram.The prior probability that p (x|ZeroDens) is null position
for input test data x the i time conversion in histogram probability estimate wherein
。
10. the prior probability p (x|ZeroDens) of null position according to claim 9 is characterized in that, this priori should utilize compound Dirichlet distribute to estimate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410232526.2A CN103985384B (en) | 2014-05-28 | 2014-05-28 | Text-independent speaker identification device based on random projection histogram model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410232526.2A CN103985384B (en) | 2014-05-28 | 2014-05-28 | Text-independent speaker identification device based on random projection histogram model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103985384A true CN103985384A (en) | 2014-08-13 |
CN103985384B CN103985384B (en) | 2015-04-15 |
Family
ID=51277327
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410232526.2A Active CN103985384B (en) | 2014-05-28 | 2014-05-28 | Text-independent speaker identification device based on random projection histogram model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103985384B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108630207A (en) * | 2017-03-23 | 2018-10-09 | 富士通株式会社 | Method for identifying speaker and speaker verification's equipment |
CN112331215A (en) * | 2020-10-26 | 2021-02-05 | 桂林电子科技大学 | Voiceprint recognition template protection algorithm |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103207961A (en) * | 2013-04-23 | 2013-07-17 | 曙光信息产业(北京)有限公司 | User verification method and device |
CN103685185A (en) * | 2012-09-14 | 2014-03-26 | 上海掌门科技有限公司 | Mobile equipment voiceprint registration and authentication method and system |
-
2014
- 2014-05-28 CN CN201410232526.2A patent/CN103985384B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103685185A (en) * | 2012-09-14 | 2014-03-26 | 上海掌门科技有限公司 | Mobile equipment voiceprint registration and authentication method and system |
CN103207961A (en) * | 2013-04-23 | 2013-07-17 | 曙光信息产业(北京)有限公司 | User verification method and device |
Non-Patent Citations (1)
Title |
---|
ZHANYU MA,ARNE LEIJON,W. BASTIAAN KLEIJN: "Vector Quantization of LSF Parameters With a Mixture of Dirichlet Distributions", 《IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108630207A (en) * | 2017-03-23 | 2018-10-09 | 富士通株式会社 | Method for identifying speaker and speaker verification's equipment |
CN112331215A (en) * | 2020-10-26 | 2021-02-05 | 桂林电子科技大学 | Voiceprint recognition template protection algorithm |
CN112331215B (en) * | 2020-10-26 | 2022-11-15 | 桂林电子科技大学 | Voiceprint recognition template protection algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN103985384B (en) | 2015-04-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yujin et al. | Research of speaker recognition based on combination of LPCC and MFCC | |
CN102800316B (en) | Optimal codebook design method for voiceprint recognition system based on nerve network | |
CN102820033B (en) | Voiceprint identification method | |
CN109637545B (en) | Voiceprint recognition method based on one-dimensional convolution asymmetric bidirectional long-short-time memory network | |
CN103345923B (en) | A kind of phrase sound method for distinguishing speek person based on rarefaction representation | |
CN105869624A (en) | Method and apparatus for constructing speech decoding network in digital speech recognition | |
CN105355214A (en) | Method and equipment for measuring similarity | |
CN102637433A (en) | Method and system for identifying affective state loaded in voice signal | |
Zhang et al. | Speech emotion recognition using combination of features | |
CN103456302A (en) | Emotion speaker recognition method based on emotion GMM model weight synthesis | |
Wataraka Gamage et al. | Speech-based continuous emotion prediction by learning perception responses related to salient events: A study based on vocal affect bursts and cross-cultural affect in AVEC 2018 | |
CN103258531A (en) | Harmonic wave feature extracting method for irrelevant speech emotion recognition of speaker | |
Aliaskar et al. | Human voice identification based on the detection of fundamental harmonics | |
Shen et al. | Rars: Recognition of audio recording source based on residual neural network | |
CN104464738A (en) | Vocal print recognition method oriented to smart mobile device | |
CN103985384B (en) | Text-independent speaker identification device based on random projection histogram model | |
Koolagudi et al. | Speaker recognition in the case of emotional environment using transformation of speech features | |
Rodman et al. | Forensic speaker identification based on spectral moments | |
Feng et al. | Speech emotion recognition based on LSTM and Mel scale wavelet packet decomposition | |
Saritha et al. | Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal | |
Herrera-Camacho et al. | Design and testing of a corpus for forensic speaker recognition using MFCC, GMM and MLE | |
Jian et al. | An embedded voiceprint recognition system based on GMM | |
CN103871411A (en) | Text-independent speaker identifying device based on line spectrum frequency difference value | |
Li et al. | Fast speaker clustering using distance of feature matrix mean and adaptive convergence threshold | |
Fan et al. | Deceptive Speech Detection based on sparse representation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |