CN105590628A - Adaptive adjustment-based Gaussian mixture model voice identification method - Google Patents
Adaptive adjustment-based Gaussian mixture model voice identification method Download PDFInfo
- Publication number
- CN105590628A CN105590628A CN201510977077.9A CN201510977077A CN105590628A CN 105590628 A CN105590628 A CN 105590628A CN 201510977077 A CN201510977077 A CN 201510977077A CN 105590628 A CN105590628 A CN 105590628A
- Authority
- CN
- China
- Prior art keywords
- gaussian
- sigma
- subcomponents
- subcomponent
- sub
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000000203 mixture Substances 0.000 title claims abstract description 43
- 238000000034 method Methods 0.000 title claims abstract description 19
- 230000003044 adaptive effect Effects 0.000 title claims abstract 3
- 238000012549 training Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000009826 distribution Methods 0.000 claims description 3
- 238000003672 processing method Methods 0.000 claims 1
- 238000002474 experimental method Methods 0.000 description 4
- 238000012790 confirmation Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Game Theory and Decision Science (AREA)
- Stereophonic System (AREA)
Abstract
The invention relates to an adaptive adjustment-based Gaussian mixture model voice identification method, which utilizes the absolute value sum of a probability difference to improve a traditional Gaussian mixing model, dynamically adjusts the contribution made by each Gaussian subcomponent in fitting voice signal characteristics, utilizes each Gaussian subcomponent to the maximum, and fully expresses information, thereby improving the identification performance confirmed by a speaker.
Description
Technical Field
The invention relates to a human voice recognition technology, in particular to a human voice recognition method based on a self-adaptive adjustment Gaussian mixture model.
Background
The human voice recognition technology is a technology for recognizing the identity of a speaker according to the voice of the speaker by utilizing a signal processing and probability theory method, and mainly comprises the following two steps: training of a speaker model and recognition of speaker voice.
The characteristic parameters mainly adopted by human voice recognition mainly comprise Mel cepstrum coefficients (MFCC), Linear Predictive Coding Coefficients (LPCC) and perceptually weighted linear predictive coefficients (PLP). The human voice recognition algorithm mainly comprises a Support Vector Machine (SVM), a Gaussian Mixture Model (GMM), a vector quantization method (VQ) and the like. The Gaussian mixture model is widely applied in the field of speech recognition.
The degree of mixing of the traditional Gaussian mixture model is fixed, voice characteristics of human voice are diversified, the information quantity carried by some Gaussian subcomponents in the characteristic distribution is small, and the information quantity carried by some Gaussian subcomponents is large, so that the phenomenon of over-fitting or under-fitting can be caused, and the recognition rate of speaker confirmation is reduced.
Disclosure of Invention
The invention provides a voice recognition method based on a self-adaptive adjustment Gaussian mixture model, aiming at the problems of voice recognition of the traditional Gaussian mixture model, and the voice recognition probability is improved by self-adaptively adjusting the mixedness and the Gaussian subcomponent on the basis of the traditional Gaussian mixture model.
The technical scheme of the invention is as follows: a self-adaptive adjustment based human voice recognition method of a Gaussian mixture model specifically comprises the following steps:
1) training by using the voice characteristic parameters of the speaker to generate a traditional Gaussian mixture model corresponding to the speaker;
2) calculating the probability of each frame of data generated by each Gaussian sub-component in the Gaussian mixture model, and then calculating the sum of absolute values of probability differences of the same frame of data generated by different Gaussian sub-components;
3) taking the minimum value of the plurality of sum values obtained in the step 2) and a set low threshold value theta3Making a comparison, if less than theta3Then, the two Gaussian subcomponents corresponding to the minimum value are merged to obtain a new Gaussian subcomponent;
4) taking the maximum value of the obtained multiple sum values and the set high threshold value theta1Making comparison, if greater than threshold value theta1Then, performing weight reconfiguration on the two Gaussian subcomponents corresponding to the maximum value to obtain two new Gaussian subcomponents;
5) taking the maximum value of the weight of the Gaussian sub-component and the set threshold value theta2By comparison, if greater than θ2Splitting the Gaussian subcomponents to obtain two new Gaussian subcomponents;
6) and replacing the original Gaussian subcomponents with the newly obtained Gaussian subcomponents, obtaining the finally optimized Gaussian model through multiple iterations, inputting the characteristic parameters of the voice to be recognized, calculating the probability of the voice signal generated by the fitting of each Gaussian mixture model, and judging the largest speaker as the corresponding target speaker, namely the true speaker of the tested voice.
The absolute value calculation expression of the probability difference value of the same frame signal generated in the step 2) is as follows:
by λn={πn,μn,σnDenotes the nth Gaussian sub-component, πnIs the weight of the nth Gaussian sub-component, munAnd σnExpressing the expectation and covariance matrixes of the nth Gaussian subcomponents, the probability of each frame of data generated by fitting K Gaussian subcomponents respectively, and the total L frame of data and xi(i-1, 2, …, L) is the input i-th frame speech signal, a and b are the ordinal numbers of the different gaussian subcomponents, piaIs the weight of the a-th Gaussian sub-component, N (x)i|μa,σa) Is the probability density of the a-th Gaussian sub-component, muaAnd σaExpressing the expectation and covariance matrix of the a-th Gaussian sub-component, wherein the subscript j in the formula expresses the sequence number of the j-th Gaussian sub-component; the subscript b indicates the sequence number of the b-th gaussian subcomponent.
The combination processing mode in the step 3) is as follows:
wherein a represents the serial number of the a-th Gaussian subcomponent; b represents the sequence number of the b-th Gaussian sub-component; t is the number of the new Gaussian subcomponent after combination, and the newly added Gaussian subcomponent lambda is usedTInstead of the original gaussian subcomponent lambdaaAnd λb。
In the step 4), the weights of the two gaussian subcomponents a and b are redistributed to obtain two new gaussian subcomponents, and the processing mode is as follows:
wherein, the desired and covariance matrices of the two gaussian distributions remain unchanged.
Splitting the Gaussian subcomponents in the step 5), wherein the splitting processing mode is as follows:
wherein,is σaMaximum on the diagonal; e ═ 1,1, …,1]Is a matrix of all 1 s and is,using the latest two Gaussian sub-components lambdaT,λT+1Replacing the original gaussian subcomponent lambdaa。
The invention has the beneficial effects that: the invention relates to a voice recognition method based on a self-adaptive adjustment Gaussian mixture model, which improves the traditional Gaussian mixture model by using the sum of absolute values of probability difference values, dynamically adjusts the Gaussian subcomponents for the contribution of each Gaussian subcomponent in the process of fitting the characteristics of a voice signal, fully expresses useful information by using each Gaussian subcomponent to the maximum extent and further improves the recognition performance of speaker confirmation.
Drawings
FIG. 1 is a schematic diagram of a training process of adaptively adjusting a Gaussian mixture model according to the present invention;
FIG. 2 is a schematic flow chart of Gaussian subcomponent weight assignment in accordance with the present invention;
FIG. 3 is a schematic flow diagram of the improved Gaussian subcomponent splitting of the present invention;
FIG. 4 is a flow chart illustrating the improved Gaussian subcomponent combination of the present invention.
Detailed Description
The experimental data in this embodiment is collected of the recorded voices of 43 participants, the sampling rate is 8000Hz, 23 women and 20 men in 43 people record 5 voice segments each, each voice segment is recorded in a quiet environment, and each voice segment is a four-character idiom.
And training a certain amount of voice of different speakers to obtain traditional Gaussian mixture models corresponding to the different speakers, and optimizing the different traditional Gaussian mixture models according to the self-adaptive adjustment rule.
In the training process, three sections of voices of different speakers are selected randomly to be trained to obtain optimized Gaussian mixture models corresponding to the different speakers.
During the test, the recognition rate test of each optimized Gaussian mixture model is performed by using other speech segments of different speakers.
As shown in fig. 1, the flow chart of adaptively adjusting the gaussian mixture model training process includes the following steps:
the method comprises the steps of preprocessing a voice signal, wherein the preprocessing step comprises end point detection, framing, windowing and extracting a characteristic parameter, namely a Mel cepstrum coefficient, and a 12-dimensional Mel cepstrum coefficient (MFCC) is selected in the experiment.
And training the extracted MFCC parameters through an EM algorithm to obtain a traditional Gaussian mixture model corresponding to the speaker. The degree of mixing of the traditional Gaussian mixture model is K, the traditional Gaussian mixture model is formed by linearly superposing K Gaussian subcomponents, and the probability density of the Gaussian mixture model is calculated as follows:
wherein, pinIs the weight of the nth Gaussian sub-component, N (x | μ |)n,σn) The probability density function of the nth Gaussian sub-component is shown, in the embodiment, K is 16, mu and sigma are expected and covariance matrixes of the Gaussian sub-components, D is the dimension of data x and is used as lambdan={πn,μn,σnDenotes the nth gaussian subcomponent, n may take any integer value from 1 to K. And obtaining the probability that the speaker to be identified belongs to the current model by calculating p (x).
Let i frame data of speaker be xi(i ═ 1,2, … L), the specific estimation steps of the EM algorithm are as follows:
step one, if the first execution is carried out, initializing parameters { pi, mu, sigma } of a Gaussian mixture model; if the first execution is not performed, the parameters of the Gaussian mixture model are the result obtained by the previous iteration calculation. Then, the probability γ (i, n) of each frame data generated by the K gaussian subcomponents respectively is estimated (representing the probability of the ith frame data generated by the nth gaussian subcomponent):
j in the formula represents the serial number of the jth Gaussian subcomponent; n represents the serial number of the nth Gaussian sub-component, the total number of the Gaussian sub-components is K, i represents the ith frame data of the speaker, and the L frame data are shared.
Secondly, estimating parameters to be solved of the Gaussian model by using the result obtained in the first step:
wherein,
and thirdly, repeating the first step and the second step until the value of the likelihood function tends to be stable.
And optimizing the obtained traditional Gaussian mixture model.
The probability that each frame of data is generated by fitting the K Gaussian subcomponents respectively is calculated by using the parameters of the traditional Gaussian model obtained by training, if L frames of data exist, a K x L matrix is obtained, for example, the 1 st row and 2 nd column of data represent the probability that the 2 nd frame of data is generated by the 1 st Gaussian subcomponent. And then calculating the absolute value of the probability difference value generated by two different Gaussian subcomponents in the same frame data, and summing the absolute values of the probability difference values of all the frame signals generated by fitting the two Gaussian subcomponents. Wherein, the absolute value calculation expression of the probability difference value of the signal of the same frame generated by the a-th and b-th Gaussian subcomponents is as follows:
j in the formula represents the serial number of the jth Gaussian subcomponent; a represents the sequence number of the a-th Gaussian sub-component; b represents the sequence number of the b-th Gaussian subcomponent, and the total number of the Gaussian subcomponents is K; i represents the i-th frame data of the speaker, and L frames are shared.
Taking the minimum value of the multiple sum values obtained in the last step and a low threshold valueθ3Making a comparison, if less than theta3If the two gaussian subcomponents are considered to fit the same part of the speech signal feature, that is, the information is overlapped, the two gaussian subcomponents are combined to form a new gaussian subcomponent, and the combination processing mode is as follows:
wherein a represents the serial number of the a-th Gaussian subcomponent; b represents the sequence number of the b-th Gaussian sub-component; and T, the sequence number of the new Gaussian subcomponents after combination. The low threshold in this step is an empirical value taken after a number of experiments.
With newly added gaussian sub-component lambdaTInstead of the original gaussian subcomponent lambdaaAnd λbThus, the degree of mixing of the gaussian mixture model is reduced by one.
Taking the maximum value of the above obtained multiple sums and the high threshold value theta1Making a comparison if greater than theta1Then, the two gaussian subcomponents are considered to be fitting different parts of the speech signal feature, in which case the weights need to be reassigned to the two gaussian subcomponents, and the processing is as follows:
wherein, the desired and covariance matrices of the two gaussian subcomponents remain unchanged.
The high threshold in this step is an empirical value taken after a number of experiments.
Taking the maximum value of the Gaussian sub-component weight and comparing the maximum value with a weight threshold value theta2Making a comparison if greater than theta2Then, it means that the gaussian subcomponent contains too much information and needs to be split, and the split processing manner is as follows:
wherein,is σaMaximum on the diagonal; e ═ 1,1, …,1]Is a matrix of all 1 s and is,using the latest two Gaussian sub-components lambdaT,λT+1Replacing the original gaussian subcomponent lambdaaThe degree of mixing of the gaussian mixture model is increased by one.
The weight threshold in this step is an empirical value taken after a number of experiments.
And presetting an iteration number M, repeatedly executing the steps by using a new Gaussian subcomponent, and obtaining an optimized Gaussian mixture model after executing M times. And optimizing the model of each speaker to obtain an optimized Gaussian mixture model corresponding to each speaker. In this embodiment, M is 10.
And for the voice signal x to be recognized, calculating the probability that the voice signal is generated by different Gaussian mixture models, and taking the largest one of the voice signals, wherein the target speaker corresponding to the largest one is the real speaker of the tested voice.
For example, if the probability of a certain segment of speech to be recognized generated by the 3 rd gaussian mixture model is the maximum, the speech to be recognized is uttered by the 3 rd speaker.
Claims (5)
1. A self-adaptive adjustment based human voice recognition method of a Gaussian mixture model is characterized by comprising the following steps:
1) training by using the voice characteristic parameters of the speaker to generate a traditional Gaussian mixture model corresponding to the speaker;
2) calculating the probability of each frame of data generated by each Gaussian sub-component in the Gaussian mixture model, and then calculating the sum of absolute values of probability differences of the same frame of data generated by different Gaussian sub-components;
3) taking the minimum value of the plurality of sum values obtained in the step 2),and a set low threshold value theta3Making a comparison, if less than theta3Then, the two Gaussian subcomponents corresponding to the minimum value are merged to obtain a new Gaussian subcomponent;
4) taking the maximum value of the obtained multiple sum values and the set high threshold value theta1Making comparison, if greater than threshold value theta1Then, performing weight reconfiguration on the two Gaussian subcomponents corresponding to the maximum value to obtain two new Gaussian subcomponents;
5) taking the maximum value of the weight of the Gaussian sub-component and the set threshold value theta2By comparison, if greater than θ2Splitting the Gaussian subcomponents to obtain two new Gaussian subcomponents;
6) and replacing the original Gaussian subcomponents with the newly obtained Gaussian subcomponents, obtaining the finally optimized Gaussian model through multiple iterations, inputting the characteristic parameters of the voice to be recognized, calculating the probability of the voice signal generated by the fitting of each Gaussian mixture model, and judging the largest speaker as the corresponding target speaker, namely the true speaker of the tested voice.
2. The method for recognizing human voice based on adaptively adjusted gaussian mixture model according to claim 1, wherein the absolute value calculation expression of the probability difference value of step 2) generated for the same frame signal is:
by λn={πn,μn,σnDenotes the nth Gaussian sub-component, πnIs the weight of the nth Gaussian sub-component, munAnd σnExpressing the expectation and covariance matrixes of the nth Gaussian subcomponents, the probability of each frame of data generated by fitting K Gaussian subcomponents respectively, and the total L frame of data and xi(i-1, 2, …, L) is the input i-th frame speech signal, a and b are the ordinal numbers of the different gaussian subcomponents, piaIs the weight of the a-th Gaussian sub-component, N (x)i|μa,σa) Is the probability density of the a-th Gaussian sub-component, muaAnd σaExpressing the expectation and covariance matrix of the a-th Gaussian sub-component, wherein the subscript j in the formula expresses the sequence number of the j-th Gaussian sub-component; the subscript b indicates the sequence number of the b-th gaussian subcomponent.
3. The method for recognizing human voice based on adaptively adjusted gaussian mixture model according to claim 2, wherein the combination processing manner in the step 3) is as follows:
wherein a represents the serial number of the a-th Gaussian subcomponent; b represents the sequence number of the b-th Gaussian sub-component; t is the number of the new Gaussian subcomponent after combination, and the newly added Gaussian subcomponent lambda is usedTInstead of the original gaussian subcomponent lambdaaAnd λb。
4. The method for recognizing human voice based on adaptively adjusted gaussian mixture model according to claim 2, wherein the step 4) re-assigns weights to the two gaussian subcomponents a and b to obtain two new gaussian subcomponents, and the processing method is as follows:
wherein, the desired and covariance matrices of the two gaussian distributions remain unchanged.
5. The adaptive adjustment-based human voice recognition method based on the gaussian mixture model as claimed in claim 2, wherein in the step 5), the gaussian subcomponent is split in the following manner:
wherein, is σaMaximum on the diagonal; e ═ 1,1, …,1]Is a matrix of all 1 s and is,using the latest two Gaussian sub-components lambdaT,λT+1Replacing the original gaussian subcomponent lambdaa。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510977077.9A CN105590628A (en) | 2015-12-22 | 2015-12-22 | Adaptive adjustment-based Gaussian mixture model voice identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510977077.9A CN105590628A (en) | 2015-12-22 | 2015-12-22 | Adaptive adjustment-based Gaussian mixture model voice identification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105590628A true CN105590628A (en) | 2016-05-18 |
Family
ID=55930150
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510977077.9A Pending CN105590628A (en) | 2015-12-22 | 2015-12-22 | Adaptive adjustment-based Gaussian mixture model voice identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105590628A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102324232A (en) * | 2011-09-12 | 2012-01-18 | 辽宁工业大学 | Method for recognizing sound-groove and system based on gauss hybrid models |
CN102360418A (en) * | 2011-09-29 | 2012-02-22 | 山东大学 | Method for detecting eyelashes based on Gaussian mixture model and maximum expected value algorithm |
CN102820033A (en) * | 2012-08-17 | 2012-12-12 | 南京大学 | Voiceprint identification method |
CN104485108A (en) * | 2014-11-26 | 2015-04-01 | 河海大学 | Noise and speaker combined compensation method based on multi-speaker model |
-
2015
- 2015-12-22 CN CN201510977077.9A patent/CN105590628A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102324232A (en) * | 2011-09-12 | 2012-01-18 | 辽宁工业大学 | Method for recognizing sound-groove and system based on gauss hybrid models |
CN102360418A (en) * | 2011-09-29 | 2012-02-22 | 山东大学 | Method for detecting eyelashes based on Gaussian mixture model and maximum expected value algorithm |
CN102820033A (en) * | 2012-08-17 | 2012-12-12 | 南京大学 | Voiceprint identification method |
CN104485108A (en) * | 2014-11-26 | 2015-04-01 | 河海大学 | Noise and speaker combined compensation method based on multi-speaker model |
Non-Patent Citations (3)
Title |
---|
熊华乔: ""基于模型聚类的说话人识别方法研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
王韵琪 等: ""自适应高斯混合模型及说话人识别应用"", 《通信技术》 * |
王韵琪: ""自适应高斯混合模型及说话人识别应用"", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109817246B (en) | Emotion recognition model training method, emotion recognition device, emotion recognition equipment and storage medium | |
CN108447490B (en) | Voiceprint recognition method and device based on memorability bottleneck characteristics | |
US10176811B2 (en) | Neural network-based voiceprint information extraction method and apparatus | |
US9400955B2 (en) | Reducing dynamic range of low-rank decomposition matrices | |
CN108417201B (en) | Single-channel multi-speaker identity recognition method and system | |
JP2016057461A (en) | Speaker indexing device, speaker indexing method, and computer program for speaker indexing | |
CN104485108A (en) | Noise and speaker combined compensation method based on multi-speaker model | |
CN112053694A (en) | Voiceprint recognition method based on CNN and GRU network fusion | |
Agrawal et al. | Prosodic feature based text dependent speaker recognition using machine learning algorithms | |
CN110634476A (en) | Method and system for rapidly building robust acoustic model | |
JP2010078650A (en) | Speech recognizer and method thereof | |
Shabani et al. | Speech recognition using principal components analysis and neural networks | |
Schwartz et al. | USSS-MITLL 2010 human assisted speaker recognition | |
Shi et al. | Deep neural network and noise classification-based speech enhancement | |
CN109360573A (en) | Livestock method for recognizing sound-groove, device, terminal device and computer storage medium | |
WO2021229643A1 (en) | Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program | |
Koolagudi et al. | Speaker recognition in the case of emotional environment using transformation of speech features | |
Yamamoto et al. | Denoising autoencoder-based speaker feature restoration for utterances of short duration. | |
Herrera-Camacho et al. | Design and testing of a corpus for forensic speaker recognition using MFCC, GMM and MLE | |
CN105590628A (en) | Adaptive adjustment-based Gaussian mixture model voice identification method | |
Nijhawan et al. | Real time speaker recognition system for hindi words | |
Dey et al. | Content normalization for text-dependent speaker verification | |
Islam et al. | Bangla dataset and MMFCC in text-dependent speaker identification. | |
Kannadaguli et al. | Comparison of artificial neural network and gaussian mixture model based machine learning techniques using ddmfcc vectors for emotion recognition in kannada | |
Zhipeng et al. | Voiceprint recognition based on BP Neural Network and CNN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160518 |
|
WD01 | Invention patent application deemed withdrawn after publication |