CN105590628A - Adaptive adjustment-based Gaussian mixture model voice identification method - Google Patents

Adaptive adjustment-based Gaussian mixture model voice identification method Download PDF

Info

Publication number
CN105590628A
CN105590628A CN201510977077.9A CN201510977077A CN105590628A CN 105590628 A CN105590628 A CN 105590628A CN 201510977077 A CN201510977077 A CN 201510977077A CN 105590628 A CN105590628 A CN 105590628A
Authority
CN
China
Prior art keywords
gaussian
sigma
subcomponent
mixture model
subcomponents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510977077.9A
Other languages
Chinese (zh)
Inventor
沈希忠
包玲玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Institute of Technology
Original Assignee
Shanghai Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Institute of Technology filed Critical Shanghai Institute of Technology
Priority to CN201510977077.9A priority Critical patent/CN105590628A/en
Publication of CN105590628A publication Critical patent/CN105590628A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Stereophonic System (AREA)

Abstract

本发明涉及一种基于自适应调整的高斯混合模型的人声识别方法,利用概率差值的绝对值之和对传统的高斯混合模型进行改进,对每一个高斯子分量在拟合语音信号的特征时所作的贡献,进行动态的调整高斯子分量,最大限度的利用每一个高斯子分量,充分表达有用信息,从而提高说话人确认的识别性能。

The invention relates to a human voice recognition method based on an adaptively adjusted Gaussian mixture model. The traditional Gaussian mixture model is improved by using the sum of the absolute values of probability differences, and each Gaussian subcomponent is fitted to the characteristics of the speech signal. The contribution made by the author is to dynamically adjust the Gaussian sub-components, maximize the use of each Gaussian sub-component, and fully express useful information, thereby improving the recognition performance of speaker confirmation.

Description

基于自适应调整的高斯混合模型的人声识别方法Human voice recognition method based on adaptively adjusted Gaussian mixture model

技术领域technical field

本发明涉及一种人声识别技术,特别涉及一种基于自适应调整的高斯混合模型的人声识别方法。The invention relates to a human voice recognition technology, in particular to a human voice recognition method based on an adaptively adjusted Gaussian mixture model.

背景技术Background technique

人声识别技术是利用信号处理和概率论的方法,根据说话人的语音对说话人身份进行识别的技术,主要包括两个步骤:说话人模型的训练和说话人语音的识别。Voice recognition technology is a technology that uses signal processing and probability theory to identify the speaker's identity based on the speaker's voice. It mainly includes two steps: speaker model training and speaker voice recognition.

人声识别主要采用的特征参数主要包括美尔倒谱系数(MFCC)、线性预测编码系数(LPCC)、感知加权的线性预测系数(PLP)。人声识别的算法主要包括支持向量机(SVM)、高斯混合模型(GMM)、矢量量化法(VQ)等等。其中高斯混合模型在语音识别领域应用非常广泛。The characteristic parameters mainly used in vocal recognition mainly include Mel cepstral coefficient (MFCC), linear predictive coding coefficient (LPCC), and perceptually weighted linear predictive coefficient (PLP). The algorithm of human voice recognition mainly includes support vector machine (SVM), Gaussian mixture model (GMM), vector quantization method (VQ) and so on. Among them, the Gaussian mixture model is widely used in the field of speech recognition.

传统的高斯混合模型的混合度是固定的,而人声的语音特征呈现的是多样性,特征分布中的某些高斯子分量携带的信息量是较少的,而某一些高斯子分量携带的信息量是比较多的,这种情况下会导致过拟合或欠拟合的现象,从而导致说话人确认的识别率的降低。The mixing degree of the traditional Gaussian mixture model is fixed, but the speech characteristics of the human voice are diverse. Some Gaussian subcomponents in the feature distribution carry less information, while some Gaussian subcomponents carry less information. The amount of information is relatively large. In this case, it will lead to overfitting or underfitting, which will lead to a decrease in the recognition rate of speaker confirmation.

发明内容Contents of the invention

本发明是针对传统的高斯混合模型识别人声存在的问题,提出了一种基于自适应调整的高斯混合模型的人声识别方法,在传统高斯混合模型的基础上自适应调节混合度和高斯子分量,以此来提高人声识别的概率。The present invention is aimed at the problems existing in the traditional Gaussian mixture model for human voice recognition, and proposes a human voice recognition method based on an adaptively adjusted Gaussian mixture model. On the basis of the traditional Gaussian mixture model, the mixing degree and Gaussian component, in order to improve the probability of human voice recognition.

本发明的技术方案为:一种基于自适应调整的高斯混合模型的人声识别方法,具体包括如下步骤:The technical scheme of the present invention is: a kind of human voice recognition method based on the Gaussian mixture model of adaptive adjustment, specifically comprises the following steps:

1)用说话人的语音特征参数训练生成该说话人对应的传统高斯混合模型;1) Use the speech feature parameters of the speaker to train to generate a traditional Gaussian mixture model corresponding to the speaker;

2)计算高斯混合模型中每一帧数据由每一个高斯子分量生成的概率,再计算互异的高斯子分量生成同一帧数据的概率差值的绝对值之和;2) Calculate the probability that each frame of data in the Gaussian mixture model is generated by each Gaussian subcomponent, and then calculate the sum of the absolute values of the probability differences of the same frame of data generated by different Gaussian subcomponents;

3)取步骤2)所得到的多个和值的最小值,与设定的低阈值θ3做比较,如果小于θ3,则将最小值对应的两个高斯子分量进行合并,得到新的高斯子分量;3) Take the minimum value of the multiple sums obtained in step 2) and compare it with the set low threshold θ 3 , if it is less than θ 3 , merge the two Gaussian subcomponents corresponding to the minimum value to obtain a new Gaussian subcomponents;

4)取得到的多个和值的最大值,与设定的高阈值θ1做比较,如果大于阈值θ1,则将最大值对应的两个高斯子分量进行权重重配,得到两个新的高斯子分量;4) Compare the maximum value of the obtained multiple sums with the set high threshold θ 1 , if it is greater than the threshold θ 1 , reconfigure the weights of the two Gaussian subcomponents corresponding to the maximum value to obtain two new The Gaussian subcomponent of ;

5)取高斯子分量的权重的最大值,与设置的门限值θ2做比较,如果大于θ2时,对这个高斯子分量进行拆分,得到两个新的高斯子分量;5) Get the maximum value of the weight of the Gaussian subcomponent, compare it with the threshold value θ 2 set, if it is greater than θ 2 , split the Gaussian subcomponent to obtain two new Gaussian subcomponents;

6)用新获得的高斯子分量代替原高斯子分量,通过多次迭代得到最后优化后的高斯模型,输入待识别的语音特征参数,计算该语音信号由每一个高斯混合模型拟合生成的概率,判定最大者为对应的目标说话人,即为测试语音的真正说话人。6) Replace the original Gaussian subcomponent with the newly obtained Gaussian subcomponent, obtain the final optimized Gaussian model through multiple iterations, input the speech feature parameters to be recognized, and calculate the probability that the speech signal is generated by fitting each Gaussian mixture model , the one with the largest value is determined as the corresponding target speaker, that is, the real speaker of the test speech.

所述步骤2)生成同一帧信号的概率差值的绝对值计算表达式为:The step 2) generates the absolute value calculation expression of the probability difference of the same frame signal as:

pp __ dd ii ff ff == || ππ aa NN (( xx ii || μμ aa ,, σσ aa )) ΣΣ jj == 11 KK ππ jj NN (( xx ii || μμ jj ,, σσ jj )) -- ππ bb NN (( xx ii || μμ bb ,, σσ bb )) ΣΣ jj == 11 KK ππ jj NN (( xx ii || μμ jj ,, σσ jj )) || ,,

用λn={πnnn}表示第n个高斯子分量,πn为第n个高斯子分量的权重,μn和σn表示第n个高斯子分量的期望和协方差矩阵,每一帧数据分别由K个高斯子分量拟合生成的概率,共有L帧数据,xi(i=1,2,…,L)为输入的第i帧语音信号,a和b为指互异的高斯子分量的序号,πa为第a个高斯子分量的权重,N(xiaa)为第a个高斯子分量的概率密度,μa和σa表示第a个高斯子分量的期望和协方差矩阵,公式中下标j表示第j个高斯子分量的序号;下标b表示第b个高斯子分量的序号。Use λ n = {π n , μ nn } to denote the nth Gaussian subcomponent, π n is the weight of the nth Gaussian subcomponent, μ n and σ n denote the expectation sum of the nth Gaussian subcomponent Variance matrix, the probability that each frame of data is fitted by K Gaussian subcomponents, there are L frames of data in total, x i (i=1,2,...,L) is the input speech signal of the i-th frame, a and b is the serial number of different Gaussian subcomponents, π a is the weight of the ath Gaussian subcomponent, N( xiaa ) is the probability density of the ath Gaussian subcomponent, μ a and σ a Represents the expectation and covariance matrix of the a-th Gaussian subcomponent. In the formula, the subscript j represents the serial number of the j-th Gaussian sub-component; the subscript b represents the serial number of the b-th Gaussian sub-component.

所述的步骤3)中合并处理方式如下:In the described step 3), the merging process is as follows:

ππ TT == ππ aa ++ ππ bb ωω aa == ππ aa ππ TT ,, ωω bb == ππ bb ππ TT μμ TT == ωω aa μμ aa ++ ωω bb μμ bb σσ TT == ωω aa σσ aa ++ ωω bb σσ bb

其中,a表示第a个高斯子分量的序号;b表示第b个高斯子分量的序号;T为合并后新的高斯子分量的序号,用新增加的高斯子分量λT来代替原来的高斯子分量λa和λbAmong them, a represents the serial number of the a-th Gaussian sub-component; b represents the serial number of the b-th Gaussian sub-component; T is the serial number of the new Gaussian sub-component after the merger, and replaces the original Gaussian with the newly added Gaussian sub-component λ T Subcomponents λ a and λ b .

所述的步骤4)中对这两个高斯子分量a、b进行重新分配权重,得到两个新的高斯子分量,处理方式如下:In the described step 4), these two Gaussian subcomponents a, b are redistributed weights to obtain two new Gaussian subcomponents, and the processing method is as follows:

ππ TT == αα 11 (( ππ aa ++ ππ bb )) ππ TT ++ 11 == αα 22 (( ππ aa ++ ππ bb ))

其中, α 1 = γ a γ a + γ b , α 1 = γ a γ a + γ b , γ a = π a N ( x i | μ a , σ a ) Σ j = 1 K π j N ( x i | μ j , σ j ) , γ b = π b N ( x i | μ b , σ b ) Σ j = 1 K π j N ( x i | μ j , σ j ) , 两个高斯分布的期望和协方差矩阵保持不变。in, α 1 = γ a γ a + γ b , α 1 = γ a γ a + γ b , γ a = π a N ( x i | μ a , σ a ) Σ j = 1 K π j N ( x i | μ j , σ j ) , γ b = π b N ( x i | μ b , σ b ) Σ j = 1 K π j N ( x i | μ j , σ j ) , The expectation and covariance matrices of the two Gaussian distributions remain the same.

所述的步骤5)中高斯子分量进行拆分,拆分的处理方式如下:In the described step 5), the Gaussian subcomponent is split, and the processing mode of the split is as follows:

ππ TT == 11 22 ππ aa ππ TT ++ 11 == 11 22 ππ aa μμ TT == μμ aa ++ ττ EE. μμ TT ++ 11 == μμ aa -- ττ EE. σσ TT == (( 11 ++ ββ )) -- 11 σσ aa σσ TT ++ 11 == (( 11 ++ ββ )) -- 11 σσ aa

其中,为σa对角线上的最大值;E=[1,1,…,1]是全1矩阵,用最新的两个高斯子分量λTT+1代替原来的高斯子分量λain, is the maximum value on the diagonal of σ a ; E=[1,1,…,1] is a matrix of all 1s, Replace the original Gaussian sub-component λ a with the latest two Gaussian sub-components λ T , λ T+1 .

本发明的有益效果在于:本发明基于自适应调整的高斯混合模型的人声识别方法,利用概率差值的绝对值之和对传统的高斯混合模型进行改进,对每一个高斯子分量在拟合语音信号的特征时所作的贡献,进行动态的调整高斯子分量,最大限度的利用每一个高斯子分量,充分表达有用信息,从而提高说话人确认的识别性能。The beneficial effect of the present invention is that: the present invention is based on the human voice recognition method of Gaussian mixture model of self-adaptive adjustment, utilizes the sum of the absolute value of probability difference to improve traditional Gaussian mixture model, each Gaussian sub-component is fitted The contribution made by the characteristics of the speech signal, dynamically adjust the Gaussian sub-component, maximize the use of each Gaussian sub-component, fully express useful information, thereby improving the recognition performance of speaker confirmation.

附图说明Description of drawings

图1为本发明自适应调整高斯混合模型训练流程示意图;Fig. 1 is a schematic diagram of the training process of the self-adaptively adjusted Gaussian mixture model of the present invention;

图2为本发明高斯子分量权重重配的流程示意图;Fig. 2 is a schematic flow chart of Gaussian sub-component weight reassignment in the present invention;

图3为本发明高斯子分量拆分改进的流程示意图;Fig. 3 is the schematic flow chart of Gaussian sub-component splitting improvement of the present invention;

图4为本发明高斯子分量合并改进的流程示意图。Fig. 4 is a schematic flow chart of the improved combination of Gaussian sub-components in the present invention.

具体实施方式detailed description

本实施方式中的实验数据是采集了43个参与者的录制语音,采样率为8000Hz,43人中23个女的,20个男的,每一个人都录制5段语音,每一段录音都在安静的环境下进行的,每一段语音都是一个四字成语。The experimental data in this embodiment is to collect the recorded voices of 43 participants, the sampling rate is 8000Hz, 23 females and 20 males among the 43 people, and each person records 5 segments of voice, and each segment of recording is in It was carried out in a quiet environment, and each speech was a four-character idiom.

利用不同说话者的一定量语音进行训练得到不同说话者对应的传统的高斯混合模型,并根据自适应调整规则对不同的传统高斯混合模型进行优化。The traditional Gaussian mixture models corresponding to different speakers are obtained by using a certain amount of speech of different speakers for training, and different traditional Gaussian mixture models are optimized according to the adaptive adjustment rules.

在训练过程中,先任意选取不同说话者的三段语音进行训练得到不同说话者对应的优化后的高斯混合模型。In the training process, three speech segments of different speakers are arbitrarily selected for training to obtain optimized Gaussian mixture models corresponding to different speakers.

在测试过程中,利用不同说话者的其他语音段进行每一个优化高斯混合模型的识别率测试。During the test, the recognition rate of each optimized Gaussian mixture model was tested using other speech segments of different speakers.

如图1所示自适应调整高斯混合模型训练流程图,训练过程如下:As shown in Figure 1, the adaptive adjustment Gaussian mixture model training flow chart, the training process is as follows:

对语音信号进行预处理,预处理的步骤包括端点检测,分帧,加窗,提取特征参数——美尔倒谱系数,本实验选用12维的美尔倒谱系数(MFCC)。The speech signal is preprocessed. The preprocessing steps include endpoint detection, framing, windowing, and extraction of feature parameters—Meier cepstral coefficients. This experiment uses 12-dimensional Mel cepstral coefficients (MFCC).

将提取到的MFCC参数通过EM算法进行训练,得到与说话者相对应的传统高斯混合模型。传统高斯混合模型的混合度为K,其由K个高斯子分量线性叠加而成,高斯混合模型的概率密度的计算如下:The extracted MFCC parameters are trained by the EM algorithm to obtain the traditional Gaussian mixture model corresponding to the speaker. The mixing degree of the traditional Gaussian mixture model is K, which is formed by the linear superposition of K Gaussian subcomponents. The probability density of the Gaussian mixture model is calculated as follows:

pp (( xx )) == ΣΣ nno == 11 KK pp (( nno )) pp (( xx || nno )) == ΣΣ nno == 11 KK ππ nno NN (( xx || μμ nno ,, σσ nno ))

NN (( xx || μμ ,, σσ )) == 11 (( 22 ππ )) DD. // 22 11 || σσ || 11 // 22 expexp [[ -- 11 22 (( xx -- μμ )) TT σσ -- 11 (( xx -- μμ )) ]]

其中,πn为第n个高斯子分量的权重,N(x|μnn)表示第n个高斯子分量的概率密度函数,本实施方式中K取16,μ和σ表示高斯子分量的期望和协方差矩阵,D是数据x的维数,用λn={πnnn}表示第n个高斯子分量,n可取1到K的任何整数值。通过求取p(x)得到待识别说话人属于当前模型的概率。Among them, π n is the weight of the nth Gaussian subcomponent, N(x|μ nn ) represents the probability density function of the nth Gaussian subcomponent, K is 16 in this embodiment, and μ and σ represent the Gaussian subcomponent The expectation and covariance matrix of the components, D is the dimension of the data x, and the nth Gaussian subcomponent is represented by λ n ={π n , μ nn }, and n can take any integer value from 1 to K. The probability that the speaker to be identified belongs to the current model is obtained by calculating p(x).

设说话人的第i帧数据为xi,(i=1,2,…L),EM算法的具体的估计步骤如下:Assuming that the i-th frame data of the speaker is x i , (i=1,2,...L), the specific estimation steps of the EM algorithm are as follows:

第一步,若第一次执行,则对高斯混合模型的参数{π,μ,σ}进行初始化;若非第一次执行,则高斯混合模型的参数为上一轮迭代计算得到的结果。然后估算每一帧数据分别由这K个高斯子分量生成的概率γ(i,n)(表示第i帧数据由第n个高斯子生成的概率):In the first step, if it is executed for the first time, the parameters {π, μ, σ} of the Gaussian mixture model are initialized; if it is not executed for the first time, the parameters of the Gaussian mixture model are the results obtained from the previous iteration calculation. Then estimate the probability γ(i,n) of each frame of data generated by the K Gaussian subcomponents (representing the probability that the i-th frame of data is generated by the n-th Gaussian):

γγ (( ii ,, nno )) == ππ nno NN (( xx ii || μμ nno ,, σσ nno )) ΣΣ jj == 11 KK ππ jj NN (( xx ii || μμ jj ,, σσ jj ))

公式中j表示第j个高斯子分量的序号;n表示第n个高斯子分量的序号,高斯子分量的总数为K,i表示说话人的第i帧数据,共有L帧数据。In the formula, j represents the sequence number of the jth Gaussian subcomponent; n represents the sequence number of the nth Gaussian subcomponent, the total number of Gaussian subcomponents is K, and i represents the i-th frame data of the speaker, and there are L frames of data in total.

第二步,用第一步得到的结果去估算高斯模型的待求参数:In the second step, use the results obtained in the first step to estimate the parameters to be requested of the Gaussian model:

μμ nno == 11 ΔΔ ΣΣ ii == 11 LL γγ (( ii ,, nno )) xx ii

σσ nno == 11 ΔΔ ΣΣ ii == 11 LL γγ (( ii ,, nno )) (( xx ii -- μμ nno )) (( xx ii -- μμ nno )) TT

ππ nno == ΔΔ LL

其中, Δ = Σ i = 1 L γ ( i , n ) in, Δ = Σ i = 1 L γ ( i , no )

第三步,重复第一步和第二步,直至似然函数的值趋于稳定。The third step is to repeat the first and second steps until the value of the likelihood function tends to be stable.

对得到的传统高斯混合模型进行优化。The resulting traditional Gaussian mixture model is optimized.

先用训练得到的传统高斯模型的参数,计算每一帧数据分别由这K个高斯子分量拟合生成的概率,有L帧数据,则得到一个K*L的矩阵,例如第1行第2列的数据表示第2帧数据由第1个高斯子分量生成的概率。再计算同一帧数据由两个互异的高斯子分量生成的概率差值的绝对值,再将这两个高斯子分量拟合生成的全部帧信号的概率差值的绝对值求和。其中,第a个高斯子分量与第b个高斯子分量生成同一帧信号的概率差值的绝对值计算表达式如下:First use the parameters of the traditional Gaussian model obtained from training to calculate the probability that each frame of data is fitted by the K Gaussian subcomponents. If there are L frames of data, a K*L matrix will be obtained, such as row 1, row 2 The data in the column represents the probability that the second frame data is generated by the first Gaussian subcomponent. Then calculate the absolute value of the probability difference generated by two different Gaussian subcomponents of the same frame data, and then sum the absolute value of the probability difference of all frame signals generated by fitting the two Gaussian subcomponents. Among them, the absolute value calculation expression of the probability difference between the a-th Gaussian sub-component and the b-th Gaussian sub-component generating the same frame signal is as follows:

pp __ dd ii ff ff == || ππ aa NN (( xx ii || μμ aa ,, σσ aa )) ΣΣ jj == 11 KK ππ jj NN (( xx ii || μμ jj ,, σσ jj )) -- ππ bb NN (( xx ii || μμ bb ,, σσ bb )) ΣΣ jj == 11 KK ππ jj NN (( xx ii || μμ jj ,, σσ jj )) ||

公式中j表示第j个高斯子分量的序号;a表示第a个高斯子分量的序号;b表示第b个高斯子分量的序号,高斯子分量的总数为K;i表示说话人的第i帧数据,一共有L帧。In the formula, j represents the sequence number of the jth Gaussian subcomponent; a represents the sequence number of the ath Gaussian subcomponent; b represents the sequence number of the bth Gaussian subcomponent, and the total number of Gaussian subcomponents is K; Frame data, a total of L frames.

取上一步得到的多个和值的最小值,与低阈值θ3做比较,如果小于θ3,则认为这两个高斯子分量在对语音信号特征的同一部分做拟合,即信息有重叠,则将这两个高斯子分量进行合并,组成一个新的高斯子分量,合并处理方式如下:Take the minimum value of multiple sums obtained in the previous step, and compare it with the low threshold θ 3 , if it is less than θ 3 , it is considered that the two Gaussian subcomponents are fitting the same part of the speech signal features, that is, the information overlaps , the two Gaussian subcomponents are combined to form a new Gaussian subcomponent, and the combination process is as follows:

ππ TT == ππ aa ++ ππ bb ωω aa == ππ aa ππ TT ,, ωω bb == ππ bb ππ TT μμ TT == ωω aa μμ aa ++ ωω bb μμ bb σσ TT == ωω aa σσ aa ++ ωω bb σσ bb

其中,a表示第a个高斯子分量的序号;b表示第b个高斯子分量的序号;T合并后新的高斯子分量的序号。该步骤中的低阈值是在多次实验后取的经验值。Among them, a represents the sequence number of the a-th Gaussian sub-component; b represents the sequence number of the b-th Gaussian sub-component; and the sequence number of the new Gaussian sub-component after T is merged. The low threshold in this step is an empirical value taken after many experiments.

用新增加的高斯子分量λT来代替原来的高斯子分量λa和λb,这样高斯混合模型的混合度减小一个。Replace the original Gaussian subcomponents λ a and λ b with the newly added Gaussian subcomponent λ T , so that the mixture degree of the Gaussian mixture model is reduced by one.

取上面所得的多个和值中的最大值,与高阈值θ1进行比较,如果大于θ1,则认为这两个高斯子分量在拟合语音信号特征的不同部分,这种情况下需要对这两个高斯子分量重新分配权重,处理方式如下:Take the maximum value among the multiple sums obtained above, and compare it with the high threshold θ 1 , if it is greater than θ 1 , it is considered that the two Gaussian subcomponents are fitting different parts of the speech signal features. In this case, it is necessary to The two Gaussian subcomponents redistribute weights as follows:

ππ TT == αα 11 (( ππ aa ++ ππ bb )) ππ TT ++ 11 == αα 22 (( ππ aa ++ ππ bb ))

其中, α 1 = γ a γ a + γ b , α 1 = γ a γ a + γ b , γ a = π a N ( x i | μ a , σ a ) Σ j = 1 K π j N ( x i | μ j , σ j ) , γ b = π b N ( x i | μ b , σ b ) Σ j = 1 K π j N ( x i | μ j , σ j ) , 两个高斯子分量的期望和协方差矩阵保持不变。in, α 1 = γ a γ a + γ b , α 1 = γ a γ a + γ b , γ a = π a N ( x i | μ a , σ a ) Σ j = 1 K π j N ( x i | μ j , σ j ) , γ b = π b N ( x i | μ b , σ b ) Σ j = 1 K π j N ( x i | μ j , σ j ) , The expectation and covariance matrices of the two Gaussian subcomponents remain unchanged.

该步骤中的高阈值是在多次实验后取的经验值。The high threshold in this step is an empirical value taken after many experiments.

取高斯子分量权重的最大值,将其与权重门限值θ2进行比较,如果大于θ2,则说明这个高斯子分量包含的信息过多,需要对其进行拆分,拆分的处理方式如下:Take the maximum value of the weight of the Gaussian sub-component, and compare it with the weight threshold θ 2 , if it is greater than θ 2 , it means that the Gaussian sub-component contains too much information, and it needs to be split. The processing method of splitting as follows:

ππ TT == 11 22 ππ aa ππ TT ++ 11 == 11 22 ππ aa μμ TT == μμ aa ++ ττ EE. μμ TT ++ 11 == μμ aa -- ττ EE. σσ TT == (( 11 ++ ββ )) -- 11 σσ aa σσ TT ++ 11 == (( 11 ++ ββ )) -- 11 σσ aa

其中,为σa对角线上的最大值;E=[1,1,…,1]是全1矩阵,用最新的两个高斯子分量λTT+1代替原来的高斯子分量λa,则高斯混合模型的混合度增加一个。in, is the maximum value on the diagonal of σ a ; E=[1,1,…,1] is a matrix of all 1s, Replace the original Gaussian sub-component λ a with the newest two Gaussian sub-components λ T , λ T+1 , then the mixture degree of the Gaussian mixture model increases by one.

该步骤中的权重门限值是在多次实验后取的经验值。The weight threshold in this step is an empirical value obtained after multiple experiments.

预先设置一个迭代次数M,用新的高斯子分量进行重复的执行上述步骤,执行M次以后得到一个优化的高斯混合模型。对每一个说话者都进行模型的优化,最后得到每个说话者对应的优化的高斯混合模型。本实施方式中M取10。Set an iteration number M in advance, repeat the above steps with a new Gaussian subcomponent, and obtain an optimized Gaussian mixture model after performing M times. The model is optimized for each speaker, and finally the optimized Gaussian mixture model corresponding to each speaker is obtained. In this embodiment, M is 10.

对于待识别的语音信号x,计算这个语音信号由不同的高斯混合模型生成的概率,取其中的最大者,最大者对应的目标说话人即为测试语音的真正说话人。For the speech signal x to be recognized, calculate the probability that the speech signal is generated by different Gaussian mixture models, take the largest one, and the target speaker corresponding to the largest one is the real speaker of the test speech.

例如,某一段待识别的语音,由第3个高斯混合模型生成的概率最大,则待识别的语音是由第3个说话者发出的。For example, if a speech to be recognized has the highest probability generated by the third Gaussian mixture model, then the speech to be recognized is uttered by the third speaker.

Claims (5)

1.一种基于自适应调整的高斯混合模型的人声识别方法,其特征在于,具体包括如下步骤:1. a human voice recognition method based on the Gaussian mixture model of adaptive adjustment, is characterized in that, specifically comprises the steps: 1)用说话人的语音特征参数训练生成该说话人对应的传统高斯混合模型;1) Use the speech feature parameters of the speaker to train to generate a traditional Gaussian mixture model corresponding to the speaker; 2)计算高斯混合模型中每一帧数据由每一个高斯子分量生成的概率,再计算互异的高斯子分量生成同一帧数据的概率差值的绝对值之和;2) Calculate the probability that each frame of data in the Gaussian mixture model is generated by each Gaussian subcomponent, and then calculate the sum of the absolute values of the probability differences of the same frame of data generated by different Gaussian subcomponents; 3)取步骤2)所得到的多个和值的最小值,与设定的低阈值θ3做比较,如果小于θ3,则将最小值对应的两个高斯子分量进行合并,得到新的高斯子分量;3) Take the minimum value of the multiple sums obtained in step 2) and compare it with the set low threshold θ 3 , if it is less than θ 3 , merge the two Gaussian subcomponents corresponding to the minimum value to obtain a new Gaussian subcomponents; 4)取得到的多个和值的最大值,与设定的高阈值θ1做比较,如果大于阈值θ1,则将最大值对应的两个高斯子分量进行权重重配,得到两个新的高斯子分量;4) Compare the maximum value of the obtained multiple sums with the set high threshold θ 1 , if it is greater than the threshold θ 1 , reconfigure the weights of the two Gaussian subcomponents corresponding to the maximum value to obtain two new The Gaussian subcomponent of ; 5)取高斯子分量的权重的最大值,与设置的门限值θ2做比较,如果大于θ2时,对这个高斯子分量进行拆分,得到两个新的高斯子分量;5) Get the maximum value of the weight of the Gaussian subcomponent, compare it with the threshold value θ 2 set, if it is greater than θ 2 , split the Gaussian subcomponent to obtain two new Gaussian subcomponents; 6)用新获得的高斯子分量代替原高斯子分量,通过多次迭代得到最后优化后的高斯模型,输入待识别的语音特征参数,计算该语音信号由每一个高斯混合模型拟合生成的概率,判定最大者为对应的目标说话人,即为测试语音的真正说话人。6) Replace the original Gaussian subcomponent with the newly obtained Gaussian subcomponent, obtain the final optimized Gaussian model through multiple iterations, input the speech feature parameters to be recognized, and calculate the probability that the speech signal is generated by fitting each Gaussian mixture model , the one with the largest value is determined as the corresponding target speaker, that is, the real speaker of the test speech. 2.根据权利要求1所述基于自适应调整的高斯混合模型的人声识别方法,其特征在于,所述步骤2)生成同一帧信号的概率差值的绝对值计算表达式为:2. the human voice recognition method based on the Gaussian mixture model of adaptive adjustment according to claim 1, is characterized in that, described step 2) generates the absolute value calculation expression of the probability difference value of same frame signal as: pp __ dd ii ff ff == || ππ aa NN (( xx ii || μμ aa ,, σσ aa )) ΣΣ jj == 11 KK ππ jj NN (( xx ii || μμ jj ,, σσ jj )) -- ππ bb NN (( xx ii || μμ bb ,, σσ bb )) ΣΣ jj == 11 KK ππ jj NN (( xx ii || μμ jj ,, σσ jj )) || ,, 用λn={πnnn}表示第n个高斯子分量,πn为第n个高斯子分量的权重,μn和σn表示第n个高斯子分量的期望和协方差矩阵,每一帧数据分别由K个高斯子分量拟合生成的概率,共有L帧数据,xi(i=1,2,…,L)为输入的第i帧语音信号,a和b为指互异的高斯子分量的序号,πa为第a个高斯子分量的权重,N(xiaa)为第a个高斯子分量的概率密度,μa和σa表示第a个高斯子分量的期望和协方差矩阵,公式中下标j表示第j个高斯子分量的序号;下标b表示第b个高斯子分量的序号。Use λ n = {π n , μ nn } to denote the nth Gaussian subcomponent, π n is the weight of the nth Gaussian subcomponent, μ n and σ n denote the expectation sum of the nth Gaussian subcomponent Variance matrix, the probability that each frame of data is fitted by K Gaussian subcomponents, there are L frames of data in total, x i (i=1,2,...,L) is the input speech signal of the i-th frame, a and b is the serial number of different Gaussian subcomponents, π a is the weight of the ath Gaussian subcomponent, N( xiaa ) is the probability density of the ath Gaussian subcomponent, μ a and σ a Represents the expectation and covariance matrix of the a-th Gaussian subcomponent. In the formula, the subscript j represents the serial number of the j-th Gaussian sub-component; the subscript b represents the serial number of the b-th Gaussian sub-component. 3.根据权利要求2所述基于自适应调整的高斯混合模型的人声识别方法,其特征在于,所述的步骤3)中合并处理方式如下:3. the human voice recognition method based on the Gaussian mixture model of self-adaptive adjustment according to claim 2, is characterized in that, described step 3) in merge processing mode is as follows: ππ TT == ππ aa ++ ππ bb ωω aa == ππ aa ππ TT ,, ωω bb == ππ bb ππ TT μμ TT == ωω aa μμ aa ++ ωω bb μμ bb σσ TT == ωω aa σσ aa ++ ωω bb σσ bb 其中,a表示第a个高斯子分量的序号;b表示第b个高斯子分量的序号;T为合并后新的高斯子分量的序号,用新增加的高斯子分量λT来代替原来的高斯子分量λa和λbAmong them, a represents the serial number of the a-th Gaussian sub-component; b represents the serial number of the b-th Gaussian sub-component; T is the serial number of the new Gaussian sub-component after the merger, and replaces the original Gaussian with the newly added Gaussian sub-component λ T Subcomponents λ a and λ b . 4.根据权利要求2所述基于自适应调整的高斯混合模型的人声识别方法,其特征在于,所述的步骤4)中对这两个高斯子分量a、b进行重新分配权重,得到两个新的高斯子分量,处理方式如下:4. the human voice recognition method based on the Gaussian mixture model of self-adaptive adjustment according to claim 2, is characterized in that, in described step 4) these two Gaussian subcomponents a, b are redistributed weight, obtain two A new Gaussian sub-component is processed as follows: ππ TT == αα 11 (( ππ aa ++ ππ bb )) ππ TT ++ 11 == αα 22 (( ππ aa ++ ππ bb )) 其中, α 1 = γ a γ a + γ b , α 2 = γ b γ a + γ b , γ a = π a N ( x i | μ a , σ a ) Σ j = 1 K π j N ( x i | μ j , σ j ) , γ b = π b N ( x i | μ b , σ b ) Σ j = 1 K π j N ( x i | μ j , σ j ) , 两个高斯分布的期望和协方差矩阵保持不变。in, α 1 = γ a γ a + γ b , α 2 = γ b γ a + γ b , γ a = π a N ( x i | μ a , σ a ) Σ j = 1 K π j N ( x i | μ j , σ j ) , γ b = π b N ( x i | μ b , σ b ) Σ j = 1 K π j N ( x i | μ j , σ j ) , The expectation and covariance matrices of the two Gaussian distributions remain the same. 5.根据权利要求2所述基于自适应调整的高斯混合模型的人声识别方法,其特征在于,所述的步骤5)中高斯子分量进行拆分,拆分的处理方式如下:5. the human voice recognition method based on the Gaussian mixture model of adaptive adjustment according to claim 2, is characterized in that, described step 5) middle Gaussian subcomponent splits, and the processing mode of splitting is as follows: ππ TT == 11 22 ππ aa ππ TT ++ 11 == 11 22 ππ aa μμ TT == μμ aa ++ ττ EE. μμ TT ++ 11 == μμ aa -- ττ EE. σσ TT == (( 11 ++ ββ )) -- 11 σσ aa σσ TT ++ 11 == (( 11 ++ ββ )) -- 11 σσ aa 其中, 为σa对角线上的最大值;E=[1,1,…,1]是全1矩阵,用最新的两个高斯子分量λTT+1代替原来的高斯子分量λain, is the maximum value on the diagonal of σ a ; E=[1,1,…,1] is a matrix of all 1s, Replace the original Gaussian sub-component λ a with the latest two Gaussian sub-components λ T , λ T+1 .
CN201510977077.9A 2015-12-22 2015-12-22 Adaptive adjustment-based Gaussian mixture model voice identification method Pending CN105590628A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510977077.9A CN105590628A (en) 2015-12-22 2015-12-22 Adaptive adjustment-based Gaussian mixture model voice identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510977077.9A CN105590628A (en) 2015-12-22 2015-12-22 Adaptive adjustment-based Gaussian mixture model voice identification method

Publications (1)

Publication Number Publication Date
CN105590628A true CN105590628A (en) 2016-05-18

Family

ID=55930150

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510977077.9A Pending CN105590628A (en) 2015-12-22 2015-12-22 Adaptive adjustment-based Gaussian mixture model voice identification method

Country Status (1)

Country Link
CN (1) CN105590628A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324232A (en) * 2011-09-12 2012-01-18 辽宁工业大学 Voiceprint recognition method and system based on Gaussian mixture model
CN102360418A (en) * 2011-09-29 2012-02-22 山东大学 Method for detecting eyelashes based on Gaussian mixture model and maximum expected value algorithm
CN102820033A (en) * 2012-08-17 2012-12-12 南京大学 Voiceprint identification method
CN104485108A (en) * 2014-11-26 2015-04-01 河海大学 Noise and speaker combined compensation method based on multi-speaker model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324232A (en) * 2011-09-12 2012-01-18 辽宁工业大学 Voiceprint recognition method and system based on Gaussian mixture model
CN102360418A (en) * 2011-09-29 2012-02-22 山东大学 Method for detecting eyelashes based on Gaussian mixture model and maximum expected value algorithm
CN102820033A (en) * 2012-08-17 2012-12-12 南京大学 Voiceprint identification method
CN104485108A (en) * 2014-11-26 2015-04-01 河海大学 Noise and speaker combined compensation method based on multi-speaker model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
熊华乔: ""基于模型聚类的说话人识别方法研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
王韵琪 等: ""自适应高斯混合模型及说话人识别应用"", 《通信技术》 *
王韵琪: ""自适应高斯混合模型及说话人识别应用"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Similar Documents

Publication Publication Date Title
Chou et al. One-shot voice conversion by separating speaker and content representations with instance normalization
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
Saito et al. Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors
CN114550703B (en) Training method and device of speech recognition system, speech recognition method and device
Zen et al. Continuous stochastic feature mapping based on trajectory HMMs
Juvela et al. Speaker-independent raw waveform model for glottal excitation
Chazan et al. A hybrid approach for speech enhancement using MoG model and neural network phoneme classifier
CN104240706A (en) Speaker recognition method based on GMM Token matching similarity correction scores
Mallidi et al. Autoencoder based multi-stream combination for noise robust speech recognition.
Padi et al. Towards relevance and sequence modeling in language recognition
Seneviratne et al. Noise Robust Acoustic to Articulatory Speech Inversion.
Selva Nidhyananthan et al. Assessment of dysarthric speech using Elman back propagation network (recurrent network) for speech recognition
Liu et al. Using bidirectional associative memories for joint spectral envelope modeling in voice conversion
Devi et al. Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
Paul et al. Automated speech recognition of isolated words using neural networks
Tobing et al. Voice conversion with CycleRNN-based spectral mapping and finely tuned WaveNet vocoder
Musaev et al. Advanced feature extraction method for speaker identification using a classification algorithm
Han et al. A study on speech emotion recognition based on CCBC and neural network
Al-Rawahy et al. Text-independent speaker identification system based on the histogram of DCT-cepstrum coefficients
Ghonem et al. Classification of stuttering events using i-vector
CN104183239B (en) Text-independent speaker recognition method based on weighted Bayes hybrid model
CN105590628A (en) Adaptive adjustment-based Gaussian mixture model voice identification method
Lilley et al. Unsupervised training of a DNN-based formant tracker
Nathwani et al. Consistent DNN uncertainty training and decoding for robust ASR
Bozorg et al. Autoregressive articulatory wavenet flow for speaker-independent acoustic-to-articulatory inversion

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160518