CN105590628A

CN105590628A - Adaptive adjustment-based Gaussian mixture model voice identification method

Info

Publication number: CN105590628A
Application number: CN201510977077.9A
Authority: CN
Inventors: 沈希忠; 包玲玲
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2015-12-22
Filing date: 2015-12-22
Publication date: 2016-05-18

Abstract

The invention relates to a human voice recognition method based on an adaptively adjusted Gaussian mixture model. The traditional Gaussian mixture model is improved by using the sum of the absolute values of probability differences, and each Gaussian subcomponent is fitted to the characteristics of the speech signal. The contribution made by the author is to dynamically adjust the Gaussian sub-components, maximize the use of each Gaussian sub-component, and fully express useful information, thereby improving the recognition performance of speaker confirmation.

Description

Human voice recognition method based on adaptively adjusted Gaussian mixture model

技术领域technical field

本发明涉及一种人声识别技术，特别涉及一种基于自适应调整的高斯混合模型的人声识别方法。The invention relates to a human voice recognition technology, in particular to a human voice recognition method based on an adaptively adjusted Gaussian mixture model.

背景技术Background technique

人声识别技术是利用信号处理和概率论的方法，根据说话人的语音对说话人身份进行识别的技术，主要包括两个步骤：说话人模型的训练和说话人语音的识别。Voice recognition technology is a technology that uses signal processing and probability theory to identify the speaker's identity based on the speaker's voice. It mainly includes two steps: speaker model training and speaker voice recognition.

人声识别主要采用的特征参数主要包括美尔倒谱系数(MFCC)、线性预测编码系数(LPCC)、感知加权的线性预测系数(PLP)。人声识别的算法主要包括支持向量机(SVM)、高斯混合模型(GMM)、矢量量化法(VQ)等等。其中高斯混合模型在语音识别领域应用非常广泛。The characteristic parameters mainly used in vocal recognition mainly include Mel cepstral coefficient (MFCC), linear predictive coding coefficient (LPCC), and perceptually weighted linear predictive coefficient (PLP). The algorithm of human voice recognition mainly includes support vector machine (SVM), Gaussian mixture model (GMM), vector quantization method (VQ) and so on. Among them, the Gaussian mixture model is widely used in the field of speech recognition.

传统的高斯混合模型的混合度是固定的，而人声的语音特征呈现的是多样性，特征分布中的某些高斯子分量携带的信息量是较少的，而某一些高斯子分量携带的信息量是比较多的，这种情况下会导致过拟合或欠拟合的现象，从而导致说话人确认的识别率的降低。The mixing degree of the traditional Gaussian mixture model is fixed, but the speech characteristics of the human voice are diverse. Some Gaussian subcomponents in the feature distribution carry less information, while some Gaussian subcomponents carry less information. The amount of information is relatively large. In this case, it will lead to overfitting or underfitting, which will lead to a decrease in the recognition rate of speaker confirmation.

发明内容Contents of the invention

本发明是针对传统的高斯混合模型识别人声存在的问题，提出了一种基于自适应调整的高斯混合模型的人声识别方法，在传统高斯混合模型的基础上自适应调节混合度和高斯子分量，以此来提高人声识别的概率。The present invention is aimed at the problems existing in the traditional Gaussian mixture model for human voice recognition, and proposes a human voice recognition method based on an adaptively adjusted Gaussian mixture model. On the basis of the traditional Gaussian mixture model, the mixing degree and Gaussian component, in order to improve the probability of human voice recognition.

本发明的技术方案为：一种基于自适应调整的高斯混合模型的人声识别方法，具体包括如下步骤：The technical scheme of the present invention is: a kind of human voice recognition method based on the Gaussian mixture model of adaptive adjustment, specifically comprises the following steps:

1)用说话人的语音特征参数训练生成该说话人对应的传统高斯混合模型；1) Use the speech feature parameters of the speaker to train to generate a traditional Gaussian mixture model corresponding to the speaker;

2)计算高斯混合模型中每一帧数据由每一个高斯子分量生成的概率，再计算互异的高斯子分量生成同一帧数据的概率差值的绝对值之和；2) Calculate the probability that each frame of data in the Gaussian mixture model is generated by each Gaussian subcomponent, and then calculate the sum of the absolute values of the probability differences of the same frame of data generated by different Gaussian subcomponents;

3)取步骤2)所得到的多个和值的最小值，与设定的低阈值θ₃做比较，如果小于θ₃，则将最小值对应的两个高斯子分量进行合并，得到新的高斯子分量；3) Take the minimum value of the multiple sums obtained in step 2) and compare it with the set low threshold θ ₃ , if it is less than θ ₃ , merge the two Gaussian subcomponents corresponding to the minimum value to obtain a new Gaussian subcomponents;

4)取得到的多个和值的最大值，与设定的高阈值θ₁做比较，如果大于阈值θ₁，则将最大值对应的两个高斯子分量进行权重重配，得到两个新的高斯子分量；4) Compare the maximum value of the obtained multiple sums with the set high threshold θ ₁ , if it is greater than the threshold θ ₁ , reconfigure the weights of the two Gaussian subcomponents corresponding to the maximum value to obtain two new The Gaussian subcomponent of ;

5)取高斯子分量的权重的最大值，与设置的门限值θ₂做比较，如果大于θ₂时，对这个高斯子分量进行拆分，得到两个新的高斯子分量；5) Get the maximum value of the weight of the Gaussian subcomponent, compare it with the threshold value θ ₂ set, if it is greater than θ ₂ , split the Gaussian subcomponent to obtain two new Gaussian subcomponents;

6)用新获得的高斯子分量代替原高斯子分量，通过多次迭代得到最后优化后的高斯模型，输入待识别的语音特征参数，计算该语音信号由每一个高斯混合模型拟合生成的概率，判定最大者为对应的目标说话人，即为测试语音的真正说话人。6) Replace the original Gaussian subcomponent with the newly obtained Gaussian subcomponent, obtain the final optimized Gaussian model through multiple iterations, input the speech feature parameters to be recognized, and calculate the probability that the speech signal is generated by fitting each Gaussian mixture model , the one with the largest value is determined as the corresponding target speaker, that is, the real speaker of the test speech.

所述步骤2)生成同一帧信号的概率差值的绝对值计算表达式为：The step 2) generates the absolute value calculation expression of the probability difference of the same frame signal as:

$p p__d d i i f f f f = = | | \frac{{π π}_{a a} N N (({x x}_{i i} | | {μ μ}_{a a},, {σ σ}_{a a}))}{{Σ Σ}_{j j = = 11}^{K K} {π π}_{j j} N N (({x x}_{i i} | | {μ μ}_{j j},, {σ σ}_{j j}))} - - \frac{{π π}_{b b} N N (({x x}_{i i} | | {μ μ}_{b b},, {σ σ}_{b b}))}{{Σ Σ}_{j j = = 11}^{K K} {π π}_{j j} N N (({x x}_{i i} | | {μ μ}_{j j},, {σ σ}_{j j}))} | |,,$

用λ_n＝{π_n,μ_n,σ_n}表示第n个高斯子分量，π_n为第n个高斯子分量的权重，μ_n和σ_n表示第n个高斯子分量的期望和协方差矩阵,每一帧数据分别由K个高斯子分量拟合生成的概率，共有L帧数据，x_i(i＝1,2,…,L)为输入的第i帧语音信号，a和b为指互异的高斯子分量的序号，π_a为第a个高斯子分量的权重，N(x_i|μ_a,σ_a)为第a个高斯子分量的概率密度，μ_a和σ_a表示第a个高斯子分量的期望和协方差矩阵,公式中下标j表示第j个高斯子分量的序号；下标b表示第b个高斯子分量的序号。Use λ _n = {π _n , μ _n ,σ _n } to denote the nth Gaussian subcomponent, π _n is the weight of the nth Gaussian subcomponent, μ _n and σ _n denote the expectation sum of the nth Gaussian subcomponent Variance matrix, the probability that each frame of data is fitted by K Gaussian subcomponents, there are L frames of data in total, x _i (i=1,2,...,L) is the input speech signal of the i-th frame, a and b is the serial number of different Gaussian subcomponents, π _a is the weight of the ath Gaussian subcomponent, N( _xi |μ _a ,σ _a ) is the probability density of the ath Gaussian subcomponent, μ _a and σ _a Represents the expectation and covariance matrix of the a-th Gaussian subcomponent. In the formula, the subscript j represents the serial number of the j-th Gaussian sub-component; the subscript b represents the serial number of the b-th Gaussian sub-component.

所述的步骤3)中合并处理方式如下：In the described step 3), the merging process is as follows:

$\{\begin{matrix} {π π}_{T T} = = {π π}_{a a} + + {π π}_{b b} \\ {ω ω}_{a a} = = \frac{{π π}_{a a}}{{π π}_{T T}},, {ω ω}_{b b} = = \frac{{π π}_{b b}}{{π π}_{T T}} \\ {μ μ}_{T T} = = {ω ω}_{a a} {μ μ}_{a a} + + {ω ω}_{b b} {μ μ}_{b b} \\ {σ σ}_{T T} = = {ω ω}_{a a} {σ σ}_{a a} + + {ω ω}_{b b} {σ σ}_{b b} \end{matrix}$

其中，a表示第a个高斯子分量的序号；b表示第b个高斯子分量的序号；T为合并后新的高斯子分量的序号，用新增加的高斯子分量λ_T来代替原来的高斯子分量λ_a和λ_b。Among them, a represents the serial number of the a-th Gaussian sub-component; b represents the serial number of the b-th Gaussian sub-component; T is the serial number of the new Gaussian sub-component after the merger, and replaces the original Gaussian with the newly added Gaussian sub-component λ _T Subcomponents λ _a and λ _b .

所述的步骤4)中对这两个高斯子分量a、b进行重新分配权重，得到两个新的高斯子分量，处理方式如下：In the described step 4), these two Gaussian subcomponents a, b are redistributed weights to obtain two new Gaussian subcomponents, and the processing method is as follows:

$\{\begin{matrix} {π π}_{T T} = = {α α}_{11} (({π π}_{a a} + + {π π}_{b b})) \\ {π π}_{T T + + 11} = = {α α}_{22} (({π π}_{a a} + + {π π}_{b b})) \end{matrix}$

其中， $α_{1} = \frac{γ_{a}}{γ_{a} + γ_{b}}, α_{1} = \frac{γ_{a}}{γ_{a} + γ_{b}}, γ_{a} = \frac{π_{a} N (x_{i} | μ_{a}, σ_{a})}{Σ_{j = 1}^{K} π_{j} N (x_{i} | μ_{j}, σ_{j})}, γ_{b} = \frac{π_{b} N (x_{i} | μ_{b}, σ_{b})}{Σ_{j = 1}^{K} π_{j} N (x_{i} | μ_{j}, σ_{j})},$ 两个高斯分布的期望和协方差矩阵保持不变。in, $α_{1} = \frac{γ_{a}}{γ_{a} + γ_{b}}, α_{1} = \frac{γ_{a}}{γ_{a} + γ_{b}}, γ_{a} = \frac{π_{a} N (x_{i} | μ_{a}, σ_{a})}{Σ_{j = 1}^{K} π_{j} N (x_{i} | μ_{j}, σ_{j})}, γ_{b} = \frac{π_{b} N (x_{i} | μ_{b}, σ_{b})}{Σ_{j = 1}^{K} π_{j} N (x_{i} | μ_{j}, σ_{j})},$ The expectation and covariance matrices of the two Gaussian distributions remain the same.

所述的步骤5)中高斯子分量进行拆分，拆分的处理方式如下：In the described step 5), the Gaussian subcomponent is split, and the processing mode of the split is as follows:

$\{\begin{matrix} {π π}_{T T} = = \frac{11}{22} {π π}_{a a} \\ {π π}_{T T + + 11} = = \frac{11}{22} {π π}_{a a} \\ {μ μ}_{T T} = = {μ μ}_{a a} + + τ τ E E. \\ {μ μ}_{T T + + 11} = = {μ μ}_{a a} - - τ τ E E. \\ {σ σ}_{T T} = = {((11 + + β β))}^{- - 11} {σ σ}_{a a} \\ {σ σ}_{T T + + 11} = = {((11 + + β β))}^{- - 11} {σ σ}_{a a} \end{matrix}$

其中，为σ_a对角线上的最大值；E＝[1,1,…,1]是全1矩阵，用最新的两个高斯子分量λ_T,λ_T+1代替原来的高斯子分量λ_a。in, is the maximum value on the diagonal of σ _a ; E=[1,1,…,1] is a matrix of all 1s, Replace the original Gaussian sub-component λ _a with the latest two Gaussian sub-components λ _T , λ _T+1 .

本发明的有益效果在于：本发明基于自适应调整的高斯混合模型的人声识别方法，利用概率差值的绝对值之和对传统的高斯混合模型进行改进，对每一个高斯子分量在拟合语音信号的特征时所作的贡献，进行动态的调整高斯子分量，最大限度的利用每一个高斯子分量，充分表达有用信息，从而提高说话人确认的识别性能。The beneficial effect of the present invention is that: the present invention is based on the human voice recognition method of Gaussian mixture model of self-adaptive adjustment, utilizes the sum of the absolute value of probability difference to improve traditional Gaussian mixture model, each Gaussian sub-component is fitted The contribution made by the characteristics of the speech signal, dynamically adjust the Gaussian sub-component, maximize the use of each Gaussian sub-component, fully express useful information, thereby improving the recognition performance of speaker confirmation.

附图说明Description of drawings

图1为本发明自适应调整高斯混合模型训练流程示意图；Fig. 1 is a schematic diagram of the training process of the self-adaptively adjusted Gaussian mixture model of the present invention;

图2为本发明高斯子分量权重重配的流程示意图；Fig. 2 is a schematic flow chart of Gaussian sub-component weight reassignment in the present invention;

图3为本发明高斯子分量拆分改进的流程示意图；Fig. 3 is the schematic flow chart of Gaussian sub-component splitting improvement of the present invention;

图4为本发明高斯子分量合并改进的流程示意图。Fig. 4 is a schematic flow chart of the improved combination of Gaussian sub-components in the present invention.

具体实施方式detailed description

本实施方式中的实验数据是采集了43个参与者的录制语音，采样率为8000Hz，43人中23个女的，20个男的，每一个人都录制5段语音，每一段录音都在安静的环境下进行的，每一段语音都是一个四字成语。The experimental data in this embodiment is to collect the recorded voices of 43 participants, the sampling rate is 8000Hz, 23 females and 20 males among the 43 people, and each person records 5 segments of voice, and each segment of recording is in It was carried out in a quiet environment, and each speech was a four-character idiom.

利用不同说话者的一定量语音进行训练得到不同说话者对应的传统的高斯混合模型，并根据自适应调整规则对不同的传统高斯混合模型进行优化。The traditional Gaussian mixture models corresponding to different speakers are obtained by using a certain amount of speech of different speakers for training, and different traditional Gaussian mixture models are optimized according to the adaptive adjustment rules.

在训练过程中，先任意选取不同说话者的三段语音进行训练得到不同说话者对应的优化后的高斯混合模型。In the training process, three speech segments of different speakers are arbitrarily selected for training to obtain optimized Gaussian mixture models corresponding to different speakers.

在测试过程中，利用不同说话者的其他语音段进行每一个优化高斯混合模型的识别率测试。During the test, the recognition rate of each optimized Gaussian mixture model was tested using other speech segments of different speakers.

如图1所示自适应调整高斯混合模型训练流程图，训练过程如下：As shown in Figure 1, the adaptive adjustment Gaussian mixture model training flow chart, the training process is as follows:

对语音信号进行预处理，预处理的步骤包括端点检测，分帧，加窗，提取特征参数——美尔倒谱系数，本实验选用12维的美尔倒谱系数(MFCC)。The speech signal is preprocessed. The preprocessing steps include endpoint detection, framing, windowing, and extraction of feature parameters—Meier cepstral coefficients. This experiment uses 12-dimensional Mel cepstral coefficients (MFCC).

将提取到的MFCC参数通过EM算法进行训练，得到与说话者相对应的传统高斯混合模型。传统高斯混合模型的混合度为K，其由K个高斯子分量线性叠加而成，高斯混合模型的概率密度的计算如下：The extracted MFCC parameters are trained by the EM algorithm to obtain the traditional Gaussian mixture model corresponding to the speaker. The mixing degree of the traditional Gaussian mixture model is K, which is formed by the linear superposition of K Gaussian subcomponents. The probability density of the Gaussian mixture model is calculated as follows:

$p p ((x x)) = = {Σ Σ}_{n no = = 11}^{K K} p p ((n no)) p p ((x x | | n no)) = = {Σ Σ}_{n no = = 11}^{K K} {π π}_{n no} N N ((x x | | {μ μ}_{n no},, {σ σ}_{n no}))$

$N N ((x x | | μ μ,, σ σ)) = = \frac{11}{{((22 π π))}^{D D. / / 22}} \frac{11}{{| | σ σ | |}^{11 / / 22}} exp exp [[- - \frac{11}{22} {((x x - - μ μ))}^{T T} {σ σ}^{- - 11} ((x x - - μ μ))]]$

其中，π_n为第n个高斯子分量的权重，N(x|μ_n,σ_n)表示第n个高斯子分量的概率密度函数，本实施方式中K取16，μ和σ表示高斯子分量的期望和协方差矩阵，D是数据x的维数，用λ_n＝{π_n,μ_n,σ_n}表示第n个高斯子分量，n可取1到K的任何整数值。通过求取p(x)得到待识别说话人属于当前模型的概率。Among them, π _n is the weight of the nth Gaussian subcomponent, N(x|μ _n ,σ _n ) represents the probability density function of the nth Gaussian subcomponent, K is 16 in this embodiment, and μ and σ represent the Gaussian subcomponent The expectation and covariance matrix of the components, D is the dimension of the data x, and the nth Gaussian subcomponent is represented by λ _n ={π _n , μ _n ,σ _n }, and n can take any integer value from 1 to K. The probability that the speaker to be identified belongs to the current model is obtained by calculating p(x).

设说话人的第i帧数据为x_i，(i＝1,2,…L)，EM算法的具体的估计步骤如下：Assuming that the i-th frame data of the speaker is x _i , (i=1,2,...L), the specific estimation steps of the EM algorithm are as follows:

第一步，若第一次执行，则对高斯混合模型的参数{π,μ,σ}进行初始化；若非第一次执行，则高斯混合模型的参数为上一轮迭代计算得到的结果。然后估算每一帧数据分别由这K个高斯子分量生成的概率γ(i,n)(表示第i帧数据由第n个高斯子生成的概率)：In the first step, if it is executed for the first time, the parameters {π, μ, σ} of the Gaussian mixture model are initialized; if it is not executed for the first time, the parameters of the Gaussian mixture model are the results obtained from the previous iteration calculation. Then estimate the probability γ(i,n) of each frame of data generated by the K Gaussian subcomponents (representing the probability that the i-th frame of data is generated by the n-th Gaussian):

$γ γ ((i i,, n no)) = = \frac{{π π}_{n no} N N (({x x}_{i i} | | {μ μ}_{n no},, {σ σ}_{n no}))}{{Σ Σ}_{j j = = 11}^{K K} {π π}_{j j} N N (({x x}_{i i} | | {μ μ}_{j j},, {σ σ}_{j j}))}$

公式中j表示第j个高斯子分量的序号；n表示第n个高斯子分量的序号，高斯子分量的总数为K，i表示说话人的第i帧数据，共有L帧数据。In the formula, j represents the sequence number of the jth Gaussian subcomponent; n represents the sequence number of the nth Gaussian subcomponent, the total number of Gaussian subcomponents is K, and i represents the i-th frame data of the speaker, and there are L frames of data in total.

第二步，用第一步得到的结果去估算高斯模型的待求参数：In the second step, use the results obtained in the first step to estimate the parameters to be requested of the Gaussian model:

${μ μ}_{n no} = = \frac{11}{Δ Δ} {Σ Σ}_{i i = = 11}^{L L} γ γ ((i i,, n no)) {x x}_{i i}$

${σ σ}_{n no} = = \frac{11}{Δ Δ} {Σ Σ}_{i i = = 11}^{L L} γ γ ((i i,, n no)) (({x x}_{i i} - - {μ μ}_{n no})) {(({x x}_{i i} - - {μ μ}_{n no}))}^{T T}$

${π π}_{n no} = = \frac{Δ Δ}{L L}$

其中， $Δ = Σ_{i = 1}^{L} γ (i, n)$ in, $Δ = Σ_{i = 1}^{L} γ (i, no)$

第三步，重复第一步和第二步，直至似然函数的值趋于稳定。The third step is to repeat the first and second steps until the value of the likelihood function tends to be stable.

对得到的传统高斯混合模型进行优化。The resulting traditional Gaussian mixture model is optimized.

先用训练得到的传统高斯模型的参数，计算每一帧数据分别由这K个高斯子分量拟合生成的概率，有L帧数据，则得到一个K*L的矩阵，例如第1行第2列的数据表示第2帧数据由第1个高斯子分量生成的概率。再计算同一帧数据由两个互异的高斯子分量生成的概率差值的绝对值，再将这两个高斯子分量拟合生成的全部帧信号的概率差值的绝对值求和。其中，第a个高斯子分量与第b个高斯子分量生成同一帧信号的概率差值的绝对值计算表达式如下：First use the parameters of the traditional Gaussian model obtained from training to calculate the probability that each frame of data is fitted by the K Gaussian subcomponents. If there are L frames of data, a K*L matrix will be obtained, such as row 1, row 2 The data in the column represents the probability that the second frame data is generated by the first Gaussian subcomponent. Then calculate the absolute value of the probability difference generated by two different Gaussian subcomponents of the same frame data, and then sum the absolute value of the probability difference of all frame signals generated by fitting the two Gaussian subcomponents. Among them, the absolute value calculation expression of the probability difference between the a-th Gaussian sub-component and the b-th Gaussian sub-component generating the same frame signal is as follows:

$p p__d d i i f f f f = = | | \frac{{π π}_{a a} N N (({x x}_{i i} | | {μ μ}_{a a},, {σ σ}_{a a}))}{{Σ Σ}_{j j = = 11}^{K K} {π π}_{j j} N N (({x x}_{i i} | | {μ μ}_{j j},, {σ σ}_{j j}))} - - \frac{{π π}_{b b} N N (({x x}_{i i} | | {μ μ}_{b b},, {σ σ}_{b b}))}{{Σ Σ}_{j j = = 11}^{K K} {π π}_{j j} N N (({x x}_{i i} | | {μ μ}_{j j},, {σ σ}_{j j}))} | |$

公式中j表示第j个高斯子分量的序号；a表示第a个高斯子分量的序号；b表示第b个高斯子分量的序号,高斯子分量的总数为K；i表示说话人的第i帧数据，一共有L帧。In the formula, j represents the sequence number of the jth Gaussian subcomponent; a represents the sequence number of the ath Gaussian subcomponent; b represents the sequence number of the bth Gaussian subcomponent, and the total number of Gaussian subcomponents is K; Frame data, a total of L frames.

取上一步得到的多个和值的最小值，与低阈值θ₃做比较，如果小于θ₃，则认为这两个高斯子分量在对语音信号特征的同一部分做拟合，即信息有重叠，则将这两个高斯子分量进行合并，组成一个新的高斯子分量，合并处理方式如下：Take the minimum value of multiple sums obtained in the previous step, and compare it with the low threshold θ ₃ , if it is less than θ ₃ , it is considered that the two Gaussian subcomponents are fitting the same part of the speech signal features, that is, the information overlaps , the two Gaussian subcomponents are combined to form a new Gaussian subcomponent, and the combination process is as follows:

其中，a表示第a个高斯子分量的序号；b表示第b个高斯子分量的序号；T合并后新的高斯子分量的序号。该步骤中的低阈值是在多次实验后取的经验值。Among them, a represents the sequence number of the a-th Gaussian sub-component; b represents the sequence number of the b-th Gaussian sub-component; and the sequence number of the new Gaussian sub-component after T is merged. The low threshold in this step is an empirical value taken after many experiments.

用新增加的高斯子分量λ_T来代替原来的高斯子分量λ_a和λ_b，这样高斯混合模型的混合度减小一个。Replace the original Gaussian subcomponents λ _a and λ _b with the newly added Gaussian subcomponent λ _T , so that the mixture degree of the Gaussian mixture model is reduced by one.

取上面所得的多个和值中的最大值，与高阈值θ₁进行比较，如果大于θ₁，则认为这两个高斯子分量在拟合语音信号特征的不同部分，这种情况下需要对这两个高斯子分量重新分配权重，处理方式如下：Take the maximum value among the multiple sums obtained above, and compare it with the high threshold θ ₁ , if it is greater than θ ₁ , it is considered that the two Gaussian subcomponents are fitting different parts of the speech signal features. In this case, it is necessary to The two Gaussian subcomponents redistribute weights as follows:

其中， $α_{1} = \frac{γ_{a}}{γ_{a} + γ_{b}}, α_{1} = \frac{γ_{a}}{γ_{a} + γ_{b}}, γ_{a} = \frac{π_{a} N (x_{i} | μ_{a}, σ_{a})}{Σ_{j = 1}^{K} π_{j} N (x_{i} | μ_{j}, σ_{j})}, γ_{b} = \frac{π_{b} N (x_{i} | μ_{b}, σ_{b})}{Σ_{j = 1}^{K} π_{j} N (x_{i} | μ_{j}, σ_{j})},$ 两个高斯子分量的期望和协方差矩阵保持不变。in, $α_{1} = \frac{γ_{a}}{γ_{a} + γ_{b}}, α_{1} = \frac{γ_{a}}{γ_{a} + γ_{b}}, γ_{a} = \frac{π_{a} N (x_{i} | μ_{a}, σ_{a})}{Σ_{j = 1}^{K} π_{j} N (x_{i} | μ_{j}, σ_{j})}, γ_{b} = \frac{π_{b} N (x_{i} | μ_{b}, σ_{b})}{Σ_{j = 1}^{K} π_{j} N (x_{i} | μ_{j}, σ_{j})},$ The expectation and covariance matrices of the two Gaussian subcomponents remain unchanged.

该步骤中的高阈值是在多次实验后取的经验值。The high threshold in this step is an empirical value taken after many experiments.

取高斯子分量权重的最大值，将其与权重门限值θ₂进行比较，如果大于θ₂，则说明这个高斯子分量包含的信息过多，需要对其进行拆分，拆分的处理方式如下：Take the maximum value of the weight of the Gaussian sub-component, and compare it with the weight threshold θ ₂ , if it is greater than θ ₂ , it means that the Gaussian sub-component contains too much information, and it needs to be split. The processing method of splitting as follows:

其中，为σ_a对角线上的最大值；E＝[1,1,…,1]是全1矩阵，用最新的两个高斯子分量λ_T,λ_T+1代替原来的高斯子分量λ_a，则高斯混合模型的混合度增加一个。in, is the maximum value on the diagonal of σ _a ; E=[1,1,…,1] is a matrix of all 1s, Replace the original Gaussian sub-component λ _a with the newest two Gaussian sub-components λ _T , λ _T+1 , then the mixture degree of the Gaussian mixture model increases by one.

该步骤中的权重门限值是在多次实验后取的经验值。The weight threshold in this step is an empirical value obtained after multiple experiments.

预先设置一个迭代次数M，用新的高斯子分量进行重复的执行上述步骤，执行M次以后得到一个优化的高斯混合模型。对每一个说话者都进行模型的优化，最后得到每个说话者对应的优化的高斯混合模型。本实施方式中M取10。Set an iteration number M in advance, repeat the above steps with a new Gaussian subcomponent, and obtain an optimized Gaussian mixture model after performing M times. The model is optimized for each speaker, and finally the optimized Gaussian mixture model corresponding to each speaker is obtained. In this embodiment, M is 10.

对于待识别的语音信号x，计算这个语音信号由不同的高斯混合模型生成的概率，取其中的最大者，最大者对应的目标说话人即为测试语音的真正说话人。For the speech signal x to be recognized, calculate the probability that the speech signal is generated by different Gaussian mixture models, take the largest one, and the target speaker corresponding to the largest one is the real speaker of the test speech.

例如，某一段待识别的语音，由第3个高斯混合模型生成的概率最大，则待识别的语音是由第3个说话者发出的。For example, if a speech to be recognized has the highest probability generated by the third Gaussian mixture model, then the speech to be recognized is uttered by the third speaker.

Claims

1. a human voice recognition method based on the Gaussian mixture model of adaptive adjustment, is characterized in that, specifically comprises the steps:

1) Use the speech feature parameters of the speaker to train to generate a traditional Gaussian mixture model corresponding to the speaker;

2) Calculate the probability that each frame of data in the Gaussian mixture model is generated by each Gaussian subcomponent, and then calculate the sum of the absolute values of the probability differences of the same frame of data generated by different Gaussian subcomponents;

3) Take the minimum value of the multiple sums obtained in step 2) and compare it with the set low threshold θ ₃ , if it is less than θ ₃ , merge the two Gaussian subcomponents corresponding to the minimum value to obtain a new Gaussian subcomponents;

4) Compare the maximum value of the obtained multiple sums with the set high threshold θ ₁ , if it is greater than the threshold θ ₁ , reconfigure the weights of the two Gaussian subcomponents corresponding to the maximum value to obtain two new The Gaussian subcomponent of ;

5) Get the maximum value of the weight of the Gaussian subcomponent, compare it with the threshold value θ ₂ set, if it is greater than θ ₂ , split the Gaussian subcomponent to obtain two new Gaussian subcomponents;

6) Replace the original Gaussian subcomponent with the newly obtained Gaussian subcomponent, obtain the final optimized Gaussian model through multiple iterations, input the speech feature parameters to be recognized, and calculate the probability that the speech signal is generated by fitting each Gaussian mixture model , the one with the largest value is determined as the corresponding target speaker, that is, the real speaker of the test speech.

2. the human voice recognition method based on the Gaussian mixture model of adaptive adjustment according to claim 1, is characterized in that, described step 2) generates the absolute value calculation expression of the probability difference value of same frame signal as:

p p__d d i i f f f f = = | | \frac{{π π}_{a a} N N (({x x}_{i i} | | {μ μ}_{a a},, {σ σ}_{a a}))}{{Σ Σ}_{j j = = 11}^{K K} {π π}_{j j} N N (({x x}_{i i} | | {μ μ}_{j j},, {σ σ}_{j j}))} - - \frac{{π π}_{b b} N N (({x x}_{i i} | | {μ μ}_{b b},, {σ σ}_{b b}))}{{Σ Σ}_{j j = = 11}^{K K} {π π}_{j j} N N (({x x}_{i i} | | {μ μ}_{j j},, {σ σ}_{j j}))} | |,,

Use λ _n = {π _n , μ _n ,σ _n } to denote the nth Gaussian subcomponent, π _n is the weight of the nth Gaussian subcomponent, μ _n and σ _n denote the expectation sum of the nth Gaussian subcomponent Variance matrix, the probability that each frame of data is fitted by K Gaussian subcomponents, there are L frames of data in total, x _i (i=1,2,...,L) is the input speech signal of the i-th frame, a and b is the serial number of different Gaussian subcomponents, π _a is the weight of the ath Gaussian subcomponent, N( _xi |μ _a ,σ _a ) is the probability density of the ath Gaussian subcomponent, μ _a and σ _a Represents the expectation and covariance matrix of the a-th Gaussian subcomponent. In the formula, the subscript j represents the serial number of the j-th Gaussian sub-component; the subscript b represents the serial number of the b-th Gaussian sub-component.

3. the human voice recognition method based on the Gaussian mixture model of self-adaptive adjustment according to claim 2, is characterized in that, described step 3) in merge processing mode is as follows:

\{\begin{matrix} {π π}_{T T} = = {π π}_{a a} + + {π π}_{b b} \\ {ω ω}_{a a} = = \frac{{π π}_{a a}}{{π π}_{T T}},, {ω ω}_{b b} = = \frac{{π π}_{b b}}{{π π}_{T T}} \\ {μ μ}_{T T} = = {ω ω}_{a a} {μ μ}_{a a} + + {ω ω}_{b b} {μ μ}_{b b} \\ {σ σ}_{T T} = = {ω ω}_{a a} {σ σ}_{a a} + + {ω ω}_{b b} {σ σ}_{b b} \end{matrix}

Among them, a represents the serial number of the a-th Gaussian sub-component; b represents the serial number of the b-th Gaussian sub-component; T is the serial number of the new Gaussian sub-component after the merger, and replaces the original Gaussian with the newly added Gaussian sub-component λ _T Subcomponents λ _a and λ _b .

4. the human voice recognition method based on the Gaussian mixture model of self-adaptive adjustment according to claim 2, is characterized in that, in described step 4) these two Gaussian subcomponents a, b are redistributed weight, obtain two A new Gaussian sub-component is processed as follows:

\{\begin{matrix} {π π}_{T T} = = {α α}_{11} (({π π}_{a a} + + {π π}_{b b})) \\ {π π}_{T T + + 11} = = {α α}_{22} (({π π}_{a a} + + {π π}_{b b})) \end{matrix}

in,

α_{1} = \frac{γ_{a}}{γ_{a} + γ_{b}}, α_{2} = \frac{γ_{b}}{γ_{a} + γ_{b}}, γ_{a} = \frac{π_{a} N (x_{i} | μ_{a}, σ_{a})}{Σ_{j = 1}^{K} π_{j} N (x_{i} | μ_{j}, σ_{j})}, γ_{b} = \frac{π_{b} N (x_{i} | μ_{b}, σ_{b})}{Σ_{j = 1}^{K} π_{j} N (x_{i} | μ_{j}, σ_{j})},

The expectation and covariance matrices of the two Gaussian distributions remain the same.

5. the human voice recognition method based on the Gaussian mixture model of adaptive adjustment according to claim 2, is characterized in that, described step 5) middle Gaussian subcomponent splits, and the processing mode of splitting is as follows:

\{\begin{matrix} {π π}_{T T} = = \frac{11}{22} {π π}_{a a} \\ {π π}_{T T + + 11} = = \frac{11}{22} {π π}_{a a} \\ {μ μ}_{T T} = = {μ μ}_{a a} + + τ τ E E. \\ {μ μ}_{T T + + 11} = = {μ μ}_{a a} - - τ τ E E. \\ {σ σ}_{T T} = = {((11 + + β β))}^{- - 11} {σ σ}_{a a} \\ {σ σ}_{T T + + 11} = = {((11 + + β β))}^{- - 11} {σ σ}_{a a} \end{matrix}

in, is the maximum value on the diagonal of σ _a ; E=[1,1,…,1] is a matrix of all 1s, Replace the original Gaussian sub-component λ _a with the latest two Gaussian sub-components λ _T , λ _T+1 .