CN101447182B

CN101447182B - Vocal-tract length normalization method capable of fast online application

Info

Publication number: CN101447182B
Application number: CN2008100979810A
Authority: CN
Inventors: 颜永红; 刘赵杰; 赵庆卫; 潘接林
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2007-11-28
Filing date: 2008-05-21
Publication date: 2011-11-09
Anticipated expiration: 2028-05-21
Also published as: CN101447182A

Abstract

The invention relates to a fast and online-applicable channel length normalization method, comprising the following steps: 1) training a normalized acoustic model that has nothing to do with the channel length in the training phase; Classify the training data and train multiple types of GMM; 3) Score in multiple types of GMM during the test, and quickly calculate the channel length normalization factor; 4) Select different segments according to the real-time requirements of the recognition system, and update the channel length normalization factor; 5) Decoding the normalized acoustic features with the normalized acoustic model of the channel length. The method of the invention can select the length of the segment for the test voice according to the real-time requirement of the recognition system, so that the channel length adjustment technology can be applied to the online system. Segmentation is to eliminate the influence of inaccurately judged silence, and not to split the continuous speech into frames too scattered to affect the value of the acoustic dynamic feature difference. At the same time, different weights can be added according to the situation of the segment.

Description

A Fast and Online Applicable Channel Length Normalization Method

技术领域 technical field

本发明涉及语音识别技术中一种说话人声学特征归整方法，更具体地说，本发明涉及一种快速可在线应用的说话人声道长度归整方法。The present invention relates to a speaker's acoustic feature integration method in speech recognition technology, more specifically, the invention relates to a speaker's vocal tract length integration method that can be quickly applied online.

背景技术 Background technique

语音是人的自然属性之一。由于说话人发音器官的生理差异以及后天形成的行为差异，在语音识别中说话人相关系统的性能要好于说话人无关系统。为了减小由于说话人差异而引起的说话人无关系统性能的下降，声道长度归整是一种常用的有效方法。声道长度归整是一种基于模型的特征归整技术，依赖于说话人声道长度归整模型。文献，H.Wakita“Normalization of Vowels by Vocal-Tract Length and itsApplication to Vowel Identification，”ICASSP77(1977)，首次提出应用去除说话人声道长度引起共振峰频率漂移的思想来提高孤立元音的识别率。声道不同的位置和形状决定了语音的产生，文献，E.Eide et al.“A Parametric Approach to Vocal Tract LengthNormalization，”ICASSP96(1996)，认为说话人声道最简单的模型是一个长度从声门到唇的均匀管子，而且是一端开口一端封闭。他们还给出了不同归整函数对最后识别性能的影响。基于这种均匀管道的模型，说话人声道长度的影响等于语音信号共振峰的中心频率乘以声道长度的倒数。通常说话人声道长度从女生的13cm左右到男生的18cm以上，这些变化对语音识别都是不利的。声道长度归整技术的思想就是找到某个归整函数把训练和测试的数据都变换到一个与说话人声道长度无关的数据域。基于管道模型的理论，共振峰随声道长度线性变化。大多数情况下归整函数只依赖于一个简单的特征归整因子。具体实施就是寻找每个说话人最佳归整因子，然后通过该归整因子对频率轴拉伸或压缩来消除说话人声道长度不同而带来的影响。声道长度归整技术的原理很简单，但是有效的具体实施是相当困难的。最大的挑战是如何从有限的数据中有效的估算出最佳的归整因子。传统相当多的方法是基于最大似然估计的两遍解码的方法，通过对归整前声学特征进行一遍解码得到说话人说话内容，用说话内容的文本信息和不同归整因子(通常是以一定步长遍历)归整后的特征在声学模型上做强制性对齐，用似然值最大的归整因子作为该人的最佳归整因子。这种方法能取得很不错的效果，但是需要两遍解码时间。文献，L.Lee etal.“Speaker Normalization using Efficient Frequency Warping Procedures，”ICASSP96(1996)，提出了一些较为成功的方法。对于训练数据，他们提出了一种跌代的方法，用一半训练数据训练一个声学模型，拿这个声学模型估算另外一半数据的归整因子，然后用归整后的数据在原来的声学模型上重新估算新的声学模型。测试的时提出了一种文本无关的方法，选用了归整因子相关的GMM(Gaussian Mixture Model)模型，省掉了第一遍解码时间。上述求归整因子方法都是说话人相关，文献，S.Wegmann etal.“Speaker Normalization on Conversational Telephone Speech”ICASSP96(1996)，提出了一种快速的句子相关的声道长度归整方法，让声道长度归整方法可以在半离线下工作提供了可能。现在报道的方法都取得了很不错的识别效果，但是这些方法多少有一定的局限性，都需要一定量先验数据，所以只能工作在离线或者半离线的方式下，难以应用于实际的系统中。在实际的系统中，特别是在线的系统，说话人信息和说话的内容是未知的，而且系统不能容许比较长的延时，现有的方法中难以找到一个合适的解决方案，所以很难用上声道长度归整技术。Speech is one of the natural attributes of human beings. Due to the physiological differences of speakers' vocal organs and acquired behavioral differences, the performance of speaker-dependent systems is better than that of speaker-independent systems in speech recognition. In order to reduce the performance degradation of speaker-independent systems caused by speaker differences, channel length normalization is a common and effective method. Vocal length normalization is a model-based feature normalization technique that relies on a speaker's vocal tract length normalization model. Literature, H.Wakita "Normalization of Vowels by Vocal-Tract Length and its Application to Vowel Identification," ICASSP77 (1977), first proposed the idea of removing the formant frequency shift caused by the length of the speaker's vocal tract to improve the recognition rate of isolated vowels . The different positions and shapes of the vocal tract determine the production of speech. The literature, E.Eide et al. "A Parametric Approach to Vocal Tract LengthNormalization," ICASSP96 (1996), believes that the simplest model of the speaker's vocal tract is a length from the vocal Uniform tube from door to lip and open at one end and closed at the other end. They also give the impact of different rounding functions on the final recognition performance. Based on this uniform pipeline model, the influence of the speaker's vocal tract length is equal to the center frequency of the formant of the speech signal multiplied by the reciprocal of the vocal tract length. Usually the length of the speaker's vocal tract ranges from about 13cm for girls to more than 18cm for boys. These changes are not good for speech recognition. The idea of vocal tract length normalization technology is to find a certain normalization function to transform both training and test data into a data domain that has nothing to do with the length of the speaker's vocal tract. Based on the theory of the tube model, the formant varies linearly with the length of the vocal tract. In most cases the rounding function depends only on a simple characteristic rounding factor. The specific implementation is to find the best rounding factor for each speaker, and then use the rounding factor to stretch or compress the frequency axis to eliminate the influence of different speaker vocal tract lengths. The principle of channel length normalization technology is very simple, but effective implementation is quite difficult. The biggest challenge is how to efficiently estimate the best rounding factor from limited data. Quite a few traditional methods are based on the two-pass decoding method based on maximum likelihood estimation. The speaker's speech content is obtained by decoding the acoustic features before normalization. Step length traversal) features after normalization are mandatory aligned on the acoustic model, and the normalization factor with the largest likelihood value is used as the best normalization factor for this person. This method can achieve very good results, but requires two decoding times. Literature, L. Lee et al. "Speaker Normalization using Efficient Frequency Warping Procedures," ICASSP96 (1996), proposed some more successful methods. For the training data, they proposed a descending method, using half of the training data to train an acoustic model, using this acoustic model to estimate the normalization factor of the other half of the data, and then using the normalized data on the original acoustic model. Estimate a new acoustic model. During the test, a text-independent method was proposed, and the GMM (Gaussian Mixture Model) model related to the integration factor was selected to save the first decoding time. The above-mentioned rounding factor methods are related to the speaker. The literature, S.Wegmann et al. "Speaker Normalization on Conversational Telephone Speech" ICASSP96 (1996), proposed a fast sentence-related channel length rounding method, so that the voice It is possible that the track length rounding method can work semi-offline. The methods reported now have achieved very good recognition results, but these methods have certain limitations and require a certain amount of prior data, so they can only work in an offline or semi-offline mode, which is difficult to apply to the actual system. middle. In an actual system, especially an online system, the information of the speaker and the content of the speech are unknown, and the system cannot tolerate a relatively long delay. It is difficult to find a suitable solution in the existing methods, so it is difficult to use Upper channel length normalization technology.

发明内容 Contents of the invention

本发明的目的在于：克服已有技术的缺陷，提供一种让声道长度归整技术能应用在在线的语音识别系统中的快速可在线应用的声道长度归整方法。The purpose of the present invention is to overcome the defects of the prior art and provide a fast and online channel length normalization method that enables the channel length normalization technology to be applied to an online speech recognition system.

本发明的目的是这样实现的：The purpose of the present invention is achieved like this:

本发明的快速可在线应用的声道长度归整方法，包括训练阶段和测试阶段，具体步骤如下：The fast and online-applicable channel length adjustment method of the present invention includes a training phase and a testing phase, and the specific steps are as follows:

1)在训练阶段训练一个与声道长度无关的归整后的声学模型；1) In the training phase, train a normalized acoustic model that is independent of the length of the vocal tract;

2)根据不同的归整因子对训练数据分类，训练多类GMM；2) Classify the training data according to different rounding factors, and train multi-class GMM;

3)测试时分段在多类GMM打分，快速计算声道长度归整因子；3) Score in multiple types of GMM during the test, and quickly calculate the normalization factor of the channel length;

4)根据识别系统的实时性需求选择不同的段数，更新声道长度归整因子；4) Select different segment numbers according to the real-time requirements of the recognition system, and update the normalization factor of the channel length;

5)用声道长度归整后的声学模型对归整后的声学特征解码。5) Decoding the normalized acoustic features with the normalized acoustic model of the channel length.

本发明的快速可在线应用的声道长度归整方法流程如图1所示。The process flow of the fast and online-applicable channel length adjustment method of the present invention is shown in FIG. 1 .

在图1中，左边是声道长度归整声学模型训练部分流程，右边是测试部分流程。In Fig. 1, the left part is the training process of the channel length normalized acoustic model, and the right part is the testing part process.

其中声学模型训练部分：训练中应用声道长度归整技术的目的是训练一个与说话人声道长度无关的声学模型，从而消除说话人声道长度的影响。训练声学模型时，由于训练的文本是已知的，主要面临的问题是未知的最佳归整因子和未知的模型参数。基于最大似然估计的方法求最佳归整因子时，需要用到归整后的声学模型，而现在没有归整后的模型。一般的做法就是认为最佳归整因子能通过某种函数事先算出，然后用最佳归整因子计算归整特征，然后训练声学模型。在实际应用中，本发明选用了单高斯声学模型来代替归整后的声学模型来算最佳归整因子，主要是认为单高斯声学模型描述的性能较混合高斯模型差一点，而更能描述语音信号的本来属性。用未归整的训练数据训练一个单高斯的声学模型，用该模型对不同的归整因子和标注文本做强制性对齐。归整因子通常在一定的范围(0.8～1.20)中以某种步长(0.02)进行遍历。Acoustic model training part: the purpose of applying vocal tract length normalization technology in training is to train an acoustic model independent of the speaker's vocal tract length, thereby eliminating the influence of the speaker's vocal tract length. When training the acoustic model, since the training text is known, the main problems are the unknown optimal rounding factor and unknown model parameters. When calculating the optimal normalization factor based on the method of maximum likelihood estimation, the normalized acoustic model needs to be used, but there is no normalized model at present. The general approach is to think that the best rounding factor can be calculated in advance through some function, and then use the best rounding factor to calculate the rounding feature, and then train the acoustic model. In practical application, the present invention selects the single Gaussian acoustic model to replace the normalized acoustic model to calculate the optimal normalization factor. Intrinsic properties of the speech signal. A single-Gaussian acoustic model is trained on the unrounded training data, and the model is used to enforce alignment for different rounding factors and annotation text. The rounding factor is usually traversed in a certain range (0.8-1.20) with a certain step size (0.02).

本发明的方法中训练主要分为三步，具体如下：Training is mainly divided into three steps in the method of the present invention, specifically as follows:

1)用归整前的声学特征训练一个单高斯声学模型。1) Train a single Gaussian acoustic model with the acoustic features before normalization.

${θ θ}_{00} &cong; &cong; \underset{θ θ}{arg arg max max} {{\underset{α α}{max max} P P ((X x | | W W;; θ θ))}} - - - - - - ((1.1 1.1))$

其中θ₀单高斯声学模型；r＝1，…，R说话人数目，X归整前的声学特征，W对应说话内容的标注文本。Where θ ₀ is a single Gaussian acoustic model; r=1,..., R is the number of speakers, X is the acoustic feature before normalization, and W corresponds to the marked text of the speech content.

2)对于每个说话人选一个最佳归整因子。2) Choose an optimal rounding factor for each speaker.

${α α}_{r r} = = \underset{α α}{arg arg max max} p p (({X x}_{r r}^{α α} | | {W W}_{r r};; {θ θ}_{00})) - - - - - - ((1.2 1.2))$

其中r＝1，…，R说话人数目，α_r说话人r对应的最佳归整因子；X_r ^α说话人r对应的用归整因子α归整后的声学特征；W_r说话人r对应说话内容的标注文本。where r=1,..., the number of R speakers, α _r the optimal rounding factor corresponding to speaker r; X _r ^α speaker r corresponding to the acoustic features after normalization with rounding factor α; W _r speaker r Annotated text corresponding to the spoken content.

3)用归整后的声学特征训练声学模型θ′。3) Train the acoustic model θ' with the normalized acoustic features.

${θ θ}^{' '} = = \underset{θ θ}{arg arg max max} {Π Π}_{r r = = 11}^{R R} \underset{{α α}_{r r}}{max max} P P (({X x}_{r r}^{{α α}_{r r}} | | {W W}_{r r};; θ θ)) - - - - - - ((1.3 1.3))$

其中θ′归整后声学模型；where θ′ is the normalized acoustic model;

本发明的方法中测试过程流程：Test process flow in the method of the present invention:

与训练部分相比，测试时已经有归整后的声学模型，但是其中说话人的信息和说话内容及其最佳归整因子是未知的。原来一般的做法是说话人的信息可以通过聚类得到，具体的说话内容可以先通过一遍解码，然后通过公式1.2计算出每个人最佳归整因子。但是在实际的在线系统中，这种处理方法计算量大且有延时，基本上是不可接受的。通常说话人的信息是不知道的而且是难以获得的，所以测试时算归整因子时一般以句子为单位。由于说话人的声道长度跟说话的具体内容没有关系，能通过说话人的语音直接得到说话人的声道长度归整因子。测试中，我们选用了文本无关的方法求最佳归整因子，就是不依赖说话人说话的内容而只根据相应的声学特征直接估算出最佳归整因子。Compared with the training part, there is already a normalized acoustic model in the test, but the information of the speaker and the content of the speech and the optimal normalization factor are unknown. It turns out that the general practice is that the speaker information can be obtained through clustering, and the specific speech content can be decoded once, and then the best rounding factor for each person can be calculated by formula 1.2. However, in the actual online system, this processing method has a large amount of calculation and delay, which is basically unacceptable. Usually the speaker's information is unknown and difficult to obtain, so the unit of sentence is generally used to calculate the rounding factor during the test. Since the length of the speaker's vocal tract has nothing to do with the specific content of the speech, the normalization factor of the speaker's vocal tract length can be directly obtained through the speaker's voice. In the test, we chose a text-independent method to find the best rounding factor, that is, we can directly estimate the best rounding factor based on the corresponding acoustic features without relying on the content of the speaker.

首先，在训练中把归整前的特征根据它所对应的最佳归整因子分类，然后训练混合高斯模型(GMM)具体流程如图2所示：First, in the training, the features before normalization are classified according to their corresponding best normalization factors, and then the mixed Gaussian model (GMM) is trained The specific process is shown in Figure 2:

其中X_α是归整前对应归整因子为α的声学特征。Where X _α is the acoustic feature corresponding to the normalization factor α before normalization.

其次，在识别过程中，用归整前声学特征在混合高斯模型上的最大似然值对应的归整因子作为它的最佳归整因子α′：Secondly, in the recognition process, the normalization factor corresponding to the maximum likelihood value of the acoustic feature before normalization on the mixed Gaussian model is used as its optimal normalization factor α′:

${Σ Σ}_{l l = = 11}^{L L} {c c}_{l l,, α α} = = 11 - - - - - - ((1.6 1.6))$

其中c_l，α，μ_l，α，σ_l，α ²分别为模型

的权重，均值，方差。Where c _{l, α} , μ _{l, α} , σ _{l, α} ² are model

The weight, mean, and variance of .

然后，对归整后的特征解码：Then, decode the normalized features:

$W W = = \underset{{W W}^{' '}}{arg arg max max} {{p p (({W W}^{' '})) \cdot \cdot \underset{α α}{max max} p p (({X x}^{α α} | | {W W}^{' '};; {θ θ}^{' '}))}} - - - - - - ((1.7 1.7))$

其中W是识别结果，X^α为用归整因子α归整后的特征。Where W is the recognition result, and X ^α is the feature after normalization by the normalization factor α.

由于语音中的静音段不含有任何说话人声道长度的信息，它们甚至可能影响最佳归整因子的计算。所以在训练GMM模型时根据语音能量的大小去除了训练数据中的静音段。测试中计算归整因子如图3所示，初试化α＝1，每隔n＝5帧判断是否是静音段，如果不是静音，在GMM模型上算累积概率，累积概率最大值作为此时的归整因子。通过对所隔帧数n的选择，可以控制系统的延时和实时性。Since silent segments in speech do not contain any information about the length of the speaker's vocal tract, they may even affect the calculation of the optimal rounding factor. Therefore, when training the GMM model, the silent segment in the training data is removed according to the size of the speech energy. The calculation of the normalization factor in the test is shown in Figure 3. The initial test is α=1, and every n=5 frames is used to judge whether it is a silent segment. If it is not silent, the cumulative probability is calculated on the GMM model, and the maximum value of the cumulative probability is taken as the current time. rounding factor. By selecting the number of frames n apart, the delay and real-time performance of the system can be controlled.

本发明的优点在于：The advantages of the present invention are:

本发明的方法可以根据识别系统对实时性的要求，对测试语音可以选择分段的长度，从而让声道长度归整技术应用于在线的系统中。分段的目的就是消除判断不准确的静音的影响，又不至于把连续语音按帧拆的太分散而影响声学动态特征差分的值，同时还可以根据段的情况加不同的权重。The method of the invention can select the segment length for the test voice according to the real-time requirement of the recognition system, so that the vocal channel length adjustment technology can be applied to the online system. The purpose of segmenting is to eliminate the influence of inaccurately judged silence, without dismantling the continuous speech frame by frame and affecting the value of the acoustic dynamic feature difference. At the same time, different weights can be added according to the situation of the segment.

附图说明 Description of drawings

图1是声道长度归整系统；Fig. 1 is the channel length normalization system;

图2是GMM训练流程；Figure 2 is the GMM training process;

图3是测试时的归整因子计算流程。Figure 3 is the flow chart of the calculation of the integration factor during the test.

具体实施方式 Detailed ways

下面结合附图和实施例对本发明进行详细地说明。The present invention will be described in detail below in conjunction with the accompanying drawings and embodiments.

参考图1，训练阶段，得到一个声道长度无关的声学模型和测试时用于快速算归整因子的GMM模型。Referring to Figure 1, in the training phase, an acoustic model independent of the length of the vocal tract and a GMM model used to quickly calculate the normalization factor during testing are obtained.

1.用归整前的声学特征一个单高斯的声学模型；1. A single Gaussian acoustic model using the acoustic features before normalization;

$θ_{0} &cong; \underset{θ}{\arg \max} {\max_{α} P (X | W; θ)},$ 训练的具体流程同原来声学模型的过程，区别是在EM跌代的过程中没有进行高斯分裂，最后的模型是和原来的同状态单高斯模型。单高斯声学模型描述的性能较混合高斯模型差一点，而更能描述语音信号的本来属性。用这个单高斯模型计算训练数据每个人所对应的最佳归整因子。 $θ_{0} &cong; \underset{θ}{\arg \max} {\max_{α} P (x | W; θ)},$ The specific process of training is the same as that of the original acoustic model, the difference is that Gaussian split is not performed in the process of EM generation, and the final model is the same state as the original single Gaussian model. The performance described by the single Gaussian acoustic model is worse than that of the mixed Gaussian model, but it can better describe the original properties of the speech signal. Use this single Gaussian model to calculate the best rounding factor corresponding to each person in the training data.

2.在单高斯声学模型上计算每个人的归整因子，用最佳归整因子提取声学特征；2. Calculate the rounding factor of each person on the single Gaussian acoustic model, and use the best rounding factor to extract the acoustic features;

根据训练数据标注文本，整理出说话人相关的列表。用单高斯声学模型对每个人的不同归整因子数据做强制性对齐，选取似然概率最大的作为该人的最佳归整因子。 $α_{r} = \underset{α}{\arg \max} p (X_{r}^{α} | W_{r}; θ_{0}),$ α范围是从0.80到1.20，步长是0.02。Annotate the text according to the training data, and sort out the speaker-related list. A single Gaussian acoustic model is used to enforce the alignment of different rounding factor data for each person, and the one with the highest likelihood probability is selected as the best rounding factor for that person. $α_{r} = \underset{α}{\arg \max} p (x_{r}^{α} | W_{r}; θ_{0}),$ The alpha range is from 0.80 to 1.20 with a step size of 0.02.

3.用归整后声学特征训练声学模型。3. Train the acoustic model with the normalized acoustic features.

$θ^{'} = \underset{θ}{\arg \max} Π_{r = 1}^{R} \max_{α_{r}} P (X_{r}^{α_{r}} | W_{r}; θ),$ 训练具体流程同原来声学模型的训练过程。 $θ^{'} = \underset{θ}{\arg \max} Π_{r = 1}^{R} \max_{α_{r}} P (x_{r}^{α_{r}} | W_{r}; θ),$ The specific training process is the same as the training process of the original acoustic model.

4.根据不同的归整因子训练多类GMM，如1-2所示。4. Train multi-class GMMs according to different rounding factors, as shown in 1-2.

在训练GMM前根据语音中能量的大小去掉了语音中的可能是静音部分。由于归整因子在0.88以下和1.12以上数据非常少，在训练GMM时仅选取了0.88-1.12段作为不同的类。Before training GMM, the possible silent part of the speech is removed according to the size of the energy in the speech. Since there are very few data with the integration factor below 0.88 and above 1.12, only 0.88-1.12 segments are selected as different classes when training GMM.

测试阶段testing phase

1)语音信号端点检测，分句；1) Speech signal endpoint detection, sentence segmentation;

根据声学环境的变化点将音频流切分成声学特征单一的片段并使用静音跟踪算法将较长的片段切分为适合识别的句子。According to the change points of the acoustic environment, the audio stream is segmented into segments with single acoustic features, and the longer segments are segmented into sentences suitable for recognition using the silence tracking algorithm.

2)初始化归整因子为1；2) Initialize the rounding factor to 1;

由于开始没有任何先验知识，我们选用了归整因子为1，就是不做声道长度归整。Since we didn't have any prior knowledge at the beginning, we chose the rounding factor to be 1, that is, no rounding of the channel length.

3)每5帧，判断静音或语音，如果是语音则在GMM上算累积似然值，更新现在最佳归整因子；3) Every 5 frames, judge silence or speech, if it is speech, calculate the cumulative likelihood value on the GMM, and update the current best rounding factor;

语音中的静音段不含有任何说话人声道长度的信息，它们甚至可能影响最佳归整因子的计算。每隔n＝5帧判断是否是静音段，如果不是静音，在GMM模型上算累积概率，累积概率最大值作为此时的归整因子。分段的目的就是消除判断不准确的静音的影响，又不至于把连续语音按帧拆的太分散，同时还可以根据段的情况加不同的权重。Silent segments in speech do not contain any information about the length of the speaker's vocal tract, and they may even affect the calculation of the optimal rounding factor. It is judged every n=5 frames whether it is a silent segment, if it is not silent, the cumulative probability is calculated on the GMM model, and the maximum value of the cumulative probability is used as the rounding factor at this time. The purpose of segmenting is to eliminate the influence of inaccurately judged silence, without dismantling the continuous speech frame by frame. At the same time, different weights can be added according to the situation of the segment.

另外，通过对所隔帧数n(3＜n＜15)的选择，可以控制系统的实时性。In addition, the real-time performance of the system can be controlled by selecting the interval frame number n (3<n<15).

4)如果离线系统，最后累积概率最大的归整因子作为该句归整因子；如果在线系统，大于设定的长度，用此时累积概率最大的归整因子归整；4) If the system is offline, the rounding factor with the largest cumulative probability is used as the rounding factor of the sentence; if the length is greater than the set length in the online system, the rounding factor with the largest cumulative probability at this time is used for rounding;

5)用归整后的声学特征解码。5) Decoding with the normalized acoustic features.

Claims

1. A fast and online-applicable vocal channel length adjustment method, including a training phase and a testing phase, the specific steps are as follows:

The process of the training phase is as follows:

1) Train a single Gaussian acoustic model with the acoustic features before normalization:

{θ θ}_{00} &cong; &cong; \underset{θ θ}{arg arg max max} {{\underset{α α}{max max} P P ((X x | | W W;; θ θ))}} - - - - - - ((1.1 1.1))

where θ ₀ is a single Gaussian acoustic model; r=1,..., R is the number of speakers, X is the acoustic feature before normalization, W is the labeled text corresponding to the speech content, α is the normalization factor, and θ is the acoustic Model;

2) Calculate the rounding factor of each person on the single Gaussian acoustic model, use the best rounding factor to extract the acoustic features, and select an optimal rounding factor for each speaker:

{α α}_{r r} = = \underset{α α}{arg arg max max} p p (({X x}_{r r}^{α α} | | {W W}_{r r};; {θ θ}_{00})) - - - - - - ((1.2 1.2))

Among them, r=1,..., R is the number of speakers, and α _r is the best rounding factor corresponding to speaker r;

is the acoustic feature corresponding to the speaker r after being normalized by the rounding factor α; W _r is the marked text corresponding to the speech content of the speaker r;

3) Train the acoustic model θ′ with the normalized acoustic features:

{θ θ}^{' '} = = \underset{θ θ}{arg arg max max} {Π Π}_{r r = = 11}^{R R} \underset{{α α}_{r r}}{max max} P P (({X x}_{r r}^{{α α}_{r r}} | | {W W}_{r r};; θ θ)) - - - - - - ((1.3 1.3))

Among them, θ' is the acoustic model after normalization,

In addition, the process of the test phase is as follows:

1) First, classify the features before normalization according to their corresponding optimal normalization factors during training, and then train the mixed Gaussian model

Among them, X _α is the acoustic feature corresponding to the normalization factor α before normalization;

2) Secondly, in the recognition process, use the normalization factor corresponding to the maximum likelihood value of the acoustic feature before normalization on the mixed Gaussian model as its optimal normalization factor α′:

{Σ Σ}_{l l = = 11}^{L L} {c c}_{l l,, α α} = = 11

Among them, c _{l, α} , μ _{l, α} ,

model respectively The weight, mean and variance of ;

3) Then, decode the normalized features:

W W = = \underset{{W W}^{' '}}{arg arg max max} {{p p (({W W}^{' '})) \cdot &Center Dot; \underset{α α}{max max} p p (({X x}^{α α} | | {W W}^{' '};; {θ θ}^{' '}))}} - - - - - - ((1.7 1.7))

Among them, W is the recognition result, and X ^α is the feature after normalization by the normalization factor α.

2. The fast and online-applicable channel length rounding method according to claim 1, characterized in that, the range of the rounding factor α is 0.80-1.20, and the step size is 0.02.

3. The fast and online-applicable channel length rounding method according to claim 1, wherein the rounding factor α ranges from 0.88 to 1.12.