CN105280181B

CN105280181B - A kind of training method and Language Identification of languages identification model

Info

Publication number: CN105280181B
Application number: CN201410336650.3A
Authority: CN
Inventors: 周若华; 王宪亮; 颜永红; 索宏彬
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2014-07-15
Filing date: 2014-07-15
Publication date: 2018-11-13
Anticipated expiration: 2034-07-15
Also published as: CN105280181A

Abstract

The present invention relates to a kind of training methods and Language Identification of languages identification model, including：The phoneme posterior probability of extraction training voice data, log-domain is transformed by phoneme posterior probability, carries out dimensionality reduction and mean variance is regular obtains phoneme correlated characteristic；Baum-Welch statistics are calculated using phoneme correlated characteristic, the phoneme variable quantity factor is extracted using Baum-Welch statistics；The phoneme variable quantity factor is modeled, SVM models (languages identification model) are established；The phoneme variable quantity factor pair SVM models of voice data to be identified are given a mark, carry out that mean variance is regular to score, and the score after regular are analyzed using linear discriminant and the regular progress score correction in Gauss rear end, final recognition result is obtained.This method reduces computation complexity compared with traditional Language Identification, and languages recognition performance is obviously improved, and has very high practicability.

Description

A kind of training method and Language Identification of languages identification model

Technical field

The present invention relates to the recognition methods of voice data language information, it is more particularly related to be based on phoneme phase Close the Language Identification of feature.

Background technology

With the globalization of modern society's information, languages identification becomes one of speech recognition technology research hotspot.Languages are known The purpose of other technology is the machine that can be manufactured a kind of thinking apish to a certain extent and carry out Language Identification to voice, The different information of each languages is exactly extracted from voice signal, and judges affiliated languages on this basis.The voice of extraction is believed Number feature directly influences the result of languages identification.

The languages identification technology of mainstream includes based on acoustical frequency spectrum feature recognition and being based on phoneme feature recognition two major classes.

Acoustical frequency spectrum feature refers to shift differential spectrum signature (MSDC) (document [1] P.A.Torres- of Mel-cepstrum Carrasquillo,E.Singer,M.A.Kohler,R.J.Greene,D.A.Reynolds,and JR Deller Jr, “Approaches to language identication using Gaussian mixture models and shifted delta cepstral features,"in Seventh International Conference on Spoken Language Processing.Citeseer, 2002.), the model method based on acoustical frequency spectrum feature be from Feature of the cepstrum feature extracted in voice as the voice, then models these features, without reference to the hair of voice Message ceases.Modeling is usually using gauss hybrid models (GMM) (document [2] L.Burget, P.Matejka and J.Cernocky,“Discriminative training techniques for acoustic language identification”,International Conference on Acoustics,Speech,and Signal Processing, vol.1,2006.) and supporting vector machine model (SVM) (document [3] W.M.Campbell, J.P.Campbell,D.A.Reynolds,E.Singer and P.A.Torres-Carrasquillo,“Support vector machines for speaker and language recognition”,Computer Speech Language,vol.20,no.2-3,pp.210-229,2006.).Ivector systems (document [4] based on factorial analysis Najim Dehak,Pedro A Torres-Carrasquillo,Douglas A Reynolds,and Reda Dehak, “Language recognition viai-vectors and dimensionality reduction.,”in INTERSPEECH, 2011, pp.857-860.) languages identification in achieve good performance, be widely used. Ivector methods define a lower dimensional space for being known as total variation factor space, this space contains speaker simultaneously Then the Gauss super vector of higher-dimension is expressed as the total variation factor of low-dimensional by space and channel space, it is demonstrated experimentally that low-dimensional The total variation factor can characterize the Gauss super vector of higher-dimension completely.After this method introduces languages identification, rapidly becomes acoustics and build The main stream approach of mould, the perhaps research of multi-speech recognition are all carried out on the basis of this method.However, languages identification in for The research of Ivector methods is only limited to acoustical frequency spectrum feature, is not generalized to comprising abundant phonetic pronunciation information Phoneme feature.

Language recognition system based on phoneme feature using phoneme recognizer to voice be decoded to obtain aligned phoneme sequence or Then phoneme lattice models languages using grammatical feature.Document [5] (W.M.Campbell, F.Richardson and D.A.Reynolds,“Language recognition with word lattices and support vector machines”,International Conference on Acoustics,Speech,and Signal Processing, vol.4,2007.)。PPRVSM(H.Li,B.Ma,C.-H.Lee,A vector space modeling approach to spoken language identification,Audio,Speech,and Language Processing,IEEE Transactions on15 (1) (2007) 271-284) vector space model is introduced to the languages identification skill based on phoneme recognition In art, aligned phoneme sequence or phoneme lattice are considered as " text ", the phoneme for having distinctive is extracted from aligned phoneme sequence or phoneme lattice Then string is classified using support vector machines as characteristic item composition characteristic vector, has obtained good languages recognition performance.

Traditional identifying system based on phoneme feature considers the pronunciation character of voice, is better than being based on recognition performance Acoustical frequency spectrum tag system, but since decoding aligned phoneme sequence computation complexity is high, long operational time, therefore seldom in real system Middle use.

Invention content

It is an object of the invention to overcome tradition scarce comprising phonetic pronunciation information based on acoustical frequency spectrum characterization method It falls into, overcomes tradition high based on phoneme characterization method decoding aligned phoneme sequence computation complexity, the defect of long operational time, to provide A kind of reduction computational complexity, improves the Language Identification of recognition performance.

To achieve the goals above, the present invention provides a kind of training method of languages identification model and languages identification sides The training method of method, wherein languages identification model includes the following steps：

Step 1-1), a certain number of target language voice data are acquired as training sentence, extract the sound of training sentence Plain posterior probability；

Step 1-2), phoneme posterior probability is transformed into log-domain, and carry out dimensionality reduction, mean value is carried out to the feature after dimensionality reduction Variance is regular (MVN), obtains phoneme correlated characteristic；

Assuming that x_itIt is the t frame log-domain phonemes posterior probability vector of i-th of training sentence, T_iIt is i-th of training sentence Frame number, m_iIt is the mean value of all frame log-domain phoneme posterior probability of i-th of training sentence, as following formula obtains：

Covariance matrix C, such as following formula can be calculated by the mean value of training all frame log-domain phoneme posterior probability of sentence：

Wherein, N is the number of trained sentence；

PCA is generated by the corresponding eigenvector of preceding L (L values and phoneme number are close) a dominant eigenvalues of covariance matrix Transition matrix A_PCA, by the PCA transition matrixes A_PCAThe feature after dimensionality reduction is obtained with log-domain phoneme posterior probability vector, Expression formula is：

To the feature y after dimensionality reduction_itIt is regular (MVN) to carry out mean variance, obtains phoneme correlated characteristic vector z_it。

Step 1-3), calculate Baum-Welch statistics using phoneme correlated characteristic；

C is Gaussian component, and Ω is the variance of global context model (UBM), and p (c | z_it, Ω) and indicate that t frames belong to c-th The probability of Gaussian component, μ_cIt is the mean vector of c-th of Gaussian component.

Step 1-4), extract the phoneme variable quantity factor using Baum-Welch statistics；

The phoneme variable quantity factor w of i-th of training sentence is obtained by following formula：

W=(I+T^TΣ^-1N(i)T)^-1T^TΣ^-1F(i) (6)

Wherein, N (i) is diagonal matrix, and element is N on diagonal line_cI, F (i) are by single order Baum-Welch statistics F_cSplicing It obtains, Σ and T are trained to obtain in Factor Analysis by EM algorithms.

Step 1-5), the phoneme variable quantity factor is modeled, languages identification model is established；

Using one-to-one and one-to-many strategy, the phoneme variable quantity factor is modeled using SVM, establishes SVM models, The as described languages identification model of SVM models.

A kind of Language Identification provided by the invention, the training side of languages identification model based on the above-mentioned technical proposal Method includes the following steps：

Step 2-1), extract the phoneme posterior probability of voice data to be identified；

Step 2-2), phoneme posterior probability is transformed into log-domain, and carry out dimensionality reduction, mean value is carried out to the feature after dimensionality reduction Variance is regular (MVN), obtains phoneme correlated characteristic；

Step 2-3), calculate Baum-Welch statistics using phoneme correlated characteristic；

Step 2-4), extract the phoneme variable quantity factor using Baum-Welch statistics；

Step 2-5), the SVM models described in phoneme variable quantity factor pair are given a mark, and mean variance rule are carried out to score It is whole, (LDA) and the regular progress score correction in Gauss rear end are analyzed using linear discriminant to the score after regular, finally known Other result；

The regular calculating process of the mean variance is：

Wherein, M is the number of the supporting vector machine model, s_mIt is the initial score of m-th of SVM model, μ and σ divide Not Wei all SVM model scores of the test data mean value and standard deviation, k is adjustable parameter, s "_mIt is regular rear score.

The advantage of the invention is that：

1, the pronunciation character of language is considered, the different information between languages becomes apparent from；

2, phoneme correlated characteristic is used for factorial analysis, improves the performance of system languages identification；

3, traditional decoding process based on phoneme Feature Recognition System is eliminated, the calculating for greatly reducing system is complicated Degree.

Description of the drawings

Fig. 1 is a kind of flow chart of the training method of languages identification model；

Fig. 2 is a kind of flow chart of Language Identification.

Specific implementation mode

The present invention is described in further detail in conjunction with attached drawing；

With reference to figure 1, a kind of flow of the training method of languages identification model includes：

Step 1-1), a certain number of target language voice data are acquired as training data, extract phoneme posterior probability；

A certain number of target language voice data are acquired as training data, at traditional voice data front end Reason cuts off mute, the invalid voices such as music to training data, retains efficient voice；Then left frequency band and right frequency band are carried respectively Take temporal mode (TRAP) feature；The frame length of each frame is 25ms, and it is 10ms that frame, which moves, and left and right frequency band takes the feature of 15 frames respectively, Therefore the feature of each frame includes the duration of surrounding 310ms.The TRAP features of left and right frequency band are respectively fed to artificial neural network and obtain To the phoneme posterior probability of two frequency bands, the phoneme posterior probability of two frequency bands is stitched together, another artificial neuron is used Network is handled, and the phoneme posterior probability of 159 dimensions is finally obtained.

Step 1-2), phoneme posterior probability is transformed into log-domain, dimensionality reduction is carried out using principal component analysis technology (PCA), It is regular (MVN) to the feature progress mean variance after dimensionality reduction, obtain phoneme correlated characteristic；

By the mean value m of all frame log-domain phoneme posterior probability of training data_iCovariance matrix C, such as following formula can be calculated：

Wherein, N is the number of trained sentence；

PCA transition matrixes A is generated by the corresponding eigenvector of preceding 56 dominant eigenvalues of covariance matrix_PCA, by described PCA transition matrixes A_PCAThe feature after dimensionality reduction is obtained with log-domain phoneme posterior probability vector, expression formula is：

To the feature y after dimensionality reduction_itThe influence that mean variance is regular (MVN), and removal pronunciation changes is carried out, phoneme correlation is obtained Feature z_it, intrinsic dimensionality is 56 dimensions.

C is Gaussian component, and Ω is the variance of global context model (UBM), and p (c | z_it, Ω) and indicate that t frames belong to c-th The probability of Gaussian component, μ_cIt is the mean vector of c-th of Gaussian component.Gaussage takes 1024.

I-th training sentence phoneme variable quantity factor w be：

W=(I+T^tΣ^-1N(i)T)^-1T^tΣ^-1F(i)

Using one-to-one and one-to-many strategy, the phoneme variable quantity factor is modeled using SVM, establishes M SVM mould Type, the as described languages identification model of SVM models.

With reference to figure 2, a kind of flow of Language Identification includes the following steps：

Step 2-1), the temporal mode feature of voice data to be identified is extracted, after extracting phoneme using artificial neural network Test probability；

Step 2-2), phoneme posterior probability is transformed into log-domain, dimensionality reduction is carried out using principal component analysis technology (PCA), It is regular (MVN) to the feature progress mean variance after dimensionality reduction, obtain phoneme correlated characteristic；

Step 2-5), M SVM model described in phoneme variable quantity factor pair is given a mark, and mean variance is carried out to score It is regular, to regular rear score using linear discriminant analysis (LDA) and the regular progress score correction in Gauss rear end, finally known Other result；

The regular calculating process of the mean variance is：

Wherein, s_mIt is the initial score of m-th of SVM model, μ and σ are respectively all SVM model scores of the test data Mean value and standard deviation, k=100, s "_mIt is regular rear score.

It is tested within 2011, is surveyed on languages identification evaluation and test (LRE) data set in American National Standard technology administration (NIST) It includes 24 target languages to try languages, Performance Evaluating Indexes have EER (etc. error rates), minDCF (minimum detection mistake cost), minCavg₂₇₆(the minimum average B configuration values of risk of 276 languages pair), actCavg₂₇₆(the actual average risk of 276 languages pair Cost), minCavg₂₄(the minimum average B configuration values of risk of 24 worst languages pair) and actCavg₂₄(24 worst languages pair Actual average value of risk).S1 indicates that traditional Ivector methods, S2 indicate traditional side PPRVSM based on phoneme feature Method, S3 indicate Language Identification proposed by the present invention.Russian phoneme recognizer, the testability of each method are used in actual test Energy evaluation index comparing result is as shown in table 1.

Table 1

Method	EER	minDCF	minCavg₂₇₆	actCavg₂₇₆	minCavg₂₄	actCavg₂₄
							S1	6.65	6.86	2.45	3.45	11.88	14.52
S2	7.62	8.04	2.67	4.68	11.68	14.09
							S3	5.52	5.75	1.45	2.68	9.00	12.01

Language Identification handle proposed by the present invention is used for factorial analysis with the pronunciation relevant phoneme correlated characteristic of content, surveys Test result shows Language Identification proposed by the present invention compared with traditional Ivector methods, has on recognition performance opposite The promotion of 16%-41% has phase compared with traditional Language Identification PPRVSM based on phoneme feature on recognition performance Promotion to 15%-46%.

Claims

1. a kind of training method of languages identification model, including：

Step 1-1), a certain number of target language voice data are acquired as training sentence, after the phoneme for extracting training sentence Test probability；

Step 1-2), phoneme posterior probability is gone into log-domain, and carry out dimensionality reduction, mean variance rule are carried out to the feature after dimensionality reduction It is whole, obtain phoneme correlated characteristic；

The step 1-4) calculating process be：

I-th training sentence phoneme variable quantity factor w be：

W=(I+T^TΣ^-1N(i)T)^-1T^TΣ^-1F(i)

Wherein, N (i) is diagonal matrix, and element is N on diagonal line_cI, F (i) are by single order Baum-Welch statistics F_cSplice It arrives, Σ and T are trained to obtain in Factor Analysis by expectation-maximization algorithm；

The step 1-5) process be：Using one-to-one and one-to-many strategy, using support vector machines to phoneme variable quantity because Son is modeled, and supporting vector machine model is established, and supporting vector machine model is the languages identification model.

2. the training method of languages identification model as described in claim 1, which is characterized in that the step 1-2) calculating Cheng Wei：

Transition matrix A is generated by the corresponding eigenvector of the preceding L dominant eigenvalue of covariance matrix_PCA, covariance matrix definition Such as following formula：

Wherein, N is the number of trained sentence, m_iIt is the mean value of all frame log-domain phoneme posterior probability of i-th of training sentence, such as Following formula obtains：

x_itIt is the t frame log-domain phonemes posterior probability vector of i-th of training sentence, T_iIt is the frame number of i-th of training sentence, drop It is characterized as after dimension：

To the feature y after dimensionality reduction_itIt is regular to carry out mean variance, obtains phoneme correlated characteristic vector z_it。

3. the training method of languages identification model as described in claim 1, which is characterized in that the step 1-3) in Baum-Welch normalized set processes are：

C is Gaussian component, and Ω is the variance of global context model, and p (c | z_it, Ω) and indicate that t frames belong to c-th Gaussian component Probability, μ_cIt is the mean vector of c-th of Gaussian component.

4. a kind of Language Identification, training method of this method based on the languages identification model described in one of claim 1-3, This method comprises the following steps：

Step 2-1), extract the phoneme posterior probability of sentence to be identified；

Step 2-2), phoneme posterior probability is gone into log-domain, and carry out dimensionality reduction, mean variance rule are carried out to the feature after dimensionality reduction It is whole, obtain phoneme correlated characteristic；

Step 2-5), the languages identification model described in phoneme variable quantity factor pair is given a mark, mean variance is carried out to score It is regular, the score after regular is analyzed and the regular progress score correction in Gauss rear end using linear discriminant, is finally identified As a result.

5. Language Identification as described in claim 4, which is characterized in that step 2-5) described in the regular meter of mean variance Calculation process is：

Wherein, M is the number of the languages identification model, s_mIt is the initial score of m-th of languages identification model, μ and σ difference For the mean value and standard deviation of all languages identification model scores, k is adjustable parameter, s "_mIt is regular rear score.