CN105280181A

CN105280181A - Training method for language recognition model and language recognition method

Info

Publication number: CN105280181A
Application number: CN201410336650.3A
Authority: CN
Inventors: 周若华; 王宪亮; 颜永红; 索宏彬
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2014-07-15
Filing date: 2014-07-15
Publication date: 2016-01-27
Anticipated expiration: 2034-07-15
Also published as: CN105280181B

Abstract

The invention relates to a training method for a language recognition model and a language recognition method. The language recognition method comprises the steps: extracting the phoneme posterior probability of speech data, converting the phoneme posterior probability into a log domain, conducting dimensionality reduction and mean and variance normalization, and then obtaining phoneme associated features; calculating Baum-Welch statistical magnitude by means of the phoneme associated features, and extracting a phoneme variance factor through the Baum-Welch statistical magnitude; modeling the phoneme variance factor, and establishing an SVM model (a language recognition model); and marking the SVM model by a phoneme variance factor of to-be-recognized speech data, conducting mean and variance normalization on the score, performing linear discriminant analysis and Gauss back end normalization on the normalized score to realize score correction, and finally obtaining a recognition result. Compared with a conventional language recognition method, the language recognition method of the invention has the advantages that the calculation complexity is reduced, the language recognition performance is obviously improved, and the method is highly practical.

Description

A kind of training method of languages model of cognition and Language Identification

Technical field

The present invention relates to the recognition methods of speech data language information, more particularly, the present invention relates to the Language Identification based on phoneme correlated characteristic.

Background technology

Along with the globalization of modern society's information, languages identification becomes one of speech recognition technology study hotspot.The object of languages recognition technology to manufacture Language Identification is carried out in a kind of apish thinking to a certain extent machine to voice, from voice signal, namely extract the different information of each languages, and languages belonging to judging on this basis.The phonic signal character extracted directly has influence on the result of languages identification.

The languages recognition technology of main flow comprises based on the identification of acoustics spectrum signature with based on the large class of phoneme feature identification two.

Acoustics spectrum signature refers to shift differential spectrum signature (MSDC) (document [1] P.A.Torres-Carrasquillo of Mel-cepstrum, E.Singer, M.A.Kohler, R.J.Greene, D.A.Reynolds, andJRDellerJr, " ApproachestolanguageidenticationusingGaussianmixturemode lsandshifteddeltacepstralfeatures, " inSeventhInternationalConferenceonSpokenLanguageProcessi ng.Citeseer, 2002.), model method based on acoustics spectrum signature is using the feature of the cepstrum feature extracted from voice as these voice, then modeling is carried out to these features, do not relate to the pronunciation information of voice.Modeling uses gauss hybrid models (GMM) (document [2] L.Burget usually, P.MatejkaandJ.Cernocky, " Discriminativetrainingtechniquesforacousticlanguageident ification ", InternationalConferenceonAcoustics, Speech, andSignalProcessing, vol.1, 2006.) and supporting vector machine model (SVM) (document [3] W.M.Campbell, J.P.Campbell, D.A.Reynolds, E.SingerandP.A.Torres-Carrasquillo, " Supportvectormachinesforspeakerandlanguagerecognition ", ComputerSpeechLanguage, vol.20, no.2-3, pp.210-229, 2006.).Based on ivector system (document [4] NajimDehak of factorial analysis, PedroATorres-Carrasquillo, DouglasAReynolds, andRedaDehak, " Languagerecognitionviai-vectorsanddimensionalityreductio n., " inINTERSPEECH, 2011, pp.857 – 860.) in languages identification, achieve good performance, be widely used.Ivector method defines the lower dimensional space that is called total variation factor space, this space contains speaker space and channel space simultaneously, then Gauss's super vector of higher-dimension is expressed as the total variation factor of low-dimensional, experiment proves, the total variation factor of low-dimensional can Gauss's super vector of Complete Characterization higher-dimension.The method becomes rapidly the main stream approach of Acoustic Modeling after introducing languages identification, and the research of being permitted multi-speech recognition is all carried out on the method basis.But, in languages identification, acoustics spectrum signature is just confined to for the research of Ivector method, is not generalized to the phoneme feature comprising abundant phonetic pronunciation information.

Language recognition system based on phoneme feature adopts phoneme recognizer to carry out decoding to voice and obtains aligned phoneme sequence or phoneme lattice, then uses grammatical feature to carry out modeling to languages.Document [5] (W.M.Campbell, F.RichardsonandD.A.Reynolds, " Languagerecognitionwithwordlatticesandsupportvectormachi nes ", InternationalConferenceonAcoustics, Speech, andSignalProcessing, vol.4,2007.).PPRVSM (H.Li, B.Ma, C.-H.Lee, Avectorspacemodelingapproachtospokenlanguageidentificati on, Audio, Speech, andLanguageProcessing, IEEETransactionson15 (1) (2007) 271 – 284) vector space model is introduced based in the languages recognition technology of phoneme recognition, aligned phoneme sequence or phoneme lattice are considered as " text ", extract from aligned phoneme sequence or phoneme lattice and have distinctive phone string as characteristic item composition characteristic vector, then support vector machine is adopted to classify, obtain good languages recognition performance.

Traditional recognition system based on phoneme feature considers the pronunciation character of voice, recognition performance is better than based on acoustics spectrum signature system, but due to decoding aligned phoneme sequence computation complexity high, long operational time, therefore seldom uses in systems in practice.

Summary of the invention

The object of the invention is to overcome traditional defect not comprising phonetic pronunciation information based on acoustics spectrum signature method, overcome tradition high based on phoneme characterization method decoding aligned phoneme sequence computation complexity, the defect of long operational time, thus a kind of reduction computational complexity is provided, improve the Language Identification of recognition performance.

To achieve these goals, the invention provides a kind of training method and Language Identification of languages model of cognition, wherein the training method of languages model of cognition comprises the steps:

Step 1-1), the target language speech data gathering some, as training statement, extracts the phoneme posterior probability of training statement;

Step 1-2), phoneme posterior probability is transformed into log-domain, and carries out dimensionality reduction, mean variance regular (MVN) is carried out to the feature after dimensionality reduction, obtains phoneme correlated characteristic;

Suppose x _itthe t frame log-domain phoneme posterior probability vector of i-th training statement, T _ithe frame number of i-th training statement, m _ithe average of i-th training statement all frame log-domains phoneme posterior probability, as shown in the formula obtaining:

m_{i} = \frac{1}{T_{i}} Σ_{t = 1}^{T_{i}} x_{it} - - - (1)

Covariance matrix C can be calculated by the average of training statement all frame log-domains phoneme posterior probability, as shown in the formula:

C = \frac{1}{N} Σ_{i = 1}^{N} (m_{i} - \frac{1}{N} Σ_{i = 1}^{N} m_{i}) {(m_{i} - \frac{1}{N} Σ_{i = 1}^{N} m_{i})}^{T} - - - (2)

Wherein, N is the number of training statement;

The latent vector corresponding by L before covariance matrix (L value is close with phoneme number) individual dominant eigenvalue generates PCA transition matrix A _pCA, by described PCA transition matrix A _pCAobtain the feature after dimensionality reduction with log-domain phoneme posterior probability vector, its expression formula is:

y_{it} = A_{PCA}^{T} x_{it} - - - (3)

To the feature y after dimensionality reduction _itcarry out mean variance regular (MVN), obtain phoneme correlated characteristic vector z _it.

Step 1-3), utilize phoneme correlated characteristic to calculate Baum-Welch statistic;

N_{c} = Σ_{t = 1}^{T_{i}} P (c | z_{it}, Ω) - - - (4)

F_{c} = Σ_{t = 1}^{T_{i}} P (c | z_{it}, Ω) (z_{it} - μ_{c}) - - - (5)

C is gaussian component, and Ω is the variance of global context model (UBM), p (c|z _it, Ω) and represent that t frame belongs to the probability of c gaussian component, μ _cit is the mean vector of c gaussian component.

Step 1-4), utilize Baum-Welch statistic to extract the phoneme variable quantity factor;

I-th training statement phoneme variable quantity factor w obtained by following formula:

w＝(I+T ^TΣ ^-1N(i)T) ^-1T ^TΣ ^-1F(i)(6)

Wherein, N (i) is diagonal matrix, and on diagonal line, element is N _ci, F (i) are by single order Baum-Welch statistic F _csplicing obtains, Σ and T is obtained by EM Algorithm for Training in Factor Analysis.

Step 1-5), modeling is carried out to the phoneme variable quantity factor, sets up languages model of cognition;

Adopt one to one with one-to-many strategy, use SVM modeling is carried out to the phoneme variable quantity factor, set up SVM model, SVM model is described languages model of cognition.

A kind of Language Identification provided by the invention, based on the training method of the languages model of cognition of technique scheme, comprises the steps:

Step 2-1), extract the phoneme posterior probability of speech data to be identified;

Step 2-2), phoneme posterior probability is transformed into log-domain, and carries out dimensionality reduction, mean variance regular (MVN) is carried out to the feature after dimensionality reduction, obtains phoneme correlated characteristic;

Step 2-3), utilize phoneme correlated characteristic to calculate Baum-Welch statistic;

Step 2-4), utilize Baum-Welch statistic to extract the phoneme variable quantity factor;

Step 2-5), by the SVM model marking described in phoneme variable quantity factor pair, and it is regular to carry out mean variance to score, use linear discriminant to analyze (LDA) to the score after regular and Gauss rear end is regular carries out score correction, obtain final recognition result;

The regular computation process of described mean variance is:

s_{m}^{'} = \exp (\frac{s_{m} - μ}{kσ}) - - - (7)

s_{m}^{''} = \log \frac{(M - 1) s_{m}^{'}}{Σ_{j = 1}^{M} s_{j}^{'} - s_{m}^{'}} - - - (8)

Wherein, M is the number of described supporting vector machine model, s _mbe the initial score of m SVM model, μ and σ is respectively average and the standard deviation of all SVM model score of this test data, and k is adjustable parameter, s " _mit is regular rear score.

The invention has the advantages that:

1, consider the pronunciation character of language, the different information between languages is more obvious;

2, phoneme correlated characteristic is used for factorial analysis, improves the performance of system languages identification;

3, eliminate the decode procedure of tradition based on phoneme Feature Recognition System, greatly reduce the computation complexity of system.

Accompanying drawing explanation

Fig. 1 is a kind of process flow diagram of training method of languages model of cognition;

Fig. 2 is a kind of process flow diagram of Language Identification.

Embodiment

Now by reference to the accompanying drawings the present invention is described in further detail;

With reference to figure 1, a kind of flow process of training method of languages model of cognition comprises:

Step 1-1), gather the target language speech data of some as training data, extract phoneme posterior probability;

The target language speech data gathering some is as training data, and by traditional speech data front-end processing, quiet to training data excision, the invalid voice such as music, retain efficient voice; Then respectively to left frequency band and right frequency band extraction time pattern (TRAP) feature; The frame length of each frame is 25ms, and frame moves as 10ms, and the feature of 15 frames got respectively by left and right frequency band, and therefore the feature of each frame comprises the duration of 310ms around.The TRAP feature of left and right frequency band sends into the phoneme posterior probability that artificial neural network obtains two frequency bands respectively, the phoneme posterior probability of two frequency bands is stitched together, and uses another person's artificial neural networks to process, finally obtains the phoneme posterior probability of 159 dimensions.

Step 1-2), phoneme posterior probability is transformed into log-domain, uses principal component analysis (PCA) technology (PCA) to carry out dimensionality reduction, mean variance regular (MVN) is carried out to the feature after dimensionality reduction, obtains phoneme correlated characteristic;

m_{i} = \frac{1}{T_{i}} Σ_{t = 1}^{T_{i}} x_{it}

By the average m of training data all frame log-domains phoneme posterior probability _icovariance matrix C can be calculated, as shown in the formula:

C = \frac{1}{N} Σ_{i = 1}^{N} (m_{i} - \frac{1}{N} Σ_{i = 1}^{N} m_{i}) {(m_{i} - \frac{1}{N} Σ_{i = 1}^{N} m_{i})}^{T}

Wherein, N is the number of training statement;

PCA transition matrix A is generated by the latent vector that front 56 dominant eigenvalues of covariance matrix are corresponding _pCA, by described PCA transition matrix A _pCAobtain the feature after dimensionality reduction with log-domain phoneme posterior probability vector, its expression formula is:

y_{it} = A_{PCA}^{T} x_{it}

To the feature y after dimensionality reduction _itcarry out mean variance regular (MVN), remove the impact of pronunciation change, obtain phoneme correlated characteristic z _it, intrinsic dimensionality is 56 dimensions.

N_{c} = Σ_{t = 1}^{T_{i}} P (c | z_{it}, Ω)

F_{c} = Σ_{t = 1}^{T_{i}} P (c | z_{it}, Ω) (z_{it} - μ_{c})

C is gaussian component, and Ω is the variance of global context model (UBM), p (c|z _it, Ω) and represent that t frame belongs to the probability of c gaussian component, μ _cit is the mean vector of c gaussian component.Gaussage gets 1024.

I-th training statement phoneme variable quantity factor w be:

w＝(I+T ^tΣ ^-1N(i)T) ^-1T ^tΣ ^-1F(i)

Adopt one to one with one-to-many strategy, use SVM modeling is carried out to the phoneme variable quantity factor, set up M SVM model, SVM model is described languages model of cognition.

With reference to figure 2, a kind of flow process of Language Identification comprises the steps:

Step 2-1), extract the temporal mode feature of speech data to be identified, end user's artificial neural networks extracts phoneme posterior probability;

Step 2-2), phoneme posterior probability is transformed into log-domain, uses principal component analysis (PCA) technology (PCA) to carry out dimensionality reduction, mean variance regular (MVN) is carried out to the feature after dimensionality reduction, obtains phoneme correlated characteristic;

Step 2-5), by M SVM model marking described in phoneme variable quantity factor pair, and it is regular to carry out mean variance to score, use linear discriminant to analyze (LDA) to regular rear score and Gauss rear end is regular carries out score correction, obtain final recognition result;

The regular computation process of described mean variance is:

s_{m}^{'} = \exp (\frac{s_{m} - μ}{kσ})

s_{m}^{''} = \log \frac{(M - 1) s_{m}^{'}}{Σ_{j = 1}^{M} s_{j}^{'} - s_{m}^{'}}

Wherein, s _mbe the initial score of m SVM model, μ and σ is respectively average and the standard deviation of all SVM model score of this test data, k=100, s " _mit is regular rear score.

American National Standard technology administration (NIST) languages identification in 2011 evaluation and test (LRE) data set is tested, test languages comprise 24 target languages, and Performance Evaluating Indexes has EER (etc. error rate), minDCF (minimum detection mistake cost), minCavg ₂₇₆(the minimum average B configuration value of risk that 276 languages are right), actCavg ₂₇₆(the actual average value of risk that 276 languages are right), minCavg ₂₄(the minimum average B configuration value of risk that 24 languages the poorest are right) and actCavg ₂₄(the actual average value of risk that 24 languages the poorest are right).S1 represents traditional Ivector method, and S2 represents traditional PPRVSM method based on phoneme feature, and S3 represents the Language Identification that the present invention proposes.Use Russian phoneme recognizer in actual test, the test performance evaluation index comparing result of each method is as shown in table 1.

Table 1

Method	EER	minDCF	minCavg ₂₇₆	actCavg ₂₇₆	minCavg ₂₄	actCavg ₂₄
							S1	6.65	6.86	2.45	3.45	11.88	14.52
S2	7.62	8.04	2.67	4.68	11.68	14.09
							S3	5.52	5.75	1.45	2.68	9.00	12.01

The Language Identification that the present invention proposes is used for factorial analysis the phoneme correlated characteristic relevant to pronunciation content, test result shows the Language Identification of the present invention's proposition compared with traditional Ivector method, recognition performance has the lifting of relative 16%-41%, compared with traditional Language Identification PPRVSM based on phoneme feature, recognition performance has the lifting of relative 15%-46%.

Claims

1. a training method for languages model of cognition, comprising:

Step 1-2), forward phoneme posterior probability to log-domain, and carry out dimensionality reduction, mean variance carries out to the feature after dimensionality reduction regular, obtain phoneme correlated characteristic;

Step 1-5), modeling is carried out to the phoneme variable quantity factor, sets up languages model of cognition.

2., by the training method of languages model of cognition according to claim 1, it is characterized in that, described step 1-2) computation process be:

The latent vector T.G Grammar matrix A corresponding by the dominant eigenvalue of L before covariance matrix _pCA, covariance matrix is defined as follows formula:

C = \frac{1}{N} Σ_{i = 1}^{N} (m_{i} - \frac{1}{N} Σ_{i = 1}^{N} m_{i}) {(m_{i} - \frac{1}{N} Σ_{i = 1}^{N} m_{i})}^{T}

Wherein, N is the number of training statement, m _ithe average of i-th training statement all frame log-domains phoneme posterior probability, as shown in the formula obtaining:

m_{i} = \frac{1}{T_{i}} Σ_{t = 1}^{T_{i}} x_{it}

X _itthe t frame log-domain phoneme posterior probability vector of i-th training statement, T _ibe the frame number of i-th training statement, the feature after dimensionality reduction is:

y_{it} = A_{PCA}^{T} x_{it}

To the feature y after dimensionality reduction _itcarry out mean variance regular, obtain phoneme correlated characteristic vector z _it.

3., by the training method of languages model of cognition according to claim 1, it is characterized in that, described step 1-3) in Baum-Welch normalized set process be:

N_{c} = Σ_{t = 1}^{T_{i}} P (c | z_{it}, Ω)

F_{c} = Σ_{t = 1}^{T_{i}} P (c | z_{it}, Ω) (z_{it} - μ_{c})

C is gaussian component, and Ω is the variance of global context model, p (c|z _it, Ω) and represent that t frame belongs to the probability of c gaussian component, μ _cit is the mean vector of c gaussian component.

4., by the training method of languages model of cognition according to claim 1, it is characterized in that, described step 1-4) computation process be:

The phoneme variable quantity factor w of i-th training statement is:

w＝(I+T ^TΣ ^-1N(i)T) ^-1T ^TΣ ^-1F(i)

Wherein, N (i) is diagonal matrix, and on diagonal line, element is N _ci, F (i) are by single order Baum-Welch statistic F _csplicing obtains, Σ and T is trained by expectation-maximization algorithm and obtain in Factor Analysis.

5. by the training method of languages model of cognition according to claim 1, it is characterized in that, described step 1-5) process be: adopt one to one with one-to-many strategy, support vector machine is used to carry out modeling to the phoneme variable quantity factor, set up supporting vector machine model, supporting vector machine model is described languages model of cognition.

6. a Language Identification, the method is based on the training method of the languages model of cognition one of claim 1-5 Suo Shu, and the method comprises the steps:

Step 2-1), extract the phoneme posterior probability of statement to be identified;

Step 2-2), forward phoneme posterior probability to log-domain, and carry out dimensionality reduction, mean variance carries out to the feature after dimensionality reduction regular, obtain phoneme correlated characteristic;

Step 2-5), the languages model of cognition described in phoneme variable quantity factor pair is given a mark, mean variance carries out to score regular, linear discriminant analysis is used to the score after regular and Gauss rear end is regular carries out score correction, obtain final recognition result.

7., by Language Identification according to claim 6, it is characterized in that, step 2-5) described in the regular computation process of mean variance be:

s_{m}^{'} = \exp (\frac{s_{m} - μ}{kσ})

s_{m}^{''} = \log \frac{(M - 1) s_{m}^{'}}{Σ_{j = 1}^{M} s_{j}^{'} - s_{m}^{'}}

Wherein, M is the number of described languages model of cognition, s _mbe the initial score of m languages model of cognition, μ and σ is respectively average and the standard deviation of this test data all languages model of cognition score, and k is adjustable parameter, s " _mit is regular rear score.