CN109545201B

CN109545201B - Construction method of acoustic model based on deep mixing factor analysis

Info

Publication number: CN109545201B
Application number: CN201811537321.XA
Authority: CN
Inventors: 屈丹; 闫红刚; 张文林; 杨绪魁; 牛铜; 张连海; 陈琦; 李�真; 魏雪娟
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2018-12-15
Filing date: 2018-12-15
Publication date: 2023-06-06
Anticipated expiration: 2038-12-15
Also published as: CN109545201A

Abstract

The invention relates to the technical field of voice recognition, and discloses a construction method of an acoustic model based on deep mixed factor analysis, which comprises the following steps: generating a baseline system by using training data and adopting an HMM-GMM model; initializing a DMFA model according to HMM-GMM model parameters, wherein the DMFA model consists of two layers of MFA models, and initializing the DMFA model parameters by adopting a GMM clustering and probability principal component analysis method; estimating the overall model parameters of a DMFA model of the acoustic feature space by using training data through a baseline system of an HMM-GMM model and adopting a greedy EM algorithm; estimating state model parameters of a first layer MFA model of the DMFA model, wherein the state model parameters comprise state related parameters and state irrelevant parameters; state model parameters of a second layer MFA model of the DMFA model are estimated. According to the invention, the deep mixed factor analysis model is introduced into the modeling process of the state model, and the acoustic model based on the deep mixed factor analysis is provided, so that the method has better overfitting resistance.

Description

Construction method of acoustic model based on deep mixing factor analysis

Technical Field

The invention relates to the technical field of voice recognition, in particular to a construction method of an acoustic model based on deep mixed factor analysis.

Background

The acoustic model is an important component of a continuous speech recognition system, a hidden Markov model-Gaussian mixture model (Hidden Markov Model-Gaussian Mixture Model, HMM-GMM) is a current mainstream model, and the HMM-GMM model is combined with bottleneck characteristics (Bottle Neck Feature, BNF) based on a deep neural network (Deep Neural network, DNN), so that the recognition rate is greatly improved. However, the uncertainty of the voice signal in nature is very large, and it is difficult to obtain an accurate acoustic model to describe the voice signal. The uncertainties of speech signals include co-pronunciation, speaker, transmission channel or speech noise environment, etc., which all require accurate modeling of the speech acoustic unit. In order to realize accurate modeling, especially overcome the problem of 'phoneme variant' caused by co-pronunciation, a context-related phoneme modeling method is generally adopted in the HMM-GMM model, namely, a single-phoneme model is subjected to triphone expansion, but the expanded model has higher requirements on data requirements, and in order to obtain stable model parameter estimation, parameter sharing is realized through state binding, and the requirements on training data quantity are reduced while the model parameter estimation is improved.

However, since continuous speech recognition systems often face low-resource real-world problems, many students are devoted to improving acoustic models, i.e. improving modeling accuracy while minimizing the need for data volume. The subspace Gaussian mixture model (Subspace Gaussian Mixture Model, SGMM) limits the mean value and the weight of Gaussian mixture elements in one parameter subspace, and the same subspace parameters and covariance matrix are shared among different states, so that each state can be represented by vectors in a plurality of low-dimensional parameter subspaces, thereby realizing effective parameter sharing, greatly reducing the size of model parameters, improving the robustness of model parameter estimation under the condition of limited data, and lacking a priori assumptions on variables of the low-dimensional subspaces.

Continuous speech recognition systems often face low-resource and other realistic problems, and the mixed factor analysis acoustic model has fewer model parameters, so that the method becomes an effective method for accurately modeling the acoustic model under limited resources. But the method assumes that the implicit factor local coordinates obey a standard normal distribution, while the actual data indicate that the hypothetical factor is far from normal. In the blend factor analysis model, the coordinates of each local region are assumed to be normally distributed, but studies have found that the true distribution of the local coordinates of the blend factor analysis is not normally distributed.

Disclosure of Invention

Aiming at the problems, the invention provides a construction method of an acoustic model based on deep mixed factor analysis, the invention introduces a deep mixed factor analysis model into a modeling process of a state model, and provides the acoustic model based on deep mixed factor analysis, which has better overfitting resistance.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the construction method of the acoustic model based on deep mixing factor analysis comprises the following steps:

step 1: generating a baseline system by using training data and adopting an HMM-GMM model;

step 2: initializing an acoustic model DMFA model based on deep mixed factor analysis according to HMM-GMM model parameters, wherein the DMFA model is a two-layer mixed factor analysis model, namely, the DMFA model consists of a first-layer MFA model and a second-layer MFA model, and initializing the DMFA model parameters by adopting a GMM clustering and probability principal component analysis method;

step 3: estimating the overall model parameters of a DMFA model of the acoustic feature space by using training data through a baseline system of an HMM-GMM model and adopting a greedy EM algorithm;

step 4: estimating state model parameters of a first layer MFA model of the DMFA model, wherein the state model parameters comprise state related parameters and state irrelevant parameters;

step 5: state model parameters of a second layer MFA model of the DMFA model are estimated.

Further, initializing parameters of the DMFA model by adopting a GMM clustering and probability principal component analysis method comprises the following steps:

determining the number of local regions I and the potential factor dimension of each region of a first layer MFA model

Determining the number of sub-regions I of each region of the second-layer MFA model _i Number of total sub-areas of the second layer

And each sub-regionPotential factor dimension of domain->

Further, the step 3 includes:

step 3.1: given data x= { X ₁ ,x ₂ ,...,x _T First-tier MFA model training:

training a first layer of MFA model using X with greedy EM algorithm, wherein there are I mixed factor components, subspace dimensions of each factor

I.e. M _i Is->

Step 3.2: data set Y for each region i _i ，i∈[1,I]Calculating a feature vector x _t Gaussian distribution of posterior probability belonging to ith local area

And feature vector x _t Posterior probability p (i|x) belonging to the i-th local region _t )，t∈[1,T]X is obtained according to the following formula _t Estimated value of corresponding blend factor component i +.>

By the following pair

Corresponding local coordinates>

Sampling:

handle

Adding to data Y _i Obtaining the added data set +.>

Step 3.3: performing second-layer MFA model training:

at the position of

Go up through the second layer mixed factor coordinate vector dimension +.>

And the number of the mixed factor components K _i An independent second-layer MFA model is trained using a greedy EM algorithm.

Further, the step 4 includes:

step 4.1: initializing state-related parameters of a first layer MFA model:

setting iteration times K; wherein (1)>

An initial weighting factor for the first layer MFA model,/->

The initial local coordinates of the first-layer MFA model are J, and the total number of context-related states;

step 4.2: re-estimating state-related parameters of the j-th state according to the state alignment information of the training data:

k∈[1,K]the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>

Is the weighting factor of the first layer MFA model at the kth iteration,

is the local coordinates of the first layer MFA model at the kth iteration;

step 4.3: re-estimating the state independent parameters of the i-th area according to the state alignment information of the training data:

wherein (1)>

For the local area mean of the first layer MFA model at the kth iteration, +.>

For the local area basis matrix of the first layer MFA model at the kth iteration, +.>

Is the state independent covariance matrix of the first layer MFA model at the kth iteration.

Further, the step 5 includes:

step 5.1: initializing state-related parameters of a second-layer MFA model:

selecting the iteration number K'; wherein (1)>

Is the initial weighting factor for the second layer MFA model,

initial local coordinates for the second-tier MFA model;

step 5.2: re-estimating state-related parameters of the j-th state according to the state alignment information of the training data:

k∈[1,K′]the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>

Is the weighting factor of the second layer MFA model at the kth iteration,

is the local coordinates of the second layer MFA model at the kth iteration;

step 5.3: re-estimating the state independent parameters of the s-th area according to the state alignment information of the training data:

wherein (1)>

For the local area mean of the second-layer MFA model at the kth iteration, +.>

For the local area basis matrix of the MFA model of the second layer at the kth iteration, +.>

Is the state independent covariance matrix of the second-layer MFA model at the kth iteration.

Compared with the prior art, the invention has the beneficial effects that:

the invention considers the state acoustic model as a self-adaption of the global feature space mixing factor analysis model, and adopts state related parameters and state irrelevant parameters to jointly describe state modeling. The DMFA model is based on further deepening and expansion of the mixed factor analysis acoustic model, the original model assumes that the regional coordinate parameters follow Gaussian prior distribution, but the regional coordinate parameters are not actually obtained, so that the DMFA model is obtained by modeling the regional coordinate parameters by adopting deeper mixed factor analysis by adopting a deeper mixed factor analysis model. Because the DMFA model has a parameter sharing strategy, the method has better overfitting resistance and is more suitable for acoustic model modeling under low resources.

Drawings

Fig. 1 is a basic flowchart of a method for constructing an acoustic model based on deep mixing factor analysis according to an embodiment of the present invention.

FIG. 2 is a basic flow chart of a method for constructing an acoustic model based on deep blending factor analysis according to another embodiment of the present invention.

Fig. 3 is a schematic diagram of deep mixed factor analysis of a method for constructing an acoustic model based on deep mixed factor analysis according to an embodiment of the present invention, (a) is partially a region schematic diagram of deep mixed factor, and (b) is partially a graph model of deep mixed factor analysis.

Detailed Description

The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:

embodiment one:

as shown in fig. 1, the method for constructing an acoustic model based on deep mixing factor analysis of the present invention includes the following steps:

step S101: and generating a baseline system by using the training data and adopting an HMM-GMM model.

Step S102: initializing an acoustic model DMFA model based on deep mixed factor analysis according to HMM-GMM model parameters, wherein the DMFA model is a two-layer mixed factor analysis model, namely, the DMFA model consists of a first-layer MFA model and a second-layer MFA model, and the parameters of the DMFA model are initialized by adopting a GMM clustering and probability principal component analysis method.

Step S103: and estimating the overall model parameters of the DMFA model of the acoustic feature space by using a greedy EM algorithm through a baseline system of the HMM-GMM model by using training data.

Step S104: state model parameters of a first layer MFA model of the DMFA model are estimated, the state model parameters including state-related parameters and state-independent parameters.

Step S105: state model parameters of a second layer MFA model of the DMFA model are estimated.

Embodiment two:

as shown in FIG. 2, another method for constructing an acoustic model based on deep mixing factor analysis of the present invention.

Firstly, generating an HMM-GMM acoustic model by using a Kaldi tool box; the DMFA is then used to generate a model of the entire acoustic feature space, which is essentially a global background model, and the model for each state can be considered to be derived from the model by some adaptive algorithm that uses the data for each state to re-estimate the parameters. The invention adopts a two-layer DMFA model for modeling, the DMFA model comprises a two-layer MFA model, and the specific flow is shown in algorithm 1.

/>

Wherein, the liquid crystal display device comprises a liquid crystal display device,

an initial weighting factor for the first layer MFA model,/->

The initial local coordinates of the first-layer MFA model are J, and the total number of context-related states; />

Is the weight factor of the first layer MFA model at the kth iteration, +.>

Is the local coordinates of the first layer MFA model at the kth iteration; />

For the local area mean of the first layer MFA model at the kth iteration, +.>

A state independent covariance matrix of the first layer MFA model at the kth iteration; />

An initial weighting factor for the second-layer MFA model,/->

Initial local coordinates for the second-tier MFA model; />

Weight factor of the second-layer MFA model at the kth iteration, +.>

Is the local coordinates of the second layer MFA model at the kth iteration; />

For the local area mean of the second-layer MFA model at the kth iteration, +.>

Is the local area base of the second layer MFA model at the kth iterationMatrix (S)>

Algorithm 1 is largely divided into four parts. The first part (line 1) generates a baseline system based on an HMM-GMM model according to a unit of three phones with context correlation, and the total number of the states with the context correlation is J; the second part (2 nd-3 rd line) generates a global DMFA model of the whole acoustic space, wherein the 2 nd line adopts a GMM clustering and probability principal component analysis method to initialize parameters of the DMFA model; line 3 is a layer-by-layer parameter re-estimation algorithm of the DMFA model of the global acoustic space; the third part (lines 4-12) estimates the first layer parameters of the state model of the DMFA model, wherein line 4 is the initialization of the first layer MFA model, lines 6-8 estimate the first layer state related parameters of the state model, and lines 9-11 estimate the first layer state unrelated parameters of the state model; the fourth part (lines 13-21) estimates the second layer parameters of the state model, line 13 is the initialization of the second layer MFA model, lines 15-17 estimates the second layer related parameters of the state model, and lines 18-20 estimates the second layer unrelated parameters of the state model. It should be noted that from equations (27) and (28), the final state model is tuned by the bottommost DMFA parameters, so the upper layer parameters can be considered to be relatively fixed.

The detailed procedure is as follows:

step S201: generating a baseline system using the training data using a hidden Markov model-Gaussian mixture model HMM-GMM model (see Povey D, ghoshal A, boulianne G, etc. the Kaldi speech recognition toolkit. In: proceedings of the 2011IEEE Workshop on Automatic Speech Recognition and Understanding.IEEE,2011);

specifically, the training data is 1987-1989WSJ text data, about 215M.

Step S202: initializing an acoustic model DMFA model based on deep mixed factor analysis according to HMM-GMM model parameters, wherein the DMFA model is a two-layer mixed factor analysis model, namely, the DMFA model consists of a first-layer MFA model and a second-layer MFA model, and the method for initializing the parameters of the DMFA model by adopting GMM clustering and probability principal component analysis comprises the following steps:

factor analysis (Factor Analyzer) is a high dimensional data correlation modeling method that models correlation information with a low dimensional linear subspace. Let x be the high-dimensional space R ^D The assumption of the root factor analysis model that x can pass through a low-dimensional subspace R ^K The point y in (D' < D) is obtained by affine transformation:

wherein μ is the mean of the high dimensional spatial data distribution; m is a linear transformation matrix, called factor load matrix, with dimension D x D'; n is a random error term. In the factor distribution model, y follows the D' dimensional standard normal distribution, while the random error n follows the gaussian distribution with a mean of 0 and covariance of the diagonal matrix Σ. It can thus be seen that a large number of observations are concentrated around the mean μ, resulting in a local linear subspace model. Thus, the factor analysis model is well suited for modeling local in high dimensional space, and combining multiple local factor analysis models results in a mixed factor analysis model (Mixture of Factor Analysers, MFA).

Assuming that the data space is divided into I partial areas, the probability that the observed data x falls into the partial areas is w respectively ₁ ,w ₂ ,…,w _I Each local area is approximated by a factor analysis model, so as to obtain a mixed factor analysis model, namely:

wherein mu _i 、M _i Sum sigma _i The mean value, the factor load matrix and the reconstruction error matrix of the ith factor analysis model, y _i For the corresponding coordinate vector of the observation data x in the model. In the formula (3), the linear subspace dimension corresponding to each local factor analysis model may be different, so that the linear subspace dimension of the ith factor analysis model is D _i M is then _i Is a D x D _i Matrix of dimensions, local coordinates y _i Is a D _i Is a vector of (a). The posterior probability follows a gaussian distribution according to equations (2) and (3), namely:

wherein the average value

Variance sigma _yi|x ＝(I ₊ M _i ^T Σ _i ^-1 M _i ) ^-1 。

The parameters of the mixture factor analysis model thus include the number of local regions I, the potential dimension D of each local region _i Model parameters { w for local factor analysis _i ,μ _i ,M _i ,Σ _i }. Parameter estimation of the model may be achieved by a expectation maximization (Expectation Maximization, EM) algorithm.

First, in E-step, first order E (y) _i |x _t ) Sum second order statistics

It can be seen that:

assume that the model parameters after the kth iteration are

The auxiliary function of the greedy EM algorithm in the k+1 iteration is: />

Wherein, gamma _i (x _t ) For a given parameter lambda ^(k) Feature vector x _t The posterior probability belonging to the i-th local region is calculated by the following expression:

for convenience of derivation, let

Then at M-step, the sum of the values of the functions Q (Λ ^(k) Λ) versus parameter w _i ,/>

Σ _i And obtaining a partial derivative, and enabling the corresponding partial derivative to be 0 to respectively obtain a parameter updating formula of the MFA model:

wherein the method comprises the steps of

And->

The method comprises the following steps of:

when the MFA model converges, the accuracy of the model can be improved by increasing the number of spatial regions I and the dimension D' of the latent factor space, which is equivalent to adjusting the conditional probability of x under the latent factor y condition of all regions to be

However, this approach quickly leads to over-fitting problems when modeling high-dimensional data, or insufficient data volume. Thus, to solve this problem, it is to replace the a priori distribution of each potential factor y +.>

That is, instead of the simple normal distribution prior probability being a more complex MFA prior probability, there are:

the parameters representing the new MFA of the second layer of spatial region i, i.e. the region i is subjected to a blending factor analysis.

Figure 3 shows a schematic representation of deep blending factor analysis. Wherein part (a) in fig. 3 is a schematic region diagram of deep mixed factor analysis, and part (b) in fig. 3 has a first layer mixed factor region number i=2, and thick lines represent two of mixed factor analysis modelsThe number of regions is represented by i=1, 2, and the mixed factor analysis is adopted in each region, assuming that the number of sub-regions divided by each region i is K _i Sub-region k _i ＝1,2,...,K _i To represent. Then for the example herein, for region i=1, a further learning is made to contain K _i＝1 Blend factor analysis of 4 subregions, the corresponding subregion of which is subregion k ₁ ＝1,2,...,K ₁ ,K ₁ =4; and region i=2, the corresponding sub-region being k ₂ ＝1,2,...,K ₂ ,K ₂ ＝2。

So a DMFA uses the prior probability p of the original DFA _MFA (y, i) =p (i) p (y|i) is replaced with a better prior probability, i.e.:

p _DMFA (y,i)＝p(i)p(k _i |i)p(y|k _i ) (15)

thus, by sampling from DMFA, w is first utilized _i Select i, then the second layer utilizes

Sample k _i Wherein k is _i ＝1,2,...,K _i ，K _i Representing the number of factor analysis performed on the second layer by the ith factor of the first layer; finally, the factor component k can be utilized _i To represent y. In a simpler, fully equivalent form of DMFA, i.e. enumerating all possible factors k of the second layer _i . A new factor composition indicator s=1, 2,..s is used to denote the composition of a specific second layer, wherein +.>

As shown on the right side of part (a) in fig. 3. The mixed weights can be defined as +.>

Wherein i(s) represents a first layer factor i and a second layer factor k corresponding to s, respectively _i (s). For example, i (2) =1, and i (5) =2. Note that the size of S increases exponentially with the number of layers. The representation of this generation process is very intuitive and will be described in the latter sectionThis representation is continued.

Part (b) of fig. 3 shows a two-layer DFMA graph model. In particular, it is pointed out that

i←i(s) (19)

Equation (19) is fully defined because each s belongs to and only one i. Wherein the method comprises the steps of

And->

Respectively D ⁽¹⁾ ×D ⁽¹⁾ And D ⁽²⁾ ×D ⁽²⁾ Diagonal matrix of dimensions. The variable here is a generic term, it being noted in particular that y ⁽²⁾ Refers to y for each s _s ⁽²⁾ The number of the actual second layers is S; y is ⁽¹⁾ Refers to +.>

When s is determined, i is alsoIs determined.

For a deep mixed factor analysis model with an arbitrary layer number L, the parameters of the DFMA model are the number I of local areas comprising each layer ^(l) L=1, 2,..l, potential dimension of each local region

Local factor analysis model parameters->

The sub-regions of each layer are here global representations, but each sub-region belongs to only a certain region of the previous layer. In the invention, a two-layer DMFA model is adopted for modeling, and for clearer representation, a first layer region is represented by i, and a second layer region is represented by s; the number of the first layer areas is I; the number of the total subregions of the second layer is S.

The weight factor of the observation vector of the same state j falling into the ith local area is w _ji The probability constraint condition is satisfied, and the observation vector of the state j is assumed to be subjected to Gaussian distribution in the ith local area, and only the influence of the mean value is considered, and the mean value is set as mu' _ji Variance is sigma' _i (the variance of each region is equal for each state). Given a local region i and a mean μ' _ji The observation probability of state j is:

assume that the mean μ 'is paired with a deep factorial analysis model' _ji Modeling, taking two layers of models as an example, and making the mean mu' _ji The local coordinates of the ith local area of the first layer are

From formula (2), μ 'can be seen' _ji Is:

/>

assume that

The local coordinates of the s-th local area in the second layer are +.>

Given the second layer local area s and +.>

Then according to formula (2) it is known that +.>

Is:

according to equations (22) and (23), the local implicit variable of the second layer can be used to represent a priori distribution of the mean, i.e

Substituting formulas (22) and (23) into formula (24) to obtain μ' _ji Is:

the observation probability of state j is according to the bayesian formula:

to reduce the number of parameters, let

Finally, a given second can be obtainedLayer region coordinates->

The observation probability model of state j is:

wherein the method comprises the steps of

And has the following steps:

in an HMM acoustic model, each state is essentially an observation probability distribution describing a certain plateau of the corresponding acoustic modeling unit. The observed feature vector for each plateau therefore necessarily distributes a small number of regions in the MFA model throughout the space, defining

And +.>

Wherein the observation data of state j are distributed over a small area of the whole space, thus +.>

Are necessarily sparse, in fact +.in formula (27)>

The combined action of the weights of the first layer and the second layer can be considered together when directly reevaluating in order to reduce the number of parameters.

It can thus be seen that with equations (27) and (28) a state model based on depth factor analysis is constructed, with accurate modeling of each state achieved by re-estimation of the parameters, and thus it can be seen that,at each layer of the DMFA model, the state model parameters include two parts, one being state-related parameters: weighting factor

And local coordinates->

Wherein i=1, 2,; s=1, 2,; second, state independent parameters: />

And->

Since it can be known from algorithm 1 that the typical state model is generated based on the greedy EM algorithm, initial selection of DMFA is critical. By referring to a method for generating a typical state model based on a mixed factor analysis acoustic model, the generation of the typical state model comprises two parts, namely, a GMM model of an overall space is generated from a base line system of the HMM-GMM model; and secondly, the GMM model of the whole space is expressed as a deep mixed factor analysis model.

The GMM model acquisition of the whole space is similar to the acoustic model method based on MFA, but two layers of clustering are needed to realize. Firstly looking at a first layer of clustering, setting the total number of Gaussian mixture elements in an HMM-GMM baseline system as M, numbering the Gaussian mixture elements from 1 to M according to a certain sequence, and setting the average value of the mth Gaussian mixture element as

Covariance matrix +.>

The training data is forcedly aligned, and zero-order, first-order and second-order statistics gamma corresponding to each Gaussian mixture element are calculated _m ＝∑ _t γ _m (t)、/>

And

wherein γm (t) is the t-th frame feature vector o _t The posterior probability of the mth Gaussian mixture is calculated by a Baum-Welch forward and backward algorithm, and then the likelihood of training data corresponding to each mixture is calculated as follows: LLK (LLK) _m ＝∑ _t γ _m (t)logp(o _t |m)。

The m 'th and m' th Gaussian mixture elements are clustered and then combined to generate a new Gaussian mixture element m _′ The corresponding zero-order, first-order and second-order statistics are calculated as gamma _m″′ ＝γ _m′ +γ _m″ ，s _m″′ ＝s _m′ +s _m″ 、S _m″′ ＝S _m′ +S _m″ Then the combined weight is calculated by using the parameter re-estimation formula of the HMM-GMM

Mean vector->

And covariance matrix->

The loss value of the log likelihood of the training data after combination is as follows:

ΔLLK _{m′m″→m″′} ＝LLK _m″′ -LLK _m′ -LLK _m″ (29)

obtaining a GMM containing I Gaussian mixture elements through an M-I step clustering process, in each step clustering process, carrying out pairwise combination on the current Gaussian mixture elements, calculating a loss value of likelihood scores before and after combination through a formula (29), combining two Gaussian mixture elements with the minimum loss value into a new Gaussian mixture element, deleting the two Gaussian mixture elements before combination, and respectively carrying out weight, mean vector and covariance matrix of the new Gaussian mixture elements

And->

After the above clustering process is completed, the GMM parameters containing I Gaussian mixture elements are obtained as +.>

Notably, the GMM model over which the entire acoustic space is acquired, some intermediate result information is required for the generation of the multi-layer GMM model. The multi-layer gaussian mixture model corresponds to the representation of the gaussian mixture elements of each layer by using a GMM model, thereby forming a multi-layer GMM model. For example, looking first at the two-layer GMM model, assume the gaussian mixture set Ω= { m for the HMM-GMM baseline system ⁽⁰⁾ M=1, 2, M, the first layer GMM model was obtained using the clustering algorithm described above, but for each merge in the cluster, the composition of the original gaussian mixture for each mixture after the merge was recorded, e.g., for the new gaussian mixture after the kth merge

The original mixed element composition set is +.>

Representing the combination of the multiple passes in omega>

Is the GMM parameter of +.>

Wherein the upper right corner mark 1 represents a first layer cluster; and the original mixed element set of each mixed element i is C respectively _i ,i＝1,2,...,I，/>

For the second layer GMM, the original state mixed element set C of the ith mixed element _i The mixed elements in the method are combined by adopting the clustering algorithm same as the first layer and are combined to the appointedCompleting the acquisition of the second layer GMM parameters of the deep GMM model, setting +.>

The clustered mixed elements of the second layer share the information of the mixed elements of the first layer.

After the clustering process is completed, the GMM parameters containing the first layer I Gaussian mixture elements are obtained as

Second-layer mixed element->

For each covariance matrix of each layer, e.g +.>

Respectively analyzing the characteristic values, and sequencing the characteristic values from large to small to lambda _h1 ,λ _h2 ,…,λ _hD The corresponding feature vector is +.>

Defining the cumulative contribution rate (Cumulative Contribution Rate, CCR) η of the d-th eigenvalue _id The method comprises the following steps:

η _id reflecting the ratio of the first d eigenvalues to the sum of the total eigenvalues. For the h local area of the mixed factor analysis model, the potential dimension D is selected _h Is that

I.e. the smallest eigenvalue number with an eigenvalue cumulative contribution exceeding 90% is chosen as the potential dimension of the first local area. For the region h of the first layer, probability principal division is adoptedQuantitative analysis of the remaining parameters of its corresponding response factor analysis model +.>

Respectively initializing.

Step S203: estimating, by using training data, overall model parameters of a DMFA model of an acoustic feature space by a baseline system of an HMM-GMM model using a greedy EM algorithm, comprising:

the training of DMFA may be achieved by a greedy EM algorithm. The MFA of the first layer is trained in a standard manner; then the mixed factor component is obtained by adopting the formula (8)

Each training data and a factor of the corresponding component. The first layer parameters are then frozen and the first layer factor values for each component are extracted as training data for the second layer MFA. Algorithm 2 gives a specific layer-by-layer training algorithm. After greedy learning, the steps of the conventional greedy EM algorithm are updated by compressing a DMFA.

Step S204: estimating state model parameters of a first layer MFA model of the DMFA model, the state model parameters including state-related parameters and state-independent parameters, comprising:

according to the modeling method of the state model given in the step 2, the state model parameters based on the MFA are adopted for the first layer, and the weight factors are respectively calculated

Local coordinates->

Local area basis matrix->

Local area mean vector->

State independent covariance matrix->

And respectively constructing auxiliary functions of each parameter by adopting maximum likelihood criteria, and obtaining a parameter re-estimation formula as follows:

/>

it should be noted that since each state model is essentially an observation probability distribution describing a certain plateau of the corresponding acoustic modeling unit, the corresponding observation feature vector of its plateau is such that the state model covers only a small part of all regions. Thus in the state model observation space

Is sparse, the sparsity constraint can be realized by a weight contraction algorithm, which comprises two parts of weight contraction and weight normalization, wherein the weight contraction realizes the sparsity constraint, and the weight normalization is that the weight satisfies the statistical constraint condition, thus the weight is obtained by +.>

The final weight parameters are obtained after the algorithm. All parameters are updated to obtain the first-layer MFA acoustic model of the state.

Step S205: estimating state model parameters of a second layer MFA model of the DMFA model, comprising:

from equations (27) and (28), it is known that after the first layer parameters of the DMFA model of the state are obtained, the state model parameters can be used to achieve accurate modeling of the state through changes in the second layer parameters. Thus, an auxiliary function can be established as:

here, the

In fact the above +.>

Also to reduce the number of parameters, they are combined together.

Thus, based on the first layer parameters, the respective constructions are constructed using the formula (36)

Local coordinates->

Local area basis matrix->

Local area mean vector->

State independent covariance matrix->

The re-estimation formula of the parameters can be obtained as follows:

wherein, parameter H _js And g _js As shown in formula (42), gamma _js 、s _js And S is _js Zero order, first order, and second order statistics, respectively, of the s subregion of state j.

It should be noted that here

Since it is modeled from the joint estimation of all sub-areas, it is as above +.>

The same algorithm is adopted to estimate the sparse constraint; secondly, each sub-region s belongs to only one region i, so when a sub-region s is determined, then the corresponding region i is also determined; in additionThe +.appearing in the formulae (39) to (41)>

Is a generalized inverse matrix, which can prove +.>

There is a left inverse matrix, and +.>

There is a right inverse matrix.

The invention provides an acoustic model modeling method based on deep mixing factor analysis. The new method refers to the thought of the modeling method based on the mixed factor analysis model, considers the state acoustic model as a self-adaption of the global feature space mixed factor analysis model, and describes the state modeling together by adopting state related parameters and state unrelated parameters. However, the model is based on further deepening and expansion of the mixed factor analysis acoustic model, and the original model assumes that the regional coordinate parameters follow Gaussian prior distribution, but the regional coordinate parameters are not actually so, so that the regional coordinate parameters are obtained by modeling the regional coordinate parameters by adopting deeper mixed factor analysis by adopting a deeper mixed factor analysis model. Because the DMFA model has a parameter sharing strategy, the method has better overfitting resistance and is more suitable for acoustic model modeling under low resources.

The foregoing is merely illustrative of the preferred embodiments of this invention, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this invention, and it is intended to cover such modifications and changes as fall within the true scope of the invention.

Claims

1. The construction method of the acoustic model based on deep mixing factor analysis is characterized by comprising the following steps:

estimating state model parameters of a first layer MFA model of the DMFA model according to:

wherein the method comprises the steps of

Is the weighting factor of the MFA model of the first layer at the k+1th iteration, +.>

For the local coordinates of the first-layer MFA model at the k+1th iteration, +.>

For the mean value of the local region of the MFA model of the first layer at the k+1th iteration, +.>

For the local area basis matrix of the first layer MFA model at the k+1th iteration,/and (b)>

For the state independent covariance matrix, μ of the first layer MFA model at the k+1th iteration _i 、M _i Sum sigma _i The mean value, the factor load matrix and the reconstruction error matrix of the ith factor analysis model, y _ji Is the local coordinate, x, of the first layer MFA model _t As feature vector, gamma _ji 、s _ji Zero order and first order statistics of the first layer MFA model of the state j respectively;

step 5: estimating state model parameters of a second-layer MFA model of the DMFA model;

estimating state model parameters of a second layer MFA model of the DMFA model according to:

/>

wherein the method comprises the steps of

Wherein the method comprises the steps of

Is the weighting factor of the MFA model of the second layer at the k+1th iteration, +.>

For the local coordinates of the MFA model of the second layer at the k+1th iteration, +.>

For the mean value of the local region of the MFA model of the second layer at the k+1th iteration, +.>

For the local area basis matrix of the MFA model of the second layer at the k+1th iteration,/and>

for the state-independent covariance matrix of the second-layer MFA model at the k+1th iteration,/->

Is the local coordinates of the second-layer MFA model, gamma _js 、s _js And S is _js Zero order, first order, and second order statistics, respectively, of the second layer MFA model of state j.

2. The method for constructing an acoustic model based on deep mixing factor analysis according to claim 1, wherein initializing DMFA model parameters by using GMM clustering and probability principal component analysis method comprises:

determining the number of local regions I and the potential factor dimension D of each region of a first layer MFA model _i ⁽¹⁾ ；

And the potential factor dimension of each sub-region +.>

3. The method for constructing an acoustic model based on deep mixing factor analysis according to claim 2, wherein the step 3 comprises:

step 3.1: given data x= { X ₁ ,x ₂ ,...,x _T First-tier MFA model training:

I.e. M _i Is->

By the following pair

Corresponding local coordinates>

Sampling:

handle

Adding to data Y _i Obtaining the added data set +.>

Step 3.3: performing second-layer MFA model training:

at the position of

Go up through the secondLayer blend factor coordinate vector dimension->

4. The method for constructing an acoustic model based on deep mixing factor analysis according to claim 3, wherein the step 4 comprises:

step 4.1: initializing state-related parameters of a first layer MFA model:

setting iteration times K; wherein (1)>

An initial weighting factor for the first layer MFA model,/->

wherein (1)>

Is the weight factor of the first layer MFA model at the kth iteration, +.>

Is the local coordinates of the first layer MFA model at the kth iteration;

wherein (1)>

For the local area mean of the first layer MFA model at the kth iteration, +.>

5. The method for constructing an acoustic model based on deep mixing factor analysis according to claim 4, wherein the step 5 comprises:

step 5.1: initializing state-related parameters of a second-layer MFA model:

selecting the iteration number K'; wherein (1)>

An initial weighting factor for the second-layer MFA model,/->

Initial local coordinates for the second-tier MFA model;

step 5.2: based on the state alignment information of the training data,overestimating state-related parameters for the j-th state:

wherein (1)>

Is the weighting factor of the second layer MFA model at the kth iteration,

is the local coordinates of the second layer MFA model at the kth iteration;

wherein (1)>

For the local area mean of the second-layer MFA model at the kth iteration, +.>

Is the state independent covariance matrix of the second-layer MFA model at the kth iteration. />