CN109545201B - Construction method of acoustic model based on deep mixing factor analysis - Google Patents
Construction method of acoustic model based on deep mixing factor analysis Download PDFInfo
- Publication number
- CN109545201B CN109545201B CN201811537321.XA CN201811537321A CN109545201B CN 109545201 B CN109545201 B CN 109545201B CN 201811537321 A CN201811537321 A CN 201811537321A CN 109545201 B CN109545201 B CN 109545201B
- Authority
- CN
- China
- Prior art keywords
- model
- layer
- mfa
- state
- parameters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000556 factor analysis Methods 0.000 title claims abstract description 70
- 238000010276 construction Methods 0.000 title claims abstract description 8
- 238000002156 mixing Methods 0.000 title claims description 16
- ZMXDDKWLCZADIW-UHFFFAOYSA-N N,N-Dimethylformamide Chemical compound CN(C)C=O ZMXDDKWLCZADIW-UHFFFAOYSA-N 0.000 claims abstract description 60
- 238000000034 method Methods 0.000 claims abstract description 43
- 238000012847 principal component analysis method Methods 0.000 claims abstract description 7
- 239000000203 mixture Substances 0.000 claims description 40
- 239000011159 matrix material Substances 0.000 claims description 31
- 239000013598 vector Substances 0.000 claims description 18
- 238000005070 sampling Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 7
- 230000006870 function Effects 0.000 description 4
- 239000004973 liquid crystal related substance Substances 0.000 description 4
- 230000008602 contraction Effects 0.000 description 3
- 230000001186 cumulative effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- YEORLXJBCPPSOC-UHFFFAOYSA-N 2-amino-5-(diaminomethylideneazaniumyl)-2-(difluoromethyl)pentanoate Chemical compound NC(N)=NCCCC(N)(C(F)F)C(O)=O YEORLXJBCPPSOC-UHFFFAOYSA-N 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 1
- IJCNJEJGRGVNMF-UHFFFAOYSA-N 2-[2-(dimethylamino)-2-oxoethoxy]-n-[2-(dimethylamino)-2-oxoethyl]benzamide Chemical compound CN(C)C(=O)CNC(=O)C1=CC=CC=C1OCC(=O)N(C)C IJCNJEJGRGVNMF-UHFFFAOYSA-N 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 229960001948 caffeine Drugs 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- BTCSSZJGUNDROE-UHFFFAOYSA-N gamma-aminobutyric acid Chemical compound NCCCC(O)=O BTCSSZJGUNDROE-UHFFFAOYSA-N 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- RYYVLZVUVIJVGH-UHFFFAOYSA-N trimethylxanthine Natural products CN1C(=O)N(C)C(=O)C2=C1N=CN2C RYYVLZVUVIJVGH-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/148—Duration modelling in HMMs, e.g. semi HMM, segmental models or transition probabilities
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Complex Calculations (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
Abstract
The invention relates to the technical field of voice recognition, and discloses a construction method of an acoustic model based on deep mixed factor analysis, which comprises the following steps: generating a baseline system by using training data and adopting an HMM-GMM model; initializing a DMFA model according to HMM-GMM model parameters, wherein the DMFA model consists of two layers of MFA models, and initializing the DMFA model parameters by adopting a GMM clustering and probability principal component analysis method; estimating the overall model parameters of a DMFA model of the acoustic feature space by using training data through a baseline system of an HMM-GMM model and adopting a greedy EM algorithm; estimating state model parameters of a first layer MFA model of the DMFA model, wherein the state model parameters comprise state related parameters and state irrelevant parameters; state model parameters of a second layer MFA model of the DMFA model are estimated. According to the invention, the deep mixed factor analysis model is introduced into the modeling process of the state model, and the acoustic model based on the deep mixed factor analysis is provided, so that the method has better overfitting resistance.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a construction method of an acoustic model based on deep mixed factor analysis.
Background
The acoustic model is an important component of a continuous speech recognition system, a hidden Markov model-Gaussian mixture model (Hidden Markov Model-Gaussian Mixture Model, HMM-GMM) is a current mainstream model, and the HMM-GMM model is combined with bottleneck characteristics (Bottle Neck Feature, BNF) based on a deep neural network (Deep Neural network, DNN), so that the recognition rate is greatly improved. However, the uncertainty of the voice signal in nature is very large, and it is difficult to obtain an accurate acoustic model to describe the voice signal. The uncertainties of speech signals include co-pronunciation, speaker, transmission channel or speech noise environment, etc., which all require accurate modeling of the speech acoustic unit. In order to realize accurate modeling, especially overcome the problem of 'phoneme variant' caused by co-pronunciation, a context-related phoneme modeling method is generally adopted in the HMM-GMM model, namely, a single-phoneme model is subjected to triphone expansion, but the expanded model has higher requirements on data requirements, and in order to obtain stable model parameter estimation, parameter sharing is realized through state binding, and the requirements on training data quantity are reduced while the model parameter estimation is improved.
However, since continuous speech recognition systems often face low-resource real-world problems, many students are devoted to improving acoustic models, i.e. improving modeling accuracy while minimizing the need for data volume. The subspace Gaussian mixture model (Subspace Gaussian Mixture Model, SGMM) limits the mean value and the weight of Gaussian mixture elements in one parameter subspace, and the same subspace parameters and covariance matrix are shared among different states, so that each state can be represented by vectors in a plurality of low-dimensional parameter subspaces, thereby realizing effective parameter sharing, greatly reducing the size of model parameters, improving the robustness of model parameter estimation under the condition of limited data, and lacking a priori assumptions on variables of the low-dimensional subspaces.
Continuous speech recognition systems often face low-resource and other realistic problems, and the mixed factor analysis acoustic model has fewer model parameters, so that the method becomes an effective method for accurately modeling the acoustic model under limited resources. But the method assumes that the implicit factor local coordinates obey a standard normal distribution, while the actual data indicate that the hypothetical factor is far from normal. In the blend factor analysis model, the coordinates of each local region are assumed to be normally distributed, but studies have found that the true distribution of the local coordinates of the blend factor analysis is not normally distributed.
Disclosure of Invention
Aiming at the problems, the invention provides a construction method of an acoustic model based on deep mixed factor analysis, the invention introduces a deep mixed factor analysis model into a modeling process of a state model, and provides the acoustic model based on deep mixed factor analysis, which has better overfitting resistance.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the construction method of the acoustic model based on deep mixing factor analysis comprises the following steps:
step 1: generating a baseline system by using training data and adopting an HMM-GMM model;
step 2: initializing an acoustic model DMFA model based on deep mixed factor analysis according to HMM-GMM model parameters, wherein the DMFA model is a two-layer mixed factor analysis model, namely, the DMFA model consists of a first-layer MFA model and a second-layer MFA model, and initializing the DMFA model parameters by adopting a GMM clustering and probability principal component analysis method;
step 3: estimating the overall model parameters of a DMFA model of the acoustic feature space by using training data through a baseline system of an HMM-GMM model and adopting a greedy EM algorithm;
step 4: estimating state model parameters of a first layer MFA model of the DMFA model, wherein the state model parameters comprise state related parameters and state irrelevant parameters;
step 5: state model parameters of a second layer MFA model of the DMFA model are estimated.
Further, initializing parameters of the DMFA model by adopting a GMM clustering and probability principal component analysis method comprises the following steps:
determining the number of local regions I and the potential factor dimension of each region of a first layer MFA model
Determining the number of sub-regions I of each region of the second-layer MFA model i Number of total sub-areas of the second layerAnd each sub-regionPotential factor dimension of domain->
Further, the step 3 includes:
step 3.1: given data x= { X 1 ,x 2 ,...,x T First-tier MFA model training:
training a first layer of MFA model using X with greedy EM algorithm, wherein there are I mixed factor components, subspace dimensions of each factorI.e. M i Is->
Step 3.2: data set Y for each region i i ,i∈[1,I]Calculating a feature vector x t Gaussian distribution of posterior probability belonging to ith local areaAnd feature vector x t Posterior probability p (i|x) belonging to the i-th local region t ),t∈[1,T]X is obtained according to the following formula t Estimated value of corresponding blend factor component i +.>
Step 3.3: performing second-layer MFA model training:
at the position ofGo up through the second layer mixed factor coordinate vector dimension +.>And the number of the mixed factor components K i An independent second-layer MFA model is trained using a greedy EM algorithm.
Further, the step 4 includes:
step 4.1: initializing state-related parameters of a first layer MFA model: setting iteration times K; wherein (1)>An initial weighting factor for the first layer MFA model,/->The initial local coordinates of the first-layer MFA model are J, and the total number of context-related states;
step 4.2: re-estimating state-related parameters of the j-th state according to the state alignment information of the training data:k∈[1,K]the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Is the weighting factor of the first layer MFA model at the kth iteration,is the local coordinates of the first layer MFA model at the kth iteration;
step 4.3: re-estimating the state independent parameters of the i-th area according to the state alignment information of the training data:wherein (1)>For the local area mean of the first layer MFA model at the kth iteration, +.>For the local area basis matrix of the first layer MFA model at the kth iteration, +.>Is the state independent covariance matrix of the first layer MFA model at the kth iteration.
Further, the step 5 includes:
step 5.1: initializing state-related parameters of a second-layer MFA model: selecting the iteration number K'; wherein (1)>Is the initial weighting factor for the second layer MFA model,initial local coordinates for the second-tier MFA model;
step 5.2: re-estimating state-related parameters of the j-th state according to the state alignment information of the training data:k∈[1,K′]the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Is the weighting factor of the second layer MFA model at the kth iteration,is the local coordinates of the second layer MFA model at the kth iteration;
step 5.3: re-estimating the state independent parameters of the s-th area according to the state alignment information of the training data:wherein (1)>For the local area mean of the second-layer MFA model at the kth iteration, +.>For the local area basis matrix of the MFA model of the second layer at the kth iteration, +.>Is the state independent covariance matrix of the second-layer MFA model at the kth iteration.
Compared with the prior art, the invention has the beneficial effects that:
the invention considers the state acoustic model as a self-adaption of the global feature space mixing factor analysis model, and adopts state related parameters and state irrelevant parameters to jointly describe state modeling. The DMFA model is based on further deepening and expansion of the mixed factor analysis acoustic model, the original model assumes that the regional coordinate parameters follow Gaussian prior distribution, but the regional coordinate parameters are not actually obtained, so that the DMFA model is obtained by modeling the regional coordinate parameters by adopting deeper mixed factor analysis by adopting a deeper mixed factor analysis model. Because the DMFA model has a parameter sharing strategy, the method has better overfitting resistance and is more suitable for acoustic model modeling under low resources.
Drawings
Fig. 1 is a basic flowchart of a method for constructing an acoustic model based on deep mixing factor analysis according to an embodiment of the present invention.
FIG. 2 is a basic flow chart of a method for constructing an acoustic model based on deep blending factor analysis according to another embodiment of the present invention.
Fig. 3 is a schematic diagram of deep mixed factor analysis of a method for constructing an acoustic model based on deep mixed factor analysis according to an embodiment of the present invention, (a) is partially a region schematic diagram of deep mixed factor, and (b) is partially a graph model of deep mixed factor analysis.
Detailed Description
The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:
embodiment one:
as shown in fig. 1, the method for constructing an acoustic model based on deep mixing factor analysis of the present invention includes the following steps:
step S101: and generating a baseline system by using the training data and adopting an HMM-GMM model.
Step S102: initializing an acoustic model DMFA model based on deep mixed factor analysis according to HMM-GMM model parameters, wherein the DMFA model is a two-layer mixed factor analysis model, namely, the DMFA model consists of a first-layer MFA model and a second-layer MFA model, and the parameters of the DMFA model are initialized by adopting a GMM clustering and probability principal component analysis method.
Step S103: and estimating the overall model parameters of the DMFA model of the acoustic feature space by using a greedy EM algorithm through a baseline system of the HMM-GMM model by using training data.
Step S104: state model parameters of a first layer MFA model of the DMFA model are estimated, the state model parameters including state-related parameters and state-independent parameters.
Step S105: state model parameters of a second layer MFA model of the DMFA model are estimated.
Embodiment two:
as shown in FIG. 2, another method for constructing an acoustic model based on deep mixing factor analysis of the present invention.
Firstly, generating an HMM-GMM acoustic model by using a Kaldi tool box; the DMFA is then used to generate a model of the entire acoustic feature space, which is essentially a global background model, and the model for each state can be considered to be derived from the model by some adaptive algorithm that uses the data for each state to re-estimate the parameters. The invention adopts a two-layer DMFA model for modeling, the DMFA model comprises a two-layer MFA model, and the specific flow is shown in algorithm 1.
Wherein, the liquid crystal display device comprises a liquid crystal display device,an initial weighting factor for the first layer MFA model,/->The initial local coordinates of the first-layer MFA model are J, and the total number of context-related states; />Is the weight factor of the first layer MFA model at the kth iteration, +.>Is the local coordinates of the first layer MFA model at the kth iteration; />For the local area mean of the first layer MFA model at the kth iteration, +.>For the local area basis matrix of the first layer MFA model at the kth iteration, +.>A state independent covariance matrix of the first layer MFA model at the kth iteration; />An initial weighting factor for the second-layer MFA model,/->Initial local coordinates for the second-tier MFA model; />Weight factor of the second-layer MFA model at the kth iteration, +.>Is the local coordinates of the second layer MFA model at the kth iteration; />For the local area mean of the second-layer MFA model at the kth iteration, +.>Is the local area base of the second layer MFA model at the kth iterationMatrix (S)>Is the state independent covariance matrix of the second-layer MFA model at the kth iteration.
The detailed procedure is as follows:
step S201: generating a baseline system using the training data using a hidden Markov model-Gaussian mixture model HMM-GMM model (see Povey D, ghoshal A, boulianne G, etc. the Kaldi speech recognition toolkit. In: proceedings of the 2011IEEE Workshop on Automatic Speech Recognition and Understanding.IEEE,2011);
specifically, the training data is 1987-1989WSJ text data, about 215M.
Step S202: initializing an acoustic model DMFA model based on deep mixed factor analysis according to HMM-GMM model parameters, wherein the DMFA model is a two-layer mixed factor analysis model, namely, the DMFA model consists of a first-layer MFA model and a second-layer MFA model, and the method for initializing the parameters of the DMFA model by adopting GMM clustering and probability principal component analysis comprises the following steps:
factor analysis (Factor Analyzer) is a high dimensional data correlation modeling method that models correlation information with a low dimensional linear subspace. Let x be the high-dimensional space R D The assumption of the root factor analysis model that x can pass through a low-dimensional subspace R K The point y in (D' < D) is obtained by affine transformation:
wherein μ is the mean of the high dimensional spatial data distribution; m is a linear transformation matrix, called factor load matrix, with dimension D x D'; n is a random error term. In the factor distribution model, y follows the D' dimensional standard normal distribution, while the random error n follows the gaussian distribution with a mean of 0 and covariance of the diagonal matrix Σ. It can thus be seen that a large number of observations are concentrated around the mean μ, resulting in a local linear subspace model. Thus, the factor analysis model is well suited for modeling local in high dimensional space, and combining multiple local factor analysis models results in a mixed factor analysis model (Mixture of Factor Analysers, MFA).
Assuming that the data space is divided into I partial areas, the probability that the observed data x falls into the partial areas is w respectively 1 ,w 2 ,…,w I Each local area is approximated by a factor analysis model, so as to obtain a mixed factor analysis model, namely:
wherein mu i 、M i Sum sigma i The mean value, the factor load matrix and the reconstruction error matrix of the ith factor analysis model, y i For the corresponding coordinate vector of the observation data x in the model. In the formula (3), the linear subspace dimension corresponding to each local factor analysis model may be different, so that the linear subspace dimension of the ith factor analysis model is D i M is then i Is a D x D i Matrix of dimensions, local coordinates y i Is a D i Is a vector of (a). The posterior probability follows a gaussian distribution according to equations (2) and (3), namely:
The parameters of the mixture factor analysis model thus include the number of local regions I, the potential dimension D of each local region i Model parameters { w for local factor analysis i ,μ i ,M i ,Σ i }. Parameter estimation of the model may be achieved by a expectation maximization (Expectation Maximization, EM) algorithm.
assume that the model parameters after the kth iteration areThe auxiliary function of the greedy EM algorithm in the k+1 iteration is: />
Wherein, gamma i (x t ) For a given parameter lambda (k) Feature vector x t The posterior probability belonging to the i-th local region is calculated by the following expression:
for convenience of derivation, letThen at M-step, the sum of the values of the functions Q (Λ (k) Λ) versus parameter w i ,/>Σ i And obtaining a partial derivative, and enabling the corresponding partial derivative to be 0 to respectively obtain a parameter updating formula of the MFA model:
when the MFA model converges, the accuracy of the model can be improved by increasing the number of spatial regions I and the dimension D' of the latent factor space, which is equivalent to adjusting the conditional probability of x under the latent factor y condition of all regions to beHowever, this approach quickly leads to over-fitting problems when modeling high-dimensional data, or insufficient data volume. Thus, to solve this problem, it is to replace the a priori distribution of each potential factor y +.>That is, instead of the simple normal distribution prior probability being a more complex MFA prior probability, there are:
wherein, the liquid crystal display device comprises a liquid crystal display device,the parameters representing the new MFA of the second layer of spatial region i, i.e. the region i is subjected to a blending factor analysis.
Figure 3 shows a schematic representation of deep blending factor analysis. Wherein part (a) in fig. 3 is a schematic region diagram of deep mixed factor analysis, and part (b) in fig. 3 has a first layer mixed factor region number i=2, and thick lines represent two of mixed factor analysis modelsThe number of regions is represented by i=1, 2, and the mixed factor analysis is adopted in each region, assuming that the number of sub-regions divided by each region i is K i Sub-region k i =1,2,...,K i To represent. Then for the example herein, for region i=1, a further learning is made to contain K i=1 Blend factor analysis of 4 subregions, the corresponding subregion of which is subregion k 1 =1,2,...,K 1 ,K 1 =4; and region i=2, the corresponding sub-region being k 2 =1,2,...,K 2 ,K 2 =2。
So a DMFA uses the prior probability p of the original DFA MFA (y, i) =p (i) p (y|i) is replaced with a better prior probability, i.e.:
p DMFA (y,i)=p(i)p(k i |i)p(y|k i ) (15)
thus, by sampling from DMFA, w is first utilized i Select i, then the second layer utilizesSample k i Wherein k is i =1,2,...,K i ,K i Representing the number of factor analysis performed on the second layer by the ith factor of the first layer; finally, the factor component k can be utilized i To represent y. In a simpler, fully equivalent form of DMFA, i.e. enumerating all possible factors k of the second layer i . A new factor composition indicator s=1, 2,..s is used to denote the composition of a specific second layer, wherein +.>As shown on the right side of part (a) in fig. 3. The mixed weights can be defined as +.>Wherein i(s) represents a first layer factor i and a second layer factor k corresponding to s, respectively i (s). For example, i (2) =1, and i (5) =2. Note that the size of S increases exponentially with the number of layers. The representation of this generation process is very intuitive and will be described in the latter sectionThis representation is continued.
Part (b) of fig. 3 shows a two-layer DFMA graph model. In particular, it is pointed out that
i←i(s) (19)
Equation (19) is fully defined because each s belongs to and only one i. Wherein the method comprises the steps of And->Respectively D (1) ×D (1) And D (2) ×D (2) Diagonal matrix of dimensions. The variable here is a generic term, it being noted in particular that y (2) Refers to y for each s s (2) The number of the actual second layers is S; y is (1) Refers to +.>When s is determined, i is alsoIs determined.
For a deep mixed factor analysis model with an arbitrary layer number L, the parameters of the DFMA model are the number I of local areas comprising each layer (l) L=1, 2,..l, potential dimension of each local regionLocal factor analysis model parameters->The sub-regions of each layer are here global representations, but each sub-region belongs to only a certain region of the previous layer. In the invention, a two-layer DMFA model is adopted for modeling, and for clearer representation, a first layer region is represented by i, and a second layer region is represented by s; the number of the first layer areas is I; the number of the total subregions of the second layer is S.
The weight factor of the observation vector of the same state j falling into the ith local area is w ji The probability constraint condition is satisfied, and the observation vector of the state j is assumed to be subjected to Gaussian distribution in the ith local area, and only the influence of the mean value is considered, and the mean value is set as mu' ji Variance is sigma' i (the variance of each region is equal for each state). Given a local region i and a mean μ' ji The observation probability of state j is:
assume that the mean μ 'is paired with a deep factorial analysis model' ji Modeling, taking two layers of models as an example, and making the mean mu' ji The local coordinates of the ith local area of the first layer areFrom formula (2), μ 'can be seen' ji Is:
assume thatThe local coordinates of the s-th local area in the second layer are +.>Given the second layer local area s and +.>Then according to formula (2) it is known that +.>Is:
according to equations (22) and (23), the local implicit variable of the second layer can be used to represent a priori distribution of the mean, i.e
Substituting formulas (22) and (23) into formula (24) to obtain μ' ji Is:
the observation probability of state j is according to the bayesian formula:
to reduce the number of parameters, letFinally, a given second can be obtainedLayer region coordinates->The observation probability model of state j is:
wherein the method comprises the steps of
in an HMM acoustic model, each state is essentially an observation probability distribution describing a certain plateau of the corresponding acoustic modeling unit. The observed feature vector for each plateau therefore necessarily distributes a small number of regions in the MFA model throughout the space, definingAnd +.>Wherein the observation data of state j are distributed over a small area of the whole space, thus +.>Are necessarily sparse, in fact +.in formula (27)>The combined action of the weights of the first layer and the second layer can be considered together when directly reevaluating in order to reduce the number of parameters.
It can thus be seen that with equations (27) and (28) a state model based on depth factor analysis is constructed, with accurate modeling of each state achieved by re-estimation of the parameters, and thus it can be seen that,at each layer of the DMFA model, the state model parameters include two parts, one being state-related parameters: weighting factorAnd local coordinates->Wherein i=1, 2,; s=1, 2,; second, state independent parameters: />And->
Since it can be known from algorithm 1 that the typical state model is generated based on the greedy EM algorithm, initial selection of DMFA is critical. By referring to a method for generating a typical state model based on a mixed factor analysis acoustic model, the generation of the typical state model comprises two parts, namely, a GMM model of an overall space is generated from a base line system of the HMM-GMM model; and secondly, the GMM model of the whole space is expressed as a deep mixed factor analysis model.
The GMM model acquisition of the whole space is similar to the acoustic model method based on MFA, but two layers of clustering are needed to realize. Firstly looking at a first layer of clustering, setting the total number of Gaussian mixture elements in an HMM-GMM baseline system as M, numbering the Gaussian mixture elements from 1 to M according to a certain sequence, and setting the average value of the mth Gaussian mixture element asCovariance matrix +.>The training data is forcedly aligned, and zero-order, first-order and second-order statistics gamma corresponding to each Gaussian mixture element are calculated m =∑ t γ m (t)、/>Andwherein γm (t) is the t-th frame feature vector o t The posterior probability of the mth Gaussian mixture is calculated by a Baum-Welch forward and backward algorithm, and then the likelihood of training data corresponding to each mixture is calculated as follows: LLK (LLK) m =∑ t γ m (t)logp(o t |m)。
The m 'th and m' th Gaussian mixture elements are clustered and then combined to generate a new Gaussian mixture element m ′ The corresponding zero-order, first-order and second-order statistics are calculated as gamma m″′ =γ m′ +γ m″ ,s m″′ =s m′ +s m″ 、S m″′ =S m′ +S m″ Then the combined weight is calculated by using the parameter re-estimation formula of the HMM-GMMMean vector->And covariance matrix->The loss value of the log likelihood of the training data after combination is as follows:
ΔLLK m′m″→m″′ =LLK m″′ -LLK m′ -LLK m″ (29)
obtaining a GMM containing I Gaussian mixture elements through an M-I step clustering process, in each step clustering process, carrying out pairwise combination on the current Gaussian mixture elements, calculating a loss value of likelihood scores before and after combination through a formula (29), combining two Gaussian mixture elements with the minimum loss value into a new Gaussian mixture element, deleting the two Gaussian mixture elements before combination, and respectively carrying out weight, mean vector and covariance matrix of the new Gaussian mixture elementsAnd->After the above clustering process is completed, the GMM parameters containing I Gaussian mixture elements are obtained as +.>
Notably, the GMM model over which the entire acoustic space is acquired, some intermediate result information is required for the generation of the multi-layer GMM model. The multi-layer gaussian mixture model corresponds to the representation of the gaussian mixture elements of each layer by using a GMM model, thereby forming a multi-layer GMM model. For example, looking first at the two-layer GMM model, assume the gaussian mixture set Ω= { m for the HMM-GMM baseline system (0) M=1, 2, M, the first layer GMM model was obtained using the clustering algorithm described above, but for each merge in the cluster, the composition of the original gaussian mixture for each mixture after the merge was recorded, e.g., for the new gaussian mixture after the kth mergeThe original mixed element composition set is +.>Representing the combination of the multiple passes in omega>Is the GMM parameter of +.>Wherein the upper right corner mark 1 represents a first layer cluster; and the original mixed element set of each mixed element i is C respectively i ,i=1,2,...,I,/>For the second layer GMM, the original state mixed element set C of the ith mixed element i The mixed elements in the method are combined by adopting the clustering algorithm same as the first layer and are combined to the appointedCompleting the acquisition of the second layer GMM parameters of the deep GMM model, setting +.>The clustered mixed elements of the second layer share the information of the mixed elements of the first layer.
After the clustering process is completed, the GMM parameters containing the first layer I Gaussian mixture elements are obtained asSecond-layer mixed element->For each covariance matrix of each layer, e.g +.> Respectively analyzing the characteristic values, and sequencing the characteristic values from large to small to lambda h1 ,λ h2 ,…,λ hD The corresponding feature vector is +.>Defining the cumulative contribution rate (Cumulative Contribution Rate, CCR) η of the d-th eigenvalue id The method comprises the following steps:
η id reflecting the ratio of the first d eigenvalues to the sum of the total eigenvalues. For the h local area of the mixed factor analysis model, the potential dimension D is selected h Is thatI.e. the smallest eigenvalue number with an eigenvalue cumulative contribution exceeding 90% is chosen as the potential dimension of the first local area. For the region h of the first layer, probability principal division is adoptedQuantitative analysis of the remaining parameters of its corresponding response factor analysis model +.>Respectively initializing.
Step S203: estimating, by using training data, overall model parameters of a DMFA model of an acoustic feature space by a baseline system of an HMM-GMM model using a greedy EM algorithm, comprising:
the training of DMFA may be achieved by a greedy EM algorithm. The MFA of the first layer is trained in a standard manner; then the mixed factor component is obtained by adopting the formula (8)Each training data and a factor of the corresponding component. The first layer parameters are then frozen and the first layer factor values for each component are extracted as training data for the second layer MFA. Algorithm 2 gives a specific layer-by-layer training algorithm. After greedy learning, the steps of the conventional greedy EM algorithm are updated by compressing a DMFA.
Step S204: estimating state model parameters of a first layer MFA model of the DMFA model, the state model parameters including state-related parameters and state-independent parameters, comprising:
according to the modeling method of the state model given in the step 2, the state model parameters based on the MFA are adopted for the first layer, and the weight factors are respectively calculatedLocal coordinates->Local area basis matrix->Local area mean vector->State independent covariance matrix->And respectively constructing auxiliary functions of each parameter by adopting maximum likelihood criteria, and obtaining a parameter re-estimation formula as follows:
it should be noted that since each state model is essentially an observation probability distribution describing a certain plateau of the corresponding acoustic modeling unit, the corresponding observation feature vector of its plateau is such that the state model covers only a small part of all regions. Thus in the state model observation spaceIs sparse, the sparsity constraint can be realized by a weight contraction algorithm, which comprises two parts of weight contraction and weight normalization, wherein the weight contraction realizes the sparsity constraint, and the weight normalization is that the weight satisfies the statistical constraint condition, thus the weight is obtained by +.>The final weight parameters are obtained after the algorithm. All parameters are updated to obtain the first-layer MFA acoustic model of the state.
Step S205: estimating state model parameters of a second layer MFA model of the DMFA model, comprising:
from equations (27) and (28), it is known that after the first layer parameters of the DMFA model of the state are obtained, the state model parameters can be used to achieve accurate modeling of the state through changes in the second layer parameters. Thus, an auxiliary function can be established as:
Thus, based on the first layer parameters, the respective constructions are constructed using the formula (36)Local coordinates->Local area basis matrix->Local area mean vector->State independent covariance matrix->The re-estimation formula of the parameters can be obtained as follows:
wherein, parameter H js And g js As shown in formula (42), gamma js 、s js And S is js Zero order, first order, and second order statistics, respectively, of the s subregion of state j.
It should be noted that hereSince it is modeled from the joint estimation of all sub-areas, it is as above +.>The same algorithm is adopted to estimate the sparse constraint; secondly, each sub-region s belongs to only one region i, so when a sub-region s is determined, then the corresponding region i is also determined; in additionThe +.appearing in the formulae (39) to (41)>Is a generalized inverse matrix, which can prove +.>There is a left inverse matrix, and +.>There is a right inverse matrix.
The invention provides an acoustic model modeling method based on deep mixing factor analysis. The new method refers to the thought of the modeling method based on the mixed factor analysis model, considers the state acoustic model as a self-adaption of the global feature space mixed factor analysis model, and describes the state modeling together by adopting state related parameters and state unrelated parameters. However, the model is based on further deepening and expansion of the mixed factor analysis acoustic model, and the original model assumes that the regional coordinate parameters follow Gaussian prior distribution, but the regional coordinate parameters are not actually so, so that the regional coordinate parameters are obtained by modeling the regional coordinate parameters by adopting deeper mixed factor analysis by adopting a deeper mixed factor analysis model. Because the DMFA model has a parameter sharing strategy, the method has better overfitting resistance and is more suitable for acoustic model modeling under low resources.
The foregoing is merely illustrative of the preferred embodiments of this invention, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this invention, and it is intended to cover such modifications and changes as fall within the true scope of the invention.
Claims (5)
1. The construction method of the acoustic model based on deep mixing factor analysis is characterized by comprising the following steps:
step 1: generating a baseline system by using training data and adopting an HMM-GMM model;
step 2: initializing an acoustic model DMFA model based on deep mixed factor analysis according to HMM-GMM model parameters, wherein the DMFA model is a two-layer mixed factor analysis model, namely, the DMFA model consists of a first-layer MFA model and a second-layer MFA model, and initializing the DMFA model parameters by adopting a GMM clustering and probability principal component analysis method;
step 3: estimating the overall model parameters of a DMFA model of the acoustic feature space by using training data through a baseline system of an HMM-GMM model and adopting a greedy EM algorithm;
step 4: estimating state model parameters of a first layer MFA model of the DMFA model, wherein the state model parameters comprise state related parameters and state irrelevant parameters;
estimating state model parameters of a first layer MFA model of the DMFA model according to:
wherein the method comprises the steps ofIs the weighting factor of the MFA model of the first layer at the k+1th iteration, +.>For the local coordinates of the first-layer MFA model at the k+1th iteration, +.>For the mean value of the local region of the MFA model of the first layer at the k+1th iteration, +.>For the local area basis matrix of the first layer MFA model at the k+1th iteration,/and (b)>For the state independent covariance matrix, μ of the first layer MFA model at the k+1th iteration i 、M i Sum sigma i The mean value, the factor load matrix and the reconstruction error matrix of the ith factor analysis model, y ji Is the local coordinate, x, of the first layer MFA model t As feature vector, gamma ji 、s ji Zero order and first order statistics of the first layer MFA model of the state j respectively;
step 5: estimating state model parameters of a second-layer MFA model of the DMFA model;
estimating state model parameters of a second layer MFA model of the DMFA model according to:
wherein the method comprises the steps of
Wherein the method comprises the steps ofIs the weighting factor of the MFA model of the second layer at the k+1th iteration, +.>For the local coordinates of the MFA model of the second layer at the k+1th iteration, +.>For the mean value of the local region of the MFA model of the second layer at the k+1th iteration, +.>For the local area basis matrix of the MFA model of the second layer at the k+1th iteration,/and>for the state-independent covariance matrix of the second-layer MFA model at the k+1th iteration,/->Is the local coordinates of the second-layer MFA model, gamma js 、s js And S is js Zero order, first order, and second order statistics, respectively, of the second layer MFA model of state j.
2. The method for constructing an acoustic model based on deep mixing factor analysis according to claim 1, wherein initializing DMFA model parameters by using GMM clustering and probability principal component analysis method comprises:
determining the number of local regions I and the potential factor dimension D of each region of a first layer MFA model i (1) ;
3. The method for constructing an acoustic model based on deep mixing factor analysis according to claim 2, wherein the step 3 comprises:
step 3.1: given data x= { X 1 ,x 2 ,...,x T First-tier MFA model training:
training a first layer of MFA model using X with greedy EM algorithm, wherein there are I mixed factor components, subspace dimensions of each factorI.e. M i Is->
Step 3.2: data set Y for each region i i ,i∈[1,I]Calculating a feature vector x t Gaussian distribution of posterior probability belonging to ith local areaAnd feature vector x t Posterior probability p (i|x) belonging to the i-th local region t ),t∈[1,T]X is obtained according to the following formula t Estimated value of corresponding blend factor component i +.>
Step 3.3: performing second-layer MFA model training:
4. The method for constructing an acoustic model based on deep mixing factor analysis according to claim 3, wherein the step 4 comprises:
step 4.1: initializing state-related parameters of a first layer MFA model: setting iteration times K; wherein (1)>An initial weighting factor for the first layer MFA model,/->The initial local coordinates of the first-layer MFA model are J, and the total number of context-related states;
step 4.2: re-estimating state-related parameters of the j-th state according to the state alignment information of the training data:wherein (1)>Is the weight factor of the first layer MFA model at the kth iteration, +.>Is the local coordinates of the first layer MFA model at the kth iteration;
step 4.3: re-estimating the state independent parameters of the i-th area according to the state alignment information of the training data:wherein (1)>For the local area mean of the first layer MFA model at the kth iteration, +.>For the local area basis matrix of the first layer MFA model at the kth iteration, +.>Is the state independent covariance matrix of the first layer MFA model at the kth iteration.
5. The method for constructing an acoustic model based on deep mixing factor analysis according to claim 4, wherein the step 5 comprises:
step 5.1: initializing state-related parameters of a second-layer MFA model: selecting the iteration number K'; wherein (1)>An initial weighting factor for the second-layer MFA model,/->Initial local coordinates for the second-tier MFA model;
step 5.2: based on the state alignment information of the training data,overestimating state-related parameters for the j-th state:wherein (1)>Is the weighting factor of the second layer MFA model at the kth iteration,is the local coordinates of the second layer MFA model at the kth iteration;
step 5.3: re-estimating the state independent parameters of the s-th area according to the state alignment information of the training data:wherein (1)>For the local area mean of the second-layer MFA model at the kth iteration, +.>For the local area basis matrix of the MFA model of the second layer at the kth iteration, +.>Is the state independent covariance matrix of the second-layer MFA model at the kth iteration. />
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811537321.XA CN109545201B (en) | 2018-12-15 | 2018-12-15 | Construction method of acoustic model based on deep mixing factor analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811537321.XA CN109545201B (en) | 2018-12-15 | 2018-12-15 | Construction method of acoustic model based on deep mixing factor analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109545201A CN109545201A (en) | 2019-03-29 |
CN109545201B true CN109545201B (en) | 2023-06-06 |
Family
ID=65856473
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811537321.XA Active CN109545201B (en) | 2018-12-15 | 2018-12-15 | Construction method of acoustic model based on deep mixing factor analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109545201B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113421555B (en) * | 2021-08-05 | 2024-04-12 | 辽宁大学 | BN-SGMM-HMM-based low-resource voice recognition method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103117060A (en) * | 2013-01-18 | 2013-05-22 | 中国科学院声学研究所 | Modeling approach and modeling system of acoustic model used in speech recognition |
CN104795063A (en) * | 2015-03-20 | 2015-07-22 | 中国人民解放军信息工程大学 | Acoustic model building method based on nonlinear manifold structure of acoustic space |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108630199A (en) * | 2018-06-30 | 2018-10-09 | 中国人民解放军战略支援部队信息工程大学 | A kind of data processing method of acoustic model |
-
2018
- 2018-12-15 CN CN201811537321.XA patent/CN109545201B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103117060A (en) * | 2013-01-18 | 2013-05-22 | 中国科学院声学研究所 | Modeling approach and modeling system of acoustic model used in speech recognition |
CN104795063A (en) * | 2015-03-20 | 2015-07-22 | 中国人民解放军信息工程大学 | Acoustic model building method based on nonlinear manifold structure of acoustic space |
Also Published As
Publication number | Publication date |
---|---|
CN109545201A (en) | 2019-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kucukelbir et al. | Automatic differentiation variational inference | |
Wang et al. | Factorization bandits for interactive recommendation | |
Alfons et al. | Sparse least trimmed squares regression for analyzing high-dimensional large data sets | |
Fan et al. | Photo-real talking head with deep bidirectional LSTM | |
JP6328320B2 (en) | How to convert the input signal | |
Young et al. | Mixtures of regressions with predictor-dependent mixing proportions | |
Dobrescu et al. | Understanding deep neural networks for regression in leaf counting | |
Lin et al. | Fixed and random effects selection by REML and pathwise coordinate optimization | |
Kirshner | Modeling of multivariate time series using hidden Markov models | |
Rumí et al. | Estimating mixtures of truncated exponentials in hybrid Bayesian networks | |
Ray et al. | Signal adaptive variable selector for the horseshoe prior | |
Stuart et al. | Inverse optimal transport | |
CN111222847B (en) | Open source community developer recommendation method based on deep learning and unsupervised clustering | |
Salazar | On Statistical Pattern Recognition in Independent Component Analysis Mixture Modelling | |
Pham | Geostatistical simulation of medical images for data augmentation in deep learning | |
CN109545201B (en) | Construction method of acoustic model based on deep mixing factor analysis | |
Fix et al. | Simultaneous autoregressive models for spatial extremes | |
Mokhtar et al. | Pedestrian wind factor estimation in complex urban environments | |
CN111353525A (en) | Modeling and missing value filling method for unbalanced incomplete data set | |
Calderhead et al. | Sparse approximate manifolds for differential geometric mcmc | |
Kazor et al. | Mixture of regression models for large spatial datasets | |
Einbeck et al. | Representing complex data using localized principal components with application to astronomical data | |
Jiang et al. | A Bayesian Markov-switching model for sparse dynamic network estimation | |
Frakt et al. | A generalized Levinson algorithm for covariance extension with application to multiscale autoregressive modeling | |
Nawar | Machine learning techniques for detecting hierarchical interactions in insurance claims models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |