CN108446273A

CN108446273A - Kalman filtering term vector learning method based on Di's formula process

Info

Publication number: CN108446273A
Application number: CN201810212606.XA
Authority: CN
Inventors: 王磊; 翟荣安; 刘晶晶; 王毓; 王飞; 于振中; 李文兴
Original assignee: HRG International Institute for Research and Innovation
Current assignee: HRG International Institute for Research and Innovation
Priority date: 2018-03-15
Filing date: 2018-03-15
Publication date: 2018-08-24
Anticipated expiration: 2038-03-15
Also published as: CN108446273B

Abstract

A kind of Kalman filtering term vector learning method based on Di's formula process, the method includes：Language material is trained and is pre-processed, LDS language model systems is generated, systematic parameter is initialized, it is assumed that process noise meets normal distribution, definition cluster θ_t=(μ_t,∑_t), μ_tFor the frequency that word t in corpus occurs, θ is calculated_tDirichlet prior distribution, by Kalman filtering derive and Gibbs Sampling Estimations calculate Posterior distrbutionp, using MCMC sampling algorithms extract alternative clusters, calculate the select probability of alternative clusters, and select the highest alternative clusters of the probability value as θ_tCalculate the Minimum Mean Squared Error estimation value of the cluster, result of calculation is substituted into LDS language models, pass through EM algorithm training patterns, model parameter is set to reach stable, the language material pre-processed is inputted into trained LDS language models, carries out calculating implicit vector expression by one step of Kalman filter more new formula.

Description

Kalman filtering term vector learning method based on Di's formula process

Technical field

The present invention relates to natural language processing field, in particular to the Kalman filtering word based on Di's formula process to Measure learning method.

Background technology

In natural language processing (NLP) inter-related task, natural language is given to the algorithm in machine learning to handle, is led to It often needs first by linguistic mathematics, because machine only recognizes mathematic sign.Vector is that the thing of nature is abstracted friendship by people To the thing of machine processing, term vector is exactly a kind of mode for the word in language to be carried out to mathematicization.

A kind of simplest term vector representation is One-hot Representation, be exactly with it is one very long to It measures to indicate that a word, vectorial length are the size of dictionary, for vectorial component only there are one 1, other are all 0,1 position pair It should position of the word in dictionary.But this word term vector indicates that there are two disadvantages：(1) it is easy to be perplexed by dimension disaster, especially When it is some algorithms for using it for Deep Learning；(2) similitude that cannot well between portrayed words and word.

Another term vector representation is Distributed Representation, it is Hinton earliest in 1986 It year proposes, the shortcomings that One-hot Representation can be overcome.Its basic principle is：By training certain language In each word be mapped to the short amount of a regular length (" short " here be relative to One-hot For " length " of Representation), all these vectors are put together to form a term vector space, and it is each to Amount is then a point in the space, spatially introduces " distance " at this, then can judge it according to the distance between word Between (morphology, semantically) similitude.

Due to many features of text data itself, such as near synonym and polysemant phenomenon, therefore indicate text data The dimension of feature vector is often higher, but these high dimensional features vector everything is not necessarily correct classification it is beneficial, feature can be caused sparse instead, Bring the decline for reducing even classifying quality of efficiency of algorithm.Therefore there are one good term vector representation methods to have important meaning Justice.

Term vector based on context in order to obtain, Gaussian linear dynamical system (LDS) be used to describe a language mould Type.The advantages of the method is the context that can make full use of current word.This model is by a large amount of not labeled language material Library learns, and the implicit vector of each word indicates by Kalman filter to infer.Assuming that dictionary is V, corpus length is T.WithIndicate that the instruction vector of word t, dimension are | V |.If i-th positions of the word t in dictionary, then vectorialI-th Value is not zero.Define μ_iThe frequency occurred for word t in corpus.

LDS models are：

Wherein x_tIndicate that the implicit vector for the word to be learnt indicates, w_tIndicate observation, η_tAnd ξ_tThe process of expression system is made an uproar Sound and measurement noise, A and C indicate state-transition matrix and observing matrix.Definition(μ is μ_iIn an element), w_tMean value be zero.Define desired valueExpectation is asked in function E expressions Function.

By the word for being located at i-th of position in a word with another be located at the word of j-th of position between the word number that differs Amount is defined as lagging, when lag is k,μ_iWithIn the corpus that length is T Middle estimated value is：

Desired value is sought in wherein E [] representatives,Refer to about x pairsSeek desired value.

Here is the process introduced to model parameter estimation, and whole process is divided into two steps, first to EM algorithms SSID (subspace identification) method is initialized, and a kind of EMwith ASOS (Martens, 2010) (increasing is passed through Strong EM algorithms) to parameter A, C, Q, R (Q, R are the basic parameter of Kalman filter) optimization, t-1 observation before being then based on The estimated value of valueK is Kalman filtering gain and the estimated value based on whole language materials(I, J are the basic parameter of Kalman filter) calculates implicit vector and indicates x_t, and by implicit vector Indicate x_tIt is expressed as term vector.

The prior art assumes that systematic procedure noise and measurement noise distribution belong to the white Gaussian noise of zero-mean, but usually In the case of system noise be uncertain, especially for language model, many problems be all need to excavate it is new, not in language material The information known, therefore the consideration of above-mentioned hypothesis does not gear to actual circumstances, the term vector obtained based on the hypothesis is inaccurate.

Di's formula process (Dirichlet Process) is a kind of famous variable element Bayesian model, particularly suitable for solution Certainly various clustering problems.The advantage of this class model is that used class number is not necessarily to be manually specified when establishing mixed model, But it from main is calculated by model and data.In natural language processing field, many problems are required for excavating new in language material , unknown information, these information often lack priori, thus this advantage of Di's formula process is at many natural languages It can fully be embodied in the application problem of reason.But Di's formula process is not yet applied to natural language processing by the prior art During term vector indicates.

It is therefore desirable to develop a kind of new term vector representation method.

Invention content

Goal of the invention to realize the present invention, the present invention provides a kind of Kalman filtering term vectors based on Di's formula process Learning method, the method includes：

Language material is trained and is pre-processed,

LDS language model systems are generated, systematic parameter is initialized,

Assuming that process noise meets normal distribution, definition cluster θ_t=(μ_t,∑_t), μ_tThe frequency occurred for word t in corpus Rate calculates θ_tDirichlet prior distribution,

It is derived by Kalman filtering and Gibbs Sampling Estimations calculates Posterior distrbutionp,

Alternative clusters are extracted using MCMC sampling algorithms, calculate the select probability of alternative clusters, and select the probability value Highest alternative clusters are as θ_t, the Minimum Mean Squared Error estimation value of the cluster is calculated,

Model parameter is set to reach stable by EM algorithm training patterns result of calculation substitution LDS language models,

The language material pre-processed is inputted into trained LDS language models, passes through one step of Kalman filter more new formula It carries out calculating implicit vector expression.

Wherein, LDS language models are as follows：

x_t=Ax_t-1+η_t

w_t=Cx_t+ξ_t

Wherein x_tIndicate that the implicit vector for the word to be learnt indicates, w_tIndicate observation, η_tAnd ξ_tThe process of expression system is made an uproar Sound and measurement noise, A and C indicate state-transition matrix and observing matrix.

Wherein, θ_tMeet Di Li Cray process prior distributions it is assumed that calculating θ_t~G；G~DP (α, G₀),

Wherein subsequent distribution is obeyed in symbol~representative, and parameter G, D, P are the expression symbol of Di Li Crays distribution, and α is ruler Spend the factor, G₀Indicate basis distribution, G₀=NIW (μ₀,κ₀,ν₀,Λ₀), μ₀,κ₀,ν₀,Λ₀For hyper parameter.

Wherein, the calculation formula of the Posterior distrbutionp is as follows：

p(x_0:T,θ_1:T|w_1:T)=p (x_0:T|θ_1:T,w_1:T)p(θ_1:T|w_1:T)

p(x_0:T|θ_1:T,w_1:T) can be derived by Kalman filtering, p (θ_1:T|w_1:T) Gibbs Sampling Estimations can be passed through.

Wherein, described to extract alternative clusters using MCMC sampling algorithms, the select probability of alternative clusters is calculated, and select institute The highest alternative clusters of probability value are stated as θ_tValue detailed process it is as follows：

From 1 ..., it is extracted in T wordIndicate the extraction after i sampling excludes word t as a result, i >=2；

MH algorithms extract alternative clusters from following formula：

Calculate the select probability of alternative clustersIf ρ ＞ α, enableOtherwise it enables

Wherein, described so that model parameter is reached by EM algorithm training patterns result of calculation substitution LDS language models Stable detailed process is as follows：

E is walked：According to the state estimation of the parameter value calculation t moment Kalman filter at t-1 momentAnd then calculate shape State estimated valueCovariance matrix：

(1) such as given a definition first：

Wherein, R is the covariance matrix of observation noise；

(2) BP neural network model is utilized to calculate the data of t moment,

Backpropagation：T=T ..., 1, calculate covariance matrixWith

It enables

Wherein, N_VFor unit diagonal matrix, B (θ_t)=G₀chol(∑_t)^T, then

Covariance matrix can be derived by Kalman filterWith

Propagated forward：T=1 ..., T,

Calculating covariance matrix can be derived by Kalman filterAnd m_t|t(θ_1:t-1)；

M is walked：Desired value is calculated using covariance matrix, it would be desirable to value maximizes, and solves the relevant parameter of LDS models, It is exactly state-transition matrix A and observing matrix C；

Update above-mentioned parameter simultaneously repeats above two steps until LDS models reach stable.

Wherein, described to carry out calculating the implicit vectorial detailed process indicated such as by one step of Kalman filter more new formula Under：

One step of Kalman filter more new formula is：

Wherein, K, R, B, Q, P, I are the basic parameter of Kalman filter,To utilize one step of Kalman filter more The calculated implicit vector of new formula indicates x_tEstimated value, can calculate implicit vector using estimated value and indicate x_t, and will be hidden X is indicated containing vector_tIt is expressed as term vector.

Invention can make full use of the unknown message in language material, provide preferably study term vector expression, use the present invention's The term vector that model obtains can more accurately express the meaning that word itself represents and the potential relationship between other words, such as Near synonym, synonym, antonym etc..

The detailed description of specific implementation mode by referring to the following drawings and to the present invention, feature and advantage of the invention It will become apparent.

Description of the drawings

Fig. 1 shows the flow chart of the term vector learning method of the present invention

Specific implementation mode

Embodiments of the present invention provide a kind of Kalman filtering term vector learning method based on Di's formula process, this hair The process noise and measurement noise of bright hypothesis system obey the distribution of Di's formula, and then can calculate Di's formula Posterior distrbutionp, then use MCMC (Monte Carlo sampling algorithm) sampling algorithm is sampled, and is obtained the highest alternative clusters of select probability, is substituted into LDS models The language material pre-processed is finally inputted trained language model, utilizes one step of Kalman filter by training pattern parameter afterwards More new formula calculates the estimated value that implicit vector indicates.

1 pair of technical scheme of the present invention is described in detail below in conjunction with the accompanying drawings.

First, language material is trained and is pre-processed, including word segmentation processing and generation dictionary etc., this is natural language processing The known process of term vector study is carried out in field, this will not be repeated here.

Then, the LDS language model systems for generating the present invention, initialize systematic parameter.

The LDS language models of the present invention are as follows：

x_t=Ax_t-1+η_t

w_t=Cx_t+ξ_t

Wherein x_tIndicate that the implicit vector for the word to be learnt indicates, w_tIndicate observation, we use one-hot Representation indicates that observation noise includes the process noise and measurement noise of system, is expressed as η_tAnd ξ_t, A and C Indicate state-transition matrix and observing matrix.We are by measurement noise ξ_tIt is set to zero mean Gaussian white noise, process noise η_t's Prior distribution is expressed as Di's formula process.

1. assuming η_tMeet normal distribution, η_t~N (μ_t,∑_t), definition cluster θ_t=(μ_t,∑_t), μ_tFor word t in corpus The frequency of appearance, θ_tMeet Di Li Cray process prior distributions it is assumed that calculating θ_t~G；G~DP (α, G₀), symbol~representative is obeyed Subsequent distribution, parameter G, D, P are the expression symbol of Di Li Crays distribution, and α is scale factor, G₀Indicate basis distribution, G₀= NIW(μ₀,κ₀,ν₀,Λ₀), μ₀,κ₀,ν₀,Λ₀For hyper parameter.

2. calculating Posterior distrbutionp：

p(x_0:T,θ_1:T|w_1:T)=p (x_0:T|θ_1:T,w_1:T)p(θ_1:T|w_1:T)

Wherein p (x_0:T|θ_1:T,w_1:T) can be derived by Kalman filtering, p (θ_1:T|w_1:T) can be sampled by Gibbs Estimation.

Then the MH algorithms in MCMC sampling algorithms is utilized to extract alternative clusters from following formula：

The highest alternative clusters of select probability value are as θ_t, subsequently calculated.

(3) cluster θ is calculated_tMinimum Mean Squared Error estimation value.

3. making model parameter reach stable by EM algorithm training patterns result of calculation substitution LDS language models, have Body process is as follows：

E is walked：According to the state estimation of the parameter value calculation t moment Kalman filter at t-1 momentAnd then calculate shape State estimated valueCovariance matrix.It specifically includes：

(1) such as given a definition first：

Wherein, R is the covariance matrix of observation noise.

(2) utilize BP neural network model calculate t moment data, this model by information forward-propagating and error it is anti- It is formed to two processes are propagated, wherein propagated forward is using the inferred from input data t moment data before t moment, and backpropagation is Utilize the inferred from input data t moment data after t moment.

Backpropagation：T=T ..., 1, calculate covariance matrixWith

It enables

Wherein, N_VFor unit diagonal matrix, B (θ_t)=G₀chol(∑_t)^T, then

Covariance matrix can be derived by Kalman filterWith

Propagated forward：T=1 ..., T,

Calculating covariance matrix can be derived by Kalman filterAnd m_t|t(θ_1:t-1)。

M is walked：Desired value is calculated using covariance matrix, it would be desirable to value maximizes, and solves the relevant parameter of LDS models, It is exactly state-transition matrix A and observing matrix C.

4. the language material pre-processed is inputted trained LDS language models, pass through the Kalman filter being set forth below One step more new formula carries out calculating implicit vector expression x_t:

One step of above-mentioned Kalman filter more new formula is existing formula in the prior art, wherein K, R, B, Q, P, I are equal For the basic parameter of Kalman filter,To be indicated using the calculated implicit vector of one step of Kalman filter more new formula x_tEstimated value, can calculate implicit vector using estimated value and indicate x_t, and implicit vector is indicated into x_tIt is expressed as term vector.

Described above be only it is illustrative, and it is to be understood that it is described herein arrangement and details modification and Variation will be apparent to those skilled in the art, any under the premise of no inventive concept for being detached from the present invention It is obvious to replace within the scope of the present invention.It is therefore intended that only by scope of the appended claims rather than by The specific detail that is presented by way of above description and explanation limits.

Claims

1. a kind of Kalman filtering term vector learning method based on Di's formula process, the method includes：

Language material is trained and is pre-processed,

LDS language model systems are generated, systematic parameter is initialized,

Assuming that process noise meets normal distribution, definition cluster θ_t=(μ_t,∑_t), μ_tFor the frequency that word t in corpus occurs, meter Calculate θ_tDirichlet prior distribution,

Alternative clusters are extracted using MCMC sampling algorithms, calculate the select probability of alternative clusters, and select the probability value highest Alternative clusters as θ_t, the Minimum Mean Squared Error estimation value of the cluster is calculated,

The language material pre-processed is inputted into trained LDS language models, is carried out by one step of Kalman filter more new formula Implicit vector is calculated to indicate.

2. according to the method described in claim 1, wherein, LDS language models are as follows：

x_t=Ax_t-1+η_t

w_t=Cx_t+ξ_t

Wherein x_tIndicate that the implicit vector for the word to be learnt indicates, w_tIndicate observation, η_tAnd ξ_tThe process noise of expression system and Measurement noise, A and C indicate state-transition matrix and observing matrix.

3. according to the method described in claim 2, wherein,

θ_tMeet Di Li Cray process prior distributions it is assumed that calculating θ_t~G；G~DP (α, G₀),

Wherein symbol~representative obeys subsequent distribution, and parameter G, D, P are the expression symbol of Di Li Crays distribution, α be scale because Son, G₀Indicate basis distribution, G₀=NIW (μ₀,κ₀,ν₀,Λ₀), μ₀,κ₀,ν₀,Λ₀For hyper parameter.

4. according to the method described in claim 3, wherein, the calculation formula of the Posterior distrbutionp is as follows：

p(x_0:T,θ_1:T|w_1:T)=p (x_0:T|θ_1:T,w_1:T)p(θ_1:T|w_1:T)

5. according to the method described in claim 4, wherein, described to extract alternative clusters using MCMC sampling algorithms, calculating is alternative The select probability of cluster, and select the highest alternative clusters of the probability value as θ_tValue detailed process it is as follows：

From 1 ..., it is extracted in T word Indicate the extraction after i sampling excludes word t as a result, i >=2；

MH algorithms extract alternative clusters from following formula：

6. it is described that result of calculation is substituted into LDS language models according to the method described in claim 5, wherein, it is instructed by EM algorithms Practice model, so that model parameter is reached stable detailed process as follows：

E is walked：According to the state estimation of the parameter value calculation t moment Kalman filter at t-1 momentAnd then calculating state is estimated EvaluationCovariance matrix：

(1) such as given a definition first：

Wherein, R is the covariance matrix of observation noise；

(2) BP neural network model is utilized to calculate the data of t moment,

Backpropagation：T=T ..., 1, calculate covariance matrixWith

It enables

Wherein, N_VFor unit diagonal matrix, B (θ_t)=G₀chol(∑_t)^T, then

Covariance matrix can be derived by Kalman filterWith

Propagated forward：T=1 ..., T,

M is walked：Desired value is calculated using covariance matrix, it would be desirable to value maximizes, and solves the relevant parameter of LDS models, that is, State-transition matrix A and observing matrix C；

7. according to the method described in claim 6, wherein, it is described by one step of Kalman filter more new formula calculate it is hidden The detailed process indicated containing vector is as follows：

One step of Kalman filter more new formula is：

Wherein, K, R, B, Q, P, I are the basic parameter of Kalman filter,It is public to be updated using one step of Kalman filter The calculated implicit vector of formula indicates x_tEstimated value, can calculate implicit vector using estimated value and indicate x_t, and will imply to Amount indicates x_tIt is expressed as term vector.