CN108446273B

CN108446273B - Kalman filtering word vector learning method based on Dield process

Info

Publication number: CN108446273B
Application number: CN201810212606.XA
Authority: CN
Inventors: 王磊; 翟荣安; 刘晶晶; 王毓; 王飞; 于振中; 李文兴
Original assignee: HRG International Institute for Research and Innovation
Current assignee: HRG International Institute for Research and Innovation
Priority date: 2018-03-15
Filing date: 2018-03-15
Publication date: 2021-07-20
Anticipated expiration: 2038-03-15
Also published as: CN108446273A

Abstract

A kalman filtering word vector learning method based on a dir process, the method comprising: training and preprocessing the corpus to generate an LDS language model system, initializing system parameters, assuming that process noise meets normal distribution, defining clustering theta_t＝(μ_t,∑_t)，μ_tCalculating theta for the frequency of occurrence of words t in a corpus_tThe Dirichlet prior distribution is calculated through Kalman filtering derivation and Gibbs sampling estimation, the alternative clusters are extracted by using MCMC sampling algorithm, the selection probability of the alternative clusters is calculated, and the alternative cluster with the highest probability value is selected as theta_tAnd calculating the minimum mean square error estimation value of the cluster, substituting the calculation result into the LDS language model, training the model through an EM (effective vector) algorithm to stabilize the model parameters, inputting the preprocessed corpus into the trained LDS language model, and calculating the implicit vector expression through a Kalman filter one-step updating formula.

Description

Kalman filtering word vector learning method based on Dield process

Technical Field

The invention relates to the field of natural language processing, in particular to a Kalman filtering word vector learning method based on a Dield process.

Background

In Natural Language Processing (NLP) related tasks, natural language is processed by algorithms in machine learning, which usually requires the language to be first mathematically processed, since the machine recognizes only mathematical symbols. A vector is a thing that people abstract from nature and give to a machine for processing, and a word vector is a way to mathematically transform words in a language.

One of the simplest word vector Representation modes is One-hot Representation, which is to use a very long vector to represent a word, the length of the vector is the size of the dictionary, the vector has only One 1 component, and the other 0 components are all 0, and the position of 1 corresponds to the position of the word in the dictionary. However, this word-word vector representation has two disadvantages: (1) are vulnerable to dimensional disasters, especially when used in some algorithms for Deep Learning; (2) the word-to-word similarity is not well characterized.

Another word vector Representation is Distributed replication, which was originally proposed by Hinton in 1986, and overcomes the drawbacks of One-hot replication. The basic principle is as follows: by training each word in a certain language to map into a short vector with fixed length (here, "short" is relative to "long" of One-hot Representation), putting all the vectors together to form a word vector space, and each vector is a point in the space, and introducing "distance" into the space, the similarity (lexical, semantic) between the words can be judged according to the distance between the words.

Due to many characteristics of text data, such as the phenomenon of similar words and ambiguous words, the dimensionality of feature vectors representing the text data is often high, but the high-dimensional feature vectors are not necessarily beneficial to classification, but can lead to feature sparsity, and reduce the efficiency of an algorithm and even reduce the classification effect. Therefore, a good word vector representation method has important significance.

To derive context-based word vectors, the gaussian Linear Dynamical System (LDS) is used to describe a language model. The advantage of this approach is that the context of the current word can be fully exploited. This model is learned through a large unlabeled corpus and the implicit vector representation of each word can be inferred by a kalman filter. Let the lexicon be V and the corpus length be T. By using

An indicator vector representing the word t, with a dimension of | V |. If the word t is at the ith position in the dictionary, the vector

Is not zero. Definition of mu_iIs the frequency of occurrence of the word t in the corpus.

The LDS model is:

wherein x_tImplicit vector representation, w, representing the word to be learned_tRepresenting the observed value, η_tAnd xi_tRepresenting the process noise and measurement noise of the system, a and C representing the state transition matrix and the observation matrix. Definition of

(mu is mu)_iOne element of) w)_tHas a mean value of zero. Defining expected values

Function E represents the desirability function.

The number of words in a sentence that differ from the word at the ith position by the number of words at the jth position is defined as the lag, and when the lag is k,

μ_iand

in a corpus of length T, the estimated values are:

wherein E [ alpha ], [ beta ], [ alpha ], [ beta ]]The representative is to calculate the expected value,

is referred to asIn x pairs

And calculating an expected value.

The following is a description of the process of estimating model parameters, the whole process is divided into two steps, firstly, the EM algorithm is initialized by SSID (subframe identification) method, parameters A, C, Q and R (Q, R is the basic parameter of the Kalman filter) are optimized by EMwith ASOS (Martens, 2010) (an enhanced EM algorithm), and then the estimation value based on the first t-1 observation values is obtained

K is Kalman filter gain and an estimated value based on all corpora

(I, J is the fundamental parameter of the Kalman Filter) computing the implicit vector representation x_tAnd the implicit vector is represented as x_tRepresented as a word vector.

In the prior art, it is assumed that the distribution of system process noise and measurement noise belongs to zero-mean white gaussian noise, but the system noise is uncertain in general, and particularly for a language model, many problems are that new and unknown information in a corpus needs to be mined, so the assumption is not practical, and a word vector obtained based on the assumption is inaccurate.

The Dirichlet Process is a famous variable parameter bayes model, and is particularly suitable for solving various clustering problems. The advantage of this type of model is that the number of classes used in building the hybrid model need not be specified manually, but is calculated autonomously from the model and data. In the field of natural language processing, many problems need to mine new and unknown information in the corpus, and the information often lacks prior knowledge, so that the advantage of the Dirichlet process can be fully embodied in many natural language processing application problems. The prior art has not applied the dike process to the word vector representation of natural language processing.

Therefore, it is necessary to develop a new word vector representation method.

Disclosure of Invention

In order to achieve the purpose of the invention, the invention provides a Kalman filtering word vector learning method based on a Dike process, which comprises the following steps:

the corpus is trained and preprocessed,

generating an LDS language model system, initializing system parameters,

assuming that the process noise satisfies the normal distribution, define the cluster θ_t＝(μ_t,∑_t)，μ_tCalculating theta for the frequency of occurrence of words t in a corpus_tThe dirichlet prior distribution of (a) is,

the posterior distribution is calculated by kalman filter derivation and Gibbs sampling estimation,

extracting alternative clusters by using MCMC sampling algorithm, calculating selection probability of the alternative clusters, and selecting the alternative cluster with the highest probability value as theta_tCalculating a minimum mean square error estimate for the cluster,

substituting the calculated result into an LDS language model, training the model through an EM algorithm to stabilize the model parameters,

inputting the preprocessed corpus into a trained LDS language model, and calculating the implicit vector expression by using a one-step updating formula of a Kalman filter.

The LDS language model comprises the following steps:

x_t＝Ax_t-1+η_t

w_t＝Cx_t+ξ_t

wherein x_tImplicit vector representation, w, representing the word to be learned_tRepresenting the observed value, η_tAnd xi_tRepresenting the process noise and measurement noise of the system, a and C representing the state transition matrix and the observation matrix.

Wherein, theta_tSatisfying the prior distribution assumption of the Dirichlet process, and calculating theta_t～G；G～DP(α,G₀)，

Where the symbol-represents the distribution subject to the following, the parameter G, D, P is an indicator of the Dirichlet distributionNumber, α is a scale factor, G₀Denotes the base distribution, G₀＝NIW(μ₀,κ₀,ν₀,Λ₀)，μ₀,κ₀,ν₀,Λ₀Is a hyper-parameter.

Wherein, the calculation formula of the posterior distribution is as follows:

p(x_0:T,θ_1:T|w_1:T)＝p(x_0:T|θ_1:T,w_1:T)p(θ_1:T|w_1:T)

p(x_0:T|θ_1:T,w_1:T) Can be derived by Kalman filtering, p (theta)_1:T|w_1:T) Can be estimated by Gibbs sampling.

Extracting alternative clusters by using MCMC sampling algorithm, calculating selection probability of the alternative clusters, and selecting the alternative cluster with the highest probability value as theta_tThe specific procedure for the values of (a) is as follows:

extracting from 1, …, T words

The extraction result after the word t is removed in the i times of sampling is shown, and i is more than or equal to 2;

the MH algorithm extracts the candidate clusters from the following formula:

computing selection probabilities for candidate clusters

If ρ > α, let

Otherwise make

Substituting the calculation result into the LDS language model, training the model through an EM algorithm, and enabling the specific process of stabilizing the model parameters to be as follows:

e, step E: calculating the state estimation value of the Kalman filter at the t moment according to the parameter value at the t-1 moment

Further calculating a state estimation value

Covariance matrix of (2):

(1) the following definitions are first made:

wherein R is a covariance matrix of observation noise;

(2) calculating data at the time t by using a BP neural network model,

and (3) back propagation: t, …,1, calculating a covariance matrix

And

order to

Wherein N is_VIs a unit diagonal matrix, B (theta)_t)＝G₀chol(∑_t)^TThen, then

Covariance matrix can be derived by Kalman filter

And

forward propagation: t is 1, …, T,

the covariance matrix can be derived and calculated through a Kalman filter

And m_t|t(θ_1:t-1)；

And M: calculating an expected value by using the covariance matrix, maximizing the expected value, and solving related parameters of the LDS model, namely a state transition matrix A and an observation matrix C;

and updating the parameters and repeating the two steps until the LDS model is stable.

The specific process of calculating the implicit vector representation through the kalman filter one-step updating formula is as follows:

the kalman filter one-step update formula is:

wherein K, R, B, Q, P and I are basic parameters of a Kalman filter,

implicit vector representation x calculated for one-step updating formula using Kalman filter_tUsing the estimated value to calculate the implicit vector representation x_tAnd the implicit vector is represented as x_tRepresented as a word vector.

The invention can fully utilize unknown information in the corpus and provide better expression of the learning word vector, and the word vector obtained by using the model of the invention can more accurately express the meaning represented by the word and the potential relation with other words, such as a near-synonym, a synonym, an antisense word and the like.

The features and advantages of the present invention will become apparent by reference to the following drawings and detailed description of specific embodiments of the invention.

Drawings

FIG. 1 shows a flow chart of a word vector learning method of the present invention

Detailed Description

The invention provides a Kalman filtering word vector learning method based on a Dietype process, which is characterized in that process noise and measurement noise of a system are assumed to obey Dietype distribution, then Dietype posterior distribution can be calculated, sampling is carried out by adopting MCMC (Monte Carlo sampling algorithm) sampling algorithm to obtain alternative clusters with highest selection probability, the alternative clusters are substituted into an LDS (Linear discriminant system) model to train model parameters, finally preprocessed linguistic data are input into a trained language model, and a Kalman filter is utilized to further update a formula to calculate an estimation value represented by an implicit vector.

The technical solution of the present invention will be described in detail with reference to fig. 1.

Firstly, training and preprocessing are performed on the speech, including word segmentation processing and dictionary generation, which are well-known processes for word vector learning in the field of natural language processing, and are not described herein again.

Then, the LDS language model system of the invention is generated, and the system parameters are initialized.

The LDS language model of the invention is as follows:

x_t＝Ax_t-1+η_t

w_t＝Cx_t+ξ_t

wherein x_tImplicit vector representation, w, representing the word to be learned_tRepresenting the observed value, we denote the one-hot representation, the observed noise includes the process noise and the measurement noise of the system, denoted as η_tAnd xi_tAnd a and C denote a state transition matrix and an observation matrix. We will measure the noise xi_tSet to zero mean white Gaussian noise, process noise η_tIs expressed as a dir process.

1. Let η be_tSatisfy the normal distribution, η_t～N(μ_t,∑_t) Defining a cluster θ_t＝(μ_t,∑_t)，μ_tIs the frequency of occurrence of words t in the corpus, θ_tSatisfying the prior distribution assumption of the Dirichlet process, and calculating theta_t～G；G～DP(α,G₀) The notation-denotes the distribution subject to the following, the parameter G, D, P is the notation of the Dirichlet distribution, α is the scale factor, G₀Denotes the base distribution, G₀＝NIW(μ₀,κ₀,ν₀,Λ₀)，μ₀,κ₀,ν₀,Λ₀Is a hyper-parameter.

2. Calculating posterior distribution:

p(x_0:T,θ_1:T|w_1:T)＝p(x_0:T|θ_1:T,w_1:T)p(θ_1:T|w_1:T)

wherein p (x)_0:T|θ_1:T,w_1:T) Can be derived by Kalman filtering, p (theta)_1:T|w_1:T) Can be estimated by Gibbs sampling.

Extracting from 1, …, T words

then, an MH algorithm in the MCMC sampling algorithm is used for extracting alternative clusters according to the following formula:

computing selection probabilities for candidate clusters

If ρ > α, let

Otherwise make

Selecting the candidate cluster with the highest probability value as theta_tAnd performing subsequent calculation.

(3) Computing cluster θ_tThe minimum mean square error estimate of.

3. Substituting the calculation result into an LDS language model, training the model through an EM algorithm, and enabling the model parameters to be stable, wherein the specific process is as follows:

Further calculating a state estimation value

The covariance matrix of (2). The method specifically comprises the following steps:

(1) the following definitions are first made:

wherein, R is the covariance matrix of the observation noise.

(2) And calculating data at the time t by using a BP neural network model, wherein the model consists of two processes of forward propagation of information and backward propagation of errors, the forward propagation is to use the data before the time t to infer the data at the time t, and the backward propagation is to use the data after the time t to infer the data at the time t.

And (3) back propagation: t, …,1, calculating a covariance matrix

And

order to

Covariance matrix can be derived by Kalman filter

And

forward propagation: t is 1, …, T,

the covariance matrix can be derived and calculated through a Kalman filter

And m_t|t(θ_1:t-1)。

And M: and calculating an expected value by using the covariance matrix, maximizing the expected value, and solving related parameters of the LDS model, namely the state transition matrix A and the observation matrix C.

4. Inputting the preprocessed corpus into the trained LDS language model, and calculating the implicit vector expression x by using a Kalman filter one-step updating formula listed below_t:

The above-mentioned Kalman filter one-step updating formula is the existing formula in the prior art, wherein K, R, B, Q, P, I are the basic parameters of the Kalman filter,

The foregoing is illustrative only, and it is to be understood that modifications and variations in the arrangements and details described herein will be apparent to those skilled in the art, and that any obvious substitutions are within the scope of the present invention without departing from the inventive concepts thereof. It is therefore intended that the scope of the appended claims be limited only by the specific details presented by way of the foregoing description and explanation.

Claims

1. A kalman filtering word vector learning method based on a dir process, the method comprising:

the corpus is trained and preprocessed,

generating an LDS language model, initializing system parameters,

the LDS language model comprises the following steps:

x_t＝Ax_t-1+η_t

w_t＝Cx_t+ξ_t

wherein x_tImplicit vector representation, w, representing the word to be learned_tRepresenting the observed value, η_tAnd xi_tRepresenting process and measurement noise of the system, A and C representing state transitionsShifting the matrix and observing the matrix;

assuming that the process noise satisfies the normal distribution, define the cluster θ_t＝(μ_t,∑_t)，μ_tFor the frequency of occurrence of words t in the corpus, sigma_tComputing θ for covariance matrix of words t in corpus_tThe dirichlet prior distribution of (a) is,

2. The method of claim 1, wherein,

θ_tsatisfying the prior distribution assumption of the Dirichlet process, and calculating theta_t～G；G～DP(α,G₀)，

Where the notation-denotes the distribution subject to the following, the parameter G, D, P is the notation of the Dirichlet distribution, α is the scale factor, G₀Denotes the base distribution, G₀＝NIW(μ₀,κ₀,v₀,Λ₀)，μ₀,κ₀,v₀,Λ₀For hyperparameters, NIW represents a normal inverse weixate distribution.

3. The method of claim 2, wherein the posterior distribution is calculated as follows:

p(x_0:T,θ_1:T|w_1:T)＝p(x_0:T|θ_1:T,w_1:T)p(θ_1:T|w_1:T)

4. The method as claimed in claim 3, wherein the extracting of the candidate clusters by the MCMC sampling algorithm, the calculating of the selection probability of the candidate clusters, and the selecting of the candidate cluster with the highest probability value as θ_tThe specific procedure for the values of (a) is as follows:

extracting from 1, …, T words

the MH algorithm extracts the candidate clusters from the following formula:

computing selection probabilities for candidate clusters

If ρ > α, let

Otherwise make

5. The method as claimed in claim 4, wherein the step of substituting the calculated result into the LDS language model and training the model by EM algorithm to make the model parameters stable is as follows:

Further calculating a state estimation value

Covariance matrix of (2):

(1) the following definitions are first made:

wherein R is a covariance matrix of observation noise;

(2) calculating data at the time t by using a BP neural network model,

and (3) back propagation: t, …,1, calculating a covariance matrix

And

order to

Wherein N is_VIs a unit diagonal matrix, B (theta)_t)＝G₀chol(∑_t)^TChol denotes Cholesky decomposition, then

Covariance matrix can be derived by Kalman filter

And

forward propagation: t is 1, …, T,

the covariance matrix can be derived and calculated through a Kalman filter

And m_t|t(θ_1:t-1)；

and updating the parameters and repeating the two steps until the LDS language model is stable.

6. The method of claim 5, wherein the calculation of the implicit vector representation by the Kalman filter one-step update formula is performed as follows:

the kalman filter one-step update formula is:

wherein K, R, B, Q, P and I are basic parameters of a Kalman filter,