CN103198152A

CN103198152A - Supervised online topic model learning method based on sparse implicit characteristic expression

Info

Publication number: CN103198152A
Application number: CN201310146127XA
Authority: CN
Inventors: 朱军; 张傲南; 张钹
Original assignee: Tsinghua University
Current assignee: Beijing Real AI Technology Co Ltd
Priority date: 2013-04-24
Filing date: 2013-04-24
Publication date: 2013-07-10
Anticipated expiration: 2033-04-24
Also published as: CN103198152B

Abstract

The invention discloses a supervised online topic model learning method based on sparse implicit characteristic expression, and relates to the field of data mining and machine learning. The method comprises the following steps of: carrying out sparse expression based implicit characteristic extraction on a document in a training set and each word in the document by an online learning method, so as to obtain a plurality of groups of characteristic vectors; training a classifier according to the characteristic vector of the training set and the class information of the document in the training set so as to obtain the characteristic vector of the classifier, wherein each type of the characteristic vector of the classifier corresponds to the type of the document in the training set; extracting the characteristic vectors of all the documents to be recognized; and calculating inner products on the characteristic vectors of the documents to be recognized and the characteristic vector of each type of the classifier, wherein that the maximum value of the inner product corresponds to the type of the training set is considered as the recognition result of the documents to be recognized. According to the supervised online topic model learning method based on the sparse implicit characteristic expression, the speed in model training is greatly increased by adopting the online learning way, and the accuracy rate of classification can be increased by utilizing the supervision information.

Description

The online topic model learning method of supervision that has based on sparse implicit features expression

Technical field

The present invention relates to relate to data mining, machine learning techniques field, relate in particular to the online topic model learning method of supervision that has based on sparse implicit features expression.

Background technology

Implicit expression topic model is all embodying remarkable advantages aspect the complicated file structure excavating document semantic information and handle, and utilizing implicit expression topic model to excavate structure in extensive document and the stream input document efficiently in recent years becomes of this field and study focus.

Present existing employing implicit expression topic model excavates the method for document semantic structure based on probability model.In all multi-models, representative have an implicit expression semantic analysis model (Latent Semantic Analysis, LSA), probability implicit expression semantic indexing model (Probabilistic Latent Semantic Indexing, PLSI) and implicit expression Di Li Cray model (Latent Dirichlet Allocation, LDA).Utilizing implicit expression topic model to excavate the problem that the semantic structure in the extensive document need solve mainly contains: number of documents is very huge; The variation of document input form is as stream input document; Improve the speed of implicit features study in the implicit expression topic model; Improve ability to express and the sparse degree of implicit features in the implicit expression topic model; Utilize supervision message to improve the accuracy of implicit expression topic model.

Have in recent years much at utilizing the topic model to handle the work of extensive document and stream document, the on-line study method is introduced in the implicit expression Di Li Cray model in " Online learning for latent Dirichlet allocation " as people such as M.Hoffman in 2010.This method imports to train implicit expression Di Li Cray model with large-scale document in batches, adopts the methodology handwriting practicing allusion quotation of on-line study, can handle extensive input document and stream document preferably; People such as D.Mimno in 2012 propose a kind of method that the online variation of Gibbs sampling introducing is inferred at " Sparse stochastic inference for latent Dirichlet allocation " and train implicit expression Di Li Cray model, make the efficient of on-line study further improve.They have all adopted probability model but the problem of above two kinds of methods is, so the restriction of the normalization in the probability model makes them can not effectively control the sparse property of implicit features in the aspect of model.In addition, two kinds of methods do not point out how to handle the supervision message that may exist in the document more than.

For the ability to express and the sparse degree that improve implicit features in the implicit expression topic model, people such as Zhu Jun proposed sparse topic encoding model in 2011.The topic model introduced sparse coding innovatively by this model, adopts the mode of non-probabilistic Modeling that the normalization restriction of implicit features expression in the topic model is removed, and introduces sparse limit entry and control the sparse degree that implicit features is expressed.Experimental results show that the training speed of this model faster than the implicit expression Di Li Cray model based on probability inference, and can control the sparse degree that implicit features is expressed better that simultaneously, this model also can utilize supervision message to improve classification accuracy.Yet owing to adopt the method for study in batches, this model can not be for the treatment of large-scale collection of document, and can't handle the document of stream input.

The newest fruits in above-mentioned field provides solid foundation for the online topic model learning method of expressing based on sparse implicit features of supervision that has.Yet these technology still can not be handled extensive document and the input of stream document effectively, and effectively control the sparse property of implicit features in the topic model simultaneously.

Summary of the invention

(1) technical matters that will solve

The technical problem to be solved in the present invention is: how a kind of online topic model learning method of supervision that has based on sparse implicit features expression is provided, to improve the topic model to the training speed of extensive document data collection and can handle stream input document, the topic model can effectively be controlled the sparse property of word implicit features in the document simultaneously, and can utilize supervision message to improve accuracy rate.

(2) technical scheme

For solving the problems of the technologies described above, the invention provides a kind of online topic model learning method of supervision that has based on sparse implicit features expression, this method may further comprise the steps:

The method of S1, employing on-line study carries out respectively extracting based on the implicit features of sparse expression to the document in the training set and each word in the document, obtain many stack features vector, all documents of each classification and all words of document in the corresponding training set of each classification of proper vector;

On-line study among the described step S1 and the step of feature extraction comprise:

S11, according to document code in the training set in the past backward order choose the subclass of a fixed size of training set, this subclass is minimized corresponding loss function, this subclass loss function is relevant with the implicit features of each word in the contained document of subclass;

The step of subclass of choosing a fixed size of training set among the described step S11 comprises: choose size and be the subclass of M in every order of taking turns in the iteration, usually, the subclass of choosing in i wheel iteration is for numbering at [((i-1) * M+1) %D, (%D of i * M)] in document, wherein D is the number of documents in the training set, the span of M is the integer in [1, D];

S12, to the implicit features of each word in the loss function loop optimization document of the subclass among the step S11, until the loss function value convergence of this subclass, final updating document implicit features.

S2, upgrade dictionary according to the classification of document in step S1 gained proper vector and the training set;

The step of upgrading dictionary among the described step S2 comprises:

S21, try to achieve in this iteration the loss function sum of input document about the gradient of dictionary vector;

S22, according to the gradient of trying to achieve among the step S21 to the dictionary vector do single step at random gradient descend, then with the dictionary vector projection to the L1 hypersphere.

S3, according to step S1 gained proper vector training classifier, obtain the proper vector of sorter, each classification of sorter proper vector is corresponding to the classification of document in the training set;

The step of the training multi-categorizer among the described step S3 comprises: the method that adopts gradient to descend is optimized the loss function of multi-categorizer, and loss function is that sorter is for the loss function sum of the document of this iteration input.

S4, step S1, S2 and S3 once be called take turns iteration, then stop iteration if iteration wheel number equals given constant, enter step S5, otherwise return step S1, iteration wheel number adds 1, and wherein iteration wheel number initial value is 0;

S5, all documents to be identified are carried out feature extraction, obtain the proper vector of document to be identified;

S6, the proper vector of document to be identified and the proper vector of step S3 gained sorter all categories are done inner product respectively;

S7, with the classification of the corresponding training set of the step S6 gained inner product maximal value recognition result as document to be identified.

(3) beneficial effect

The present invention is by the mode of non-probabilistic Modeling, and the normalization in the probability model that relaxed limits, and then introduces the sparse degree that sparse limit entry is expressed with effective control word implicit features; The present invention has simultaneously adopted the method for on-line study, has improved accuracy rate and the model training speed of document classification.In addition, the present invention can also effectively utilize supervision message further to promote classification accuracy.

Description of drawings

Fig. 1 is the process flow diagram that the online topic model learning method of supervision is arranged based on sparse implicit features expression that the present invention proposes;

Fig. 2 is the process flow diagram that has the online topic model learning method of supervision to implement based on sparse implicit features expression that proposes according to the present invention.

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for explanation the present invention, but are not used for limiting the scope of the invention.The online topic model learning method of expressing based on sparse implicit features that the present invention proposes is described in detail as follows in conjunction with the embodiments.

As shown in Figure 2, present embodiment may further comprise the steps:

Comprise D document in step 1, the training set altogether, adopt the method for on-line study, from D document of training set, choose M document, each word in these documents and the document is carried out respectively extracting based on the implicit features of sparse expression, obtain the eigenmatrix T of M * K, wherein each row of T represents the proper vector of a document, and K is the dimension of implicit features herein, delegation in the matrix T or multirow as a classification of training set, are represented the classification in the training set.

Step 2, upgrade dictionary β according to the classification information of document in the proper vector of trying to achieve in the step 1 and the training set.

Step 3, train polytypic support vector machine according to the proper vector of trying to achieve in the step 1, the characteristic vector W of supported vector machine, wherein each row of W represents the proper vector of each classification of support vector machine, corresponding to the respective classes of document in the training set.

Step 4, step 1, step 2, step 3 once be called take turns iteration, establishing iteration wheel number initial value is 0, judges whether iterations reaches

If, then stop iteration, enter step 5, otherwise return step 1, iterations adds 1.

Step 5, document to be identified in the test set is carried out feature extraction, obtain the proper vector y of document to be identified;

Step 6, the proper vector of step 5 gained proper vector y and certain classification i of the polytypic support vector machine of step 3 gained is done inner product respectively, be worth and be P _i=W _iY, wherein W _iFor the i among the W is capable, inner product W _iY is defined as

W_{i} \cdot y = W_{i 1} \times y_{1} + W_{i 2} \times y_{2} + \cdot \cdot \cdot + W_{iK} \times y_{K} - - - (1)

W wherein _Ij(j=1,2 ..., K) be vectorial W _iJ component, y _j(j=1,2 ..., K) be j the component of y.

Step 7, obtain the inner product of each class correspondence according to step 6, and with the classification i of the corresponding training set of the inner product maximal value P recognition result as document to be identified, P is defined as:

P = \max_{j} P_{j} - - - (2)

I is defined as:

i = \underset{j}{\arg \max} p_{j} - - - (3)

On-line study in the described step 1 and the step of feature extraction specifically comprise:

(a) according to document code in the training set in the past backward order choose a subclass that size is M document of training set, as in i wheel iteration, choosing [((i-1) * M+1) %D, (%D of i * M)] document in the scope, as go beyond the scope and then go back to the 1st document, this subclass is minimized corresponding loss function.Use w _nThe number of times that word n occurs in the expression document, θ represents the implicit features of document, s _nThe implicit features of word n in the expression document, β _nThe vector of word n correspondence in the expression dictionary, ε is the compensation rate that prevents degenerate case, value is 0.001.

Definition to the loss function of a document is:

L = - \log Poisson + tr (S^{T} ΛS) + ρ {\times | | S | |}_{1} - - - (4)

Wherein, the mark of tr (X) representing matrix X; Λ=(a-b) * I+b * E, I is the unit matrix of K * K, E is that all elements value of K * K is 1 matrix, a=γ/2+ γ ²/ (4 * λ+2 * γ * N), b=γ/2-a; S is matrix, and its n classifies vectorial s as _n, S ^TThe transposition of expression S, N represents the word sum in the document.λ, γ, ρ are super parameter, in the present embodiment, λ=0.1, γ=0.1, ρ=0.01.In addition, in formula (4):

- \log Poisson = - \log Poisson_1 - \log Poisson_2 - \cdot \cdot \cdot - \log Poisson_N - - - (5)

{| | S | |}_{1} = {| | s_{1} | |}_{1} + {| | s_{2} | |}_{1} + \cdot \cdot \cdot + {| | s_{N} | |}_{1} - - - (6)

Wherein, || s _n|| ₁(n=1,2 ..., N) expression s _nThe L1 norm.

Definition is to the Poisson loss function of word n in the document:

- \log Poisson_n = w_{n} \times \log (s_{n} \cdot β_{n} + ϵ) - s_{n} \cdot β_{n} - - - (7)

Wherein, s _nβ _n, (n=1,2 ..., N) represent vectorial s _nWith vectorial β _nInner product.

(b) the loss function addition of M document correspondence importing in this iteration is obtained the overall loss function of epicycle iteration.Each word implicit features vector s in the loop optimization document _nIn the value of each dimension, until the convergence of loss function value, final updating θ, specific practice is as follows:

Optimize loss function in the formula (4), loop optimization s respectively for each document _Ni, (i=1,2 ..., K) until convergence, s wherein _NiRepresentation vector s _nI component.Can be by finding the solution quadratic equation with one unknown

2 a β_{ni} {(s_{ni})}^{2} + c β_{ni} s_{ni} + c \underset{i^{'} &NotEqual; i}{Σ} β_{{ni}^{'}} s_{{ni}^{'}} - w_{n} β_{ni} = 0 - - - (8)

Wherein, β _NiRepresentation vector β _nI component (i=1,2 ..., K), c=β _Ni+ ρ+2b Σ s _Ni

Find the solution and obtain s _NiThe sealing solution, be made as sealing and separate ν.Then to s _NiDoing nonnegative value blocks:

s_{ni} = \max (0, v) - - - (9)

The formula that upgrades θ is

θ = γ \times (s_{1} + s_{2} + \cdot \cdot \cdot + s_{N}) / (γ \times N + λ) - - - (10)

The step of the renewal dictionary in the described step 2 specifically comprises:

(c) try to achieve the loss function sum of M the document of importing in this iteration about the gradient of β.Be that loss function in the formula (5) is about the gradient of β, namely for the loss function of certain document about the gradient of β wherein

(d) according to the gradient of trying to achieve in the step (c) β is done the single step gradient decline at random that step-length is 1/ (t+10), wherein t is the wheel number of iteration.Then β is projected to radius and is on 1 the L1 hypersphere.

The concrete steps of the training multi-category support vector machines in the described step 3 comprise: the method that adopts gradient to descend is optimized the loss function of support vector machine, and loss function is that support vector machine is for the loss function sum of the document of this iteration input.For a document, its loss function is:

R = \max_{i} (Δ (i, j) + W_{i} \cdot θ - W_{j} \cdot θ) - - - (11)

Wherein j is the real classification of document, if i=j, then Δ (i, j)=0; Otherwise Δ (i, j)=3600.

There is supervision topic model to experimentize in the 20Newsgroups data centralization based on the online of sparse implicit features expression, this data set has 20 class documents, training set comprises 11269 pieces of documents, comprise 7505 pieces of documents to be identified in the test set, the number of all kinds of documents is relatively more balanced in training set and the test set.Experiment shows that the convergence time of the online topic model of expressing based on sparse implicit features is about 2000 seconds, fast about 5 times than online implicit expression Di Li Cray model.Accuracy rate can reach more than 66% simultaneously, and is higher by about 4% than the accuracy rate of online implicit expression Di Li Cray model.

Above embodiment only is used for explanation the present invention; and be not limitation of the present invention; the those of ordinary skill in relevant technologies field; under the situation that does not break away from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all technical schemes that are equal to also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims

1. the online topic model learning method of supervision that has based on sparse implicit features expression is characterized in that, may further comprise the steps:

S2, upgrade dictionary according to the classification of document in S1 gained proper vector and the training set;

S3, according to S1 gained proper vector training classifier, obtain the proper vector of sorter, each classification of sorter proper vector is corresponding to the classification of document in the training set;

2. the method for claim 1 is characterized in that, the step of on-line study and feature extraction comprises among the described step S1:

S12, to the implicit features of each word in the subclass loss function loop optimization document among the step S11, until the loss function value convergence of described subclass, upgrade the document implicit features.

3. method as claimed in claim 2, it is characterized in that, the step of the subclass of a described fixed size choosing training set comprises: choose size and be the subclass of M in every order of taking turns in the iteration, the subclass of choosing in i wheel iteration is for numbering at [((i-1) * M+1) %D, (%D of i * M)] in document, wherein D is the number of documents in the training set, and the span of M is the integer in [1, D].

4. the method for claim 1 is characterized in that, the step of upgrading dictionary among the described step S2 comprises:

S22, according to step S21 gained gradient to the dictionary vector do single step at random gradient descend, then with the dictionary vector projection to the L1 hypersphere.

5. the method for claim 1, it is characterized in that, the step of the training multi-categorizer among the described step S3 comprises: the method that adopts gradient to descend is optimized the loss function of sorter, and loss function is that sorter is for the loss function sum of the document of this iteration input.