CN103198152A - Supervised online topic model learning method based on sparse implicit characteristic expression - Google Patents

Supervised online topic model learning method based on sparse implicit characteristic expression Download PDF

Info

Publication number
CN103198152A
CN103198152A CN201310146127XA CN201310146127A CN103198152A CN 103198152 A CN103198152 A CN 103198152A CN 201310146127X A CN201310146127X A CN 201310146127XA CN 201310146127 A CN201310146127 A CN 201310146127A CN 103198152 A CN103198152 A CN 103198152A
Authority
CN
China
Prior art keywords
document
training set
iteration
subclass
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310146127XA
Other languages
Chinese (zh)
Other versions
CN103198152B (en
Inventor
朱军
张傲南
张钹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Real AI Technology Co Ltd
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201310146127.XA priority Critical patent/CN103198152B/en
Publication of CN103198152A publication Critical patent/CN103198152A/en
Application granted granted Critical
Publication of CN103198152B publication Critical patent/CN103198152B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a supervised online topic model learning method based on sparse implicit characteristic expression, and relates to the field of data mining and machine learning. The method comprises the following steps of: carrying out sparse expression based implicit characteristic extraction on a document in a training set and each word in the document by an online learning method, so as to obtain a plurality of groups of characteristic vectors; training a classifier according to the characteristic vector of the training set and the class information of the document in the training set so as to obtain the characteristic vector of the classifier, wherein each type of the characteristic vector of the classifier corresponds to the type of the document in the training set; extracting the characteristic vectors of all the documents to be recognized; and calculating inner products on the characteristic vectors of the documents to be recognized and the characteristic vector of each type of the classifier, wherein that the maximum value of the inner product corresponds to the type of the training set is considered as the recognition result of the documents to be recognized. According to the supervised online topic model learning method based on the sparse implicit characteristic expression, the speed in model training is greatly increased by adopting the online learning way, and the accuracy rate of classification can be increased by utilizing the supervision information.

Description

The online topic model learning method of supervision that has based on sparse implicit features expression
Technical field
The present invention relates to relate to data mining, machine learning techniques field, relate in particular to the online topic model learning method of supervision that has based on sparse implicit features expression.
Background technology
Implicit expression topic model is all embodying remarkable advantages aspect the complicated file structure excavating document semantic information and handle, and utilizing implicit expression topic model to excavate structure in extensive document and the stream input document efficiently in recent years becomes of this field and study focus.
Present existing employing implicit expression topic model excavates the method for document semantic structure based on probability model.In all multi-models, representative have an implicit expression semantic analysis model (Latent Semantic Analysis, LSA), probability implicit expression semantic indexing model (Probabilistic Latent Semantic Indexing, PLSI) and implicit expression Di Li Cray model (Latent Dirichlet Allocation, LDA).Utilizing implicit expression topic model to excavate the problem that the semantic structure in the extensive document need solve mainly contains: number of documents is very huge; The variation of document input form is as stream input document; Improve the speed of implicit features study in the implicit expression topic model; Improve ability to express and the sparse degree of implicit features in the implicit expression topic model; Utilize supervision message to improve the accuracy of implicit expression topic model.
Have in recent years much at utilizing the topic model to handle the work of extensive document and stream document, the on-line study method is introduced in the implicit expression Di Li Cray model in " Online learning for latent Dirichlet allocation " as people such as M.Hoffman in 2010.This method imports to train implicit expression Di Li Cray model with large-scale document in batches, adopts the methodology handwriting practicing allusion quotation of on-line study, can handle extensive input document and stream document preferably; People such as D.Mimno in 2012 propose a kind of method that the online variation of Gibbs sampling introducing is inferred at " Sparse stochastic inference for latent Dirichlet allocation " and train implicit expression Di Li Cray model, make the efficient of on-line study further improve.They have all adopted probability model but the problem of above two kinds of methods is, so the restriction of the normalization in the probability model makes them can not effectively control the sparse property of implicit features in the aspect of model.In addition, two kinds of methods do not point out how to handle the supervision message that may exist in the document more than.
For the ability to express and the sparse degree that improve implicit features in the implicit expression topic model, people such as Zhu Jun proposed sparse topic encoding model in 2011.The topic model introduced sparse coding innovatively by this model, adopts the mode of non-probabilistic Modeling that the normalization restriction of implicit features expression in the topic model is removed, and introduces sparse limit entry and control the sparse degree that implicit features is expressed.Experimental results show that the training speed of this model faster than the implicit expression Di Li Cray model based on probability inference, and can control the sparse degree that implicit features is expressed better that simultaneously, this model also can utilize supervision message to improve classification accuracy.Yet owing to adopt the method for study in batches, this model can not be for the treatment of large-scale collection of document, and can't handle the document of stream input.
The newest fruits in above-mentioned field provides solid foundation for the online topic model learning method of expressing based on sparse implicit features of supervision that has.Yet these technology still can not be handled extensive document and the input of stream document effectively, and effectively control the sparse property of implicit features in the topic model simultaneously.
Summary of the invention
(1) technical matters that will solve
The technical problem to be solved in the present invention is: how a kind of online topic model learning method of supervision that has based on sparse implicit features expression is provided, to improve the topic model to the training speed of extensive document data collection and can handle stream input document, the topic model can effectively be controlled the sparse property of word implicit features in the document simultaneously, and can utilize supervision message to improve accuracy rate.
(2) technical scheme
For solving the problems of the technologies described above, the invention provides a kind of online topic model learning method of supervision that has based on sparse implicit features expression, this method may further comprise the steps:
The method of S1, employing on-line study carries out respectively extracting based on the implicit features of sparse expression to the document in the training set and each word in the document, obtain many stack features vector, all documents of each classification and all words of document in the corresponding training set of each classification of proper vector;
On-line study among the described step S1 and the step of feature extraction comprise:
S11, according to document code in the training set in the past backward order choose the subclass of a fixed size of training set, this subclass is minimized corresponding loss function, this subclass loss function is relevant with the implicit features of each word in the contained document of subclass;
The step of subclass of choosing a fixed size of training set among the described step S11 comprises: choose size and be the subclass of M in every order of taking turns in the iteration, usually, the subclass of choosing in i wheel iteration is for numbering at [((i-1) * M+1) %D, (%D of i * M)] in document, wherein D is the number of documents in the training set, the span of M is the integer in [1, D];
S12, to the implicit features of each word in the loss function loop optimization document of the subclass among the step S11, until the loss function value convergence of this subclass, final updating document implicit features.
S2, upgrade dictionary according to the classification of document in step S1 gained proper vector and the training set;
The step of upgrading dictionary among the described step S2 comprises:
S21, try to achieve in this iteration the loss function sum of input document about the gradient of dictionary vector;
S22, according to the gradient of trying to achieve among the step S21 to the dictionary vector do single step at random gradient descend, then with the dictionary vector projection to the L1 hypersphere.
S3, according to step S1 gained proper vector training classifier, obtain the proper vector of sorter, each classification of sorter proper vector is corresponding to the classification of document in the training set;
The step of the training multi-categorizer among the described step S3 comprises: the method that adopts gradient to descend is optimized the loss function of multi-categorizer, and loss function is that sorter is for the loss function sum of the document of this iteration input.
S4, step S1, S2 and S3 once be called take turns iteration, then stop iteration if iteration wheel number equals given constant, enter step S5, otherwise return step S1, iteration wheel number adds 1, and wherein iteration wheel number initial value is 0;
S5, all documents to be identified are carried out feature extraction, obtain the proper vector of document to be identified;
S6, the proper vector of document to be identified and the proper vector of step S3 gained sorter all categories are done inner product respectively;
S7, with the classification of the corresponding training set of the step S6 gained inner product maximal value recognition result as document to be identified.
(3) beneficial effect
The present invention is by the mode of non-probabilistic Modeling, and the normalization in the probability model that relaxed limits, and then introduces the sparse degree that sparse limit entry is expressed with effective control word implicit features; The present invention has simultaneously adopted the method for on-line study, has improved accuracy rate and the model training speed of document classification.In addition, the present invention can also effectively utilize supervision message further to promote classification accuracy.
Description of drawings
Fig. 1 is the process flow diagram that the online topic model learning method of supervision is arranged based on sparse implicit features expression that the present invention proposes;
Fig. 2 is the process flow diagram that has the online topic model learning method of supervision to implement based on sparse implicit features expression that proposes according to the present invention.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for explanation the present invention, but are not used for limiting the scope of the invention.The online topic model learning method of expressing based on sparse implicit features that the present invention proposes is described in detail as follows in conjunction with the embodiments.
As shown in Figure 2, present embodiment may further comprise the steps:
Comprise D document in step 1, the training set altogether, adopt the method for on-line study, from D document of training set, choose M document, each word in these documents and the document is carried out respectively extracting based on the implicit features of sparse expression, obtain the eigenmatrix T of M * K, wherein each row of T represents the proper vector of a document, and K is the dimension of implicit features herein, delegation in the matrix T or multirow as a classification of training set, are represented the classification in the training set.
Step 2, upgrade dictionary β according to the classification information of document in the proper vector of trying to achieve in the step 1 and the training set.
Step 3, train polytypic support vector machine according to the proper vector of trying to achieve in the step 1, the characteristic vector W of supported vector machine, wherein each row of W represents the proper vector of each classification of support vector machine, corresponding to the respective classes of document in the training set.
Step 4, step 1, step 2, step 3 once be called take turns iteration, establishing iteration wheel number initial value is 0, judges whether iterations reaches
Figure BDA00003096157700054
If, then stop iteration, enter step 5, otherwise return step 1, iterations adds 1.
Step 5, document to be identified in the test set is carried out feature extraction, obtain the proper vector y of document to be identified;
Step 6, the proper vector of step 5 gained proper vector y and certain classification i of the polytypic support vector machine of step 3 gained is done inner product respectively, be worth and be P i=W iY, wherein W iFor the i among the W is capable, inner product W iY is defined as
W i · y = W i 1 × y 1 + W i 2 × y 2 + · · · + W iK × y K - - - ( 1 )
W wherein Ij(j=1,2 ..., K) be vectorial W iJ component, y j(j=1,2 ..., K) be j the component of y.
Step 7, obtain the inner product of each class correspondence according to step 6, and with the classification i of the corresponding training set of the inner product maximal value P recognition result as document to be identified, P is defined as:
P = max j P j - - - ( 2 )
I is defined as:
i = arg max j p j - - - ( 3 )
On-line study in the described step 1 and the step of feature extraction specifically comprise:
(a) according to document code in the training set in the past backward order choose a subclass that size is M document of training set, as in i wheel iteration, choosing [((i-1) * M+1) %D, (%D of i * M)] document in the scope, as go beyond the scope and then go back to the 1st document, this subclass is minimized corresponding loss function.Use w nThe number of times that word n occurs in the expression document, θ represents the implicit features of document, s nThe implicit features of word n in the expression document, β nThe vector of word n correspondence in the expression dictionary, ε is the compensation rate that prevents degenerate case, value is 0.001.
Definition to the loss function of a document is:
L = - log Poisson + tr ( S T ΛS ) + ρ × | | S | | 1 - - - ( 4 )
Wherein, the mark of tr (X) representing matrix X; Λ=(a-b) * I+b * E, I is the unit matrix of K * K, E is that all elements value of K * K is 1 matrix, a=γ/2+ γ 2/ (4 * λ+2 * γ * N), b=γ/2-a; S is matrix, and its n classifies vectorial s as n, S TThe transposition of expression S, N represents the word sum in the document.λ, γ, ρ are super parameter, in the present embodiment, λ=0.1, γ=0.1, ρ=0.01.In addition, in formula (4):
- log Poisson = - log Poisson _ 1 - log Poisson _ 2 - · · · - log Poisson _ N - - - ( 5 )
| | S | | 1 = | | s 1 | | 1 + | | s 2 | | 1 + · · · + | | s N | | 1 - - - ( 6 )
Wherein, || s n|| 1(n=1,2 ..., N) expression s nThe L1 norm.
Definition is to the Poisson loss function of word n in the document:
- log Poisson _ n = w n × log ( s n · β n + ϵ ) - s n · β n - - - ( 7 )
Wherein, s nβ n, (n=1,2 ..., N) represent vectorial s nWith vectorial β nInner product.
(b) the loss function addition of M document correspondence importing in this iteration is obtained the overall loss function of epicycle iteration.Each word implicit features vector s in the loop optimization document nIn the value of each dimension, until the convergence of loss function value, final updating θ, specific practice is as follows:
Optimize loss function in the formula (4), loop optimization s respectively for each document Ni, (i=1,2 ..., K) until convergence, s wherein NiRepresentation vector s nI component.Can be by finding the solution quadratic equation with one unknown
2 a β ni ( s ni ) 2 + c β ni s ni + c Σ i ′ ≠ i β ni ′ s ni ′ - w n β ni = 0 - - - ( 8 )
Wherein, β NiRepresentation vector β nI component (i=1,2 ..., K), c=β Ni+ ρ+2b Σ s Ni
Find the solution and obtain s NiThe sealing solution, be made as sealing and separate ν.Then to s NiDoing nonnegative value blocks:
s ni = max ( 0 , v ) - - - ( 9 )
The formula that upgrades θ is
θ = γ × ( s 1 + s 2 + · · · + s N ) / ( γ × N + λ ) - - - ( 10 )
The step of the renewal dictionary in the described step 2 specifically comprises:
(c) try to achieve the loss function sum of M the document of importing in this iteration about the gradient of β.Be that loss function in the formula (5) is about the gradient of β, namely for the loss function of certain document about the gradient of β wherein
Figure BDA00003096157700074
(d) according to the gradient of trying to achieve in the step (c) β is done the single step gradient decline at random that step-length is 1/ (t+10), wherein t is the wheel number of iteration.Then β is projected to radius and is on 1 the L1 hypersphere.
The concrete steps of the training multi-category support vector machines in the described step 3 comprise: the method that adopts gradient to descend is optimized the loss function of support vector machine, and loss function is that support vector machine is for the loss function sum of the document of this iteration input.For a document, its loss function is:
R = max i ( Δ ( i , j ) + W i · θ - W j · θ ) - - - ( 11 )
Wherein j is the real classification of document, if i=j, then Δ (i, j)=0; Otherwise Δ (i, j)=3600.
There is supervision topic model to experimentize in the 20Newsgroups data centralization based on the online of sparse implicit features expression, this data set has 20 class documents, training set comprises 11269 pieces of documents, comprise 7505 pieces of documents to be identified in the test set, the number of all kinds of documents is relatively more balanced in training set and the test set.Experiment shows that the convergence time of the online topic model of expressing based on sparse implicit features is about 2000 seconds, fast about 5 times than online implicit expression Di Li Cray model.Accuracy rate can reach more than 66% simultaneously, and is higher by about 4% than the accuracy rate of online implicit expression Di Li Cray model.
Above embodiment only is used for explanation the present invention; and be not limitation of the present invention; the those of ordinary skill in relevant technologies field; under the situation that does not break away from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all technical schemes that are equal to also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims (5)

1. the online topic model learning method of supervision that has based on sparse implicit features expression is characterized in that, may further comprise the steps:
The method of S1, employing on-line study carries out respectively extracting based on the implicit features of sparse expression to the document in the training set and each word in the document, obtain many stack features vector, all documents of each classification and all words of document in the corresponding training set of each classification of proper vector;
S2, upgrade dictionary according to the classification of document in S1 gained proper vector and the training set;
S3, according to S1 gained proper vector training classifier, obtain the proper vector of sorter, each classification of sorter proper vector is corresponding to the classification of document in the training set;
S4, step S1, S2 and S3 once be called take turns iteration, then stop iteration if iteration wheel number equals given constant, enter step S5, otherwise return step S1, iteration wheel number adds 1, and wherein iteration wheel number initial value is 0;
S5, all documents to be identified are carried out feature extraction, obtain the proper vector of document to be identified;
S6, the proper vector of document to be identified and the proper vector of step S3 gained sorter all categories are done inner product respectively;
S7, with the classification of the corresponding training set of the step S6 gained inner product maximal value recognition result as document to be identified.
2. the method for claim 1 is characterized in that, the step of on-line study and feature extraction comprises among the described step S1:
S11, according to document code in the training set in the past backward order choose the subclass of a fixed size of training set, this subclass is minimized corresponding loss function, this subclass loss function is relevant with the implicit features of each word in the contained document of subclass;
S12, to the implicit features of each word in the subclass loss function loop optimization document among the step S11, until the loss function value convergence of described subclass, upgrade the document implicit features.
3. method as claimed in claim 2, it is characterized in that, the step of the subclass of a described fixed size choosing training set comprises: choose size and be the subclass of M in every order of taking turns in the iteration, the subclass of choosing in i wheel iteration is for numbering at [((i-1) * M+1) %D, (%D of i * M)] in document, wherein D is the number of documents in the training set, and the span of M is the integer in [1, D].
4. the method for claim 1 is characterized in that, the step of upgrading dictionary among the described step S2 comprises:
S21, try to achieve in this iteration the loss function sum of input document about the gradient of dictionary vector;
S22, according to step S21 gained gradient to the dictionary vector do single step at random gradient descend, then with the dictionary vector projection to the L1 hypersphere.
5. the method for claim 1, it is characterized in that, the step of the training multi-categorizer among the described step S3 comprises: the method that adopts gradient to descend is optimized the loss function of sorter, and loss function is that sorter is for the loss function sum of the document of this iteration input.
CN201310146127.XA 2013-04-24 2013-04-24 Online topic model learning method is supervised based on having of sparse Assisted by Implicit Feature Representation Active CN103198152B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310146127.XA CN103198152B (en) 2013-04-24 2013-04-24 Online topic model learning method is supervised based on having of sparse Assisted by Implicit Feature Representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310146127.XA CN103198152B (en) 2013-04-24 2013-04-24 Online topic model learning method is supervised based on having of sparse Assisted by Implicit Feature Representation

Publications (2)

Publication Number Publication Date
CN103198152A true CN103198152A (en) 2013-07-10
CN103198152B CN103198152B (en) 2016-06-15

Family

ID=48720709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310146127.XA Active CN103198152B (en) 2013-04-24 2013-04-24 Online topic model learning method is supervised based on having of sparse Assisted by Implicit Feature Representation

Country Status (1)

Country Link
CN (1) CN103198152B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391902A (en) * 2014-11-12 2015-03-04 清华大学 Maximum entropy topic model-based online document classification method and device
CN109145296A (en) * 2018-08-09 2019-01-04 新华智云科技有限公司 A kind of general word recognition method and device based on monitor model

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
AONAN ZHANG, JUN ZHU, BO ZHANG: ""Sparse Online Topic Models"", 《INTERNATIONAL WORLD WIDE WEB CONFERENCE COMMITTEE (IW3C2)》 *
JUN ZHU,AMR AHMED,ERIC P. XING: ""MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification"", 《PROCEEDINGS OF THE 26 TH INTERNATIONAL CONFERENCE》 *
MATTHEW D. HOFFMAN 等: ""Online Learning for Latent Dirichlet Allocation"", 《ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 23(NIPS 2010)》 *
QIXIA JIANG等: ""Monte Carlo Methods for Maximum Margin Supervised Topic Models"", 《ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 25(NIPS2012)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391902A (en) * 2014-11-12 2015-03-04 清华大学 Maximum entropy topic model-based online document classification method and device
CN109145296A (en) * 2018-08-09 2019-01-04 新华智云科技有限公司 A kind of general word recognition method and device based on monitor model

Also Published As

Publication number Publication date
CN103198152B (en) 2016-06-15

Similar Documents

Publication Publication Date Title
CN109271522B (en) Comment emotion classification method and system based on deep hybrid model transfer learning
CN111125358B (en) Text classification method based on hypergraph
CN104951548B (en) A kind of computational methods and system of negative public sentiment index
Xiao et al. History-based attention in Seq2Seq model for multi-label text classification
Zhong et al. Adam revisited: A weighted past gradients perspective
CN105893609A (en) Mobile APP recommendation method based on weighted mixing
CN105824802A (en) Method and device for acquiring knowledge graph vectoring expression
Zeng et al. A GA-based feature selection and parameter optimization for support tucker machine
CN104239554A (en) Cross-domain and cross-category news commentary emotion prediction method
CN103631859A (en) Intelligent review expert recommending method for science and technology projects
CN107451278A (en) Chinese Text Categorization based on more hidden layer extreme learning machines
CN109189926A (en) A kind of construction method of technical paper corpus
CN110175224A (en) Patent recommended method and device based on semantic interlink Heterogeneous Information internet startup disk
CN105069143A (en) Method and device for extracting keywords from document
CN102915448B (en) A kind of three-dimensional model automatic classification method based on AdaBoost
CN110647995A (en) Rule training method, device, equipment and storage medium
CN107590262A (en) The semi-supervised learning method of big data analysis
CN103150383A (en) Event evolution analysis method of short text data
CN108920446A (en) A kind of processing method of Engineering document
CN103473309B (en) Text categorization method based on probability word selection and supervision subject model
CN104537280B (en) Protein interactive relation recognition methods based on text relation similitude
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system
Kawamura et al. A hybrid approach for optimal feature subset selection with evolutionary algorithms
CN106446117A (en) Text analysis method based on poisson-gamma belief network
CN106227767A (en) A kind of based on the adaptive collaborative filtering method of field dependency

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20210525

Address after: 100084 a1901, 19th floor, building 8, yard 1, Zhongguancun East Road, Haidian District, Beijing

Patentee after: Beijing Ruili Wisdom Technology Co.,Ltd.

Address before: 100084 mailbox, 100084-82 Tsinghua Yuan, Beijing, Haidian District, Beijing

Patentee before: TSINGHUA University

TR01 Transfer of patent right
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20130710

Assignee: Beijing Intellectual Property Management Co.,Ltd.

Assignor: Beijing Ruili Wisdom Technology Co.,Ltd.

Contract record no.: X2023110000073

Denomination of invention: Supervised online topic model learning method based on sparse implicit feature representation

Granted publication date: 20160615

License type: Common License

Record date: 20230531

EE01 Entry into force of recordation of patent licensing contract