CN103810282B

CN103810282B - Logistic-normal model topic extraction method

Info

Publication number: CN103810282B
Application number: CN201410056958.2A
Authority: CN
Inventors: 朱军; 陈键飞; 王紫; 张钹
Original assignee: 清华大学
Current assignee: Beijing Real AI Technology Co Ltd
Priority date: 2014-02-19
Filing date: 2014-02-19
Publication date: 2017-02-15
Anticipated expiration: 2034-02-19
Also published as: CN103810282A

Abstract

The invention provides a logistic-normal model topic extraction method. The logistic-normal model topic extraction method includes the following steps that firstly, a parameter server stores count matrixes on a computing node in a distributed mode, and all documents in a training set are distributed to the computing node; secondly, Gibbs sampling is performed on topics which correspond to all words in the documents respectively; thirdly, feature vectors of the sampled documents are collected; fourthly, posteriori distributions which the sum, the quadratic sum, the mean value and the covariance of the feature vectors of each document in the computing node obey are calculated, and the mean value and the covariance of the feature vectors of each document in the posteriori distributions are sampled; fifthly, whether the number of iterations reaches a reserved constant or not is judged, if yes, iterations are stopped, and the sixth step is executed, and if not, one is added to the number of iterations, and the second step, the third step, the fourth step and the sixth step are executed; the second step and the third step are sequentially executed on the documents of the computing node, soft maximum conversion is performed on the feature vectors sampled in the third step, and the proportion of each topic in each document in the computing node is output. By means of the method, the speed of extracting topics can be improved.

Description

A kind of Rogers spy-normal model method for extracting topic

Technical field

The present invention relates to data mining technology field, more particularly, to a kind of Rogers spy-normal model method for extracting topic.

Background technology

Implicit expression topic model has all embodied substantially in terms of excavating the file structure of document semantic information and process complexity Advantage, using implicit expression topic model excavate extensive document in semantic structure need solve problem be mainly：Number of files Amount is very huge, needs available algorithm in a distributed computing environment；The motility of model, such as extracts the dependency of topic.

Nowadays application implicit expression topic model data from small-scale text set develop into large-scale community network, Or even whole the Internet.Traditional unit learning method cannot adapt to the requirement of big data, need quick and can be distributed The algorithm running under formula computing environment.

In prior art, using association topic model, by using non-conjugated Rogers spy's normal model, extracting topic phase Guan Xing, in association topic model, the learning algorithm of Rogers's spy's normal model uses the calculus of variations, is repeatedly changed by numerical algorithm In generation, is solved.

Visible by foregoing description, the learning algorithm of the Rogers's spy's normal model in association topic model uses variation Method, is solved by numerical algorithm successive ignition, and less efficient, speed is low.

Content of the invention

The invention provides a kind of Rogers spy-normal model method for extracting topic, it is possible to increase the speed that topic extracts.

The invention provides a kind of Rogers spy-normal model method for extracting topic, the method includes：

S1：The count matrix distributed storage of topic in training set and word corresponding relation is being calculated section by parameter server On point, all document distribution in training set are given described calculate node by parameter server, and each calculate node preserves described meter The document that matrix number and parameter server are sent；

S2：Calculate node is deposited according to this calculate node to the corresponding topic of each word in the document in this calculate node The count matrix of storage carries out gibbs sampler；

S3：The topic of each word in the document that calculate node is sampled according to this calculate node is sampled the spy of this document Levy vector；

S4：Calculate node calculate the characteristic vector of each document in this node and, quadratic sum, using described and, square The Posterior distrbutionp obeyed with the average and covariance calculating all described characteristic vectors, and each literary composition of sampling from Posterior distrbutionp The average of characteristic vector of shelves and covariance；

S5：In calculate node, judging whether iterationses reach predetermined constant, if it is, stopping iteration, executing S6, If it is not, then iterationses add 1, execute S2, S3, S4 successively；

S6：In calculate node, S2, S3 are executed successively to the document of this calculate node, to the characteristic vector sampled in S3 Do soft maximum conversion, export the ratio of the document shared by each topic in each document in this calculate node.

Further, methods described further includes：

The Posterior distrbutionp of described topic is split into the item of described count matrix and the priori of this node storage by calculate node Item, by introducing the sampling of augmentation uniformly distributed random variable, non-zero entry of only sampling when from the sampling of the item of described count matrix.

Further, the topic sampling of each word in the document that described calculate node is sampled according to this calculate node The characteristic vector of this document, further includes：

S31：Often one-dimensional introducing Augmentation approach to described characteristic vector；

S32：Utilize Gauss distribution approximation sample from the condition distribution of the often one-dimensional Augmentation approach current signature vector This Augmentation approach；

S33：After other all dimensions giving described characteristic vector and Augmentation approach, certain one-dimensional bar of characteristic vector Sample successively the often one-dimensional of described characteristic vector in part distribution；

S34：Judge whether described cycle-index reaches preset loop number of times, if it is not, then cycle-index adds 1, hold successively Row S32, S33.

Further, described preset loop number of times is 8 times.

Further, described step S32, including：The condition distribution of the arbitrary dimension Augmentation approach from current signature vector Middle using through conversion Polya-Gamma (1, z) be distributed this Augmentation approach of approximation sample.

Further, methods described also includes：By implicit expression topic-word distribution matrix in the Posterior distrbutionp of arbitrary topic Removed by integration.

Further, methods described also includes：

The increment of the count matrix of calculate node minute book calculate node, periodically by every a line of this count matrix with The corresponding parameter server of this row synchronizes, and wherein, described parameter server is distributed server, and this count matrix is not Colleague is stored on different nodes.

Further, this is periodically counted by the increment of the count matrix of described calculate node minute book calculate node Every a line parameter server corresponding with this line of matrix synchronizes, and specifically includes：

Numbering according to described row calculates the parameter server of memorizer, and increment in this calculate node for this row is sent To parameter server；

Parameter server, will be corresponding on parameter server according to the count matrix in the incremental update parameter server sent Row and calculate node on the difference of described row send back described calculate node；

Calculate node updates this row in this calculate node according to the difference receiving.

A kind of Rogers spy being provided by the present invention-normal model method for extracting topic, is processed by Distributed Calculation Large-scale data, and the speed of topic extraction can be improved.

Brief description

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Have technology description in required use accompanying drawing be briefly described it should be apparent that, drawings in the following description are the present invention Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis These accompanying drawings obtain other accompanying drawings.

Fig. 1 is a kind of Rogers spy provided in an embodiment of the present invention-normal model method for extracting topic flow chart.

Specific embodiment

Purpose, technical scheme and advantage for making the embodiment of the present invention are clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described it is clear that described embodiment is The a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention, those of ordinary skill in the art The every other embodiment being obtained on the premise of not making creative work, broadly falls into the scope of protection of the invention.

Embodiments provide a kind of Rogers spy-normal model method for extracting topic, referring to Fig. 1, the method bag Include：

A kind of Rogers spy provided in an embodiment of the present invention-normal model method for extracting topic, at Distributed Calculation Reason large-scale data, and the speed of topic extraction can be improved.

Wherein, in the system that a topic extracts, including a parameter server and at least one calculate node, parameter Server is used for distributing the document to be extracted in training set for calculate node, and count matrix is sent to calculate node；Meter Operator node preserves a part of document in all documents in the training set that parameter server distributes, and enters jargon to the document preserving Topic is extracted.

In step sl, parameter server is by the count matrix of topic in training set and word corresponding relationDistributed It is stored in calculate node, all document distribution in training set are given described calculate node by parameter server, each calculates section Point preserves described count matrixThe document sent with parameter server.

Wherein,D is number of documents, N_dFor d piece document Length, w_dn∈ [1, V] is the numbering of n-th word of d piece document, and V is the size of word list, z_dn∈ [1, K] is a d piece The topic numbering of n-th word of document, K is topic number；#A represents the element number of set A.

In step s 2, calculate node is to each word corresponding topic z in the document in this calculate node_dnAccording to this The count matrix of calculate node storageCarry out gibbs sampler.

In step s3, the topic z of each word in the document that calculate node is sampled according to this calculate node_dnSampling The characteristic vector of this documentWhereinKth dimension for document d characteristic vector.

In step s 4, calculate node calculates the characteristic vector of each document in this nodeAnd, quadratic sum, using institute State and, quadratic sum calculates the Posterior distrbutionp that the average of described characteristic vector and covariance are obeyed, and samples from Posterior distrbutionp The mean μ of the characteristic vector of each document and covariance Σ；

In step s 6, in calculate node, S2, S3 are executed successively to the document of this calculate node, to sampled in S3 Characteristic vectorDo soft maximum conversion, export the ratio shared by each topic in each document in this calculate nodeWherein soft maximum transform definition isAfter conversion

Further, the method includes：The Posterior distrbutionp of described topic is split into the institute of this node storage by calculate node State the item of count matrix and the item of priori, by introducing the sampling of augmentation uniformly distributed random variable, when from described count matrix Only sample non-zero entry during sampling.

Step S3 specifically includes：

Specifically, sampling characteristic vector i-th dimension～P (i-th dimension of characteristic vector | the i-th dimension of Augmentation approach, feature become Amount is except the dimension of i).

Wherein, preset loop number of times is preferably 8 times.Step S32 can also be realized by the following method：From current signature to In the condition distribution of the arbitrary dimension Augmentation approach under amount, using the Polya-Gamma through conversion, (1, z) distribution approximation sample should Augmentation approach.

In addition, in order to improve topic extraction speed it is preferable that in the Posterior distrbutionp of arbitrary topic by implicit expression topic- Word distribution matrix is removed by integration.

The method is safeguarded using periodicity Asynchronous Incremental update method to count matrix, specifically includes：Calculate node Corresponding with this row for every a line of this count matrix parameter is periodically taken by the increment of the count matrix of minute book calculate node Business device synchronizes, and wherein, described parameter server is distributed server, and the different rows of this count matrix are stored in different On node.

Wherein, the increment of the count matrix of described calculate node minute book calculate node, periodically by this count matrix Every a line parameter server corresponding with this line synchronize, specifically include：

The method method of sampling specifically includes：

a：In calculate node, the method using gibbs sampler to the given corresponding topic of all words and feature to Posterior distrbutionp after amount, to certain topic in documentSampled.

Wherein, For priori β set in advance= (0.01,…,0.01).

b：Count matrix using topic and word corresponding relation The openness probability calculating each topic successively.

Wherein

From condition distributionThe method of middle sampling is：

U～U (0,1),

Wherein Mult (A) is the multinomial distribution with vectorial A as parameter；

c：Often one-dimensional introducing Augmentation approach to characteristic vector

d：Using Gauss distribution or through conversion from the condition distribution of the arbitrary dimension Augmentation approach current signature vector Polya-Gamma (1, z) be distributed this Augmentation approach of approximation sample

Wherein For Polya-Gamma distribution.

e：After other all dimensions giving described characteristic vector and Augmentation approach, certain one-dimensional condition of characteristic vector Sample successively in distribution the often one-dimensional of described characteristic vector；

Wherein

f：Repeat step d and step e reach preset loop number of times S=8 until number of repetition.

g：The average of the characteristic vector of all documents and association side in this calculate node are calculated respectively on each calculate node Difference, calculates the average of the characteristic vector of all documents and covariance in training set using these information.

h：Calculate the parameter of its Posterior distrbutionp using the average of the characteristic vector of documents all in training set and covariance, and New average and covariance are gone out according to this parameter sampling.If the prior distribution of μ, Σ is Normal-Inverse-Wishart distributionThen its Posterior distrbutionp is

Wherein,ρ '=ρ+D, κ '=κ+D, Sample averageSample variance

It should be noted that：Count matrix during calculate node is understood to this calculate node during extracting topic is carried out Update, and by update notification to parameter server, the update content that parameter server is sent according to all calculate nodes is to parameter In server, the count matrix of storage is updated, and after the completion of renewal, the count matrix of the latest edition after updating is sent to All of calculate node, the count matrix that calculate node receives this latest edition is realized to count matrix in this calculate node more Newly.In addition, in method provided in an embodiment of the present invention, using Rogers spy-normal state priori, having obtained the dependency of topic.

Visible by foregoing description, a kind of Rogers spy provided in an embodiment of the present invention-normal model method for extracting topic, The present invention utilizes Distributed Calculation, can process extensive document, by way of data augmentation, has obtained accurate gibbs Sampling algorithm, improves computational efficiency and precision, it is possible to increase the speed that topic extracts.

It should be noted that herein, such as first and second etc relational terms are used merely to an entity Or operation is made a distinction with another entity or operation, and not necessarily requires or imply exist between these entities or operation Any this actual relation or order.And, term " inclusion ", "comprising" or its any other variant are intended to non- The comprising of exclusiveness, so that including a series of process of key elements, method, article or equipment not only include those key elements, But also include other key elements being not expressly set out, or also include being consolidated by this process, method, article or equipment Some key elements.In the absence of more restrictions, the key element being limited by sentence "including a ..." is it is not excluded that including Also there is other same factor in the process of described key element, method, article or equipment.

One of ordinary skill in the art will appreciate that：The all or part of step realizing said method embodiment can be passed through Completing, aforesaid program can be stored in the storage medium of embodied on computer readable the related hardware of programmed instruction, this program Upon execution, execute the step including said method embodiment；And aforesaid storage medium includes：ROM, RAM, magnetic disc or light Disk etc. is various can be with the medium of store program codes.

Last it should be noted that：The foregoing is only presently preferred embodiments of the present invention, be merely to illustrate the skill of the present invention Art scheme, is not intended to limit protection scope of the present invention.All any modifications made within the spirit and principles in the present invention, Equivalent, improvement etc., are all contained in protection scope of the present invention.

Claims

1. a kind of Rogers spy-normal model method for extracting topic is it is characterised in that the method includes：

S1：Parameter server is by the count matrix distributed storage of topic in training set and word corresponding relation in calculate node On, all document distribution in training set are given described calculate node by parameter server, and each calculate node preserves described counting The document that matrix and parameter server are sent；

S2：Calculate node stores according to this calculate node to the corresponding topic of each word in the document in this calculate node Count matrix carries out gibbs sampler；

S3：The topic of each word in the document that calculate node is sampled according to this calculate node sample this document feature to Amount；

S4：Calculate node calculate the characteristic vector of each document in this node and, quadratic sum, using described and, quadratic sum meter Calculate the Posterior distrbutionp that the average of all described characteristic vectors and covariance are obeyed, and each document of sampling from Posterior distrbutionp The average of characteristic vector and covariance；

S5：In calculate node, judging whether iterationses reach predetermined constant, if it is, stopping iteration, executing S6, if No, then iterationses add 1, execute S2, S3, S4 successively；

S6：In calculate node, S2, S3 are executed successively to the document of this calculate node, the characteristic vector sampled is done soft in S3 Maximum converts, and exports the ratio of the document shared by each topic in each document in this calculate node.

2. method according to claim 1 is it is characterised in that methods described further includes：

The Posterior distrbutionp of described topic is split into the item of described count matrix and the item of priori of this node storage by calculate node, By introducing the sampling of augmentation uniformly distributed random variable, non-zero entry of only sampling when from the item sampling of described count matrix.

3. method according to claim 1 is it is characterised in that the literary composition sampled according to this calculate node of described calculate node The topic of each word in shelves is sampled the characteristic vector of this document, further includes：

S32：Utilize this increasing of Gauss distribution approximation sample from the condition distribution of the often one-dimensional Augmentation approach current signature vector Extent amount；

S33：After other all dimensions giving described characteristic vector and Augmentation approach, certain one-dimensional condition of characteristic vector is divided Sample successively in cloth the often one-dimensional of described characteristic vector；

S34：Judge whether cycle-index reaches preset loop number of times, if it is not, then cycle-index adds 1, successively execution S32, S33.

4. method according to claim 3 is it is characterised in that described preset loop number of times is 8 times.

5. method according to claim 1 is it is characterised in that methods described also includes：Posterior distrbutionp in arbitrary topic Middle implicit expression topic-word distribution matrix is removed by integration.

6. method according to claim 1 is it is characterised in that methods described also includes：

The increment of the count matrix of calculate node minute book calculate node, periodically by every a line of this count matrix and this row Corresponding parameter server synchronizes, and wherein, described parameter server is distributed server, the different rows of this count matrix It is stored on different nodes.

7. method according to claim 6 is it is characterised in that the count matrix of described calculate node minute book calculate node Increment, periodically corresponding with this line for every a line of this count matrix parameter server is synchronized, specifically includes：

Numbering according to described row calculates the parameter server of memorizer, and increment in this calculate node for this row is sent to ginseng Number server；

Parameter server according to the count matrix in the incremental update parameter server sent, by row corresponding on parameter server Send back described calculate node with the difference of the described row in calculate node；