Content of the invention
The invention provides a kind of Rogers spy-normal model method for extracting topic, it is possible to increase the speed that topic extracts.
The invention provides a kind of Rogers spy-normal model method for extracting topic, the method includes:
S1:The count matrix distributed storage of topic in training set and word corresponding relation is being calculated section by parameter server
On point, all document distribution in training set are given described calculate node by parameter server, and each calculate node preserves described meter
The document that matrix number and parameter server are sent;
S2:Calculate node is deposited according to this calculate node to the corresponding topic of each word in the document in this calculate node
The count matrix of storage carries out gibbs sampler;
S3:The topic of each word in the document that calculate node is sampled according to this calculate node is sampled the spy of this document
Levy vector;
S4:Calculate node calculate the characteristic vector of each document in this node and, quadratic sum, using described and, square
The Posterior distrbutionp obeyed with the average and covariance calculating all described characteristic vectors, and each literary composition of sampling from Posterior distrbutionp
The average of characteristic vector of shelves and covariance;
S5:In calculate node, judging whether iterationses reach predetermined constant, if it is, stopping iteration, executing S6,
If it is not, then iterationses add 1, execute S2, S3, S4 successively;
S6:In calculate node, S2, S3 are executed successively to the document of this calculate node, to the characteristic vector sampled in S3
Do soft maximum conversion, export the ratio of the document shared by each topic in each document in this calculate node.
Further, methods described further includes:
The Posterior distrbutionp of described topic is split into the item of described count matrix and the priori of this node storage by calculate node
Item, by introducing the sampling of augmentation uniformly distributed random variable, non-zero entry of only sampling when from the sampling of the item of described count matrix.
Further, the topic sampling of each word in the document that described calculate node is sampled according to this calculate node
The characteristic vector of this document, further includes:
S31:Often one-dimensional introducing Augmentation approach to described characteristic vector;
S32:Utilize Gauss distribution approximation sample from the condition distribution of the often one-dimensional Augmentation approach current signature vector
This Augmentation approach;
S33:After other all dimensions giving described characteristic vector and Augmentation approach, certain one-dimensional bar of characteristic vector
Sample successively the often one-dimensional of described characteristic vector in part distribution;
S34:Judge whether described cycle-index reaches preset loop number of times, if it is not, then cycle-index adds 1, hold successively
Row S32, S33.
Further, described preset loop number of times is 8 times.
Further, described step S32, including:The condition distribution of the arbitrary dimension Augmentation approach from current signature vector
Middle using through conversion Polya-Gamma (1, z) be distributed this Augmentation approach of approximation sample.
Further, methods described also includes:By implicit expression topic-word distribution matrix in the Posterior distrbutionp of arbitrary topic
Removed by integration.
Further, methods described also includes:
The increment of the count matrix of calculate node minute book calculate node, periodically by every a line of this count matrix with
The corresponding parameter server of this row synchronizes, and wherein, described parameter server is distributed server, and this count matrix is not
Colleague is stored on different nodes.
Further, this is periodically counted by the increment of the count matrix of described calculate node minute book calculate node
Every a line parameter server corresponding with this line of matrix synchronizes, and specifically includes:
Numbering according to described row calculates the parameter server of memorizer, and increment in this calculate node for this row is sent
To parameter server;
Parameter server, will be corresponding on parameter server according to the count matrix in the incremental update parameter server sent
Row and calculate node on the difference of described row send back described calculate node;
Calculate node updates this row in this calculate node according to the difference receiving.
A kind of Rogers spy being provided by the present invention-normal model method for extracting topic, is processed by Distributed Calculation
Large-scale data, and the speed of topic extraction can be improved.
Specific embodiment
Purpose, technical scheme and advantage for making the embodiment of the present invention are clearer, below in conjunction with the embodiment of the present invention
In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described it is clear that described embodiment is
The a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention, those of ordinary skill in the art
The every other embodiment being obtained on the premise of not making creative work, broadly falls into the scope of protection of the invention.
Embodiments provide a kind of Rogers spy-normal model method for extracting topic, referring to Fig. 1, the method bag
Include:
S1:The count matrix distributed storage of topic in training set and word corresponding relation is being calculated section by parameter server
On point, all document distribution in training set are given described calculate node by parameter server, and each calculate node preserves described meter
The document that matrix number and parameter server are sent;
S2:Calculate node is deposited according to this calculate node to the corresponding topic of each word in the document in this calculate node
The count matrix of storage carries out gibbs sampler;
S3:The topic of each word in the document that calculate node is sampled according to this calculate node is sampled the spy of this document
Levy vector;
S4:Calculate node calculate the characteristic vector of each document in this node and, quadratic sum, using described and, square
The Posterior distrbutionp obeyed with the average and covariance calculating all described characteristic vectors, and each literary composition of sampling from Posterior distrbutionp
The average of characteristic vector of shelves and covariance;
S5:In calculate node, judging whether iterationses reach predetermined constant, if it is, stopping iteration, executing S6,
If it is not, then iterationses add 1, execute S2, S3, S4 successively;
S6:In calculate node, S2, S3 are executed successively to the document of this calculate node, to the characteristic vector sampled in S3
Do soft maximum conversion, export the ratio of the document shared by each topic in each document in this calculate node.
A kind of Rogers spy provided in an embodiment of the present invention-normal model method for extracting topic, at Distributed Calculation
Reason large-scale data, and the speed of topic extraction can be improved.
Wherein, in the system that a topic extracts, including a parameter server and at least one calculate node, parameter
Server is used for distributing the document to be extracted in training set for calculate node, and count matrix is sent to calculate node;Meter
Operator node preserves a part of document in all documents in the training set that parameter server distributes, and enters jargon to the document preserving
Topic is extracted.
In step sl, parameter server is by the count matrix of topic in training set and word corresponding relationDistributed
It is stored in calculate node, all document distribution in training set are given described calculate node by parameter server, each calculates section
Point preserves described count matrixThe document sent with parameter server.
Wherein,D is number of documents, NdFor d piece document
Length, wdn∈ [1, V] is the numbering of n-th word of d piece document, and V is the size of word list, zdn∈ [1, K] is a d piece
The topic numbering of n-th word of document, K is topic number;#A represents the element number of set A.
In step s 2, calculate node is to each word corresponding topic z in the document in this calculate nodednAccording to this
The count matrix of calculate node storageCarry out gibbs sampler.
In step s3, the topic z of each word in the document that calculate node is sampled according to this calculate nodednSampling
The characteristic vector of this documentWhereinKth dimension for document d characteristic vector.
In step s 4, calculate node calculates the characteristic vector of each document in this nodeAnd, quadratic sum, using institute
State and, quadratic sum calculates the Posterior distrbutionp that the average of described characteristic vector and covariance are obeyed, and samples from Posterior distrbutionp
The mean μ of the characteristic vector of each document and covariance Σ;
In step s 6, in calculate node, S2, S3 are executed successively to the document of this calculate node, to sampled in S3
Characteristic vectorDo soft maximum conversion, export the ratio shared by each topic in each document in this calculate nodeWherein soft maximum transform definition isAfter conversion
Further, the method includes:The Posterior distrbutionp of described topic is split into the institute of this node storage by calculate node
State the item of count matrix and the item of priori, by introducing the sampling of augmentation uniformly distributed random variable, when from described count matrix
Only sample non-zero entry during sampling.
Step S3 specifically includes:
S31:Often one-dimensional introducing Augmentation approach to described characteristic vector;
S32:Utilize Gauss distribution approximation sample from the condition distribution of the often one-dimensional Augmentation approach current signature vector
This Augmentation approach;
S33:After other all dimensions giving described characteristic vector and Augmentation approach, certain one-dimensional bar of characteristic vector
Sample successively the often one-dimensional of described characteristic vector in part distribution;
Specifically, sampling characteristic vector i-th dimension~P (i-th dimension of characteristic vector | the i-th dimension of Augmentation approach, feature become
Amount is except the dimension of i).
S34:Judge whether described cycle-index reaches preset loop number of times, if it is not, then cycle-index adds 1, hold successively
Row S32, S33.
Wherein, preset loop number of times is preferably 8 times.Step S32 can also be realized by the following method:From current signature to
In the condition distribution of the arbitrary dimension Augmentation approach under amount, using the Polya-Gamma through conversion, (1, z) distribution approximation sample should
Augmentation approach.
In addition, in order to improve topic extraction speed it is preferable that in the Posterior distrbutionp of arbitrary topic by implicit expression topic-
Word distribution matrix is removed by integration.
The method is safeguarded using periodicity Asynchronous Incremental update method to count matrix, specifically includes:Calculate node
Corresponding with this row for every a line of this count matrix parameter is periodically taken by the increment of the count matrix of minute book calculate node
Business device synchronizes, and wherein, described parameter server is distributed server, and the different rows of this count matrix are stored in different
On node.
Wherein, the increment of the count matrix of described calculate node minute book calculate node, periodically by this count matrix
Every a line parameter server corresponding with this line synchronize, specifically include:
Numbering according to described row calculates the parameter server of memorizer, and increment in this calculate node for this row is sent
To parameter server;
Parameter server, will be corresponding on parameter server according to the count matrix in the incremental update parameter server sent
Row and calculate node on the difference of described row send back described calculate node;
Calculate node updates this row in this calculate node according to the difference receiving.
The method method of sampling specifically includes:
a:In calculate node, the method using gibbs sampler to the given corresponding topic of all words and feature to
Posterior distrbutionp after amount, to certain topic in documentSampled.
Wherein, For priori β set in advance=
(0.01,…,0.01).
b:Count matrix using topic and word corresponding relation
The openness probability calculating each topic successively.
Wherein
From condition distributionThe method of middle sampling is:
U~U (0,1),
Wherein Mult (A) is the multinomial distribution with vectorial A as parameter;
c:Often one-dimensional introducing Augmentation approach to characteristic vector
d:Using Gauss distribution or through conversion from the condition distribution of the arbitrary dimension Augmentation approach current signature vector
Polya-Gamma (1, z) be distributed this Augmentation approach of approximation sample
Wherein For Polya-Gamma distribution.
e:After other all dimensions giving described characteristic vector and Augmentation approach, certain one-dimensional condition of characteristic vector
Sample successively in distribution the often one-dimensional of described characteristic vector;
Wherein
f:Repeat step d and step e reach preset loop number of times S=8 until number of repetition.
g:The average of the characteristic vector of all documents and association side in this calculate node are calculated respectively on each calculate node
Difference, calculates the average of the characteristic vector of all documents and covariance in training set using these information.
h:Calculate the parameter of its Posterior distrbutionp using the average of the characteristic vector of documents all in training set and covariance, and
New average and covariance are gone out according to this parameter sampling.If the prior distribution of μ, Σ is Normal-Inverse-Wishart distributionThen its Posterior distrbutionp is
Wherein,ρ '=ρ+D, κ '=κ+D,
Sample averageSample variance
It should be noted that:Count matrix during calculate node is understood to this calculate node during extracting topic is carried out
Update, and by update notification to parameter server, the update content that parameter server is sent according to all calculate nodes is to parameter
In server, the count matrix of storage is updated, and after the completion of renewal, the count matrix of the latest edition after updating is sent to
All of calculate node, the count matrix that calculate node receives this latest edition is realized to count matrix in this calculate node more
Newly.In addition, in method provided in an embodiment of the present invention, using Rogers spy-normal state priori, having obtained the dependency of topic.
Visible by foregoing description, a kind of Rogers spy provided in an embodiment of the present invention-normal model method for extracting topic,
The present invention utilizes Distributed Calculation, can process extensive document, by way of data augmentation, has obtained accurate gibbs
Sampling algorithm, improves computational efficiency and precision, it is possible to increase the speed that topic extracts.
It should be noted that herein, such as first and second etc relational terms are used merely to an entity
Or operation is made a distinction with another entity or operation, and not necessarily requires or imply exist between these entities or operation
Any this actual relation or order.And, term " inclusion ", "comprising" or its any other variant are intended to non-
The comprising of exclusiveness, so that including a series of process of key elements, method, article or equipment not only include those key elements,
But also include other key elements being not expressly set out, or also include being consolidated by this process, method, article or equipment
Some key elements.In the absence of more restrictions, the key element being limited by sentence "including a ..." is it is not excluded that including
Also there is other same factor in the process of described key element, method, article or equipment.
One of ordinary skill in the art will appreciate that:The all or part of step realizing said method embodiment can be passed through
Completing, aforesaid program can be stored in the storage medium of embodied on computer readable the related hardware of programmed instruction, this program
Upon execution, execute the step including said method embodiment;And aforesaid storage medium includes:ROM, RAM, magnetic disc or light
Disk etc. is various can be with the medium of store program codes.
Last it should be noted that:The foregoing is only presently preferred embodiments of the present invention, be merely to illustrate the skill of the present invention
Art scheme, is not intended to limit protection scope of the present invention.All any modifications made within the spirit and principles in the present invention,
Equivalent, improvement etc., are all contained in protection scope of the present invention.