CN105868186A

CN105868186A - Simple and efficient topic extracting method

Info

Publication number: CN105868186A
Application number: CN201610382578.7A
Authority: CN
Inventors: 朱军; 陈文光; 陈键飞; 李恺威
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2016-06-01
Filing date: 2016-06-01
Publication date: 2016-08-17

Abstract

The invention discloses a simple and efficient topic extracting method. According to the method, the speed of extracting the topic can be increased. The method comprises the following steps: S1) in the word/file stage, treating the rows/lines of some small blocks of a topic matrix at each calculating node, successively scanning each row/line assigned to the calculating node, and performing an accepting step and a proposing step on each row/line; and S2) judging if the iteration times reach a preset constant, if yes, stopping iteration, if not, adding 1 to the iteration times and repeating the steps S1 and S2.

Description

Simple efficient method for extracting topic

Technical field

The present invention relates to data mining technology field, be specifically related to a kind of simple efficient topic Extracting method.

Background technology

Topic model all body in terms of excavating document semantic information and processing complicated file structure Reveal obvious advantage, utilize the semanteme in the topic model extensive document of excavation, structure to need Problem to be solved is mainly: number of documents is the hugest, needs efficient algorithm；Needs carry The topic number, the vocabulary number of data set that take are the biggest, need to optimize especially to save storage Space；It is the simplest that algorithm realizes needs so that more users can be adopted.

Nowadays the data applying topic model develop into from small-scale text set on a large scale Community network so that whole internet.Traditional unit learning method cannot adapt to big data Requirement, the algorithm needing quickly and can running in a distributed computing environment.

In prior art, utilize Metropolis-Hastings algorithm and model concurrent technique, ginseng The algorithm of number server can linearly solve in the time complexity of data set size.But It needs extensive random-access memory, it is impossible to make full use of the cache of CPU；Needs are deposited Store up huge topic count matrix.

Visible by foregoing description, existing topic extraction algorithm speed is slow, storage complexity High, it is achieved complicated.

Summary of the invention

In view of this, the embodiment of the present invention provides a kind of simple efficient method for extracting topic, The speed that topic extracts can be improved.

The embodiment of the present invention proposes a kind of simple efficient method for extracting topic, including:

S1, in word/document stage, some of each calculating node processing topic matrix is little The column/row of block, sequential scan distributes to the column/row of this calculating node, to every column/row, performs Accept step and propose step, wherein, whole lexical items of training data are expressed as one dilute Dredging matrix, each document is a line, and each word is row, each lexical item of sparse matrix This lexical item actualite of middle storage and some motion topics, perform accept step and propose step According to word stage and document stage alternately, accept in step according to this column/row lexical item Actualite calculate the topic count vector of this column/row, then according to the topic of this column/row Count vector calculates the acceptance probability of motion topic, and updates the current words of this column/row lexical item Topic；Propose motion topic new according to the actualite generation of this column/row lexical item in step；

S2, judge whether iterations reaches predetermined constant, if it is, stop iteration, If it is not, then iterations adds 1, repeat S1, S2.

The simple efficient method for extracting topic that the embodiment of the present invention provides, by calculating node Distributed Calculation process large-scale data, actualite is updated, and produces new Motion topic, it is possible to increase the speed that topic extracts, and algorithm the most easily realizes.

Accompanying drawing explanation

Fig. 1 is that the flow process of a kind of simple efficient method for extracting topic one embodiment of the present invention is shown It is intended to.

Detailed description of the invention

For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below will In conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clearly Chu ground describe, it is clear that described embodiment be a part of embodiment of the present invention rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not having Have and make the every other embodiment obtained under creative work premise, broadly fall into the present invention The scope of protection.

Referring to Fig. 1, the open a kind of simple efficient method for extracting topic of the present embodiment, including:

It should be noted that in the system that a topic extracts, calculate joint including several Point, between use express network, as InfiniBand connect.

Training dataset comprises a vocabularyIts size is V.Claim in vocabulary is each Xiang Weiyi word, additionally, claim a word appear as a lexical item each time.

Training dataset comprises D piece document, wherein has L in d piece document_dIndividual lexical item, n-th Lexical item is expressed as w_dn, it means the n-th lexical item is the w in vocabulary_dnIndividual word, i.e.The difference of word and lexical item be lexical item be that word is in the one of certain position of certain document Secondary appearance, word possible corresponding a lot of lexical items, i.e. this word are at different article not coordinatioies The appearance put.

Each lexical item has an actualite z_dnWith M motion topic (z '_d1..., z '_dM), its Middle M is constant set in advance.Claim actualite and the motion topic composition of a lexical item Collection is combined into the topic vector y of this lexical item_dn=(z_dn, z '_d1..., z '_dM).Actualite and motion The span of topic be 1 ..., K}, wherein K is expression topic numbers set in advance Constant.For each lexical item w_dn, its topic vector is stored in the Y of matrix Y_{D, wdn}Position Put, i.e. Y_{D, w}={ y_dn|w_dn=w, 1≤n≤Ld}.Matrix Y is the dilute of a D × V Dredge matrix, referred to as topic matrix.It is found that the d row of topic matrix contains in document d All topic vectors of lexical item；The w row of topic matrix contain wordMiddle correspondence complete The topic vector of portion's lexical item.

In step sl, by topic matrix to compress the storage of sparse column format, the most each list The lexical item of word, by row Coutinuous store, remembers the topic matrix that the sparse column format of this compression stores For Y^CSC.Additionally, point to the pointer of lexical item by compression loose line form storage, i.e. point to The pointer of whole lexical items of each document, by row Coutinuous store, remembers this compression loose line lattice The pointer matrix of lexical item in topic matrix that points to of formula storage is Y^CSR。

Before the computation, topic matrix is cut into calculating interstitial content and takes advantage of calculating interstitial content Fritter.Design operator node number is P, and the fritter of the i-th row jth row is

The process calculated is divided into document stage and word stage.In the document stage, calculate node I processes the i-th row fritterEach calculating node sequence processes each document, its In document d processed the topic matrix Y compressing the storage of sparse column format into reading and writing^CSC's D row, concrete reading and writing mode is mentioned below.In the word stage, calculate node j process Jth row fritterEach calculating node sequence processes each word, the most right Word w processes as lexical item in the sensing topic matrix of reading and writing compression loose line form storage Pointer matrix Y^CSRW row, concrete reading and writing mode is mentioned below.

Each row and every a line in document stage to the word stage, concrete reading and writing content is equal It is divided into and accepts step and propose step.Accept step comprises:

S110: the actualite of whole lexical items calculates based on the topic of this column/row in this column/row Number vector.If certain row d is L altogether_dIndividual lexical item, the actualite of lexical item is respectivelyThen the topic count vector of row d is C_d=(C_d1..., C_dK), wherein, C_dkRepresent the number of times that topic k occurs in corresponding document, i.e. C_dk=| n ∈ 1 ..., L_d}|z_dn=k} |, wherein | | represent cardinality of a set.In like manner, If certain row w L altogether_wIndividual lexical item, the actualite of lexical item is respectivelyThen arrange w Topic count vector be C_w=(C_w1... C_wK), wherein, C_wkRepresent that topic k is right The number of times occurred in the word answered, i.e. C_wk=| d ∈ 1 ..., D}, n ∈ 1 ..., L_d}|z_dn=k, w_dn=w} |.

S111: topic count vector and motion topic according to this column/row calculate and accept generally Rate, and update the actualite of this column/row lexical item.

To each lexical item, if actualite is k₀, and motion topic is respectively k '₁,…,k′_M, then Structure Metropolis-Hastings chain k₁,…,k_M, wherein

And when the document stage, if current document is d, acceptance probability π_iExpression formula as follows:

π_{i} = m i n {1, \frac{C_{{dk}_{i}^{'}} + α_{k_{i}^{'}}}{C_{{dk}_{i - 1}} + α_{k_{i - 1}}} \frac{C_{k_{i - 1}} + V β}{C_{k_{i}^{'}} + V β}};

When the word stage, if current word is the w in vocabulary, acceptance probability π_iExpression Formula is as follows:

π_{i} = m i n {1, \frac{C_{{wk}_{i}^{'}} + β}{C_{{wk}_{i - 1}} + β} \frac{C_{k_{i - 1}} + V β}{C_{k_{i}^{'}} + V β}}

Wherein, α₁..., α_K, β is previously given constant,For overall situation topic Count vector.Finally, updating actualite is k_M。

Propose that step comprises:

S120: produce new motion topic according to the actualite of this column/row lexical item.Wherein The probability producing topic k when document phase process d piece document is proportional to C_dk+α_k, at list The probability producing topic k during word phase process w piece document is proportional to C_wk+β。

The present embodiment produces new motion topic and can use alias table method, for existing skill Art, does not repeats them here.

Each in word/document stage accepts in step, can disposably calculate multiple The acceptance probability of motion topic, in each proposal step, can disposably propose multiple Motion topic.

Each word/document stage each calculating node processing one arranges/a line fritter, each Word/document stage terminate after the fritter currently calculating on node is sent to next document/ The word stage needs on the calculating node of this fritter.

Additionally, in order to realize iterative computation, it is also possible to by further for each fritter aforesaid Being cut into B and be multiplied by the particle of B, wherein B is constant set in advance.At each word/document Stage processes every column/row of particle in order, and the most asynchronously will after particle disposal is complete The transmission of this particle needs the calculating node of this particle to next document/word stage.

The simple efficient method for extracting topic that the present embodiment provides, by calculating dividing of node Cloth calculating processes large-scale data, is updated actualite, and produces new motion Topic, it is possible to increase the speed that topic extracts, and algorithm the most easily realizes.

Alternatively, in another embodiment of the simply efficient method for extracting topic of the present invention, Also include:

After each iteration completes, calculate the probability of the Joint Distribution of lexical item and actualite, And according to described probabilistic determination convergence of algorithm situation.

In the embodiment of the present invention, the computing formula of the probability of the Joint Distribution of lexical item and topic is F=f_d+f_k+f_w, wherein

\begin{matrix} f_{d} = Σ_{d = 1}^{D} (l o g Γ (\overset{&OverBar;}{α}) - \log Γ (\overset{&OverBar;}{α} + L_{d})) + Σ_{d = 1}^{D} Σ_{k = 1}^{K} (l o g Γ (α_{k} + C_{d k}) - \log Γ (α_{k})) \\ f_{k} = Σ_{k = 1}^{K} (\log Γ (V β) - \log Γ (V β + C_{k})) \\ f_{w} = Σ_{k = 1}^{K} Σ_{w = 1}^{V} (\log Γ (β + C_{w k} - \log Γ (β))) \end{matrix},

Wherein Γ () is Euler's gamma function,F can be calculated in the document stage_d, The word stage calculates f_kAnd f_w。

It should be understood that this algorithm is without storing by C_dkAnd C_wkThe topic counting square of composition Battle array, and only need to calculate in use, thus saved memory space.

When whether evaluation algorithm restrains, if calculated word and topic after certain an iteration Joint Distribution probability and front an iteration after calculated word and the Joint Distribution of topic The absolute value of difference of probability less than a certain less constant, then algorithmic statement is described, no Then, illustrate that algorithm is not converged.

Although be described in conjunction with the accompanying embodiments of the present invention, but those skilled in the art Various modifications and variations can be made without departing from the spirit and scope of the present invention, Within the scope of such amendment and modification each fall within and are defined by the appended claims.

Claims

1. a simple efficient method for extracting topic, it is characterised in that including:

Method the most according to claim 1, it is characterised in that described S1, including:

In word/document stage, each accept in step and propose basis in step Metropolis-Hastings algorithm calculates acceptance probability and proposes new motion topic, wherein, newly The probability that in motion topic, each topic produces is proportional to what this topic occurred in corresponding document The Di Li Cray priori of number of times and this topic and.

Method the most according to claim 2, it is characterised in that described S1, including:

Each in word/document stage accepts in step, disposably calculates multiple motion words The acceptance probability of topic, in each proposal step, disposably proposes multiple motion topic.

Method the most according to claim 1, it is characterised in that sparse matrix is by row even Renew storage lexical item, point to the pointer of lexical item by row Coutinuous store, in the word stage, according to pressing The lexical item of row Coutinuous store accesses lexical item；In the document stage, according to the finger by row Coutinuous store Lexical item is accessed to the pointer of lexical item.

Topic matrix is cut into M and is multiplied by the fritter of M, in each word/document stage by suitable Sequence processes the row fritter of each column/often, and the most asynchronously should after each column/often row fritter has processed The fritter transmission of column/row needs the calculating node of this fritter to next document/word stage, wherein M is constant set in advance.

Method the most according to claim 1, it is characterised in that also include: