CN105868186A - Simple and efficient topic extracting method - Google Patents

Simple and efficient topic extracting method Download PDF

Info

Publication number
CN105868186A
CN105868186A CN201610382578.7A CN201610382578A CN105868186A CN 105868186 A CN105868186 A CN 105868186A CN 201610382578 A CN201610382578 A CN 201610382578A CN 105868186 A CN105868186 A CN 105868186A
Authority
CN
China
Prior art keywords
topic
row
column
lexical item
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610382578.7A
Other languages
Chinese (zh)
Inventor
朱军
陈文光
陈键飞
李恺威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201610382578.7A priority Critical patent/CN105868186A/en
Publication of CN105868186A publication Critical patent/CN105868186A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

The invention discloses a simple and efficient topic extracting method. According to the method, the speed of extracting the topic can be increased. The method comprises the following steps: S1) in the word/file stage, treating the rows/lines of some small blocks of a topic matrix at each calculating node, successively scanning each row/line assigned to the calculating node, and performing an accepting step and a proposing step on each row/line; and S2) judging if the iteration times reach a preset constant, if yes, stopping iteration, if not, adding 1 to the iteration times and repeating the steps S1 and S2.

Description

Simple efficient method for extracting topic
Technical field
The present invention relates to data mining technology field, be specifically related to a kind of simple efficient topic Extracting method.
Background technology
Topic model all body in terms of excavating document semantic information and processing complicated file structure Reveal obvious advantage, utilize the semanteme in the topic model extensive document of excavation, structure to need Problem to be solved is mainly: number of documents is the hugest, needs efficient algorithm;Needs carry The topic number, the vocabulary number of data set that take are the biggest, need to optimize especially to save storage Space;It is the simplest that algorithm realizes needs so that more users can be adopted.
Nowadays the data applying topic model develop into from small-scale text set on a large scale Community network so that whole internet.Traditional unit learning method cannot adapt to big data Requirement, the algorithm needing quickly and can running in a distributed computing environment.
In prior art, utilize Metropolis-Hastings algorithm and model concurrent technique, ginseng The algorithm of number server can linearly solve in the time complexity of data set size.But It needs extensive random-access memory, it is impossible to make full use of the cache of CPU;Needs are deposited Store up huge topic count matrix.
Visible by foregoing description, existing topic extraction algorithm speed is slow, storage complexity High, it is achieved complicated.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of simple efficient method for extracting topic, The speed that topic extracts can be improved.
The embodiment of the present invention proposes a kind of simple efficient method for extracting topic, including:
S1, in word/document stage, some of each calculating node processing topic matrix is little The column/row of block, sequential scan distributes to the column/row of this calculating node, to every column/row, performs Accept step and propose step, wherein, whole lexical items of training data are expressed as one dilute Dredging matrix, each document is a line, and each word is row, each lexical item of sparse matrix This lexical item actualite of middle storage and some motion topics, perform accept step and propose step According to word stage and document stage alternately, accept in step according to this column/row lexical item Actualite calculate the topic count vector of this column/row, then according to the topic of this column/row Count vector calculates the acceptance probability of motion topic, and updates the current words of this column/row lexical item Topic;Propose motion topic new according to the actualite generation of this column/row lexical item in step;
S2, judge whether iterations reaches predetermined constant, if it is, stop iteration, If it is not, then iterations adds 1, repeat S1, S2.
The simple efficient method for extracting topic that the embodiment of the present invention provides, by calculating node Distributed Calculation process large-scale data, actualite is updated, and produces new Motion topic, it is possible to increase the speed that topic extracts, and algorithm the most easily realizes.
Accompanying drawing explanation
Fig. 1 is that the flow process of a kind of simple efficient method for extracting topic one embodiment of the present invention is shown It is intended to.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below will In conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clearly Chu ground describe, it is clear that described embodiment be a part of embodiment of the present invention rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not having Have and make the every other embodiment obtained under creative work premise, broadly fall into the present invention The scope of protection.
Referring to Fig. 1, the open a kind of simple efficient method for extracting topic of the present embodiment, including:
S1, in word/document stage, some of each calculating node processing topic matrix is little The column/row of block, sequential scan distributes to the column/row of this calculating node, to every column/row, performs Accept step and propose step, wherein, whole lexical items of training data are expressed as one dilute Dredging matrix, each document is a line, and each word is row, each lexical item of sparse matrix This lexical item actualite of middle storage and some motion topics, perform accept step and propose step According to word stage and document stage alternately, accept in step according to this column/row lexical item Actualite calculate the topic count vector of this column/row, then according to the topic of this column/row Count vector calculates the acceptance probability of motion topic, and updates the current words of this column/row lexical item Topic;Propose motion topic new according to the actualite generation of this column/row lexical item in step;
It should be noted that in the system that a topic extracts, calculate joint including several Point, between use express network, as InfiniBand connect.
Training dataset comprises a vocabularyIts size is V.Claim in vocabulary is each Xiang Weiyi word, additionally, claim a word appear as a lexical item each time.
Training dataset comprises D piece document, wherein has L in d piece documentdIndividual lexical item, n-th Lexical item is expressed as wdn, it means the n-th lexical item is the w in vocabularydnIndividual word, i.e.The difference of word and lexical item be lexical item be that word is in the one of certain position of certain document Secondary appearance, word possible corresponding a lot of lexical items, i.e. this word are at different article not coordinatioies The appearance put.
Each lexical item has an actualite zdnWith M motion topic (z 'd1..., z 'dM), its Middle M is constant set in advance.Claim actualite and the motion topic composition of a lexical item Collection is combined into the topic vector y of this lexical itemdn=(zdn, z 'd1..., z 'dM).Actualite and motion The span of topic be 1 ..., K}, wherein K is expression topic numbers set in advance Constant.For each lexical item wdn, its topic vector is stored in the Y of matrix YD, wdnPosition Put, i.e. YD, w={ ydn|wdn=w, 1≤n≤Ld}.Matrix Y is the dilute of a D × V Dredge matrix, referred to as topic matrix.It is found that the d row of topic matrix contains in document d All topic vectors of lexical item;The w row of topic matrix contain wordMiddle correspondence complete The topic vector of portion's lexical item.
In step sl, by topic matrix to compress the storage of sparse column format, the most each list The lexical item of word, by row Coutinuous store, remembers the topic matrix that the sparse column format of this compression stores For YCSC.Additionally, point to the pointer of lexical item by compression loose line form storage, i.e. point to The pointer of whole lexical items of each document, by row Coutinuous store, remembers this compression loose line lattice The pointer matrix of lexical item in topic matrix that points to of formula storage is YCSR
Before the computation, topic matrix is cut into calculating interstitial content and takes advantage of calculating interstitial content Fritter.Design operator node number is P, and the fritter of the i-th row jth row is
The process calculated is divided into document stage and word stage.In the document stage, calculate node I processes the i-th row fritterEach calculating node sequence processes each document, its In document d processed the topic matrix Y compressing the storage of sparse column format into reading and writingCSC's D row, concrete reading and writing mode is mentioned below.In the word stage, calculate node j process Jth row fritterEach calculating node sequence processes each word, the most right Word w processes as lexical item in the sensing topic matrix of reading and writing compression loose line form storage Pointer matrix YCSRW row, concrete reading and writing mode is mentioned below.
Each row and every a line in document stage to the word stage, concrete reading and writing content is equal It is divided into and accepts step and propose step.Accept step comprises:
S110: the actualite of whole lexical items calculates based on the topic of this column/row in this column/row Number vector.If certain row d is L altogetherdIndividual lexical item, the actualite of lexical item is respectivelyThen the topic count vector of row d is Cd=(Cd1..., CdK), wherein, CdkRepresent the number of times that topic k occurs in corresponding document, i.e. Cdk=| n ∈ 1 ..., Ld}|zdn=k} |, wherein | | represent cardinality of a set.In like manner, If certain row w L altogetherwIndividual lexical item, the actualite of lexical item is respectivelyThen arrange w Topic count vector be Cw=(Cw1... CwK), wherein, CwkRepresent that topic k is right The number of times occurred in the word answered, i.e. Cwk=| d ∈ 1 ..., D}, n ∈ 1 ..., Ld}|zdn=k, wdn=w} |.
S111: topic count vector and motion topic according to this column/row calculate and accept generally Rate, and update the actualite of this column/row lexical item.
To each lexical item, if actualite is k0, and motion topic is respectively k '1,…,k′M, then Structure Metropolis-Hastings chain k1,…,kM, wherein
And when the document stage, if current document is d, acceptance probability πiExpression formula as follows:
π i = m i n { 1 , C dk i ′ + α k i ′ C dk i - 1 + α k i - 1 C k i - 1 + V β C k i ′ + V β } ;
When the word stage, if current word is the w in vocabulary, acceptance probability πiExpression Formula is as follows:
π i = m i n { 1 , C wk i ′ + β C wk i - 1 + β C k i - 1 + V β C k i ′ + V β }
Wherein, α1..., αK, β is previously given constant,For overall situation topic Count vector.Finally, updating actualite is kM
Propose that step comprises:
S120: produce new motion topic according to the actualite of this column/row lexical item.Wherein The probability producing topic k when document phase process d piece document is proportional to Cdkk, at list The probability producing topic k during word phase process w piece document is proportional to Cwk+β。
The present embodiment produces new motion topic and can use alias table method, for existing skill Art, does not repeats them here.
Each in word/document stage accepts in step, can disposably calculate multiple The acceptance probability of motion topic, in each proposal step, can disposably propose multiple Motion topic.
Each word/document stage each calculating node processing one arranges/a line fritter, each Word/document stage terminate after the fritter currently calculating on node is sent to next document/ The word stage needs on the calculating node of this fritter.
Additionally, in order to realize iterative computation, it is also possible to by further for each fritter aforesaid Being cut into B and be multiplied by the particle of B, wherein B is constant set in advance.At each word/document Stage processes every column/row of particle in order, and the most asynchronously will after particle disposal is complete The transmission of this particle needs the calculating node of this particle to next document/word stage.
S2, judge whether iterations reaches predetermined constant, if it is, stop iteration, If it is not, then iterations adds 1, repeat S1, S2.
The simple efficient method for extracting topic that the present embodiment provides, by calculating dividing of node Cloth calculating processes large-scale data, is updated actualite, and produces new motion Topic, it is possible to increase the speed that topic extracts, and algorithm the most easily realizes.
Alternatively, in another embodiment of the simply efficient method for extracting topic of the present invention, Also include:
After each iteration completes, calculate the probability of the Joint Distribution of lexical item and actualite, And according to described probabilistic determination convergence of algorithm situation.
In the embodiment of the present invention, the computing formula of the probability of the Joint Distribution of lexical item and topic is F=fd+fk+fw, wherein
f d = Σ d = 1 D ( l o g Γ ( α ‾ ) - log Γ ( α ‾ + L d ) ) + Σ d = 1 D Σ k = 1 K ( l o g Γ ( α k + C d k ) - log Γ ( α k ) ) f k = Σ k = 1 K ( log Γ ( V β ) - log Γ ( V β + C k ) ) f w = Σ k = 1 K Σ w = 1 V ( log Γ ( β + C w k - log Γ ( β ) ) ) ,
Wherein Γ () is Euler's gamma function,F can be calculated in the document staged, The word stage calculates fkAnd fw
It should be understood that this algorithm is without storing by CdkAnd CwkThe topic counting square of composition Battle array, and only need to calculate in use, thus saved memory space.
When whether evaluation algorithm restrains, if calculated word and topic after certain an iteration Joint Distribution probability and front an iteration after calculated word and the Joint Distribution of topic The absolute value of difference of probability less than a certain less constant, then algorithmic statement is described, no Then, illustrate that algorithm is not converged.
Although be described in conjunction with the accompanying embodiments of the present invention, but those skilled in the art Various modifications and variations can be made without departing from the spirit and scope of the present invention, Within the scope of such amendment and modification each fall within and are defined by the appended claims.

Claims (6)

1. a simple efficient method for extracting topic, it is characterised in that including:
S1, in word/document stage, some of each calculating node processing topic matrix is little The column/row of block, sequential scan distributes to the column/row of this calculating node, to every column/row, performs Accept step and propose step, wherein, whole lexical items of training data are expressed as one dilute Dredging matrix, each document is a line, and each word is row, each lexical item of sparse matrix This lexical item actualite of middle storage and some motion topics, perform accept step and propose step According to word stage and document stage alternately, accept in step according to this column/row lexical item Actualite calculate the topic count vector of this column/row, then according to the topic of this column/row Count vector calculates the acceptance probability of motion topic, and updates the current words of this column/row lexical item Topic;Propose motion topic new according to the actualite generation of this column/row lexical item in step;
S2, judge whether iterations reaches predetermined constant, if it is, stop iteration, If it is not, then iterations adds 1, repeat S1, S2.
Method the most according to claim 1, it is characterised in that described S1, including:
In word/document stage, each accept in step and propose basis in step Metropolis-Hastings algorithm calculates acceptance probability and proposes new motion topic, wherein, newly The probability that in motion topic, each topic produces is proportional to what this topic occurred in corresponding document The Di Li Cray priori of number of times and this topic and.
Method the most according to claim 2, it is characterised in that described S1, including:
Each in word/document stage accepts in step, disposably calculates multiple motion words The acceptance probability of topic, in each proposal step, disposably proposes multiple motion topic.
Method the most according to claim 1, it is characterised in that sparse matrix is by row even Renew storage lexical item, point to the pointer of lexical item by row Coutinuous store, in the word stage, according to pressing The lexical item of row Coutinuous store accesses lexical item;In the document stage, according to the finger by row Coutinuous store Lexical item is accessed to the pointer of lexical item.
Method the most according to claim 1, it is characterised in that described S1, including:
Topic matrix is cut into M and is multiplied by the fritter of M, in each word/document stage by suitable Sequence processes the row fritter of each column/often, and the most asynchronously should after each column/often row fritter has processed The fritter transmission of column/row needs the calculating node of this fritter to next document/word stage, wherein M is constant set in advance.
Method the most according to claim 1, it is characterised in that also include:
After each iteration completes, calculate the probability of the Joint Distribution of lexical item and actualite, And according to described probabilistic determination convergence of algorithm situation.
CN201610382578.7A 2016-06-01 2016-06-01 Simple and efficient topic extracting method Pending CN105868186A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610382578.7A CN105868186A (en) 2016-06-01 2016-06-01 Simple and efficient topic extracting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610382578.7A CN105868186A (en) 2016-06-01 2016-06-01 Simple and efficient topic extracting method

Publications (1)

Publication Number Publication Date
CN105868186A true CN105868186A (en) 2016-08-17

Family

ID=56676360

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610382578.7A Pending CN105868186A (en) 2016-06-01 2016-06-01 Simple and efficient topic extracting method

Country Status (1)

Country Link
CN (1) CN105868186A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605658A (en) * 2013-10-14 2014-02-26 北京航空航天大学 Search engine system based on text emotion analysis
CN103810282A (en) * 2014-02-19 2014-05-21 清华大学 Logistic-normal model topic extraction method
CN103870447A (en) * 2014-03-11 2014-06-18 北京优捷信达信息科技有限公司 Keyword extracting method based on implied Dirichlet model
CN104462253A (en) * 2014-11-20 2015-03-25 武汉数为科技有限公司 Topic detection or tracking method for network text big data
US20150268930A1 (en) * 2012-12-06 2015-09-24 Korea University Research And Business Foundation Apparatus and method for extracting semantic topic
CN105354333A (en) * 2015-12-07 2016-02-24 天云融创数据科技(北京)有限公司 Topic extraction method based on news text

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150268930A1 (en) * 2012-12-06 2015-09-24 Korea University Research And Business Foundation Apparatus and method for extracting semantic topic
CN103605658A (en) * 2013-10-14 2014-02-26 北京航空航天大学 Search engine system based on text emotion analysis
CN103810282A (en) * 2014-02-19 2014-05-21 清华大学 Logistic-normal model topic extraction method
CN103870447A (en) * 2014-03-11 2014-06-18 北京优捷信达信息科技有限公司 Keyword extracting method based on implied Dirichlet model
CN104462253A (en) * 2014-11-20 2015-03-25 武汉数为科技有限公司 Topic detection or tracking method for network text big data
CN105354333A (en) * 2015-12-07 2016-02-24 天云融创数据科技(北京)有限公司 Topic extraction method based on news text

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JIANFEI CHEN ET AL.: "WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation", 《PROCEEDINGS OF THE VLDB ENDOWMENT》 *
沈聚敏 等: "《钢筋混凝土有限元与板壳极限分析》", 30 November 1993, 清华大学出版社 *
陈平 等: "Metropolis-Hastings自适应算法及其应用", 《系统工程理论与实践》 *

Similar Documents

Publication Publication Date Title
Ryang et al. High utility pattern mining over data streams with sliding window technique
CN105447179B (en) Topic auto recommending method and its system based on microblogging social networks
CN102298579A (en) Scientific and technical literature-oriented model and method for sequencing papers, authors and periodicals
CN104537025A (en) Frequent sequence mining method
CN101339553A (en) Approximate quick clustering and index method for mass data
CN103530402A (en) Method for identifying microblog key users based on improved Page Rank
CN110020435B (en) Method for optimizing text feature selection by adopting parallel binary bat algorithm
Plumecoq et al. From template analysis to generating partitions: I: Periodic orbits, knots and symbolic encodings
CN105069290B (en) A kind of parallelization key node towards consignment data finds method
Cevahir et al. Site-based partitioning and repartitioning techniques for parallel pagerank computation
Baillie et al. Cluster identification algorithms for spin models—Sequential and parallel
CN113159287A (en) Distributed deep learning method based on gradient sparsity
CN109145107A (en) Subject distillation method, apparatus, medium and equipment based on convolutional neural networks
CN105913063A (en) Sparse expression acceleration method for image data set and device
CN109446478A (en) A kind of complex covariance matrix computing system based on iteration and restructural mode
CN105868186A (en) Simple and efficient topic extracting method
Minato et al. Frequent pattern mining and knowledge indexing based on zero-suppressed BDDs
CN107818125A (en) Assessment is iterated by SIMD processor register pair data
Glondu et al. Fast collision detection for fracturing rigid bodies
CN116128701A (en) Device and method for executing graph calculation task
WO2020037512A1 (en) Neural network calculation method and device
US9122997B1 (en) Generating attribute-class-statistics for decision trees
CN112734625B (en) Hardware acceleration system and method based on 3D scene design
CN104268270A (en) Map Reduce based method for mining triangles in massive social network data
Singh et al. RSTDB a new candidate generation and test algorithm for frequent pattern mining

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination