CN106570088A

CN106570088A - Discovering and evolution tracking method for scientific research document topics

Info

Publication number: CN106570088A
Application number: CN201610913510.7A
Authority: CN
Inventors: 周厚奎; 于慧敏
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2016-10-20
Filing date: 2016-10-20
Publication date: 2017-04-19

Abstract

The invention discloses a discovering and evolution tracking method for scientific research document topics. The method comprises the steps of downloading scientific research documents of a certain discipline, arranging the obtained document metadata, and performing preprocessing on the document metadata to obtain a document metadata set; extracting topics through a topic discovering method based on citing and content information, finding distribution of topic words and distribution of topic documents, and then dividing the extracted topics on a time axis to form sub topics in different time frames; and finally calculating correlation of the topics, and tracking a topic evolution path to obtain an evolution diagram of the scientific research topics. According to the method, the topics are discovered by comprehensive utilization of the document texts and citing information, so that the obtained topics are higher in quality and can better satisfy actual conditions; and the important scientific research topics can be discovered, and the evolution condition of the topics along with time can be tracked, so that the scientific research topics and the evolution development of the topics can be rapidly mastered by a scientific and technical personnel.

Description

A kind of method that scientific documents topic found and developed tracking

Technical field

The present invention relates to a kind of Knowledge Discovery of scientific research field and data mining technology, more particularly to a kind of scientific documents words The method of topic evolution tracking.

Background technology

Scientific documents record important academic research achievement, are the carriers of academic publicity and academic exchange.Scientific achievement has The characteristics of having hand down, most scientific achievements are all the sustained improvements on the basis of the achievement in research of forefathers.With The appearance of the electronic literature such as PubMed, DBLP index resource database and the development of the Internet, the quantity of the scientific documents of accumulation It is more and more.In the face of the scientific documents for emerging in multitude, scientific research personnel is particularly novices, urgently wants to rapidly It was found that the important scientific research topic of oneself place research field and tracking situation of these topics with time-evolution.Automatically scientific research Topic discovery and evolution tracking technique can help the development and change situation of the quick scientific research topic of research worker, with important reality With value and realistic meaning.

Topic both domestic and external finds that model is developed from LDA topic models at present^[1].LDA models be one with regard to The three layer model of document-topic-word, the interchangeability basis of its topmost interchangeability and word for assuming to be built upon document On, i.e. " bag of words " and " bag of documents " model.LDA models regard the document of corpus by diving as It is distributed what is constituted in topic variable, and what topic was made up of the distribution of word, this two components cloth all meets parameter Dirichlet Distribution priori.With PLSA models^[2]Compare, LDA models are a completely Bayesian models, for unknown document, vocabulary Estimate more accurately, and the MAP that PLSA models can be regarded as LDA models estimates, and there is over-fitting.It is right at present The improvement of LDA models is mainly at three aspects：(1) consider the relation between document or word, i.e. document and word is no longer to hand over Change^[3-8]；(2) quantity of adaptive study topic, by introducing some non-parametric Bayesian models^[9-12]；(3) except Beyond text, additional information is introduced, realization has supervision or semi-supervised study, improves the performance that topic finds model^[13-16]。 In terms of another, whether the method for whether having levels from the structure of model and learning has supervision aspect to classify, topic model Four classes can be divided into：1) topic model of unsupervised, non-hierarchical structure；2) topic model of unsupervised, hierarchical structure；3) There is supervision, non-hierarchical structure topic model；4) there is supervision, hierarchical structure topic model.

Different demarcation method in being developed according to topic to the time, existing scientific research topic evolution analysis method, Ke Yifen For two big class：Discrete time topic evolution method, topic evolution method continuous time.

The general process of discrete time topic evolution method is as follows：(1) corpus of text collection carries out son according to its time tag Collection is divided；(2) topic extraction is carried out using probability topic model in each subset；(3) according to degree of a relation amount between topic Criterion, sets up the Evolvement of topic between subset；(4) form the figure that topic develops.According to the probability topic model for adopting Difference, this class model can be divided into two big class, and the first kind is fixed using Bayes's parameter model i.e. topic numbers, such as TTM (Temporal Text Mining)^[17],DTM(Dynamic Topic Model)^[18]With MTTM (Multiscale Topic Tomography Model)^[19].It is unfixed, example using the non-parametric Bayesian model i.e. quantity of topic that Equations of The Second Kind is Such as TDPM (Temporal Dirichlet Process Mixture Model)^[20]With iDTM (infinite Dynamic Topic Model)^[21]Deng.Discrete time topic evolutionary model needs to carry out document sets time division, this man-made division Actually it is difficult to accomplish science accurately, because different types of document its division methods may be exactly different, this is often Impact can also be brought on the result that final topic develops.For this problem, some scholars propose a kind of new topic and drill Change research method, i.e., develop in research in topic, time factor is just taken into account when topic is modeled, i.e., the time is worked as Make a variable, the effect of time factor is just considered when topic is modeled, its topic for obtaining is with regard to word and time Distribution.Such topic evolutionary model mainly has Topics Over Time (TOT)^[22],continuous time Dynamic Topic Model(cDTM)^[23],Trend Analysis Model(TAM)^[24]With non-parametric Topics Over Time(npTOT)^[25]Deng.

Existing scientific research topic finds that the model overwhelming majority does not make full use of the multi-source in scientific documents information structural Information (content of such as document, reference, author and Source Periodicals etc.) finding scientific research topic.For this problem, this Bright method is achieved more simple than existing simultaneously using the reference information and content information in scientific documents finding scientific research topic The more preferable effect of method of scientific research topic is found using a certain type information.And obtained by existing topic evolutionary model Topic EVOLUTION ANALYSIS is both for greatly the evolution condition in different time sections of same topic, and be directed between different topics not But it is related to the existing technology of analysis of time period upper evolution condition less.For this problem, the present invention is solved and tracks certain The problem of the different scientific research topic Temporal Evolutions in one field；Another difference from prior art is, of the invention Topic EVOLUTION ANALYSIS be first to extract topic, then topic is split, it is to avoid topic extracts what is brought to first discretization again A difficult problem for topic alignment.

With scientific documents as object of study, find important scientific research topic and the evolution of scientific research topic is tracked in scientific documents Knowledge Discovery and Data Mining have very important significance, to helping researcher to carry out research work and promotion section The development ground also has important effect.

The content of the invention

The purpose of the present invention is to overcome the shortcomings of that existing scientific research topic finds and evolution tracking technique, there is provided Yi Zhongke Grind the method that topic found and developed tracking.The method fully utilizes reference and content information in scientific documents to find section Topic is ground, and tracks the evolution condition between different scientific research topics, achieved than the more preferable topic Detection results of existing method, it is real The target of Evolution Paths between the different scientific research topics of tracking is showed.

In order to solve above-mentioned technical problem, the invention provides a kind of scientific documents topic finds and develops the side of tracking Method, the method comprising the steps of：

A1. the scientific documents of a certain subject are downloaded, gained document metadata is arranged；

A2. pretreatment is carried out to data in literature and forms data in literature collection；

A3. utilize based on the topic discovery method extraction topic quoted with content information, find distribution and the words of topic word The distribution of topic document；

A4. split the topic of discovery on a timeline, form the sub-topic in different time sections；

A5. measure the dependency between sub-topic and track the path of topic evolution.

The metadata that every documents management is obtained in above-mentioned steps A1 includes：(document ID is by document to No. ID of document Deliver time sequencing sequence), the adduction relationship matrix for delivering time, the content of document and document of document.

Data in literature pretreatment in above-mentioned steps A2 includes：Remove stop-word, numeral, non-English character, the word of word Mummification, removes low-frequency word of the occurrence number less than 5 times in all documents, builds the document frequency matrix of data set, build word Remittance table, builds the adduction relationship matrix of document.

Above-mentioned steps A3 are specifically included：

A31. reference citation matrix [M] m*m is set up according to the adduction relationship between the document of data set, wherein m be with The quantity of the document of adduction relationship.Matrix M is pressed into row normalization, matrix M is decomposed into into two with the method for Non-negative Matrix Factorization Individual nonnegative matrix [B] m*z and [H] z*m, i.e. M=B*H.

A32. Matrix C and M are obtained by row normalization respectively to matrix B and H.Wherein, each element c of Matrix C_i,jRepresent Comprising the probability for quoting document j, each element m of matrix M in each topic (cluster) i_i,jRepresent that each is quoted document i and belongs to The probability of some topic (cluster) j.

A33. each topic c to generating in A32 steps_i,j, set up using the content of the document for constituting the topic and be based on The LDA probability topic models of " bag of words " model.Topic is thought of as the LDA probability topic model set of word, its specific life Into process it is：D is distributed according to document topic_{doc_topic}(；D)～θ_dTo generate topic z_d,n, further according to the distribution of topic wordTo generate the word of document, dividing for the i.e. topic word of model parameter is generated using gibbs sampler ClothWith the distribution θ of document topic_j,k, wherein there is θ_d～Dir (α) andThe parameter of gained topic modelAnd θ_j,kComposition topic

The segmentation on a timeline of topic in above-mentioned steps A4, mainly use the document that belongs to some topic when Between information, topic was split on the different time periods, the sub-topic on the time period is formed Specific time-sharing scheme is as follows：According to the time hop count P's, document for dividing Initial time t₀, terminate time t_sTo determine that the size of the time interval of each time period is (t_s-t₀)/P。

Topic EVOLUTION ANALYSIS in above-mentioned steps A5 is specifically included：

A51. to two topic z in two time intervals of arbitrary neighborhood_iAnd z_jUsing the distribution of the word of each topic With the distribution of the Core article of each topicTo calculate the relation of two topics, its computing formula For：

A52. relativity measurement, degree of being are solved according to step shown in A51 to arbitrary two topics on time adjacent segments Value sets up directed edge more than two topics of certain threshold value, and the direction on side is determined according to the time relationship between topic, The Evolvement figure between topic is built with this.

Beneficial effects of the present invention：The characteristics of present invention includes abundant structural information for scientific documents, both utilized The text message of scientific documents make use of reference information to realize discovery to scientific research topic again, using the core for extracting topic The temporal information that paper is included is carrying out topic segmentation, and realizes tracking the evolution of topic.Find with traditional scientific research topic and Evolution tracking is compared, the invention comprehensively utilizes the text and reference information of document are finding topic, the topic matter of acquisition Amount is higher, more meet reality.Additionally, without first dividing to corpus document, but topic topic segmentation again is first extracted, keep away The problem of topic alignment is exempted from.The scientific research topic provided by embodiments of the invention finds and evolution tracking, it is possible to achieve It was found that important scientific research topic and these topics of tracking contribute to scientific and technical personnel and hold rapidly scientific research words with the evolution condition of time The evolution venation of topic and topic.

Description of the drawings

Fig. 1 is the flow chart of scientific documents topic discovery of the present invention and evolution tracking embodiment；

Fig. 2 is topic splitting scheme figure on a timeline in the present embodiment；

Fig. 3 is the EVOLUTION ANALYSIS figure of 30 topics in the present embodiment；

Fig. 4 is the puzzled degree comparison schematic diagram that the topic of LDA, reference LDA and the present invention in the present embodiment finds method.

Specific embodiment

Method proposed by the present invention, including scientific documents acquisition and documents management, the pretreatment of documentation ＆ info, based on many The several steps of evolution of the discovery of scientific research topic, the segmentation of topic and tracking topic of source information.The acquisition of scientific documents and document Arrange and be responsible for obtaining a certain amount of scientific documents data and arranging forming document corpus；The pretreatment of documentation ＆ info includes removing Stop-word and low-frequency word, obtain document frequency matrix, document vocabulary information and the document reference of document from document corpus Relational matrix；Scientific research topic based on multi-source information finds the discovery of main responsible scientific research topic, obtains the distribution of each topic The distribution of distribution and topic document including topic word；The segmentation of topic is responsible for carrying out on a timeline drawing the topic of extraction Point, form the sub-topic in different time sections；The evolution tracking of topic is mainly including relation between sub-topic on time adjacent segments Tolerance and topic evolution diagram structure.

With reference to the accompanying drawings and examples, the specific embodiment of the present invention is described in further detail.Lead to below It is exemplary to cross the embodiment being described with reference to the drawings, and is only used for explaining the present invention, and is not construed as limiting the claims.

Fig. 1 is the flow chart of scientific documents topic discovery of the present invention and evolution tracking specific embodiment.Such as Fig. 1 institutes Show, the present embodiment scientific documents topic finds and the workflow of evolution tracking comprises the steps：

A1：The scientific documents of a certain subject are downloaded, gained document metadata is arranged.

In this step, by the international top periodical in downloaded Pattern recognition and image processing field《IEEE moulds Formula is analyzed and machine intelligence transactions》(IEEE TPAMI-IEEE Transactions on Pattern Analysis and Machine Intelligence) all papers (except the article of chief editor) from January nineteen ninety-five in September, 2012, amount to To 2719 Research Literatures.Every document record to collecting carries out arrangement and obtains document metadata, including every document ID pid, document deliver time year (being accurate to year), content text of document (only including title, key word and is plucked Will), citation sequence cit of document (refer to and belong to the list of references for downloading the document in the range of archiveies).Will be all Download 2719 documents protocol arrange it is written offer metadata after, into step A2.

A2：Data prediction is carried out to gained document metadata in A1 and obtains document metadata set.

In this step, the document metadata set to obtaining in S1 carries out pretreatment, including removes stop words, removes in institute There is low-frequency word of the occurrence number less than 5 times in document, the dictionary for obtaining being made up of 881 lexical items after the completion of pretreatment, can be arranged The document frequency matrix D=[d that V, 2719 documents and 881 words are constituted_ij]_2719×881(wherein d_ijRepresent jth in i-th document The word frequency size of individual word), and the adduction relationship Matrix C=[c between 2719 documents_mn]_2719×2719(wherein c_mnRepresent m Whether piece document and n-th document have adduction relationship, if c_mn=1 indicates adduction relationship, otherwise represents System).After pretreatment is carried out to scientific documents, into step A3.

A3：Using based on the topic discovery method extraction topic quoted with content information, distribution and the words of topic word are found The distribution of topic document.

In this step, using the reference and content information of document, the scientific documents topic for building collection finds method, tool Body includes three sub-steps：

1). by matrix [C]_2719*2719Matrix M is obtained by row normalization, with the method for Non-negative Matrix Factorization by matrix M point Solve as two nonnegative matrixes [B]_2719*10[H]_10*2719, i.e. M=B*H is decomposed into 10 big clusters, and Decomposition iteration number of times is 1000。

2). matrix N and M are obtained by row normalization transposition respectively to matrix B and H.Wherein, each element n of matrix N_i,j Represent the probability comprising reference document j in each topic (cluster) i, each element m of matrix M_i,jRepresent that each quotes document i Belong to the probability of some topic (cluster) j.

3). to step 2) middle each topic n for generating_i,j, set up based on " word using the content of the document for constituting the topic The LDA probability topic models of bag " model.Topic is thought of as the LDA probability topic model set of word, and which specifically generated Cheng Wei：D is distributed according to document topic_{doc_topic}(；D)～θ_dTo generate topic z_d,n, further according to the distribution of topic wordTo generate the word of document, dividing for the i.e. topic word of model parameter is generated using gibbs sampler ClothWith the distribution θ of document topic_j,k, wherein there is θ_d～Dir (α) andThe parameter of gained topic model And θ_j,kComposition topicThe formula for wherein obtaining model parameter using gibbs sampler is as follows：

Wherein,Expression belongs to the quantity of the word w of topic k,Expression belongs to the quantity of the topic k of document j, α_k It is θ_j,kDi Li Cray Dirichlet Study firsts vector, θ_j,kRepresent the probability distribution of k-th topic of document j, β_tIt isDirichlet prior parameter vector,The probability distribution of w-th lexical item of topic k is represented, herein K=3, α_kAnd β_t Value be respectively 0.5 and 0.01.

In the present embodiment, after 300 iteration of gibbs sampler in running above-mentioned steps 3), whole sampling process will become In convergence.

In this step, the topic through Multi-information acquisition finds that model is calculated 30 words of 2719 scientific documents Topic, the description of each topic include two parts：(a) front 10 lexical items maximally related with topic and corresponding probability；(b) and topic Maximally related front 10 core stateless and corresponding probability.It is calculated wherein representative 2 of 2719 scientific documents The distribution of individual topic lexical item and the distribution of topic Core article are respectively as shown in Table 1 and Table 2：

Table 1：The distribution of 2 topic words

Two typical cases in the topic obtained on scientific documents data set used in the present embodiment are given in upper table 1 Son --- the probit of 10 words of probability highest and word in topic 18 " shape table reaches " and topic 3 " sorting algorithm ".From table 1 It is found that the high frequency words in topic 18 have " recognition ", " shape ", " image ", " affine ", " invariant " Relevant word is reached Deng with shape table；And high frequency words " classification ", " feature ", " nearest " in topic 3, " neighbor " etc. is all relevant with sorting algorithm.

Table 2：The distribution of 2 topic papers

Through said method, extract 2719 scientific documents of PAMI scientific documents data sets 30 topic words it is general After rate is distributed the probability distribution with topic document, into step A4.

A4：30 topics for extracting are divided on a timeline, the sub-topic in different time sections is formed.

Above-mentioned scientific research topic division unit, mainly uses the temporal information of document, and topic is projected to the different times The sub-topic on the time period is formed in sectionThe specific time Splitting scheme is as shown in Figure 2：The initial time of the document that the time hop count of division is 6, taken is 2012, the size of the time interval of each time period was 3 years.Topic 18 " shape table reaches " is listed upper in table 3 below The time division result under scheme is stated, in order to each topic herein of saving space only lists 4 words of probability highest and 4 literary Offer, document is represented with No. id, the title of the document being given in Table 4 corresponding to id.

Table 3：The time division result of topic 18 " shape table reaches "

Table 4：Topic document id and its corresponding title in topic 29 " 3D reconstructions "

A5：Topic relevance is calculated using the relativity measurement method between topic, and tracks the path that topic develops, obtained To the evolution condition of scientific research topic.The specific implementation process of this step is as follows：

1). to two topic z in two time intervals of arbitrary neighborhood_iAnd z_j, using the distribution of the word of each topic With the distribution of the Core article of each topicTo calculate the relation of two topics, its computing formula isHerein Take μ=0.5.

2). relativity measurement is solved according to step shown in 1) to arbitrary two topics on time adjacent segments, is tolerance Value sets up directed edge more than two topics of certain threshold value (value is 0.2), and the direction on side was closed according to the time between topic System builds the Evolvement figure between topic determining with this,.

Being embodied as by this step, can obtain 30 found on the present embodiment data set 2719 documents The rule of topic Temporal Evolution between nineteen ninety-five to 2012.The result of the experiment contributes to scientific research personnel and fully understands image Process the evolution condition with the time with area of pattern recognition important research topic.Accompanying drawing 3 gives and tracks drilling for 30 topics The result of change situation, the topic evolution diagram on PAMI data sets be divided into four parts correspond to respectively four it is different Research direction：Image segmentation, recognition of face, Handwritten Digits Recognition and tracking.The numeral on node in figure represents topic respectively Sequence number, and the side between node represents the Evolution Paths between topic.To each corresponding to four detached parts in figure For research direction, the Evolution Paths of one or more topic are all there are：Image segmentation research direction has up to seven not Same topic Evolution Paths, are concentrated mainly on topic 4, topic 5, topic 6, topic 7 and topic 17；Recognition of face research direction There are six different Evolution Paths, be concentrated mainly on topic 13, topic 14 and topic 15；Handwritten Digits Recognition research direction is altogether There are three different Evolution Paths, be concentrated mainly on topic 10 and topic 26；Follow-up study direction only has an Evolution Paths, It is concentrated mainly on topic 20 and topic 21.In order to save space, we list the detailed key word of each topic in table 5 Information.

Table 5：Front 10 key words of each topic in different time sections in Fig. 3

In the present embodiment, the international top periodical of downloading mode identification and image processing field《IEEE mode analyze with Machine intelligence transactions》All papers (except the article of chief editor) from January nineteen ninety-five in September, 2012, obtain 2719 altogether Research Literature, carries out arrangement and obtains document metadata, and carry out pretreatment to metadata document obtaining metadata to initial data Collection；Using based on the multi-source topic discovery method 30 scientific research topics of extraction quoted with content, the probability distribution of topic word is obtained With the result of the probability distribution of topic document；The temporal information of document is combined with the scientific research topic for obtaining, 30 topics are existed It was divided into into for 6 time periods between nineteen ninety-five to 2012, every section of time interval is 3 years, forms 180 sub-topics altogether.According to each word The distribution of the word of topic and the distribution of topic document, obtain the evolution of scientific research topic using the relativity measurement method of the topic for proposing Graph of a relation.The evolution of scientific research topic and tracking topic is found by above-mentioned steps, the important scientific research words of default scientific research field have been excavated The rule of topic and topic Temporal Evolution, with very important realistic meaning.

In actual applications, puzzlement degree (Perplexity) is the standard index of evaluation model generalization ability, puzzled angle value It is less, illustrate that model generalization ability is stronger.In order to evaluate the extensive of the scientific research motif discovery model of the Multi-information acquisition of the present invention 2719 scientific documents are further divided into two parts by ability, the present embodiment, wherein, 1360 documents as training set, 1359 Piece document is used as test set.In the topic of the present invention finds model, for test set D_testIn scientific documents puzzlement degree meter Calculate formula as follows：

N in above formula_dRepresent the quantity of word in document d, w_d=(w_1d,w_2d…w_id…w_nd) represent the word for constituting document d Vector, M is document total quantity in test set, here value be 1359.

Accompanying drawing 4 gives the motif discovery model in the present embodiment, the standard LDA topic model based on document content and base In adduction relationship LDA topic models (referring to " Wang, X., Zhai, C., Roth, D., 2013.Understanding evolution of research themes:a probabilistic generative model for citations.In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, p.1115-1123. ") three puzzles the comparative experimentss result of angle value. From Fig. 4, the motif discovery model that we are can be found that in the present embodiment has lower puzzlement than other two contrast models Angle value, i.e., with more preferable model generalization ability；And when theme quantity is more than 30, the value of the puzzled degree of three models all keeps It is basically unchanged, it is appropriate that in this explanation the present embodiment, number of topics measures 30, in reasonable reflecting TPAMI data sets Comprising real number of topics.

The above is only some embodiments of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims

1. a kind of method that scientific documents topic found and developed tracking, it is characterised in that comprise the following steps：

A1, downloads the scientific documents of certain ambit, arranges gained document metadata.

A2, the data in literature to downloading in A1 carry out pretreatment and form data in literature collection S.

A3, arranges the data in literature collection S to be formed to A2, using based on the topic discovery method extraction words quoted with content information Topic, finds the distribution of topic word and the distribution of topic document.

A4, using the temporal information of all documents for belonging to some theme, the theme of extraction is divided on a timeline, Form the sub-topicses in different time sections.

A5, calculates topic relevance using the relativity measurement method between topic, and tracks the path that topic develops, obtain section Grind the evolution diagram of topic.

Above-mentioned steps A3 specifically include following sub-step：

A31. reference citation matrix [M] m*m is set up according to the adduction relationship between the document of data in literature collection S, wherein m be with The quantity of the document of adduction relationship.Matrix M is pressed into row normalization, matrix M is decomposed into into two with the method for Non-negative Matrix Factorization Individual nonnegative matrix [B] m*z and [H] z*m, wherein m be decompose after B matrixes row and H-matrix row quantity, i.e. M=B*H.

A32. Matrix C and M are obtained by row normalization respectively to matrix B and H.Wherein, each element c of Matrix C_i,jRepresent each Comprising the probability for quoting document j, each element m of matrix M in topic (cluster) i_i,jRepresent that each is quoted document i and belongs to a certain The probability of individual topic (cluster) j.

A33. each topic c to generating in A32 steps_i,j, set up based on " bag of words " using the content of the document for constituting the topic The LDA probability topic models of model.Topic is thought of as the LDA probability topic model set of word, its specific generating process For：D is distributed according to document topic_{doc_topic}(；D)～θ_dTo generate topic z_d,n, further according to the distribution of topic wordTo generate the word of document, dividing for the i.e. topic word of model parameter is generated using gibbs sampler ClothWith the distribution θ of document topic_j,k, wherein there is θ_d～Dir (α) andWherein α, β are Dirichlet distributions Parameter, the parameter of gained topic modelAnd θ_j,kComposition topic

The segmentation on a timeline of topic in above-mentioned steps A4, mainly uses the time letter of the document for belonging to some topic Breath, topic was split on the different time periods, the sub-topic on the time period is formed Wherein K is the quantity of topic, and P is the quantity of time period.Specific time division side Case is as follows：According to initial time t of the time hop count P, document for dividing₀, terminate time t_sTo determine the time of each time period The size at interval is (t_s-t₀)/P。

A51. to two topic z in two time intervals of arbitrary neighborhood_iAnd z_jUsing the distribution of the word of each topic With the distribution of the Core article of each topicTo calculate the relation of two topics；

A52. relativity measurement is solved according to step shown in A51 to arbitrary two topics on time adjacent segments, is metric Directed edge is set up more than two topics of certain threshold value, the direction on side is determined according to the time relationship between topic, with this To build the Evolvement figure between topic.

2. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that in step A1 The metadata that every documents management is obtained includes：Document ID, document are delivered the time, only including title, key word and summary Literature content, reference information of document etc..

3. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that in step A2 Document metadata preprocessing process, specifically include：Stop-word is removed, the occurrence number in all documents is removed and is less than 5 times Low-frequency word, build the document frequency matrix of document, build the vocabulary of all documents, build drawing between data set Literature Use relational matrix.

4. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that step A33 In model parameterAnd θ_j,kComputing formula is as follows：

φ_{k, w} = \frac{n_{k}^{(w)} + β_{w}}{Σ_{w = 1}^{V} (n_{k}^{w} + β_{w})}

θ_{j, k} = \frac{n_{j}^{(k)} + α_{k}}{Σ_{k = 1}^{K} (n_{j}^{(k)} + α_{k})}

Wherein,Expression belongs to the quantity of the word w of topic k,Expression belongs to the quantity of the topic k of document j, α_kIt is θ_j,kDi Li Cray Dirichlet Study firsts vector, θ_j,kRepresent the probability distribution of k-th topic of document j, β_tIt is Dirichlet prior parameter vector,Represent the probability distribution of w-th lexical item of topic k.

5. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that in step A4 Time-sharing scheme it is as follows：Initial time t of the document included by the time hop count S, data set according to dividing₀, terminate when Between t_sTo determine that the size of the time interval of each time period is (t_s-t₀)/S, obtains the sub-topicses on each time period

6. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that step A51 It is middle calculate two topic relations formula be：