CN106570088A - Discovering and evolution tracking method for scientific research document topics - Google Patents

Discovering and evolution tracking method for scientific research document topics Download PDF

Info

Publication number
CN106570088A
CN106570088A CN201610913510.7A CN201610913510A CN106570088A CN 106570088 A CN106570088 A CN 106570088A CN 201610913510 A CN201610913510 A CN 201610913510A CN 106570088 A CN106570088 A CN 106570088A
Authority
CN
China
Prior art keywords
topic
document
topics
time
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610913510.7A
Other languages
Chinese (zh)
Inventor
周厚奎
于慧敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201610913510.7A priority Critical patent/CN106570088A/en
Publication of CN106570088A publication Critical patent/CN106570088A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a discovering and evolution tracking method for scientific research document topics. The method comprises the steps of downloading scientific research documents of a certain discipline, arranging the obtained document metadata, and performing preprocessing on the document metadata to obtain a document metadata set; extracting topics through a topic discovering method based on citing and content information, finding distribution of topic words and distribution of topic documents, and then dividing the extracted topics on a time axis to form sub topics in different time frames; and finally calculating correlation of the topics, and tracking a topic evolution path to obtain an evolution diagram of the scientific research topics. According to the method, the topics are discovered by comprehensive utilization of the document texts and citing information, so that the obtained topics are higher in quality and can better satisfy actual conditions; and the important scientific research topics can be discovered, and the evolution condition of the topics along with time can be tracked, so that the scientific research topics and the evolution development of the topics can be rapidly mastered by a scientific and technical personnel.

Description

A kind of method that scientific documents topic found and developed tracking
Technical field
The present invention relates to a kind of Knowledge Discovery of scientific research field and data mining technology, more particularly to a kind of scientific documents words The method of topic evolution tracking.
Background technology
Scientific documents record important academic research achievement, are the carriers of academic publicity and academic exchange.Scientific achievement has The characteristics of having hand down, most scientific achievements are all the sustained improvements on the basis of the achievement in research of forefathers.With The appearance of the electronic literature such as PubMed, DBLP index resource database and the development of the Internet, the quantity of the scientific documents of accumulation It is more and more.In the face of the scientific documents for emerging in multitude, scientific research personnel is particularly novices, urgently wants to rapidly It was found that the important scientific research topic of oneself place research field and tracking situation of these topics with time-evolution.Automatically scientific research Topic discovery and evolution tracking technique can help the development and change situation of the quick scientific research topic of research worker, with important reality With value and realistic meaning.
Topic both domestic and external finds that model is developed from LDA topic models at present[1].LDA models be one with regard to The three layer model of document-topic-word, the interchangeability basis of its topmost interchangeability and word for assuming to be built upon document On, i.e. " bag of words " and " bag of documents " model.LDA models regard the document of corpus by diving as It is distributed what is constituted in topic variable, and what topic was made up of the distribution of word, this two components cloth all meets parameter Dirichlet Distribution priori.With PLSA models[2]Compare, LDA models are a completely Bayesian models, for unknown document, vocabulary Estimate more accurately, and the MAP that PLSA models can be regarded as LDA models estimates, and there is over-fitting.It is right at present The improvement of LDA models is mainly at three aspects:(1) consider the relation between document or word, i.e. document and word is no longer to hand over Change[3-8];(2) quantity of adaptive study topic, by introducing some non-parametric Bayesian models[9-12];(3) except Beyond text, additional information is introduced, realization has supervision or semi-supervised study, improves the performance that topic finds model[13-16]。 In terms of another, whether the method for whether having levels from the structure of model and learning has supervision aspect to classify, topic model Four classes can be divided into:1) topic model of unsupervised, non-hierarchical structure;2) topic model of unsupervised, hierarchical structure;3) There is supervision, non-hierarchical structure topic model;4) there is supervision, hierarchical structure topic model.
Different demarcation method in being developed according to topic to the time, existing scientific research topic evolution analysis method, Ke Yifen For two big class:Discrete time topic evolution method, topic evolution method continuous time.
The general process of discrete time topic evolution method is as follows:(1) corpus of text collection carries out son according to its time tag Collection is divided;(2) topic extraction is carried out using probability topic model in each subset;(3) according to degree of a relation amount between topic Criterion, sets up the Evolvement of topic between subset;(4) form the figure that topic develops.According to the probability topic model for adopting Difference, this class model can be divided into two big class, and the first kind is fixed using Bayes's parameter model i.e. topic numbers, such as TTM (Temporal Text Mining)[17],DTM(Dynamic Topic Model)[18]With MTTM (Multiscale Topic Tomography Model)[19].It is unfixed, example using the non-parametric Bayesian model i.e. quantity of topic that Equations of The Second Kind is Such as TDPM (Temporal Dirichlet Process Mixture Model)[20]With iDTM (infinite Dynamic Topic Model)[21]Deng.Discrete time topic evolutionary model needs to carry out document sets time division, this man-made division Actually it is difficult to accomplish science accurately, because different types of document its division methods may be exactly different, this is often Impact can also be brought on the result that final topic develops.For this problem, some scholars propose a kind of new topic and drill Change research method, i.e., develop in research in topic, time factor is just taken into account when topic is modeled, i.e., the time is worked as Make a variable, the effect of time factor is just considered when topic is modeled, its topic for obtaining is with regard to word and time Distribution.Such topic evolutionary model mainly has Topics Over Time (TOT)[22],continuous time Dynamic Topic Model(cDTM)[23],Trend Analysis Model(TAM)[24]With non-parametric Topics Over Time(npTOT)[25]Deng.
Existing scientific research topic finds that the model overwhelming majority does not make full use of the multi-source in scientific documents information structural Information (content of such as document, reference, author and Source Periodicals etc.) finding scientific research topic.For this problem, this Bright method is achieved more simple than existing simultaneously using the reference information and content information in scientific documents finding scientific research topic The more preferable effect of method of scientific research topic is found using a certain type information.And obtained by existing topic evolutionary model Topic EVOLUTION ANALYSIS is both for greatly the evolution condition in different time sections of same topic, and be directed between different topics not But it is related to the existing technology of analysis of time period upper evolution condition less.For this problem, the present invention is solved and tracks certain The problem of the different scientific research topic Temporal Evolutions in one field;Another difference from prior art is, of the invention Topic EVOLUTION ANALYSIS be first to extract topic, then topic is split, it is to avoid topic extracts what is brought to first discretization again A difficult problem for topic alignment.
With scientific documents as object of study, find important scientific research topic and the evolution of scientific research topic is tracked in scientific documents Knowledge Discovery and Data Mining have very important significance, to helping researcher to carry out research work and promotion section The development ground also has important effect.
The content of the invention
The purpose of the present invention is to overcome the shortcomings of that existing scientific research topic finds and evolution tracking technique, there is provided Yi Zhongke Grind the method that topic found and developed tracking.The method fully utilizes reference and content information in scientific documents to find section Topic is ground, and tracks the evolution condition between different scientific research topics, achieved than the more preferable topic Detection results of existing method, it is real The target of Evolution Paths between the different scientific research topics of tracking is showed.
In order to solve above-mentioned technical problem, the invention provides a kind of scientific documents topic finds and develops the side of tracking Method, the method comprising the steps of:
A1. the scientific documents of a certain subject are downloaded, gained document metadata is arranged;
A2. pretreatment is carried out to data in literature and forms data in literature collection;
A3. utilize based on the topic discovery method extraction topic quoted with content information, find distribution and the words of topic word The distribution of topic document;
A4. split the topic of discovery on a timeline, form the sub-topic in different time sections;
A5. measure the dependency between sub-topic and track the path of topic evolution.
The metadata that every documents management is obtained in above-mentioned steps A1 includes:(document ID is by document to No. ID of document Deliver time sequencing sequence), the adduction relationship matrix for delivering time, the content of document and document of document.
Data in literature pretreatment in above-mentioned steps A2 includes:Remove stop-word, numeral, non-English character, the word of word Mummification, removes low-frequency word of the occurrence number less than 5 times in all documents, builds the document frequency matrix of data set, build word Remittance table, builds the adduction relationship matrix of document.
Above-mentioned steps A3 are specifically included:
A31. reference citation matrix [M] m*m is set up according to the adduction relationship between the document of data set, wherein m be with The quantity of the document of adduction relationship.Matrix M is pressed into row normalization, matrix M is decomposed into into two with the method for Non-negative Matrix Factorization Individual nonnegative matrix [B] m*z and [H] z*m, i.e. M=B*H.
A32. Matrix C and M are obtained by row normalization respectively to matrix B and H.Wherein, each element c of Matrix Ci,jRepresent Comprising the probability for quoting document j, each element m of matrix M in each topic (cluster) ii,jRepresent that each is quoted document i and belongs to The probability of some topic (cluster) j.
A33. each topic c to generating in A32 stepsi,j, set up using the content of the document for constituting the topic and be based on The LDA probability topic models of " bag of words " model.Topic is thought of as the LDA probability topic model set of word, its specific life Into process it is:D is distributed according to document topicdoc_topic(;D)~θdTo generate topic zd,n, further according to the distribution of topic wordTo generate the word of document, dividing for the i.e. topic word of model parameter is generated using gibbs sampler ClothWith the distribution θ of document topicj,k, wherein there is θd~Dir (α) andThe parameter of gained topic modelAnd θj,kComposition topic
The segmentation on a timeline of topic in above-mentioned steps A4, mainly use the document that belongs to some topic when Between information, topic was split on the different time periods, the sub-topic on the time period is formed Specific time-sharing scheme is as follows:According to the time hop count P's, document for dividing Initial time t0, terminate time tsTo determine that the size of the time interval of each time period is (ts-t0)/P。
Topic EVOLUTION ANALYSIS in above-mentioned steps A5 is specifically included:
A51. to two topic z in two time intervals of arbitrary neighborhoodiAnd zjUsing the distribution of the word of each topic With the distribution of the Core article of each topicTo calculate the relation of two topics, its computing formula For:
A52. relativity measurement, degree of being are solved according to step shown in A51 to arbitrary two topics on time adjacent segments Value sets up directed edge more than two topics of certain threshold value, and the direction on side is determined according to the time relationship between topic, The Evolvement figure between topic is built with this.
Beneficial effects of the present invention:The characteristics of present invention includes abundant structural information for scientific documents, both utilized The text message of scientific documents make use of reference information to realize discovery to scientific research topic again, using the core for extracting topic The temporal information that paper is included is carrying out topic segmentation, and realizes tracking the evolution of topic.Find with traditional scientific research topic and Evolution tracking is compared, the invention comprehensively utilizes the text and reference information of document are finding topic, the topic matter of acquisition Amount is higher, more meet reality.Additionally, without first dividing to corpus document, but topic topic segmentation again is first extracted, keep away The problem of topic alignment is exempted from.The scientific research topic provided by embodiments of the invention finds and evolution tracking, it is possible to achieve It was found that important scientific research topic and these topics of tracking contribute to scientific and technical personnel and hold rapidly scientific research words with the evolution condition of time The evolution venation of topic and topic.
Description of the drawings
Fig. 1 is the flow chart of scientific documents topic discovery of the present invention and evolution tracking embodiment;
Fig. 2 is topic splitting scheme figure on a timeline in the present embodiment;
Fig. 3 is the EVOLUTION ANALYSIS figure of 30 topics in the present embodiment;
Fig. 4 is the puzzled degree comparison schematic diagram that the topic of LDA, reference LDA and the present invention in the present embodiment finds method.
Specific embodiment
Method proposed by the present invention, including scientific documents acquisition and documents management, the pretreatment of documentation & info, based on many The several steps of evolution of the discovery of scientific research topic, the segmentation of topic and tracking topic of source information.The acquisition of scientific documents and document Arrange and be responsible for obtaining a certain amount of scientific documents data and arranging forming document corpus;The pretreatment of documentation & info includes removing Stop-word and low-frequency word, obtain document frequency matrix, document vocabulary information and the document reference of document from document corpus Relational matrix;Scientific research topic based on multi-source information finds the discovery of main responsible scientific research topic, obtains the distribution of each topic The distribution of distribution and topic document including topic word;The segmentation of topic is responsible for carrying out on a timeline drawing the topic of extraction Point, form the sub-topic in different time sections;The evolution tracking of topic is mainly including relation between sub-topic on time adjacent segments Tolerance and topic evolution diagram structure.
With reference to the accompanying drawings and examples, the specific embodiment of the present invention is described in further detail.Lead to below It is exemplary to cross the embodiment being described with reference to the drawings, and is only used for explaining the present invention, and is not construed as limiting the claims.
Fig. 1 is the flow chart of scientific documents topic discovery of the present invention and evolution tracking specific embodiment.Such as Fig. 1 institutes Show, the present embodiment scientific documents topic finds and the workflow of evolution tracking comprises the steps:
A1:The scientific documents of a certain subject are downloaded, gained document metadata is arranged.
In this step, by the international top periodical in downloaded Pattern recognition and image processing field《IEEE moulds Formula is analyzed and machine intelligence transactions》(IEEE TPAMI-IEEE Transactions on Pattern Analysis and Machine Intelligence) all papers (except the article of chief editor) from January nineteen ninety-five in September, 2012, amount to To 2719 Research Literatures.Every document record to collecting carries out arrangement and obtains document metadata, including every document ID pid, document deliver time year (being accurate to year), content text of document (only including title, key word and is plucked Will), citation sequence cit of document (refer to and belong to the list of references for downloading the document in the range of archiveies).Will be all Download 2719 documents protocol arrange it is written offer metadata after, into step A2.
A2:Data prediction is carried out to gained document metadata in A1 and obtains document metadata set.
In this step, the document metadata set to obtaining in S1 carries out pretreatment, including removes stop words, removes in institute There is low-frequency word of the occurrence number less than 5 times in document, the dictionary for obtaining being made up of 881 lexical items after the completion of pretreatment, can be arranged The document frequency matrix D=[d that V, 2719 documents and 881 words are constitutedij]2719×881(wherein dijRepresent jth in i-th document The word frequency size of individual word), and the adduction relationship Matrix C=[c between 2719 documentsmn]2719×2719(wherein cmnRepresent m Whether piece document and n-th document have adduction relationship, if cmn=1 indicates adduction relationship, otherwise represents System).After pretreatment is carried out to scientific documents, into step A3.
A3:Using based on the topic discovery method extraction topic quoted with content information, distribution and the words of topic word are found The distribution of topic document.
In this step, using the reference and content information of document, the scientific documents topic for building collection finds method, tool Body includes three sub-steps:
1). by matrix [C]2719*2719Matrix M is obtained by row normalization, with the method for Non-negative Matrix Factorization by matrix M point Solve as two nonnegative matrixes [B]2719*10[H]10*2719, i.e. M=B*H is decomposed into 10 big clusters, and Decomposition iteration number of times is 1000。
2). matrix N and M are obtained by row normalization transposition respectively to matrix B and H.Wherein, each element n of matrix Ni,j Represent the probability comprising reference document j in each topic (cluster) i, each element m of matrix Mi,jRepresent that each quotes document i Belong to the probability of some topic (cluster) j.
3). to step 2) middle each topic n for generatingi,j, set up based on " word using the content of the document for constituting the topic The LDA probability topic models of bag " model.Topic is thought of as the LDA probability topic model set of word, and which specifically generated Cheng Wei:D is distributed according to document topicdoc_topic(;D)~θdTo generate topic zd,n, further according to the distribution of topic wordTo generate the word of document, dividing for the i.e. topic word of model parameter is generated using gibbs sampler ClothWith the distribution θ of document topicj,k, wherein there is θd~Dir (α) andThe parameter of gained topic model And θj,kComposition topicThe formula for wherein obtaining model parameter using gibbs sampler is as follows:
Wherein,Expression belongs to the quantity of the word w of topic k,Expression belongs to the quantity of the topic k of document j, αk It is θj,kDi Li Cray Dirichlet Study firsts vector, θj,kRepresent the probability distribution of k-th topic of document j, βtIt isDirichlet prior parameter vector,The probability distribution of w-th lexical item of topic k is represented, herein K=3, αkAnd βt Value be respectively 0.5 and 0.01.
In the present embodiment, after 300 iteration of gibbs sampler in running above-mentioned steps 3), whole sampling process will become In convergence.
In this step, the topic through Multi-information acquisition finds that model is calculated 30 words of 2719 scientific documents Topic, the description of each topic include two parts:(a) front 10 lexical items maximally related with topic and corresponding probability;(b) and topic Maximally related front 10 core stateless and corresponding probability.It is calculated wherein representative 2 of 2719 scientific documents The distribution of individual topic lexical item and the distribution of topic Core article are respectively as shown in Table 1 and Table 2:
Table 1:The distribution of 2 topic words
Two typical cases in the topic obtained on scientific documents data set used in the present embodiment are given in upper table 1 Son --- the probit of 10 words of probability highest and word in topic 18 " shape table reaches " and topic 3 " sorting algorithm ".From table 1 It is found that the high frequency words in topic 18 have " recognition ", " shape ", " image ", " affine ", " invariant " Relevant word is reached Deng with shape table;And high frequency words " classification ", " feature ", " nearest " in topic 3, " neighbor " etc. is all relevant with sorting algorithm.
Table 2:The distribution of 2 topic papers
Through said method, extract 2719 scientific documents of PAMI scientific documents data sets 30 topic words it is general After rate is distributed the probability distribution with topic document, into step A4.
A4:30 topics for extracting are divided on a timeline, the sub-topic in different time sections is formed.
Above-mentioned scientific research topic division unit, mainly uses the temporal information of document, and topic is projected to the different times The sub-topic on the time period is formed in sectionThe specific time Splitting scheme is as shown in Figure 2:The initial time of the document that the time hop count of division is 6, taken is 2012, the size of the time interval of each time period was 3 years.Topic 18 " shape table reaches " is listed upper in table 3 below The time division result under scheme is stated, in order to each topic herein of saving space only lists 4 words of probability highest and 4 literary Offer, document is represented with No. id, the title of the document being given in Table 4 corresponding to id.
Table 3:The time division result of topic 18 " shape table reaches "
Table 4:Topic document id and its corresponding title in topic 29 " 3D reconstructions "
A5:Topic relevance is calculated using the relativity measurement method between topic, and tracks the path that topic develops, obtained To the evolution condition of scientific research topic.The specific implementation process of this step is as follows:
1). to two topic z in two time intervals of arbitrary neighborhoodiAnd zj, using the distribution of the word of each topic With the distribution of the Core article of each topicTo calculate the relation of two topics, its computing formula isHerein Take μ=0.5.
2). relativity measurement is solved according to step shown in 1) to arbitrary two topics on time adjacent segments, is tolerance Value sets up directed edge more than two topics of certain threshold value (value is 0.2), and the direction on side was closed according to the time between topic System builds the Evolvement figure between topic determining with this,.
Being embodied as by this step, can obtain 30 found on the present embodiment data set 2719 documents The rule of topic Temporal Evolution between nineteen ninety-five to 2012.The result of the experiment contributes to scientific research personnel and fully understands image Process the evolution condition with the time with area of pattern recognition important research topic.Accompanying drawing 3 gives and tracks drilling for 30 topics The result of change situation, the topic evolution diagram on PAMI data sets be divided into four parts correspond to respectively four it is different Research direction:Image segmentation, recognition of face, Handwritten Digits Recognition and tracking.The numeral on node in figure represents topic respectively Sequence number, and the side between node represents the Evolution Paths between topic.To each corresponding to four detached parts in figure For research direction, the Evolution Paths of one or more topic are all there are:Image segmentation research direction has up to seven not Same topic Evolution Paths, are concentrated mainly on topic 4, topic 5, topic 6, topic 7 and topic 17;Recognition of face research direction There are six different Evolution Paths, be concentrated mainly on topic 13, topic 14 and topic 15;Handwritten Digits Recognition research direction is altogether There are three different Evolution Paths, be concentrated mainly on topic 10 and topic 26;Follow-up study direction only has an Evolution Paths, It is concentrated mainly on topic 20 and topic 21.In order to save space, we list the detailed key word of each topic in table 5 Information.
Table 5:Front 10 key words of each topic in different time sections in Fig. 3
In the present embodiment, the international top periodical of downloading mode identification and image processing field《IEEE mode analyze with Machine intelligence transactions》All papers (except the article of chief editor) from January nineteen ninety-five in September, 2012, obtain 2719 altogether Research Literature, carries out arrangement and obtains document metadata, and carry out pretreatment to metadata document obtaining metadata to initial data Collection;Using based on the multi-source topic discovery method 30 scientific research topics of extraction quoted with content, the probability distribution of topic word is obtained With the result of the probability distribution of topic document;The temporal information of document is combined with the scientific research topic for obtaining, 30 topics are existed It was divided into into for 6 time periods between nineteen ninety-five to 2012, every section of time interval is 3 years, forms 180 sub-topics altogether.According to each word The distribution of the word of topic and the distribution of topic document, obtain the evolution of scientific research topic using the relativity measurement method of the topic for proposing Graph of a relation.The evolution of scientific research topic and tracking topic is found by above-mentioned steps, the important scientific research words of default scientific research field have been excavated The rule of topic and topic Temporal Evolution, with very important realistic meaning.
In actual applications, puzzlement degree (Perplexity) is the standard index of evaluation model generalization ability, puzzled angle value It is less, illustrate that model generalization ability is stronger.In order to evaluate the extensive of the scientific research motif discovery model of the Multi-information acquisition of the present invention 2719 scientific documents are further divided into two parts by ability, the present embodiment, wherein, 1360 documents as training set, 1359 Piece document is used as test set.In the topic of the present invention finds model, for test set DtestIn scientific documents puzzlement degree meter Calculate formula as follows:
N in above formuladRepresent the quantity of word in document d, wd=(w1d,w2d…wid…wnd) represent the word for constituting document d Vector, M is document total quantity in test set, here value be 1359.
Accompanying drawing 4 gives the motif discovery model in the present embodiment, the standard LDA topic model based on document content and base In adduction relationship LDA topic models (referring to " Wang, X., Zhai, C., Roth, D., 2013.Understanding evolution of research themes:a probabilistic generative model for citations.In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, p.1115-1123. ") three puzzles the comparative experimentss result of angle value. From Fig. 4, the motif discovery model that we are can be found that in the present embodiment has lower puzzlement than other two contrast models Angle value, i.e., with more preferable model generalization ability;And when theme quantity is more than 30, the value of the puzzled degree of three models all keeps It is basically unchanged, it is appropriate that in this explanation the present embodiment, number of topics measures 30, in reasonable reflecting TPAMI data sets Comprising real number of topics.
The above is only some embodiments of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (6)

1. a kind of method that scientific documents topic found and developed tracking, it is characterised in that comprise the following steps:
A1, downloads the scientific documents of certain ambit, arranges gained document metadata.
A2, the data in literature to downloading in A1 carry out pretreatment and form data in literature collection S.
A3, arranges the data in literature collection S to be formed to A2, using based on the topic discovery method extraction words quoted with content information Topic, finds the distribution of topic word and the distribution of topic document.
A4, using the temporal information of all documents for belonging to some theme, the theme of extraction is divided on a timeline, Form the sub-topicses in different time sections.
A5, calculates topic relevance using the relativity measurement method between topic, and tracks the path that topic develops, obtain section Grind the evolution diagram of topic.
Above-mentioned steps A3 specifically include following sub-step:
A31. reference citation matrix [M] m*m is set up according to the adduction relationship between the document of data in literature collection S, wherein m be with The quantity of the document of adduction relationship.Matrix M is pressed into row normalization, matrix M is decomposed into into two with the method for Non-negative Matrix Factorization Individual nonnegative matrix [B] m*z and [H] z*m, wherein m be decompose after B matrixes row and H-matrix row quantity, i.e. M=B*H.
A32. Matrix C and M are obtained by row normalization respectively to matrix B and H.Wherein, each element c of Matrix Ci,jRepresent each Comprising the probability for quoting document j, each element m of matrix M in topic (cluster) ii,jRepresent that each is quoted document i and belongs to a certain The probability of individual topic (cluster) j.
A33. each topic c to generating in A32 stepsi,j, set up based on " bag of words " using the content of the document for constituting the topic The LDA probability topic models of model.Topic is thought of as the LDA probability topic model set of word, its specific generating process For:D is distributed according to document topicdoc_topic(;D)~θdTo generate topic zd,n, further according to the distribution of topic wordTo generate the word of document, dividing for the i.e. topic word of model parameter is generated using gibbs sampler ClothWith the distribution θ of document topicj,k, wherein there is θd~Dir (α) andWherein α, β are Dirichlet distributions Parameter, the parameter of gained topic modelAnd θj,kComposition topic
The segmentation on a timeline of topic in above-mentioned steps A4, mainly uses the time letter of the document for belonging to some topic Breath, topic was split on the different time periods, the sub-topic on the time period is formed Wherein K is the quantity of topic, and P is the quantity of time period.Specific time division side Case is as follows:According to initial time t of the time hop count P, document for dividing0, terminate time tsTo determine the time of each time period The size at interval is (ts-t0)/P。
Topic EVOLUTION ANALYSIS in above-mentioned steps A5 is specifically included:
A51. to two topic z in two time intervals of arbitrary neighborhoodiAnd zjUsing the distribution of the word of each topic With the distribution of the Core article of each topicTo calculate the relation of two topics;
A52. relativity measurement is solved according to step shown in A51 to arbitrary two topics on time adjacent segments, is metric Directed edge is set up more than two topics of certain threshold value, the direction on side is determined according to the time relationship between topic, with this To build the Evolvement figure between topic.
2. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that in step A1 The metadata that every documents management is obtained includes:Document ID, document are delivered the time, only including title, key word and summary Literature content, reference information of document etc..
3. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that in step A2 Document metadata preprocessing process, specifically include:Stop-word is removed, the occurrence number in all documents is removed and is less than 5 times Low-frequency word, build the document frequency matrix of document, build the vocabulary of all documents, build drawing between data set Literature Use relational matrix.
4. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that step A33 In model parameterAnd θj,kComputing formula is as follows:
φ k , w = n k ( w ) + β w Σ w = 1 V ( n k w + β w )
θ j , k = n j ( k ) + α k Σ k = 1 K ( n j ( k ) + α k )
Wherein,Expression belongs to the quantity of the word w of topic k,Expression belongs to the quantity of the topic k of document j, αkIt is θj,kDi Li Cray Dirichlet Study firsts vector, θj,kRepresent the probability distribution of k-th topic of document j, βtIt is Dirichlet prior parameter vector,Represent the probability distribution of w-th lexical item of topic k.
5. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that in step A4 Time-sharing scheme it is as follows:Initial time t of the document included by the time hop count S, data set according to dividing0, terminate when Between tsTo determine that the size of the time interval of each time period is (ts-t0)/S, obtains the sub-topicses on each time period
6. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that step A51 It is middle calculate two topic relations formula be:
CN201610913510.7A 2016-10-20 2016-10-20 Discovering and evolution tracking method for scientific research document topics Pending CN106570088A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610913510.7A CN106570088A (en) 2016-10-20 2016-10-20 Discovering and evolution tracking method for scientific research document topics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610913510.7A CN106570088A (en) 2016-10-20 2016-10-20 Discovering and evolution tracking method for scientific research document topics

Publications (1)

Publication Number Publication Date
CN106570088A true CN106570088A (en) 2017-04-19

Family

ID=58533362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610913510.7A Pending CN106570088A (en) 2016-10-20 2016-10-20 Discovering and evolution tracking method for scientific research document topics

Country Status (1)

Country Link
CN (1) CN106570088A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112307278A (en) * 2020-10-26 2021-02-02 中国科学院计算技术研究所 Real-time generation method and system for topic venation of any scale
CN117891959A (en) * 2024-03-15 2024-04-16 中国标准化研究院 Document metadata storage method and system based on Bayesian network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103390051A (en) * 2013-07-25 2013-11-13 南京邮电大学 Topic detection and tracking method based on microblog data
CN103984681A (en) * 2014-03-31 2014-08-13 同济大学 News event evolution analysis method based on time sequence distribution information and topic model
CN105335349A (en) * 2015-08-26 2016-02-17 天津大学 Time window based LDA microblog topic trend detection method and apparatus
CN105956130A (en) * 2016-05-09 2016-09-21 浙江农林大学 Multi-information fusion scientific research literature theme discovering and tracking method and system thereof
CN106021222A (en) * 2016-05-09 2016-10-12 浙江农林大学 Analysis method and device for scientific research literature theme evolution

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103390051A (en) * 2013-07-25 2013-11-13 南京邮电大学 Topic detection and tracking method based on microblog data
CN103984681A (en) * 2014-03-31 2014-08-13 同济大学 News event evolution analysis method based on time sequence distribution information and topic model
CN105335349A (en) * 2015-08-26 2016-02-17 天津大学 Time window based LDA microblog topic trend detection method and apparatus
CN105956130A (en) * 2016-05-09 2016-09-21 浙江农林大学 Multi-information fusion scientific research literature theme discovering and tracking method and system thereof
CN106021222A (en) * 2016-05-09 2016-10-12 浙江农林大学 Analysis method and device for scientific research literature theme evolution

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112307278A (en) * 2020-10-26 2021-02-02 中国科学院计算技术研究所 Real-time generation method and system for topic venation of any scale
CN112307278B (en) * 2020-10-26 2024-02-23 中国科学院计算技术研究所 Topic context real-time generation method and system with arbitrary scale
CN117891959A (en) * 2024-03-15 2024-04-16 中国标准化研究院 Document metadata storage method and system based on Bayesian network
CN117891959B (en) * 2024-03-15 2024-05-10 中国标准化研究院 Document metadata storage method and system based on Bayesian network

Similar Documents

Publication Publication Date Title
Santra et al. Genetic algorithm and confusion matrix for document clustering
CN102902700B (en) Online-increment evolution topic model based automatic software classifying method
CN105956130B (en) The scientific documents motif discovery and tracking and its system of multi-information fusion
Zhang et al. Document clustering with universum
CN112836051B (en) Online self-learning court electronic file text classification method
Orlikowski et al. Learning diachronic analogies to analyze concept change
Abid et al. Semi-automatic classification and duplicate detection from human loss news corpus
Suadaa et al. Combination of latent Dirichlet allocation (LDA) and term frequency-inverse cluster frequency (TFxICF) in Indonesian text clustering with labeling
Pang et al. SBTM: topic modeling over short texts
He et al. Unsupervised learning style classification for learning path generation in online education platforms
Gupta et al. Deep temporal-recurrent-replicated-softmax for topical trends over time
Mei et al. Proximity-based k-partitions clustering with ranking for document categorization and analysis
Wu et al. Topic mover's distance based document classification
CN106570088A (en) Discovering and evolution tracking method for scientific research document topics
Pan et al. Ontology-driven scientific literature classification using clustering and self-supervised learning
Davagdorj et al. Biobert based efficient clustering framework for biomedical document analysis
Ye et al. Summarizing product aspects from massive online review with word representation
CN112836507B (en) Method for extracting domain text theme
Nagesh et al. An exploration of three lightly-supervised representation learning approaches for named entity classification
CN112800243A (en) Project budget analysis method and system based on knowledge graph
Yu et al. Interpretative topic categorization via deep multiple instance learning
Asyrofi et al. Comparative Studies of Several Methods for Building Simple Traceability and Identifying The Quality Aspects of Requirements in SRS Documents
Liu et al. Research on Discrete Emotion Classification of Chinese Online Product Reviews Based on OCC Model
Muarraf et al. Research Trend Analysis of Artificial Intelligence
Chen et al. Incremental Patent Semantic Annotation Based on Keyword Extraction and List Extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170419

RJ01 Rejection of invention patent application after publication