CN106570088A - Discovering and evolution tracking method for scientific research document topics - Google Patents
Discovering and evolution tracking method for scientific research document topics Download PDFInfo
- Publication number
- CN106570088A CN106570088A CN201610913510.7A CN201610913510A CN106570088A CN 106570088 A CN106570088 A CN 106570088A CN 201610913510 A CN201610913510 A CN 201610913510A CN 106570088 A CN106570088 A CN 106570088A
- Authority
- CN
- China
- Prior art keywords
- topic
- document
- topics
- time
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a discovering and evolution tracking method for scientific research document topics. The method comprises the steps of downloading scientific research documents of a certain discipline, arranging the obtained document metadata, and performing preprocessing on the document metadata to obtain a document metadata set; extracting topics through a topic discovering method based on citing and content information, finding distribution of topic words and distribution of topic documents, and then dividing the extracted topics on a time axis to form sub topics in different time frames; and finally calculating correlation of the topics, and tracking a topic evolution path to obtain an evolution diagram of the scientific research topics. According to the method, the topics are discovered by comprehensive utilization of the document texts and citing information, so that the obtained topics are higher in quality and can better satisfy actual conditions; and the important scientific research topics can be discovered, and the evolution condition of the topics along with time can be tracked, so that the scientific research topics and the evolution development of the topics can be rapidly mastered by a scientific and technical personnel.
Description
Technical field
The present invention relates to a kind of Knowledge Discovery of scientific research field and data mining technology, more particularly to a kind of scientific documents words
The method of topic evolution tracking.
Background technology
Scientific documents record important academic research achievement, are the carriers of academic publicity and academic exchange.Scientific achievement has
The characteristics of having hand down, most scientific achievements are all the sustained improvements on the basis of the achievement in research of forefathers.With
The appearance of the electronic literature such as PubMed, DBLP index resource database and the development of the Internet, the quantity of the scientific documents of accumulation
It is more and more.In the face of the scientific documents for emerging in multitude, scientific research personnel is particularly novices, urgently wants to rapidly
It was found that the important scientific research topic of oneself place research field and tracking situation of these topics with time-evolution.Automatically scientific research
Topic discovery and evolution tracking technique can help the development and change situation of the quick scientific research topic of research worker, with important reality
With value and realistic meaning.
Topic both domestic and external finds that model is developed from LDA topic models at present[1].LDA models be one with regard to
The three layer model of document-topic-word, the interchangeability basis of its topmost interchangeability and word for assuming to be built upon document
On, i.e. " bag of words " and " bag of documents " model.LDA models regard the document of corpus by diving as
It is distributed what is constituted in topic variable, and what topic was made up of the distribution of word, this two components cloth all meets parameter Dirichlet
Distribution priori.With PLSA models[2]Compare, LDA models are a completely Bayesian models, for unknown document, vocabulary
Estimate more accurately, and the MAP that PLSA models can be regarded as LDA models estimates, and there is over-fitting.It is right at present
The improvement of LDA models is mainly at three aspects:(1) consider the relation between document or word, i.e. document and word is no longer to hand over
Change[3-8];(2) quantity of adaptive study topic, by introducing some non-parametric Bayesian models[9-12];(3) except
Beyond text, additional information is introduced, realization has supervision or semi-supervised study, improves the performance that topic finds model[13-16]。
In terms of another, whether the method for whether having levels from the structure of model and learning has supervision aspect to classify, topic model
Four classes can be divided into:1) topic model of unsupervised, non-hierarchical structure;2) topic model of unsupervised, hierarchical structure;3)
There is supervision, non-hierarchical structure topic model;4) there is supervision, hierarchical structure topic model.
Different demarcation method in being developed according to topic to the time, existing scientific research topic evolution analysis method, Ke Yifen
For two big class:Discrete time topic evolution method, topic evolution method continuous time.
The general process of discrete time topic evolution method is as follows:(1) corpus of text collection carries out son according to its time tag
Collection is divided;(2) topic extraction is carried out using probability topic model in each subset;(3) according to degree of a relation amount between topic
Criterion, sets up the Evolvement of topic between subset;(4) form the figure that topic develops.According to the probability topic model for adopting
Difference, this class model can be divided into two big class, and the first kind is fixed using Bayes's parameter model i.e. topic numbers, such as TTM
(Temporal Text Mining)[17],DTM(Dynamic Topic Model)[18]With MTTM (Multiscale Topic
Tomography Model)[19].It is unfixed, example using the non-parametric Bayesian model i.e. quantity of topic that Equations of The Second Kind is
Such as TDPM (Temporal Dirichlet Process Mixture Model)[20]With iDTM (infinite Dynamic
Topic Model)[21]Deng.Discrete time topic evolutionary model needs to carry out document sets time division, this man-made division
Actually it is difficult to accomplish science accurately, because different types of document its division methods may be exactly different, this is often
Impact can also be brought on the result that final topic develops.For this problem, some scholars propose a kind of new topic and drill
Change research method, i.e., develop in research in topic, time factor is just taken into account when topic is modeled, i.e., the time is worked as
Make a variable, the effect of time factor is just considered when topic is modeled, its topic for obtaining is with regard to word and time
Distribution.Such topic evolutionary model mainly has Topics Over Time (TOT)[22],continuous time Dynamic
Topic Model(cDTM)[23],Trend Analysis Model(TAM)[24]With non-parametric Topics Over
Time(npTOT)[25]Deng.
Existing scientific research topic finds that the model overwhelming majority does not make full use of the multi-source in scientific documents information structural
Information (content of such as document, reference, author and Source Periodicals etc.) finding scientific research topic.For this problem, this
Bright method is achieved more simple than existing simultaneously using the reference information and content information in scientific documents finding scientific research topic
The more preferable effect of method of scientific research topic is found using a certain type information.And obtained by existing topic evolutionary model
Topic EVOLUTION ANALYSIS is both for greatly the evolution condition in different time sections of same topic, and be directed between different topics not
But it is related to the existing technology of analysis of time period upper evolution condition less.For this problem, the present invention is solved and tracks certain
The problem of the different scientific research topic Temporal Evolutions in one field;Another difference from prior art is, of the invention
Topic EVOLUTION ANALYSIS be first to extract topic, then topic is split, it is to avoid topic extracts what is brought to first discretization again
A difficult problem for topic alignment.
With scientific documents as object of study, find important scientific research topic and the evolution of scientific research topic is tracked in scientific documents
Knowledge Discovery and Data Mining have very important significance, to helping researcher to carry out research work and promotion section
The development ground also has important effect.
The content of the invention
The purpose of the present invention is to overcome the shortcomings of that existing scientific research topic finds and evolution tracking technique, there is provided Yi Zhongke
Grind the method that topic found and developed tracking.The method fully utilizes reference and content information in scientific documents to find section
Topic is ground, and tracks the evolution condition between different scientific research topics, achieved than the more preferable topic Detection results of existing method, it is real
The target of Evolution Paths between the different scientific research topics of tracking is showed.
In order to solve above-mentioned technical problem, the invention provides a kind of scientific documents topic finds and develops the side of tracking
Method, the method comprising the steps of:
A1. the scientific documents of a certain subject are downloaded, gained document metadata is arranged;
A2. pretreatment is carried out to data in literature and forms data in literature collection;
A3. utilize based on the topic discovery method extraction topic quoted with content information, find distribution and the words of topic word
The distribution of topic document;
A4. split the topic of discovery on a timeline, form the sub-topic in different time sections;
A5. measure the dependency between sub-topic and track the path of topic evolution.
The metadata that every documents management is obtained in above-mentioned steps A1 includes:(document ID is by document to No. ID of document
Deliver time sequencing sequence), the adduction relationship matrix for delivering time, the content of document and document of document.
Data in literature pretreatment in above-mentioned steps A2 includes:Remove stop-word, numeral, non-English character, the word of word
Mummification, removes low-frequency word of the occurrence number less than 5 times in all documents, builds the document frequency matrix of data set, build word
Remittance table, builds the adduction relationship matrix of document.
Above-mentioned steps A3 are specifically included:
A31. reference citation matrix [M] m*m is set up according to the adduction relationship between the document of data set, wherein m be with
The quantity of the document of adduction relationship.Matrix M is pressed into row normalization, matrix M is decomposed into into two with the method for Non-negative Matrix Factorization
Individual nonnegative matrix [B] m*z and [H] z*m, i.e. M=B*H.
A32. Matrix C and M are obtained by row normalization respectively to matrix B and H.Wherein, each element c of Matrix Ci,jRepresent
Comprising the probability for quoting document j, each element m of matrix M in each topic (cluster) ii,jRepresent that each is quoted document i and belongs to
The probability of some topic (cluster) j.
A33. each topic c to generating in A32 stepsi,j, set up using the content of the document for constituting the topic and be based on
The LDA probability topic models of " bag of words " model.Topic is thought of as the LDA probability topic model set of word, its specific life
Into process it is:D is distributed according to document topicdoc_topic(;D)~θdTo generate topic zd,n, further according to the distribution of topic wordTo generate the word of document, dividing for the i.e. topic word of model parameter is generated using gibbs sampler
ClothWith the distribution θ of document topicj,k, wherein there is θd~Dir (α) andThe parameter of gained topic modelAnd θj,kComposition topic
The segmentation on a timeline of topic in above-mentioned steps A4, mainly use the document that belongs to some topic when
Between information, topic was split on the different time periods, the sub-topic on the time period is formed Specific time-sharing scheme is as follows:According to the time hop count P's, document for dividing
Initial time t0, terminate time tsTo determine that the size of the time interval of each time period is (ts-t0)/P。
Topic EVOLUTION ANALYSIS in above-mentioned steps A5 is specifically included:
A51. to two topic z in two time intervals of arbitrary neighborhoodiAnd zjUsing the distribution of the word of each topic With the distribution of the Core article of each topicTo calculate the relation of two topics, its computing formula
For:
A52. relativity measurement, degree of being are solved according to step shown in A51 to arbitrary two topics on time adjacent segments
Value sets up directed edge more than two topics of certain threshold value, and the direction on side is determined according to the time relationship between topic,
The Evolvement figure between topic is built with this.
Beneficial effects of the present invention:The characteristics of present invention includes abundant structural information for scientific documents, both utilized
The text message of scientific documents make use of reference information to realize discovery to scientific research topic again, using the core for extracting topic
The temporal information that paper is included is carrying out topic segmentation, and realizes tracking the evolution of topic.Find with traditional scientific research topic and
Evolution tracking is compared, the invention comprehensively utilizes the text and reference information of document are finding topic, the topic matter of acquisition
Amount is higher, more meet reality.Additionally, without first dividing to corpus document, but topic topic segmentation again is first extracted, keep away
The problem of topic alignment is exempted from.The scientific research topic provided by embodiments of the invention finds and evolution tracking, it is possible to achieve
It was found that important scientific research topic and these topics of tracking contribute to scientific and technical personnel and hold rapidly scientific research words with the evolution condition of time
The evolution venation of topic and topic.
Description of the drawings
Fig. 1 is the flow chart of scientific documents topic discovery of the present invention and evolution tracking embodiment;
Fig. 2 is topic splitting scheme figure on a timeline in the present embodiment;
Fig. 3 is the EVOLUTION ANALYSIS figure of 30 topics in the present embodiment;
Fig. 4 is the puzzled degree comparison schematic diagram that the topic of LDA, reference LDA and the present invention in the present embodiment finds method.
Specific embodiment
Method proposed by the present invention, including scientific documents acquisition and documents management, the pretreatment of documentation & info, based on many
The several steps of evolution of the discovery of scientific research topic, the segmentation of topic and tracking topic of source information.The acquisition of scientific documents and document
Arrange and be responsible for obtaining a certain amount of scientific documents data and arranging forming document corpus;The pretreatment of documentation & info includes removing
Stop-word and low-frequency word, obtain document frequency matrix, document vocabulary information and the document reference of document from document corpus
Relational matrix;Scientific research topic based on multi-source information finds the discovery of main responsible scientific research topic, obtains the distribution of each topic
The distribution of distribution and topic document including topic word;The segmentation of topic is responsible for carrying out on a timeline drawing the topic of extraction
Point, form the sub-topic in different time sections;The evolution tracking of topic is mainly including relation between sub-topic on time adjacent segments
Tolerance and topic evolution diagram structure.
With reference to the accompanying drawings and examples, the specific embodiment of the present invention is described in further detail.Lead to below
It is exemplary to cross the embodiment being described with reference to the drawings, and is only used for explaining the present invention, and is not construed as limiting the claims.
Fig. 1 is the flow chart of scientific documents topic discovery of the present invention and evolution tracking specific embodiment.Such as Fig. 1 institutes
Show, the present embodiment scientific documents topic finds and the workflow of evolution tracking comprises the steps:
A1:The scientific documents of a certain subject are downloaded, gained document metadata is arranged.
In this step, by the international top periodical in downloaded Pattern recognition and image processing field《IEEE moulds
Formula is analyzed and machine intelligence transactions》(IEEE TPAMI-IEEE Transactions on Pattern Analysis and
Machine Intelligence) all papers (except the article of chief editor) from January nineteen ninety-five in September, 2012, amount to
To 2719 Research Literatures.Every document record to collecting carries out arrangement and obtains document metadata, including every document
ID pid, document deliver time year (being accurate to year), content text of document (only including title, key word and is plucked
Will), citation sequence cit of document (refer to and belong to the list of references for downloading the document in the range of archiveies).Will be all
Download 2719 documents protocol arrange it is written offer metadata after, into step A2.
A2:Data prediction is carried out to gained document metadata in A1 and obtains document metadata set.
In this step, the document metadata set to obtaining in S1 carries out pretreatment, including removes stop words, removes in institute
There is low-frequency word of the occurrence number less than 5 times in document, the dictionary for obtaining being made up of 881 lexical items after the completion of pretreatment, can be arranged
The document frequency matrix D=[d that V, 2719 documents and 881 words are constitutedij]2719×881(wherein dijRepresent jth in i-th document
The word frequency size of individual word), and the adduction relationship Matrix C=[c between 2719 documentsmn]2719×2719(wherein cmnRepresent m
Whether piece document and n-th document have adduction relationship, if cmn=1 indicates adduction relationship, otherwise represents
System).After pretreatment is carried out to scientific documents, into step A3.
A3:Using based on the topic discovery method extraction topic quoted with content information, distribution and the words of topic word are found
The distribution of topic document.
In this step, using the reference and content information of document, the scientific documents topic for building collection finds method, tool
Body includes three sub-steps:
1). by matrix [C]2719*2719Matrix M is obtained by row normalization, with the method for Non-negative Matrix Factorization by matrix M point
Solve as two nonnegative matrixes [B]2719*10[H]10*2719, i.e. M=B*H is decomposed into 10 big clusters, and Decomposition iteration number of times is
1000。
2). matrix N and M are obtained by row normalization transposition respectively to matrix B and H.Wherein, each element n of matrix Ni,j
Represent the probability comprising reference document j in each topic (cluster) i, each element m of matrix Mi,jRepresent that each quotes document i
Belong to the probability of some topic (cluster) j.
3). to step 2) middle each topic n for generatingi,j, set up based on " word using the content of the document for constituting the topic
The LDA probability topic models of bag " model.Topic is thought of as the LDA probability topic model set of word, and which specifically generated
Cheng Wei:D is distributed according to document topicdoc_topic(;D)~θdTo generate topic zd,n, further according to the distribution of topic wordTo generate the word of document, dividing for the i.e. topic word of model parameter is generated using gibbs sampler
ClothWith the distribution θ of document topicj,k, wherein there is θd~Dir (α) andThe parameter of gained topic model
And θj,kComposition topicThe formula for wherein obtaining model parameter using gibbs sampler is as follows:
Wherein,Expression belongs to the quantity of the word w of topic k,Expression belongs to the quantity of the topic k of document j, αk
It is θj,kDi Li Cray Dirichlet Study firsts vector, θj,kRepresent the probability distribution of k-th topic of document j, βtIt isDirichlet prior parameter vector,The probability distribution of w-th lexical item of topic k is represented, herein K=3, αkAnd βt
Value be respectively 0.5 and 0.01.
In the present embodiment, after 300 iteration of gibbs sampler in running above-mentioned steps 3), whole sampling process will become
In convergence.
In this step, the topic through Multi-information acquisition finds that model is calculated 30 words of 2719 scientific documents
Topic, the description of each topic include two parts:(a) front 10 lexical items maximally related with topic and corresponding probability;(b) and topic
Maximally related front 10 core stateless and corresponding probability.It is calculated wherein representative 2 of 2719 scientific documents
The distribution of individual topic lexical item and the distribution of topic Core article are respectively as shown in Table 1 and Table 2:
Table 1:The distribution of 2 topic words
Two typical cases in the topic obtained on scientific documents data set used in the present embodiment are given in upper table 1
Son --- the probit of 10 words of probability highest and word in topic 18 " shape table reaches " and topic 3 " sorting algorithm ".From table 1
It is found that the high frequency words in topic 18 have " recognition ", " shape ", " image ", " affine ", " invariant "
Relevant word is reached Deng with shape table;And high frequency words " classification ", " feature ", " nearest " in topic 3,
" neighbor " etc. is all relevant with sorting algorithm.
Table 2:The distribution of 2 topic papers
Through said method, extract 2719 scientific documents of PAMI scientific documents data sets 30 topic words it is general
After rate is distributed the probability distribution with topic document, into step A4.
A4:30 topics for extracting are divided on a timeline, the sub-topic in different time sections is formed.
Above-mentioned scientific research topic division unit, mainly uses the temporal information of document, and topic is projected to the different times
The sub-topic on the time period is formed in sectionThe specific time
Splitting scheme is as shown in Figure 2:The initial time of the document that the time hop count of division is 6, taken is
2012, the size of the time interval of each time period was 3 years.Topic 18 " shape table reaches " is listed upper in table 3 below
The time division result under scheme is stated, in order to each topic herein of saving space only lists 4 words of probability highest and 4 literary
Offer, document is represented with No. id, the title of the document being given in Table 4 corresponding to id.
Table 3:The time division result of topic 18 " shape table reaches "
Table 4:Topic document id and its corresponding title in topic 29 " 3D reconstructions "
A5:Topic relevance is calculated using the relativity measurement method between topic, and tracks the path that topic develops, obtained
To the evolution condition of scientific research topic.The specific implementation process of this step is as follows:
1). to two topic z in two time intervals of arbitrary neighborhoodiAnd zj, using the distribution of the word of each topic With the distribution of the Core article of each topicTo calculate the relation of two topics, its computing formula isHerein
Take μ=0.5.
2). relativity measurement is solved according to step shown in 1) to arbitrary two topics on time adjacent segments, is tolerance
Value sets up directed edge more than two topics of certain threshold value (value is 0.2), and the direction on side was closed according to the time between topic
System builds the Evolvement figure between topic determining with this,.
Being embodied as by this step, can obtain 30 found on the present embodiment data set 2719 documents
The rule of topic Temporal Evolution between nineteen ninety-five to 2012.The result of the experiment contributes to scientific research personnel and fully understands image
Process the evolution condition with the time with area of pattern recognition important research topic.Accompanying drawing 3 gives and tracks drilling for 30 topics
The result of change situation, the topic evolution diagram on PAMI data sets be divided into four parts correspond to respectively four it is different
Research direction:Image segmentation, recognition of face, Handwritten Digits Recognition and tracking.The numeral on node in figure represents topic respectively
Sequence number, and the side between node represents the Evolution Paths between topic.To each corresponding to four detached parts in figure
For research direction, the Evolution Paths of one or more topic are all there are:Image segmentation research direction has up to seven not
Same topic Evolution Paths, are concentrated mainly on topic 4, topic 5, topic 6, topic 7 and topic 17;Recognition of face research direction
There are six different Evolution Paths, be concentrated mainly on topic 13, topic 14 and topic 15;Handwritten Digits Recognition research direction is altogether
There are three different Evolution Paths, be concentrated mainly on topic 10 and topic 26;Follow-up study direction only has an Evolution Paths,
It is concentrated mainly on topic 20 and topic 21.In order to save space, we list the detailed key word of each topic in table 5
Information.
Table 5:Front 10 key words of each topic in different time sections in Fig. 3
In the present embodiment, the international top periodical of downloading mode identification and image processing field《IEEE mode analyze with
Machine intelligence transactions》All papers (except the article of chief editor) from January nineteen ninety-five in September, 2012, obtain 2719 altogether
Research Literature, carries out arrangement and obtains document metadata, and carry out pretreatment to metadata document obtaining metadata to initial data
Collection;Using based on the multi-source topic discovery method 30 scientific research topics of extraction quoted with content, the probability distribution of topic word is obtained
With the result of the probability distribution of topic document;The temporal information of document is combined with the scientific research topic for obtaining, 30 topics are existed
It was divided into into for 6 time periods between nineteen ninety-five to 2012, every section of time interval is 3 years, forms 180 sub-topics altogether.According to each word
The distribution of the word of topic and the distribution of topic document, obtain the evolution of scientific research topic using the relativity measurement method of the topic for proposing
Graph of a relation.The evolution of scientific research topic and tracking topic is found by above-mentioned steps, the important scientific research words of default scientific research field have been excavated
The rule of topic and topic Temporal Evolution, with very important realistic meaning.
In actual applications, puzzlement degree (Perplexity) is the standard index of evaluation model generalization ability, puzzled angle value
It is less, illustrate that model generalization ability is stronger.In order to evaluate the extensive of the scientific research motif discovery model of the Multi-information acquisition of the present invention
2719 scientific documents are further divided into two parts by ability, the present embodiment, wherein, 1360 documents as training set, 1359
Piece document is used as test set.In the topic of the present invention finds model, for test set DtestIn scientific documents puzzlement degree meter
Calculate formula as follows:
N in above formuladRepresent the quantity of word in document d, wd=(w1d,w2d…wid…wnd) represent the word for constituting document d
Vector, M is document total quantity in test set, here value be 1359.
Accompanying drawing 4 gives the motif discovery model in the present embodiment, the standard LDA topic model based on document content and base
In adduction relationship LDA topic models (referring to " Wang, X., Zhai, C., Roth, D., 2013.Understanding
evolution of research themes:a probabilistic generative model for
citations.In Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining, p.1115-1123. ") three puzzles the comparative experimentss result of angle value.
From Fig. 4, the motif discovery model that we are can be found that in the present embodiment has lower puzzlement than other two contrast models
Angle value, i.e., with more preferable model generalization ability;And when theme quantity is more than 30, the value of the puzzled degree of three models all keeps
It is basically unchanged, it is appropriate that in this explanation the present embodiment, number of topics measures 30, in reasonable reflecting TPAMI data sets
Comprising real number of topics.
The above is only some embodiments of the present invention, it is noted that for the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (6)
1. a kind of method that scientific documents topic found and developed tracking, it is characterised in that comprise the following steps:
A1, downloads the scientific documents of certain ambit, arranges gained document metadata.
A2, the data in literature to downloading in A1 carry out pretreatment and form data in literature collection S.
A3, arranges the data in literature collection S to be formed to A2, using based on the topic discovery method extraction words quoted with content information
Topic, finds the distribution of topic word and the distribution of topic document.
A4, using the temporal information of all documents for belonging to some theme, the theme of extraction is divided on a timeline,
Form the sub-topicses in different time sections.
A5, calculates topic relevance using the relativity measurement method between topic, and tracks the path that topic develops, obtain section
Grind the evolution diagram of topic.
Above-mentioned steps A3 specifically include following sub-step:
A31. reference citation matrix [M] m*m is set up according to the adduction relationship between the document of data in literature collection S, wherein m be with
The quantity of the document of adduction relationship.Matrix M is pressed into row normalization, matrix M is decomposed into into two with the method for Non-negative Matrix Factorization
Individual nonnegative matrix [B] m*z and [H] z*m, wherein m be decompose after B matrixes row and H-matrix row quantity, i.e. M=B*H.
A32. Matrix C and M are obtained by row normalization respectively to matrix B and H.Wherein, each element c of Matrix Ci,jRepresent each
Comprising the probability for quoting document j, each element m of matrix M in topic (cluster) ii,jRepresent that each is quoted document i and belongs to a certain
The probability of individual topic (cluster) j.
A33. each topic c to generating in A32 stepsi,j, set up based on " bag of words " using the content of the document for constituting the topic
The LDA probability topic models of model.Topic is thought of as the LDA probability topic model set of word, its specific generating process
For:D is distributed according to document topicdoc_topic(;D)~θdTo generate topic zd,n, further according to the distribution of topic wordTo generate the word of document, dividing for the i.e. topic word of model parameter is generated using gibbs sampler
ClothWith the distribution θ of document topicj,k, wherein there is θd~Dir (α) andWherein α, β are Dirichlet distributions
Parameter, the parameter of gained topic modelAnd θj,kComposition topic
The segmentation on a timeline of topic in above-mentioned steps A4, mainly uses the time letter of the document for belonging to some topic
Breath, topic was split on the different time periods, the sub-topic on the time period is formed Wherein K is the quantity of topic, and P is the quantity of time period.Specific time division side
Case is as follows:According to initial time t of the time hop count P, document for dividing0, terminate time tsTo determine the time of each time period
The size at interval is (ts-t0)/P。
Topic EVOLUTION ANALYSIS in above-mentioned steps A5 is specifically included:
A51. to two topic z in two time intervals of arbitrary neighborhoodiAnd zjUsing the distribution of the word of each topic With the distribution of the Core article of each topicTo calculate the relation of two topics;
A52. relativity measurement is solved according to step shown in A51 to arbitrary two topics on time adjacent segments, is metric
Directed edge is set up more than two topics of certain threshold value, the direction on side is determined according to the time relationship between topic, with this
To build the Evolvement figure between topic.
2. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that in step A1
The metadata that every documents management is obtained includes:Document ID, document are delivered the time, only including title, key word and summary
Literature content, reference information of document etc..
3. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that in step A2
Document metadata preprocessing process, specifically include:Stop-word is removed, the occurrence number in all documents is removed and is less than 5 times
Low-frequency word, build the document frequency matrix of document, build the vocabulary of all documents, build drawing between data set Literature
Use relational matrix.
4. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that step A33
In model parameterAnd θj,kComputing formula is as follows:
Wherein,Expression belongs to the quantity of the word w of topic k,Expression belongs to the quantity of the topic k of document j, αkIt is
θj,kDi Li Cray Dirichlet Study firsts vector, θj,kRepresent the probability distribution of k-th topic of document j, βtIt is
Dirichlet prior parameter vector,Represent the probability distribution of w-th lexical item of topic k.
5. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that in step A4
Time-sharing scheme it is as follows:Initial time t of the document included by the time hop count S, data set according to dividing0, terminate when
Between tsTo determine that the size of the time interval of each time period is (ts-t0)/S, obtains the sub-topicses on each time period
6. the method that scientific documents topic according to claim 1 found and developed tracking, it is characterised in that step A51
It is middle calculate two topic relations formula be:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610913510.7A CN106570088A (en) | 2016-10-20 | 2016-10-20 | Discovering and evolution tracking method for scientific research document topics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610913510.7A CN106570088A (en) | 2016-10-20 | 2016-10-20 | Discovering and evolution tracking method for scientific research document topics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106570088A true CN106570088A (en) | 2017-04-19 |
Family
ID=58533362
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610913510.7A Pending CN106570088A (en) | 2016-10-20 | 2016-10-20 | Discovering and evolution tracking method for scientific research document topics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106570088A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112307278A (en) * | 2020-10-26 | 2021-02-02 | 中国科学院计算技术研究所 | Real-time generation method and system for topic venation of any scale |
CN117891959A (en) * | 2024-03-15 | 2024-04-16 | 中国标准化研究院 | Document metadata storage method and system based on Bayesian network |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103390051A (en) * | 2013-07-25 | 2013-11-13 | 南京邮电大学 | Topic detection and tracking method based on microblog data |
CN103984681A (en) * | 2014-03-31 | 2014-08-13 | 同济大学 | News event evolution analysis method based on time sequence distribution information and topic model |
CN105335349A (en) * | 2015-08-26 | 2016-02-17 | 天津大学 | Time window based LDA microblog topic trend detection method and apparatus |
CN105956130A (en) * | 2016-05-09 | 2016-09-21 | 浙江农林大学 | Multi-information fusion scientific research literature theme discovering and tracking method and system thereof |
CN106021222A (en) * | 2016-05-09 | 2016-10-12 | 浙江农林大学 | Analysis method and device for scientific research literature theme evolution |
-
2016
- 2016-10-20 CN CN201610913510.7A patent/CN106570088A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103390051A (en) * | 2013-07-25 | 2013-11-13 | 南京邮电大学 | Topic detection and tracking method based on microblog data |
CN103984681A (en) * | 2014-03-31 | 2014-08-13 | 同济大学 | News event evolution analysis method based on time sequence distribution information and topic model |
CN105335349A (en) * | 2015-08-26 | 2016-02-17 | 天津大学 | Time window based LDA microblog topic trend detection method and apparatus |
CN105956130A (en) * | 2016-05-09 | 2016-09-21 | 浙江农林大学 | Multi-information fusion scientific research literature theme discovering and tracking method and system thereof |
CN106021222A (en) * | 2016-05-09 | 2016-10-12 | 浙江农林大学 | Analysis method and device for scientific research literature theme evolution |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112307278A (en) * | 2020-10-26 | 2021-02-02 | 中国科学院计算技术研究所 | Real-time generation method and system for topic venation of any scale |
CN112307278B (en) * | 2020-10-26 | 2024-02-23 | 中国科学院计算技术研究所 | Topic context real-time generation method and system with arbitrary scale |
CN117891959A (en) * | 2024-03-15 | 2024-04-16 | 中国标准化研究院 | Document metadata storage method and system based on Bayesian network |
CN117891959B (en) * | 2024-03-15 | 2024-05-10 | 中国标准化研究院 | Document metadata storage method and system based on Bayesian network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Santra et al. | Genetic algorithm and confusion matrix for document clustering | |
CN102902700B (en) | Online-increment evolution topic model based automatic software classifying method | |
CN105956130B (en) | The scientific documents motif discovery and tracking and its system of multi-information fusion | |
Zhang et al. | Document clustering with universum | |
CN112836051B (en) | Online self-learning court electronic file text classification method | |
Orlikowski et al. | Learning diachronic analogies to analyze concept change | |
Abid et al. | Semi-automatic classification and duplicate detection from human loss news corpus | |
Suadaa et al. | Combination of latent Dirichlet allocation (LDA) and term frequency-inverse cluster frequency (TFxICF) in Indonesian text clustering with labeling | |
Pang et al. | SBTM: topic modeling over short texts | |
He et al. | Unsupervised learning style classification for learning path generation in online education platforms | |
Gupta et al. | Deep temporal-recurrent-replicated-softmax for topical trends over time | |
Mei et al. | Proximity-based k-partitions clustering with ranking for document categorization and analysis | |
Wu et al. | Topic mover's distance based document classification | |
CN106570088A (en) | Discovering and evolution tracking method for scientific research document topics | |
Pan et al. | Ontology-driven scientific literature classification using clustering and self-supervised learning | |
Davagdorj et al. | Biobert based efficient clustering framework for biomedical document analysis | |
Ye et al. | Summarizing product aspects from massive online review with word representation | |
CN112836507B (en) | Method for extracting domain text theme | |
Nagesh et al. | An exploration of three lightly-supervised representation learning approaches for named entity classification | |
CN112800243A (en) | Project budget analysis method and system based on knowledge graph | |
Yu et al. | Interpretative topic categorization via deep multiple instance learning | |
Asyrofi et al. | Comparative Studies of Several Methods for Building Simple Traceability and Identifying The Quality Aspects of Requirements in SRS Documents | |
Liu et al. | Research on Discrete Emotion Classification of Chinese Online Product Reviews Based on OCC Model | |
Muarraf et al. | Research Trend Analysis of Artificial Intelligence | |
Chen et al. | Incremental Patent Semantic Annotation Based on Keyword Extraction and List Extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170419 |
|
RJ01 | Rejection of invention patent application after publication |