CN109408782A - Research hotspot based on KL distance similarity measurement develops behavioral value method - Google Patents

Research hotspot based on KL distance similarity measurement develops behavioral value method Download PDF

Info

Publication number
CN109408782A
CN109408782A CN201811216206.2A CN201811216206A CN109408782A CN 109408782 A CN109408782 A CN 109408782A CN 201811216206 A CN201811216206 A CN 201811216206A CN 109408782 A CN109408782 A CN 109408782A
Authority
CN
China
Prior art keywords
topic
publication
time slice
theme
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811216206.2A
Other languages
Chinese (zh)
Other versions
CN109408782B (en
Inventor
黄芳
杜春修
赵义健
张祖平
章成源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dragon Totem Technology Hefei Co ltd
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN201811216206.2A priority Critical patent/CN109408782B/en
Publication of CN109408782A publication Critical patent/CN109408782A/en
Application granted granted Critical
Publication of CN109408782B publication Critical patent/CN109408782B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of research hotspots based on KL distance similarity measurement to develop behavioral value method, it combines the thematic timing with publication of publication, propose timing publication topic model TS-JTM, to realize that the tense hot spot of academic journals extracts, the theme snapshot publication research hotspot evolution model based on time series is established on this basis, simultaneously, utilize probability distribution KL distance similarity measure, the detection method that theme in measurement adjacent moment theme snapshot develops behavior is proposed, the fine granularity that research hotspot in publication develops is analyzed with realizing.

Description

KL distance similarity measurement-based research hotspot evolution behavior detection method
Technical Field
The invention belongs to the technical field of literature theme analysis and detection, and particularly relates to a KL distance similarity measurement-based method for detecting evolution behaviors of research hotspots.
Background
With the continuous development of scientific research and exploration, research hotspots in academic fields change, and as the change of the academic research hotspots along with the change of the time is promoted by the interpenetration among disciplines and the application of new technologies, some old research problems disappear in the process, and new research problems continuously occur, while some research problems are fissured or fused with other research problems along with the time, and the behaviors lead the development of the academic research hotspots. Therefore, it is necessary to analyze the development of the research hotspots in the academic field and grasp the development trajectories of the research hotspots to predict the development trends of the research hotspots. The method can help the scholars to know the current hot research problem, and can also assist the scientific research personnel and managers to grasp the development rule of the scientific research. The research results and progress of scientific researchers are reflected in academic publications of published academic papers, the academic publications collect a large number of academic research results in a classified manner, and the periodical publication of the publications essentially records the development process of the research field of the journal, so that the research focus of the journal is extracted to find the evolution of the research focus along with time, which is very meaningful.
In the analysis of document theme, an Author theme Model (Author-Topic-Model) is a commonly used theme clustering analysis method, ATM models the interest of the Author of the document, and can analyze the academic preference of the Author[1]. The author topic model is a three-layer Bayesian probability model and comprises three layers of structures of words, topics and author interests. The model can be mapped directly into the topic model of the publication, i.e. the publication selects a certain probabilityAnd generating subject words by the subjects according to a certain probability. However, the evolution of the theme along with the time is an important factor influencing the theme extraction, the author theme model does not consider the time factor, when the author theme model is directly used for the theme extraction of the corpus database of each time slice, the author theme model is an independent model parameter in each time slice, the time dependency is not realized, the influence of the change of the theme along with the time is not considered, and the uncertainty of the theme words in the distribution process is increased. The DTM model is proposed by Blei on the basis of an LDA (latent Dirichlet allocation) model[2]However, the DTM model is not a model for publications to obtain topics included in each publication in the document data set and their evolution over time, and thus cannot meet the requirement of publication topic research.
Therefore, there is no effective means for detecting evolution behavior based on publication time-series topics in the prior art.
Disclosure of Invention
The invention aims to provide a method for detecting evolution behaviors of research hotspots based on KL distance similarity measurement aiming at the defects of the prior art, a time Sequence publication theme model TS-JTM (time Sequence Journal model) is provided by combining the theme and the time Sequence of publications, the publications are subjected to temporal theme extraction according to the time Sequence publication theme model TS-JTM, and theme evolution is measured by combining the theme similarity of KL distance, so that detection of theme continuation, new birth, split, fusion and extinction evolution behaviors is realized.
A method for detecting evolution behaviors of research hotspots based on KL distance similarity measurement comprises the following steps:
step 1: acquiring publication documents, and constructing a subject term corpus with time attributes based on publication time of the publication documents;
dividing time slices by publication document publication time, wherein the subject term corpus is composed of data sets on each time slice, and the data set on each time slice is composed of document feature vectors of publication documents published at matching time;
in the formula, CtIs a data set over a time slice t, (w)i,ji) Document feature vector, w, for publication document iiSet of characteristic words, j, for publication document iiAs a publication to which publication document i belongs, ciIs the ith feature word in the feature word set, n1Is the number of publication documents on time slice t, n2The number of characteristic words on publication document i;
wherein, the characteristic words of the publication documents are obtained after the content of the publication documents is subjected to word segmentation processing;
step 2: constructing a time sequence publication theme model based on publication themes and time sequence;
each time slice in the time-series publication topic model corresponds to a publication topic model, and a dirichlet prior parameter α of publication-topic distribution theta and a dirichlet prior parameter β of topic-word distribution phi in the publication topic model of the next time slice in two adjacent time slices are associated with two dirichlet prior parameters α and β of the previous time slice;
and step 3: sequentially carrying out theme extraction on the data sets on the matched time slices based on a publication theme model on each time slice in the time sequence publication theme model to obtain publication-theme distribution and theme-word distribution on each time slice;
and 4, step 4: the method comprises the steps of obtaining the theme and theme-word distribution of a publication to be tested on each time slice, calculating the KL distance between any two themes of the same publication to be tested on adjacent time slices based on the theme-word distribution, and obtaining the evolution behavior of each theme in the publication to be tested based on a theme snapshot publication research hotspot evolution model;
the topic snapshot publication research hotspot evolution model comprises five types of evolution behavior detection rules, namely topic continuation, new generation, extinction, division and fusion, each type of evolution behavior detection rule is identified based on the similarity of topics on adjacent time slices and the evolution behavior characteristic, the evolution behavior characteristic is related to the similarity, and the similarity of the two topics is measured by adopting KL distance.
On one hand, the invention provides a topic snapshot publication research hotspot evolution model which combines a KL distance to measure the similarity between two topics of the same publication to be tested on two adjacent time slices, covers the detection rules of continuation, new generation, division, fusion and extinction behaviors in topic evolution and realizes the detection of the evolution behavior of a time-series topic on the publication to be tested, wherein various evolution behaviors are characterized in that ① continuation behaviors are adopted that the topic of the current time slice is continued in the next time slice, so that the topic of the current time slice is only very similar to one topic of the next time slice and is not similar to other topics, ② new generation behaviors are adopted that the topic of the current time slice is not connected with the topic of the previous time slice, so that the topic of the current time slice is not similar to all topics of the previous time slice, 32 division behaviors are adopted that the topic of the current time slice is divided, so that a plurality of topics are generated, so that two or more topics of the current time slice are similar to the previous time slice, ④ fusion behaviors are deduced that the topic of the current time slice is similar to the topic of the current time slice, so that the topic of the current time slice, the topic of the same as well as the topic evolution of the topic of the publication to be tested, and the next time slice, and the topic of the same as the topic of the current time slice, and the topic of the current topic of the same topic, and the same as.
On the other hand, by constructing a time sequence publication theme model based on publication themeness and time sequence, considering the influence of the theme along with the time change and adopting a parameter transmission mode to construct the association relation of the publication theme models on adjacent time slices, the uncertainty of the theme words in the theme distribution process is reduced, so that the model is less confused; meanwhile, the time-series publication topic model aims at publication modeling in a literature data set, and the topic of the subject field represented by the publication is stronger than the topic of the subject field represented by the author, so that the time-series publication topic model of the invention is more in line with the requirement of researching the topic evolution of the publication compared with the conventional author topic model ATM and the conventional DTM model.
Further preferably, the topic snapshot publication research hotspot evolution model comprises the following detection rules:
a: when the KL distance between the topic i on the time slice t and the topic on the next time slice t +1 is less than the similarity threshold value and the KL distances between the topic i on the next time slice t +1 and the remaining topics on the next time slice t +1 are greater than or equal to the similarity threshold value, the topic i keeps continuing in the next time slice t + 1:
b: when the KL distance between the topic i on the time slice t and each topic on the adjacent time slice t-1 is larger than the similarity threshold value, the topic i on the time slice t is a new topic:
c: when the KL distance between the theme i on the time slice t and each theme on the next adjacent time slice t +1 is larger than the similarity threshold, the theme i on the time slice t is not continued in the next time slice t +1, and the theme i disappears:
d: when KL distances between the theme i on the time slice t and at least two themes on the next adjacent time slice t +1 are both smaller than a similarity threshold value, the theme i on the time slice t is split into multiple themes in the next time slice t + 1:
e: and when the KL distances between the topic i on the time slice t and at least two topics on the adjacent last time slice t-1 are smaller than the similarity threshold value, fusing the topic i on the time slice t by a plurality of topics in the last time slice t-1.
Further preferably, the detection formula of each detection rule in the topic snapshot publication research hotspot evolution model is as follows:
the detection formula of the continuous evolution behavior in the rule a is as follows:
in the formula,KL distances between a subject i and a subject j on a t +1 time slice and between a subject i and a subject k on a t +1 time slice respectively,topic-word distributions of topic i, topic j, and topic k on T +1 time slices, Tt+1A topic set on t +1 time slice, and threshold _ A is a similarity threshold;
the detection formula of the evolution behavior of the new theme in the rule b is as follows:
in the formula,is KL distance between topic j and topic i on T-1 time slicet-1Is a topic set on a t-1 time slice;
the detection formula of the death evolution behavior in the rule c is as follows:
the detection formula of the splitting evolution behavior in the rule d is as follows:
the detection formula of the fusion evolution behavior in the e rule is as follows:
further preferably, the KL distance calculation formula of the two subjects is as follows:
in the formula,the KL distance of topic j to topic i on t-1 time slice,respectively representing the subject-word distribution of a subject j on a t-1 time slice and a subject i on a t time slice,are respectively asAnd (3) the word probability of a subject word X under the subject-word distribution, wherein X represents a subject word set of a subject j on a t-1 time slice, and X represents any subject word in X.
It should be understood that the above formula is also used when calculating KL distances on two other adjacent time slices, and this formula is a general formula. It should be noted that if the subject word x in the formula does not exist in the subject word set of the subject i at time t, then phii t(x) Taken to a preset small value, for example 0.001.
Further preferably, the similarity threshold is 0.4.
Further preferably, the dirichlet prior parameter α of the publication-topic distribution θ and the dirichlet prior parameter β of the topic-word distribution Φ in the publication topic model on the adjacent time slice in step 2 are correlated as follows:
βtt-1~N(βt-12I)
αtt-1~N(αt-12I)
in the formula, βt、βt-1Dirichlet prior parameters of topic-word distribution in topic models on time slice t and time slice t-1, αt、αt-1Dirichlet prior parameter N (β) for publication-topic distribution in publication topic models at time slice t and time slice t-1, respectivelyt-12I) And N (α)t-12I) Are all normally distributed, σ2I and delta2I represents the variance of the corresponding random variable;
βtt-1~N(βt-12I) prior parameter β representing topic-word distribution under time slice ttSubject-word distribution prior parameter β for last time slice t-1t-1And satisfies N (β)t-12I) Distribution, αtt-1~N(αt-12I) Dirichlet prior parameter α representing publication-topic distribution under time slice ttDirichlet prior parameter α subject to publication-topic distribution under last time slice t-1t-1And satisfies N (α)t-12I) And (4) distribution.
The invention considers that academic publications are published periodically along with time, the evolution of the subject of the academic publications is gradual, and adjacent time slices are connected in a parameter transmission mode, namely the adjacent time slices are connected by two parameters of Dirichlet prior parameters α and β. since the value of the Dirichlet prior parameters α and β can influence the formation of the subject and change the distribution of words in the subject, the invention transmits the influence of the publication-subject distribution theta and the subject-word distribution phi in the preamble time slices to the adjacent next time slice subject model parameters by two parameters of α and β, thereby reducing the uncertainty of the subject words in distributing the subject and ensuring that the model is less confused.
Further preferably, the topic number of the time-series topic model in step 2, the dirichlet prior parameter α of the topic-topic distribution θ and the dirichlet prior parameter β of the topic-word distribution Φ in the topic model at the first time slice are preset values.
Further preferably, the number of topics of the chronological publication topic model is 50.
Further preferably, the dirichlet prior parameter α of the publication-topic distribution θ in the publication topic model on the first time slice is 1, and the dirichlet prior parameter β of the topic-word distribution Φ is 0.01.
Advantageous effects
1. The invention provides a brand-new theme snapshot publication research hotspot evolution model which measures the similarity between two topics of the same publication to be tested on two adjacent time slices by combining KL (karhunen-Loeve) distance, detects the continuation, the new generation, the division, the fusion and the extinction behaviors of the topic evolution in the topic snapshots at adjacent moments, realizes the fine-grained analysis of the research hotspot evolution in the publication, and fills the blank of solving the problem of effective detection means based on the publication time sequence topic evolution behaviors in the prior art. The detection rule of the evolution behaviors of continuation, new growth, division, fusion and extinction in the topic snapshot publication research hotspot evolution model provided by the invention is deduced based on the similarity of topics of the same publication on adjacent time slices, and the topic evolution process is accurately reflected.
2. The topic model of the time series publication is constructed based on topic and time sequence, the topic model of the time series publication combines the characteristics of a topic model JTM and a model DTM, namely on one hand, the influence of the topic along with the time change is considered, and the association relation of the topic model of the time series publication on adjacent time slices is constructed in a parameter transmission mode, the uncertainty of the topic words in the topic distribution is reduced, the model is less confused, the defect that the topic change along with the time change is not considered by a single topic model is overcome, and the uncertainty of the topic words in the topic distribution is increased, the topic model of the time series publication is connected with the adjacent time slices through two parameters of Dirichlet prior parameters α and β, the topic distribution of the topic is changed because the values of Dirichlet parameters α and β influence the formation of the topic, the topic distribution theta and the topic-distribution influence of the topic-topic distribution in the prior time slices are transmitted to the next time slices through two parameters of the topic model α and β, and the topic distribution model of the topic model of the invention is more suitable for the topic distribution of the topic model, and the topic distribution of the topic model of the topic distribution is more suitable for the topic model of the topic distribution.
3. Experiments prove that the time sequence publication theme model provided by the invention has better performance in the confusion degree and the running time, the confusion degree of the time sequence publication theme model is lower than that of an author theme model ATM and a DTM model, and the running time of the time sequence publication theme model is close to that of the DTM model and is shorter than that of the ATM.
Drawings
FIG. 1 is a schematic flow chart of a method for detecting evolution behavior of a research hotspot based on KL distance similarity measurement according to the present invention;
FIG. 2 is a schematic view of a publication topic model provided by the present invention;
FIG. 3 is a schematic diagram of a temporal publication topic model provided by the present invention;
FIG. 4 is a schematic diagram of topic evolution behavior in a topic snapshot publication research hotspot evolution model provided by the invention;
FIG. 5 is a schematic diagram of the evolution of the subject under publication ID 003 in the years 2010-2016 according to the present invention;
FIG. 6 is a diagram illustrating the confusion contrast of the ATM model, the DTM model and the TS-JTM model according to the present invention.
Detailed Description
The present invention will be further described with reference to the following examples.
Because the research hotspot in the academic field is mainly reflected in the academic publication, how to analyze the evolution behavior of the subject in the data set of the academic publication has important significance for scientific researchers to know the development track of the subject research hotspot and grasp the development rule of the research hotspot. As shown in fig. 1, the present invention provides a method for detecting evolution behavior of research hotspot based on KL distance similarity measurement based on this requirement, which includes the following steps:
step 1: and preprocessing the literature information. The method comprises the steps of firstly obtaining publication documents from a public document information base and preprocessing the publication documents, and then constructing a topic word corpus with time attributes based on publication time of the publication documents.
The pretreatment process comprises the following steps: extracting the title name, abstract, key word, periodical name and publication time of the publication document, formatting, dividing the abstract and the title name into phrases by using a word segmentation tool, deleting stop words, and combining the remaining phrases and the key word into a feature word of the document. In other possible embodiments, the feature words of the document may also be derived from the abstract only, or from the abstract and the keywords; or an abstract or a literature title, and the present invention is not particularly limited thereto.
After the characteristic word set of each document is obtained, the time slices are divided according to the publication time of the document, and the characteristic words of the document belonging to the same time slice and publication information of the document form a data set of the time slices. The data sets for each time slice constitute a corpus of topic words.
For example: and acquiring scientific and technical literature information from a national knowledge network public literature resource library to construct a subject word corpus. A summary of 6487 articles, corresponding journal names and publication time are selected from publications in the computer field of 2010-2016 to serve as experimental data. Dividing all literature information into data sets of 7 time slices according to the year, and then using Chinese academy Chinese word segmentation system NLPIR to segment and remove stop words from each abstract of thesis to form subject word sets of all literaturesWherein is prepared from (w)i,ji) To represent the document feature vector of document i. Wherein wiRepresenting a set of feature words, j, in document iiRepresents a publication published in document i. N in time slice t1Data set C composed of the literaturetCan be expressed as
Step 2: a temporal journal topic model (TS-JTM) is constructed based on the topic and the chronological order.
The model of the time series publication topic model (TS-JTM) in each time slice is a publication topic model, which is shown in FIG. 2.α and β in the model represent Dirichlet (Dirichlet) prior parameters of a publication-topic distribution theta and a topic-word distribution phi, respectively, K represents the total number of publications, and T represents the number of topics.
The journal topic model on adjacent time slices in the time series topic model (TS-JTM) of the present invention has an association relationship, as with the DTM model, as shown in FIG. 3, adjacent time slices are connected by Dirichlet priors α and β, wherein the values of Dirichlet priors α and β affect the formation of the topic and change the distribution of words in the topic.
βtt-1~N(βt-12I) (1)
αtt-1~N(αt-12I) (2)
φt~Dir(βt) (3)
θt~Dir(αt) (4)
Wherein, formula 1 represents prior parameter β of topic-word distribution under time slice ttSubject-word distribution prior parameter β for last time slice t-1t-1And satisfies N (β)t-12I) Distribution, βtAnd βt-1Satisfying the first order Markov process, and in the same way, equation 2 represents the Dirichlet prior parameter α of publication-subject distribution under time slice ttDirichlet prior parameter α subject to publication-topic distribution under last time slice t-1t-1And satisfies N (α)t-12I) Distribution equation (3) and equation (4) represent the parameter βtAnd αtRespectively, topic-word distribution in the modeltAnd publication-subject θtDirichlet prior parameter αtAnd βtThe value of (b) will affect publication-topic distributions and topic-word distributions.
Based on the model structure of the topic model of the time series publication, the number of topics in the model and the Dirichlet prior parameter β on the first time slice are set1And α1The value of (2) is obtained by extracting the theme of the data set on the first time slicePublication-topic distribution θ on first time slice1And a subject-word phi1Distribution, and the Dirichlet prior parameter β of the first time slice is obtained by using formula (1) and formula (2)1And α1Calculate new β1' and α1', and new parameters β1' and α1The distribution of the publications-subject and subject-word on each time slice is obtained by repeating the process continuously, namely the Dirichlet prior parameter α on other time slicestAnd βtAccording to α on the previous time slice respectivelyt-1、βt-1And (4) calculating. The process of performing topic extraction on the data set on the matched time slice by using the topic model of the publication on the time slice to obtain publication-topic distribution and topic-word distribution is the implementation process of the prior art, and the invention does not describe the process in detail but only describes the process briefly.
The parameter inference of the publication topic distribution theta and the topic word distribution phi in the time-series publication topic model adopts a Gibbs Sampling (Gibbs Sampling) method. For each word, the publication and topic are sampled according to equation 5, with p (topic | journal) · p (word | topic) on the right in equation 5, i.e., the probability that the publication selects the topic and the topic selects the word. Since there are T topics (topic) and K publications (journal), the physical meaning of the formula is to sample in these K T paths.
In the formula, zi=j,xiK here represents the ith word in a document assigned to the jth Topic (Topic) and kth publication. WiM represents that the ith word is the mth word in the dictionary. Z-i,X-iSubject matter and publication assignments representing words other than the ith word.Representing the total number of words m that have been assigned to topic j before this assignment,indicating the total number of topics j assigned by publication k to topic j so far. N is the total number of words in the dictionary, which consists of all the different feature words in the data set. The formula (1) only needs to record two matrixes in the parameter estimation of the model, wherein one is a counting matrix NxT of a topic-word (word by topic) and the other is a counting matrix KxT of a Journal-topic (Journal by topic), and then the calculation formulas of the topic-word distribution phi and the topic-topic distribution theta are respectively the formula (6) and the formula (7) according to the two counting matrixes.
In the formula, phimjRepresenting the probability, θ, that topic j uses word mkjIndicating the probability that publication k selects topic j, m 'indicating any word assigned to topic j, and j' indicating any topic assigned to publication k.
And step 3: and sequentially carrying out theme extraction on the data sets on the matched time slices based on the publication theme model on each time slice in the time sequence publication theme model to obtain publication-theme distribution and theme-word distribution on each time slice.
Constructing a framework of a time sequence publication topic model based on the step 2, setting the topic number of the time sequence publication topic model, a Dirichlet prior parameter α of a publication-topic distribution theta and an initial value of a Dirichlet prior parameter β of a topic-word distribution phi in the embodiment, and then performing topic extraction on data sets on all time slices in sequence to obtain publication-topic distribution and topic-word distribution on all time slices, wherein the process is to perform topic extraction by using a TS-JTM model, namely performing 1.1, 1.2 and 1.3 on each time slice t in a circulating manner;
1.1 in time slice T, using TS-JTM model to extract subject from data set to obtain subject set TtAnd topic-word distribution;
1.2 set of topics TtAdding to a set TC of time series topics;
1.3 parameters α Using the Current time slice modelttAnd updating the model TS-JTM.
It should be appreciated that updating the TS-JTM model updates the next time-slice model parameters α, β in the temporal journal topic model.
And 4, step 4: the method comprises the steps of obtaining the theme and theme-word distribution of a publication to be tested on each time slice, calculating the KL distance between any two themes of the same publication to be tested on adjacent time slices based on the theme-word distribution, and obtaining the evolution behavior of each theme in the publication to be tested based on a theme snapshot publication research hotspot evolution model.
As shown in FIG. 4, the topic snapshot publication research hotspot evolution model provided by the invention comprises behavior characteristics of the topics, wherein ① one-to-one relation indicates that the topic of the current time slice is continued from the topic of the previous time slice, ② indicates that a new topic exists when the topic in the current time slice is not connected with the topic in the previous time slice, ③ one-to-many relation indicates that the topic of the previous time slice is split and a plurality of topics are generated, ④ many-to-one relation indicates that a plurality of topics are fused into one topic, and ⑤ indicates that the topic in the previous time slice is lost when the topic is not connected with the topic in the next time slice.
To measure the similarity between two topics, the present invention employs the KL distance. KL (Kullback-LeiblerDrigence) distance was proposed by Solomon Kullback and Richard LeiblerGo out[3]Also called relative entropy (relatedentropy), is often used to measure the similarity between two probability distributions, and the use of KL distance can be used to measure the similarity between any two subjects in adjacent time slices. The following formula 8 is a calculation formula of the KL distance, in which,andrespectively representing two probability distributions, the value of the KL distance being such that when the two probability distributions are identicalIs 0.
The similarity between two topics distributed on two adjacent time slices is measured by adopting the KL distance, the corresponding relation between the topics of the adjacent time slices is established, and the probability distribution in the formula corresponds to the topic-word distribution of the topics.
Based on the evolution behaviors in the above 1-5, the subject snapshot publication research hotspot evolution model of the invention comprises the following detection rules:
a: and when the KL distance between the topic i on the time slice t and the topic on the next adjacent time slice t +1 is less than the similarity threshold value and the KL distances between the topic i on the time slice t +1 and the remaining topics on the next adjacent time slice t +1 are both greater than or equal to the similarity threshold value, the topic i keeps continuing in the next time slice t + 1.
b: and when the KL distance between the topic i on the time slice t and each topic on the adjacent time slice t-1 is larger than the similarity threshold value, the topic i on the time slice t is the new topic.
c: when the KL distance between the theme i on the time slice t and each theme on the next adjacent time slice t +1 is larger than the similarity threshold, the theme i on the time slice t is not continued in the next time slice t +1, and the theme i disappears.
d: and when the KL distances between the topic i on the time slice t and at least two topics on the next adjacent time slice t +1 are both smaller than the similarity threshold value, the topic i on the time slice t is split into multiple topics in the next time slice t + 1.
e: and when the KL distances between the topic i on the time slice t and at least two topics on the adjacent last time slice t-1 are smaller than the similarity threshold value, fusing the topic i on the time slice t by a plurality of topics in the last time slice t-1.
To sum up, the detection formulas corresponding to the a-e detection rules are as follows:
wherein,and (3) representing the evolution behavior state identification of the ith theme of the tth time slice, wherein Threshold _ A is a similarity Threshold. Through repeated experiments, when threshold _ a is set to 0.4, the evolution behavior of the theme can be reasonably reflected, and in other feasible embodiments, other values can be taken.
For the treatment of the publication to be tested on each time slice, the following procedures 2.2.1 and 2.2.2 are respectively executed:
2.2.1 extracting the topic set T of the current time slice from the set TCtAnd a theme set T of two time slices adjacent to the current time slicet-1、Tt+1And obtaining the journal to be tested in the collection Tt-1,Tt,Tt+1The subject matter of (1);
2.2.2 detecting the evolution behavior of each topic of the publication to be detected on the current time slice according to the formula 9.
In order to more clearly describe the aspects of the present invention, a number of examples will be provided below.
1. Change of subject word with time
As shown in table 1 below, the publication with ID 003 in the data set, topic number 2 was a topic related to the face recognition field in the topic distribution in 2010. The distribution of topic-words for topic number 2 from 2010 to 2016 is shown in table 2, which shows the 10 topic words with the highest probability for this topic per year. It can be seen from the table that, as time goes on, the core words in the theme of "face recognition" do not change greatly, and popular words related to face recognition, such as "image", "feature", "face recognition", etc., are always in the theme. However, the 'genetic algorithm' appeared in 2013 and the 'deep learning' appeared in 2015 are applications of some new methods in the field of 'face recognition'. From 2010 to 2016, the KL values of two time slice topics adjacent to each other are respectively 0.20, 0.26, 0.23, 0.17, 0.21 and 0.19, and the KL distances are all smaller than the similarity threshold value threshold _ a, during which the KL distances of other topics in the "face recognition" topic and the next time slice are all larger than the similarity threshold value threshold _ a, so that the "face recognition" topic is continued from 2010 to 2016.
TABLE 1 topic-word distribution Table for "face recognition
2. Evolution of publication topics over time
For convenience of description, we will refer to a subject in subsequent articles by its english abbreviation. The first 10 topic words with the highest probability in the 2010-2016 years are shown in Table 2 for the three topics "Neural Network (NN)", "Deep Learning (DL)", "Speech Recognition (SR)". As can be seen from table 2, core words such as "neural network", "neuron", "feature" and the like in the topic NN are basically kept unchanged, and distribution of edge words such as "sample" and "particle group" in different time slices is greatly changed. The words of the topic NN in 2013 and the topic DL in 2014 that are the same in the first 10 topic words are "training", "classification", "performance", "feature", "neuron", and due to the similarity of word distribution, the KL value between the two topics is small and is 0.27, which is smaller than the similarity threshold value threshold _ a, and the KL values of all topics in 2013, which correspond to the smallest value are DL and NN, respectively, and are 0.55, 0.27, 0.21, 0.69, 1.84, 1.16, 0.92 and 1.53, respectively, and the rest are larger than the similarity threshold value, so the topic DL is generated by topic NN splitting.
Table 22010 and 2016 of "speech recognition" and other word distribution tables for three topics
3. Publication topic evolution analysis
The evolution of the subject under publication ID 003 between 2010 and 2016 is shown in FIG. 5. Since the same subject is formed with different numbers by clustering at different time slices, the same subject is represented by an english abbreviation in the figure. As can be seen from fig. 5, the KL values of the topic distributions in 2015 and the topic SR in 2016 are 0.74, 0.46, 0.23, 0.16, 0.81, 0.95, and 1.37, respectively, two topics that are smaller than the similarity threshold are NN and SR, respectively, and the remaining KL values are both greater than the similarity threshold, indicating that the topic NN is fused into the topic SR, the KL distances of the "aircraft" topic in 2014 from the topic in 2015 are 1.72, 1.46, 1.25, 1.07, 1.20, 0.83, 1.59, and the minimum value of KL is 0.83 and is greater than the similarity threshold, so that the "aircraft" topic undergoes extinction in 2015; the KL values of all topics in 2010 and the "cloud computing" topic in 2011 are 1.16, 0.75, 1.37, 2.32, 1.51 respectively, the minimum value of KL is 0.75 which is greater than the similarity threshold, and the topic is new born; similarly, the target tracking topic is always in a continuous state, and the new topic in 2013 is entity identification.
Model performance verification
In order to verify the model performance of the time series publication topic model (TS-JTM) provided by the invention, the invention adopts a perplexity index. Equation 10 below is a calculation equation for Perplexity, where DtestRepresenting a test set, is a collection of M documents, p (W)d) Representing the probability of a word in a document being selected, NdRepresenting the number of words in the document d, Wd=(w1d,w2d,...,wid,...,wnd) Representing the word vector form in document d. A smaller Perplexity value indicates better performance of the model.
To measure the performance of the time series publication topic model (TS-JTM), 3 parameters of the model need to be set before the experiment, the value of the topic quantity | T | is gradually increased from 10, two dirichlet super parameter values of ATM are set to α ═ 50/| T |, β ═ 0.01, two dirichlet super parameter values of DTM and time series publication topic model in the first time slice are set to α ═ 50/| T |, β ═ 0.01, α and β in the remaining time slices are automatically obtained by the model, comparing the experimental results as shown in fig. 6, the horizontal axis represents the topic quantity, the vertical axis represents the Perplexity (Perplexity), we can see that the Perplexity of TS-JTM is always minimum as the change of the topic quantity, which indicates the best performance of TS-JTM, in addition, the Perplexity only decreases as the topic quantity increases, when the topic quantity is greater than 50, the Perplexity is set to be equal to 3950, preferably equal to the topic quantity of TS-JTM 50, and the Perplexity model is set to 3950.
In another aspect, the invention tests the runtime of the time sequential publication topic model (TS-JTM) on a data set, and we compare TS-JTM with the runtime of the Author Topic Model (ATM) and the Dynamic Topic Model (DTM) on the model. The same data was processed using the three models, respectively, with run times of 23.8 minutes, 25.6 minutes, and 24.2 minutes for the three models, respectively. This indicates that the TS-JTM and DTM run times are very close, with the longest ATM run time. In conjunction with the model obfuscation performance of FIG. 4, it is shown that the time series publication topic model (TS-JTM) not only has low obfuscation, but also has good performance at runtime.
In summary, the subject evolution of academic publications reflects the development trend of research hotspots in the academic field. Because the theme and the timeliness of the publication influence the distribution and the evolution process of the topic, the evolution behavior exists in the topic evolution process, and the identification of the evolution track of the publication research hotspot is complicated. The text combines the theme of the publication and the time sequence of the publication, provides a time sequence publication theme model TS-JTM, uses the TS-JTM to realize temporal hotspot extraction of academic publications, and verifies the performance of the model TS-JTM through a confusion contrast experiment. On the basis, a topic snapshot publication research hotspot evolution model based on a time sequence is established, KL distance measurement similarity is used, continuation, new generation, splitting, fusion and extinction behaviors of topic evolution in topic snapshots at adjacent moments are detected, and fine-grained analysis of research hotspot evolution in publications is realized.
It should be emphasized that the examples described herein are illustrative and not restrictive, and thus the invention is not to be limited to the examples described herein, but rather to other embodiments that may be devised by those skilled in the art based on the teachings herein, and that various modifications, alterations, and substitutions are possible without departing from the spirit and scope of the present invention.
The references are as follows:
[1]Rosen-Zvi M,GriffithsT,Steyvers M.The Author-Topic Model forAuthors and Documents[C].Proceedings of the 20th Conference on Uncertainty inArtificial Intelligence.2004:487-494.
[2]Blei D M,Lafferty J D.Dynamic Topic Models[C].Proceedings of the23rd International Conference on Machine Learning,2006:113-120.
[3]David J.C.MacKay.Information Theory,Inference,and LearningAlgorithms[M].Cambridge University Press,2003:22-48.

Claims (9)

1. A KL distance similarity measurement-based method for detecting evolution behaviors of research hotspots is characterized by comprising the following steps: the method comprises the following steps:
step 1: acquiring publication documents, and constructing a subject term corpus with time attributes based on publication time of the publication documents;
dividing time slices by publication document publication time, wherein the subject term corpus is composed of data sets on each time slice, and the data set on each time slice is composed of document feature vectors of publication documents published at matching time;
in the formula, CtIs a data set over a time slice t, (w)i,ji) Document feature vector, w, for publication document iiSet of characteristic words, j, for publication document iiAs a publication to which publication document i belongs, ciIs the ith feature word in the feature word set, n1Is the number of publication documents on time slice t, n2The number of characteristic words on publication document i;
wherein, the characteristic words of the publication documents are obtained after the content of the publication documents is subjected to word segmentation processing;
step 2: constructing a time sequence publication theme model based on publication themes and time sequence;
each time slice in the time-series publication topic model corresponds to a publication topic model, and a dirichlet prior parameter α of publication-topic distribution theta and a dirichlet prior parameter β of topic-word distribution phi in the publication topic model of the next time slice in two adjacent time slices are associated with two dirichlet prior parameters α and β of the previous time slice;
and step 3: sequentially carrying out theme extraction on the data sets on the matched time slices based on a publication theme model on each time slice in the time sequence publication theme model to obtain publication-theme distribution and theme-word distribution on each time slice;
and 4, step 4: the method comprises the steps of obtaining the theme and theme-word distribution of a publication to be tested on each time slice, calculating the KL distance between any two themes of the same publication to be tested on adjacent time slices based on the theme-word distribution, and obtaining the evolution behavior of each theme in the publication to be tested based on a theme snapshot publication research hotspot evolution model;
the topic snapshot publication research hotspot evolution model comprises five types of evolution behavior detection rules, namely topic continuation, new generation, extinction, division and fusion, each type of evolution behavior detection rule is identified based on the similarity of topics on adjacent time slices and the evolution behavior characteristic, the evolution behavior characteristic is related to the similarity, and the similarity of the two topics is measured by adopting KL distance.
2. The method of claim 1, wherein: the topic snapshot publication research hotspot evolution model comprises the following detection rules:
a: when the KL distance between the topic i on the time slice t and the topic on the next time slice t +1 is less than the similarity threshold value and the KL distances between the topic i on the next time slice t +1 and the remaining topics on the next time slice t +1 are greater than or equal to the similarity threshold value, the topic i keeps continuing in the next time slice t + 1:
b: when the KL distance between the topic i on the time slice t and each topic on the adjacent time slice t-1 is larger than the similarity threshold value, the topic i on the time slice t is a new topic:
c: when the KL distance between the theme i on the time slice t and each theme on the next adjacent time slice t +1 is larger than the similarity threshold, the theme i on the time slice t is not continued in the next time slice t +1, and the theme i disappears:
d: when KL distances between the theme i on the time slice t and at least two themes on the next adjacent time slice t +1 are both smaller than a similarity threshold value, the theme i on the time slice t is split into multiple themes in the next time slice t + 1:
e: and when the KL distances between the topic i on the time slice t and at least two topics on the adjacent last time slice t-1 are smaller than the similarity threshold value, fusing the topic i on the time slice t by a plurality of topics in the last time slice t-1.
3. The method of claim 2, wherein: the detection formula of each detection rule in the topic snapshot publication research hotspot evolution model is as follows:
the detection formula of the continuous evolution behavior in the rule a is as follows:
in the formula,KL distances between a subject i and a subject j on a t +1 time slice and between a subject i and a subject k on a t +1 time slice respectively,topic-word distributions of topic i, topic j, and topic k on T +1 time slices, Tt+1A topic set on t +1 time slice, and threshold _ A is a similarity threshold;
the detection formula of the evolution behavior of the new theme in the rule b is as follows:
in the formula,is KL distance between topic j and topic i on T-1 time slicet-1Is a topic set on a t-1 time slice;
the detection formula of the death evolution behavior in the rule c is as follows:
the detection formula of the splitting evolution behavior in the rule d is as follows:
the detection formula of the fusion evolution behavior in the e rule is as follows:
4. the method of claim 1, wherein: the KL distance calculation formula for both topics is as follows:
in the formula,the KL distance of topic j to topic i on t-1 time slice,respectively representing the subject-word distribution of a subject j on a t-1 time slice and a subject i on a t time slice,are respectively asWord probability of a topic word X under topic-word distribution, X representing a topicX represents any subject word in X.
5. The method of claim 1, wherein: the similarity threshold is 0.4.
6. The method as claimed in claim 1, wherein the Dirichlet prior parameter α of the publication-topic distribution θ and the Dirichlet prior parameter β of the topic-word distribution φ in the publication topic model at the adjacent time slice in step 2 are related as follows:
βtt-1~N(βt-12I)
αtt-1~N(αt-12I)
in the formula, βt、βt-1Journal topic models on time slice t and time slice t-1 respectivelyDirichlet prior parameter of medium topic-word distribution, αt、αt-1Dirichlet prior parameter N (β) for publication-topic distribution in publication topic models at time slice t and time slice t-1, respectivelyt-12I) And N (α)t-12I) Are all normally distributed, σ2I and delta2I represents the variance of the corresponding random variable;
βtt-1~N(βt-12I) prior parameter β representing topic-word distribution under time slice ttSubject-word distribution prior parameter β for last time slice t-1t-1And satisfies N (β)t-12I) Distribution, αtt-1~N(αt-12I) Prior parameter α representing publication-topic distribution under time slice ttPrior parameter α of publication-topic distribution under last time slice t-1t-1And satisfies N (α)t-12I) And (4) distribution.
7. The method as claimed in claim 6, wherein the topic number of the time series topic model in step 2 and the Dirichlet prior parameter α of the topic-topic distribution θ and the Dirichlet prior parameter β of the topic-word distribution φ in the topic model at the first time slice are preset values.
8. The method of claim 7, wherein: the number of topics for the temporal publication topic model is 50.
9. The method as recited in claim 7, wherein the Dirichlet prior parameter α of the publication-topic distribution θ is 1 and the Dirichlet prior parameter β of the topic-word distribution φ is 0.01 in the publication topic model at the first time slice.
CN201811216206.2A 2018-10-18 2018-10-18 KL distance similarity measurement-based research hotspot evolution behavior detection method Active CN109408782B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811216206.2A CN109408782B (en) 2018-10-18 2018-10-18 KL distance similarity measurement-based research hotspot evolution behavior detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811216206.2A CN109408782B (en) 2018-10-18 2018-10-18 KL distance similarity measurement-based research hotspot evolution behavior detection method

Publications (2)

Publication Number Publication Date
CN109408782A true CN109408782A (en) 2019-03-01
CN109408782B CN109408782B (en) 2020-07-03

Family

ID=65468456

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811216206.2A Active CN109408782B (en) 2018-10-18 2018-10-18 KL distance similarity measurement-based research hotspot evolution behavior detection method

Country Status (1)

Country Link
CN (1) CN109408782B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102646114A (en) * 2012-02-17 2012-08-22 清华大学 News topic timeline abstract generating method based on breakthrough point
CN102902700A (en) * 2012-04-05 2013-01-30 中国人民解放军国防科学技术大学 Online-increment evolution topic model based automatic software classifying method
CN103559176A (en) * 2012-10-29 2014-02-05 中国人民解放军国防科学技术大学 Microblog emotional evolution analysis method and system
CN103984681A (en) * 2014-03-31 2014-08-13 同济大学 News event evolution analysis method based on time sequence distribution information and topic model
CN105868415A (en) * 2016-05-06 2016-08-17 黑龙江工程学院 Microblog real-time filtering model based on historical microblogs
US20160241346A1 (en) * 2015-02-17 2016-08-18 Adobe Systems Incorporated Source separation using nonnegative matrix factorization with an automatically determined number of bases
CN106204140A (en) * 2016-07-12 2016-12-07 华东师范大学 A kind of colony based on KL distance viewpoint migrates detection method
CN107918611A (en) * 2016-10-09 2018-04-17 郑州大学 A kind of model analyzed microblog topic and developed

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102646114A (en) * 2012-02-17 2012-08-22 清华大学 News topic timeline abstract generating method based on breakthrough point
CN102902700A (en) * 2012-04-05 2013-01-30 中国人民解放军国防科学技术大学 Online-increment evolution topic model based automatic software classifying method
CN103559176A (en) * 2012-10-29 2014-02-05 中国人民解放军国防科学技术大学 Microblog emotional evolution analysis method and system
CN103984681A (en) * 2014-03-31 2014-08-13 同济大学 News event evolution analysis method based on time sequence distribution information and topic model
US20160241346A1 (en) * 2015-02-17 2016-08-18 Adobe Systems Incorporated Source separation using nonnegative matrix factorization with an automatically determined number of bases
CN105868415A (en) * 2016-05-06 2016-08-17 黑龙江工程学院 Microblog real-time filtering model based on historical microblogs
CN106204140A (en) * 2016-07-12 2016-12-07 华东师范大学 A kind of colony based on KL distance viewpoint migrates detection method
CN107918611A (en) * 2016-10-09 2018-04-17 郑州大学 A kind of model analyzed microblog topic and developed

Also Published As

Publication number Publication date
CN109408782B (en) 2020-07-03

Similar Documents

Publication Publication Date Title
CN105677873B (en) Text Intelligence association cluster based on model of the domain knowledge collects processing method
CN110188047B (en) Double-channel convolutional neural network-based repeated defect report detection method
CN106844424A (en) A kind of file classification method based on LDA
CN110807084A (en) Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy
CN116992007B (en) Limiting question-answering system based on question intention understanding
CN111832289A (en) Service discovery method based on clustering and Gaussian LDA
CN108304479B (en) Quick density clustering double-layer network recommendation method based on graph structure filtering
CN112115716A (en) Service discovery method, system and equipment based on multi-dimensional word vector context matching
CN108959305A (en) A kind of event extraction method and system based on internet big data
CN110728151B (en) Information depth processing method and system based on visual characteristics
CN106294863A (en) A kind of abstract method for mass text fast understanding
Pembeci Using word embeddings for ontology enrichment
CN107832467A (en) A kind of microblog topic detecting method based on improved Single pass clustering algorithms
Zhou et al. Neural storyline extraction model for storyline generation from news articles
Poudyal et al. Using Clustering Techniques to Identify Arguments in Legal Documents.
CN114676346A (en) News event processing method and device, computer equipment and storage medium
CN109408782B (en) KL distance similarity measurement-based research hotspot evolution behavior detection method
CN110633363A (en) Text entity recommendation method based on NLP and fuzzy multi-criterion decision
Sun et al. Stylometric and Neural Features Combined Deep Bayesian Classifier for Authorship Verification.
CN113987536A (en) Method and device for determining security level of field in data table, electronic equipment and medium
Chou et al. Text mining technique for Chinese written judgment of criminal case
Wu et al. Leveraging document-level and query-level passage cumulative gain for document ranking
Fan et al. Research and application of automated search engine based on machine learning
Efrizoni et al. Hybrid Modeling to Classify and Detect Outliers on Multilabel Dataset based on Content and Context
Nyandag et al. Keyword extraction based on statistical information for Cyrillic Mongolian script

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240315

Address after: 230000 floor 1, building 2, phase I, e-commerce Park, Jinggang Road, Shushan Economic Development Zone, Hefei City, Anhui Province

Patentee after: Dragon totem Technology (Hefei) Co.,Ltd.

Country or region after: China

Address before: Yuelu District City, Hunan province 410083 Changsha Lushan Road No. 932

Patentee before: CENTRAL SOUTH University

Country or region before: China

TR01 Transfer of patent right