CN109408782A - Research hotspot based on KL distance similarity measurement develops behavioral value method - Google Patents
Research hotspot based on KL distance similarity measurement develops behavioral value method Download PDFInfo
- Publication number
- CN109408782A CN109408782A CN201811216206.2A CN201811216206A CN109408782A CN 109408782 A CN109408782 A CN 109408782A CN 201811216206 A CN201811216206 A CN 201811216206A CN 109408782 A CN109408782 A CN 109408782A
- Authority
- CN
- China
- Prior art keywords
- topic
- publication
- time slice
- time
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明属于文献主题分析检测技术领域,具体涉及一种基于KL距离相似性度量的研究热点演变行为检测方法。The invention belongs to the technical field of document subject analysis and detection, in particular to a method for detecting evolution behavior of research hotspots based on KL distance similarity measure.
背景技术Background technique
随着科学研究与探索的不断发展,学术领域的研究热点随之发生变化,由于学科之间的相互渗透和新技术的应用促进了学术研究热点随时间的变化而演变,在这个过程中有一些老的研究问题会消失,同时也会不断有新的研究问题产生,而有些研究问题会随时间产生裂变或与其他研究问题融合,这些行为导致了学科研究热点的演变。因此,分析学术领域的研究热点演变,把握研究热点演变轨迹,对于预测研究热点发展趋势是很有必要的。它不仅可以帮助学者了解到当前的热点研究问题,而且还可以辅助科研人员与管理者把握科学研究的发展规律。科研人员的研究成果与进展集中反映在其发表学术论文学术刊物中,这些学术刊物分门别类地收集了大量学术研究成果,由于刊物周期性出版,它本质上记录了本刊物所在的研究领域的发展历程,所以,通过对刊物主题抽取去发现其研究热点随时间的演变是十分有意义的。With the continuous development of scientific research and exploration, the research hotspots in the academic field change accordingly. Due to the mutual penetration between disciplines and the application of new technologies, the academic research hotspots evolve over time. In this process, there are some Old research questions will disappear, and new research questions will continue to emerge, and some research questions will fission or merge with other research questions over time. These behaviors lead to the evolution of disciplinary research hotspots. Therefore, it is necessary to analyze the evolution of research hotspots in the academic field and grasp the evolution trajectory of research hotspots to predict the development trend of research hotspots. It can not only help scholars understand the current hot research issues, but also assist researchers and managers to grasp the development law of scientific research. The research results and progress of researchers are concentrated in the academic journals they publish academic papers. These academic journals collect a large number of academic research results in different categories. Due to the periodic publication of the journal, it essentially records the development process of the research field in which this journal is located. Therefore, it is very meaningful to discover the evolution of research hotspots over time by extracting the topics of publications.
在文献主题分析分析中,作者主题模型(Author-Topic-Model)是常用的主题聚类分析方法,ATM对文献的作者兴趣建模,可以分析作者的学术偏好[1]。作者主题模型是一个三层贝叶斯概率模型,包含词、主题、作者兴趣三层结构。该模型可以直接映射到在刊物主题模型中,即刊物以一定的概率选择某个主题,主题以一定的概率生成主题词。然而,主题随时间的演变是影响主题抽取的重要因素,作者主题模型没有考虑时间因素,将作者主题模型直接用于各个时间片的语料数据库进行主题抽取时,在每个时间片内都是独立模型参数,不具备时间依赖性,没有考虑到主题随时间变化的影响,增大了主题词在分配主题时的不确定性。Blei在LDA(Latent Dirichlet Allocation)模型的基础上提出了DTM模型[2],实现对时序主题的抽取,然而DTM模型是通过针对数据集的内容建模,并非针对刊物建模来得到文献数据集中各个刊物所包含的主题及其随时间的演变,是无法满足刊物主题研究的需求。In literature topic analysis, Author-Topic-Model is a commonly used topic clustering analysis method. ATM models the author's interest in literature and can analyze the author's academic preference [1] . The author topic model is a three-layer Bayesian probability model, which includes a three-layer structure of words, topics, and author interests. This model can be directly mapped to the topic model of the publication, that is, the publication selects a topic with a certain probability, and the topic generates the topic word with a certain probability. However, the evolution of topics over time is an important factor affecting topic extraction. The author topic model does not consider the time factor. When the author topic model is directly used in the corpus database of each time slice for topic extraction, it is independent in each time slice. The model parameters do not have time dependence, and do not take into account the influence of topic changes over time, which increases the uncertainty of topic headings when assigning topics. Blei proposed the DTM model based on the LDA (Latent Dirichlet Allocation) model [2] to achieve the extraction of time series topics. However, the DTM model is based on the content modeling of the data set, not the publication model. The topics contained in each journal and its evolution over time cannot meet the needs of journal topic research.
因此,现有技术中缺少一种有效的手段来解决基于刊物时序主题演变行为检测。Therefore, there is a lack of an effective means in the prior art to solve the topic evolution behavior detection based on publication time series.
发明内容SUMMARY OF THE INVENTION
本发明的目的是针对现有技术的缺陷提供一种基于KL距离相似性度量的研究热点演变行为检测方法,通过结合刊物的主题性和时序性提出了时序刊物主题模型TS-JTM(Time Sequence Journal Topic Model),并以此来对刊物进行时态主题抽取,再结合KL距离的主题相似性度量主题演变,实现主题延续、新生、分裂、融合、消亡演变行为的检测。The purpose of the present invention is to provide a method for detecting the evolution behavior of research hotspots based on the KL distance similarity measure in view of the defects of the prior art. Topic Model), and use this to extract temporal topics for publications, and then combine the topic similarity of KL distance to measure topic evolution, so as to detect the evolution behaviors of topic continuation, new birth, split, fusion, and demise.
一种基于KL距离相似性度量的研究热点演变行为检测方法,包括如下步骤:A method for detecting evolution behavior of research hotspots based on KL distance similarity measure, including the following steps:
步骤1:获取刊物文献,并基于刊物文献的发表时间构建具有时间属性的主题词语料库;Step 1: Obtain publication documents, and build a subject word corpus with temporal attributes based on the publication time of the publication documents;
其中,以刊物文献的发表时间划分时间片,所述主题词语料库由各个时间片上的数据集构成,每个时间片上的数据集由相匹配时间发表的刊物文献的文献特征向量构成;Wherein, the time slice is divided by the publication time of the publication document, the subject word corpus is composed of the data sets on each time slice, and the data set on each time slice is composed of the document feature vector of the publication document published at the matching time;
式中,Ct为时间片t上的数据集,(wi,ji)为刊物文献i的文献特征向量,wi为刊物文献i的特征词集合,ji为刊物文献i所属的刊物,ci为特征词集合中的第i个特征词,n1为时间片t上刊物文献的数量,n2为刊物文献i上特征词的数量;In the formula, C t is the data set on the time slice t, ( wi , ji ) is the document feature vector of the publication document i, wi is the feature word set of the publication document i, and ji is the publication to which the publication document i belongs. , c i is the ith feature word in the feature word set, n 1 is the number of publication documents on time slice t, n 2 is the number of feature words on publication document i;
其中,刊物文献的特征词是刊物文献的内容进行分词处理后得到的;Among them, the feature words of the publication document are obtained after the content of the publication document is processed by word segmentation;
步骤2:基于刊物主题性与时序性构建时序刊物主题模型;Step 2: Construct the topic model of time series publications based on the publication theme and time series;
其中,所述时序刊物主题模型中每个时间片对应一个刊物主题模型,两个相邻时间片中下一时间片的刊物主题模型中刊物-主题分布θ的狄利克雷先验参数α、主题-词分布φ的狄利克雷先验参数β与上一时间片的两个狄利克雷先验参数α、β相关联;Wherein, each time slice in the time series publication topic model corresponds to a publication topic model, and the Dirichlet prior parameter α of the publication-topic distribution θ in the publication topic model of the next time slice in the two adjacent time slices, the topic - The Dirichlet prior parameter β of the word distribution φ is associated with the two Dirichlet prior parameters α, β of the previous time slice;
步骤3:基于时序刊物主题模型中各个时间片上的刊物主题模型依次对相匹配时间片上的数据集进行主题提取得到每个时间片上的刊物-主题分布以及主题-词分布;Step 3: Based on the publication topic model on each time slice in the time series publication topic model, subject extraction is performed on the data set on the matching time slice in turn to obtain the publication-topic distribution and topic-word distribution on each time slice;
步骤4:获取待测刊物在各个时间片上主题以及主题-词分布,并基于主题-词分布计算同一待测刊物在相邻时间片上每任意两个主题之间的KL距离,再基于主题快照刊物研究热点演变模型得出待测刊物中各个主题的演变行为;Step 4: Obtain the topic and topic-word distribution of the publication to be tested on each time slice, and calculate the KL distance between any two topics of the same publication to be tested on adjacent time slices based on the topic-word distribution, and then snapshot the publication based on the topic The evolution model of research hotspots is used to obtain the evolution behavior of each topic in the publication to be tested;
其中,所述主题快照刊物研究热点演变模型包括主题延续、新生、消亡、分裂和融合五类演变行为检测规律,每类演变行为检测规律均基于相邻时间片上主题的相似性以及演变行为特性鉴别,所述演变行为特性与相似性相关,两个主题的相似性采用KL距离度量。Among them, the hotspot evolution model of the topic snapshot journals includes five types of evolution behavior detection rules: theme continuation, new life, demise, split and fusion, and each type of evolution behavior detection law is based on the similarity of topics on adjacent time slices and the identification of evolution behavior characteristics. , the evolution behavior characteristics are related to similarity, and the similarity of two topics is measured by KL distance.
一方面,本发明提出了主题快照刊物研究热点演变模型,其结合KL距离来度量同一待测刊物在两个相邻时间片上两个主题之间的相似性,并涵盖了主题演变中延续、新生、分裂、融合、消亡行为的检测规律,实现了待测刊物上时序主题演变行为的检测。其中,各类演变行为特征如下:①延续行为:当前时间片的主题在下一个时间片保持延续,因此当前时间片的主题仅和下一时间片的一个主题的很相似,和其他主题不相似;②新生行为:当前时间片中的主题与上一时间片中的主题没有连接,因此当前时间片的主题和上一时间片中所有的主题都不相似;③分裂行为:当前时间片的主题产生了分裂,生成了多个主题,因此当前时间片的主题和下一时间片中两个及以上的主题都相似;④融合行为:多个主题融合成一个主题,因此当前时间片的主题和上一时间片中两个及以上的主题都相似;⑤消亡行为:当前时间片中的主题与下一时间片中的主题没有连接,因此当前时间片的主题和下一时间片中所有的主题都不相似。本发明基于各类演变行为的特征以及用于度量相似性的KL值可以推导出待测刊物上主题的演变行为。On the one hand, the present invention proposes a topic snapshot publication research hotspot evolution model, which combines the KL distance to measure the similarity between two topics of the same publication to be tested on two adjacent time slices, and covers the continuation and new life in the topic evolution. , split, merge, and disappear behavior detection rules, realize the detection of the evolution behavior of time series topics in the publications to be tested. Among them, the characteristics of various evolution behaviors are as follows: ① Continuation behavior: the theme of the current time slice remains in the next time slice, so the theme of the current time slice is only very similar to a theme of the next time slice, and is not similar to other themes; ②New behavior: The topic in the current time slice is not connected to the topic in the previous time slice, so the topic of the current time slice is not similar to all the topics in the previous time slice; ③Split behavior: The topic of the current time slice is generated In order to split, multiple topics are generated, so the topic of the current time slice is similar to two or more themes in the next time slice; ④ Fusion behavior: multiple topics are merged into one topic, so the topic of the current time slice is the same as the previous one. Two or more themes in one time slice are similar; ⑤Dead behavior: the theme in the current time slice is not connected with the theme in the next time slice, so the theme of the current time slice and all the themes in the next time slice are not similar. Based on the characteristics of various evolution behaviors and the KL value used to measure similarity, the invention can deduce the evolution behavior of topics on the publication to be tested.
另一方面,本发明通过构建了基于刊物主题性与时序性的时序刊物主题模型,其考虑到主题随时间变化的影响,并采用参数传递的方式来构建相邻时间片上刊物主题模型的关联关系,降低了主题词在分配主题时的不确定性,使得模型的困惑度较小;同时时序刊物主题模型是针对文献数据集中的刊物建模,由于刊物所代表的学科领域的主题性比作者所代表的学科领域主题性更强,因此本发明的时序刊物主题模型相较于常规的作者主题模型ATM和DTM模型更符合本发明研究刊物主题演变的需求。On the other hand, the present invention constructs a chronological publication subject model based on the subjectivity and temporality of the publication, which takes into account the influence of the subject changes over time, and adopts the method of parameter transmission to construct the relationship between the subject models of the publications on adjacent time slices. , which reduces the uncertainty of subject headings when assigning topics, making the model less confusing; at the same time, the topic model of time series publications is modeled for publications in the literature dataset, because the subjectivity of the subject areas represented by the publications is higher than that of the authors. Compared with the conventional author topic models ATM and DTM models, the topic model of the time series publication of the present invention is more in line with the needs of the present invention to study the topic evolution of the publication.
进一步优选,所述主题快照刊物研究热点演变模型包括如下检测规律:Further preferably, the evolution model of the research hotspots of the subject snapshot journals includes the following detection rules:
a:时间片t上的主题i仅与相邻下一时间片t+1上一个主题的KL距离小于相似性阈值,且与相邻下一时间片t+1上剩余主题的KL距离均大于或等于相似性阈值时,主题i在下一时间片t+1中保持延续:a: The KL distance between topic i on time slice t and the previous topic on the adjacent next time slice t+1 is less than the similarity threshold, and the KL distance from the remaining topics on the adjacent next time slice t+1 is greater than or equal to the similarity threshold, topic i remains in the next time slice t+1:
b:时间片t上的主题i与相邻上一时间片t-1上每个主题的KL距离均大于相似值阈值时,时间片t上的主题i为新生主题:b: When the KL distance between topic i on time slice t and each topic on the adjacent previous time slice t-1 is greater than the similarity threshold, topic i on time slice t is a new topic:
c:时间片t上的主题i与相邻下一时间片t+1上每个主题的KL距离均大于相似值阈值时,时间片t上的主题i在下一时间片t+1中没有延续,主题i消亡:c: When the KL distance between topic i on time slice t and each topic on adjacent next time slice t+1 is greater than the similarity threshold, topic i on time slice t does not continue in the next time slice t+1 , subject i dies:
d:时间片t上的主题i与相邻下一时间片t+1上至少两个主题的KL距离均小于相似值阈值时,时间片t上的主题i在下一时间片t+1中分裂为多主题:d: When the KL distance between topic i on time slice t and at least two topics on adjacent next time slice t+1 is smaller than the similarity threshold, topic i on time slice t is split in the next time slice t+1 For multiple topics:
e:时间片t上的主题i与相邻上一时间片t-1上至少两个主题的KL距离均小于相似值阈值时,时间片t上的主题i由上一时间片t-1中多主题融合而来。e: When the KL distance between topic i on time slice t and at least two topics on the adjacent previous time slice t-1 is less than the similarity threshold, topic i on time slice t is determined from the previous time slice t-1. Combining multiple themes.
进一步优选,所述主题快照刊物研究热点演变模型中各个检测规律的检测公式如下:Further preferably, the detection formula of each detection law in the evolution model of the subject snapshot publication research hotspot is as follows:
a规律中延续演变行为的检测公式为:The detection formula of the continuous evolution behavior in the a-law is:
式中,分别为t时间片上主题i与t+1时间片上主题j、t时间片上主题i与t+1时间片上主题k之间的KL距离,分别为t时间片上主题i、t+1时间片上主题j、t+1时间片上主题k的主题-词分布,Tt+1为t+1时间片上主题集合,threshold_A为相似性阈值;In the formula, are the KL distances between the topic i on the t time slice and the topic j on the t+1 time slice, and the KL distance between the topic i on the t time slice and the topic k on the t+1 time slice, are the topic-word distributions of topic i on t time slice, topic j on t+1 time slice, and topic k on t+1 time slice, T t+1 is the set of topics on t+1 time slice, and threshold_A is the similarity threshold;
b规律中新生主题演变行为的检测公式为:The detection formula for the evolution behavior of the new theme in the b-law is:
式中,为t-1时间片上主题j与t时间片上主题i之间的KL距离,Tt-1为t-1时间片上主题集合;In the formula, is the KL distance between the topic j on the t-1 time slice and the topic i on the t time slice, and T t-1 is the set of topics on the t-1 time slice;
c规律中消亡演变行为的检测公式为:The detection formula of the extinction evolution behavior in the c-law is:
d规律中分裂演变行为的检测公式为:The detection formula of split evolution behavior in d-law is:
e规律中融合演变行为的检测公式为:The detection formula of fusion evolution behavior in e-law is:
进一步优选,两个主题的KL距离计算公式如下:Further preferably, the KL distance calculation formula of the two topics is as follows:
式中,为t-1时间片上主题j与t时间片上主题i的KL距离,分别表示t-1时间片上主题j、t时间片上主题i的主题-词分布,分别为主题-词分布下主题词x的词概率,X表示t-1时间片上主题j的主题词集合,x表示X中的任意一个主题词。In the formula, is the KL distance between topic j on time slice t-1 and topic i on time slice t, represent the topic-word distribution of topic j on time slice t-1 and topic i on time slice t, respectively, respectively The word probability of the topic word x under the topic-word distribution, X represents the topic word set of topic j on the t-1 time slice, and x represents any topic word in X.
应当理解,当计算其他相邻两个时间片上的KL距离时,也是采用上述公式,此公式为通用公式。且需要说明的是,公式中主题词x不存在于在t时间片上主题i的主题词集合,则φi t(x)取为预设小值,例如0.001。It should be understood that when calculating the KL distances on other two adjacent time slices, the above formula is also used, and this formula is a general formula. It should be noted that, in the formula, the subject word x does not exist in the subject word set of the subject i in the t time slice, so φ i t (x) is taken as a preset small value, such as 0.001.
进一步优选,所述相似性阈值为0.4。Further preferably, the similarity threshold is 0.4.
进一步优选,步骤2中相邻时间片上的刊物主题模型中刊物-主题分布θ的狄利克雷先验参数α、主题-词分布φ的狄利克雷先验参数β相互关联如下:Further preferably, in the publication topic model on the adjacent time slice in step 2, the Dirichlet prior parameter α of the publication-topic distribution θ and the Dirichlet prior parameter β of the topic-word distribution φ are related to each other as follows:
βt|βt-1~N(βt-1,σ2I)β t |β t-1 ~N(β t-1 ,σ 2 I)
αt|αt-1~N(αt-1,δ2I)α t |α t-1 ~N(α t-1 ,δ 2 I)
式中,βt、βt-1分别为时间片t、时间片t-1上的刊物主题模型中主题-词分布的狄利克雷先验参数,αt、αt-1分别为时间片t、时间片t-1上的刊物主题模型中刊物-主题分布的狄利克雷先验参数,N(βt-1,σ2I)和N(αt-1,δ2I)均为正态分布,σ2I与δ2I表示对应随机变量的方差;In the formula, β t and β t-1 are the Dirichlet prior parameters of the topic-word distribution in the publication topic model on time slice t and time slice t-1, respectively, and α t and α t-1 are time slices, respectively. t. The Dirichlet prior parameters of the publication-topic distribution in the publication-topic model on the time slice t-1, N(β t-1 ,σ 2 I) and N(α t-1 ,δ 2 I) are both Normal distribution, σ 2 I and δ 2 I represent the variance of the corresponding random variable;
βt|βt-1~N(βt-1,σ2I)表示时间片t下的主题-词分布的先验参数βt受上一时间片t-1下的主题-词分布的先验参数βt-1的影响并满足N(βt-1,σ2I)分布,αt|αt-1~N(αt-1,δ2I)表示时间片t下的刊物-主题分布的狄利克雷先验参数αt受上一时间片t-1下的刊物-主题分布的狄利克雷先验参数αt-1的影响并满足N(αt-1,δ2I)分布。β t |β t-1 ~N(β t-1 ,σ 2 I) indicates that the prior parameter β t of the topic-word distribution in the time slice t is affected by the topic-word distribution in the previous time slice t-1. The influence of the prior parameter β t-1 and satisfy the N(β t-1 ,σ 2 I) distribution, α t |α t-1 ~N(α t-1 ,δ 2 I) represents the publications under the time slice t - The Dirichlet prior parameter α t of the topic distribution is influenced by the publications under the previous time slice t-1 - The Dirichlet prior parameter α t-1 of the topic distribution satisfies N(α t-1 ,δ 2 I) Distribution.
本发明考虑到学术刊物是随时间周期性出版的,其主题的演变具有渐进性,通过参数传递的方式连接相邻时间片,即相邻时间片通过狄利克雷先验参数α和β这两个参数连接。由于狄利克雷先验参数α和β的值会影响主题的形成,改变主题中词的分布,因此,本发明通过α和β这两个参数将前序时间片中的刊物-主题分布θ和主题-词分布φ的影响传递到相邻的下一时间片主题模型参数中,降低了主题词在分配主题时的不确定性,使得模型的困惑度较小。The present invention takes into account that academic journals are published periodically over time, and the evolution of their topics is gradual, and the adjacent time slices are connected by means of parameter transmission, that is, adjacent time slices are passed through Dirichlet's prior parameters α and β. parameter connection. Since the values of Dirichlet's prior parameters α and β will affect the formation of topics and change the distribution of words in the topic, the present invention uses the two parameters α and β to convert the publication-topic distributions θ and The influence of topic-word distribution φ is transmitted to the parameters of the topic model of the adjacent next time slice, which reduces the uncertainty of topic words when assigning topics, and makes the model less perplexed.
进一步优选,步骤2中所述时序刊物主题模型的主题数目以及第一个时间片上的刊物主题模型中刊物-主题分布θ的狄利克雷先验参数α、主题-词分布φ的狄利克雷先验参数β为预设值。Further preferably, the number of topics of the time series publication topic model described in step 2 and the Dirichlet prior parameter α of the publication-topic distribution θ in the publication topic model on the first time slice, the Dirichlet prior of the topic-word distribution φ. The test parameter β is a preset value.
进一步优选,所述时序刊物主题模型的主题数目为50。Further preferably, the number of topics in the topic model of the time series publication is 50.
进一步优选,第一个时间片上的刊物主题模型中刊物-主题分布θ的狄利克雷先验参数α为1、主题-词分布φ的狄利克雷先验参数β为0.01。Further preferably, in the publication topic model on the first time slice, the Dirichlet prior parameter α of the publication-topic distribution θ is 1, and the Dirichlet prior parameter β of the topic-word distribution φ is 0.01.
有益效果beneficial effect
1、本发明提出了一种全新主题快照刊物研究热点演变模型,其结合KL距离来度量同一待测刊物在两个相邻时间片上的两个主题之间的相似性,对相邻时刻主题快照中主题演变的延续、新生、分裂、融合、消亡行为进行检测,实现了对刊物中研究热点演变的细粒度分析,填补了现有技术中解决基于刊物时序主题演变行为有效检测手段的空白。其中,本发明提供的主题快照刊物研究热点演变模型中延续、新生、分裂、融合、消亡演变行为的检测规律是基于同一刊物在相邻时间片上主题之间的相似性推导的,准确地反应了主题演变过程。1. The present invention proposes a brand-new topic snapshot publication research hotspot evolution model, which combines the KL distance to measure the similarity between two topics of the same publication to be tested on two adjacent time slices. By detecting the continuation, new birth, split, fusion, and demise of the topic evolution in the journal, fine-grained analysis of the evolution of research hotspots in publications is realized, which fills the gap in the existing technology to solve the effective detection method of the topic evolution behavior based on the time series of publications. Among them, the detection rules of the evolution behaviors of continuation, new birth, split, fusion and extinction in the subject snapshot publication research hotspot evolution model provided by the present invention are deduced based on the similarity between subjects of the same publication on adjacent time slices, which accurately reflects The evolution of the theme.
2、基于刊物主题性与时序性构建了时序刊物主题模型,其结合了刊物主题模型JTM与DTM模型的特性,即一方面考虑到主题随时间变化的影响,并采用参数传递的方式来构建相邻时间片上刊物主题模型的关联关系,降低了主题词在分配主题时的不确定性,使得模型的困惑度较小,克服了单独刊物主题模型未考虑主题随时间变化的影响,增大了主题词在分配主题时的不确定性的缺陷。本发明的时序刊物主题模型是通过狄利克雷先验参数α和β这两个参数连接相邻时间片,由于狄利克雷先验参数α和β的值会影响主题的形成,改变主题中词的分布,因此,本发明通过α和β这两个参数将前序时间片中的刊物-主题分布θ和主题-词分布φ的影响传递到相邻的下一时间片主题模型参数中。另一方面,本发明的时序刊物主题模型是针对文献数据集中的刊物建模,由于刊物所代表的学科领域的主题性比作者所代表的学科领域主题性更强,DTM模型虽然考虑到主题随时间变化中受先前话题的影响,但是其仅是针对数据集内容建模,而未考虑到刊物,是无法满足刊物主题模型演变需求的。因此本发明的时序刊物主题模型相较于独立的刊物主题模型JTM和现有的DTM模型更符合本发明研究刊物主题演变的需求,利用其得到每个时间片上的刊物-主题分布以及主题-词分布,为后续刊物主题演变检测奠定基础。2. The topic model of time series publications is constructed based on the topicality and time series of the publications, which combines the characteristics of the JTM and DTM models of the publication topic model, that is, on the one hand, the influence of the topic changes over time is taken into account, and the method of parameter transfer is used to construct the The relationship between the topic models of the publications on the adjacent time slices reduces the uncertainty of the topic words when assigning topics, making the model less perplexing, overcoming the influence of the topic model over time, which is not considered in the topic model of a separate publication, and increasing the number of topics. Defects of the uncertainty of words in assigning topics. The topic model of the time series publication of the present invention connects adjacent time slices through the Dirichlet a priori parameters α and β. Since the values of the Dirichlet prior parameters α and β will affect the formation of the topic, changing the words in the topic Therefore, the present invention transfers the influence of the publication-topic distribution θ and the topic-word distribution φ in the previous time slice to the topic model parameters of the adjacent next time slice through the two parameters α and β. On the other hand, the topic model of the time series publications of the present invention is based on the modeling of the publications in the document data set. Since the subject area of the subject area represented by the publication is more thematic than the subject area represented by the author, although the DTM model takes into account the subject matter of the subject area The time change is affected by the previous topic, but it only models the content of the dataset without considering the publication, which cannot meet the evolution needs of the publication topic model. Therefore, compared with the independent publication topic model JTM and the existing DTM model, the time series publication topic model of the present invention is more in line with the needs of the present invention to study the topic evolution of publications, and the publication-topic distribution and topic-word distribution on each time slice can be obtained by using it. distribution, which lays the foundation for the detection of the subject evolution of subsequent publications.
3、通过实验验证,本发明提供的时序刊物主题模型在困惑度以及运行时间上具有较好的表现,时序刊物主题模型的困惑度低于作者主题模型ATM以及DTM模型,时序刊物主题模型的运行时间与DTM模型接近,并比ATM的运行时间短。3. Through experimental verification, the time series publication subject model provided by the present invention has better performance in terms of perplexity and running time, and the perplexity degree of the time series publication subject model is lower than that of the author's subject model ATM and DTM models, and the operation of the time series publication subject model. The time is close to that of the DTM model and is shorter than that of the ATM.
附图说明Description of drawings
图1是本发明提供的一种基于KL距离相似性度量的研究热点演变行为检测方法的流程示意图;Fig. 1 is a kind of schematic flow chart of a research hotspot evolution behavior detection method based on KL distance similarity measure provided by the present invention;
图2是本发明提供的刊物主题模型的示意图;Fig. 2 is the schematic diagram of the publication theme model provided by the invention;
图3是本发明提供的时序刊物主题模型的示意图;Fig. 3 is the schematic diagram of the time series publication subject model provided by the present invention;
图4是本发明提供的主题快照刊物研究热点演变模型中主题演变行为示意图;4 is a schematic diagram of the subject evolution behavior in the subject snapshot publication research hotspot evolution model provided by the present invention;
图5是本发明提供的刊物ID:003下的主题在2010~2016年间的演变示意图;5 is a schematic diagram of the evolution of the theme under the publication ID: 003 provided by the present invention between 2010 and 2016;
图6是本发明提供的ATM模型、DTM模型以及TS-JTM模型的困惑度对比示意图。FIG. 6 is a schematic diagram of the perplexity comparison of the ATM model, the DTM model and the TS-JTM model provided by the present invention.
具体实施方式Detailed ways
下面将结合实施例对本发明做进一步的说明。The present invention will be further described below with reference to the embodiments.
由于学术领域的研究热点主要反映在学术刊物中,如何对学术刊物的数据集中主题的演变行为进行分析,对于科研人员了解学科研究热点发展轨迹,把握研究热点发展规律具有重要意义。如图1所示,本发明基于该需要提供了一种基于KL距离相似性度量的研究热点演变行为检测方法,包括如下步骤:Since the research hotspots in the academic field are mainly reflected in academic journals, how to analyze the evolution behavior of the topics in the data sets of academic journals is of great significance for researchers to understand the development trajectory of disciplinary research hotspots and grasp the development laws of research hotspots. As shown in FIG. 1 , the present invention provides a method for detecting the evolution behavior of research hotspots based on the KL distance similarity measure based on the need, including the following steps:
步骤1:文献信息预处理。首先从公共文献信息库获取刊物文献并进行预处理,再基于刊物文献的发表时间构建具有时间属性的主题词语料库。Step 1: Literature information preprocessing. Firstly, the publication documents are obtained from the public document information database and preprocessed, and then a subject word corpus with time attribute is constructed based on the publication time of the publication documents.
预处理过程为:提取刊物文献的文献题名、摘要、关键词、刊物名和发表时间等,然后进行格式化处理,对摘要和文献题名使用分词工具分成词组并删除停用词,将剩余词组与关键词组成该文献的特征词。其他可行的实施例中,文献的特征词也可以仅来源于摘要,或者来源于摘要、关键词;或者来源于摘要、文献题名等,本发明对此不进行具体的限定。The preprocessing process is as follows: extract the title, abstract, keywords, publication name and publication time of the publication document, and then format it, use the word segmentation tool to divide the abstract and document title into phrases, delete stop words, and combine the remaining phrases with key words. Words make up the characteristic words of the document. In other feasible embodiments, the characteristic words of documents may also be derived only from abstracts, or from abstracts, keywords; or from abstracts, document titles, etc., which are not specifically limited in the present invention.
得到各个文献的特征词集合后,将按照文献的发表时间划分时间片,再将属于同一时间片的文献的特征词及文献所述的刊物信息构成该时间片的数据集。各个时间片的数据集构成主题词语料库。After the feature word set of each document is obtained, the time slice is divided according to the publication time of the document, and the feature words of the document belonging to the same time slice and the publication information described in the document form the data set of the time slice. The datasets for each time slice constitute the subject word corpus.
例如:从中国知网公共文献资源库上获取科技文文献信息来构建主题词语料库。从2010~2016年计算机领域的刊物中选取了6487篇文章摘要及其对应的刊物名、发表时间作为实验数据。并将所有文献信息按年份分成7个时间片的数据集,然后使用中科院汉语分词系统NLPIR对每篇论文摘要进行分词和去除停用词,形成各个文献的主题词集合其中,用(wi,ji)来表示文献i的文献特征向量。其中wi表示文献i中特征词集合,ji代表文献i发表的刊物。在时间片t中的n1篇文献组成的数据集Ct可以表示为 For example, obtain scientific and technological literature information from CNKI public literature resource database to construct a subject word corpus. From 2010 to 2016, 6487 article abstracts and their corresponding publication names and publication time were selected as experimental data. Divide all the literature information into 7 time-slice datasets by year, and then use the Chinese word segmentation system NLPIR of the Chinese Academy of Sciences to segment and remove stop words for each paper abstract to form a set of subject words for each literature. Among them, (wi i , j i ) is used to represent the document feature vector of document i. where w i represents the set of feature words in document i, and ji represents the publications published by document i. The dataset C t consisting of n 1 documents in time slice t can be expressed as
步骤2:基于刊物主题性与时序性构建时序刊物主题模型(TS-JTM)。Step 2: Construct a Time Series Journal Topic Model (TS-JTM) based on the journal topicality and time series.
时序刊物主题模型(TS-JTM)在每个时间片中的模型均是刊物主题模型,刊物主题模型如图2所示。模型中的α和β分别表示刊物-主题分布θ和主题-词分布φ的狄利克雷(Dirichlet)先验参数,K表示刊物的总数量,T表示主题的数量。刊物主题模型的核心思想是:一篇文章所属的刊物J从其对应的主题分布θ中选择一个主题z,根据这个主题在单词上的概率分布φ随机地产生一个词w。重复此过程,直到生成这篇文章中的每一个词。The model of the Time Series Journal Topic Model (TS-JTM) in each time slice is the journal topic model, as shown in Figure 2. α and β in the model represent the Dirichlet prior parameters of the publication-topic distribution θ and the topic-word distribution φ, respectively, K represents the total number of publications, and T represents the number of topics. The core idea of the publication topic model is: the publication J to which an article belongs selects a topic z from its corresponding topic distribution θ, and randomly generates a word w according to the probability distribution φ of this topic on words. Repeat this process until every word in this article is generated.
本发明的时序刊物主题模型(TS-JTM)中相邻时间片上刊物主题模型存在关联关系。如同与DTM模型,如图3所示,相邻时间片通过狄利克雷先验参数α和β连接,其中狄利克雷先验参数α和β的值会影响主题的形成,改变主题中词的分布。相邻时间片之间参数的计算公式如下:In the time series publication topic model (TS-JTM) of the present invention, there is an associated relationship between the publication topic models on adjacent time slices. As with the DTM model, as shown in Figure 3, adjacent time slices are connected by the Dirichlet prior parameters α and β, where the values of the Dirichlet prior parameters α and β will affect the formation of the topic, changing the words in the topic. distributed. The calculation formula of parameters between adjacent time slices is as follows:
βt|βt-1~N(βt-1,σ2I) (1)β t |β t-1 ~N(β t-1 ,σ 2 I) (1)
αt|αt-1~N(αt-1,δ2I) (2)α t |α t-1 ~N(α t-1 ,δ 2 I) (2)
φt~Dir(βt) (3)φ t ~Dir(β t ) (3)
θt~Dir(αt) (4)θ t ~Dir(α t ) (4)
其中,公式1表示时间片t下的主题-词分布的先验参数βt受上一时间片t-1下的主题-词分布的先验参数βt-1的影响并满足N(βt-1,σ2I)分布,βt与βt-1满足一阶马尔科夫过程;同理,公式2表示时间片t下的刊物-主题分布的狄利克雷先验参数αt受上一时间片t-1下的刊物-主题分布的狄利克雷先验参数αt-1的影响并满足N(αt-1,δ2I)分布。公式(3)和公式(4)表示参数βt和αt分别是模型中主题-词分布φt和刊物-主题θt的狄利克雷先验参数。狄利克雷先验参数αt和βt的值会影响刊物-主题分布和主题-词分布。Among them, formula 1 indicates that the prior parameter β t of the topic-word distribution under the time slice t is affected by the prior parameter β t - 1 of the topic-word distribution under the previous time slice t-1 and satisfies N(β t -1 ,σ 2 I) distribution, β t and β t-1 satisfy the first-order Markov process; in the same way, formula 2 indicates that the Dirichlet prior parameter α t of the publication-topic distribution under the time slice t is subject to The influence of the Dirichlet prior parameter α t-1 of the publication-topic distribution under a time slice t-1 and satisfy the N(α t-1 ,δ 2 I) distribution. Equation (3) and Equation (4) indicate that the parameters β t and α t are the Dirichlet prior parameters of the topic-word distribution φ t and the publication-topic θ t in the model, respectively. The values of the Dirichlet prior parameters α t and β t affect the publication-topic distribution and the topic-word distribution.
基于上述时序刊物主题模型的模型架构,设定模型中主题的数目以及第一个时间片上狄利克雷先验参数β1和α1的值,再对第一时间片上的数据集进行主题抽取即可得到第一个时间片上的刊物-主题分布θ1以及主题-词φ1分布;再利用公式(1)和公式(2)由第一个时间片的狄利克雷先验参数β1和α1计算出新的β1'和α1',并将新的参数β1'和α1'传递给第二个时间片,作为第二个时间片中模型超参数的初始值,再对第二个时间片上的数据集进行主题抽取,不断地重复此过程得到各个时间片上的刊物-主题分布以及主题-词分布。即其他时间片上的狄利克雷先验参数αt和βt分别依据前一个时间片上的αt-1、βt-1计算。其中,利用时间片上刊物主题模型对相匹配时间片上的数据集进行主题抽取得到刊物-主题分布以及主题-词分布的过程为现有技术实现过程,本发明对此不进行详细说明,仅简述说明。Based on the model architecture of the above-mentioned time series publication topic model, set the number of topics in the model and the values of Dirichlet's prior parameters β 1 and α 1 on the first time slice, and then perform topic extraction on the data set on the first time slice, namely The publication-topic distribution θ 1 and the topic-word φ 1 distribution on the first time slice can be obtained; then the Dirichlet prior parameters β 1 and α of the first time slice can be obtained by using formula (1) and formula (2). 1 Calculate the new β 1 ' and α 1 ', and pass the new parameters β 1 ' and α 1 ' to the second time slice as the initial values of the model hyperparameters in the second time slice, and then for the first time slice Subject extraction is performed on the datasets on the two time slices, and this process is repeated continuously to obtain the publication-topic distribution and topic-word distribution on each time slice. That is, the Dirichlet prior parameters α t and β t in other time slices are calculated according to α t-1 and β t-1 in the previous time slice, respectively. Among them, the process of using the publication topic model on the time slice to perform topic extraction on the data set on the matching time slice to obtain the publication-topic distribution and the topic-word distribution is a prior art implementation process, which is not described in detail in the present invention, only briefly described illustrate.
时序刊物主题模型中刊物主题分布θ以及主题词分布φ参数推断采用吉布斯采样(Gibbs Sampling)方法。对于每个词,根据公式5对刊物和主题进行采样,公式5中右边是p(topic|journal)·p(word|topic),即刊物选择主题并且主题选择词的概率。由于主题(topic)有T个,刊物(journal)有K个,公式的物理意义就是在这K×T条路径中采样。The Gibbs sampling (Gibbs Sampling) method is used to infer the parameters of the publication topic distribution θ and the topic word distribution φ in the time series publication topic model. For each word, the journal and topic are sampled according to Equation 5. The right side of Equation 5 is p(topic|journal)·p(word|topic), that is, the probability that the journal selects the topic and the topic selects the word. Since there are T topics and K journals, the physical meaning of the formula is to sample in these K×T paths.
式中,zi=j,xi=k这里代表一篇文献中第i个词分配给第j个主题(Topic)和第k个刊物。Wi=m代表第i个单词是词典中第m个词汇。Z-i,X-i代表除第i个单词之外其余词的主题和刊物分配。表示词m在此次分配之前已经分配给主题j的总个数,表示到目前为止,刊物k分配给主题j的总个数。N是词典中词的总个数,词典由数据集中所有不同特征词组成。公式(1)在模型的参数估计中只需要记录两个矩阵,一个是主题-词(word by topic)计数矩阵N×T、一个是刊物-主题(Journal by topic)计数矩阵K×T,然后根据这两个计数矩阵估算主题-词分布φ、刊物-主题分布θ计算公式分别为公式(6)和公式(7)。In the formula, z i =j, x i =k here represents that the i-th word in a document is assigned to the j-th topic (Topic) and the k-th publication. W i =m represents that the i-th word is the m-th word in the dictionary. Z -i , X -i represent the topic and publication assignments of the remaining words except the i-th word. represents the total number of words m has been assigned to topic j before this assignment, represents the total number of publications k assigned to topic j so far. N is the total number of words in the dictionary, which consists of all the different feature words in the dataset. Formula (1) only needs to record two matrices in the parameter estimation of the model, one is the topic-word (word by topic) count matrix N×T, the other is the journal-topic (Journal by topic) count matrix K×T, and then According to these two count matrices, the calculation formulas of topic-word distribution φ and publication-topic distribution θ are calculated as formula (6) and formula (7), respectively.
式中,φmj表示主题j使用单词m的概率,θkj表示刊物k选择主题j的概率,m'表示任意一个已分配到主题j下的单词,j'表示任意一个已分配到刊物k下的主题。In the formula, φ mj represents the probability that topic j uses word m, θ kj represents the probability that publication k selects topic j, m' represents any word that has been assigned to topic j, and j' represents any word that has been assigned to journal k. Theme of.
步骤3:基于时序刊物主题模型中各个时间片上的刊物主题模型依次对相匹配时间片上的数据集进行主题提取得到每个时间片上的刊物-主题分布以及主题-词分布。Step 3: Based on the publication topic model on each time slice in the time series publication topic model, subject extraction is performed on the data set on the matching time slice in turn to obtain the publication-topic distribution and the topic-word distribution on each time slice.
基于步骤2构建时序刊物主题模型的架构,本实施例中设置时序刊物主题模型的主题数目以及刊物-主题分布θ的狄利克雷先验参数α、主题-词分布φ的狄利克雷先验参数β的初始值,然后按照依次对各个时间片上的数据集进行主题抽取得到各个时间片上的刊物-主题分布以及主题-词分布。流程为使用TS-JTM模型进行主题抽取,即对每个时间片t,循环执行1.1、1.2、1.3;Based on step 2, the architecture of the topic model of time series publications is constructed. In this embodiment, the number of topics of the topic model of time series publications, the Dirichlet prior parameter α of the publication-topic distribution θ, and the Dirichlet prior parameter of the topic-word distribution φ are set. The initial value of β, and then subject extraction is performed on the data sets on each time slice in turn to obtain the publication-topic distribution and topic-word distribution on each time slice. The process is to use the TS-JTM model for topic extraction, that is, 1.1, 1.2, and 1.3 are executed cyclically for each time slice t;
1.1在时间片t中使用TS-JTM模型,对数据集进行主题抽取,得到主题集合Tt以及主题-词分布;1.1 Use the TS-JTM model in the time slice t to extract topics from the dataset to obtain the topic set T t and topic-word distribution;
1.2将主题集合Tt添加到时间序列主题的集合TC中;1.2 Add the subject set T t to the set TC of time series subjects;
1.3使用当前时间片模型的参数αt,βt更新模型TS-JTM。1.3 Update the model TS-JTM with the parameters α t and β t of the current time slice model.
应当理解,更新TS-JTM模型为更新时序刊物主题模型中下一个时间片上模型参数α、β。It should be understood that updating the TS-JTM model is updating the model parameters α and β on the next time slice in the topic model of the time series publication.
步骤4:获取待测刊物在各个时间片上主题以及主题-词分布,并基于主题-词分布计算同一待测刊物在相邻时间片上每任意两个主题之间的KL距离,再基于主题快照刊物研究热点演变模型得出待测刊物中各个主题的演变行为。Step 4: Obtain the topic and topic-word distribution of the publication to be tested on each time slice, and calculate the KL distance between any two topics of the same publication to be tested on adjacent time slices based on the topic-word distribution, and then snapshot the publication based on the topic The evolution model of research hotspots is used to obtain the evolution behavior of each topic in the publication to be tested.
如图4所示,其表示某刊物发行的三个相邻的时间片的主题快照,时间片之间的虚线表示主题之间的关系。本发明提出的主题快照刊物研究热点演变模型包含了主题之间的行为特性,分别如下:①一对一关系表示当前时间片的主题是由上一时间片的主题延续而来;②当下一时间片中的主题与上一时间片中的主题没有连接,则表明有新生主题;③一对多关系表示上一时间片的主题产生了分裂,生成了多个主题;④多对一关系表示多个主题融合成一个主题;⑤当上一时间片中的主题与下一时间片中的主题没有连接,则表明主题发生了消亡。As shown in FIG. 4 , it represents the subject snapshots of three adjacent time slices issued by a certain publication, and the dotted lines between the time slices represent the relationship between the subjects. The subject snapshot publication hotspot evolution model proposed by the present invention includes the behavioral characteristics between subjects, which are as follows: ① a one-to-one relationship indicates that the subject of the current time slice is a continuation of the subject of the previous time slice; ② the next time If the theme in the film is not connected with the theme in the previous time slice, it indicates that there is a new theme; ③ a one-to-many relationship indicates that the theme of the previous time slice has been split and multiple themes have been generated; ④ a many-to-one relationship indicates that many 5. When the theme in the previous time slice is not connected with the theme in the next time slice, it means that the theme has died.
为了度量两个主题之间的相似性,本发明采用KL距离。KL(Kullback-LeiblerDivergence)距离由Solomon Kullback和Richard Leibler提出[3],也叫相对熵(RelativeEntropy),常被用于度量两个概率分布间的相似性,使用KL距离可以用于衡量相邻时间片中任意两个主题之间的相似度。如下公式8为KL距离的计算公式,其中,和分别表示两个概率分布,当两个概率分布完全相同时,KL距离的值为0。In order to measure the similarity between two topics, the present invention adopts the KL distance. KL (Kullback-Leibler Divergence) distance was proposed by Solomon Kullback and Richard Leibler [3] , also called relative entropy (Relative Entropy), which is often used to measure the similarity between two probability distributions, and KL distance can be used to measure adjacent time The similarity between any two themes in the film. The following formula 8 is the calculation formula of the KL distance, wherein, and Represent two probability distributions, respectively, when the two probability distributions are exactly the same, the value of the KL distance is 0.
本发明采用KL距离来度量分布于相邻两个时间片上的两个主题之间的相似性,建立相邻时间片主题之间的对应关系,公式中的概率分布对应于主题的主题-词分布。The present invention uses the KL distance to measure the similarity between two topics distributed on two adjacent time slices, and establishes the correspondence between the adjacent time slice topics, and the probability distribution in the formula corresponds to the topic-word distribution of the topic .
基于前述1-5中演变行为,本发明的主题快照刊物研究热点演变模型包括如下检测规律:Based on the evolution behavior in the aforementioned 1-5, the subject snapshot publication research hotspot evolution model of the present invention includes the following detection rules:
a:时间片t上的主题i仅与相邻下一时间片t+1上一个主题的KL距离小于相似性阈值,且与相邻下一时间片t+1上剩余主题的KL距离均大于或等于相似性阈值时,主题i在下一时间片t+1中保持延续。a: The KL distance between topic i on time slice t and the previous topic on the adjacent next time slice t+1 is less than the similarity threshold, and the KL distance from the remaining topics on the adjacent next time slice t+1 is greater than When equal to or equal to the similarity threshold, topic i remains in the next time slice t+1.
b:时间片t上的主题i与相邻上一时间片t-1上每个主题的KL距离均大于相似值阈值时,时间片t上的主题i为新生主题。b: When the KL distance between topic i on time slice t and each topic on the adjacent previous time slice t-1 is greater than the similarity threshold, topic i on time slice t is a new topic.
c:时间片t上的主题i与相邻下一时间片t+1上每个主题的KL距离均大于相似值阈值时,时间片t上的主题i在下一时间片t+1中没有延续,主题i消亡。c: When the KL distance between topic i on time slice t and each topic on adjacent next time slice t+1 is greater than the similarity threshold, topic i on time slice t does not continue in the next time slice t+1 , the subject i dies.
d:时间片t上的主题i与相邻下一时间片t+1上至少两个主题的KL距离均小于相似值阈值时,时间片t上的主题i在下一时间片t+1中分裂为多主题。d: When the KL distance between topic i on time slice t and at least two topics on adjacent next time slice t+1 is smaller than the similarity threshold, topic i on time slice t is split in the next time slice t+1 for multiple themes.
e:时间片t上的主题i与相邻上一时间片t-1上至少两个主题的KL距离均小于相似值阈值时,时间片t上的主题i由上一时间片t-1中多主题融合而来。e: When the KL distance between topic i on time slice t and at least two topics on the adjacent previous time slice t-1 is less than the similarity threshold, topic i on time slice t is determined from the previous time slice t-1. Combining multiple themes.
综上a-e个检测规律对应的检测公式如下:In summary, the detection formulas corresponding to the a-e detection laws are as follows:
其中,表示第t个时间片的第i个主题的演变行为状态标识,Threshold_A为相似性阈值。经过反复实验,threshold_A设置为0.4时,能合理地反映主题的演变行为,其他可行的实施例中,可以取其他值。in, Indicates the evolution behavior status of the i-th topic in the t-th time slice, and Threshold_A is the similarity threshold. After repeated experiments, when threshold_A is set to 0.4, it can reasonably reflect the evolution behavior of the theme. In other feasible embodiments, other values may be adopted.
针对待测刊物在每个时间片上的处理,分别执行如下流程2.2.1和2.2.2:For the processing of the publication to be tested on each time slice, the following procedures 2.2.1 and 2.2.2 are executed respectively:
2.2.1从集合TC中提取当前时间片的主题集合Tt以及与当前时间片相邻的两个时间片的主题集合Tt-1、Tt+1,并获取待测刊物在集合Tt-1,Tt,Tt+1中的主题;2.2.1 Extract the subject set T t of the current time slice and the subject sets T t-1 and T t+1 of the two time slices adjacent to the current time slice from the set TC, and obtain the publications to be tested in the set T t Topics in -1 ,T t ,T t+1 ;
2.2.2按照公式9检测当前时间片上待测刊物的各个主题的演变行为。2.2.2 According to formula 9, the evolution behavior of each topic of the publication to be tested on the current time slice is detected.
为了更加清楚的描述本发明所述方案,下文将提供多个实例。In order to more clearly describe the solution of the present invention, a number of examples will be provided below.
1、主题词随时间的变化1. Changes in subject headings over time
如下表1所示,数据集中编号为ID:003的刊物,在2010年的主题分布中,主题编号2是与人脸识别领域相关的主题。主题编号2从2010年到2016年的主题-词分布如表2所示,表中展示的是每年中这个主题下概率最高的10个主题词。从表中可以看出随着时间的推移,“人脸识别”主题中的核心词汇没有发生较大变化,如“图像”、“特征”、“人脸识别”等与人脸识别相关的热门词汇一直都在主题中。但2013年出现的“遗传算法”,2015年出现的“深度学习”,这是一些新的方法在“人脸识别”领域的应用。从2010到2016年,彼此相邻的两个时间片主题的KL值分别为0.20、0.26、0.23、0.17、0.21、0.19,这些KL距离均小于相似性阈值threshold_A,在此期间,“人脸识别”主题和下一时间片中其它主题的KL距离均大于相似性阈值threshold_A,因此“人脸识别”主题在2010年到2016年期间是一直延续的。As shown in Table 1 below, for the publication with ID: 003 in the dataset, in the topic distribution in 2010, topic number 2 is a topic related to the field of face recognition. The topic-word distribution of topic number 2 from 2010 to 2016 is shown in Table 2, which shows the 10 topic words with the highest probability under this topic in each year. It can be seen from the table that the core vocabulary in the topic of "face recognition" has not changed significantly over time, such as "image", "feature", "face recognition" and other popular topics related to face recognition The vocabulary is always in the theme. But the "genetic algorithm" that appeared in 2013 and the "deep learning" that appeared in 2015 are the application of some new methods in the field of "face recognition". From 2010 to 2016, the KL values of two time slice subjects adjacent to each other were 0.20, 0.26, 0.23, 0.17, 0.21, 0.19, and these KL distances were all smaller than the similarity threshold threshold_A. During this period, "Face Recognition" "The KL distance between the topic and other topics in the next time slice is greater than the similarity threshold threshold_A, so the topic "Face Recognition" continued from 2010 to 2016.
表1“人脸识别”的主题-词分布表Table 1 The topic-word distribution table of "Face Recognition"
2、刊物主题随时间的演变2. The evolution of publication themes over time
为了方便描述,我们在后续的文章中将主题用其英文缩写表示。表2中展示了“神经网络(NN)”、“深度学习(DL)”、“语音识别(SR)”三个主题在2010-2016各年份中概率最大的前10个主题词。从表2中可以看出,主题NN中“神经网络”、“神经元”、“特征”等核心词汇基本保持不变,边缘词汇如“样本”、“粒子群”在不同时间片中的分布变化较大。2013年的主题NN和2014年的主题DL在前10个主题词中相同的词有“训练”、“分类”、“性能”、“特征”,“神经元”,由于词分布的相似性使得这两个主题之间的KL值较小,为0.27,小于相似性阈值threshold_A,2013年的主题NN和下一时间片所有主题的KL值分别为0.55,0.27,0.21,0.69,1.84,1.16,0.92,1.53,其中最小值对应的主题分别为DL、NN,其余都大于相似性阈值,所以主题DL由主题NN分裂产生。For the convenience of description, we will use their English abbreviations for topics in subsequent articles. Table 2 shows the top 10 topics with the highest probability in the three topics of "Neural Network (NN)", "Deep Learning (DL)" and "Speech Recognition (SR)" in each year from 2010 to 2016. As can be seen from Table 2, the core words such as "neural network", "neuron" and "feature" in the topic NN remain basically unchanged, and the distribution of edge words such as "sample" and "particle swarm" in different time slices big change. The topic NN in 2013 and the topic DL in 2014 have the same words in the top 10 topic words as "training", "classification", "performance", "feature", "neuron", due to the similarity of word distributions. The KL value between these two topics is small, 0.27, which is less than the similarity threshold threshold_A, the KL value of the topic NN in 2013 and all topics in the next time slice are 0.55, 0.27, 0.21, 0.69, 1.84, 1.16, 0.92, 1.53, the topics corresponding to the minimum values are DL and NN respectively, and the rest are larger than the similarity threshold, so the topic DL is generated by the topic NN split.
表2 2010-2016年“语音识别”等三个主题的词分布表Table 2 The word distribution of three topics including "speech recognition" in 2010-2016
3、刊物主题演变分析3. Analysis of the evolution of the publication theme
刊物ID:003下的主题在2010~2016年间的演变情况如图5所示。由于同一主题在不同时间片通过聚类形成了不同的编号,所以图中同一主题用英文缩写表示。从图5中可以看出,2015年所有的主题分布和2016年中主题SR的KL值分别为0.74、0.46、0.23、0.16、0.81、0.95、1.37,小于相似性阈值的两个主题分别为NN和SR,其余KL值均大于相似性阈值,表示主题NN融合到主题SR中,2014年“飞行器”主题与2015年所有主题的KL距离分别为1.72,1.46,1.25,1.07,1.20,0.83,1.59,KL的最小值为0.83,大于相似性阈值,所以“飞行器”主题在2015年发生了消亡;2010年所有的主题与2011年“云计算”主题的KL值分别1.16,0.75,1.37,2.32,1.51,KL的最小值为0.75,大于相似性阈值,该主题为新生;同理,“目标跟踪”主题一直处于延续状态,2013年的新生主题是“实体识别”。The evolution of topics under publication ID: 003 from 2010 to 2016 is shown in Figure 5. Since the same topic forms different numbers through clustering in different time slices, the same topic in the figure is represented by an English abbreviation. As can be seen from Figure 5, the KL values of all topic distributions in 2015 and topic SR in 2016 are 0.74, 0.46, 0.23, 0.16, 0.81, 0.95, 1.37, respectively, and the two topics less than the similarity threshold are NN respectively and SR, the remaining KL values are larger than the similarity threshold, indicating that the topic NN is integrated into the topic SR, and the KL distances between the 2014 "aircraft" topic and all the topics in 2015 are 1.72, 1.46, 1.25, 1.07, 1.20, 0.83, 1.59, respectively. , the minimum value of KL is 0.83, which is greater than the similarity threshold, so the "aircraft" topic died out in 2015; the KL values of all topics in 2010 and the "cloud computing" topic in 2011 were 1.16, 0.75, 1.37, 2.32, 1.51, the minimum value of KL is 0.75, which is greater than the similarity threshold, and the subject is a freshman; similarly, the subject of "target tracking" has been in a continuous state, and the new subject in 2013 is "entity recognition".
模型性能验证Model performance verification
为了验证本发明提出的时序刊物主题模型(TS-JTM)的模型性能,本发明采用困惑度指标。如下公式10为困惑度Perplexity的计算公式,其中,Dtest代表测试集,是M篇文档的集合,p(Wd)表示文档中词被选中的概率,Nd代表文档d中词的数量,Wd=(w1d,w2d,...,wid,...,wnd)表示文档d中的词向量形式。Perplexity值越小则说明模型的性能越好。In order to verify the model performance of the Time Series Journal Topic Model (TS-JTM) proposed by the present invention, the present invention adopts the perplexity index. The following formula 10 is the calculation formula of Perplexity, where D test represents the test set, which is a collection of M documents, p(W d ) represents the probability of the word in the document being selected, N d represents the number of words in the document d, W d =(w 1d , w 2d ,...,w id ,...,w nd ) represents the word vector form in document d. The smaller the Perplexity value, the better the performance of the model.
为了度量时序刊物主题模型(TS-JTM)的性能,在实验前需要对模型的3个参数进行设置,主题数量|T|的值从10开始逐渐增加,ATM的两个狄利克雷超参数值分别设置为:α=50/|T|,β=0.01,DTM和时序刊物主题模型在第一个时间片中的两个狄利克雷超参数值分别设置为α=50/|T|,β=0.01,其余时间片中的α和β由模型自动获取。对比实验结果如图6所示,横轴表示主题数量,纵轴表示困惑度(perplexity),我们可以看到,随着主题数量的变化,TS-JTM的困惑度始终最小,这表明TS-JTM性能最好,另外,Perplexity只随主题数目的增加而下降,当主题数目大于50后,Perplexity保持不变,这表明TS-JTM模型的主题数目设置为50是合理的。本实施例中,时序刊物主题模型的主题数量优选为50,且对应两个狄利克雷超参数设置为:α=50/|T|,β=0.01。In order to measure the performance of the time series publication topic model (TS-JTM), three parameters of the model need to be set before the experiment. The value of the number of topics |T| gradually increases from 10, and the two Dirichlet hyperparameter values They are set as: α=50/|T|, β=0.01, and the two Dirichlet hyperparameter values of DTM and time series publication topic model in the first time slice are respectively set as α=50/|T|, β = 0.01, α and β in the remaining time slices are automatically obtained by the model. The comparison experiment results are shown in Figure 6. The horizontal axis represents the number of topics, and the vertical axis represents perplexity. We can see that with the change of the number of topics, the perplexity of TS-JTM is always the smallest, which indicates that TS-JTM The performance is the best. In addition, the Perplexity only decreases with the increase of the number of topics. When the number of topics is greater than 50, the Perplexity remains unchanged, which indicates that the number of topics for the TS-JTM model is set to 50 is reasonable. In this embodiment, the number of topics in the topic model of the time series publication is preferably 50, and the corresponding two Dirichlet hyperparameters are set as: α=50/|T|, β=0.01.
另一方面,本发明测试了时序刊物主题模型(TS-JTM)在数据集上的运行时间,我们将TS-JTM和作者主题模型(ATM)、动态主题模型(DTM)在模型的运行时间进行对比。使用这三个模型分别处理相同的数据,三个模型的运行时间分别为23.8分钟、25.6分钟、24.2分钟。这表明TS-JTM和DTM的运行时间很接近,ATM的运行时间最长。结合图4的模型困惑度表现,表明时序刊物主题模型(TS-JTM)不仅困惑度低,而且在运行时间上也有不错的表现。On the other hand, the present invention tests the running time of the Time Series Journal Topic Model (TS-JTM) on the dataset. Compared. Using these three models to process the same data separately, the running times of the three models are 23.8 minutes, 25.6 minutes, and 24.2 minutes, respectively. This shows that the running time of TS-JTM and DTM is close, and the running time of ATM is the longest. Combined with the model perplexity performance in Figure 4, it shows that the time series publication topic model (TS-JTM) not only has low perplexity, but also has good performance in running time.
综上所述,学术刊物的主题演变反映了学术领域的研究热点的发展趋势。由于刊物的主题性和时序性会影响刊物主题分布和演变过程,主题的演变过程中存在着演变行为,使得刊物研究热点的演变轨迹识别变得复杂。本文结合刊物主题性和刊物的时序性,提出了时序刊物主题模型TS-JTM,使用TS-JTM实现对学术刊物的时态热点抽取,通过困惑度对比实验验证了模型TS-JTM的性能。在此基础上建立基于时间序列的主题快照刊物研究热点演变模型,并使用KL距离度量相似性,对相邻时刻主题快照中主题演变的延续、新生、分裂、融合、消亡行为进行检测,实现了对刊物中研究热点演变的细粒度分析。To sum up, the subject evolution of academic journals reflects the development trend of research hotspots in the academic field. Because the thematic and temporal nature of a publication will affect the distribution and evolution of the publication's topic, there are evolutionary behaviors in the evolution of the topic, which complicates the identification of the evolution track of the publication's research hotspots. This paper proposes a topic model TS-JTM for temporal publications based on the subject nature of publications and the temporal nature of publications. TS-JTM is used to extract temporal hotspots from academic publications, and the performance of the model TS-JTM is verified through a perplexity comparison experiment. On this basis, a time-series-based topic snapshot publication research hotspot evolution model is established, and the KL distance is used to measure the similarity to detect the continuation, new birth, split, fusion, and demise behaviors of topic evolution in topic snapshots at adjacent moments. A fine-grained analysis of the evolution of research hotspots in publications.
需要强调的是,本发明所述的实例是说明性的,而不是限定性的,因此本发明不限于具体实施方式中所述的实例,凡是由本领域技术人员根据本发明的技术方案得出的其他实施方式,不脱离本发明宗旨和范围的,不论是修改还是替换,同样属于本发明的保护范围。It should be emphasized that the examples described in the present invention are illustrative rather than restrictive, so the present invention is not limited to the examples described in the specific implementation manner, and all the examples obtained by those skilled in the art according to the technical solutions of the present invention Other embodiments that do not depart from the spirit and scope of the present invention, whether modified or replaced, also belong to the protection scope of the present invention.
参考文献如下:References are as follows:
[1]Rosen-Zvi M,GriffithsT,Steyvers M.The Author-Topic Model forAuthors and Documents[C].Proceedings of the 20th Conference on Uncertainty inArtificial Intelligence.2004:487-494.[1] Rosen-Zvi M, Griffiths T, Steyvers M. The Author-Topic Model for Authors and Documents [C]. Proceedings of the 20th Conference on Uncertainty inArtificial Intelligence. 2004:487-494.
[2]Blei D M,Lafferty J D.Dynamic Topic Models[C].Proceedings of the23rd International Conference on Machine Learning,2006:113-120.[2] Blei D M, Lafferty J D. Dynamic Topic Models [C]. Proceedings of the 23rd International Conference on Machine Learning, 2006: 113-120.
[3]David J.C.MacKay.Information Theory,Inference,and LearningAlgorithms[M].Cambridge University Press,2003:22-48.[3] David J.C. MacKay. Information Theory, Inference, and Learning Algorithms [M]. Cambridge University Press, 2003: 22-48.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811216206.2A CN109408782B (en) | 2018-10-18 | 2018-10-18 | KL distance similarity measurement-based research hotspot evolution behavior detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811216206.2A CN109408782B (en) | 2018-10-18 | 2018-10-18 | KL distance similarity measurement-based research hotspot evolution behavior detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109408782A true CN109408782A (en) | 2019-03-01 |
CN109408782B CN109408782B (en) | 2020-07-03 |
Family
ID=65468456
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811216206.2A Active CN109408782B (en) | 2018-10-18 | 2018-10-18 | KL distance similarity measurement-based research hotspot evolution behavior detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109408782B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102646114A (en) * | 2012-02-17 | 2012-08-22 | 清华大学 | A Timeline Summary Generation Method for News Topics Based on Breakthrough Points |
CN102902700A (en) * | 2012-04-05 | 2013-01-30 | 中国人民解放军国防科学技术大学 | Online-increment evolution topic model based automatic software classifying method |
CN103559176A (en) * | 2012-10-29 | 2014-02-05 | 中国人民解放军国防科学技术大学 | Microblog emotional evolution analysis method and system |
CN103984681A (en) * | 2014-03-31 | 2014-08-13 | 同济大学 | News event evolution analysis method based on time sequence distribution information and topic model |
CN105868415A (en) * | 2016-05-06 | 2016-08-17 | 黑龙江工程学院 | Microblog real-time filtering model based on historical microblogs |
US20160241346A1 (en) * | 2015-02-17 | 2016-08-18 | Adobe Systems Incorporated | Source separation using nonnegative matrix factorization with an automatically determined number of bases |
CN106204140A (en) * | 2016-07-12 | 2016-12-07 | 华东师范大学 | A kind of colony based on KL distance viewpoint migrates detection method |
CN107918611A (en) * | 2016-10-09 | 2018-04-17 | 郑州大学 | A kind of model analyzed microblog topic and developed |
-
2018
- 2018-10-18 CN CN201811216206.2A patent/CN109408782B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102646114A (en) * | 2012-02-17 | 2012-08-22 | 清华大学 | A Timeline Summary Generation Method for News Topics Based on Breakthrough Points |
CN102902700A (en) * | 2012-04-05 | 2013-01-30 | 中国人民解放军国防科学技术大学 | Online-increment evolution topic model based automatic software classifying method |
CN103559176A (en) * | 2012-10-29 | 2014-02-05 | 中国人民解放军国防科学技术大学 | Microblog emotional evolution analysis method and system |
CN103984681A (en) * | 2014-03-31 | 2014-08-13 | 同济大学 | News event evolution analysis method based on time sequence distribution information and topic model |
US20160241346A1 (en) * | 2015-02-17 | 2016-08-18 | Adobe Systems Incorporated | Source separation using nonnegative matrix factorization with an automatically determined number of bases |
CN105868415A (en) * | 2016-05-06 | 2016-08-17 | 黑龙江工程学院 | Microblog real-time filtering model based on historical microblogs |
CN106204140A (en) * | 2016-07-12 | 2016-12-07 | 华东师范大学 | A kind of colony based on KL distance viewpoint migrates detection method |
CN107918611A (en) * | 2016-10-09 | 2018-04-17 | 郑州大学 | A kind of model analyzed microblog topic and developed |
Also Published As
Publication number | Publication date |
---|---|
CN109408782B (en) | 2020-07-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109670039B (en) | Semi-supervised e-commerce comment emotion analysis method based on three-part graph and cluster analysis | |
Vadicamo et al. | Cross-media learning for image sentiment analysis in the wild | |
CN109376242B (en) | Text classification method based on cyclic neural network variant and convolutional neural network | |
US10719664B2 (en) | Cross-media search method | |
WO2018218708A1 (en) | Deep-learning-based public opinion hotspot category classification method | |
US20150310862A1 (en) | Deep learning for semantic parsing including semantic utterance classification | |
WO2020253583A1 (en) | Written composition off-topic detection method | |
CN108304479B (en) | Quick density clustering double-layer network recommendation method based on graph structure filtering | |
CN115017887B (en) | Chinese rumor detection method based on graph convolution | |
CN118113849B (en) | Information consulting service system and method based on big data | |
CN110209818A (en) | A kind of analysis method of Semantic-Oriented sensitivity words and phrases | |
CN116186268A (en) | Multi-document summary extraction method and system based on Capsule-BiGRU network and event automatic classification | |
CN110046353A (en) | An Aspect-Level Sentiment Analysis Method Based on Multilingual Hierarchical Mechanism | |
Wan | Sentiment analysis of Weibo comments based on deep neural network | |
Lee et al. | Detecting suicidality with a contextual graph neural network | |
CN117708336A (en) | A multi-strategy sentiment analysis method based on topic enhancement and knowledge distillation | |
CN113761125A (en) | Dynamic summary determination method and device, computing equipment and computer storage medium | |
Sheeba et al. | A fuzzy logic based on sentiment classification | |
CN110245682A (en) | A kind of network representation learning method based on topic | |
CN109408782B (en) | KL distance similarity measurement-based research hotspot evolution behavior detection method | |
Zhang et al. | Text Semantic Analysis Algorithm Based on LDA Model and Doc2vec | |
Efrizoni et al. | Hybrid Modeling to Classify and Detect Outliers on Multilabel Dataset based on Content and Context | |
Zheng et al. | Automatic Labeling of SDN Controller Defect Text based on Neural Topic Model | |
CN114742048A (en) | System and method for automatic generation of Internet news hot events | |
Zhu et al. | A performance comparison of fake news detection approaches |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20240315 Address after: 230000 floor 1, building 2, phase I, e-commerce Park, Jinggang Road, Shushan Economic Development Zone, Hefei City, Anhui Province Patentee after: Dragon totem Technology (Hefei) Co.,Ltd. Country or region after: China Address before: Yuelu District City, Hunan province 410083 Changsha Lushan Road No. 932 Patentee before: CENTRAL SOUTH University Country or region before: China |