CN110543564B - Domain Label Acquisition Method Based on Topic Model - Google Patents
Domain Label Acquisition Method Based on Topic Model Download PDFInfo
- Publication number
- CN110543564B CN110543564B CN201910784200.3A CN201910784200A CN110543564B CN 110543564 B CN110543564 B CN 110543564B CN 201910784200 A CN201910784200 A CN 201910784200A CN 110543564 B CN110543564 B CN 110543564B
- Authority
- CN
- China
- Prior art keywords
- word
- topic
- words
- model
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 239000013598 vector Substances 0.000 claims abstract description 36
- 238000000605 extraction Methods 0.000 claims abstract description 28
- 238000011160 research Methods 0.000 claims abstract description 11
- 238000009826 distribution Methods 0.000 claims description 24
- 238000005070 sampling Methods 0.000 claims description 21
- 230000011218 segmentation Effects 0.000 claims description 21
- 238000004364 calculation method Methods 0.000 claims description 14
- 238000013507 mapping Methods 0.000 claims description 11
- 238000007781 pre-processing Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 7
- 238000004140 cleaning Methods 0.000 claims description 4
- 238000004422 calculation algorithm Methods 0.000 abstract description 32
- 238000002474 experimental method Methods 0.000 abstract description 5
- 238000012549 training Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 7
- 238000011156 evaluation Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 238000002372 labelling Methods 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 4
- 241000282414 Homo sapiens Species 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000009193 crawling Effects 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Entrepreneurship & Innovation (AREA)
- Development Economics (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Educational Administration (AREA)
- Marketing (AREA)
- Tourism & Hospitality (AREA)
- Quality & Reliability (AREA)
- General Business, Economics & Management (AREA)
- Operations Research (AREA)
- Game Theory and Decision Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明提供一种基于主题模型的领域标签获取方法,在海量学术数据的基础上,分析学术数据固有的特点,引入学术词频特征构建FLDA主题模型,利用主题模型将同一学者的学术文档进行“主题‑短语”抽取。其次,引入领域体系,将主题模型的抽取结果与体系标签进行向量表征,经过位置加权后使用相似度进行体系映射,最终获得学者的领域标签。实验表明,FLDA模型与传统的LDA模型、基于统计的TFIDF算法和基于网络图的TextRank算法相比,最终获取的标签词效果更好,准确率更高,说明基于主题模型的标签抽取方法在学术领域具有良好的适用性。
The present invention provides a method for obtaining field tags based on a topic model. On the basis of massive academic data, the inherent characteristics of academic data are analyzed, and academic word frequency features are introduced to construct an FLDA topic model. ‑phrases” extraction. Secondly, the domain system is introduced, and the extraction results of the topic model and the system labels are represented by vectors. After position weighting, the similarity is used to map the system, and finally the domain labels of scholars are obtained. Experiments show that compared with the traditional LDA model, the TFIDF algorithm based on statistics, and the TextRank algorithm based on network graphs, the FLDA model finally obtains better label words and higher accuracy, which shows that the label extraction method based on topic model has an important role in academic research. domain has good applicability.
Description
技术领域technical field
本发明涉及一种基于主题模型的领域标签获取方法,具体的,涉及一种学者的领域标签的获取方法,属于信息处理技术领域。The present invention relates to a method for acquiring domain labels based on a subject model, in particular to a method for acquiring domain labels of scholars, and belongs to the technical field of information processing.
背景技术Background technique
经济社会的蓬勃发展,促使着各种科技项目不断产生,项目从立项、评审到验收均需要前沿学者的参与。在以往的经验中,学者的遴选往往由专人进行人为遴选,通过人为统计各学者的研究领域,选择与项目领域相符的学者。然而现有技术的方法往往有以下缺点:同一时间内存在大量项目需要前沿学者参与,这无形中加大了人为遴选的工作量;人为遴选容易受到人的主观性和局限性影响,且在整个遴选过程中容易受到自身的知识层次、社会关系、个人偏好与利益等因素的影响,对学者的领域判断不全面,进而影响遴选结果的准确性。The vigorous development of the economy and society has prompted the continuous emergence of various scientific and technological projects, which require the participation of cutting-edge scholars from project approval, review to acceptance. In the past experience, the selection of scholars is often manually selected by a special person, and the research field of each scholar is artificially counted, and the scholar who matches the project field is selected. However, the methods of existing technologies often have the following disadvantages: there are a large number of projects at the same time that require the participation of frontier scholars, which virtually increases the workload of manual selection; manual selection is easily affected by human subjectivity and limitations, and in the The selection process is easily affected by factors such as one's own knowledge level, social relations, personal preferences and interests, and the judgment of the scholar's field is not comprehensive, which in turn affects the accuracy of the selection results.
现有技术的领域标签获取主要分为传统的领域标签获取和基于关键词的领域标签获取。Domain label acquisition in the prior art is mainly divided into traditional domain label acquisition and keyword-based domain label acquisition.
在传统领域方面,一种方法是利用学者在互联网各平台的简介进行抽取,称之为网络标签。区别于抽取的标签,网络标签通常由本人或他人总结添加,且没有统一的规范,用词随意,进而导致获取的标签复杂多样,可用性低。另外,由于互联网内容较为随意,且多带有作者的写作特点,这导致在标签抽取过程中较难区分正确的数据信息和无用的信息,抽取时往往要针对特定平台、特定学者设计具有针对性的抽取方案,无形中增加了的工作量。In the traditional field, one method is to use the profiles of scholars on various Internet platforms to extract, which is called network tags. Different from the extracted tags, network tags are usually summed up and added by oneself or others, and there is no uniform specification, and words are used randomly, which leads to complex and diverse tags and low usability. In addition, because the Internet content is relatively random and mostly has the characteristics of the author's writing, it is difficult to distinguish correct data information from useless information in the process of label extraction. When extracting, it is often necessary to design targeted tags for specific platforms and specific scholars. The extraction scheme virtually increases the workload.
另一种方法是基于本体技术,设计P2P模式的学者研究领域管理系统,利用RDF技术解决学者领域获取的问题,但由于该方法使用了特定的模板,导致方法的扩展性不足。Another method is to design a P2P mode scholar research field management system based on ontology technology, and use RDF technology to solve the problem of scholar field acquisition. However, because this method uses a specific template, the scalability of the method is insufficient.
还有一种方法是利用J2EE技术实现学者信息管理系统,通过人工更新学者的基本信息和研究领域等信息,并提供咨询学者推荐模块,通过使用Pearson相似度计算用户问题文本与学者研究领域的相似性来实现学者推荐功能,该方法需要学者登陆网站进行信息的更新与维护,只适应少量学者单一领域的情况,无法完成大规模学者发现与机器化领域标签抽取的功能。Another method is to use J2EE technology to realize the scholar information management system, manually update the basic information of the scholar and the research field, and provide a consulting scholar recommendation module, and calculate the similarity between the user question text and the scholar's research field by using Pearson similarity To realize the scholar recommendation function, this method requires scholars to log in to the website to update and maintain information, which is only suitable for a small number of scholars in a single field, and cannot complete the functions of large-scale scholar discovery and machine field label extraction.
在基于关键词的领域标签抽取方面,其有多种抽取方法,常用的关键词抽取基础包括统计、主题、网络图等。关键词是对文章具有高度概括性的一系列词语,自动关键词抽取是识别文本中具有代表性的词语或短语的技术,自从该领域被提出以来,相关研究人员相继提出了各种各样的方法,总体上分为有监督和无监督两大类,其中有监督的方法需要人为标注语料,对小文本情况下适用,但随着海量互联网数据的增加,人工标注成本越来越大,近年来逐步转向无监督的方法。统计法的核心思路是通过文本中词语的统计信息进行抽取,该方法无需训练数据,直接使用词频、位置等进行判断筛选。例如一种方法是利用加权因子和词贡献度来修正TFIDF结果权值,来提升细分领域中的关键词提取效果;还有一种方法是利用N元语言模型和文档权重归并实现学者领域的自动识别,N元语言模型直接使用词序进行计算,无需对文档进行分词、特征提取等操作即可对文档进行领域分类,进而找出学者的领域标签。主题模型是利用概率分布实现关键词抽取,目前主要流行的是LDA(隐含迪利克雷分布)模型。包括为研究用户的行为演变过程,使用静态LDA与改进的fLDA抽取主题词进行用户历史兴趣提取;为了减少微博热点事件中数据稀疏的问题,将时间序列特征与词频加权特征引入LDA算法,该方法得到的话题关键词具有较高的可解释性,同时能够较好的表明话题所展示的内容。现有技术进一步提出一种基于LDA的话题聚类方法,该方法首先将LDA得到的关键词进行聚类,使用结果对LDA得出的话题的进行结果优化,该方法能够有效提高聚类结果的准确性与召回率。在网络图方面,基于PageRank改编的TextRank算法最为著名。也有为解决学术关键词抽取结果较差的问题,使用先验知识计算候选结果在学术领域中的权重,然后结合TextRank对候选关键词进行综合排序,最终得到相关度较高的学术性关键词;以及利用词向量构建概率偏移矩阵,对textrank算法进行了改进,提高了算法的性能。In terms of keyword-based field label extraction, there are many extraction methods. Commonly used keyword extraction bases include statistics, topics, and network diagrams. Keywords are a series of words with a high degree of generality to the article. Automatic keyword extraction is a technology to identify representative words or phrases in texts. Since this field was proposed, relevant researchers have successively proposed various Methods are generally divided into two categories: supervised and unsupervised. Among them, supervised methods require human-labeled corpus, which is suitable for small texts. However, with the increase of massive Internet data, the cost of manual labeling is increasing. In recent years, to gradually move to unsupervised methods. The core idea of the statistical method is to extract the statistical information of words in the text. This method does not need training data, and directly uses word frequency, position, etc. for judgment and screening. For example, one method is to use the weighting factor and word contribution to modify the weight of the TFIDF result to improve the keyword extraction effect in the subdivision field; another method is to use the N-gram language model and document weighting to realize automatic learning in the field of scholars. Recognition, the N-gram language model directly uses word order for calculation, and can classify documents by domain without performing operations such as word segmentation and feature extraction on documents, and then find out the domain labels of scholars. The topic model is to use the probability distribution to realize the keyword extraction. Currently, the most popular one is the LDA (Latent Dirichlet Distribution) model. Including the use of static LDA and improved fLDA to extract subject words to extract user historical interest in order to study the evolution of user behavior; in order to reduce the problem of data sparsity in microblog hot events, time series features and word frequency weighted features are introduced into the LDA algorithm. The topic keywords obtained by the method have high interpretability, and can better indicate the content displayed by the topic. The prior art further proposes a topic clustering method based on LDA. This method first clusters the keywords obtained by LDA, and uses the results to optimize the results of the topics obtained by LDA. This method can effectively improve the accuracy of the clustering results. Precision and recall. In terms of network graphs, the TextRank algorithm adapted from PageRank is the most famous. In order to solve the problem of poor academic keyword extraction results, prior knowledge is used to calculate the weight of candidate results in the academic field, and then combined with TextRank to comprehensively rank candidate keywords, and finally obtain academic keywords with high relevance; And use the word vector to construct the probability offset matrix, improve the textrank algorithm, and improve the performance of the algorithm.
基于统计、网络图等无监督方法虽然不需要提前对语料进行人工标注,但其却严重依赖语料库的效果与规模,例如TFIDF等方法结构简单,提取出的关键词缺乏分布情况与语义等信息。TextRank等方法虽然能获取到关键词的分布信息,但网络图的构建需要大量的数据来构成边,且抽取出的关键词缺乏主题相关性。尽管有上述等缺点,无监督方法在工作量上仍具有优势。Although unsupervised methods based on statistics and network diagrams do not need to manually label the corpus in advance, they heavily rely on the effect and scale of the corpus. For example, methods such as TFIDF have a simple structure, and the extracted keywords lack information such as distribution and semantics. Although methods such as TextRank can obtain the distribution information of keywords, the construction of network graphs requires a large amount of data to form edges, and the extracted keywords lack topic relevance. Despite the disadvantages mentioned above, unsupervised methods still have advantages in terms of workload.
本发明拟使用基于主题的抽取方法,将学术文档集合看作待抽取语料使用改进的FLDA主题模型对语料进行“主题-短语“抽取,得到主题分布矩阵,实现标签的自动获取。The present invention intends to use the topic-based extraction method, regard the collection of academic documents as corpus to be extracted, and use the improved FLDA topic model to extract "topic-phrase" from the corpus to obtain a topic distribution matrix and realize automatic label acquisition.
发明内容Contents of the invention
为解决现有技术存在的问题,本发明提出了基于主题模型的领域标签获取方法,在海量学术数据的基础上,分析学术数据固有的特点,引入学术词频特征构建FLDA主题模型,利用主题模型将同一学者的学术文档进行“主题-短语”抽取。其次,引入领域体系,将主题模型的抽取结果与体系标签进行向量表征,经过位置加权后使用相似度进行体系映射,最终获得学者的领域标签In order to solve the problems existing in the prior art, the present invention proposes a field label acquisition method based on a topic model, analyzes the inherent characteristics of academic data on the basis of massive academic data, introduces academic word frequency features to construct an FLDA topic model, and utilizes the topic model to The academic documents of the same scholar are extracted by "topic-phrase". Secondly, the domain system is introduced, and the extraction results of the topic model and the system labels are represented by vectors. After position weighting, the similarity is used to map the system, and finally the scholar's domain labels are obtained.
本发明为了实现上述的技术目的,采用如下的技术方案。In order to achieve the above-mentioned technical purpose, the present invention adopts the following technical solutions.
一种基于主题模型的领域标签获取方法,参见图1,包括以下的步骤:A topic model-based domain label acquisition method, see Figure 1, includes the following steps:
S1,数据预处理S1, data preprocessing
获取初始数据集合;Get the initial data set;
S2,关键词抽取S2, keyword extraction
通过FLDA进行“主题-短语”抽取,根据在文中出现的位置对短语进行权重赋值,并使用word2vec对其进行向量表征;Use FLDA to extract "topic-phrase", assign weights to phrases according to their positions in the text, and use word2vec to represent them as vectors;
S3,领域体系映射S3, domain system mapping
将“主题-短语”映射到体系,实现学者领域的统一管理;Map the "topic-phrase" to the system to realize the unified management of scholars;
S4,综合排序S4, comprehensive sorting
将向量表征结果与权重赋值结果加权排序,通过阈值获得最能代表学者的标签词。The vector representation results and the weight assignment results are weighted and sorted, and the tag words that best represent scholars are obtained through the threshold.
根据前述的一种基于主题模型的领域标签获取方法,具体的,S1,数据预处理,包括S11,数据去重处理,以及S12,分词。According to the aforementioned topic model-based domain label acquisition method, specifically, S1, data preprocessing, including S11, data deduplication processing, and S12, word segmentation.
具体的,为了消除多数据源爬取造成的重复数据对计算结果的影响,在S1,数据预处理阶段,需要进行S11,数据去重处理,得到学者的文献集合。Specifically, in order to eliminate the impact of repeated data caused by crawling from multiple data sources on the calculation results, in S1, the data preprocessing stage, S11, data deduplication processing, is required to obtain the scholar's literature collection.
本发明使用字共现、作者贡献与关键词重合率构建清洗模型:The present invention uses word co-occurrence, author contribution and keyword coincidence rate to build cleaning model:
对于两个待比较的文本,首先判断其DOI是否相同,对于相同DOI的直接过滤;For two texts to be compared, first determine whether their DOIs are the same, and directly filter for the same DOI;
对于DOI不相同或不存在的,首先使用字共现对标题进行判断,若共现度大于80%则继续判断作者共现次数和关键词共现次数,如果作者共现次数大于1且关键字共现率大于0.5,则判定结果为重复并清除。For DOIs that are not the same or do not exist, first use word co-occurrence to judge the title, if the co-occurrence is greater than 80%, continue to judge the number of co-occurrences of authors and keywords, if the number of co-occurrences of authors is greater than 1 and keywords If the co-occurrence rate is greater than 0.5, the result is determined as duplicate and cleared.
所述的共现公式如下所示为:The co-occurrence formula is as follows:
其中,A、B分别为两个标题的字集合,len(A)为标题A的字集合的长度,len(B)为标题B的字集合的长度,len(A∩B)为两个标题字集合取交集后的长度,min{len(A),len(B)}为两个长度中的最小值。Among them, A and B are the word sets of the two titles respectively, len(A) is the length of the word set of title A, len(B) is the length of the word set of title B, and len(A∩B) is the length of the two titles The length of the intersection of word sets, min{len(A), len(B)} is the minimum of the two lengths.
S12,分词。S12, word segmentation.
在分词阶段,首先提取论文关键词数据添加为分词工具的用户词典,同时使用TextRank在计算之前抽取出关键短语,将关键短语也添加到分词工具的用户词典中。In the word segmentation stage, first extract the keyword data of the paper and add it as the user dictionary of the word segmentation tool, and at the same time use TextRank to extract key phrases before calculation, and add the key phrases to the user dictionary of the word segmentation tool.
另外,将整体语料数据按照词频排序,人工筛选高频无关词,将无关词添加到分词工具的停用词表中。In addition, the overall corpus data is sorted by word frequency, the high-frequency irrelevant words are manually screened, and the irrelevant words are added to the stop word list of the word segmentation tool.
根据前述的一种基于主题模型的领域标签获取方法,具体的,S2,关键词抽取中,通过FLDA进行“主题-短语”抽取。According to the aforementioned method for acquiring domain labels based on a topic model, specifically, S2, in keyword extraction, "topic-phrase" extraction is performed through FLDA.
现有技术在基于主题的关键词抽取模型中,主流的是LDA主题模型,该模型认为一批文档中含有多个主题,而每个主题下又可以用一些列短语来近似表示当前主题。一个文档的形成是通过一定的概率选择一个主题,随后通过一定的概率选择当前主题下的短语,重复此过程直到形成一篇文档。而LDA主题抽取过程则是上述过程的逆操作。LDA主题模型通常在新闻领域应用较广,然而在科技文献领域,主题建模效果会受到科技文献特殊词频分布的影响。In the prior art, among topic-based keyword extraction models, the mainstream is the LDA topic model, which considers that a batch of documents contains multiple topics, and each topic can use a series of phrases to approximate the current topic. The formation of a document is to select a topic with a certain probability, and then select a phrase under the current topic with a certain probability, and repeat this process until a document is formed. The LDA topic extraction process is the reverse operation of the above process. The LDA topic model is usually widely used in the field of news, but in the field of scientific and technological literature, the effect of topic modeling will be affected by the special word frequency distribution of scientific and technological literature.
通过对学者学术文档的统计分析,发现学术文档的频次信息满足幂函数分布,如图2为学术文档前2000个高频次词,其中横坐标表示高频词按照频次降序排序后对应的序号,纵坐标为高频词频次。Through the statistical analysis of scholars’ academic documents, it is found that the frequency information of academic documents satisfies the power function distribution. Figure 2 shows the top 2000 high-frequency words in academic documents, where the abscissa indicates the corresponding serial number of high-frequency words sorted in descending order of frequency. The vertical axis is the frequency of high-frequency words.
经过统计得知,频词排名前10%的单词占据了全部学术文档词集的81.1%,符合Zipf分布,且对词频的研究发现,最能代表主题的词往往不是极高频词与极低频词,而是频次较靠前的中高频词。如果直接使用LDA模型对文档进行提取,则会造成某些中频词的缺失,同时,用于词频较高的特征词通常结对出现,通常高频词被分配到主题中的概率比较大,因此导致各话题的区分度不高,在S1,数据预处理过程中虽然进行了停用词过滤,但仍不能做到完全的过滤。According to statistics, the top 10% of frequent words account for 81.1% of all academic document word sets, which conforms to the Zipf distribution, and the research on word frequency found that the words that best represent the theme are often not extremely high-frequency words and extremely low-frequency words words, but the middle and high frequency words with higher frequency. If the LDA model is directly used to extract documents, some mid-frequency words will be missing. At the same time, feature words with high word frequency usually appear in pairs. Usually, the probability of high-frequency words being assigned to topics is relatively high, which leads to The degree of differentiation of each topic is not high. In S1, although stop words were filtered during the data preprocessing process, complete filtering could not be achieved.
因此,本发明提出一种词频加权的LDA主题模型,首先统计文档中的词频信息,将词频特征引入Gibbs采样的过程中,降低高频词的影响力,提高中频特征词的影响力,构建FLDA模型,使得模型不过分偏重于高频特征词词语。FLDA模型如下所述:Therefore, the present invention proposes a word-frequency weighted LDA topic model. First, the word frequency information in the document is counted, and the word frequency feature is introduced into the Gibbs sampling process to reduce the influence of high-frequency words, improve the influence of medium-frequency feature words, and construct FLDA model, so that the model does not place too much emphasis on high-frequency feature words. The FLDA model is described as follows:
LDA模型通过Gibbs抽样获得抽样参数和θ,获取参数/>和θ的目的是构造一个收敛的马尔可夫链,进而从马尔可夫链中抽取合适的样本。The LDA model obtains sampling parameters through Gibbs sampling and θ, get parameters /> The purpose of and θ is to construct a convergent Markov chain, and then draw suitable samples from the Markov chain.
LDA对短语的分配过程即为对zi的抽样。其中,zi的后验公式为如下式所示:The allocation process of LDA to phrases is the sampling of z i . Among them, the posterior formula of zi is as follows:
P(zi=j|z-i,w)→P(wi|zi=j,z-i,w-i)P(zi=j|z-i)P(z i =j|z -i , w)→P(w i |z i =j, z -i , w -i )P(z i =j|z -i )
zi=j为将主题j分配给当前词Wi,z-i为分配给非zi的词语权重和,W-i为非当前位置的词。z i =j is assigning topic j to the current word W i , z -i is the weight sum of words not assigned to z i , and W -i is the word at the non-current position.
已知P(w|z)仅与相关,因此通过在/>上积分,得到下式:It is known that P(w|z) is only related to related, so pass in /> Integrating, we get the following formula:
为Gibbs抽样参数,/>为当前主题j对应的Gibbs抽样参数,/>为对参数进行积分, Sampling parameters for Gibbs, /> Gibbs sampling parameters corresponding to the current topic j, /> To integrate over the parameters,
是“主题-短语”的多项式分布,遵循下式: is a multinomial distribution of topic-phrases, following the formula:
另外,同时,/>是/>的先验分布,因此对后验概率/>进行积分,即可获得下式:in addition, At the same time, /> yes /> The prior distribution of , so for the posterior probability /> Integrating, the following formula can be obtained:
其中,是分配给主题j且与词w相同的词的权重和,/>为分配给主题j且的所有词的权重和,β为Dirichlet分布的参数,v为词库的大小。in, is the sum of the weights of words assigned to topic j that are the same as word w, /> is the sum of the weights of all words assigned to topic j, β is the parameter of Dirichlet distribution, and v is the size of the lexicon.
同理可知,P(z)仅与θ有关,因此通过在θ上积分可得下式:In the same way, it can be known that P(z) is only related to θ, so the following formula can be obtained by integrating on θ:
表示di中分配给主题i的词语权重和,T为主题数。 Indicates the weight sum of words assigned to topic i in d i , and T is the number of topics.
通过公式P(zi=j|z-i,w)→P(wi|zi=j,z-i,w-i)P(zi=j|z-i), 结合得到下式:By the formula P(z i =j|z -i , w)→P(w i |z i =j, z -i , w -i )P(z i =j|z -i ), Combined to get the following formula:
经过上述的计算,得到了LDA的非标准分布,但需要除去所有“主题-短语”分配的概率和,如下式所示:After the above calculation, the non-standard distribution of LDA is obtained, but the probability sum of all "topic-phrase" distributions needs to be removed, as shown in the following formula:
其中,wi第i个词语,zi=j为将当前主题j分配给当前词wi,z-i为分配给非zi的词权重和,表示主题为j且词语与词语wi相同的权重和,/>表示文档di中主题为i的词语的权重和,/>表示当前文档中拥有主题的词语的权重和,V表示词库大小,T表示主题个数,P(zi=j|z-i,wi)为经过重新计算的后验概率。Among them, the i-th word of w i , z i =j is to assign the current topic j to the current word w i , z -i is the sum of the word weights assigned to non-z i , Indicates that the topic is j and the weight sum of the word is the same as the word w i , /> Indicates the sum of weights of words with topic i in document d i , /> Indicates the sum of weights of words with topics in the current document, V indicates the size of the thesaurus, T indicates the number of topics, and P(z i =j|z -i , w i ) is the recalculated posterior probability.
模型的词频加权公式如下:The word frequency weighting formula of the model is as follows:
其中,ni表示当前此的词频,nmid表示选择中频词的词频,nmax表示词频统计结果中的最大值,nmin表示词频统计结果中的最小值,Ci表示当前词的权重,取值范围为[1,2],为保障加权后总特征词的个数不变,需要对每个特征词的权重做调整,其中,Fi为特征词调整后的权重,为当前词出现的个数,/>为所有词的权重和。参见图3,由于Gibbs采样initialize时单词w分配给主题z的概率是随机的,因此,将计算得到的Fi替换掉Gibbs采样过程中初始化的随机值,并在此基础上循环计算至收敛并获得参数/>和θ。Among them, n i represents the word frequency of the current word, n mid represents the word frequency of the selected medium frequency word, n max represents the maximum value in the word frequency statistical result, n min represents the minimum value in the word frequency statistical result, C i represents the weight of the current word, and takes The value range is [1, 2]. In order to ensure that the total number of feature words remains unchanged after weighting, it is necessary to adjust the weight of each feature word, where F i is the adjusted weight of feature words, is the number of occurrences of the current word, /> is the weight sum of all words. Referring to Figure 3, since the probability of assigning word w to topic z is random when Gibbs sampling is initialized, the calculated Fi is replaced with the initialized random value during Gibbs sampling, and on this basis, the calculation is cyclic until convergence and obtained parameter /> and θ.
word2vec法进行词向量表征Word2vec method for word vector representation
Word2vec通过深度学习在百万级词典与数亿级的训练语料上进行训练,训练得到的结果即为词向量模型,词向量有效的在空间中表达了词的语义信息。向量的训练模型指的是浅层神经网络CBOW或Skip-gram模型,其中,CBOW模型如图4所示。Word2vec is trained on millions of dictionaries and hundreds of millions of training corpora through deep learning. The result of the training is the word vector model, which effectively expresses the semantic information of words in space. The vector training model refers to the shallow neural network CBOW or Skip-gram model, where the CBOW model is shown in Figure 4.
CBOW模型的特点是根据上下文来预测当前词,训练时,首先为所有词语初始化一个N维词向量,并模型将输入窗口期内的上下文词进行累加,同时根据词频构建Huffmam树来获得Huffman路径,根据路径计算叶节点的概率,随后采用梯度下降的方法调整非叶节点的参数和上下文的词向量,进行多次迭代后使结果收敛于真实结果。The characteristic of the CBOW model is to predict the current word according to the context. During training, first initialize an N-dimensional word vector for all words, and the model accumulates the context words in the input window period, and constructs a Huffmam tree according to the word frequency to obtain the Huffman path. Calculate the probability of leaf nodes according to the path, and then use the method of gradient descent to adjust the parameters of non-leaf nodes and the word vector of the context, and make the result converge to the real result after multiple iterations.
权重赋值weight assignment
由于学术论文通常分为标题、摘要、关键词、内容等信息。根据以往经验,标题往往蕴含全文的中心思想,是全文内容的重要总结,因此本发明增加标题中词语的最终权重。关键词部分对全文主旨也具有一定的代表能力,而摘要部分则认为是对全文内容的简要概括。Because academic papers are usually divided into titles, abstracts, keywords, content and other information. According to past experience, the title often contains the central idea of the full text and is an important summary of the full text, so the present invention increases the final weight of the words in the title. The keyword part also has a certain representative ability to the main theme of the full text, while the abstract part is considered to be a brief summary of the full text content.
优选的,本发明的权重赋值为,将标题的权重置为4,将关键词的权重置为3,将摘要的权重置为2。Preferably, the weight assignment in the present invention is to reset the weight of the title to 4, reset the weight of keywords to 3, and reset the weight of the abstract to 2.
FLDA模型参数的选择Selection of FLDA model parameters
本发明中选取20为最优主题个数。In the present invention, 20 is selected as the optimal number of topics.
S3,领域体系映射S3, domain system mapping
由于各学者的学术文档通过主题模型获得的短语存在差异性,无法对学者进行统一管理,因此引入学术领域体系实现对学者的统一衡量。Due to the differences in the phrases obtained by the topic model in the academic documents of various scholars, it is impossible to manage the scholars uniformly. Therefore, the academic field system is introduced to realize the unified measurement of scholars.
领域体系参照国家自然科学基金领域体系制定,能尽最大程度的覆盖各领域的研究范围。The field system is formulated with reference to the field system of the National Natural Science Foundation of China, which can cover the research scope of each field to the greatest extent.
本发明将主题模型的结果映射到领域体系中,映射公式如下:The present invention maps the results of the topic model to the domain system, and the mapping formula is as follows:
F(A,B)=sim(A,B)*CA*LA F (A, B) = sim(A, B)*C A *L A
其中,A为主题模型获得的短语,B为体系词,使用向量模型获得对应的词向量,对于未登录词则使用字向量拼接成词向量。sim(A,B)为最终计算的余弦相似度。CA为主题模型分配的概率,LA为短语在文档中的位置系数,取值范围为[2,3,4],F(A,B)为经过加权得到的相似度。CB则为体系词的最终得分。Among them, A is the phrase obtained by the topic model, B is the system word, and the corresponding word vector is obtained by using the vector model. For unregistered words, the word vector is used to splice the word vector into a word vector. sim(A, B) is the final calculated cosine similarity. C A is the probability assigned by the topic model, L A is the position coefficient of the phrase in the document, and the value range is [2, 3, 4], F (A, B) is the weighted similarity. C B is the final score of system words.
S4,综合排序S4, comprehensive sorting
经过映射公式获得最终得分CB,对当前学者所对应的所有体系词按照得分CB由高到低排序,取前四项得分最高的体系词作为最能代表学者研究领域的领域标签词。The final score C B is obtained through the mapping formula, and all system words corresponding to the current scholar are sorted according to the score C B from high to low, and the top four system words with the highest scores are taken as the domain label words that best represent the scholar's research field.
本发明采用上述的技术方案,取得如下的技术效果。The present invention adopts the above-mentioned technical scheme to obtain the following technical effects.
本发明的FLDA模型与传统的LDA模型、基于统计的TFIDF算法和基于网络图的TextRank算法相比,最终获取的标签词效果更好,准确率更高,说明基于主题模型的标签抽取方法在学术领域具有良好的适用性。Compared with the traditional LDA model, the TFIDF algorithm based on statistics and the TextRank algorithm based on the network graph, the FLDA model of the present invention has a better tag word effect and a higher accuracy rate, which shows that the tag extraction method based on the topic model is in the academic field. domain has good applicability.
附图说明Description of drawings
图1为本发明基于主题模型的领域标签获取方法的示意框架图;Fig. 1 is a schematic framework diagram of the subject model-based domain label acquisition method of the present invention;
图2为文档-词频分布图;Figure 2 is a document-word frequency distribution diagram;
图3为Gibbs采样流程图;Figure 3 is a Gibbs sampling flow chart;
图4为CBOW模型示意图;Figure 4 is a schematic diagram of the CBOW model;
图5为困惑度-主题数关系图。Figure 5 is a diagram of perplexity-number of topics.
具体实施方式Detailed ways
为使本发明的目的、技术方案和有益效果更加清楚,下面将结合本发明实施例中及附图,对本发明的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solution and beneficial effect of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention and the accompanying drawings. Obviously, the described embodiments are part of the present invention Examples, not all examples. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.
实施例:Example:
一种基于主题模型的领域标签获取方法,包括以下的步骤:A method for obtaining domain labels based on topic models, comprising the following steps:
S1,数据预处理S1, data preprocessing
获取初始数据集合,具体的,采用如下的方法:To obtain the initial data set, specifically, the following methods are used:
S11,数据去重处理S11, data deduplication processing
本实施例使用字共现、作者贡献与关键词重合率构建清洗模型:In this example, word co-occurrence, author contribution and keyword coincidence rate are used to construct a cleaning model:
对于两个待比较的文本,首先判断其DOI是否相同,对于相同DOI的直接过滤;For two texts to be compared, first determine whether their DOIs are the same, and directly filter for the same DOI;
对于DOI不相同或不存在的,首先使用字共现对标题进行判断,若共现度大于80%则继续判断作者共现次数和关键词共现次数,如果作者共现次数大于1且关键共现率大于0.5,则判定结果为重复并清除。For DOIs that are not the same or do not exist, first use word co-occurrence to judge the title, if the co-occurrence is greater than 80%, continue to judge the number of co-occurrences of authors and keywords, if the number of co-occurrences of authors is greater than 1 and the co-occurrence of keywords If the occurrence rate is greater than 0.5, the judgment result is repeated and cleared.
所述的共现公式如下所示为:The co-occurrence formula is as follows:
其中,A、B分别为两个标题的字集合,len(A)为标题A的字集合的长度,len(B)为标题B的字集合的长度,len(A∩B)为两个标题字集合取交集后的长度,min{len(A),len(B)}为两个长度中的最小值。Among them, A and B are the word sets of the two titles respectively, len(A) is the length of the word set of title A, len(B) is the length of the word set of title B, and len(A∩B) is the length of the two titles The length of the intersection of word sets, min{len(A), len(B)} is the minimum of the two lengths.
S12,分词S12, participle
在分词阶段,首先提取论文关键词数据添加为分词工具的用户词典,同时使用TextRank在计算之前抽取出关键短语,将关键短语也添加到分词工具的用户词典中。In the word segmentation stage, first extract the keyword data of the paper and add it as the user dictionary of the word segmentation tool, and at the same time use TextRank to extract key phrases before calculation, and add the key phrases to the user dictionary of the word segmentation tool.
另外,将整体语料数据按照词频排序,人工筛选高频无关词,将无关词添加到分词工具的停用词表中。In addition, the overall corpus data is sorted by word frequency, the high-frequency irrelevant words are manually screened, and the irrelevant words are added to the stop word list of the word segmentation tool.
S2,关键词抽取S2, keyword extraction
通过FLDA进行“主题-短语”抽取,对短语根据在文中出现的位置进行权重赋值,并使用word2vec对其进行向量表征。The "topic-phrase" extraction is performed through FLDA, and the phrases are assigned weights according to the positions in the text, and word2vec is used to represent them as vectors.
通过FLDA进行“主题-短语”抽取:"Topic-phrase" extraction through FLDA:
LDA模型通过Gibbs抽样获得抽样参数和θ,获取参数/>和θ的目的是构造一个收敛的马尔可夫链,进而从马尔可夫链中抽取合适的样本。The LDA model obtains sampling parameters through Gibbs sampling and θ, get parameters /> The purpose of and θ is to construct a convergent Markov chain, and then draw suitable samples from the Markov chain.
LDA对短语的分配过程即为对zi的抽样。其中,zi的后验公式为如下式所示:The allocation process of LDA to phrases is the sampling of z i . Among them, the posterior formula of zi is as follows:
P(zi=j|z-i,w)→P(wi|zi=j,z-i,w-i)P(zi=j|z-i)P(z i =j|z -i , w)→P(w i |z i =j, z -i , w -i )P(z i =j|z -i )
zi=j为将主题j分配给当前词Wi,z-i为分配给非zi的词语权重和,W-i为非当前位置的词。z i =j is assigning topic j to the current word W i , z -i is the weight sum of words not assigned to z i , and W -i is the word at the non-current position.
已知P(w|z)仅与相关,因此通过在/>上积分,得到下式:It is known that P(w|z) is only related to related, so pass in /> Integrating, we get the following formula:
为Gibbs抽样参数,/>为当前主题j对应的Gibbs抽样参数,/>为对参数进行积分, Sampling parameters for Gibbs, /> Gibbs sampling parameters corresponding to the current topic j, /> To integrate over the parameters,
是“主题-短语”的多项式分布,遵循下式: is a multinomial distribution of topic-phrases, following the formula:
另外,同时,/>是/>的先验分布,因此对后验概率/>进行积分,即可获得下式:in addition, At the same time, /> yes /> The prior distribution of , so for the posterior probability /> Integrating, the following formula can be obtained:
其中,是分配给主题j且与词w相同的词的权重和,/>为分配给主题j且的所有词的权重和,β为Dirichlet分布的参数,v为词库的大小。in, is the sum of the weights of words assigned to topic j that are the same as word w, /> is the sum of the weights of all words assigned to topic j, β is the parameter of Dirichlet distribution, and v is the size of the lexicon.
同理可知,P(z)仅与θ有关,因此通过在θ上积分可得下式:In the same way, it can be known that P(z) is only related to θ, so the following formula can be obtained by integrating on θ:
表示di中分配给主题i的词语权重和,T为主题数。 Indicates the weight sum of words assigned to topic i in d i , and T is the number of topics.
通过公式P(zi=j|z-i,w)→P(wi|zi=j,z-i,w-i)P(zi=j|z-i), 结合得到下式:By the formula P(z i =j|z -i , w)→P(w i |z i =j, z -i , w -i )P(z i =j|z -i ), Combined to get the following formula:
经过上述的计算,得到了LDA的非标准分布,但需要除去所有“主题-短语”分配的概率和,如下式所示:After the above calculation, the non-standard distribution of LDA is obtained, but the probability sum of all "topic-phrase" distributions needs to be removed, as shown in the following formula:
其中,wi第i个词语,zi=j为将当前主题j分配给当前词wi,z-i为分配给非zi的词权重和,表示主题为j且词语与词语wi相同的权重和,/>表示文档di中主题为i的词语的权重和,/>表示当前文档中拥有主题的词语的权重和,V表示词库大小,T表示主题个数,P(zi=j|z-i,wi)为经过重新计算的后验概率。Among them, the i-th word of w i , z i =j is to assign the current topic j to the current word w i , z -i is the sum of the word weights assigned to non-z i , Indicates that the topic is j and the weight sum of the word is the same as the word w i , /> Indicates the sum of weights of words with topic i in document d i , /> Indicates the sum of weights of words with topics in the current document, V indicates the size of the thesaurus, T indicates the number of topics, and P(z i =j|z -i , w i ) is the recalculated posterior probability.
模型的词频加权公式如下:The word frequency weighting formula of the model is as follows:
其中,ni表示当前此的词频,nmid表示选择中频词的词频,nmax表示词频统计结果中的最大值,nmin表示词频统计结果中的最小值,Ci表示当前词的权重,取值范围为[1,2],为保障加权后总特征词的个数不变,需要对每个特征词的权重做调整,其中,Fi为特征词调整后的权重,为当前词出现的个数,/>为所有词的权重和。参见图3,由于Gibbs采样initialize时单词w分配给主题z的概率是随机的,因此,将计算得到的Fi替换掉Gibbs采样过程中初始化的随机值,并在此基础上循环计算至收敛并获得参数/>和θ。Among them, n i represents the word frequency of the current word, n mid represents the word frequency of the selected medium frequency word, n max represents the maximum value in the word frequency statistical result, n min represents the minimum value in the word frequency statistical result, C i represents the weight of the current word, and takes The value range is [1, 2]. In order to ensure that the total number of feature words remains unchanged after weighting, it is necessary to adjust the weight of each feature word, where F i is the adjusted weight of feature words, is the number of occurrences of the current word, /> is the weight sum of all words. Referring to Figure 3, since the probability of assigning word w to topic z is random when Gibbs sampling is initialized, the calculated Fi is replaced with the initialized random value during Gibbs sampling, and on this basis, the calculation is cyclic until convergence and obtained parameter /> and θ.
word2vec法进行词向量表征Word2vec method for word vector representation
Word2vec通过深度学习在百万级词典与数亿级的训练语料上进行训练,训练得到的结果即为词向量模型,词向量有效的在空间中表达了词的语义信息。Word2vec is trained on millions of dictionaries and hundreds of millions of training corpora through deep learning. The result of the training is the word vector model, which effectively expresses the semantic information of words in space.
具体的,采用CBOW模型进行向量的训练。CBOW模型的特点是根据上下文来预测当前词,训练时,首先为所有词语初始化一个N维词向量,并模型将输入窗口期内的上下文词进行累加,同时根据词频构建Huffmam树来获得Huffman路径,根据路径计算叶节点的概率,随后采用梯度下降的方法调整非叶节点的参数和上下文的词向量,进行多次迭代后使结果收敛于真实结果。Specifically, the CBOW model is used for vector training. The characteristic of the CBOW model is to predict the current word according to the context. During training, first initialize an N-dimensional word vector for all words, and the model accumulates the context words in the input window period, and constructs a Huffmam tree according to the word frequency to obtain the Huffman path. Calculate the probability of leaf nodes according to the path, and then use the method of gradient descent to adjust the parameters of non-leaf nodes and the word vector of the context, and make the result converge to the real result after multiple iterations.
权重赋值weight assignment
本实施例将标题的权重置为4,将关键词的权重置为3,将摘要的权重置为2。In this embodiment, the weight of the title is reset to 4, the weight of keywords is reset to 3, and the weight of the abstract is reset to 2.
作为一种优选的方案,本实施例中选取20为最优主题个数。As a preferred solution, 20 is selected as the optimal number of topics in this embodiment.
S3,领域体系映射S3, domain system mapping
将“主题-短语”映射到体系,实现学者领域的统一管理;Map the "topic-phrase" to the system to realize the unified management of scholars;
本实施例将主题模型的结果映射到领域体系中,映射公式如下:In this embodiment, the results of the topic model are mapped to the domain system, and the mapping formula is as follows:
F(A,B)=sim(A,B)*CA*LA F (A, B) = sim(A, B)*C A *L A
其中,A为主题模型获得的短语,B为体系词,使用向量模型获得对应的词向量,对于未登录词则使用字向量拼接成词向量。sim(A,B)为最终计算的余弦相似度。CA为主题模型分配的概率,LA为短语在文档中的位置系数,取值范围为[2,3,4],F(A,B)为经过加权得到的相似度。CB则为体系词的最终得分。Among them, A is the phrase obtained by the topic model, B is the system word, and the corresponding word vector is obtained by using the vector model. For unregistered words, the word vector is used to splice the word vector into a word vector. sim(A, B) is the final calculated cosine similarity. C A is the probability assigned by the topic model, L A is the position coefficient of the phrase in the document, and the value range is [2, 3, 4], F (A, B) is the weighted similarity. C B is the final score of system words.
S4,综合排序S4, comprehensive sorting
将向量表征结果与权重赋值结果加权排序,获得最能代表学者的标签词。The vector representation results and weight assignment results are weighted and sorted to obtain the tag words that best represent scholars.
具体的,经过映射公式获得最终得分CB,对当前学者所对应的所有体系词按照得分由高到低排序,取前四项得分最高的体系词作为最能代表学者研究领域的领域标签词。Specifically, the final score C B is obtained through the mapping formula, and all system words corresponding to the current scholar are sorted from high to low scores, and the top four system words with the highest scores are taken as the domain label words that best represent the scholar's research field.
实验例:Experimental example:
为获取尽可能真实的实验数据,本发明使用网络爬虫技术,爬取CNKI与万方数据库中的论文数据源,数据使用jieba进行分词处理,向量表征使用腾讯AI实验室公布的词向量模型,本实验例将从评价标准介绍,数据预处理,FLDA模型参数主题数的选择,基于主题模型的标签算法的评价四个方面进行介绍。In order to obtain experimental data as real as possible, the present invention uses web crawler technology to crawl paper data sources in CNKI and Wanfang databases, uses jieba for word segmentation processing of data, and uses word vector model published by Tencent AI Lab for vector representation. The experimental example will be introduced from four aspects: the introduction of evaluation criteria, data preprocessing, the selection of the number of topics in the FLDA model parameters, and the evaluation of the labeling algorithm based on the topic model.
评价标准:evaluation standard:
由于LDA主题模型属于无监督模型,因此没有比较直观评价标准去衡量模型的好坏。本实验例选择利用主题模型的“主题-短语”矩阵进行评价,引入困惑度作为模型的评价标准,通常认为,困惑度越低,模型的效果就越好。困惑度的计算公式如公式所示。Since the LDA topic model is an unsupervised model, there is no intuitive evaluation standard to measure the quality of the model. In this experimental example, we choose to use the "topic-phrase" matrix of the topic model for evaluation, and introduce perplexity as the evaluation standard of the model. It is generally believed that the lower the perplexity, the better the effect of the model. The calculation formula of perplexity is shown in the formula.
p(w)=p(z|d)*p(w|z)。p(w)=p(z|d)*p(w|z).
其中,Perplexity(D)表示当前模型的困惑度,d表是学术文档文本,M表示学术文档的个数,当前语料中所有词的数量和,p(w)为词w出现在矩阵中的概率,p(z|d)表示学术文档d为主题z的概率,p(w|z)表示的是词w出现在主题z中的概率。困惑度衡量主题模型预测出的结果与原样本信息的符合度。Among them, Perplexity (D) represents the perplexity of the current model, d table is the academic document text, M represents the number of academic documents, The sum of the number of all words in the current corpus, p(w) is the probability that word w appears in the matrix, p(z|d) represents the probability that academic document d is topic z, and p(w|z) represents the word w The probability of appearing in topic z. The perplexity measures the degree of conformity between the results predicted by the topic model and the original sample information.
在计算标签准确度时,使用F1值来衡量学者标签的准确性,随机挑选多位学者,人工参照领域标签并结合对学者的了解,选择4个最合适的标签作为正确标签。根据算法排序得到的前四个标签与正确标签使用上述指标进行评价计算,并最终计算平均F1值,计算公式如下所示:When calculating the labeling accuracy, the F1 value is used to measure the accuracy of the scholar's labeling, a number of scholars are randomly selected, and the 4 most suitable labels are selected as the correct labels by manually referring to the field labels and combining with the understanding of the scholars. The first four labels and the correct labels obtained according to the algorithm sorting are evaluated and calculated using the above indicators, and finally the average F1 value is calculated. The calculation formula is as follows:
其中,hi表示标准标签的个数,mi表示算法得到的标签个数,hi∩mi表示算法得到的正确的标签数。N为样本总数。Among them, h i represents the number of standard labels, m i represents the number of labels obtained by the algorithm, and h i ∩ m i represents the correct number of labels obtained by the algorithm. N is the total number of samples.
数据预处理data preprocessing
为消除多数据源爬取造成的重复数据对计算结果的影响,在预处理阶段需要对数据进行去重处理,使用字共现、作者贡献与关键词重合率构建清洗模型,对于两个待比较的文本,首先判断其DOI是否相同,对于相同DOI的则直接过滤,对于DOI不相同或不存在的,首先使用字共现对标题进行判断,若共现度大于80%则继续判断作者共现次数和关键词共现次数,如果作者共现次数大于1且关键共现率大于0.5的话,则判定结果为重复并清除。In order to eliminate the impact of repeated data on the calculation results caused by crawling from multiple data sources, the data needs to be deduplicated in the preprocessing stage, and the cleaning model is constructed using word co-occurrence, author contribution and keyword coincidence rate. For the two to be compared For texts with the same DOI, first judge whether their DOIs are the same, and directly filter for the same DOI. For those with different or non-existent DOIs, first use word co-occurrence to judge the title, and if the co-occurrence degree is greater than 80%, continue to judge the author co-occurrence Times and keyword co-occurrence times, if the co-occurrence times of the author is greater than 1 and the key co-occurrence rate is greater than 0.5, the result is judged to be duplicate and cleared.
在分词阶段,首先提取海量论文关键词数据添加为分词工具的用户词典,同时使用TextRank在计算之前抽取出关键短语,将关键短语也添加到分词工具的用户词典中。另外,将整体语料数据按照词频排序,人工筛选高频无关词,将无关词添加到分词工具的停用词表中。In the word segmentation stage, a large amount of paper keyword data is first extracted and added as the user dictionary of the word segmentation tool. At the same time, TextRank is used to extract key phrases before calculation, and the key phrases are also added to the user dictionary of the word segmentation tool. In addition, the overall corpus data is sorted by word frequency, the high-frequency irrelevant words are manually screened, and the irrelevant words are added to the stop word list of the word segmentation tool.
FLDA模型参数的选择Selection of FLDA model parameters
主题模型中主题数的选取是影响主题聚类结果的重要因素,主题数如果设置的过小,则会导致模型聚类结果没有区分度,主题数如果设置的过大,则会导致将当前文档错误的划分到别的主题中,因此本节通过实验来计算不同主题数下的困惑度,并根据困惑度来确定最终主题数。实验在其余参数不变的情况下只改变主题数,得到的实验结果如图4所示。The selection of the number of topics in the topic model is an important factor affecting the results of topic clustering. If the number of topics is set too small, the clustering results of the model will be indistinguishable. If the number of topics is set too large, the current document will be Wrongly divided into other topics, so this section uses experiments to calculate the perplexity under different topic numbers, and determines the final topic number according to the perplexity. In the experiment, only the number of topics is changed when the other parameters remain unchanged, and the experimental results are shown in Figure 4.
其中LDA曲线为LDA主题模型的困惑度曲线,FLDA曲线为经过词频加权的LDA主题模型的困惑度曲线。关系图中横坐标为主题数,纵坐标为困惑度。实验中使用同一组参数重复做三次实验,取实验结果的平均值。The LDA curve is the perplexity curve of the LDA topic model, and the FLDA curve is the perplexity curve of the LDA topic model weighted by word frequency. The abscissa in the relationship diagram is the number of topics, and the ordinate is the perplexity. In the experiment, the same set of parameters was used to repeat the experiment three times, and the average value of the experimental results was taken.
由实验结果可得,随着主题数的增加,三种模型的困惑度都呈下降趋势,且都在主题数为20左右时下降幅度放缓甚至收敛,因此选取20为最优主题个数。It can be seen from the experimental results that with the increase of the number of topics, the perplexity of the three models shows a downward trend, and the decline slows down or even converges when the number of topics is about 20, so 20 is selected as the optimal number of topics.
标签算法的评价Evaluation of Labeling Algorithms
引入领域体系以达到统一衡量学者的学术领域,其中,领域体系参照国家自然科学基金领域体系制定,并在此基础上做适当修改,领域体系示例如表1所示。Introduce a field system to achieve a unified measurement of scholars’ academic fields. The field system is formulated with reference to the field system of the National Natural Science Foundation of China, and appropriate modifications are made on this basis. An example of the field system is shown in Table 1.
表1领域体系示例Table 1 Domain System Example
由于在学术领域标签抽取方面缺乏较权威的数据集,为验证算法的有效性,本次实验数据使用12位学者的学术论文数据,并分别用TFIDF算法、TextRank算法、LDA算法和FLDA算法获取学者的学术标签并进行比较,具体示例如表2所示。Due to the lack of more authoritative data sets in the field of label extraction in the academic field, in order to verify the effectiveness of the algorithm, the experimental data used the academic paper data of 12 scholars, and used the TFIDF algorithm, TextRank algorithm, LDA algorithm and FLDA algorithm to obtain scholars The academic labels and compare them, the specific examples are shown in Table 2.
在评价算法时,人工根据学者的主页介绍、招生简章等信息选取体系中适当的词语做为学者领域标签的标准答案,将算法获取的标签词称作当前答案,将当前答案与标准答案进行效果评价。When evaluating the algorithm, human beings select appropriate words in the system based on information such as the homepage introduction of the scholar and the admission brochure as the standard answer for the scholar's domain label, and the label words obtained by the algorithm are called the current answer, and the current answer is compared with the standard answer. evaluate.
表2算法抽取标签结果对比Table 2 Comparison of algorithm extraction label results
表3算法F1值比较Table 3 Algorithm F1 value comparison
由表2可得,经过FLDA算法获得的标签与标准答案拥有较高的重合性。算法效果要优于LDA于TFIDF算法。It can be seen from Table 2 that the labels obtained by the FLDA algorithm have a high coincidence with the standard answers. The algorithm effect is better than LDA and TFIDF algorithm.
表3的数据为参照公式(15)(16)(17)进行计算得到F1值,其中4-2代表使用算法获得4个得分最高的标签,与两个标准答案进行计算;4-3代表使用算法获得4个得分最高的标签,与三个标准答案进行计算;4-4则为与四个标准答案进行计算。经分析可得,在多种预测情况下,FLDA的F1值要高于传统LDA算法、基于统计的TFIDF算法和基于网络图的TextRank算法。这说明通过引入多词频特征加权的FLDA模型不仅能够从篇章及分析文档的内容及其之间的联系,而且能够将学术数据降维,更适用于处理一定数量级的学术文档,有利于后续的标签映射与计算。该模型在一定程度上能够反映学者的研究方向,使用户能够较方便的对学者做全面了解,节省了用户的时间和精力。也间接说明了经过多词频特征加权的FLDA算法相较于传统算法能够较好的提取学术文本中的关键信息。The data in Table 3 is the F1 value calculated by referring to formulas (15)(16)(17), where 4-2 represents the use of the algorithm to obtain the 4 highest-scoring labels and calculates them with two standard answers; 4-3 represents the use of The algorithm obtains the 4 highest-scoring labels and calculates them with three standard answers; 4-4 is calculated with four standard answers. According to the analysis, in various prediction situations, the F1 value of FLDA is higher than that of traditional LDA algorithm, TFIDF algorithm based on statistics and TextRank algorithm based on network graph. This shows that by introducing the FLDA model weighted by multi-word frequency features, it can not only analyze the content of documents and their connections from the text, but also reduce the dimensionality of academic data, which is more suitable for processing academic documents of a certain order of magnitude, which is conducive to subsequent labeling Mapping and computing. To a certain extent, the model can reflect the research direction of scholars, so that users can have a comprehensive understanding of scholars more conveniently, saving users' time and energy. It also indirectly shows that the FLDA algorithm weighted by multi-word frequency features can better extract key information in academic texts than traditional algorithms.
本发明提供的技术方案,不受上述实施例的限制,凡是利用本发明的结构和方式,经过变换和代换所形成的技术方案,都在本发明的保护范围内。The technical solutions provided by the present invention are not limited by the above-mentioned embodiments, and all technical solutions formed by utilizing the structures and methods of the present invention through transformation and substitution are within the protection scope of the present invention.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910784200.3A CN110543564B (en) | 2019-08-23 | 2019-08-23 | Domain Label Acquisition Method Based on Topic Model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910784200.3A CN110543564B (en) | 2019-08-23 | 2019-08-23 | Domain Label Acquisition Method Based on Topic Model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110543564A CN110543564A (en) | 2019-12-06 |
CN110543564B true CN110543564B (en) | 2023-06-20 |
Family
ID=68712039
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910784200.3A Active CN110543564B (en) | 2019-08-23 | 2019-08-23 | Domain Label Acquisition Method Based on Topic Model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110543564B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111241283B (en) * | 2020-01-15 | 2023-04-07 | 电子科技大学 | Rapid characterization method for portrait of scientific research student |
CN111831804B (en) * | 2020-06-29 | 2024-04-26 | 深圳价值在线信息科技股份有限公司 | Method and device for extracting key phrase, terminal equipment and storage medium |
CN112508376A (en) * | 2020-11-30 | 2021-03-16 | 中国科学院深圳先进技术研究院 | Index system construction method |
CN112446204B (en) * | 2020-12-07 | 2024-08-02 | 北京明略软件系统有限公司 | Method, system and computer equipment for determining document label |
CN112883148B (en) * | 2021-01-15 | 2023-03-28 | 博观创新(上海)大数据科技有限公司 | Subject talent evaluation control method and device based on research trend matching |
CN113190672A (en) * | 2021-05-12 | 2021-07-30 | 上海热血网络科技有限公司 | Advertisement judgment model, advertisement filtering method and system |
CN113298399B (en) * | 2021-05-31 | 2023-04-07 | 西南大学 | Scientific research project analysis method based on big data |
CN114492425B (en) * | 2021-12-30 | 2023-04-07 | 中科大数据研究院 | Method for communicating multi-dimensional data by adopting one set of field label system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740342A (en) * | 2016-01-22 | 2016-07-06 | 天津中科智能识别产业技术研究院有限公司 | Social relation topic model based social network friend recommendation method |
CN109766544A (en) * | 2018-12-24 | 2019-05-17 | 中国科学院合肥物质科学研究院 | Document keyword extraction method and device based on LDA and word vector |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9449096B2 (en) * | 2014-01-07 | 2016-09-20 | International Business Machines Corporation | Identifying influencers for topics in social media |
-
2019
- 2019-08-23 CN CN201910784200.3A patent/CN110543564B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740342A (en) * | 2016-01-22 | 2016-07-06 | 天津中科智能识别产业技术研究院有限公司 | Social relation topic model based social network friend recommendation method |
CN109766544A (en) * | 2018-12-24 | 2019-05-17 | 中国科学院合肥物质科学研究院 | Document keyword extraction method and device based on LDA and word vector |
Non-Patent Citations (3)
Title |
---|
Supervised topic models for multi-label classification;Ximing Li 等;《Neurocomputing》;20151231;全文 * |
基于SL-LDA的领域标签获取方法;王胜 等;《计算机科学》;20201130;全文 * |
基于主题模型的多标签文本分类和流文本数据建模若干问题研究;李熙铭;《中国优秀博士学位论文库》;20150815;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110543564A (en) | 2019-12-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110543564B (en) | Domain Label Acquisition Method Based on Topic Model | |
CN109858028B (en) | Short text similarity calculation method based on probability model | |
CN108052593B (en) | A topic keyword extraction method based on topic word vector and network structure | |
CN106997382B (en) | Automatic labeling method and system for innovative creative labels based on big data | |
CN107220295B (en) | Searching and mediating strategy recommendation method for human-human contradiction mediating case | |
CN111832289B (en) | Service discovery method based on clustering and Gaussian LDA | |
CN108763213A (en) | Theme feature text key word extracting method | |
CN102750316B (en) | Based on the conceptual relation label abstracting method of semantic co-occurrence patterns | |
CN114706972B (en) | An automatic generation method of unsupervised scientific and technological information summaries based on multi-sentence compression | |
CN106844349B (en) | Spam comment recognition method based on collaborative training | |
CN111897963B (en) | Commodity classification method based on text information and machine learning | |
CN111221968B (en) | Author disambiguation method and device based on subject tree clustering | |
CN106156272A (en) | A kind of information retrieval method based on multi-source semantic analysis | |
Wang et al. | Ptr: Phrase-based topical ranking for automatic keyphrase extraction in scientific publications | |
CN104008090A (en) | Multi-subject extraction method based on concept vector model | |
CN103970730A (en) | Method for extracting multiple subject terms from single Chinese text | |
CN109670014A (en) | A kind of Authors of Science Articles name disambiguation method of rule-based matching and machine learning | |
CN110209818A (en) | A kind of analysis method of Semantic-Oriented sensitivity words and phrases | |
CN107506472B (en) | Method for classifying browsed webpages of students | |
CN107688870A (en) | A kind of the classification factor visual analysis method and device of the deep neural network based on text flow input | |
CN109298796A (en) | A kind of Word association method and device | |
CN109086443A (en) | Social media short text on-line talking method based on theme | |
Chang et al. | A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING. | |
CN118278365A (en) | Automatic generation method and device for scientific literature review | |
CN107220293A (en) | File classification method based on mood |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |