WO2021128158A1 - 一种基于网络表征和语义表征的同名作者消歧方法 - Google Patents
一种基于网络表征和语义表征的同名作者消歧方法 Download PDFInfo
- Publication number
- WO2021128158A1 WO2021128158A1 PCT/CN2019/128642 CN2019128642W WO2021128158A1 WO 2021128158 A1 WO2021128158 A1 WO 2021128158A1 CN 2019128642 W CN2019128642 W CN 2019128642W WO 2021128158 A1 WO2021128158 A1 WO 2021128158A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- paper
- papers
- similarity
- semantic
- thesis
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Definitions
- the present invention mainly relates to the field of entity disambiguation, heterogeneous network embedding technology, and word vector embedding technology, in particular to a disambiguation technology for authors of the same name in a paper based on network representation and semantic representation.
- Disambiguation of authors of the same name in a paper refers to the use of information about the paper, such as title, author, author organization, abstract, keywords, etc., to assign the paper to the correct author file through some methods.
- the purpose of the present invention is to provide a method for disambiguating authors of the same name in a paper based on the network representation and semantic representation of the paper.
- the method uses relevant information about the paper, including the title, abstract, author, institution, journal, etc. of the paper. Effectively dig out the relationship information between the papers and the semantic information of the paper text, and then obtain the paper representation vector and the paper similarity matrix, and use the clustering method to divide the paper collections of different authors into different clusters for the similarity matrix In, the disambiguation of authors with the same name is realized.
- this method also uses the method based on similarity threshold matching to further process some outlier papers that appear in the above process, and then realize the disambiguation of the authors of the same name with high accuracy.
- the present invention specifically includes the following steps:
- Step 1 Perform feature analysis on the relevant information of the papers in the paper database, and divide these features into semantic features and discrete features.
- Step 2 Based on the discrete features of step 1, construct a heterogeneous network based on the relationship between the paper and the paper, and generate a path set containing the paper id based on the random walk method of meta-path, and use the word2vec model to train the relationship representation vector of the paper to obtain The relationship similarity matrix of the paper.
- Step 3 Based on the semantic features of step 1, use word2vec to train the word vector, and obtain the semantic representation vector of the paper, thereby obtaining the semantic similarity matrix of the paper.
- Step 4 Based on the similarity matrix generated in Step 2 and Step 3, clustering is performed using the DBSCAN algorithm, and the clusters after clustering represent the collection of papers contained by real authors.
- Step 5 Use the method based on similarity threshold matching to process the outlier essays generated in the above step 2, step 3, and step 4, and assign the papers in the outlier essay set to the correct cluster.
- a disambiguation method for authors of the same name based on network representation and semantic representation includes:
- the target paper library is a paper library obtained according to the author to be disambiguated;
- Each cluster after clustering represents a collection of papers contained by an author; Papers that do not belong to any cluster are added to the collection of outliers;
- step 15) If step 15) the calculated S (p i, p j) is greater than the threshold value [alpha] is set, the paper will be assigned to the cluster paper p i where p j, otherwise the individual assigned to a new cluster paper p i.
- the similarity is calculated pair by pair, and if the similarity is greater than the set threshold, the clusters where the two are located are merged.
- the method of constructing the heterogeneous network is: taking each paper in the target paper library as a node in the heterogeneous network, and setting a number of relationships; if there is a certain set relationship between the two papers, then An edge is constructed between the nodes corresponding to the two papers, and the weight of the edge is set to obtain the heterogeneous network.
- the setting relationship includes having a co-author and having a common organization.
- the path set is generated by a random walk strategy based on the meta path.
- the discrete features include author and institution; the semantic features include title, journal, institution, publication year, and keywords.
- model is a word2vec model.
- a computer-readable storage medium is characterized by storing a computer program, and the computer program includes instructions for executing each step in the above-mentioned method.
- the present invention can simultaneously use the relationship features between the papers and the semantic features of the papers to obtain the characterization vectors of the papers, and then cluster the papers to realize disambiguation.
- the present invention also fully considers that there may be some papers whose features are not obvious enough, and the similarity with other papers is relatively small, and proposes a method based on similarity threshold matching to further process these outlying papers. Thereby improving the accuracy of disambiguation.
- Figure 1 is a model architecture diagram of the present invention
- Figure 2 is a schematic diagram of a heterogeneous network
- Figure 3 is a schematic diagram of random walk path generation based on meta-path.
- the present invention aims to solve the problem of ambiguity of authors with the same name in the paper, using some main information of the paper, such as title, abstract, author, journal, author institution, publication year, keywords, and learning through the relationship representation and semantic representation of the paper And use the clustering method to cluster them, and use the method based on similarity threshold matching to process the outlier papers generated in the process, so as to obtain the final paper division result, that is, the real papers of the same author are divided into In a cluster, papers by different authors are in different clusters.
- Figure 1 is a diagram of the model architecture of the present invention.
- Step 1 Perform feature analysis on the relevant information of the papers in the paper database, and divide these features into semantic features and discrete features.
- semantic features refer to features with text information, such as titles, abstracts, and keywords. These features can be transformed into text vectors using semantic representation learning models, such as word2vec. Discrete feature means that the feature itself has no great value, but it can be used to express the relationship between papers, such as authors, institutions, etc. Some of these features can be treated as discrete features or semantic features.
- the present invention defines author and institution as discrete features; defines title, journal, institution, publication year, and keywords as semantic features.
- Step 2 Based on the discrete features of step 1, construct a heterogeneous network based on the relationship between the paper and the paper, and generate a path set containing the paper id based on the random walk strategy of the meta path, and use the word2vec model to train the relationship representation vector of the paper. Use the word2vec model in the gensim library in python to get the relationship similarity matrix of the paper.
- This part mainly extracts the relationship information of the paper from the discrete features of the paper through the method of network embedding, and realizes the representation learning of the relationship between the papers.
- the network mainly contains one type of nodes: papers, and two types of edges: CoAuthor and CoOrg.
- CoAuthor represents that there are co-authors between two papers (not including the names that need to be disambiguated), and the weight on the side represents the number of co-authors. If there are co-authors between two papers, the side with the corresponding weight is constructed according to the number of co-authors. If there is no co-author between the two papers, this side is not constructed.
- CoOrg represents the similarity relationship between the institutions whose names are to be disambiguated in the two papers.
- the author organization of the names of the two papers to be disambiguated is regarded as the set of words after the stop words are removed.
- the similarity relationship of the organization is based on the number of intersections of the sets of two organization words, that is, if If the author institutions of two papers have co-occurring words, then construct an edge for the corresponding number of co-occurring words as the weight. If the intersection of the author institutions of the two papers is 0, that is, there are no co-occurring words between the two institutions, then Don't build this side.
- a meta-path based on p1 ⁇ CoAuthor ⁇ p2 ⁇ CoOrg ⁇ p3 to perform random walks to generate a path set composed of paper ids.
- the specific process is to take turns to select each paper node in the paper heterogeneous network as the initial node, and perform a random walk according to the above-mentioned meta-path.
- Each walk is a certain type of edge specified according to the current meta-path.
- Weight value select the next node connected by this type of edge as the next wandering node with a certain probability, and save this node in the path set. It stipulates that the transition probability of random walk is proportional to the weight of the edge.
- the paper id path set can be obtained through the above random walk process, and the path set is used as a training corpus, and the skip-gram model in word2vec is used for training, so as to obtain the relationship representation vector of the paper.
- word2vec is to use word vectors to represent the semantic information of words by learning text, that is, to make semantically similar words very close in this space through an embedding space.
- word vector embedding technology papers with similar relationships will also have a closer distance in the embedding space.
- the cosine similarity calculation method can be used to obtain the relationship similarity matrix of the paper.
- the present invention uses the idea of bagging to repeat the above process several times to obtain multiple paper relationship similarity matrices, and add and average them to obtain a final paper relationship similarity matrix.
- Step 3 Based on the semantic features of step 1, use word2vec to train the word vector, and obtain the semantic representation vector of the paper, thereby obtaining the semantic similarity matrix of the paper.
- semantic features For each paper, we use semantic features to obtain the semantic representation vector of the paper through the above-mentioned word vector pre-training model. These semantic features include the title, journal, institution, publication year, keywords, etc. of the paper. After data cleaning, lowercase letters, word segmentation, and stop words removal are performed on these semantic features, the corresponding information for each paper can be obtained. Text information. Using the previously pre-trained word vector, the corresponding text vector can be obtained from the text information of each piece of text, where the text vector is obtained by averaging the word vector. These text vectors constitute the semantic representation vector of the paper.
- the cosine similarity calculation method is also used to obtain the semantic similarity matrix of the paper.
- Step 4 Based on the similarity matrix generated in Step 2 and Step 3, clustering is performed using the DBSCAN algorithm, and the clusters after clustering represent the collection of papers contained by real authors.
- the two similarity matrices are weighted and summed to obtain the final paper similarity matrix.
- the weight of the semantic similarity matrix is 0.5.
- Step 5 Use the method based on similarity threshold matching to process the outlier essays generated in the above step 2, step 3, and step 4, and assign the papers in the outlier essay set to the correct cluster.
- the initial s(p i , p j ) is 0;
- tanimoto(p,q) refers to the tanimoto similarity of two string sets, and p and q are corresponding strings:
- the outlier essay set For each paper in the outlier essay set, first compare its similarity with the papers in the cluster. If the similarity between it and the paper with the highest similarity is greater than the threshold ⁇ , it will be assigned to the cluster in the cluster. In the cluster where the paper is located, otherwise it will be allocated to a new cluster separately. Secondly, for each paper in the outlier essay collection, compare its similarity with other papers in the outlier essay collection. If the similarity between the two is greater than the threshold ⁇ , the clusters where the two are located are merged.
- the defined threshold ⁇ is 1.5.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种基于网络表征和语义表征的同名作者消歧方法。该方法为:1)提取目标论文库中各论文的语义特征和离散特征;2)基于离散特征计算各论文间的相似度,得到论文的关系相似性矩阵;如果一论文与其他论文没有共同作者或机构,则将其加入一离群论文集中;3)基于各论文的所述语义特征计算论文的语义相似度矩阵;将目标论文库中不包含语义特征的论文加入离群论文集中;4)将关系相似性矩阵和语义相似度矩阵加权求和,获得论文相似度矩阵并对其进行聚类;将不属于任何簇的论文加入离群论文集中;5)利用基于相似度阈值匹配的方法将离群论文集中的论文分配到对应的簇中。实现了高准确率的论文同名作者的消歧。
Description
本发明主要涉及实体消歧,异质网络嵌入技术领域,词向量嵌入技术领域,具体是一种基于网络表征和语义表征的论文同名作者消歧技术。
在许多领域中,同名消歧一直被视为一个很有意义但具有挑战性的问题,如文献管理,社交网络分析等。在学术网络领域,各种学术搜索系统的出现,如Google Scholar,Aminer等,为论文的搜索和学术交流提供了极大的便利。但由于论文数量的巨大,以及论文信息的复杂和多样,存在大量的论文被分配错误的问题,其中同名作者歧义问题就是一个比较重要但棘手的问题。对论文同名作者的消歧是指利用论文的信息,如标题,作者,作者机构,摘要,关键词等,通过一些方法将论文分配到正确的作者档案中。目前已经有很多研究者针对同名作者消歧问题提出了解决方法,这些方法主要包括利用论文信息进行基于规则的匹配,或者利用表示学习方法,对论文信息进行表征,然后利用聚类方法,如层次聚类,DBSCAN等对这些表征信息进行聚类,使得相似的论文聚成一簇,不相似的论文被分到不同的簇中。语义表征学习是一种将原始数据转换成能够被机器学习有效开发的形式的一种技术,利用论文的表征学习,可以将论文的语义信息映射到隐层,用于进行聚类。随着DeepWalk,LINE等网络表征学习方法的提出,基于网络表征学习的同名作者消歧方法被提出,它通过构建论文网络,将论文的特征映射到新的空间中,从而使相似的论文在空间上距离较近,不相似的论文在空间上的分布较远,进而对论文进行聚类,实现同名作者消歧。
发明内容
本发明的目的是提供了一种基于论文的网络表征和语义表征来进行论文同名作者消歧的方法,该方法利用论文的相关信息,包括论文的题目,摘要,作者,机构,期刊等,能够有效的挖掘出论文之间的关系信息和论文文本的语义信息,进而求出论文表征向量和论文相似度矩阵,并对相似度矩阵通过聚类的方法将不同作者的论文集划分到不同的簇中,实现对同名的作者进行消歧,此外本方法还利用基于相似度阈值匹配的方法对上述过程中出现的一些离群论文进行进一步处理,进而实现高准确率的论文同名作者的消歧。
本发明具体包括以下步骤:
步骤一:对论文库中论文的相关信息进行特征分析,将这些特征分为语义特征和离散特征。
步骤二:基于步骤一的离散特征,构建基于论文和论文关系的异质网络,并基于元路径的随机游走方法生成包含论文id的路径集,并利用word2vec模型训练论文的关系表征向量,得到论文的关系相似性矩阵。
步骤三:基于步骤一的语义特征,利用word2vec训练词向量,并获得论文的语义表征向量,从而得到论文的语义相似度矩阵。
步骤四:基于步骤二和步骤三生成的相似性矩阵,利用DBSCAN算法进行聚类,聚类后的簇代表真实的作者所包含的论文集合。
步骤五:利用基于相似度阈值匹配的方法对上述步骤二,步骤三,步骤四产生的离群论文集进行处理,将离群论文集中的论文分配到正确的簇中。
本发明的技术方案为:
一种基于网络表征和语义表征的同名作者消歧方法,其步骤包括:
1)提取目标论文库中各论文的语义特征和离散特征;其中,目标论文库为根据待消歧作者获取的论文库;
2)基于各论文的所述离散特征构建论文的异质网络,然后基于所述异质网络生成路径集并将其作为训练语料训练一模型,然后利用该模型生成目标论文库中论文的关系表征向量,然后根据所述关系表征向量计算各论文间的相似度,得到论文的关系相似性矩阵;对于目标论文库中一论文a,如果该论文a与其他论文没有共同作者或机构,则将其加入一离群论文集中;
3)基于各论文的所述语义特征生成论文的语义表征向量,然后根据所述语义表征向量计算各论文的相似度,得到论文的语义相似度矩阵;将目标论文库中不包含语义特征的论文加入所述离群论文集中;
4)将所述关系相似性矩阵和所述语义相似度矩阵进行加权求和,获得论文相似度矩阵并对其进行聚类,聚类后的每一簇代表一作者所包含的论文集合;将不属于任何簇的论文加入所述离群论文集中;
5)利用基于相似度阈值匹配的方法将所述离群论文集中的论文分配到对应的簇中。
进一步的,利用基于相似度阈值匹配的方法将所述离群论文集中的论文分配到对应的簇中的方法为:
11)从所述离群论文集中任选一论文p
i,对于各簇中每一论文p
j;初始化论文p
i与论文p
j的相似度s(pi,pj)为0;
12)计算s(p
i,p
j)=s(p
i,p
j)+(p
i和p
j的共同作者数)×N;N为一设定经验值;
13)计算s(p
i,p
j)=s(p
i,p
j)+tanimoto(p
i的期刊名,p
j的期刊名);其中,函数tanimoto(p,q)用于计算两个集合p,q的tanimoto相似度;
14)计算s(p
i,p
j)=s(p
i,p
j)+tanimoto(p
i中待消歧作者的机构,p
j中待消歧作者的机构);
15)计算s(p
i,p
j)=s(p
i,p
j)+(p
i和p
j中主题、关键词的共词数)/M;M为一设定经验值;
16)如果步骤15)计算得到的s(p
i,p
j)大于设定阈值α,则将论文p
i分配到论文p
j所在簇中,否则将论文p
i单独分配到一个新簇中。
进一步的,对所述离群论文集中的论文,两两计算相似度,如果相似度大于设定阈值则将二者分别所在的簇进行合并。
进一步的,构建所述异质网络的方法为:将目标论文库中每一篇论文作为异质网络中的一节点,并设置若干关系;如果两论文之间存在某一设置的关系,则在两论文对应的节点之间构建一条边,并设置该边的权值,得到所述异质网络。
进一步的,所述设定关系包括具有共同作者、具有共同机构。
进一步的,通过基于元路径随机游走策略生成所述路径集。
进一步的,所述离散特征包括作者和机构;所述语义特征包括标题、期刊、机构、发表年份和关键词。
进一步的,所述模型为word2vec模型。
一种计算机可读存储介质,其特征在于,存储一计算机程序,所述计算机程序包括用于执行上述方法中各步骤的指令。
与现有技术相比,本发明的积极效果为:
本发明能够同时利用论文之间的关系特征以及论文的语义特征得到论文的表征向量,进而对论文进行聚类实现消歧。与此同时,本发明还充分考虑到了可能存在一些论文的特征不够明显,与其他论文的相似度比较小的情况,提出了一种基于相似度阈值匹配的方法对这些离群论文进行进一步处理,从而提高了消歧的准确率。
图1为本发明的模型架构图;
图2为异质网络示意图;
图3为基于元路径的随机游走路径生成示意图。
下面将结合附图及实施例对本发明做进一步的阐述说明。
本发明以解决论文中存在的同名作者歧义问题为目标,使用论文的一些主要信息,如标题,摘要,作者,期刊,作者机构,发表年份,关键词,通过对论文关系表征和语义表征进行学习并使用聚类方法对其进行聚类,同时并对过程中产生的离群论文使用基于相似度阈值匹配的方法进行处理,从而得到最终的论文划分结果,即真实的同一作者的论文被划分到一个簇中,不同作者的论文在不同的簇中。图1为本发明的模型架构图。
步骤一:对论文库中论文的相关信息进行特征分析,将这些特征分为语义特征和离散特征。
首先对这些特征进行分析,根据特征所包含的信息类型不同,把特征划分成两种类型,一种为语义特征,一种为离散特征。语义特征指的是具有文本信息的特征,例如标题,摘要,关键词,这些特征可以使用语义表征学习模型,如word2vec等,将其转化为文本向量。离散特征指特征本身没有很大价值,但可以用其表示论文之间的关系,如作者,机构等。其中有些特征既可以当作离散特征,也可以当作语义特征。在具体实施中,本发明定义作者,机构为离散特征;定义标题,期刊,机构,发表年份,关键词为语义特征。
步骤二:基于步骤一的离散特征,构建基于论文和论文关系的异质网络,并基于元路径的随机游走策略生成包含论文id的路径集,并利用word2vec模型训练论文的关系表征向量,具体使用python中gensim库中的word2vec模型,得到论文的关系相似性矩阵。
此部分主要通过网络嵌入的方法从论文的离散特征中提取出论文的关系信息,实现对论文关系的表征学习。
首先,先搭建论文的异质网络。对于每一个需要消歧的名字,将其对应的所有的论文之间的关系抽取出来,构建出一个论文异质网络,如图2所示。该网络主要包含一种类型的节点:论文,两种类型的边:CoAuthor,CoOrg。
CoAuthor代表两个论文之间有共同作者(不包含需要消歧的名字),边上的权值代表拥有共同作者的个数。如果两篇论文之间有共同作者,就根据其共同作者的数量搭建相应权值大小的边,如果两篇论文之间无共同作者,则不搭建此边。
CoOrg代表两个论文中待消歧名字的机构的相似性关系。在构建论文的CoOrg关系时,将两篇论文的待消歧名字的作者机构当作去掉停用词后的词的集合,机构的相似性关系依据两个机构词的集合的交集数量,即如果两篇论文的作者机构有共现词,则为其搭建相应共现词数量为权值的边,如果两篇论文的作者机构交集大小为0,即两个机构之间无共现词,则 不搭建这条边。
在搭建完论文的异质网络后,我们使用基于p1→CoAuthor→p2→CoOrg→p3这样的元路径进行随机游走,生成由论文id组成的路径集。具体过程为轮流选择论文异质网络中的每一个论文节点作为初始节点,并按照上述元路径进行随机游走,每一次游走即为根据当前元路径规定的某种类型的边,按照边的权值,以一定的概率选择通过该类型的边相连的下一节点作为下一个游走节点,并将该节点保存到路径集中。其中规定随机游走的转移概率与边的权值成正比。通过重复进行若干次这样的游走,直至达到规定的路径长度,得到一条论文id路径。然后通过重新选择异质网络中的另一个节点作为初始节点,进行相同操作得到相应的论文id路径。通过对上述过程迭代N次,获得论文id路径集,作为关系表征学习的训练语料库。随机游走过程示意图如图3所示。
通过上述随机游走过程可以获得论文id路径集,并把该路径集当成训练语料库,利用word2vec中的skip-gram模型进行训练,从而获得论文的关系表征向量。word2vec是通过学习文本来用词向量的方式表征词的语义信息,即通过一个嵌入空间使得语义上相似的单词在该空间内距离很近。这里通过词向量嵌入技术,具有相似关系的论文也将在嵌入空间具有较近的距离。
在获得了论文的关系表征向量后,利用余弦相似度的计算方法可以获得论文的关系相似度矩阵。此外本发明利用了bagging的思想,对上述过程重复进行若干次,获得多个论文关系相似性矩阵,并对它们进行加和求平均获得一个最终的论文关系相似性矩阵。
其中,对于与其他论文均无以上定义的两种关系的论文,我们将其加入离群论文集中,后续单独对其处理。
步骤三:基于步骤一的语义特征,利用word2vec训练词向量,并获得论文的语义表征向量,从而得到论文的语义相似度矩阵。
在论文语义表征学习中,我们首先将所有的论文数据中具有语义信息的特征合并到一起,当作语料集,其主要包括论文的题目,摘要,期刊,所有作者的机构。通过对其进行数据清洗,分词,去停用词等操作后,对这个语料集使用word2vec模型进行词向量训练,获得词向量预训练模型,用于构建论文的语义表征向量。
对于每一篇论文,我们利用语义特征通过上述词向量预训练模型获得论文的语义表征向量。这些语义特征包括论文的标题,期刊,机构,发表年份,关键词等,通过对这些语义特征进行数据清洗,字母小写化,分词,去停用词等操作后,可以获得每一篇论文对应的文本 信息。利用先前预训练好的词向量,可以对每一篇文本的文本信息求得其对应的文本向量,其中文本向量由词向量求平均获得。这些文本向量即组成了论文的语义表征向量。
在获得论文的语义表征向量后,同样利用余弦相似度计算方法获得论文的语义相似度矩阵。
其中,对于不包含语义特征的论文,我们将其加入离群论文集中,后续单独进行处理。
步骤四:基于步骤二和步骤三生成的相似性矩阵,利用DBSCAN算法进行聚类,聚类后的簇代表真实的作者所包含的论文集合。
对于上述过程中获得的论文关系相似性矩阵和论文语义相似性矩阵,对两种相似度矩阵进行加权求和,获得最终的论文相似度矩阵,通过实验,此处设置论文关系相似性矩阵和论文语义相似性矩阵的权重均为0.5。然后使用聚类算法中的DBScan算法对其进行聚类,具体使用python中sklearn.cluster库内的DBSCAN方法。该方法不需要预先确定簇的数目(K值),我们的参数设置如下表
参数 | 值 |
Eps | 0.2 |
Min_samples | 4 |
metric | precomputed |
在聚类过程中,设置最小样本数为4,即一个簇中最少论文数为4,这样对于一些与其他论文均不相似的论文,将不属于任何簇,我们将其加入离群论文集,并单独进行处理。
步骤五:利用基于相似度阈值匹配的方法对上述步骤二,步骤三,步骤四产生的离群论文集进行处理,将离群论文集中的论文分配到正确的簇中。
对于以上三个步骤中产生的离群论文集,我们使用基于相似度阈值匹配的方法对其进行处理。
首先,我们定义了如下的相似度规则,其中s(p
i,p
j)表示论文p
i和论文p
j的相似度。
1.初始s(p
i,p
j)为0;
2.s(p
i,p
j)=s(p
i,p
j)+(p
i和p
j的共同作者数)×1.5;
3.s(p
i,p
j)=s(p
i,p
j)+tanimoto(p
i的期刊名,p
j的期刊名);
4.s(p
i,p
j)=s(p
i,p
j)+tanimoto(p
i中待消歧作者的机构,p
j中待消歧作者的机构);
5.s(p
i,p
j)=s(p
i,p
j)+(p
i和p
j中主题、关键词的共词数)/3.0;
6.输出s(p
i,p
j)。
其中,tanimoto(p,q)指两个字符串集合的tanimoto相似度,p、q为相应字符串:
tanimoto(p,q)= (1)
对于离群论文集中的每一篇论文,先比较它与已经聚好类的论文的相似度,若它与其相似度最高的论文的相似度大于阈值α,就把它分配给该聚好类的论文所在的簇中,否则将它单独分配给一个新的簇中。其次,对离群论文集中的每一篇论文,比较它与其他离群论文集中的论文的相似度,如果两者的相似度大于阈值α,则将两者所在的簇进行合并。在这里,定义的阈值α为1.5。
通过以上基于相似度阈值的匹配方法,能够对那些特征不够明显的论文(离群论文)进行处理,并通过将该处理结果与之前的预聚类结果进行合并,得到最终的论文聚类结果,实现同名作者的消歧。
以上实施例仅用以说明本发明的技术方案而非对其进行限制,本领域的普通技术人员可以对本发明的技术方案进行修改或者等同替换,而不脱离本发明的精神和范围,本发明的保护范围应以权利要求所述为准。
Claims (10)
- 一种基于网络表征和语义表征的同名作者消歧方法,其步骤包括:6)提取目标论文库中各论文的语义特征和离散特征;其中,目标论文库为根据待消歧作者获取的论文库;7)基于各论文的所述离散特征构建论文的异质网络,然后基于所述异质网络生成路径集并将其作为训练语料训练一模型,然后利用该模型生成目标论文库中论文的关系表征向量,然后根据所述关系表征向量计算各论文间的相似度,得到论文的关系相似性矩阵;对于目标论文库中一论文a,如果该论文a与其他论文没有共同作者或机构,则将其加入一离群论文集中;8)基于各论文的所述语义特征生成论文的语义表征向量,然后根据所述语义表征向量计算各论文的相似度,得到论文的语义相似度矩阵;将目标论文库中不包含语义特征的论文加入所述离群论文集中;9)将所述关系相似性矩阵和所述语义相似度矩阵进行加权求和,获得论文相似度矩阵并对其进行聚类,聚类后的每一簇代表一作者所包含的论文集合;将不属于任何簇的论文加入所述离群论文集中;10)利用基于相似度阈值匹配的方法将所述离群论文集中的论文分配到对应的簇中。
- 如权利要求1所述的方法,其特征在于,利用基于相似度阈值匹配的方法将所述离群论文集中的论文分配到对应的簇中的方法为:11)从所述离群论文集中任选一论文p i,对于各簇中每一论文p j;初始化论文p i与论文p j的相似度s(pi,pj)为0;12)计算s(p i,p j)=s(p i,p j)+(p i和p j的共同作者数)×N;N为一设定经验值;13)计算s(p i,p j)=s(p i,p j)+tanimoto(p i的期刊名,p j的期刊名);其中,函数tanimoto(p,q)用于计算两个集合p,q的tanimoto相似度;14)计算s(p i,p j)=s(p i,p j)+tanimoto(p i中待消歧作者的机构,p j中待消歧作者的机构);15)计算s(p i,p j)=s(p i,p j)+(p i和p j中主题、关键词的共词数)/M;M为一设定经验值;16)如果步骤15)计算得到的s(p i,p j)大于设定阈值α,则将论文p i分配到论文p j所在簇中,否则将论文p i单独分配到一个新簇中。
- 如权利要求1或2所述的方法,其特征在于,对所述离群论文集中的论文,两两计算相似 度,如果相似度大于设定阈值则将二者分别所在的簇进行合并。
- 如权利要求1所述的方法,其特征在于,构建所述异质网络的方法为:将目标论文库中每一篇论文作为异质网络中的一节点,并设置若干关系;如果两论文之间存在某一设置的关系,则在两论文对应的节点之间构建一条边,并设置该边的权值,得到所述异质网络。
- 如权利要求5所述的方法,其特征在于,所述设定关系包括具有共同作者、具有共同机构。
- 如权利要求1所述的方法,其特征在于,通过基于元路径随机游走策略生成所述路径集。
- 如权利要求1所述的方法,其特征在于,所述离散特征包括作者和机构;所述语义特征包括标题、期刊、机构、发表年份和关键词。
- 如权利要求1所述的方法,其特征在于,所述模型为word2vec模型。
- 一种计算机可读存储介质,其特征在于,存储一计算机程序,所述计算机程序包括用于执行权利要求1至9任一所述方法中各步骤的指令。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/603,391 US11775594B2 (en) | 2019-12-25 | 2019-12-26 | Method for disambiguating between authors with same name on basis of network representation and semantic representation |
EP19957434.4A EP3940582A4 (en) | 2019-12-25 | 2019-12-26 | METHOD FOR DISAMBIGUATING BETWEEN AUTHORS WITH THE SAME NAME BASED ON A NETWORK REPRESENTATION AND A SEMANTIC REPRESENTATION |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911352416.9 | 2019-12-25 | ||
CN201911352416.9A CN111191466B (zh) | 2019-12-25 | 2019-12-25 | 一种基于网络表征和语义表征的同名作者消歧方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021128158A1 true WO2021128158A1 (zh) | 2021-07-01 |
Family
ID=70710506
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/128642 WO2021128158A1 (zh) | 2019-12-25 | 2019-12-26 | 一种基于网络表征和语义表征的同名作者消歧方法 |
Country Status (4)
Country | Link |
---|---|
US (1) | US11775594B2 (zh) |
EP (1) | EP3940582A4 (zh) |
CN (1) | CN111191466B (zh) |
WO (1) | WO2021128158A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114818736A (zh) * | 2022-05-31 | 2022-07-29 | 北京百度网讯科技有限公司 | 文本处理方法、用于短文本的链指方法、装置及存储介质 |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111881693B (zh) * | 2020-07-28 | 2023-01-13 | 平安科技(深圳)有限公司 | 论文作者的消歧方法、装置和计算机设备 |
CN112417082B (zh) * | 2020-10-14 | 2022-06-07 | 西南科技大学 | 一种科研成果数据消歧归档存储方法 |
CN113111178B (zh) * | 2021-03-04 | 2021-12-10 | 中国科学院计算机网络信息中心 | 无监督的基于表示学习的同名作者消歧方法及装置 |
CN113051397A (zh) * | 2021-03-10 | 2021-06-29 | 北京工业大学 | 一种基于异质信息网络表示学习和词向量表示的学术论文同名排歧方法 |
CN113962293B (zh) * | 2021-09-29 | 2022-10-14 | 中国科学院计算机网络信息中心 | 一种基于LightGBM分类与表示学习的姓名消歧方法和系统 |
CN115358207A (zh) * | 2022-07-12 | 2022-11-18 | 粤港澳大湾区数字经济研究院(福田) | 一种文本数据重复判定方法、装置、设备及存储介质 |
CN117312565B (zh) * | 2023-11-28 | 2024-02-06 | 山东科技大学 | 一种基于关系融合与表示学习的文献作者姓名消歧方法 |
CN118410805B (zh) * | 2024-07-03 | 2024-09-20 | 北京语言大学 | 基于关系图卷积神经网络的中文作者姓名消歧方法及装置 |
CN118468003B (zh) * | 2024-07-09 | 2024-09-24 | 北京邮电大学 | 基于异质引文网络自动检测论文工厂论文的方法 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110137898A1 (en) * | 2009-12-07 | 2011-06-09 | Xerox Corporation | Unstructured document classification |
CN105653590A (zh) * | 2015-12-21 | 2016-06-08 | 青岛智能产业技术研究院 | 一种中文文献作者重名消歧的方法 |
CN106021424A (zh) * | 2016-05-13 | 2016-10-12 | 南京邮电大学 | 一种文献作者重名检测方法 |
CN109558494A (zh) * | 2018-10-29 | 2019-04-02 | 中国科学院计算机网络信息中心 | 一种基于异质网络嵌入的学者名字消歧方法 |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101155041B1 (ko) * | 2010-05-27 | 2012-06-11 | 주식회사 씩스클릭 | 애니메이션 저작 시스템 및 애니메이션 저작 방법 |
CN104111973B (zh) * | 2014-06-17 | 2017-10-27 | 中国科学院计算技术研究所 | 一种学者重名的消歧方法及其系统 |
CN105488092B (zh) * | 2015-07-13 | 2018-05-22 | 中国科学院信息工程研究所 | 一种时间敏感和自适应的子话题在线检测方法及系统 |
US10839947B2 (en) * | 2016-01-06 | 2020-11-17 | International Business Machines Corporation | Clinically relevant medical concept clustering |
CN105868347A (zh) * | 2016-03-28 | 2016-08-17 | 南京邮电大学 | 一种基于多步聚类的重名消歧方法 |
CN108280061B (zh) * | 2018-01-17 | 2021-10-26 | 北京百度网讯科技有限公司 | 基于歧义实体词的文本处理方法和装置 |
CN108304380B (zh) * | 2018-01-24 | 2020-09-22 | 华南理工大学 | 一种融合学术影响力的学者人名消除歧义的方法 |
CN108763333B (zh) * | 2018-05-11 | 2022-05-17 | 北京航空航天大学 | 一种基于社会媒体的事件图谱构建方法 |
CN108959577B (zh) * | 2018-07-06 | 2021-12-07 | 中国民航大学 | 基于非主属性离群点检测的实体匹配方法和计算机程序 |
CN110516146B (zh) * | 2019-07-15 | 2022-08-19 | 中国科学院计算机网络信息中心 | 一种基于异质图卷积神经网络嵌入的作者名字消歧方法 |
CN111581949B (zh) * | 2020-05-12 | 2023-03-21 | 上海市研发公共服务平台管理中心 | 学者人名的消歧方法、装置、存储介质及终端 |
CN111881693B (zh) * | 2020-07-28 | 2023-01-13 | 平安科技(深圳)有限公司 | 论文作者的消歧方法、装置和计算机设备 |
-
2019
- 2019-12-25 CN CN201911352416.9A patent/CN111191466B/zh active Active
- 2019-12-26 WO PCT/CN2019/128642 patent/WO2021128158A1/zh active Application Filing
- 2019-12-26 US US17/603,391 patent/US11775594B2/en active Active
- 2019-12-26 EP EP19957434.4A patent/EP3940582A4/en not_active Withdrawn
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110137898A1 (en) * | 2009-12-07 | 2011-06-09 | Xerox Corporation | Unstructured document classification |
CN105653590A (zh) * | 2015-12-21 | 2016-06-08 | 青岛智能产业技术研究院 | 一种中文文献作者重名消歧的方法 |
CN106021424A (zh) * | 2016-05-13 | 2016-10-12 | 南京邮电大学 | 一种文献作者重名检测方法 |
CN109558494A (zh) * | 2018-10-29 | 2019-04-02 | 中国科学院计算机网络信息中心 | 一种基于异质网络嵌入的学者名字消歧方法 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114818736A (zh) * | 2022-05-31 | 2022-07-29 | 北京百度网讯科技有限公司 | 文本处理方法、用于短文本的链指方法、装置及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
EP3940582A1 (en) | 2022-01-19 |
US20220318317A1 (en) | 2022-10-06 |
US11775594B2 (en) | 2023-10-03 |
CN111191466A (zh) | 2020-05-22 |
EP3940582A4 (en) | 2022-08-17 |
CN111191466B (zh) | 2022-04-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021128158A1 (zh) | 一种基于网络表征和语义表征的同名作者消歧方法 | |
Rao et al. | Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information | |
Wang et al. | Syntax-based deep matching of short texts | |
CN105528437B (zh) | 一种基于结构化文本知识提取的问答系统构建方法 | |
CN107688870B (zh) | 一种基于文本流输入的深度神经网络的分层因素可视化分析方法及装置 | |
CN103207856A (zh) | 一种本体概念及层次关系生成方法 | |
CN113962293B (zh) | 一种基于LightGBM分类与表示学习的姓名消歧方法和系统 | |
Yang et al. | Short text similarity measurement using context‐aware weighted biterms | |
Xu et al. | CET-4 score analysis based on data mining technology | |
CN113051397A (zh) | 一种基于异质信息网络表示学习和词向量表示的学术论文同名排歧方法 | |
CN116244497A (zh) | 一种基于异质数据嵌入的跨域论文推荐方法 | |
Zhao et al. | Domain-specific ontology concept extraction and hierarchy extension | |
Andrews et al. | Robust entity clustering via phylogenetic inference | |
CN118093860A (zh) | 一种基于文本嵌入向量聚类的多层次科研主题挖掘方法 | |
Lin et al. | Exploring ensemble of models in taxonomy-based cross-domain sentiment classification | |
Srivastava et al. | Fuzzy association rule mining for economic development indicators | |
Zhou et al. | Satirical news detection with semantic feature extraction and game-theoretic rough sets | |
Trabelsi et al. | Relational graph embeddings for table retrieval | |
Bhuiyan et al. | An effective approach to generate Wikipedia infobox of movie domain using semi-structured data | |
CN102663123B (zh) | 基于伪种子属性和随机漫步排序的语义属性自动抽取方法及实现该方法的系统 | |
Wang et al. | A semantic path based approach to match subgraphs from large financial knowledge graph | |
Tong et al. | Topic-adaptive sentiment analysis on tweets via learning from multi-sources data | |
Lu et al. | Overview of knowledge mapping construction technology | |
Ji et al. | MEFE: A Multi-fEature Knowledge Fusion and Evaluation Method Based on BERT | |
Kaur et al. | Sentiment analysis of english tweets using data mining |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19957434 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2019957434 Country of ref document: EP Effective date: 20211013 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2019957434 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |