US11775594B2 - Method for disambiguating between authors with same name on basis of network representation and semantic representation - Google Patents

Method for disambiguating between authors with same name on basis of network representation and semantic representation Download PDF

Info

Publication number
US11775594B2
US11775594B2 US17/603,391 US201917603391A US11775594B2 US 11775594 B2 US11775594 B2 US 11775594B2 US 201917603391 A US201917603391 A US 201917603391A US 11775594 B2 US11775594 B2 US 11775594B2
Authority
US
United States
Prior art keywords
publication
theses
discrete
similarity
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
US17/603,391
Other languages
English (en)
Other versions
US20220318317A1 (en
Inventor
Yi Du
Hanxue Wang
Ziyue Qiao
Yuanchun Zhou
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Computer Network Information Center of CAS
Original Assignee
Computer Network Information Center of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Computer Network Information Center of CAS filed Critical Computer Network Information Center of CAS
Publication of US20220318317A1 publication Critical patent/US20220318317A1/en
Application granted granted Critical
Publication of US11775594B2 publication Critical patent/US11775594B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present invention mainly relates to the field of entity disambiguation, heterogeneous network embedding technology and word vector embedding technology, and in particular to a disambiguation technology for authors with a same name of theses based on network representation and semantic representation.
  • the object of the present invention is to provide a method for disambiguating between authors with a same name on basis of network representation and semantic representation.
  • the method can effectively mine the relationship information between the theses and the semantic information of the text of the publication, then obtain a publication representation vector and a publication similarity matrix, and divide the publication sets of different authors into different clusters by clustering the similarity matrix so as to realize the disambiguation between the authors of the same name.
  • the similarity threshold matching method is used to further deal with some discrete theses in the above process, so as to achieve disambiguation between the authors with the same name of theses with high accuracy.
  • the invention can specifically include the following steps.
  • Step 1 analyzing the relevant information of theses in the publication library, and dividing features into semantic and discrete features.
  • Step 2 constructing a heterogeneous network based on the publication and publication relationships based on the discrete features of step 1, and generating a path set including the publication id by the random walk method of meta path, and obtaining the relationship similarity matrix of publication by training the relationship representation vector of publication using the word2vec model.
  • Step 3 based on the semantic feature of step 1, training the word vector by word2vec to obtain a semantic representation vector of publication and then obtain a semantic similarity matrix of the publication.
  • Step 4 clustering by the DBSCAN algorithm based on the similarity matrix generated by steps 2 and 3, and the cluster after clustering represents the publication set contained by the real author.
  • Step 5 processing the discrete publication set generated in the above-mentioned step 2, step 3 and step 4 by using a method based on similarity threshold matching, and allocating the theses in the discrete publication set to a correct cluster.
  • the technical solution of the present invention comprises the followings.
  • a method for disambiguating between authors with a same name on basis of network representation and semantic representation can include the following steps:
  • the step of allocating the theses in the discrete publication set to corresponding clusters by using a method based on similarity threshold matching can include:
  • the similarity is calculated by twos, and the clusters in which the two are respectively located are merged if the similarity is greater than a set threshold.
  • the step of constructing a heterogeneous network can include taking each publication in the target publication library as a node in the heterogeneous network, and setting several relationships; and if there is a certain set relationship between the two theses, constructing an edge between nodes corresponding to the two theses, and setting a weight value of the edge to obtain the heterogeneous network.
  • the setting the relationships can include having a common author and a common institution.
  • the path set is generated by the random walk strategy based on the meta path.
  • the discrete features comprise authors and institutions; and the semantic features can include title, journal, institution, publication year, and keywords.
  • model can be the word2vec model.
  • a computer readable storage medium is characterized by storing a computer program comprising instructions for performing the steps of the method of the above method.
  • the invention has the following beneficial effects.
  • the present invention can obtain the representation vector of the theses by using the relationship features between the theses and the semantic features of the theses, and then cluster the theses to achieve disambiguation. Meanwhile, the present invention also takes into account that there may be some theses whose features are not obvious enough and the similarity is small compared with other theses, and proposes a method based on the similarity threshold matching to further process these discrete theses, thereby improving the accuracy of disambiguation.
  • FIG. 1 is a model architecture diagram of the present invention
  • FIG. 2 is a schematic diagram of a heterogeneous network
  • FIG. 3 is a schematic diagram of random walk path generation based on a meta path.
  • the present invention aims to solve the ambiguity problem of the authors with the same name of the theses, and uses some main information of the theses, such as title, abstract, author, journal, author organization, publication year and keywords, to learn the relationship representation and semantic representation of the theses and cluster them using a clustering method; and meanwhile, the present invention processes the discrete theses generated in the process using a method based on the similarity threshold matching, so as to obtain a final publication division result. Namely, theses of real same author are divided into one cluster, and theses of different authors are in different clusters.
  • FIG. 1 is a model architecture diagram of the present invention.
  • Step 1 analyzing the relevant information of theses in the publication library, and dividing features into semantic and discrete features.
  • semantic features refer to features with text information, such as titles, abstracts, keywords, which can be transformed into text vectors using semantic representation learning models, such as word2vec, etc.
  • Discrete features mean that the features themselves have little value, but they can be used to express the relationship between theses, such as authors, institutions, etc. Some of these features can be regarded as either discrete or semantic features.
  • the invention defines the author and the entity as discrete features, and defines the title, journal, institution, publication year, and keywords as semantic features.
  • Step 2 constructing a heterogeneous network based on the publication and publication relationships based on the discrete features of step 1, and generating a path set including the publication id by the random walk strategy of meta path, and obtaining the relationship similarity matrix of publication by training the relationship representation vector of publication using the word2vec model, in particular, the word2vec model in a gensim library in python.
  • This part extracts the publication's relationship information from the publication's discrete features by the network embedding method to realize the representation learning of the publication's relationship.
  • the network mainly comprises one type of node: article, and two types of edges: coAuthor, CoOrg.
  • CoAuthor represents that there are common authors between two theses(excluding the names that need to be disambiguated), and the weights on the edge represent the number of common authors. If there is a common author between two theses, the edge of corresponding weight is built according to the number of common authors; and if there is no common author between two theses, the edge is not built.
  • CoOrg represents the similarity relationship of the institutions in the two theses with names to be disambiguated.
  • the author institutions of the two theses with author names to be disambiguated are regarded as the set of words after removing stop words, and the similarity relationship of the institutions depends on the number of intersections of the set of words of the two institutions. That is, if the author institutions of the two theses have co-occurring words, an edge is constructed for which the number of corresponding co-occurring words is a weight; and if the intersection of the intersection of the author institutions of the two theses is 0, i. e. there is no co-occurring word between the two institutions, this edge is not constructed.
  • a meta path based on p1 ⁇ CoAuthor ⁇ p2 ⁇ CoOrg ⁇ p3 is used to make random walk and generate a path set composed of publication id.
  • it selects each publication node in the publication heterogeneous network as an initial node in turn, and takes random walk according to the above-mentioned meta path, wherein each walk is an edge of a certain type specified according to the current meta path, a next node connected by the edge of the type is selected as a next walk node according to the weight of the edge and with a certain probability, and the node is stored in a path set. It is specified that the transition probability of random walk is proportional to the weight of edge.
  • a publication id path is obtained by repeating such walks several times until the prescribed path length is reached. Then, by reselecting another node in the heterogeneous network as the initial node, the same operation is performed to obtain the corresponding publication id path. With N iterations of the above process, the publication id path set is obtained as the training corpus of relationship representation learning.
  • a schematic diagram of the random walk process is shown in FIG. 3 .
  • the publication id path set can be obtained by the above-mentioned random walk process, the path set can be used as a training corpus, and the skip-gram model in word2vec can be used for training, so as to obtain the relationship representation vector of the publication.
  • Word2vec characterizes word semantic information by means of word vectors by learning the text, i. e. semantically similar words are very close in the pace via an embedding space. By word vector embedding, the theses with similar relationships will also have a closer distance in the embedding space.
  • the relationship similarity matrix of the theses can be obtained by using the cosine similarity calculation method.
  • the present invention uses the idea of bagging to repeat the above-mentioned process several times to obtain a plurality of publication relationship similarity matrices, and sums and averages them to obtain a final publication relationship similarity matrix.
  • Step 3 based on the semantic feature of step 1, training the word vector by word2vec to obtain a semantic representation vector of publication and then obtain a semantic similarity matrix of the publication.
  • semantic features is used to obtain the semantic representation vector of the publication by the word vector pre-training model. These semantic features include the title, journal, institution, publication year, keywords, etc. of the publication.
  • the text information corresponding to each publication can be obtained by data cleaning, lowercasing, word segmentation, stop-word removal and other operations on these semantic features.
  • a corresponding text vector can be obtained for the text information of each text, wherein the text vector is obtained by averaging the word vectors.
  • the semantic similarity matrix of the theses is also obtained by using the cosine similarity calculation method.
  • Step 4 clustering by the DBSCAN algorithm based on the similarity matrix generated by steps 2 and 3, and the cluster after clustering represents the publication set contained by the real author.
  • the two similarity matrices are weighted and summed to obtain a final publication similarity matrix; and by the experiments, the publication relationship similarity matrix and the publication semantic similarity matrix are both set to have a weight of 0.5.
  • the DBScan algorithm in the clustering algorithm is used to cluster them, specifically using the DBSCAN method in the sklearn.cluster library in python. This method does not need to predetermine the number of clusters(K-value), and our parameters are set as shown in the following table.
  • a minimum number of samples is set as 4, i. e. the minimum number of theses in a cluster is 4, so that some theses which are not similar to other theses will not belong to any cluster, and these theses are added into the discrete publication set and process them separately.
  • Step 5 processing the discrete publication set generated in the above-mentioned step 2, step 3 and step 4 by using a method based on similarity threshold matching, and allocating the theses in the discrete publication set to a correct cluster.
  • the discrete publication sets generated in the above three steps are processed using a method based on similarity threshold matching.
  • a similarity rule is defined, where s(p i , p j ) denotes the similarity of publication p i and publication p j .
  • tanimoto(p, q) refers to the tanimoto similarity of two character string sets, and p, q are the corresponding character strings:
  • the similarity between it and the clustered theses is compared. If the similarity between it and the publication with the highest similarity is greater than the threshold value a, it is allocated to the cluster in which the clustered publication is located, otherwise to a new cluster alone. Secondly, for each publication in the discrete publication set, its similarity with the theses in other discrete sets is compared. If the similarity between the both is greater than the threshold value a, the clusters where the two are located are merged.
  • the threshold value a is defined to be 1.5.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
US17/603,391 2019-12-25 2019-12-26 Method for disambiguating between authors with same name on basis of network representation and semantic representation Active US11775594B2 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201911352416.9A CN111191466B (zh) 2019-12-25 2019-12-25 一种基于网络表征和语义表征的同名作者消歧方法
CN201911352416.9 2019-12-25
PCT/CN2019/128642 WO2021128158A1 (zh) 2019-12-25 2019-12-26 一种基于网络表征和语义表征的同名作者消歧方法

Publications (2)

Publication Number Publication Date
US20220318317A1 US20220318317A1 (en) 2022-10-06
US11775594B2 true US11775594B2 (en) 2023-10-03

Family

ID=70710506

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/603,391 Active US11775594B2 (en) 2019-12-25 2019-12-26 Method for disambiguating between authors with same name on basis of network representation and semantic representation

Country Status (4)

Country Link
US (1) US11775594B2 (zh)
EP (1) EP3940582A4 (zh)
CN (1) CN111191466B (zh)
WO (1) WO2021128158A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881693B (zh) * 2020-07-28 2023-01-13 平安科技(深圳)有限公司 论文作者的消歧方法、装置和计算机设备
CN112417082B (zh) * 2020-10-14 2022-06-07 西南科技大学 一种科研成果数据消歧归档存储方法
CN113111178B (zh) * 2021-03-04 2021-12-10 中国科学院计算机网络信息中心 无监督的基于表示学习的同名作者消歧方法及装置
CN113051397A (zh) * 2021-03-10 2021-06-29 北京工业大学 一种基于异质信息网络表示学习和词向量表示的学术论文同名排歧方法
CN113962293B (zh) * 2021-09-29 2022-10-14 中国科学院计算机网络信息中心 一种基于LightGBM分类与表示学习的姓名消歧方法和系统
CN114818736B (zh) * 2022-05-31 2023-06-09 北京百度网讯科技有限公司 文本处理方法、用于短文本的链指方法、装置及存储介质
CN117312565B (zh) * 2023-11-28 2024-02-06 山东科技大学 一种基于关系融合与表示学习的文献作者姓名消歧方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110137898A1 (en) 2009-12-07 2011-06-09 Xerox Corporation Unstructured document classification
CN105653590A (zh) 2015-12-21 2016-06-08 青岛智能产业技术研究院 一种中文文献作者重名消歧的方法
CN106021424A (zh) 2016-05-13 2016-10-12 南京邮电大学 一种文献作者重名检测方法
CN109558494A (zh) 2018-10-29 2019-04-02 中国科学院计算机网络信息中心 一种基于异质网络嵌入的学者名字消歧方法

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101155041B1 (ko) * 2010-05-27 2012-06-11 주식회사 씩스클릭 애니메이션 저작 시스템 및 애니메이션 저작 방법
CN104111973B (zh) * 2014-06-17 2017-10-27 中国科学院计算技术研究所 一种学者重名的消歧方法及其系统
CN105488092B (zh) * 2015-07-13 2018-05-22 中国科学院信息工程研究所 一种时间敏感和自适应的子话题在线检测方法及系统
US10839947B2 (en) * 2016-01-06 2020-11-17 International Business Machines Corporation Clinically relevant medical concept clustering
CN105868347A (zh) * 2016-03-28 2016-08-17 南京邮电大学 一种基于多步聚类的重名消歧方法
CN108280061B (zh) * 2018-01-17 2021-10-26 北京百度网讯科技有限公司 基于歧义实体词的文本处理方法和装置
CN108304380B (zh) * 2018-01-24 2020-09-22 华南理工大学 一种融合学术影响力的学者人名消除歧义的方法
CN108763333B (zh) * 2018-05-11 2022-05-17 北京航空航天大学 一种基于社会媒体的事件图谱构建方法
CN108959577B (zh) * 2018-07-06 2021-12-07 中国民航大学 基于非主属性离群点检测的实体匹配方法和计算机程序
CN110516146B (zh) * 2019-07-15 2022-08-19 中国科学院计算机网络信息中心 一种基于异质图卷积神经网络嵌入的作者名字消歧方法
CN111581949B (zh) * 2020-05-12 2023-03-21 上海市研发公共服务平台管理中心 学者人名的消歧方法、装置、存储介质及终端
CN111881693B (zh) * 2020-07-28 2023-01-13 平安科技(深圳)有限公司 论文作者的消歧方法、装置和计算机设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110137898A1 (en) 2009-12-07 2011-06-09 Xerox Corporation Unstructured document classification
CN105653590A (zh) 2015-12-21 2016-06-08 青岛智能产业技术研究院 一种中文文献作者重名消歧的方法
CN106021424A (zh) 2016-05-13 2016-10-12 南京邮电大学 一种文献作者重名检测方法
CN109558494A (zh) 2018-10-29 2019-04-02 中国科学院计算机网络信息中心 一种基于异质网络嵌入的学者名字消歧方法

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
In-Su Kang Ed—Gosse Bouma et al: "Author Disambiguation Using Wikipedia—Based Explicit Semantic Analysis", Jun. 26, 2012 (Jun. 26, 2012), Natural Language Processing and Information Systems Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 351 354 XP047008633, ISBN: 978-3-642-31177-2.
Ryan Perozzi et al: "DeepWalk", Knowledge D iscovery and Data Mining, ACM, 2 Penn Plaza, Suite 701 New York NY 10121-0701 USA, Aug. 24, 2014 (Aug. 24, 2014), pp. 701-710, XP058053805, DOI: 10.1145/2623330.2623732 ISBN: 978-1-4503-2956-9.
Zhang Baichuan ZHAN1910QPURDUE Edu et al: "Name Disambiguation in Anonymized Graphs using Network Embedding", Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, ACMPUB27, New York, NY, USA, Nov. 6, 2017 (Nov. 6, 2017), pp. 1239-1248, XP058542111, DOI: 10.1145/3132847.3132873 ISBN: 978-1-4503-5586-5.
Zhang Yutao YT—ZHANG13QMAILS Tsinghua Edu CN et al: "Name Disambiguation in a Miner Clustering, Maintenance, and Human in the Loop", High Performance Compilation, Computing and Communications, ACM, 2 Penn Plaza, Suite 701New YorkNY10121-0701USA, Jul. 19, 2018 (Jun. 19, 2018), pp. 1002-1011, XP058654402, DOI: 10.1145/3219819.3219859 ISBN: 978-1-4503-6638-0.

Also Published As

Publication number Publication date
EP3940582A1 (en) 2022-01-19
EP3940582A4 (en) 2022-08-17
WO2021128158A1 (zh) 2021-07-01
CN111191466B (zh) 2022-04-01
US20220318317A1 (en) 2022-10-06
CN111191466A (zh) 2020-05-22

Similar Documents

Publication Publication Date Title
US11775594B2 (en) Method for disambiguating between authors with same name on basis of network representation and semantic representation
Rao et al. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information
Allahyari et al. Automatic topic labeling using ontology-based topic models
Cheng et al. Contextual text understanding in distributional semantic space
US20170193086A1 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN107688870B (zh) 一种基于文本流输入的深度神经网络的分层因素可视化分析方法及装置
CN106796600A (zh) 相关项目的计算机实现的标识
JP2009093651A (ja) 統計分布を用いたトピックスのモデリング
Chatterjee et al. Single document extractive text summarization using genetic algorithms
WO2013118435A1 (ja) 意味的類似度計算方法、システム及びプログラム
Song et al. Joint learning for coreference resolution with markov logic
Alian et al. Arabic semantic similarity approaches-review
Sadr et al. Unified topic-based semantic models: A study in computing the semantic relatedness of geographic terms
Bondielli et al. On the use of summarization and transformer architectures for profiling résumés
Ramaprabha et al. Survey on sentence similarity evaluation using deep learning
CN117010373A (zh) 一种电力设备资产管理数据所属类别和组的推荐方法
CN111581365A (zh) 一种谓词抽取方法
Bhavani et al. An efficient clustering approach for fair semantic web content retrieval via tri-level ontology construction model with hybrid dragonfly algorithm
Camargo et al. Sentiment polarity classification of tweets using a extended dictionary
CN113849639A (zh) 一种城市级数据仓库主题模型类别的构建方法及系统
Tejada-Cárcamo et al. Unsupervised WSD by finding the predominant sense using context as a dynamic thesaurus
Alfarra et al. Graph-based Growing self-organizing map for Single Document Summarization (GGSDS)
CN111476037B (zh) 文本处理方法、装置、计算机设备和存储介质
Ning Research on the extraction of accounting multi-relationship information based on cloud computing and multimedia
Ketineni et al. Metaheuristic Aided Improved LSTM for Multi-document Summarization: A Hybrid Optimization Model

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE