WO2021139256A1 - 论文作者的消歧方法、装置和计算机设备 - Google Patents

论文作者的消歧方法、装置和计算机设备 Download PDF

Info

Publication number
WO2021139256A1
WO2021139256A1 PCT/CN2020/118531 CN2020118531W WO2021139256A1 WO 2021139256 A1 WO2021139256 A1 WO 2021139256A1 CN 2020118531 W CN2020118531 W CN 2020118531W WO 2021139256 A1 WO2021139256 A1 WO 2021139256A1
Authority
WO
WIPO (PCT)
Prior art keywords
name
paper
author
papers
authors
Prior art date
Application number
PCT/CN2020/118531
Other languages
English (en)
French (fr)
Inventor
马文佳
林桂
倪渊
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139256A1 publication Critical patent/WO2021139256A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Definitions

  • This application relates to the field of artificial intelligence technology, in particular to the disambiguation method, device and computer equipment of the author of the paper.
  • the main purpose of this application is to provide a disambiguation method for the author of a paper, aiming to solve the technical problem that the correspondence between the name of the paper and the author in the database cannot reach a usable level.
  • This application proposes a disambiguation method for paper authors, including:
  • the names of authors involved in all papers in the database are formed into a name tree according to preset rules
  • heterogeneous network of association relationships corresponding to all papers in the database, where the heterogeneous network of association relationships includes an association relationship between authors and collaborators, and an association relationship between authors and institutions;
  • This application also provides a disambiguation device for paper authors, including:
  • the formation module is used to form a name tree according to the preset rules of the author names of all the papers in the database;
  • the first obtaining module is configured to obtain the heterogeneous network of association relationships corresponding to all the papers in the database, wherein the heterogeneous network of association relationships includes an association relationship between authors and collaborators, and an association relationship between authors and institutions;
  • the second acquisition module is used to acquire the semantic representations of papers corresponding to all papers in the database
  • a construction module which is used to construct a similarity matrix based on the name tree, the heterogeneous network of association relationships, and the semantic representation of the paper;
  • a clustering module configured to cluster the similarity matrix to obtain paper cluster groups corresponding to all papers in the database
  • the first judgment module is used to judge whether the paper cluster group corresponding to the author to be disambiguated belongs to the paper cluster group corresponding to the designated author, wherein the designated author is any one of all the authors involved in all the papers in the database;
  • the determination module is used to determine that the author to be disambiguated is different from the designated author if it does not belong to the paper cluster group corresponding to the designated author.
  • the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, a method for disambiguating a paper author is implemented, including:
  • the names of authors involved in all papers in the database are formed into a name tree according to preset rules
  • heterogeneous network of association relationships corresponding to all papers in the database, where the heterogeneous network of association relationships includes an association relationship between authors and collaborators, and an association relationship between authors and institutions;
  • the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, a method for disambiguation of the author of a paper is realized, including:
  • the names of authors involved in all papers in the database are formed into a name tree according to preset rules
  • heterogeneous network of association relationships corresponding to all papers in the database, where the heterogeneous network of association relationships includes an association relationship between authors and collaborators, and an association relationship between authors and institutions;
  • This application preprocesses the author’s name, including disassembling the name into different component blocks, and then constructs a name tree based on each component block through the included relationship, and then forms a hierarchical matrix corresponding to the author’s information according to the name tree, eliminating There are clustering errors caused by different expressions in name writing, so that when the same author’s name is expressed in different writing methods, it will not be separated into two different groups as much as possible to improve the accuracy of name disambiguation.
  • Fig. 1 is a schematic flow chart of a disambiguation method for a paper author according to an embodiment of the present application
  • Figure 2 is a schematic diagram of the components of the author's name in an embodiment of the present application.
  • Fig. 3 is a schematic diagram of the structure of the name tree of the author’s name in an embodiment of the present application
  • Fig. 4 is a schematic structural diagram of a disambiguation device of a paper author according to an embodiment of the present application
  • Fig. 5 is a schematic diagram of the internal structure of a computer device according to an embodiment of the present application.
  • a disambiguation method for a paper author in this embodiment includes:
  • S2 Obtain a heterogeneous network of association relationships corresponding to all papers in the database, where the heterogeneous network of association relationships includes an association relationship between authors and collaborators, and an association relationship between authors and institutions;
  • S4 Construct a similarity matrix based on the name tree, the heterogeneous network of association relationships, and the semantic representation of the paper;
  • S6 Determine whether the paper cluster group corresponding to the author to be disambiguated belongs to the paper cluster group corresponding to the designated author, where the designated author is any one of all the authors involved in all the papers in the database;
  • the preset rules for forming the name tree in this application include preprocessing the author’s name.
  • the preprocessing process includes disassembling the author’s name into different components, and then associating each component through the included relationship to construct the name. tree. Then, the hierarchical matrix corresponding to the author information is formed according to the name tree, which eliminates the clustering error caused by different expressions in name writing, and ensures that when the name of the same author is expressed in different writing methods, it will not be separated into two different groups as much as possible. , Improve the accuracy of name disambiguation.
  • This application comprehensively considers the three factors of the name tree, the heterogeneous network of association relationships, and the semantic representation of the paper to construct a similarity matrix, expand the information scope of the disambiguation reference, and further improve the accuracy of one-to-one correspondence between the paper and the author.
  • all papers related to the specified author are collected to form a paper cluster group, and then the relationship between the paper cluster group corresponding to the author to be disambiguated and the paper cluster group corresponding to the specified author is determined , To determine whether the author to be disambiguated and the designated author are the same author.
  • the paper cluster group corresponding to the author to be disambiguated is included in the paper cluster group corresponding to the designated author, it will be determined that the author to be disambiguated and the designated author are the same author, otherwise they are not the same author, so that different authors can be distinguished and eliminated.
  • the purpose of ambiguity is to achieve a precise and unique correspondence between the paper and the author's name.
  • step S1 of forming a name tree of the names of authors involved in all the papers in the database respectively according to preset rules includes:
  • the designated name is sorted in the English alphabet according to the initial letter, and divided into the first part and the second part from front to back, where the designated name is the author of all the papers in the database Any one of the names;
  • S12 Combine the first letters corresponding to the first part and the second part as the first name, use the first word corresponding to the first part as the second name, and set the first name corresponding to the second part. Words as the third name, the remainder except the first word in the first part as the fourth name, and the remainder except the first word in the second part as the fifth name;
  • S13 Form a first branch corresponding to the second name according to the fourth name, and form a second branch corresponding to the third name according to the fifth name;
  • the database of this application includes a large amount of paper texts.
  • Disambiguation Before disambiguation, the names of authors involved in all the papers are clustered and divided, and the names of natural persons who may be the same author and related documents are associated together.
  • This application is to accurately identify the author's name of the natural person of the same author.
  • the author's name is preprocessed. The process is as follows: First, the author's name is divided into several components according to the name composition rules. The name is generally composed of two parts, the surname and the first name.
  • step S13 of forming the first branch corresponding to the second name according to the fourth name and forming the second branch corresponding to the third name according to the fifth name includes:
  • S132 Connect each of the first name combinations in parallel to the second name to form the first branch, and connect each of the second name combinations in parallel to the third name to form the second branch.
  • Name Tree When constructing Name Tree in this application, after dividing the author's name into blocks according to l2_name and l3_name, each block constructs a name Tree branch. As shown in Figure 3, it is the author's name Ferrari Marquez, and the corresponding name tree is formed according to l2name and l3name.
  • Ferrari, Juan cruz may be the author name of the same natural person as Ferrari Luis, Juan cruz, but the writing is different, because crizs ⁇ curz is a sub-branch of ⁇ curz.
  • the method of constructing the sub-branch of the name Tree sub-branch is based on the inclusion relationship with l4name and l5name.
  • step S3 of obtaining paper semantic representations corresponding to all papers in the database includes:
  • S31 Obtain the title content and abstract content of a designated paper; wherein, the designated paper is any one of all the papers in the database;
  • This application converts the content of each paper into a semantic representation vector through word2vec, and uses this to calculate the semantic similarity between papers, thereby constructing a semantic similarity matrix for paper classification.
  • use word2vec to obtain the vector corresponding to each word in the title content and abstract content of the same paper, and then arrange the vectors according to the original order of each word into the semantic representation of the paper vector.
  • the semantic direction of the paper is more concentrated and accurate, and the paper is obtained. Semantic representation is more relevant to the content of the paper to improve the accuracy of the semantic representation of the paper.
  • step S2 of obtaining the heterogeneous network of association relationships corresponding to all the papers in the database includes:
  • S22 Perform a pairwise comparison of the papers in the database, determine whether the number of common words corresponding to the work organization information of the first author of each paper exceeds the first preset number, and determine the first author and co-author of each paper Whether the number of co-authors in each author exceeds the second preset number;
  • this application uses the metapath method of heterogeneous network to construct the relationship between the lead author and collaborators, and between the lead author and the work organization Characterization, forming a relationship similarity matrix.
  • the node types used in the heterogeneous network of association relations include the first author and collaborator of the same paper, and the work organization information of the first author to be disambiguated.
  • the work organization information includes but is not limited to the name of the work organization.
  • there are two kinds of edges between papers one is the edge corresponding to the paper organization, and the other is the edge corresponding to the co-author of the paper.
  • the degree of the edge corresponding to the paper organization is the number of common words, and the degree of the edge corresponding to the co-author of the paper is the number of co-authors.
  • step S4 of constructing a similarity matrix based on the name tree, the heterogeneous network of association relationships, and the semantic representation of the paper includes:
  • S41 Form the first core object of the similarity matrix according to the name tree of the author to be disambiguated
  • the similarity matrix of this application includes three parts, which are the similarity matrix corresponding to the semantic representation of the paper, the hierarchical similarity matrix corresponding to the name tree, and the relationship similarity matrix in the heterogeneous network of association relations.
  • the similarity matrix obtained from different influencing factors is integrated. Evaluate whether the author to be disambiguated and the currently designated author are the same natural person, so as to improve the accuracy of disambiguation, and make the unique correspondence between the paper and the author more clear, precise and specific.
  • the similarity matrix corresponding to semantic representation it can be considered whether the author to be disambiguated belongs to the same research field as the currently designated author; through the hierarchical similarity matrix corresponding to the name tree, it can be considered whether the name of the author to be disambiguated and the currently designated author belong to the same In the name tree; through the relationship similarity matrix, you can consider whether the author to be disambiguated and the currently designated author have closer relationship information.
  • the above-mentioned related information includes, but is not limited to, whether most of the collaborators are the same and whether the working organization is the same, etc. .
  • the relationship similarity matrix of the present application starts sampling paths in the heterogeneous network of association relationships with any paper as the starting point, and uses the meta-path random walk strategy to form a path containing the node information of the author to be disambiguated, and aggregate the relationship Similarity matrix.
  • the above path length is set to the same value, such as 10, 20, etc., and then the embedding of the paper corresponding to each path is formed through network embedding.
  • step S5 of clustering the similarity matrix to obtain paper cluster groups corresponding to all papers in the database includes:
  • S52 Collect the papers with reachable density corresponding to the first core object, the second core object, and the third core object into the paper cluster group.
  • This application uses the density clustering method DBSCAN to perform clustering calculations on the similarity matrix, without the need to determine the number of author names in advance, and requires fewer priors, which is convenient for calculation and processing.
  • the density clustering algorithm of this application determines the paper clusters corresponding to the three core objects respectively through the calculation principle of density reachability, and then combines them to form the paper cluster groups.
  • the method includes:
  • S503 Classify the outlier papers into the paper cluster group corresponding to the maximum similarity value.
  • This application judges whether there is an outlier corresponding to an outlier paper that does not belong to any paper cluster group after clustering the database papers. If there is an outlier, it will use the method with the largest similarity value to merge it with it.
  • the paper cluster group where the most similar papers are located enables the papers in the database to form a corresponding relationship with the authors of each name tree, and increases the range of papers used for disambiguation in the database to avoid disambiguation loopholes.
  • This application classifies the outlier papers into the corresponding paper cluster group when the similarity value is the largest, so that the paper corresponding to the outlier can find the corresponding relationship with the author.
  • the outlier papers are classified as the paper cluster group corresponding to the maximum similarity value when the similarity value is greater than the preset threshold.
  • this part of the data is defined as noise data and discarded, or manual analysis and correction is introduced to the above noise data, such as modifying the typographical errors when entering information, or approving the identity of changing the name and surname, Complete the same designated natural person certification, etc., so that the outlier papers corresponding to the outlier points can be more accurately classified.
  • the disambiguation device for a paper author of an embodiment of the present application includes:
  • Forming module 1 which is used to form a name tree according to preset rules for the names of authors involved in all the papers in the database;
  • the first obtaining module 2 is configured to obtain the heterogeneous network of association relationships corresponding to all the papers in the database, wherein the heterogeneous network of association relationships includes the association relationship between authors and collaborators, and the association relationship between authors and institutions;
  • the second acquisition module 3 is used to acquire the semantic representations of papers corresponding to all papers in the database
  • the construction module 4 is used to construct a similarity matrix based on the name tree, the heterogeneous network of association relationships, and the semantic representation of the paper;
  • the clustering module 5 is used to cluster the similarity matrix to obtain paper cluster groups corresponding to all papers in the database;
  • the first judgment module 6 is used to judge whether the paper cluster group corresponding to the author to be disambiguated belongs to the paper cluster group corresponding to the designated author, wherein the designated author is any one of all the authors involved in all the papers in the database;
  • the determining module 7 is used for determining that the author to be disambiguated is different from the designated author if it does not belong to the paper cluster group corresponding to the designated author.
  • the preset rules for forming the name tree in this application include preprocessing the author’s name.
  • the preprocessing process includes disassembling the author’s name into different components, and then associating each component through the included relationship to construct the name. tree. Then, the hierarchical matrix corresponding to the author information is formed according to the name tree, which eliminates the clustering error caused by different expressions in name writing, and ensures that when the name of the same author is expressed in different writing methods, it will not be separated into two different groups as much as possible. , Improve the accuracy of name disambiguation.
  • This application comprehensively considers the three factors of the name tree, the heterogeneous network of association relationships, and the semantic representation of the paper to construct a similarity matrix, expand the information scope of the disambiguation reference, and further improve the accuracy of one-to-one correspondence between the paper and the author.
  • all papers related to the specified author are collected to form a paper cluster group, and then the relationship between the paper cluster group corresponding to the author to be disambiguated and the paper cluster group corresponding to the specified author is determined , To determine whether the author to be disambiguated and the designated author are the same author.
  • the paper cluster group corresponding to the author to be disambiguated is included in the paper cluster group corresponding to the designated author, it will be determined that the author to be disambiguated and the designated author are the same author, otherwise they are not the same author, so that different authors can be distinguished and eliminated.
  • the purpose of ambiguity is to achieve a precise and unique correspondence between the paper and the author's name.
  • the module 1 is formed, including:
  • the splitting unit is used to sort the designated names in the English alphabet according to the writing separator, and split them into the first part and the second part from front to back, wherein the designated names are all the names in the database. Any one of the names of the authors involved in the paper;
  • the combination unit is used to combine the first letters corresponding to the first part and the second part respectively into a first name, the first word corresponding to the first part is used as the second name, and the second part corresponds to The first word of as the third name, the remaining part except the first word in the first part as the fourth name, and the remaining part except the first word in the second part as the fifth name ;
  • a first forming unit for forming a first branch corresponding to the second name according to the fourth name, and forming a second branch corresponding to the third name according to the fifth name;
  • the first linking unit is used to link the first branch and the second branch with the first name as the root directory to form a name tree corresponding to the designated name.
  • the database of this application includes a large amount of paper texts.
  • Disambiguation Before disambiguation, the names of authors involved in all the papers are clustered and divided, and the names of natural persons who may be the same author and related documents are associated together.
  • This application is to accurately identify the author's name of the natural person of the same author.
  • the author's name is preprocessed. The process is as follows: First, the author's name is divided into several components according to the name composition rules. The name is generally composed of two parts, the surname and the first name.
  • the Ferrari Marquez before the comma is used as the first part , Juan Luis after the comma as the second part, and combine the first letters of the above two parts as F_J, as the first name, that is, l1_name in the figure, because in the English alphabet, F is arranged before J; put the first part of the first
  • the word Ferrari is called l2_name
  • the remaining part Marquez outside Ferrari in the first part is called l4_name
  • the first word Juan in the second part is called l3_name
  • the remaining part Luis outside Juan is called l5_name, so that names can be avoided.
  • the first forming unit includes:
  • An obtaining subunit configured to obtain each first name combination that meets the preset similarity with the fourth name, and obtain each second name combination that meets the preset similarity with the fifth name;
  • the forming subunit is used to connect each of the first name combinations in parallel to the second name to form the first branch, and to connect each of the second name combinations in parallel to the third name to form a second branch.
  • Name Tree When constructing Name Tree in this application, after dividing the author's name into blocks according to l2_name and l3_name, each block constructs a name Tree branch. As shown in Figure 3, it is the author's name Ferrari Marquez, and the corresponding name tree is formed according to l2name and l3name.
  • Ferrari, Juan cruz may be the author name of the same natural person as Ferrari Luis, Juan cruz, but the writing is different, because crizs ⁇ curz is a sub-branch of ⁇ curz.
  • the method of constructing the sub-branch of the name Tree sub-branch is based on the inclusion relationship with l4name and l5name.
  • the second acquisition module 3 includes:
  • the first obtaining unit is used to obtain the title content and abstract content of a designated paper; wherein, the designated paper is any one of all the papers in the database;
  • the second acquiring unit is configured to acquire the semantic representation vector corresponding to each word in the title content and the abstract content through word2vec;
  • a calculation unit configured to calculate the average value of the semantic representation vector corresponding to the title content and the abstract content according to the semantic representation vector corresponding to each word in the title content and the abstract content;
  • the first unit is used to use the average value of the semantic representation vector as the semantic representation of the paper corresponding to the specified paper.
  • This application converts the content of each paper into a semantic representation vector through word2vec, and uses this to calculate the semantic similarity between papers, thereby constructing a semantic similarity matrix for paper classification.
  • use word2vec to obtain the vector corresponding to each word in the title content and abstract content of the same paper, and then arrange the vectors according to the original order of each word into the semantic representation of the paper vector.
  • the semantic direction of the paper is more concentrated and accurate, and the paper is obtained. Semantic representation is more relevant to the content of the paper to improve the accuracy of the semantic representation of the paper.
  • the first obtaining module 2 includes:
  • the third obtaining unit is used to obtain each author and collaborator included in each paper, as well as the work organization information of each of the first authors, as the paper node type of the heterogeneous network of association relationships;
  • the comparison unit is used to compare the papers in the database in pairs, determine whether the number of common words corresponding to the work organization information of the first author of each paper exceeds the first preset number, and determine the first author and collaborator of each paper Whether the number of co-authors existing in each of them exceeds the second preset number;
  • the second linking unit is used for linking nodes corresponding to two papers whose number of common words exceeds the first preset number to form an edge corresponding to the paper organization, if so, linking two papers whose number of co-authors exceeds the second preset number
  • the nodes corresponding to the paper form the edges corresponding to the co-authors of the paper;
  • the second formation unit is used for the corresponding paper node type, the edge corresponding to each paper institution, and the co-author of each paper based on the information of each of the first author and collaborator, the work organization of each of the first author
  • the corresponding edges form the heterogeneous network of association relations.
  • this application uses the metapath method of heterogeneous network to construct the relationship between the lead author and collaborators, and between the lead author and the work organization Characterization, forming a relationship similarity matrix.
  • the node types used in the heterogeneous network of association relations include the first author and collaborator of the same paper, and the work organization information of the first author to be disambiguated.
  • the work organization information includes but is not limited to the name of the work organization.
  • there are two kinds of edges between papers one is the edge corresponding to the paper organization, and the other is the edge corresponding to the co-author of the paper.
  • the degree of the edge corresponding to the paper organization is the number of common words, and the degree of the edge corresponding to the co-author of the paper is the number of co-authors.
  • building module 4 includes:
  • the third forming unit is used to form the first core object of the similarity matrix according to the name tree of the author to be disambiguated;
  • the fourth acquiring unit is configured to acquire all the paths of the papers including the authors to be disambiguated in the heterogeneous network of association relations through the meta-path random walk strategy according to the preset path length, as the similarity matrix The second core object;
  • the second unit is used as the third core object of the similarity matrix according to the semantic representation of all the papers of the authors to be disambiguated;
  • the fourth forming unit is used to integrate the first core object, the second core object, and the third core object to form a similarity matrix corresponding to the author to be disambiguated.
  • the similarity matrix of this application includes three parts, which are the similarity matrix corresponding to the semantic representation of the paper, the hierarchical similarity matrix corresponding to the name tree, and the relationship similarity matrix in the heterogeneous network of association relations.
  • the similarity matrix obtained from different influencing factors is integrated. Evaluate whether the author to be disambiguated and the currently designated author are the same natural person, so as to improve the accuracy of disambiguation, and make the unique correspondence between the paper and the author more clear, precise and specific.
  • the similarity matrix corresponding to semantic representation it can be considered whether the author to be disambiguated belongs to the same research field as the currently designated author; through the hierarchical similarity matrix corresponding to the name tree, it can be considered whether the name of the author to be disambiguated and the currently designated author belong to the same In the name tree; through the relationship similarity matrix, you can consider whether the author to be disambiguated and the currently designated author have closer relationship information.
  • the above-mentioned related information includes, but is not limited to, whether most of the collaborators are the same and whether the working organization is the same, etc. .
  • the relationship similarity matrix of the present application starts sampling paths in the heterogeneous network of association relationships with any paper as the starting point, and uses the meta-path random walk strategy to form a path containing the node information of the author to be disambiguated, and aggregate the relationship Similarity matrix.
  • the above path length is set to the same value, such as 10, 20, etc., and then the embedding of the paper corresponding to each path is formed through network embedding.
  • the clustering module 5 includes:
  • a fifth acquiring unit configured to acquire the density-reachable papers corresponding to the first core object, the second core object, and the third core object, respectively, according to a density clustering algorithm
  • the collection unit is configured to collect the papers with reachable density corresponding to the first core object, the second core object, and the third core object respectively into the paper cluster group.
  • This application uses the density clustering method DBSCAN to perform clustering calculations on the similarity matrix, without the need to determine the number of author names in advance, and requires fewer priors, which is convenient for calculation and processing.
  • the density clustering algorithm of this application determines the paper clusters corresponding to the three core objects respectively through the calculation principle of density reachability, and then combines them to form the paper cluster groups.
  • the disambiguation device of the author of the paper includes:
  • the second judgment module is used to judge whether there are outlier papers
  • the calculation module is used to calculate the similarity between the outlier paper and each of the paper cluster groups if there is an outlier paper;
  • the classification module is used to classify outlier papers into the paper cluster group corresponding to the maximum similarity value.
  • This application judges whether there is an outlier corresponding to an outlier paper that does not belong to any paper cluster group after clustering the database papers. If there is an outlier, it will use the method with the largest similarity value to merge it with it.
  • the paper cluster group where the most similar papers are located enables the papers in the database to form a corresponding relationship with the authors of each name tree, and increases the range of papers used for disambiguation in the database to avoid disambiguation loopholes.
  • This application classifies the outlier papers into the corresponding paper cluster group when the similarity value is the largest, so that the paper corresponding to the outlier can find the corresponding relationship with the author.
  • the outlier papers are classified as the paper cluster group corresponding to the maximum similarity value when the similarity value is greater than the preset threshold.
  • this part of the data is defined as noise data and discarded, or manual analysis and correction is introduced to the above noise data, such as modifying the typographical errors when entering information, or approving the identity of changing the name and surname, Complete the same designated natural person certification, etc., so that the outlier papers corresponding to the outlier points can be more accurately classified.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 5.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium.
  • the database of the computer equipment is used to store all the data needed for the disambiguation process of the author of the paper.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize the disambiguation method of the author of the paper.
  • the processor executes the disambiguation method for the authors of the papers, including: forming a name tree according to the author names of all the papers in the database according to preset rules; obtaining the heterogeneous network of association relationships corresponding to all the papers in the database, wherein The heterogeneous network of associations includes the associations between authors and collaborators, as well as the associations between authors and institutions; the semantic representations of papers corresponding to all papers in the database are obtained; based on the name tree, the heterogeneous network of associations, and the Describe the semantic characterization of the paper, construct a similarity matrix; cluster the similarity matrix to obtain the paper cluster group corresponding to all the papers in the database; determine whether the paper cluster group corresponding to the author to be disambiguated belongs to the paper corresponding to the specified author Clustering group, wherein the designated author is any one of all authors involved in all the papers in the database; if not, it is determined that the author to be disambiguated is different from the designated author.
  • the above-mentioned computer equipment preprocesses the author’s name, including disassembling the name into different constituent blocks, and then constructs a name tree according to each constituent block through the included relationship, and then forms a hierarchical matrix corresponding to the author’s information according to the name tree. It eliminates the clustering error caused by different expressions in name writing, and ensures that when the name of the same author is expressed by different writing methods, it is not separated into two different groups as much as possible, and the accuracy of name disambiguation is improved.
  • the above-mentioned processor forms a name tree according to preset rules for the names of authors involved in all the papers in the database, including: sorting the designated names in the English alphabet according to the initial letter according to the writing separator, and Split into the first part and the second part from front to back, wherein the designated name is any one of the names of the authors involved in all the papers in the database; the first part and the second part are respectively corresponding to the first part
  • the letter combination is the first name, the first word corresponding to the first part is taken as the second name, the first word corresponding to the second part is taken as the third name, and the first word in the first part
  • the remaining part outside the fourth name is taken as the fourth name, and the remaining part outside the first word of the second part is taken as the fifth name;
  • the first branch corresponding to the second name is formed according to the fourth name, and Form the second branch corresponding to the third name according to the fifth name; use the first name as the root directory to link the first branch and the second branch to form a name tree corresponding
  • the steps of the processor forming the first branch corresponding to the second name according to the fourth name and forming the second branch corresponding to the third name according to the fifth name include: Obtain each first name combination that meets the preset similarity with the fourth name, obtain each second name combination that meets the preset similarity with the fifth name; connect each of the first name combinations in parallel The second name forms the first branch, and each of the second name combinations is connected in parallel to the third name to form the second branch.
  • the step of obtaining the semantic representations of the papers corresponding to all the papers in the database by the above-mentioned processor includes: obtaining the title content and abstract content of the specified papers; wherein, the specified papers are all papers in the database Any one of the papers; obtain the semantic representation vector corresponding to each word in the title content and abstract content through word2vec; calculate the semantic representation vector corresponding to each word in the title content and abstract content respectively The average value of the semantic representation vector corresponding to the title content and the abstract content; the average value of the semantic representation vector is used as the semantic representation of the paper corresponding to the specified paper.
  • the step of obtaining the heterogeneous network of association relationships corresponding to all the papers in the database by the above-mentioned processor includes: obtaining each author and collaborator included in each paper, and the work of each of the first authors Institutional information, as the paper node type of the heterogeneous network of association relations; compare papers in the database pairwise to determine whether the number of common words corresponding to the work organization information of the first author of each paper exceeds The first preset number, to determine whether the number of co-authors existing in the first author and collaborator of each paper exceeds the second preset number; if yes, link two papers with the number of common words exceeding the first preset number Corresponding nodes form the edge corresponding to the paper organization, link the nodes corresponding to the two papers whose number of co-authors exceeds the second preset number, and form the edge corresponding to the co-author of the paper;
  • the above-mentioned processor constructs a similarity matrix based on the name tree, the heterogeneous network of association relations, and the semantic representation of the paper, including: forming a similarity matrix according to the name tree of the author to be disambiguated.
  • the first core object of the similarity matrix according to the preset path length, through the meta-path random walk strategy, the path of all the papers including the author to be disambiguated is obtained in the heterogeneous network of association relations as the path
  • the second core object of the similarity matrix as the third core object of the similarity matrix according to the semantic representations of all the papers of the authors to be disambiguated; integrating the first core object, the second core object, and the The third core object forms a similarity matrix corresponding to the author to be disambiguated.
  • the above-mentioned processor clusters the similarity matrix to obtain paper cluster groups corresponding to all papers in the database, including: obtaining the first core object according to a density clustering algorithm, The second core object and the third core object respectively correspond to papers with reachable density; the first core object, the second core object, and the third core object respectively correspond to the density reachable papers Papers, the collection is the paper cluster group.
  • FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile, and has a computer program stored thereon, which is realized when the computer program is executed by a processor.
  • the disambiguation method for paper authors includes: forming a name tree according to preset rules for the names of authors involved in all papers in the database; obtaining a heterogeneous network of association relationships corresponding to all papers in the database, wherein the association relationship is heterogeneous
  • the network includes the association relationship between authors and collaborators, and the association relationship between authors and institutions; obtaining the semantic representations of papers corresponding to all the papers in the database; based on the name tree, the heterogeneous network of association relationships, and the semantic representations of the papers, Construct a similarity matrix; cluster the similarity matrix to obtain the paper cluster group corresponding to all the papers in the database; determine whether the paper cluster group corresponding to the author to be disambiguated belongs to the paper cluster group corresponding to the designated author, where , The
  • the above-mentioned computer-readable storage medium preprocesses the author’s name, including disassembling the name into different components, and then constructs a name tree according to the relationship contained in each component block, and then forms the author information corresponding to the name tree according to the name tree.
  • the hierarchical matrix eliminates the clustering errors caused by different expressions in name writing, and ensures that when the name of the same author is expressed in different writing methods, it will not be separated into two different groups as much as possible to improve the accuracy of name disambiguation.
  • the above-mentioned processor forms a name tree according to preset rules for the names of authors involved in all the papers in the database, including: sorting the designated names in the English alphabet according to the initial letter according to the writing separator, and Split into the first part and the second part from front to back, wherein the designated name is any one of the names of the authors involved in all the papers in the database; the first part and the second part respectively correspond to the first part
  • the letter combination is the first name, the first word corresponding to the first part is taken as the second name, the first word corresponding to the second part is taken as the third name, and the first word in the first part
  • the remaining part outside the fourth name is taken as the fourth name, and the remaining part outside the first word of the second part is taken as the fifth name;
  • the first branch corresponding to the second name is formed according to the fourth name, and Form the second branch corresponding to the third name according to the fifth name; use the first name as the root directory to link the first branch and the second branch to form a name tree corresponding to the
  • the steps of the processor forming the first branch corresponding to the second name according to the fourth name and forming the second branch corresponding to the third name according to the fifth name include: Obtain each first name combination that meets the preset similarity with the fourth name, obtain each second name combination that meets the preset similarity with the fifth name; connect each of the first name combinations in parallel The second name forms the first branch, and each of the second name combinations is connected in parallel to the third name to form the second branch.
  • the step of obtaining the semantic representations of the papers corresponding to all the papers in the database by the above-mentioned processor includes: obtaining the title content and abstract content of the specified papers; wherein, the specified papers are all papers in the database Any one of the papers; obtain the semantic representation vector corresponding to each word in the title content and abstract content through word2vec; calculate the semantic representation vector corresponding to each word in the title content and abstract content respectively The average value of the semantic representation vector corresponding to the title content and the abstract content; the average value of the semantic representation vector is used as the semantic representation of the paper corresponding to the specified paper.
  • the step of obtaining the heterogeneous network of association relationships corresponding to all the papers in the database by the above-mentioned processor includes: obtaining each author and collaborator included in each paper, and the work of each of the first authors Institutional information, as the paper node type of the heterogeneous network of association relations; compare papers in the database pairwise to determine whether the number of common words corresponding to the work organization information of the first author of each paper exceeds The first preset number, to determine whether the number of co-authors existing in the first author and collaborator of each paper exceeds the second preset number; if yes, link two papers with the number of common words exceeding the first preset number Corresponding nodes form the edge corresponding to the paper organization, link the nodes corresponding to the two papers whose number of co-authors exceeds the second preset number, and form the edge corresponding to the co-author of the paper;
  • the above-mentioned processor constructs a similarity matrix based on the name tree, the heterogeneous network of association relationships, and the semantic representation of the paper, including: forming a similarity matrix according to the name tree of the author to be disambiguated
  • the first core object of the similarity matrix according to the preset path length, through the meta-path random walk strategy, the path of all the papers including the author to be disambiguated is obtained in the heterogeneous network of association relations as the path
  • the second core object of the similarity matrix as the third core object of the similarity matrix according to the semantic representations of all the papers of the authors to be disambiguated; integrating the first core object, the second core object, and the The third core object forms a similarity matrix corresponding to the author to be disambiguated.
  • the above-mentioned processor clusters the similarity matrix to obtain paper cluster groups corresponding to all papers in the database, including: obtaining the first core object according to a density clustering algorithm, The second core object and the third core object respectively correspond to papers with reachable density; the first core object, the second core object, and the third core object respectively correspond to the density reachable papers Papers, the collection is the paper cluster group.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-rate data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种论文作者的消歧方法,包括:将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树(S1);获取数据库中所有论文对应的关联关系异质网络(S2);获取数据库中所有论文分别对应的论文语义表征(S3);基于姓名树、关联关系异质网络和论文语义表征,构建相似矩阵(S4);对相似矩阵进行聚类,得到数据库中所有论文对应的论文聚类群(S5);判断待消歧作者对应的论文聚类群是否属于指定作者对应的论文聚类群(S6);若否,则判定待消歧作者与指定作者不同(S7)。通过对作者姓名进行预处理构建姓名树,然后根据姓名树消除了姓名书写存在不同表述方式时引起的聚类误差,保证同一位作者的姓名尽可能分在同一分组,提高姓名消歧的精准度。

Description

论文作者的消歧方法、装置和计算机设备
本申请要求于2020年07月28日提交中国专利局、申请号为202010740289.6,发明名称为“论文作者的消歧方法、装置和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,特别是涉及到论文作者的消歧方法、装置和计算机设备。
背景技术
论文数据库中存在庞大的论文数量,每篇论文涉及的作者往往不止一个,很难基于数据库形成每位作者唯一对应的学术ID,将数据库中论文和作者自然人实现唯一对应关系,实现对同名作者的论文区分,提高数据库检索精准度。但发明人意识到现有实现方式需要作者的高度参与,比如作者上传论文,并维护个人信息,使得作者使用的热情不高,导致很难推行,也因此数据库信息很难完整,数据库中论文和作者姓名的对应关系达不到可用的水平。
技术问题
本申请的主要目的为提供论文作者的消歧方法,旨在解决数据库中论文和作者姓名的对应关系达不到可用的水平的技术问题。
技术解决方案
本申请提出一种论文作者的消歧方法,包括:
将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树;
获取所述数据库中所有论文对应的关联关系异质网络,其中,所述关联关系异质网络包括作者与协作者关联关系、以及作者与机构关联关系;
获取所述数据库中所有论文分别对应的论文语义表征;
基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵;
对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群;
判断待消歧作者对应的论文聚类群是否属于指定作者对应的论文聚类群,其中,所述指定作者为数据库中所有论文涉及的所有作者中的任意一个;
若否,则判定所述待消歧作者与所述指定作者不同。
本申请还提供了一种论文作者的消歧装置,包括:
形成模块,用于将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树;
第一获取模块,用于获取所述数据库中所有论文对应的关联关系异质网络,其中,所述关联关系异质网络包括作者与协作者关联关系、以及作者与机构关联关系;
第二获取模块,用于获取所述数据库中所有论文分别对应的论文语义表征;
构建模块,用于基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵;
聚类模块,用于对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群;
第一判断模块,用于判断待消歧作者对应的论文聚类群是否属于指定作者对应的论文聚类群,其中,所述指定作者为数据库中所有论文涉及的所有作者中的任意一个;
判定模块,用于若不属于指定作者对应的论文聚类群,则判定所述待消歧作者与所述指定作者不同。
本申请还提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现论文作者的消歧方法,包括:
将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树;
获取所述数据库中所有论文对应的关联关系异质网络,其中,所述关联关系异质网络包括作者与协作者关联关系、以及作者与机构关联关系;
获取所述数据库中所有论文分别对应的论文语义表征;
基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵;
对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群;
判断待消歧作者对应的论文聚类群是否属于指定作者对应的论文聚类群,其中,所述指定作者为数据库中所有论文涉及的所有作者中的任意一个;
若否,则判定所述待消歧作者与所述指定作者不同。
本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现论文作者的消歧方法,包括:
将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树;
获取所述数据库中所有论文对应的关联关系异质网络,其中,所述关联关系异质网络包括作者与协作者关联关系、以及作者与机构关联关系;
获取所述数据库中所有论文分别对应的论文语义表征;
基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵;
对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群;
判断待消歧作者对应的论文聚类群是否属于指定作者对应的论文聚类群,其中,所述指定作者为数据库中所有论文涉及的所有作者中的任意一个;
若否,则判定所述待消歧作者与所述指定作者不同。
有益效果
本申请通过对作者姓名进行预处理,包括将姓名拆解成不同的组成块,然后依据每个组成块分别通过包含的关系构建姓名树,然后根据姓名树形成作者信息对应的层次矩阵,消除了姓名书写存在不同表述方式时引起的聚类误差,保证同一位作者的姓名通过不同写法表达时,尽可能不会分隔在两个不同的分组,提高姓名消歧的精准度。
附图说明
图1本申请一实施例的论文作者的消歧方法流程示意图;
图2本申请一实施例的作者姓名的组成部分示意图;
图3本申请一实施例的作者姓名的姓名树的结构示意图;
图4本申请一实施例的论文作者的消歧装置结构示意图;
图5本申请一实施例的计算机设备内部结构示意图。
本发明的最佳实施方式
参照图1,本实施例一种论文作者的消歧方法,包括:
S1:将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树;
S2:获取所述数据库中所有论文对应的关联关系异质网络,其中,所述关联关系异质网络包括作者与协作者关联关系、以及作者与机构关联关系;
S3:获取所述数据库中所有论文分别对应的论文语义表征;
S4:基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵;
S5:对相似矩阵进行聚类,得到数据库中所有论文对应的论文聚类群;
S6:判断待消歧作者对应的论文聚类群是否属于指定作者对应的论文聚类群,其中,指定作者为数据库中所有论文涉及的所有作者中的任意一个;
S7:若否,则判定所述待消歧作者与所述指定作者不同。
本申请形成姓名树的预设规则包括,通过对作者姓名进行预处理,预处理过程包括将作者姓名拆解成不同的组成块,然后依据每个组成块分别通过包含的关系进行关联,构建姓名树。然后根据姓名树形成作者信息对应的层次矩阵,消除了姓名书写存在不同表述方式时引起的聚类误差,保证同一位作者的姓名通过不同写法表达时,尽可能不会分隔在两 个不同的分组,提高姓名消歧的精准度。本申请通过综合考虑姓名树、关联关系异质网络和论文语义表征三个方面的因素,构建相似矩阵,扩大消歧参考的信息范围,进一步提高论文与作者一一对应的精准度。通过密度聚类算法的密度可达原理,集合所有与指定作者相关的论文形成论文聚类群,然后通过判断待消歧作者对应的论文聚类群,与指定作者对应的论文聚类群的关系,判断待消歧作者与指定作者是否为同一个作者。比如,待消歧作者对应的论文聚类群包含于指定作者对应的论文聚类群,则判定待消歧作者与指定作者为同一个作者,否者不是同一个作者,达到区分不同作者、消除歧义的目的,使论文和作者姓名实现精准的唯一对应关系。
进一步地,所述将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树的步骤S1,包括:
S11:将指定姓名依据书写分隔符,按照开头字母处于英文字母表中排序,由前至后拆分成第一部分和第二部分,其中,所述指定姓名为所述数据库中所有论文涉及的作者姓名中的任意一个;
S12:将所述第一部分和所述第二部分分别对应的首字母组合为第一名字,将所述第一部分对应的第一个单词作为第二名字,将所述第二部分对应的第一个单词作为第三名字,将所述第一部分的第一个单词之外的剩余部分作为第四名字,将所述第二部分的第一个单词之外的剩余部分作为第五名字;
S13:依据所述第四名字形成所述第二名字对应的第一分支,以及依据所述第五名字形成所述第三名字对应的第二分支;
S14:以所述第一名字为根目录,链接所述第一分支和所述第二分支,形成所述指定姓名对应的姓名树。
本申请的数据库中包括海量论文文本,为实现论文文本与作者姓名的准确对应、唯一对应的对应关系,需要对相同作者姓名不同作者自然人、相同作者自然人的姓名不同书写表达等引起的分类错误进行消歧。在进行消歧前,对所有论文中涉及的作者姓名进行聚类分块,将可能是同一作者自然人的姓名及相关文献关联在一起。本申请为准确识别同一作者自然人的作者姓名,将作者姓名进行预处理,过程如下:首先根据姓名组成规则将作者姓名分成几个组成部分。姓名一般由姓和名两部分组成,国内外的书写方式存在差异,有的姓排在名前,有的姓排在名后,作者姓名会因为各种原因出现书写顺序颠倒,比如zhang,wei,与wei,zhang;另外据不完全统计,姓名书写中常会出现简写的状态。为避免分类过程中的识别错误,本申请将作者姓名的表示规则重新调整。如图2所示,将姓和名按开头字母在英文字母表中的顺序排列,且不区分姓和名,如图中将逗号前面的Ferrari Marquez作为第一部分,逗号后面的Juan Luis作为第二部分,并将上述两部分的开头字母组合为F_J,作为第一名字即图中l1_name,因在英文字母表中,F排列J前面;将第一部分的第一个单词Ferrari称为l2_name,第一部分中Ferrari之外的剩余部分Marquez称为l4_name,将第二部分的第一个单词Juan称为l3_name,Juan之外的剩余部分Luis称为l5_name,这样可规避姓名书写中出现的引起作者姓名识别错误的情况,包括书写顺序颠倒、中间名省略、姓名简写等书写问题带来的分类时的识别错误,以保证同一位作者自然人的作者姓名,在不同写法表达时不会分隔在不同组。当然姓名拼写错误、改姓换名等人为修改的因素导致作者姓名发生了实质变化的因素除外。
进一步地,所述依据所述第四名字形成所述第二名字对应的第一分支,以及依据所述第五名字形成所述第三名字对应的第二分支的步骤S13,包括:
S131:获取与所述第四名字满足预设相似度的各第一名字组合,获取与所述第五名字满足所述预设相似度的各第二名字组合;
S132:将各所述第一名字组合并列连接所述第二名字,形成所述第一分支,将各所述第二名字组合并列连接所述第三名字,形成所述第二分支。
本申请在构建Name Tree时,根据l2_name和l3_name将作者姓名分块后,每个分块内部分别构建name Tree分支。如图3所示,即是作者姓名Ferrari Marquez,根据l2name和l3name形成的对应name tree。比如:Ferrari,Juan cruz,可能与Ferrari Luis,Juan cruz是同一作者自然人的作者姓名,仅是写法不同,因为luis^curz是^curz的子分支。name Tree子分支的下一级分支的构建方式,即是根据与l4name和l5name的包含关系来构建的。
进一步地,所述获取所述数据库中所有论文分别对应的论文语义表征的步骤S3,包括:
S31:获取指定论文的标题内容和摘要内容;其中,所述指定论文为所述数据库中所有论文中的任意一篇论文;
S32:通过word2vec获取所述标题内容和摘要内容中每个词分别对应的语义表征向量;
S33:根据所述标题内容和摘要内容中每个词分别对应的语义表征向量,计算所述标题内容和摘要内容分别对应的语义表征向量的平均值;
S34:将所述语义表征向量的平均值,作为指定论文对应的论文语义表征。
本申请通过word2vec将每篇论文的内容转化成语义表征向量,并以此来计算论文之间的语义相似度,从而构建对论文分类的语义相似矩阵。构建论文语义表征时,将同一篇论文的标题内容和摘要内容中的多个词,分别通过word2vec得到各个词分别对应的向量,然后将各向量按照各个词原有的排序排列成论文的语义表征向量。在表征一篇论文的论文语义表征时,通过对上述标题内容和摘要内容分别对应的语义表征向量进行平均,通过全面考量标题内容和摘要内容,使得论文语义指向性更集中和精准,得到的论文语义表征与论文内容更贴切,以提高论文语义表征的精准度。
进一步地,所述获取所述数据库中所有论文对应的关联关系异质网络的步骤S2,包括:
S21:获取各论文中分别包括的各首作者与协作者,以及各所述首作者的工作机构信息,作为所述关联关系异质网络的论文节点类型;
S22:对所述数据库中的论文进行两两比较,判断各论文的所述首作者的工作机构信息分别对应的共同词的数量,是否超过第一预设数量,判断各论文的首作者与协作者中分别存在的共同作者的数量,是否超过第二预设数量;
S23:若是,则链接共同词的数量超过第一预设数量的两篇论文对应的节点,形成论文机构对应的边,链接共同作者的数量超过第二预设数量的两篇论文对应的节点,形成论文共同作者对应的边;
S24:基于各所述首作者与协作者、各所述首作者的工作机构信息,分别对应的论文节点类型,以及各所述论文机构对应的边、各所述论文共同作者对应的边,形成所述关联关系异质网络。
本申请为了挖掘不同论文的作者之间的关联关系,使用异质网络(heterogeneous network)的元路径(metapath)方法,来构建首作者与协作者之间、以及首作者与工作机构之间的关系表征,形成关系相似矩阵。关联关系异质网络中用到的节点类型,包括同一论文的首作者与协作者,以及待消歧的首作者的工作机构信息,工作机构信息包括但不限于工作机构名称。然后使用网络嵌入(network embedding)来构建每篇论文的关联关系表征。本申请的关联关系异质网络中,各论文之间有两种边存在,一种是论文机构对应的边,另一种是论文共同作者对应的边。论文机构对应的边的度是共同词的数量,而论文共同作者对应的边的度则是共同作者的数量。
进一步地,所述基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵的步骤S4,包括:
S41:根据所述待消歧作者的姓名树,形成所述相似矩阵的第一核心对象;
S42:根据预设的路径长度,通过元路径随机游走策略,在所述关联关系异质网络中获取所有包括所述待消歧作者的论文的路径,作为所述相似矩阵的第二核心对象;
S43:根据所述待消歧作者的所有论文的论文语义表征,作为所述相似矩阵的第三核心对象;
S44:集成所述第一核心对象、所述第二核心对象和所述第三核心对象,形成所述待消歧作者对应的相似矩阵。
本申请的相似矩阵包括三个部分,分别为论文语义表征对应的相似矩阵、姓名树对应的层次相似矩阵以及关联关系异质网络中的关系相似矩阵,通过从不同影响因素获得的相似矩阵,综合评价待消歧作者与当前指定作者是否为同一作者自然人,以提高消歧精准度,使得论文和作者之间的唯一性对应关系更加明确、精准和具体。通过语义表征对应的相似矩阵,可考量待消歧作者是否与当前指定作者属于同一研究领域;通过姓名树对应的层次相似矩阵,可考量待消歧作者与当前指定作者的姓名是否同属于同一个姓名树中;通过关系相似矩阵,可考量待消歧作者与当前指定作者是否具有更接近的关系信息,上述关联信息包括但不限于共同合作的协作者是否大多数相同、工作的机构是否相同等。通过在相似矩阵中引入多个相关的核心对象,实现更全面、相互关联的信息分析,提高消歧精准度。本申请的关系相似矩阵,通过在关联关系异质网络中以任一论文为起始点开始对路径抽样,使用元路径随机游走策略,形成包含待消歧作者的节点信息的路径,汇集成关系相似矩阵。上述路径长度设置为相同的值,比如10,20等,再通过network embedding形成各路径对应的论文的embedding。
进一步地,所述对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群的步骤S5,包括:
S51:根据密度聚类算法,获取所述第一核心对象、所述第二核心对象和所述第三核心对象分别对应的密度可达的论文;
S52:将所述第一核心对象、所述第二核心对象和所述第三核心对象分别对应的密度可达的论文,集合为所述论文聚类群。
本申请使用密度聚类方式DBSCAN,对相似矩阵进行聚类计算,无需事先确定作者姓名的数量,且所需先验较少,便于计算处理。本申请的密度聚类算法通过密度可达的计算原理,确定三个核心对象分别对应的论文集群,然后并集形成论文聚类群。
进一步地,所述对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群的步骤S5之后,包括:
S501:判断是否存在离群论文;
S502:若是,则计算所述离群论文分别与各所述论文聚类群的相似度;
S503:将所述离群论文归类为相似度值最大时对应的论文聚类群。
本申请通过判断聚类后的数据库论文,是否存在不属于任何论文聚类群的离群论文对应的离群点,如果存在离群点,会使用相似度值最大的方式,将其归并到与其最相似的论文所在的论文聚类群,使数据库中的论文都能与各姓名树的作者形成对应关系,提高数据库中用于消歧的论文范围,以免存在消歧漏洞。本申请通过将所述离群论文归类为相似度值最大时对应的论文聚类群,使离群点对应的论文找到与作者的对应关系。本申请其他实施例中,可通过进一步比较相似度值最大时是否大于等于预设阈值,大于预设阈值的才将所述离群论文归类为相似度值最大时对应的论文聚类群,以提高聚类的精准性。当相似度值最大时小于预设阈值,则将此部分数据定义为噪音数据进行舍弃,或者对上述噪音数据引入人工分析校正,比如修改信息录入时的录入笔误、或核准改名换姓的身份,完成同一指定自然人证明等方式,使离群点对应的离群论文能够得到更精准的归类。
参照图4,本申请一实施例的论文作者的消歧装置,包括:
形成模块1,用于将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树;
第一获取模块2,用于获取所述数据库中所有论文对应的关联关系异质网络,其中,所述关联关系异质网络包括作者与协作者关联关系、以及作者与机构关联关系;
第二获取模块3,用于获取数据库中所有论文分别对应的论文语义表征;
构建模块4,用于基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵;
聚类模块5,用于对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群;
第一判断模块6,用于判断待消歧作者对应的论文聚类群是否属于指定作者对应的论文聚类群,其中,所述指定作者为数据库中所有论文涉及的所有作者中的任意一个;
判定模块7,用于若不属于指定作者对应的论文聚类群,则判定所述待消歧作者与所述指定作者不同。
本申请形成姓名树的预设规则包括,通过对作者姓名进行预处理,预处理过程包括将作者姓名拆解成不同的组成块,然后依据每个组成块分别通过包含的关系进行关联,构建姓名树。然后根据姓名树形成作者信息对应的层次矩阵,消除了姓名书写存在不同表述方式时引起的聚类误差,保证同一位作者的姓名通过不同写法表达时,尽可能不会分隔在两个不同的分组,提高姓名消歧的精准度。本申请通过综合考虑姓名树、关联关系异质网络和论文语义表征三个方面的因素,构建相似矩阵,扩大消歧参考的信息范围,进一步提高论文与作者一一对应的精准度。通过密度聚类算法的密度可达原理,集合所有与指定作者相关的论文形成论文聚类群,然后通过判断待消歧作者对应的论文聚类群,与指定作者对应的论文聚类群的关系,判断待消歧作者与指定作者是否为同一个作者。比如,待消歧作者对应的论文聚类群包含于指定作者对应的论文聚类群,则判定待消歧作者与指定作者为同一个作者,否者不是同一个作者,达到区分不同作者、消除歧义的目的,使论文和作者姓名实现精准的唯一对应关系。
进一步地,形成模块1,包括:
拆分单元,用于将指定姓名依据书写分隔符,按照开头字母处于英文字母表中排序,由前至后拆分成第一部分和第二部分,其中,所述指定姓名为所述数据库中所有论文涉及的作者姓名中的任意一个;
组合单元,用于将所述第一部分和所述第二部分分别对应的首字母组合为第一名字,将所述第一部分对应的第一个单词作为第二名字,将所述第二部分对应的第一个单词作为第三名字,将所述第一部分的第一个单词之外的剩余部分作为第四名字,将所述第二部分的第一个单词之外的剩余部分作为第五名字;
第一形成单元,用于依据所述第四名字形成所述第二名字对应的第一分支,以及依据所述第五名字形成所述第三名字对应的第二分支;
第一链接单元,用于以所述第一名字为根目录,链接所述第一分支和所述第二分支,形成所述指定姓名对应的姓名树。
本申请的数据库中包括海量论文文本,为实现论文文本与作者姓名的准确对应、唯一对应的对应关系,需要对相同作者姓名不同作者自然人、相同作者自然人的姓名不同书写表达等引起的分类错误进行消歧。在进行消歧前,对所有论文中涉及的作者姓名进行聚类分块,将可能是同一作者自然人的姓名及相关文献关联在一起。本申请为准确识别同一作者自然人的作者姓名,将作者姓名进行预处理,过程如下:首先根据姓名组成规则将作者姓名分成几个组成部分。姓名一般由姓和名两部分组成,国内外的书写方式存在差异,有的姓排在名前,有的姓排在名后,作者姓名会因为各种原因出现书写顺序颠倒,比如zhang,wei,与wei,zhang;另外据不完全统计,姓名书写中常会出现简写的状态。为避免分类过程中的识别错误,本申请将作者姓名的表示规则重新调整。如图2所示作者姓名“Ferrari Marquez,Juan Luis”,将姓和名按开头字母在英文字母表中的顺序排列,且 不区分姓和名,如图中将逗号前面的Ferrari Marquez作为第一部分,逗号后面的Juan Luis作为第二部分,并将上述两部分的开头字母组合为F_J,作为第一名字即图中l1_name,因在英文字母表中,F排列J前面;将第一部分的第一个单词Ferrari称为l2_name,第一部分中Ferrari之外的剩余部分Marquez称为l4_name,将第二部分的第一个单词Juan称为l3_name,Juan之外的剩余部分Luis称为l5_name,这样可规避姓名书写中出现的引起作者姓名识别错误的情况,包括书写顺序颠倒、中间名省略、姓名简写等书写问题带来的分类时的识别错误,以保证同一位作者自然人的作者姓名,在不同写法表达时不会分隔在不同组。当然姓名拼写错误、改姓换名等人为修改的因素导致作者姓名发生了实质变化的因素除外。
进一步地,第一形成单元,包括:
获取子单元,用于获取与所述第四名字满足预设相似度的各第一名字组合,获取与所述第五名字满足所述预设相似度的各第二名字组合;
形成子单元,用于将各所述第一名字组合并列连接所述第二名字,形成所述第一分支,将各所述第二名字组合并列连接第三名字,形成第二分支。
本申请在构建Name Tree时,根据l2_name和l3_name将作者姓名分块后,每个分块内部分别构建name Tree分支。如图3所示,即是作者姓名Ferrari Marquez,根据l2name和l3name形成的对应name tree。比如:Ferrari,Juan cruz,可能与Ferrari Luis,Juan cruz是同一作者自然人的作者姓名,仅是写法不同,因为luis^curz是^curz的子分支。name Tree子分支的下一级分支的构建方式,即是根据与l4name和l5name的包含关系来构建的。
进一步地,第二获取模块3,包括:
第一获取单元,用于获取指定论文的标题内容和摘要内容;其中,所述指定论文为所述数据库中所有论文中的任意一篇论文;
第二获取单元,用于通过word2vec获取所述标题内容和摘要内容中每个词分别对应的语义表征向量;
计算单元,用于根据所述标题内容和摘要内容中每个词分别对应的语义表征向量,计算所述标题内容和摘要内容分别对应的语义表征向量的平均值;
第一作为单元,用于将所述语义表征向量的平均值,作为所述指定论文对应的论文语义表征。
本申请通过word2vec将每篇论文的内容转化成语义表征向量,并以此来计算论文之间的语义相似度,从而构建对论文分类的语义相似矩阵。构建论文语义表征时,将同一篇论文的标题内容和摘要内容中的多个词,分别通过word2vec得到各个词分别对应的向量,然后将各向量按照各个词原有的排序排列成论文的语义表征向量。在表征一篇论文的论文语义表征时,通过对上述标题内容和摘要内容分别对应的语义表征向量进行平均,通过全面考量标题内容和摘要内容,使得论文语义指向性更集中和精准,得到的论文语义表征与论文内容更贴切,以提高论文语义表征的精准度。
进一步地,第一获取模块2,包括:
第三获取单元,用于获取各论文中分别包括的各首作者与协作者,以及各所述首作者的工作机构信息,作为所述关联关系异质网络的论文节点类型;
比较单元,用于对数据库中的论文进行两两比较,判断各论文的首作者的工作机构信息分别对应的共同词的数量,是否超过第一预设数量,判断各论文的首作者与协作者中分别存在的共同作者的数量,是否超过第二预设数量;
第二链接单元,用于若是,则链接共同词的数量超过第一预设数量的两篇论文对应的节点,形成论文机构对应的边,链接共同作者的数量超过第二预设数量的两篇论文对应的节点,形成论文共同作者对应的边;
第二形成单元,用于基于各所述首作者与协作者、各所述首作者的工作机构信息,分别对应的论文节点类型,以及各所述论文机构对应的边、各所述论文共同作者对应的边,形成所述关联关系异质网络。
本申请为了挖掘不同论文的作者之间的关联关系,使用异质网络(heterogeneous network)的元路径(metapath)方法,来构建首作者与协作者之间、以及首作者与工作机构之间的关系表征,形成关系相似矩阵。关联关系异质网络中用到的节点类型,包括同一论文的首作者与协作者,以及待消歧的首作者的工作机构信息,工作机构信息包括但不限于工作机构名称。然后使用网络嵌入(network embedding)来构建每篇论文的关联关系表征。本申请的关联关系异质网络中,各论文之间有两种边存在,一种是论文机构对应的边,另一种是论文共同作者对应的边。论文机构对应的边的度是共同词的数量,而论文共同作者对应的边的度则是共同作者的数量。
进一步地,构建模块4,包括:
第三形成单元,用于根据所述待消歧作者的姓名树,形成所述相似矩阵的第一核心对象;
第四获取单元,用于根据预设的路径长度,通过元路径随机游走策略,在所述关联关系异质网络中获取所有包括所述待消歧作者的论文的路径,作为所述相似矩阵的第二核心对象;
第二作为单元,用于根据所述待消歧作者的所有论文的论文语义表征,作为所述相似矩阵的第三核心对象;
第四形成单元,用于集成所述第一核心对象、所述第二核心对象和所述第三核心对象,形成所述待消歧作者对应的相似矩阵。
本申请的相似矩阵包括三个部分,分别为论文语义表征对应的相似矩阵、姓名树对应的层次相似矩阵以及关联关系异质网络中的关系相似矩阵,通过从不同影响因素获得的相似矩阵,综合评价待消歧作者与当前指定作者是否为同一作者自然人,以提高消歧精准度,使得论文和作者之间的唯一性对应关系更加明确、精准和具体。通过语义表征对应的相似矩阵,可考量待消歧作者是否与当前指定作者属于同一研究领域;通过姓名树对应的层次相似矩阵,可考量待消歧作者与当前指定作者的姓名是否同属于同一个姓名树中;通过关系相似矩阵,可考量待消歧作者与当前指定作者是否具有更接近的关系信息,上述关联信息包括但不限于共同合作的协作者是否大多数相同、工作的机构是否相同等。通过在相似矩阵中引入多个相关的核心对象,实现更全面、相互关联的信息分析,提高消歧精准度。本申请的关系相似矩阵,通过在关联关系异质网络中以任一论文为起始点开始对路径抽样,使用元路径随机游走策略,形成包含待消歧作者的节点信息的路径,汇集成关系相似矩阵。上述路径长度设置为相同的值,比如10,20等,再通过network embedding形成各路径对应的论文的embedding。
进一步地,聚类模块5,包括:
第五获取单元,用于根据密度聚类算法,获取所述第一核心对象、所述第二核心对象和所述第三核心对象分别对应的密度可达的论文;
集合单元,用于将所述第一核心对象、所述第二核心对象和所述第三核心对象分别对应的密度可达的论文,集合为所述论文聚类群。
本申请使用密度聚类方式DBSCAN,对相似矩阵进行聚类计算,无需事先确定作者姓名的数量,且所需先验较少,便于计算处理。本申请的密度聚类算法通过密度可达的计算原理,确定三个核心对象分别对应的论文集群,然后并集形成论文聚类群。
进一步地,论文作者的消歧装置,包括:
第二判断模块,用于判断是否存在离群论文;
计算模块,用于若存在离群论文,则计算所述离群论文分别与各所述论文聚类群的相似度;
归类模块,用于将离群论文归类为相似度值最大时对应的论文聚类群。
本申请通过判断聚类后的数据库论文,是否存在不属于任何论文聚类群的离群论文对应的离群点,如果存在离群点,会使用相似度值最大的方式,将其归并到与其最相似的论文所在的论文聚类群,使数据库中的论文都能与各姓名树的作者形成对应关系,提高数据库中用于消歧的论文范围,以免存在消歧漏洞。本申请通过将所述离群论文归类为相似度值最大时对应的论文聚类群,使离群点对应的论文找到与作者的对应关系。本申请其他实施例中,可通过进一步比较相似度值最大时是否大于等于预设阈值,大于预设阈值的才将所述离群论文归类为相似度值最大时对应的论文聚类群,以提高聚类的精准性。当相似度值最大时小于预设阈值,则将此部分数据定义为噪音数据进行舍弃,或者对上述噪音数据引入人工分析校正,比如修改信息录入时的录入笔误、或核准改名换姓的身份,完成同一指定自然人证明等方式,使离群点对应的离群论文能够得到更精准的归类。
参照图5,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图5所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储论文作者的消歧过程需要的所有数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现论文作者的消歧方法。
上述处理器执行上述论文作者的消歧方法,包括:将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树;获取所述数据库中所有论文对应的关联关系异质网络,其中,所述关联关系异质网络包括作者与协作者关联关系、以及作者与机构关联关系;获取所述数据库中所有论文分别对应的论文语义表征;基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵;对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群;判断待消歧作者对应的论文聚类群是否属于指定作者对应的论文聚类群,其中,所述指定作者为数据库中所有论文涉及的所有作者中的任意一个;若否,则判定所述待消歧作者与所述指定作者不同。
上述计算机设备,通过对作者姓名进行预处理,包括将姓名拆解成不同的组成块,然后依据每个组成块分别通过包含的关系构建姓名树,然后根据姓名树形成作者信息对应的层次矩阵,消除了姓名书写存在不同表述方式时引起的聚类误差,保证同一位作者的姓名通过不同写法表达时,尽可能不会分隔在两个不同的分组,提高姓名消歧的精准度。
在一个实施例中,上述处理器将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树的步骤,包括:将指定姓名依据书写分隔符,按照开头字母处于英文字母表中排序,由前至后拆分成第一部分和第二部分,其中,所述指定姓名为所述数据库中所有论文涉及的作者姓名中的任意一个;将所述第一部分和所述第二部分分别对应的首字母组合为第一名字,将所述第一部分对应的第一个单词作为第二名字,将所述第二部分对应的第一个单词作为第三名字,将所述第一部分的第一个单词之外的剩余部分作为第四名字,将所述第二部分的第一个单词之外的剩余部分作为第五名字;依据所述第四名字形成所述第二名字对应的第一分支,以及依据所述第五名字形成所述第三名字对应的第二分支;以所述第一名字为根目录,链接所述第一分支和所述第二分支,形成所述指定姓名对应的姓名树。
在一个实施例中,上述处理器依据所述第四名字形成所述第二名字对应的第一分支, 以及依据所述第五名字形成所述第三名字对应的第二分支的步骤,包括:获取与所述第四名字满足预设相似度的各第一名字组合,获取与所述第五名字满足所述预设相似度的各第二名字组合;将各所述第一名字组合并列连接所述第二名字,形成所述第一分支,将各所述第二名字组合并列连接所述第三名字,形成所述第二分支。
在一个实施例中,上述处理器获取所述数据库中所有论文分别对应的论文语义表征的步骤,包括:获取指定论文的标题内容和摘要内容;其中,所述指定论文为所述数据库中所有论文中的任意一篇论文;通过word2vec获取所述标题内容和摘要内容中每个词分别对应的语义表征向量;根据所述标题内容和摘要内容中每个词分别对应的语义表征向量,计算所述标题内容和摘要内容分别对应的语义表征向量的平均值;将所述语义表征向量的平均值,作为所述指定论文对应的论文语义表征。
在一个实施例中,上述处理器获取所述数据库中所有论文对应的关联关系异质网络的步骤,包括:获取各论文中分别包括的各首作者与协作者,以及各所述首作者的工作机构信息,作为所述关联关系异质网络的论文节点类型;对所述数据库中的论文进行两两比较,判断各论文的所述首作者的工作机构信息分别对应的共同词的数量,是否超过第一预设数量,判断各论文的首作者与协作者中分别存在的共同作者的数量,是否超过第二预设数量;若是,则链接共同词的数量超过第一预设数量的两篇论文对应的节点,形成论文机构对应的边,链接共同作者的数量超过第二预设数量的两篇论文对应的节点,形成论文共同作者对应的边;基于各所述首作者与协作者、各所述首作者的工作机构信息,分别对应的论文节点类型,以及各所述论文机构对应的边、各所述论文共同作者对应的边,形成所述关联关系异质网络。
在一个实施例中,上述处理器基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵的步骤,包括:根据所述待消歧作者的姓名树,形成所述相似矩阵的第一核心对象;根据预设的路径长度,通过元路径随机游走策略,在所述关联关系异质网络中获取所有包括所述待消歧作者的论文的路径,作为所述相似矩阵的第二核心对象;根据所述待消歧作者的所有论文的论文语义表征,作为所述相似矩阵的第三核心对象;集成所述第一核心对象、所述第二核心对象和所述第三核心对象,形成所述待消歧作者对应的相似矩阵。
在一个实施例中,上述处理器对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群的步骤,包括:根据密度聚类算法,获取所述第一核心对象、所述第二核心对象和所述第三核心对象分别对应的密度可达的论文;将所述第一核心对象、所述第二核心对象和所述第三核心对象分别对应的密度可达的论文,集合为所述论文聚类群。
本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。
本申请一实施例还提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,其上存储有计算机程序,计算机程序被处理器执行时实现论文作者的消歧方法,包括:将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树;获取所述数据库中所有论文对应的关联关系异质网络,其中,所述关联关系异质网络包括作者与协作者关联关系、以及作者与机构关联关系;获取所述数据库中所有论文分别对应的论文语义表征;基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵;对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群;判断待消歧作者对应的论文聚类群是否属于指定作者对应的论文聚类群,其中,所述指定作者为数据库中所有论文涉及的所有作者中的任意一个;若否,则判定所述待消歧作者与所述指定作者不同。
上述计算机可读存储介质,通过对作者姓名进行预处理,包括将姓名拆解成不同的组成块,然后依据每个组成块分别通过包含的关系构建姓名树,然后根据姓名树形成作者信 息对应的层次矩阵,消除了姓名书写存在不同表述方式时引起的聚类误差,保证同一位作者的姓名通过不同写法表达时,尽可能不会分隔在两个不同的分组,提高姓名消歧的精准度。
在一个实施例中,上述处理器将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树的步骤,包括:将指定姓名依据书写分隔符,按照开头字母处于英文字母表中排序,由前至后拆分成第一部分和第二部分,其中,所述指定姓名为所述数据库中所有论文涉及的作者姓名中的任意一个;将所述第一部分和所述第二部分分别对应的首字母组合为第一名字,将所述第一部分对应的第一个单词作为第二名字,将所述第二部分对应的第一个单词作为第三名字,将所述第一部分的第一个单词之外的剩余部分作为第四名字,将所述第二部分的第一个单词之外的剩余部分作为第五名字;依据所述第四名字形成所述第二名字对应的第一分支,以及依据所述第五名字形成所述第三名字对应的第二分支;以所述第一名字为根目录,链接所述第一分支和所述第二分支,形成所述指定姓名对应的姓名树。
在一个实施例中,上述处理器依据所述第四名字形成所述第二名字对应的第一分支,以及依据所述第五名字形成所述第三名字对应的第二分支的步骤,包括:获取与所述第四名字满足预设相似度的各第一名字组合,获取与所述第五名字满足所述预设相似度的各第二名字组合;将各所述第一名字组合并列连接所述第二名字,形成所述第一分支,将各所述第二名字组合并列连接所述第三名字,形成所述第二分支。
在一个实施例中,上述处理器获取所述数据库中所有论文分别对应的论文语义表征的步骤,包括:获取指定论文的标题内容和摘要内容;其中,所述指定论文为所述数据库中所有论文中的任意一篇论文;通过word2vec获取所述标题内容和摘要内容中每个词分别对应的语义表征向量;根据所述标题内容和摘要内容中每个词分别对应的语义表征向量,计算所述标题内容和摘要内容分别对应的语义表征向量的平均值;将所述语义表征向量的平均值,作为所述指定论文对应的论文语义表征。
在一个实施例中,上述处理器获取所述数据库中所有论文对应的关联关系异质网络的步骤,包括:获取各论文中分别包括的各首作者与协作者,以及各所述首作者的工作机构信息,作为所述关联关系异质网络的论文节点类型;对所述数据库中的论文进行两两比较,判断各论文的所述首作者的工作机构信息分别对应的共同词的数量,是否超过第一预设数量,判断各论文的首作者与协作者中分别存在的共同作者的数量,是否超过第二预设数量;若是,则链接共同词的数量超过第一预设数量的两篇论文对应的节点,形成论文机构对应的边,链接共同作者的数量超过第二预设数量的两篇论文对应的节点,形成论文共同作者对应的边;基于各所述首作者与协作者、各所述首作者的工作机构信息,分别对应的论文节点类型,以及各所述论文机构对应的边、各所述论文共同作者对应的边,形成所述关联关系异质网络。
在一个实施例中,上述处理器基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵的步骤,包括:根据所述待消歧作者的姓名树,形成所述相似矩阵的第一核心对象;根据预设的路径长度,通过元路径随机游走策略,在所述关联关系异质网络中获取所有包括所述待消歧作者的论文的路径,作为所述相似矩阵的第二核心对象;根据所述待消歧作者的所有论文的论文语义表征,作为所述相似矩阵的第三核心对象;集成所述第一核心对象、所述第二核心对象和所述第三核心对象,形成所述待消歧作者对应的相似矩阵。
在一个实施例中,上述处理器对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群的步骤,包括:根据密度聚类算法,获取所述第一核心对象、所述第二核心对象和所述第三核心对象分别对应的密度可达的论文;将所述第一核心对象、所述第二核心对象和所述第三核心对象分别对应的密度可达的论文,集合为所述论文聚类群。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,上述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种论文作者的消歧方法,其中,包括:
    将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树;
    获取所述数据库中所有论文对应的关联关系异质网络,其中,所述关联关系异质网络包括作者与协作者关联关系、以及作者与机构关联关系;
    获取所述数据库中所有论文分别对应的论文语义表征;
    基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵;
    对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群;
    判断待消歧作者对应的论文聚类群是否属于指定作者对应的论文聚类群,其中,所述指定作者为数据库中所有论文涉及的所有作者中的任意一个;
    若否,则判定所述待消歧作者与所述指定作者不同。
  2. 根据权利要求1所述的论文作者的消歧方法,其中,所述将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树的步骤,包括:
    将指定姓名依据书写分隔符,按照开头字母处于英文字母表中排序,由前至后拆分成第一部分和第二部分,其中,所述指定姓名为所述数据库中所有论文涉及的作者姓名中的任意一个;
    将所述第一部分和所述第二部分分别对应的首字母组合为第一名字,将所述第一部分对应的第一个单词作为第二名字,将所述第二部分对应的第一个单词作为第三名字,将所述第一部分的第一个单词之外的剩余部分作为第四名字,将所述第二部分的第一个单词之外的剩余部分作为第五名字;
    依据所述第四名字形成所述第二名字对应的第一分支,以及依据所述第五名字形成所述第三名字对应的第二分支;
    以所述第一名字为根目录,链接所述第一分支和所述第二分支,形成所述指定姓名对应的姓名树。
  3. 根据权利要求2所述的论文作者的消歧方法,其中,所述依据所述第四名字形成所述第二名字对应的第一分支,以及依据所述第五名字形成所述第三名字对应的第二分支的步骤,包括:
    获取与所述第四名字满足预设相似度的各第一名字组合,获取与所述第五名字满足所述预设相似度的各第二名字组合;
    将各所述第一名字组合并列连接所述第二名字,形成所述第一分支,将各所述第二名字组合并列连接所述第三名字,形成所述第二分支。
  4. 根据权利要求1所述的论文作者的消歧方法,其中,获取所述数据库中所有论文分别对应的论文语义表征的步骤,包括:
    获取指定论文的标题内容和摘要内容;其中,所述指定论文为所述数据库中所有论文中的任意一篇论文;
    通过word2vec获取所述标题内容和摘要内容中每个词分别对应的语义表征向量;
    根据所述标题内容和摘要内容中每个词分别对应的语义表征向量,计算所述标题内容和摘要内容分别对应的语义表征向量的平均值;
    将所述语义表征向量的平均值,作为所述指定论文对应的论文语义表征。
  5. 根据权利要求1所述的论文作者的消歧方法,其中,获取所述数据库中所有论文对应的关联关系异质网络的步骤,包括:
    获取各论文中分别包括的各首作者与协作者,以及各所述首作者的工作机构信息,作为所述关联关系异质网络的论文节点类型;
    对所述数据库中的论文进行两两比较,判断各论文的所述首作者的工作机构信息分别对应的共同词的数量,是否超过第一预设数量,判断各论文的首作者与协作者中分别存在的共同作者的数量,是否超过第二预设数量;
    若是,则链接共同词的数量超过第一预设数量的两篇论文对应的节点,形成论文机构对应的边,链接共同作者的数量超过第二预设数量的两篇论文对应的节点,形成论文共同作者对应的边;
    基于各所述首作者与协作者、各所述首作者的工作机构信息,分别对应的论文节点类型,以及各所述论文机构对应的边、各所述论文共同作者对应的边,形成所述关联关系异质网络。
  6. 根据权利要求1所述的论文作者的消歧方法,其中,基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵的步骤,包括:
    根据所述待消歧作者的姓名树,形成所述相似矩阵的第一核心对象;
    根据预设的路径长度,通过元路径随机游走策略,在所述关联关系异质网络中获取所有包括所述待消歧作者的论文的路径,作为所述相似矩阵的第二核心对象;
    根据所述待消歧作者的所有论文的论文语义表征,作为所述相似矩阵的第三核心对象;
    集成所述第一核心对象、所述第二核心对象和所述第三核心对象,形成所述待消歧作者对应的相似矩阵。
  7. 根据权利要求6所述的论文作者的消歧方法,其中,所述对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群的步骤,包括:
    根据密度聚类算法,获取所述第一核心对象、所述第二核心对象和所述第三核心对象分别对应的密度可达的论文;
    将所述第一核心对象、所述第二核心对象和所述第三核心对象分别对应的密度可达的论文,集合为所述论文聚类群。
  8. 一种论文作者的消歧装置,其中,包括:
    形成模块,用于将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树;
    第一获取模块,用于获取所述数据库中所有论文对应的关联关系异质网络,其中,所述关联关系异质网络包括作者与协作者关联关系、以及作者与机构关联关系;
    第二获取模块,用于获取所述数据库中所有论文分别对应的论文语义表征;
    构建模块,用于基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵;
    聚类模块,用于对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群;
    第一判断模块,用于判断待消歧作者对应的论文聚类群是否属于指定作者对应的论文聚类群,其中,所述指定作者为数据库中所有论文涉及的所有作者中的任意一个;
    判定模块,用于若不属于指定作者对应的论文聚类群,则判定所述待消歧作者与所述指定作者不同。
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现论文作者的消歧方法,包括:
    将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树;
    获取所述数据库中所有论文对应的关联关系异质网络,其中,所述关联关系异质网络包括作者与协作者关联关系、以及作者与机构关联关系;
    获取所述数据库中所有论文分别对应的论文语义表征;
    基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵;
    对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群;
    判断待消歧作者对应的论文聚类群是否属于指定作者对应的论文聚类群,其中,所述指定作者为数据库中所有论文涉及的所有作者中的任意一个;
    若否,则判定所述待消歧作者与所述指定作者不同。
  10. 根据权利要求9所述的计算机设备,其中,所述将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树的步骤,包括:
    将指定姓名依据书写分隔符,按照开头字母处于英文字母表中排序,由前至后拆分成第一部分和第二部分,其中,所述指定姓名为所述数据库中所有论文涉及的作者姓名中的任意一个;
    将所述第一部分和所述第二部分分别对应的首字母组合为第一名字,将所述第一部分对应的第一个单词作为第二名字,将所述第二部分对应的第一个单词作为第三名字,将所述第一部分的第一个单词之外的剩余部分作为第四名字,将所述第二部分的第一个单词之外的剩余部分作为第五名字;
    依据所述第四名字形成所述第二名字对应的第一分支,以及依据所述第五名字形成所述第三名字对应的第二分支;
    以所述第一名字为根目录,链接所述第一分支和所述第二分支,形成所述指定姓名对应的姓名树。
  11. 根据权利要求10所述的计算机设备,其中,所述依据所述第四名字形成所述第二名字对应的第一分支,以及依据所述第五名字形成所述第三名字对应的第二分支的步骤,包括:
    获取与所述第四名字满足预设相似度的各第一名字组合,获取与所述第五名字满足所述预设相似度的各第二名字组合;
    将各所述第一名字组合并列连接所述第二名字,形成所述第一分支,将各所述第二名字组合并列连接所述第三名字,形成所述第二分支。
  12. 根据权利要求9所述的计算机设备,其中,获取所述数据库中所有论文分别对应的论文语义表征的步骤,包括:
    获取指定论文的标题内容和摘要内容;其中,所述指定论文为所述数据库中所有论文中的任意一篇论文;
    通过word2vec获取所述标题内容和摘要内容中每个词分别对应的语义表征向量;
    根据所述标题内容和摘要内容中每个词分别对应的语义表征向量,计算所述标题内容和摘要内容分别对应的语义表征向量的平均值;
    将所述语义表征向量的平均值,作为所述指定论文对应的论文语义表征。
  13. 根据权利要求9所述的计算机设备,其中,获取所述数据库中所有论文对应的关联关系异质网络的步骤,包括:
    获取各论文中分别包括的各首作者与协作者,以及各所述首作者的工作机构信息,作为所述关联关系异质网络的论文节点类型;
    对所述数据库中的论文进行两两比较,判断各论文的所述首作者的工作机构信息分别对应的共同词的数量,是否超过第一预设数量,判断各论文的首作者与协作者中分别存在的共同作者的数量,是否超过第二预设数量;
    若是,则链接共同词的数量超过第一预设数量的两篇论文对应的节点,形成论文机构对应的边,链接共同作者的数量超过第二预设数量的两篇论文对应的节点,形成论文共同作者对应的边;
    基于各所述首作者与协作者、各所述首作者的工作机构信息,分别对应的论文节点类型,以及各所述论文机构对应的边、各所述论文共同作者对应的边,形成所述关联关系异质网络。
  14. 根据权利要求9所述的计算机设备,其中,基于所述姓名树、所述关联关系异质 网络和所述论文语义表征,构建相似矩阵的步骤,包括:
    根据所述待消歧作者的姓名树,形成所述相似矩阵的第一核心对象;
    根据预设的路径长度,通过元路径随机游走策略,在所述关联关系异质网络中获取所有包括所述待消歧作者的论文的路径,作为所述相似矩阵的第二核心对象;
    根据所述待消歧作者的所有论文的论文语义表征,作为所述相似矩阵的第三核心对象;
    集成所述第一核心对象、所述第二核心对象和所述第三核心对象,形成所述待消歧作者对应的相似矩阵。
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现论文作者的消歧方法,包括:
    将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树;
    获取所述数据库中所有论文对应的关联关系异质网络,其中,所述关联关系异质网络包括作者与协作者关联关系、以及作者与机构关联关系;
    获取所述数据库中所有论文分别对应的论文语义表征;
    基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵;
    对所述相似矩阵进行聚类,得到所述数据库中所有论文对应的论文聚类群;
    判断待消歧作者对应的论文聚类群是否属于指定作者对应的论文聚类群,其中,所述指定作者为数据库中所有论文涉及的所有作者中的任意一个;
    若否,则判定所述待消歧作者与所述指定作者不同。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述将数据库中所有论文涉及的作者姓名分别按照预设规则形成姓名树的步骤,包括:
    将指定姓名依据书写分隔符,按照开头字母处于英文字母表中排序,由前至后拆分成第一部分和第二部分,其中,所述指定姓名为所述数据库中所有论文涉及的作者姓名中的任意一个;
    将所述第一部分和所述第二部分分别对应的首字母组合为第一名字,将所述第一部分对应的第一个单词作为第二名字,将所述第二部分对应的第一个单词作为第三名字,将所述第一部分的第一个单词之外的剩余部分作为第四名字,将所述第二部分的第一个单词之外的剩余部分作为第五名字;
    依据所述第四名字形成所述第二名字对应的第一分支,以及依据所述第五名字形成所述第三名字对应的第二分支;
    以所述第一名字为根目录,链接所述第一分支和所述第二分支,形成所述指定姓名对应的姓名树。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述依据所述第四名字形成所述第二名字对应的第一分支,以及依据所述第五名字形成所述第三名字对应的第二分支的步骤,包括:
    获取与所述第四名字满足预设相似度的各第一名字组合,获取与所述第五名字满足所述预设相似度的各第二名字组合;
    将各所述第一名字组合并列连接所述第二名字,形成所述第一分支,将各所述第二名字组合并列连接所述第三名字,形成所述第二分支。
  18. 根据权利要求15所述的计算机可读存储介质,其中,获取所述数据库中所有论文分别对应的论文语义表征的步骤,包括:
    获取指定论文的标题内容和摘要内容;其中,所述指定论文为所述数据库中所有论文中的任意一篇论文;
    通过word2vec获取所述标题内容和摘要内容中每个词分别对应的语义表征向量;
    根据所述标题内容和摘要内容中每个词分别对应的语义表征向量,计算所述标题内容和摘要内容分别对应的语义表征向量的平均值;
    将所述语义表征向量的平均值,作为所述指定论文对应的论文语义表征。
  19. 根据权利要求15所述的计算机可读存储介质,其中,获取所述数据库中所有论文对应的关联关系异质网络的步骤,包括:
    获取各论文中分别包括的各首作者与协作者,以及各所述首作者的工作机构信息,作为所述关联关系异质网络的论文节点类型;
    对所述数据库中的论文进行两两比较,判断各论文的所述首作者的工作机构信息分别对应的共同词的数量,是否超过第一预设数量,判断各论文的首作者与协作者中分别存在的共同作者的数量,是否超过第二预设数量;
    若是,则链接共同词的数量超过第一预设数量的两篇论文对应的节点,形成论文机构对应的边,链接共同作者的数量超过第二预设数量的两篇论文对应的节点,形成论文共同作者对应的边;
    基于各所述首作者与协作者、各所述首作者的工作机构信息,分别对应的论文节点类型,以及各所述论文机构对应的边、各所述论文共同作者对应的边,形成所述关联关系异质网络。
  20. 根据权利要求15所述的计算机可读存储介质,其中,基于所述姓名树、所述关联关系异质网络和所述论文语义表征,构建相似矩阵的步骤,包括:
    根据所述待消歧作者的姓名树,形成所述相似矩阵的第一核心对象;
    根据预设的路径长度,通过元路径随机游走策略,在所述关联关系异质网络中获取所有包括所述待消歧作者的论文的路径,作为所述相似矩阵的第二核心对象;
    根据所述待消歧作者的所有论文的论文语义表征,作为所述相似矩阵的第三核心对象;
    集成所述第一核心对象、所述第二核心对象和所述第三核心对象,形成所述待消歧作者对应的相似矩阵。
PCT/CN2020/118531 2020-07-28 2020-09-28 论文作者的消歧方法、装置和计算机设备 WO2021139256A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010740289.6 2020-07-28
CN202010740289.6A CN111881693B (zh) 2020-07-28 2020-07-28 论文作者的消歧方法、装置和计算机设备

Publications (1)

Publication Number Publication Date
WO2021139256A1 true WO2021139256A1 (zh) 2021-07-15

Family

ID=73200336

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118531 WO2021139256A1 (zh) 2020-07-28 2020-09-28 论文作者的消歧方法、装置和计算机设备

Country Status (2)

Country Link
CN (1) CN111881693B (zh)
WO (1) WO2021139256A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113672706A (zh) * 2021-08-31 2021-11-19 清华大学苏州汽车研究院(相城) 一种基于属性异质网络的文本摘要抽取方法
CN113869461A (zh) * 2021-07-21 2021-12-31 中国人民解放军国防科技大学 一种用于科学合作异质网络的作者迁移分类方法
CN117312565A (zh) * 2023-11-28 2023-12-29 山东科技大学 一种基于关系融合与表示学习的文献作者姓名消歧方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191466B (zh) * 2019-12-25 2022-04-01 中国科学院计算机网络信息中心 一种基于网络表征和语义表征的同名作者消歧方法
CN112528089B (zh) * 2020-12-04 2023-11-14 平安科技(深圳)有限公司 论文作者消歧的方法、装置和计算机设备
CN113111178B (zh) * 2021-03-04 2021-12-10 中国科学院计算机网络信息中心 无监督的基于表示学习的同名作者消歧方法及装置
CN113051397A (zh) * 2021-03-10 2021-06-29 北京工业大学 一种基于异质信息网络表示学习和词向量表示的学术论文同名排歧方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140164298A1 (en) * 2012-12-01 2014-06-12 Sirius-Beta Corporation System and method for ontology derivation
CN106372239A (zh) * 2016-09-14 2017-02-01 电子科技大学 一种基于异质网络的社交网络事件关联分析方法
CN109558494A (zh) * 2018-10-29 2019-04-02 中国科学院计算机网络信息中心 一种基于异质网络嵌入的学者名字消歧方法
CN111191466A (zh) * 2019-12-25 2020-05-22 中国科学院计算机网络信息中心 一种基于网络表征和语义表征的同名作者消歧方法
CN111325326A (zh) * 2020-02-21 2020-06-23 北京工业大学 一种基于异质网络表示学习的链路预测方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080049239A (ko) * 2006-11-30 2008-06-04 한국과학기술정보연구원 원문으로부터의 정보추출기법을 사용한 동명저자 중의성해소 방법
US7953724B2 (en) * 2007-05-02 2011-05-31 Thomson Reuters (Scientific) Inc. Method and system for disambiguating informational objects
US8538898B2 (en) * 2011-05-28 2013-09-17 Microsoft Corporation Interactive framework for name disambiguation
CN104111973B (zh) * 2014-06-17 2017-10-27 中国科学院计算技术研究所 一种学者重名的消歧方法及其系统
CN106055539B (zh) * 2016-05-27 2018-12-28 中国科学技术信息研究所 姓名消歧的方法和装置
CN108664468A (zh) * 2018-05-02 2018-10-16 武汉烽火普天信息技术有限公司 一种基于词典和语义消歧的人名识别方法和装置
CN109670014B (zh) * 2018-11-21 2021-02-19 北京大学 一种基于规则匹配和机器学习的论文作者名消歧方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140164298A1 (en) * 2012-12-01 2014-06-12 Sirius-Beta Corporation System and method for ontology derivation
CN106372239A (zh) * 2016-09-14 2017-02-01 电子科技大学 一种基于异质网络的社交网络事件关联分析方法
CN109558494A (zh) * 2018-10-29 2019-04-02 中国科学院计算机网络信息中心 一种基于异质网络嵌入的学者名字消歧方法
CN111191466A (zh) * 2019-12-25 2020-05-22 中国科学院计算机网络信息中心 一种基于网络表征和语义表征的同名作者消歧方法
CN111325326A (zh) * 2020-02-21 2020-06-23 北京工业大学 一种基于异质网络表示学习的链路预测方法

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113869461A (zh) * 2021-07-21 2021-12-31 中国人民解放军国防科技大学 一种用于科学合作异质网络的作者迁移分类方法
CN113869461B (zh) * 2021-07-21 2024-03-12 中国人民解放军国防科技大学 一种用于科学合作异质网络的作者迁移分类方法
CN113672706A (zh) * 2021-08-31 2021-11-19 清华大学苏州汽车研究院(相城) 一种基于属性异质网络的文本摘要抽取方法
CN113672706B (zh) * 2021-08-31 2024-04-26 清华大学苏州汽车研究院(相城) 一种基于属性异质网络的文本摘要抽取方法
CN117312565A (zh) * 2023-11-28 2023-12-29 山东科技大学 一种基于关系融合与表示学习的文献作者姓名消歧方法
CN117312565B (zh) * 2023-11-28 2024-02-06 山东科技大学 一种基于关系融合与表示学习的文献作者姓名消歧方法

Also Published As

Publication number Publication date
CN111881693A (zh) 2020-11-03
CN111881693B (zh) 2023-01-13

Similar Documents

Publication Publication Date Title
WO2021139256A1 (zh) 论文作者的消歧方法、装置和计算机设备
CN110675288B (zh) 智能辅助审判方法、装置、计算机设备及存储介质
US11327975B2 (en) Methods and systems for improved entity recognition and insights
Azarbonyad et al. Words are malleable: Computing semantic shifts in political and media discourse
Potthast et al. Overview of the 2nd international competition on plagiarism detection
El et al. Authorship analysis studies: A survey
CN104866558B (zh) 一种社交网络账号映射模型训练方法及映射方法和系统
US20180060306A1 (en) Extracting facts from natural language texts
CN109829155A (zh) 关键词的确定方法、自动评分方法、装置、设备及介质
Huang et al. Institution name disambiguation for research assessment
CN110427612B (zh) 基于多语言的实体消歧方法、装置、设备和存储介质
CN111680634A (zh) 公文文件处理方法、装置、计算机设备及存储介质
Hussein Arabic document similarity analysis using n-grams and singular value decomposition
Prata et al. Social data analysis of Brazilian's mood from Twitter
Sudhish et al. Adaptive fusion of biometric and biographic information for identity de-duplication
Wang et al. A novel calibrated label ranking based method for multiple emotions detection in Chinese microblogs
CN110245234A (zh) 一种基于本体和语义相似度的多源数据样本关联方法
CN108170716B (zh) 一种基于人体视觉的文本查重方法
Jeon et al. Making a graph database from unstructured text
El-Shishtawy A hybrid algorithm for matching arabic names
CN113158206A (zh) 一种基于决策树的文档安全等级划分方法
CN107577760B (zh) 一种基于约束规范的文本分类方法及装置
CN113742498B (zh) 一种知识图谱的构建更新方法
Han et al. Probabilistic quality assessment based on article’s revision history
CN114238768A (zh) 资讯信息的推送方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20911403

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20911403

Country of ref document: EP

Kind code of ref document: A1