CN111143457A - Student homonymy disambiguation method based on multiple source data sets - Google Patents

Student homonymy disambiguation method based on multiple source data sets Download PDF

Info

Publication number
CN111143457A
CN111143457A CN201911384229.9A CN201911384229A CN111143457A CN 111143457 A CN111143457 A CN 111143457A CN 201911384229 A CN201911384229 A CN 201911384229A CN 111143457 A CN111143457 A CN 111143457A
Authority
CN
China
Prior art keywords
data
clustering
author
nodes
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911384229.9A
Other languages
Chinese (zh)
Inventor
张思洋
鄂新华
黄韬
刘江
杨帆
霍如
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201911384229.9A priority Critical patent/CN111143457A/en
Publication of CN111143457A publication Critical patent/CN111143457A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a student homonymy disambiguation method based on data of multiple source data sources, which comprises the following steps: (1) integrating a plurality of data source data from a plurality of sources; (2) performing data processing and Chinese-English mapping; (3) carrying out graph calculation by using the processed data and generating a knowledge graph; (4) extracting a strong characteristic value of data for each author with the same name to perform fuzzy matching, and adding the data which can be matched with the field in the data to the corresponding category of the field; (5) weak feature extraction is carried out on the basis of the previous-level clustering, and path searching is carried out through a knowledge graph; (6) and calculating the similarity between the two nodes, clustering the nodes with the similarity higher than a threshold value, and judging the nodes with the similarity lower than the threshold value in a class. (7) And extracting a data strong characteristic value for each author with the same name to perform rechecking clustering. (8) And finally obtaining the polymerized knowledge graph.

Description

Student homonymy disambiguation method based on multiple source data sets
Technical Field
The invention belongs to the field of data analysis and data mining, and particularly relates to a student homonymy disambiguation method based on multiple source data sets.
Background
At present, a plurality of document retrieval platforms such as domestic web knowledge, universities and centuries academia, foreign DBLP, Google academia and the like have provided own expert scholars library, but the problems at present are as follows: when the student base needs to calculate information such as student influence, theory and number, the students with the same name are difficult to distinguish accurately, which will affect the accuracy of other information calculation, therefore, how to reduce the influence caused by the phenomenon of renaming and exert the maximum efficiency of the expert knowledge base becomes a problem concerned by researchers. Thus, "homonymous disambiguation" began to be proposed and attracted the attention of a large group of expert scholars. By "homonymy disambiguation," we simply divide a given collection of articles with the same author name into several classes, so that the authors of the articles within each class are the same person.
The method for disambiguating the homonymy mainly adopts two modes of supervised learning and unsupervised learning according to the degree of depending on training data by combining various attribute characteristics of a paper, the supervised learning mainly refers to a probabilistic model-based method, the method is high in accuracy, but a large amount of manual labeling is needed for training mass data, time and labor are consumed, along with the time advance, the data iteration is rapid, and the problem of poor portability exists in the supervised learning. The unsupervised learning mode mainly refers to graph theory, clustering and a constraint-based method. The method can achieve higher precision, but the recall rate is relatively low. The most important advantage of the unsupervised approach over the guided homonymous disambiguation approach is that it does not require a lot of training data and training time. On large-scale data, the unguided algorithm is more feasible and extensible than the guided algorithm.
The method selects an unsupervised learning mode, merges the same-name authors with higher accuracy as much as possible according to some strong eigenvalues, enlarges the cardinality of the same-name authors, and extracts weak eigenvalues for merging on the basis of the larger same-name authors. On the premise of ensuring the accuracy, the recall rate is improved as much as possible.
Disclosure of Invention
The invention aims to solve the problem of homonymy ambiguity generated when a retrieval system searches a learner, and provides a learner homonymy disambiguation method based on a plurality of source data sets, which can cluster homonymy learners as accurately as possible and further improve the retrieval accuracy.
The technical scheme is as follows:
a student homonymy disambiguation method based on data of a data source with multiple sources comprises the following steps:
integrating a plurality of data source data from a plurality of sources;
performing duplicate checking, data specification and Chinese-English mapping on the data source data;
carrying out graph calculation by using the processed data and generating a knowledge graph;
extracting a strong characteristic value of data for each author with the same name to perform fuzzy matching, and adding the data which can be matched with the field in the data to the corresponding category of the field;
weak feature extraction is carried out on the basis of the previous-level clustering, and path searching is carried out through a knowledge graph;
and calculating the similarity between the two nodes, clustering the nodes with the similarity higher than a threshold value, and judging the nodes with the similarity lower than the threshold value in a class.
And extracting a data strong characteristic value for each author with the same name to perform rechecking clustering.
And finally obtaining the polymerized knowledge graph.
Preferably, the data obtained from the plurality of sources comprises:
the dynamic data includes: number of citations, number of downloads, legal status;
the static data includes: title, author, abstract, keyword, DOI, category number, organization, author code.
Preferably, the map calculation is performed by using the processed data to generate a knowledge map: each piece of processed non-repeated data is regarded as an independent category, and all paper authors in the category are connected in a partner mode.
Preferably, the extracting of the data strong feature value for each of the same-name authors includes: the organization where the author is located and the contact way of the author.
Preferably, strong characteristic values of data extracted by each author with the same name are subjected to fuzzy matching, and the data which can be matched with the field in the database are added into the category of the extracted characteristic value field in the row.
Preferably, each clustering depends on the previous clustering, and the previous clustering result is regarded as a whole for feature extraction.
Preferably, weak feature extraction is performed on the basis of the previous-level clustering, wherein weak features include but are not limited to the relation of related collaborators, research direction and the like.
Preferably, a path is searched through the knowledge graph, and the path is a connecting line which is formed by connecting two nodes with the same name through weak characteristic values.
Preferably, the similarity between the two nodes is calculated according to the path obtained by path search, a threshold value with higher similarity is obtained through a large number of experiments, and the two nodes with similarity higher than the threshold value are clustered and merged.
Drawings
FIG. 1 is a flow chart of a method for disambiguating the homonyms of a student according to an embodiment of the invention
FIG. 2 is a block diagram illustrating a method for disambiguating the homonyms of the trainees according to an embodiment of the present invention
FIG. 3 is a flow chart of the organization of the method for the congruence disambiguation of the learner according to an embodiment of the present invention
Detailed Description
The following is a detailed description of embodiments of the invention, illustrated in the accompanying drawings in which like or similar reference numerals refer to the same or similar components or components having the same or similar functions throughout the several views. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of illustrating the present invention and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or "coupled". As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
FIG. 1 is a flow chart of a method for dissimilarity between students according to an embodiment of the present invention: .
101, integrating a plurality of data source data from a plurality of sources;
102, performing data processing and Chinese-English mapping;
103, carrying out graph calculation by using the processed data and generating a knowledge graph;
104, extracting a data strong characteristic value for each author with the same name to perform fuzzy matching, and adding data which can be matched with the field in the data to the corresponding category of the field;
105, extracting weak features on the basis of the previous-level clustering, and searching a path through a knowledge graph;
and 106, calculating the similarity between the two nodes, clustering the nodes with the similarity higher than a threshold value, and judging the nodes with the similarity lower than the threshold value in a class.
And 107, extracting data strong characteristic values for each author with the same name to perform rechecking clustering.
And 108, finally obtaining the aggregated knowledge graph.
In step 101, integrating a plurality of data source data from a plurality of sources comprises:
the data mainly comes from data source data of various channels, and the acquisition field comprises dynamic data including but not limited to: number of citations, number of downloads, legal status, etc.
Static data includes, but is not limited to: title, author, abstract, keyword, DOI, category number, organization, author code, etc.
In step 102, the data processing section includes:
and carrying out operations such as duplicate checking, data specification, Chinese and English mapping and the like on the data.
In step 103, map calculation is performed using the processed data to generate a knowledge map:
and taking each piece of processed non-repeated data as an independent category, wherein the paper authors in the category are connected in a partner manner.
In step 104, extracting data strong feature values for each of the same author includes, but is not limited to: the organization where the author is located, the author contact, etc.
In step 104, extracting strong characteristic values of data for each author with the same name to perform fuzzy matching, and adding the data and the data which can be matched with the field in the database into the category of the extracted characteristic value field in the row.
In step 104, each clustering depends on the previous clustering, and the previous clustering result is regarded as a whole for feature extraction.
In step 105, weak feature extraction is performed on the basis of the previous-level clustering, wherein the weak features include but are not limited to the relationship of related collaborators, research direction and the like.
In step 105, a path is searched through the knowledge graph, and the path is a connecting line which is formed by connecting two nodes with the same name through weak characteristic values.
In step 106, the obtained path is searched according to the path, the similarity between the two nodes is calculated, a threshold value with higher similarity is obtained through a large number of experiments, and the two nodes with similarity higher than the threshold value are clustered and merged.
In step 107, a large amount of repeated data is added in the strong feature matching link, and the repeated data cannot be completely eliminated due to partial data loss during weak feature clustering, so that a strong feature value rechecking clustering part is added, and on the basis of the previous clustering, a strong feature value of data is extracted for each author of the same name to perform rechecking clustering.
Fig. 2 is a diagram illustrating a method for eliminating the homonymy between students according to an embodiment of the present invention. The method is suitable for academic information retrieval scenes, and a plurality of data source data, namely data acquisition modules in the graph, are integrated from a plurality of sources. And carrying out duplication removal on the data in the data processing module, standardizing the data format and carrying out Chinese and English mapping. And transmitting the processed data to a graph calculation module, and generating a knowledge graph by taking each article as a category. And extracting a strong characteristic value of each author to match with data of other students with the same name, and adding data which can be matched with the characteristic to the category corresponding to the field. And transmitting the data to a weak feature clustering module, performing weak feature extraction on the whole by taking the matching result of the previous time as a whole on the basis of the previous module, performing path search according to a knowledge graph, calculating the similarity between two nodes with the same name, and clustering the data when the value of the similarity is greater than a threshold value. And sending the clustering result to a strong feature rechecking clustering module, wherein a large amount of repeated data is generated in the strong feature matching module, and the repetition cannot be well eliminated in the weak feature clustering module due to factors such as data loss, so that the strong feature rechecking clustering module is added, the data with the similarity larger than a threshold value is clustered, and the aggregated knowledge graph is finally output.
FIG. 3 is an organization flowchart of a method for the student homonymy disambiguation according to an embodiment of the invention.
Integrating a plurality of data source data from a plurality of sources, processing the data, mapping Chinese and English, generating a knowledge graph from the processed data, generating nodes for authors by taking each piece of processed non-repeated data as a class, traversing the nodes of the knowledge graph, judging whether other nodes are same as the nodes in the class, continuously traversing the next node if no same node exists, extracting a data strong characteristic value for each same author for fuzzy matching if a same student exists, adding the data which can be matched with the field in the data into the class corresponding to the field, extracting weak characteristics on the basis of previous-level clustering, searching paths through the knowledge graph, calculating the similarity between two nodes, clustering the nodes with the similarity larger than a threshold value, extracting the data strong characteristic value for each same author for repeated clustering, and clustering the nodes with the similarity larger than the threshold value, and finally outputting the aggregated knowledge graph.
By the technical scheme provided by the invention, the system can be used for clustering the students with the same name, so that the academic retrieval accuracy is further improved.
Those skilled in the art will appreciate that the present invention may be directed to an apparatus for performing one or more of the operations described in the present application. The apparatus may be specially designed and constructed for the required purposes, or it may comprise any known apparatus in a general purpose computer selectively activated or reconfigured by a program stored in the general purpose computer.
It will be understood by those within the art that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the methods specified in the block or blocks of the block diagrams and/or flowchart block or blocks.
Those of skill in the art will appreciate that various operations, methods, steps in the processes, acts, or solutions discussed in the present application may be alternated, modified, combined, or deleted. Further, various operations, methods, steps in the flows, which have been discussed in the present application, may be interchanged, modified, rearranged, decomposed, combined, or eliminated. Further, steps, measures, schemes in the various operations, methods, procedures disclosed in the prior art and the present invention can also be alternated, changed, rearranged, decomposed, combined, or deleted.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (9)

1. A student homonymy disambiguation method based on data of a data source with multiple sources is characterized by comprising the following steps:
integrating a plurality of data source data from a plurality of sources;
performing duplicate checking, data specification and Chinese-English mapping on the data source data;
carrying out graph calculation by using the processed data and generating a knowledge graph;
extracting a strong characteristic value of data for each author with the same name to perform fuzzy matching, and adding the data which can be matched with the field in the data to the corresponding category of the field;
weak feature extraction is carried out on the basis of the previous-level clustering, and path searching is carried out through a knowledge graph;
calculating the similarity between the two nodes, clustering the nodes with the similarity higher than a threshold value, and judging the nodes with the similarity lower than the threshold value in a class;
extracting a data strong characteristic value for each author with the same name to perform rechecking clustering;
and finally obtaining the polymerized knowledge graph.
2. The method of claim 1, wherein the data obtained from a plurality of sources comprises:
the dynamic data includes: number of citations, number of downloads, legal status;
the static data includes: title, author, abstract, keyword, DOI, category number, organization, author code.
3. The method of claim 1, wherein the map computation is performed using the processed data to generate a knowledge map: each piece of processed non-repeated data is regarded as an independent category, and all paper authors in the category are connected in a partner mode.
4. The method of claim 1, wherein extracting data strong feature values for each author of the same name comprises: the organization where the author is located and the contact way of the author.
5. The method of claim 1, wherein the strong eigenvalue of the extracted data for each author of the same name is fuzzy matched, and the data in the database containing the data and the data that can be matched with the field are added to the category of the extracted eigenvalue field of the row.
6. The method of claim 1, wherein each clustering depends on previous clustering, and feature extraction is performed by regarding previous clustering results as a whole.
7. The method of claim 1, wherein weak feature extraction is performed on the basis of previous-level clustering, and wherein weak features include: the relation and research direction of the related collaborators.
8. The method of claim 1, wherein the path search is performed by a knowledge graph, and the path is a connection line between two nodes with the same name, which is communicated by a weak eigenvalue.
9. The method of claim 1, wherein the similarity between two nodes is calculated according to the path obtained by path search, a threshold with higher similarity is obtained through a large number of experiments, and two nodes with similarity higher than the threshold are clustered and merged.
CN201911384229.9A 2019-12-28 2019-12-28 Student homonymy disambiguation method based on multiple source data sets Pending CN111143457A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911384229.9A CN111143457A (en) 2019-12-28 2019-12-28 Student homonymy disambiguation method based on multiple source data sets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911384229.9A CN111143457A (en) 2019-12-28 2019-12-28 Student homonymy disambiguation method based on multiple source data sets

Publications (1)

Publication Number Publication Date
CN111143457A true CN111143457A (en) 2020-05-12

Family

ID=70521316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911384229.9A Pending CN111143457A (en) 2019-12-28 2019-12-28 Student homonymy disambiguation method based on multiple source data sets

Country Status (1)

Country Link
CN (1) CN111143457A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051397A (en) * 2021-03-10 2021-06-29 北京工业大学 Academic paper homonymy disambiguation method based on heterogeneous information network representation learning and word vector representation
CN113742450A (en) * 2021-08-30 2021-12-03 中信百信银行股份有限公司 User data grade label falling method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9779363B1 (en) * 2014-12-09 2017-10-03 Linkedin Corporation Disambiguating personal names
CN104111973B (en) * 2014-06-17 2017-10-27 中国科学院计算技术研究所 Disambiguation method and its system that a kind of scholar bears the same name
CN109726280A (en) * 2018-12-29 2019-05-07 北京邮电大学 A kind of row's discrimination method and device for scholar of the same name
CN110362692A (en) * 2019-07-23 2019-10-22 中南大学 A kind of academic circle construction method of knowledge based map

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104111973B (en) * 2014-06-17 2017-10-27 中国科学院计算技术研究所 Disambiguation method and its system that a kind of scholar bears the same name
US9779363B1 (en) * 2014-12-09 2017-10-03 Linkedin Corporation Disambiguating personal names
CN109726280A (en) * 2018-12-29 2019-05-07 北京邮电大学 A kind of row's discrimination method and device for scholar of the same name
CN110362692A (en) * 2019-07-23 2019-10-22 中南大学 A kind of academic circle construction method of knowledge based map

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SIYANG ZHANG: "《A multi-Level Author Name Disambiguation Algorithm》", 《IEEE ACCESS》 *
冯钧: "融合多特征的中文集成实体链接方法", 《计算机与现代化》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051397A (en) * 2021-03-10 2021-06-29 北京工业大学 Academic paper homonymy disambiguation method based on heterogeneous information network representation learning and word vector representation
CN113742450A (en) * 2021-08-30 2021-12-03 中信百信银行股份有限公司 User data grade label falling method and device, electronic equipment and storage medium
CN113742450B (en) * 2021-08-30 2023-05-30 中信百信银行股份有限公司 Method, device, electronic equipment and storage medium for user data grade falling label

Similar Documents

Publication Publication Date Title
Zhang et al. Ad hoc table retrieval using semantic similarity
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
CN109190117B (en) Short text semantic similarity calculation method based on word vector
Adelfio et al. Schema extraction for tabular data on the web
CN103970729B (en) A kind of multi-threaded extracting method based on semantic category
CN104991905B (en) A kind of mathematic(al) representation search method based on level index
CN106649275A (en) Relation extraction method based on part-of-speech information and convolutional neural network
CN110399392B (en) Semantic relation database operation
CN105279252A (en) Related word mining method, search method and search system
CN102043851A (en) Multiple-document automatic abstracting method based on frequent itemset
CN103488724A (en) Book-oriented reading field knowledge map construction method
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN111897914A (en) Entity information extraction and knowledge graph construction method for field of comprehensive pipe gallery
CN105550216A (en) Searching method and device of academic research information and excavating method and device of academic research information
Bin et al. Web mining research
Wang et al. Neural related work summarization with a joint context-driven attention mechanism
Wang et al. Representing document as dependency graph for document clustering
Alian et al. Arabic semantic similarity approaches-review
CN107436955A (en) A kind of English word relatedness computation method and apparatus based on Wikipedia Concept Vectors
CN107871002A (en) A kind of across language plagiarism detection method based on fingerprint fusion
CN103678499A (en) Data mining method based on multi-source heterogeneous patent data semantic integration
Ahmadi et al. Unsupervised matching of data and text
CN108021682A (en) Open information extracts a kind of Entity Semantics method based on wikipedia under background
CN111143457A (en) Student homonymy disambiguation method based on multiple source data sets
CN107168953A (en) The new word discovery method and system that word-based vector is characterized in mass text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200512

WD01 Invention patent application deemed withdrawn after publication