CN111143457A

CN111143457A - Student homonymy disambiguation method based on multiple source data sets

Info

Publication number: CN111143457A
Application number: CN201911384229.9A
Authority: CN
Inventors: 张思洋; 鄂新华; 黄韬; 刘江; 杨帆; 霍如
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-12-28
Filing date: 2019-12-28
Publication date: 2020-05-12

Abstract

The invention discloses a student homonymy disambiguation method based on data of multiple source data sources, which comprises the following steps: (1) integrating a plurality of data source data from a plurality of sources; (2) performing data processing and Chinese-English mapping; (3) carrying out graph calculation by using the processed data and generating a knowledge graph; (4) extracting a strong characteristic value of data for each author with the same name to perform fuzzy matching, and adding the data which can be matched with the field in the data to the corresponding category of the field; (5) weak feature extraction is carried out on the basis of the previous-level clustering, and path searching is carried out through a knowledge graph; (6) and calculating the similarity between the two nodes, clustering the nodes with the similarity higher than a threshold value, and judging the nodes with the similarity lower than the threshold value in a class. (7) And extracting a data strong characteristic value for each author with the same name to perform rechecking clustering. (8) And finally obtaining the polymerized knowledge graph.

Description

Student homonymy disambiguation method based on multiple source data sets

Technical Field

The invention belongs to the field of data analysis and data mining, and particularly relates to a student homonymy disambiguation method based on multiple source data sets.

Background

At present, a plurality of document retrieval platforms such as domestic web knowledge, universities and centuries academia, foreign DBLP, Google academia and the like have provided own expert scholars library, but the problems at present are as follows: when the student base needs to calculate information such as student influence, theory and number, the students with the same name are difficult to distinguish accurately, which will affect the accuracy of other information calculation, therefore, how to reduce the influence caused by the phenomenon of renaming and exert the maximum efficiency of the expert knowledge base becomes a problem concerned by researchers. Thus, "homonymous disambiguation" began to be proposed and attracted the attention of a large group of expert scholars. By "homonymy disambiguation," we simply divide a given collection of articles with the same author name into several classes, so that the authors of the articles within each class are the same person.

The method for disambiguating the homonymy mainly adopts two modes of supervised learning and unsupervised learning according to the degree of depending on training data by combining various attribute characteristics of a paper, the supervised learning mainly refers to a probabilistic model-based method, the method is high in accuracy, but a large amount of manual labeling is needed for training mass data, time and labor are consumed, along with the time advance, the data iteration is rapid, and the problem of poor portability exists in the supervised learning. The unsupervised learning mode mainly refers to graph theory, clustering and a constraint-based method. The method can achieve higher precision, but the recall rate is relatively low. The most important advantage of the unsupervised approach over the guided homonymous disambiguation approach is that it does not require a lot of training data and training time. On large-scale data, the unguided algorithm is more feasible and extensible than the guided algorithm.

The method selects an unsupervised learning mode, merges the same-name authors with higher accuracy as much as possible according to some strong eigenvalues, enlarges the cardinality of the same-name authors, and extracts weak eigenvalues for merging on the basis of the larger same-name authors. On the premise of ensuring the accuracy, the recall rate is improved as much as possible.

Disclosure of Invention

The invention aims to solve the problem of homonymy ambiguity generated when a retrieval system searches a learner, and provides a learner homonymy disambiguation method based on a plurality of source data sets, which can cluster homonymy learners as accurately as possible and further improve the retrieval accuracy.

The technical scheme is as follows:

a student homonymy disambiguation method based on data of a data source with multiple sources comprises the following steps:

integrating a plurality of data source data from a plurality of sources;

performing duplicate checking, data specification and Chinese-English mapping on the data source data;

carrying out graph calculation by using the processed data and generating a knowledge graph;

extracting a strong characteristic value of data for each author with the same name to perform fuzzy matching, and adding the data which can be matched with the field in the data to the corresponding category of the field;

weak feature extraction is carried out on the basis of the previous-level clustering, and path searching is carried out through a knowledge graph;

and calculating the similarity between the two nodes, clustering the nodes with the similarity higher than a threshold value, and judging the nodes with the similarity lower than the threshold value in a class.

And extracting a data strong characteristic value for each author with the same name to perform rechecking clustering.

And finally obtaining the polymerized knowledge graph.

Preferably, the data obtained from the plurality of sources comprises:

the dynamic data includes: number of citations, number of downloads, legal status;

the static data includes: title, author, abstract, keyword, DOI, category number, organization, author code.

Preferably, the map calculation is performed by using the processed data to generate a knowledge map: each piece of processed non-repeated data is regarded as an independent category, and all paper authors in the category are connected in a partner mode.

Preferably, the extracting of the data strong feature value for each of the same-name authors includes: the organization where the author is located and the contact way of the author.

Preferably, strong characteristic values of data extracted by each author with the same name are subjected to fuzzy matching, and the data which can be matched with the field in the database are added into the category of the extracted characteristic value field in the row.

Preferably, each clustering depends on the previous clustering, and the previous clustering result is regarded as a whole for feature extraction.

Preferably, weak feature extraction is performed on the basis of the previous-level clustering, wherein weak features include but are not limited to the relation of related collaborators, research direction and the like.

Preferably, a path is searched through the knowledge graph, and the path is a connecting line which is formed by connecting two nodes with the same name through weak characteristic values.

Preferably, the similarity between the two nodes is calculated according to the path obtained by path search, a threshold value with higher similarity is obtained through a large number of experiments, and the two nodes with similarity higher than the threshold value are clustered and merged.

Drawings

FIG. 1 is a flow chart of a method for disambiguating the homonyms of a student according to an embodiment of the invention

FIG. 2 is a block diagram illustrating a method for disambiguating the homonyms of the trainees according to an embodiment of the present invention

FIG. 3 is a flow chart of the organization of the method for the congruence disambiguation of the learner according to an embodiment of the present invention

Detailed Description

The following is a detailed description of embodiments of the invention, illustrated in the accompanying drawings in which like or similar reference numerals refer to the same or similar components or components having the same or similar functions throughout the several views. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of illustrating the present invention and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or "coupled". As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

FIG. 1 is a flow chart of a method for dissimilarity between students according to an embodiment of the present invention: .

101, integrating a plurality of data source data from a plurality of sources;

102, performing data processing and Chinese-English mapping;

103, carrying out graph calculation by using the processed data and generating a knowledge graph;

104, extracting a data strong characteristic value for each author with the same name to perform fuzzy matching, and adding data which can be matched with the field in the data to the corresponding category of the field;

105, extracting weak features on the basis of the previous-level clustering, and searching a path through a knowledge graph;

and 106, calculating the similarity between the two nodes, clustering the nodes with the similarity higher than a threshold value, and judging the nodes with the similarity lower than the threshold value in a class.

And 107, extracting data strong characteristic values for each author with the same name to perform rechecking clustering.

And 108, finally obtaining the aggregated knowledge graph.

In step 101, integrating a plurality of data source data from a plurality of sources comprises:

the data mainly comes from data source data of various channels, and the acquisition field comprises dynamic data including but not limited to: number of citations, number of downloads, legal status, etc.

Static data includes, but is not limited to: title, author, abstract, keyword, DOI, category number, organization, author code, etc.

In step 102, the data processing section includes:

and carrying out operations such as duplicate checking, data specification, Chinese and English mapping and the like on the data.

In step 103, map calculation is performed using the processed data to generate a knowledge map:

and taking each piece of processed non-repeated data as an independent category, wherein the paper authors in the category are connected in a partner manner.

In step 104, extracting data strong feature values for each of the same author includes, but is not limited to: the organization where the author is located, the author contact, etc.

In step 104, extracting strong characteristic values of data for each author with the same name to perform fuzzy matching, and adding the data and the data which can be matched with the field in the database into the category of the extracted characteristic value field in the row.

In step 104, each clustering depends on the previous clustering, and the previous clustering result is regarded as a whole for feature extraction.

In step 105, weak feature extraction is performed on the basis of the previous-level clustering, wherein the weak features include but are not limited to the relationship of related collaborators, research direction and the like.

In step 105, a path is searched through the knowledge graph, and the path is a connecting line which is formed by connecting two nodes with the same name through weak characteristic values.

In step 106, the obtained path is searched according to the path, the similarity between the two nodes is calculated, a threshold value with higher similarity is obtained through a large number of experiments, and the two nodes with similarity higher than the threshold value are clustered and merged.

In step 107, a large amount of repeated data is added in the strong feature matching link, and the repeated data cannot be completely eliminated due to partial data loss during weak feature clustering, so that a strong feature value rechecking clustering part is added, and on the basis of the previous clustering, a strong feature value of data is extracted for each author of the same name to perform rechecking clustering.

Fig. 2 is a diagram illustrating a method for eliminating the homonymy between students according to an embodiment of the present invention. The method is suitable for academic information retrieval scenes, and a plurality of data source data, namely data acquisition modules in the graph, are integrated from a plurality of sources. And carrying out duplication removal on the data in the data processing module, standardizing the data format and carrying out Chinese and English mapping. And transmitting the processed data to a graph calculation module, and generating a knowledge graph by taking each article as a category. And extracting a strong characteristic value of each author to match with data of other students with the same name, and adding data which can be matched with the characteristic to the category corresponding to the field. And transmitting the data to a weak feature clustering module, performing weak feature extraction on the whole by taking the matching result of the previous time as a whole on the basis of the previous module, performing path search according to a knowledge graph, calculating the similarity between two nodes with the same name, and clustering the data when the value of the similarity is greater than a threshold value. And sending the clustering result to a strong feature rechecking clustering module, wherein a large amount of repeated data is generated in the strong feature matching module, and the repetition cannot be well eliminated in the weak feature clustering module due to factors such as data loss, so that the strong feature rechecking clustering module is added, the data with the similarity larger than a threshold value is clustered, and the aggregated knowledge graph is finally output.

FIG. 3 is an organization flowchart of a method for the student homonymy disambiguation according to an embodiment of the invention.

Integrating a plurality of data source data from a plurality of sources, processing the data, mapping Chinese and English, generating a knowledge graph from the processed data, generating nodes for authors by taking each piece of processed non-repeated data as a class, traversing the nodes of the knowledge graph, judging whether other nodes are same as the nodes in the class, continuously traversing the next node if no same node exists, extracting a data strong characteristic value for each same author for fuzzy matching if a same student exists, adding the data which can be matched with the field in the data into the class corresponding to the field, extracting weak characteristics on the basis of previous-level clustering, searching paths through the knowledge graph, calculating the similarity between two nodes, clustering the nodes with the similarity larger than a threshold value, extracting the data strong characteristic value for each same author for repeated clustering, and clustering the nodes with the similarity larger than the threshold value, and finally outputting the aggregated knowledge graph.

By the technical scheme provided by the invention, the system can be used for clustering the students with the same name, so that the academic retrieval accuracy is further improved.

Those skilled in the art will appreciate that the present invention may be directed to an apparatus for performing one or more of the operations described in the present application. The apparatus may be specially designed and constructed for the required purposes, or it may comprise any known apparatus in a general purpose computer selectively activated or reconfigured by a program stored in the general purpose computer.

It will be understood by those within the art that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the methods specified in the block or blocks of the block diagrams and/or flowchart block or blocks.

Those of skill in the art will appreciate that various operations, methods, steps in the processes, acts, or solutions discussed in the present application may be alternated, modified, combined, or deleted. Further, various operations, methods, steps in the flows, which have been discussed in the present application, may be interchanged, modified, rearranged, decomposed, combined, or eliminated. Further, steps, measures, schemes in the various operations, methods, procedures disclosed in the prior art and the present invention can also be alternated, changed, rearranged, decomposed, combined, or deleted.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A student homonymy disambiguation method based on data of a data source with multiple sources is characterized by comprising the following steps:

integrating a plurality of data source data from a plurality of sources;

calculating the similarity between the two nodes, clustering the nodes with the similarity higher than a threshold value, and judging the nodes with the similarity lower than the threshold value in a class;

extracting a data strong characteristic value for each author with the same name to perform rechecking clustering;

and finally obtaining the polymerized knowledge graph.

2. The method of claim 1, wherein the data obtained from a plurality of sources comprises:

3. The method of claim 1, wherein the map computation is performed using the processed data to generate a knowledge map: each piece of processed non-repeated data is regarded as an independent category, and all paper authors in the category are connected in a partner mode.

4. The method of claim 1, wherein extracting data strong feature values for each author of the same name comprises: the organization where the author is located and the contact way of the author.

5. The method of claim 1, wherein the strong eigenvalue of the extracted data for each author of the same name is fuzzy matched, and the data in the database containing the data and the data that can be matched with the field are added to the category of the extracted eigenvalue field of the row.

6. The method of claim 1, wherein each clustering depends on previous clustering, and feature extraction is performed by regarding previous clustering results as a whole.

7. The method of claim 1, wherein weak feature extraction is performed on the basis of previous-level clustering, and wherein weak features include: the relation and research direction of the related collaborators.

8. The method of claim 1, wherein the path search is performed by a knowledge graph, and the path is a connection line between two nodes with the same name, which is communicated by a weak eigenvalue.

9. The method of claim 1, wherein the similarity between two nodes is calculated according to the path obtained by path search, a threshold with higher similarity is obtained through a large number of experiments, and two nodes with similarity higher than the threshold are clustered and merged.