CN112016328B

CN112016328B - Academic institution name entity alignment method based on text features

Info

Publication number: CN112016328B
Application number: CN202010867785.8A
Authority: CN
Inventors: 林欣; 郭晨亮; 李继洲
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2023-06-09
Anticipated expiration: 2040-08-26
Also published as: CN112016328A

Abstract

The invention discloses an academic institution name entity alignment method based on text characteristics, which comprises the following 5 steps: english abbreviations are converted into English names; correcting the error correspondence between English abbreviation and English full name; translation complements English full names and Chinese names; correcting the wrong Chinese name; academic institution merging is performed based on text features. The method uses academic organization data extracted from Chinese and English text data to carry out entity alignment, wherein each organization data contains English abbreviations, english full names, chinese names and geographic positions, has content deletion and a small amount of errors, and finally obtains a plurality of different names corresponding to the same organization by complementing the deletion data, correcting the error data and merging the same organization data. The invention combines the text characteristics of the mechanism names and the geographical position information for the entity alignment of the academic mechanism names, does not need the corresponding relation of the mechanism names and the contextual semantic information of the names, and obtains better entity alignment effect with lower complexity.

Description

Academic institution name entity alignment method based on text features

Technical Field

The technical field of the invention comprises entity alignment, entity disambiguation, knowledge graph construction, data preprocessing technology and a collection algorithm, in particular to a method for performing entity alignment on academic institution names and constructing academic knowledge graphs, and relates to an academic institution name entity alignment method based on text features.

Background

In recent years, with the development of computers, networks and the accumulation of data, there have been more and more electronic data to assist computers in accomplishing more tasks. In order to understand the relationship among the articles common in life, the computer can learn more knowledge, a knowledge graph can be constructed for the articles in life, each article corresponds to an entity point on the graph, and the relationship among the articles corresponds to the edge of the connected entity on the graph. As a large number of academic papers are published and published in electronic form, the field classification of academic research is also becoming more detailed and complex. In order to provide a more convenient and effective document retrieval function, an academic knowledge graph needs to be constructed for the relevant fields corresponding to papers and the information of authors and institutions, and a deeper association relationship is used for optimizing the query result.

In the process of constructing the knowledge graph, as entity information is extracted from the text and the same entity has a plurality of different expression modes, the different expression modes of the same entity are required to be combined to reduce the occurrence of errors; for data from different sources, the constructed knowledge graphs need to be combined, and corresponding entities with different names are found. Institutions are an important component of academic knowledge maps, and disambiguating academic institution names is also a step in constructing academic knowledge maps.

The common entity alignment method can be used for judging by combining semantic information in the context of the entity, and can also be used for judging according to the relation between the knowledge graph and other entities, and the context information of the entity and the labeling data of the known entity alignment relation are required. But the name of the organization in the academic knowledge graph has insufficient context information, the organization is widely related to the field and is weakly related to the context information; organization names often appear in abbreviated form rather than common words; without sufficient labeling data for the alignment of the names of known institutions, the method requiring labeling data is less effective in this case; entity information such as authors associated with institutions also need to be aligned. There is currently no better way to solve these problems.

In the entity alignment of academic organization names, there are problems that data sources have different languages of Chinese and English, organization names have no uniform format, organizations often appear in English abbreviations, english abbreviations of text extraction correspond to full names possibly in error, some organizations intercept from detailed addresses to cause errors, and the like. The organization name data is in the form of (English abbreviation, english holonomy, chinese name, geographical position), and some items may be missing or have wrong corresponding relation. The invention comprehensively utilizes the text characteristics and the geographic position information of the organization names to better solve the problems.

Disclosure of Invention

In view of the defects and the existing difficulties and challenges in the prior art, the invention aims to provide an academic institution name entity alignment method based on text features, which solves the problems that the academic institution entity without context cannot be aligned, the algorithm implementation complexity is high and insufficient marking data are used for training a model in the prior art scheme by utilizing text features of academic institution names and association relations between short names and full names, chinese and English and geographic positions and institutions, and only uses a small amount of known short words and full names to perform alignment tasks on the academic institution entity without related papers and author information, and obtains better effects.

The specific technical scheme for realizing the aim of the invention is as follows:

an academic institution name entity alignment method based on text features is characterized in that: and carrying out entity alignment on academic institution names based on a plurality of pieces of academic institution data extracted from the Chinese and English text data. Each academic institution has the data format (English is called, chinese is called, and geographic position) and has a small number of missing items and error corresponding relations. The method is used for complementing missing items in data, correcting error contents, merging data representing the same mechanism, finally obtaining a plurality of different Chinese and English names and unique geographic positions of the same mechanism, and finding out a group of data corresponding to each mechanism. The method comprises the following specific steps:

Step 1: conversion of English abbreviations to English holonomics

For the original data containing English abbreviations but without English full names, automatically generating English full names according to the English abbreviations by using a word replacement method based on geographic positions, and complementing the missing English full names;

step 2: correcting the correspondence between English abbreviations and English full names

Judging whether the corresponding relation between the English abbreviation and the English full name is correct or not for the original data containing the English abbreviation and the English full name simultaneously, splitting the incorrect corresponding relation to correct errors, inputting the English abbreviation as a new data in the step 1 when splitting, taking the English full name and the Chinese name as another data, and reserving the same geographic position;

step 3: translation complement English full name and Chinese name

The data without Chinese names after the completion of the step 1, the data without Chinese names after the correction of the step 2, the original data only with English full names but without English short names and Chinese names are subjected to the completion of the missing Chinese names in English translation; for the original data containing Chinese names but no English full names, the Chinese-English full names are used for supplementing the missing English full names;

step 4: correcting wrong chinese names

Identifying and correcting the wrong Chinese name by using a suffix frequency statistics correction method for the data which is corrected in the step 2 and contains the Chinese name in the original data and the data which is subjected to the completion of the Chinese name in the English translation in the step 3;

Step 5: academic institution merging is performed based on the text features,

and (3) merging the data after the completion of the translation and the completion of the translation in the step (3) and the data after the correction in the step (4), and completing the entity alignment of the academic institution names to obtain a plurality of different Chinese and English names and unique geographic positions corresponding to the same institution.

The word replacement method based on the geographic position in the step 1 specifically comprises the following steps:

a1: counting word frequency in English abbreviations of a mechanism, constructing a corresponding relation from English common abbreviations of the mechanism to full-scale words, and selecting proper corresponding relation according to geographic positions when one abbreviation corresponds to a plurality of full-scale words;

a2: finding out all word substrings for short appearing in English abbreviations of the mechanism, wherein two ends of each substring are required to be non-alphabetic characters, are not single quotation marks and are not contained in the substrings for short, and replacing according to the corresponding relation constructed in A1 to obtain the English full name.

Step 2, judging whether the correspondence between the english abbreviations and the english holonomics is correct, specifically including:

b1: separating English abbreviations and English names of the mechanisms by taking non-letter non-single quotation characters as separators, converting the non-letter non-single quotation characters into lowercase letters, obtaining abbreviation separation words wj and full-name separation words wq, and splicing all word initials in the wq into an abbreviation word ws of the English names;

B2: calculating similarity sim between English abbreviations and English full names by using wj, wq and ws obtained in the step B1, setting a threshold value v, judging that English abbreviations and English full names are not correctly corresponding if the similarity is smaller than the threshold value, otherwise, the correspondence is correct.

And B2, calculating similarity sim between English abbreviations and English holonomics, wherein the calculation specifically comprises the following steps:

if the English abbreviation only retaining English letters is the full English sub-sequence sim=1, otherwise, the calculation formula is:

wherein, the number of words in English abbreviation is nj, namely, the number of words in wj, the number of words in English full names is nq, namely, the number of words in wq, cnta and cntb respectively represent the number of partial and full correspondence of words in wj found in wq, cntc and cntd respectively represent the number of partial and full correspondence of words in wq found in wj, and 0< e <1 represents the weight ratio corresponding to partial; cnta, cntb, cntc, cntd are calculated as follows: for two words, if the two words are identical or the correspondence in A1 is satisfied, the two words are called complete matching, if one is the subsequence of the other is called partial matching, and if one exchanges the character sequence, the subsequence of the other is called permuting matching; cnta, cntb, cntc, cntd is initially 0; for each word in wj, if the word is completely matched with a word in wq, the cntb is accumulated by 1, otherwise, if the word is completely or partially matched with ws, the cntb is accumulated by 1 and the cntd is accumulated by the word length, otherwise, if the word is partially matched with a word in wq, the cnta is accumulated by 1, otherwise, the word length is accumulated by the cntc; for each word in wq, the cntd is accumulated by 1 if it matches exactly a word in wj, otherwise the cntc is accumulated by 1 if it matches partially a word in wj.

The suffix frequency statistics correction method in step 4 specifically includes:

d1: deleting the Chinese names with the length of the excessively short word smaller than the threshold value k from the data, wherein the deleted data contains Chinese names with the English letter proportion exceeding the threshold value p, correcting the Chinese names with the non-Chinese character tail end by using a deleting suffix, and deleting the newly generated excessively short names;

d2: counting the occurrence frequencies of suffixes with different lengths of the Chinese names of the institutions, and automatically determining common suffixes of the Chinese names of the institutions by a recursion method;

d3: and correcting the Chinese name of the common suffix ending determined in the non-D2 by deleting a section of ending substring to correct the Chinese name into the common suffix ending and correcting the short name in the non-D1, otherwise deleting the Chinese name.

The method D2 for automatically determining the common suffix of the Chinese name of the organization by using a recursion method specifically comprises the following steps:

d21: the total number of the mechanism data is N, all suffixes with the suffix length of 1 are used as an initial set of common suffixes, and the occurrence times of the suffixes in the mechanism data are simultaneously stored in the set;

d22: splitting suffixes which are uniformly distributed in the initial set, and deleting suffixes which occur too little until all the suffixes are not uniformly distributed or occur too little; if the suffix x is i in length, the number of occurrences in the organization data is c and

Deleting the suffix x from the collection seed; otherwise, if the number of occurrences of n suffixes b1, b2 and … bn with suffix length of i+1 ending with x in the mechanism data seed is c1, c2 and … cn, c1 is greater than or equal to c2 is greater than or equal to … is greater than or equal to cn, and c=c1+c2+ … +cn; if the category number n is greater than the threshold t, the information entropy v1 is greater than the threshold t1, and the high-frequency suffix duty ratio v2 is greater than the threshold t2, splitting the suffix x into suffixes b1, b2 and … bn; the calculation formulas of v1 and v2 are as follows: />

Step 5, merging multiple different pieces of data of the same mechanism specifically includes:

e1: marking unique IDs for the data with the same Chinese name and the same geographic position, without Chinese name and English full name and the same geographic position in each group as one group of data of the same organization, and if the geographic positions of the two groups of data corresponding to the two IDs are the same and the two groups of data with different IDs and the same English full name exist, changing the data into the same ID, and completing preliminary merging by using a merging algorithm;

e2: and calculating the similarity s between two groups of data corresponding to any two IDs 1 and ID2, wherein the two groups of data have the same geographic position and contain common words, calculating the similarity s between the two groups of data according to text features of English holometrical names, geographic position words and quantity words in names, and if the similarity s is larger than a threshold pp, considering that the two groups of data represent the same mechanism to be changed into the same ID, finishing further combination by using a union algorithm, and finally obtaining complete information of the mechanism as a group of English acronyms, english holometrical names, chinese names and unique addresses corresponding to each group of IDs.

The calculating the similarity s between the two sets of data in E2 specifically includes:

wherein x1, x2, …, xq is q pieces of data of ID1, y2, …, ym is m pieces of data of ID2, ss (xi, yj) represents unidirectional similarity of the mechanism data xi and yj, the geographic position word set pxi, pyj and the quantity word set nxi, nyj in the mechanism data xi, yj are extracted, wherein the geographic position word consists of geographic position in the data, words appearing in a word stock of a ground list in english full name word segmentation, pinyin words in english full name, and the pinyin words are identified by using a dynamic programming algorithm; if at least one of the geographical location word sets pxi, pyj of the organization data xi and yj is not empty and the intersection pxi n pyj is empty, or at least one of the quantity word sets nxi, nyj is not empty and the intersection nxi n nyj is empty, ss (xi, yj) =0, otherwise ss (xi, yj) has a calculation formula:

the method for calculating the similarity and the inverse sequence pairs of the words in wx is as follows, wherein the words after the same geographic position information and the number words in xi and yj are removed in the English full scale of xi and yj are wx, wy, nx and ny respectively, the word number of the English full scale of yj is that of wx and wy, ca and cb respectively represent that the words in wx find part and all the corresponding numbers in wy, inv represents that the words in wx find the inverse sequence pairs of the corresponding sequences in wy, 0< ee <1 represents the weight ratio of the part corresponding relation and the complete corresponding relation, 0< z <1 represents the weight ratio of the similarity and the inverse sequence pairs, and ca, cb and inv are as follows: for two words, if they are identical or satisfy the correspondence in step A1, it is called perfect match, and if one is the subsequence of the other, it is called partial match; ca, cb is initially 0, for each word in wx, matching the word in wy by using a greedy strategy, if a complete match cb accumulation 1 with a word in wy is found, otherwise, if a partial match ca accumulation 1 with a word in wy is found, each word in wy can only be matched with each word in wx once, if a plurality of words in wy can be matched with one word in wx according to the priority of the front word segmentation order; inv is the reverse logarithm of the above wx, wy matching the corresponding order.

The invention has the beneficial effects that: the invention reasonably designs a name similarity calculation method by utilizing the text characteristics of academic institution names in a mode of combining Chinese and English, full names with short names and geographic position information, comprehensively uses various modes to correct and complement data, uses efficient rules and algorithms to complete entity alignment tasks of the academic institution names, solves the problem that insufficient context information and known alignment relationship marking data exist, only uses text similarity, semantic information of words and geographic position information, does not need to be connected with other entities such as authors and paper texts, obtains better alignment effect of the academic institution name entity, obtains various different Chinese and English names and unique geographic position information of the same institution, and finds a group of data corresponding to the same institution.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to specific examples, and those skilled in the art will further appreciate from this section that the subject matter of the present invention is directed to a text feature-based academic institution name entity alignment method. The invention can also be applied by different embodiments according to the actual problem, for example, different threshold parameters can be used according to the actual data. Various processes in the present invention may also be modified and combined in terms of methods and details according to practical circumstances.

Referring to fig. 1, the present invention includes the following 5 steps: english abbreviations are converted into English names; correcting the error correspondence between English abbreviation and English full name; translation complements English full names and Chinese names; correcting the wrong Chinese name; academic institution merging is performed based on text features.

In the method for converting English abbreviations into English holonomics in the step 1 of the invention, information complementation is carried out. For the original data containing English abbreviations but without English full names, automatically generating English full names according to the English abbreviations by using a word replacement method based on geographic positions, and complementing the missing English full names.

Because the English abbreviations contain a large number of words which cannot be translated and intuitively understood automatically, the step of converting the English abbreviations into English names is favorable for improving the effect in English translation in the step 3 and is used for comparing the English names in the step 5 in similarity.

The word replacement method based on the geographic position comprises the following steps:

a1: counting word frequency in English abbreviations of a mechanism, constructing a corresponding relation from English abbreviations of the mechanism to full-name words, and selecting proper corresponding relation according to geographic positions when one abbreviation corresponds to a plurality of full-name words. And selecting common words according to the counted English abbreviation word frequency to carry out manual labeling, and obtaining the corresponding relation from 402 abbreviations to full names.

A2: finding out all word substrings for short appearing in English abbreviations of the mechanism, wherein two ends of each substring are required to be non-alphabetic characters, are not single quotation marks and are not contained in the substrings for short, and replacing according to the corresponding relation constructed in A1 to obtain the English full name. If one acronym is a substring of another acronym, longer words should be selected for substitution, and the requirement at both ends of the substring limits that the acronym be a separate word rather than a part of a word. The most common default full term may be selected for replacement when the geographic location is missing. The lower case letters are used when matching the short words, and the lower case form of the full-term is maintained when the full-term is replaced.

In the method of correcting the error corresponding to the English abbreviation and the English holonomic name in the step 2 of the invention, information correction is carried out. And (3) judging whether the corresponding relation between the English abbreviation and the English full name is correct or not for the original data containing the English abbreviation and the English full name simultaneously, splitting the incorrect corresponding relation to correct errors, inputting the English abbreviation as one piece of data in the step (1) when splitting, taking the English full name and the Chinese name as the other piece of data, and reserving the same geographic position.

The error corresponding relation is corrected in the step, so that error propagation influence on the entity alignment in the step 5 caused by error correspondence in the original data can be reduced. Since it is observed that the chinese name and the geographic position generally correspond to the english full scale correctly, if jc is not found in other data, the split data (jc, qc, zh, pos) is input as new data to step 1, and if qc is not found in other data, the split data (jc, ",") is retained.

The method for judging whether English abbreviations and English full names correspond to each other correctly comprises the following steps:

b1: the English abbreviations and English full names of the mechanism are divided into words by taking non-letter non-single quotation characters as separators and converted into lowercase letters, so that divided words wj of the English abbreviations and divided words wq of the English full names are obtained, and all word initials in the wq are spliced into acronyms ws of the English full names. The word number for English abbreviation is the word number in nj, i.e. wj, and the word number in nq, i.e. wq, is the word number in English full-name word number.

B2: and calculating the similarity sim between the English abbreviation and the English full scale by using wj, wq and ws obtained in the B1, setting a threshold value v, and judging that the English abbreviation and the English full scale are not correct if the similarity is smaller than the threshold value sim < v, otherwise, the correspondence is correct. If the english abbreviation only retaining the english letters is the full english sub-sequence sim=1, otherwise the calculation formula of the similarity sim is:

wherein, cnta, cntb, cntc and cbtd represent the degree of association of English abbreviations with English fully-called words, cnta, cntb represent the number of words found in wj that are partially and fully corresponding in wq, cntc, cntd represent the number of words found in wq that are partially and fully corresponding in wj, and 0< e <1 represents the weight ratio of partial correspondence.

For two words, one of the 402 correspondences in A1 is said to be a perfect match if it is identical or satisfied, a partial match if one is another, and a permuted match if one is permuted.

cnta, cntb, cntc, cntd is initially 0; for each word in wj, if the word is completely matched with a word in wq, the cntb is accumulated by 1, otherwise, if the word is completely or partially matched with ws, the cntb is accumulated by 1 and the cntd is accumulated by the word length, otherwise, if the word is partially matched with a word in wq, the cnta is accumulated by 1, otherwise, the word length is accumulated by the cntc; for each word in wq, the cntd is accumulated by 1 if it matches exactly a word in wj, otherwise the cntc is accumulated by 1 if it matches partially a word in wj.

In the method for completing English full name and Chinese name by translation in the step 3, information completion is carried out. The data without Chinese names after the completion of the step 1, the data without Chinese names after the correction of the step 2, the original data only with English full names but without English short names and Chinese names are subjected to the completion of the missing Chinese names in English translation; for the original data containing Chinese names but no English full names, the Chinese-English full names are used for supplementing the missing English full names.

In the translation of the step, a plurality of organization names are separated by a line feed character, and a plurality of names are queried at the same time, so that the efficiency is improved. According to observation, it can be found that English names of some institutions differ much but Chinese names are similar, chinese names of some institutions differ much but English names are similar, and the use of translation complement bilingual name information is helpful to comparison and combination of different institutions in step 5.

In the method for correcting the wrong Chinese name in the step 4, information correction is carried out. Correcting the wrong Chinese name, and identifying and correcting the wrong Chinese name by using a suffix frequency statistics correction method for the data which is corrected in the step 2 and contains the Chinese name in the original data and the data which is subjected to the completion of the Chinese name in the step 3 English translation.

Some data contains other contents which are not mechanism names at two ends due to the fact that the positions of the data are intercepted when the data are extracted from the text, some mechanism names are provided with sub-mechanism names which are too detailed, only the parent mechanism names in front are required to be reserved, some mechanism names can leave a large number of English words when being translated due to incomplete translation, and the steps can be used for identifying and correcting the conditions, so that the error correspondence of the Chinese and English mechanism names is reduced, and the accuracy of entity alignment is improved.

The correction method of the suffix frequency statistics comprises the following steps:

d1: since the organization chinese names are typically not too short, the english alphabet ratio cannot be too high, and should not end with non-chinese characters, the decision is made from the following three perspectives. Deleting the Chinese names with the length of too short being smaller than the threshold value k in the data, wherein the deleted data contains Chinese names with the English letter proportion exceeding the threshold value p, correcting the Chinese names with the non-Chinese character ending by using a deleting suffix, and deleting the newly generated Chinese names with the length of too short being smaller than the threshold value k.

D2: according to observation, the Chinese names of institutions are usually only of specific types, and therefore whether the Chinese names are correct or not is judged through the name suffixes. Counting the occurrence frequencies of suffixes with different lengths of Chinese names of the institutions, and automatically determining common suffixes of the Chinese names of the institutions by using a recursion method, wherein the common suffixes of the Chinese names of the institutions are prepared as follows:

d22: and splitting the suffixes which are uniformly distributed in the initial set, and deleting the suffixes which occur too little until all the suffixes are not uniformly distributed or occur too little. If the suffix x is i in length, the number of occurrences is c in the organization data, and n suffixes b1, b2 and … bn with the suffix length of x ending being i+1 are the number of occurrences in the organization data The numbers are c1, c2 and … cn, c1 is larger than or equal to c2 and larger than or equal to … is larger than or equal to cn, and c=c1+c2+ … +cn, b1, b2 and … bn are all a character connection x. If it is

The suffix x appears too few times to delete from the collection; if the number of categories n is greater than the threshold t and the suffixes b1, b2, … bn are uniformly distributed, splitting the suffix x, deleting the suffix x from the set and adding the suffixes b1, b2, … bn to the set. In order to measure whether the number of suffixes is uniformly distributed, information entropy v1 and high-frequency suffix duty ratio v2 are calculated to judge whether the distribution is uniform or not, if v1 is larger, the larger the uncertainty is, the more uniform the distribution is, if v2 is smaller, the smaller the comparison threshold t is, the smaller the high-frequency suffix duty ratio is, the more uniform the distribution is, threshold t1 and t2 are set, and if the information entropy v1 is larger than t1 and the high-frequency suffix duty ratio v2 is judged<t2 illustrates that the distribution is uniform, and the calculation formulas of v1 and v2 are as follows:

d3: and changing the Chinese name with the common suffix ending determined in the non-D2 into the common suffix ending by deleting a segment of ending substring, if the length of the corrected Chinese name is not too short and is smaller than a threshold value k, correcting, otherwise deleting the Chinese name. After correction, all remaining institutional chinese names end with the usual institutional chinese name suffix obtained at D22.

In the method for merging academic institutions based on text features in step 5, merging different pieces of data of the same institution after the completion of translation and english complementation in step 3 and the corrected data in step 4, and completing entity alignment of names of the academic institutions to obtain a plurality of different Chinese and English names and unique geographic positions corresponding to the same institution.

The method comprises the steps of carrying out entity alignment on the corrected data after completion and correction, firstly marking the mechanisms with the obviously same IDs, merging the mechanisms for preliminary alignment, then comparing the similarities for further alignment, and finally aggregating a group of data corresponding to each mechanism. The combination of Chinese and English bilingual can easily find more alignment relations, and the combination of geographic position information and quantitative words can obviously reduce errors of text similarity, which are difficult to identify for only mechanisms with different addresses and numbers.

The method for merging the multiple different data of the same mechanism comprises the following steps:

e1: and marking unique IDs for the data with the same Chinese name and geographic position, no Chinese name and the same English full name and geographic position in each group as one group of data of the same organization, and if the geographic positions of the two groups of data corresponding to the two IDs are the same and the two groups of data with different IDs and the same English full name exist, changing the data into the same ID, and completing the primary combination by using a merging and gathering algorithm.

According to the previous steps, the data used by E1 all contain english full names and mostly chinese names, while english abbreviations are too abbreviated to be considered here. Since chinese names are easier to aggregate more data than english full names, the use of chinese names for ID labeling and merging is preferred.

The detailed steps of the union algorithm are as follows: the method comprises the steps of establishing T trees taking each ID as a root, merging two trees 1 and 2 where ID1 and ID2 are located when two pieces of data x1 epsilon ID1 and x2 epsilon ID2 and ID1 not equal to ID2 are found to have the same English full name, changing the root of the tree1 into the leaf of the root of the tree2, changing all points on a root path to the leaf of the root to compress the depth of the tree when the root of the tree where a certain ID is located is queried each time, and changing all IDs in each obtained tree into the same after all merging operations are completed.

E2: and calculating the similarity s between two groups of data corresponding to any two IDs 1 and ID2, wherein the two groups of data have the same geographic position and contain common words, calculating the similarity s between the two groups of data according to text features of English holonomics, geographic position words and quantity words in names, and if the similarity is larger than a threshold value s & gtpp, considering that the two groups of data represent the same mechanism to be changed into the same ID, and finishing further combination by using a union algorithm, wherein finally, one group of English abbreviations, english holonomics, chinese names and unique addresses corresponding to each group of IDs are complete information of the mechanism.

In this step, since the english holonomy is common to all data, the english holonomy is used for similarity calculation. Because of the large number of organizations, to quickly find IDs that may represent the same organization in order to increase algorithm efficiency, the organization IDs corresponding to each word are calculated to reduce the entity alignment complexity by about 99% of the number of comparisons. Since only the text similarity of two organization names is considered to judge whether the two organization names are identical or not, it is difficult to distinguish tiny differences caused by geographic position words and quantity words in the organization, and the words can make the meaning of the organization have great differences, the quantity words and the geographic position words are extracted from English full names for similarity calculation.

The quantitative terms include words extracted from the english full name that contain the following substrings:

['zero','one','two','three','four','five','six','seven','eight','nine','ten','first','second','thir','fif','eleven','twelve','twenty','hundred','thousand','million','billion','1','2','3','4','5','6','7','8','9','0']

the geographic position words are divided into two types, one is a word obtained by matching from a word bank containing 6213 place names, and the other is a word which is determined by using a dynamic programming algorithm and consists of a plurality of pinyin and is found out all Chinese pinyin according to pinyin rules.

The method for judging whether a word consists of a plurality of word pinyins by using the dynamic programming method comprises the following steps: for each substring position that may represent a single word pinyin, if the start position is 1, the substring position is marked as true, otherwise the substring position is marked as false, and for each position marked as false, if there is a word string position marked as true with its start position as end position, the position is marked as true. If the ending position of the position marked true is word length, it can be expressed as pinyin for a plurality of words.

Then, the calculation formula of the similarity s between the above-mentioned group data is to calculate the similarity re-average for each pair of organization data:

where x1, x2, …, xq is q pieces of data of ID1, y2, …, ym is m pieces of data of ID2, ss (xi, yj) represents unidirectional similarity of the mechanism data xi and yj, ss (xi, yj) =0 if the geographic position information and the quantitative word intersection of the mechanism data xi and yj are empty, otherwise ss (xi, yj) is calculated by a weighted sum of matching association and matching inverted pairs:

the method comprises the steps that the English full names of xi and yj are divided into wx, wy, nx and ny after the same geographic position information and number of words in the xi and yj are removed, the number of words in the English full names of the xi and yj is the number of words in wx and wy after the geographic position words and the number of words are removed, ca and cb respectively represent the number of words in wx which are found partially and fully corresponding in wy, inv represents the reverse order logarithm of the words in wx which are found in the wy in corresponding order, 0< ee <1 represents the weight proportion of the partial corresponding relation and the full corresponding relation, and 0< z <1 represents the weight proportion of the similarity and the reverse order pair.

For two words, one of the 402 correspondences in step A1 is referred to as a perfect match if it is identical or satisfied, and the subsequence if one is the other is referred to as a partial match.

The calculation method of ca, cb and inv is as follows: ca, cb is initially 0, for each word in wx, matching the word in wy using a greedy strategy, if a complete match cb accumulation 1 is found with a word in wy, otherwise, if a partial match ca accumulation 1 is found with a word in wy, each word in wy can only be matched with each word in wx once, if multiple words in wy can all be matched with one word in wx in order of word-segmentation first priority. inv is the reverse logarithm of the above wx, wy matching the corresponding order, e.g., inv=2 when the 1,2,4 th word in wx matches the 5,1,3 th word in wy, because the reverse logarithm of the sequence 5,1,3 is 2.

Examples

Referring to fig. 1, the english text data source may extract the mechanism data of type 4, which includes only english abbreviation, only english full title, both english abbreviation and full title, and both chinese and english names, and are respectively labeled (1), (2), (3), (4); the Chinese text data source can extract the mechanism data of 2 types which only contain Chinese names and Chinese and English names, and the mechanism data are respectively marked as (5) and (6). The data marked as (1) are complemented with English full names by the step 1, and the result is marked as (7); the data marked as (3), (4) and (6) are corrected by the step 2 to correspond to the English abbreviations and the full names, and the obtained results are marked as (8), (9) and (9) respectively; the data marked as (2), (7) and (8) are translated into Chinese names by using the step 3, the Chinese names are corrected by using the step 4, and the result is marked as (9); the data marked as (5) is used for translating the Chinese name into English full name in the step 3, and the result is marked as (9). All data are changed into (9) after the processing marks of the steps 1,2,3 and 4, a small amount of data with errors are removed, no Chinese names are generated, other data have Chinese names and English full names at the same time, the data marked as (9) are marked and combined by the ID in the step 5, an entity alignment result is obtained, and a group of data corresponding to each ID represents the same mechanism.

Detailed procedure of the entity alignment method of academic institution name referring to the flow of fig. 1, it can be seen that the present invention includes the steps of: step 1: conversion of English abbreviations to English holonomics

A1: firstly, according to common English abbreviations in the mechanism names, the corresponding relation between 402 abbreviations and full-scale words is marked, for example, sci corresponds to science, or corresponds to oregon, inst corresponds to institute and the like; some of these correspondences are related to geographic location, e.g., WA corresponds to Washington in the united states and Western Australia in australia.

A2: for organization data for English only, such as ('George Wa Univ', 'American') the English full name is complemented according to the corresponding rules. According to the corresponding relation between the short words and the full-name words in A1, the appearance positions of all the short words in English abbreviations of the mechanism are found, and the influence of letter cases is ignored, but characters on two sides of the short words cannot be single quotation marks or English letters, such as Wa in George Wa Univ appears in (8, 9), and Univ appears in (11, 14), or appears in (3, 4) but the position 2 is the letter e so that the word is not an independent word.

The words are replaced by correct full names in combination with the geographical location information of the organization, and Univ is replaced by university, wa is replaced by Washington according to geographical location, so as to obtain full name ' George Washington University, and data complement is (' George Wa uni ', ' George Washington University ', ' united states ').

And calculating the characteristic similarity of the English abbreviations and the text of the full names for the data containing the English abbreviations and the full names, and correcting the error data with low similarity.

For example: x 1= (' caps Med Univ ', ' Peking University ', ' Beijing university ', ' beijin ') and x 2= (' Univ Pompeu Fabra UPF ', ' Pompeu Fabra University ', ' poincare university ', ') x1 is the wrong correspondence and x2 is the correct correspondence.

B1: the english abbreviations and english fulness are segmented to obtain wj, wq, and abbreviations ws of the english fulness are obtained, for example wj= (caps, med, univ), wq= (peking, uniivCity), ws=pu, wj= (univ, poimeu, fabra, upf), wq= (poimeu, fabra, uniivety), ws=pfu;

b2: calculating similarity sim between English abbreviations and full names according to text features of the segmented words:

in this specific example, a weight e=0.5 is set, and a threshold v=0.25 is processed. For example, univ for x1 corresponds to universty, cnta=1, cntb=0, cntc=1, cntd=0, nj=3, nq=2,

univ, ponmeu, fabra, upf of x2 correspond to uniqueness, ponmeu, fabra, pfu, cnta=2, cntb=2, cntc=4, cntd=2, nj=4, nq=3, respectively, for the following items >

It can be seen that the English abbreviations can be easily distinguished by the similarity,Whether the corresponding relation of English full name is correct or not. The data x1 with respect to the correspondence error can be split into two pieces of data ('caps Med uni', 'beijin') and (", 'Peking University', 'Beijing university', 'beijin').

Step 3: translation complement English full name and Chinese name

The completion of translating English full names into Chinese names by English translation and translating Chinese names into English full names by Chinese translation is realized by calling a hundred-degree translation API interface. Each query can use a line feed to separate a plurality of mechanism names in English 4800 characters and Chinese 1600 characters on the basis of not exceeding the maximum query length limit, so that the query efficiency is improved, a group of translation results corresponding to each line are obtained in a trans-result field of the returned results, and the translation speed of about 200 mechanism names per second can be achieved.

Step 4: correcting wrong chinese names

In the translation result of the step 3, a small amount of data can fail translation to generate wrong Chinese names, and the small amount of data generates errors because English full names contain redundant information; the Chinese name in the original data also contains a small amount of redundant information to generate errors. The step corrects the Chinese name and identifies the wrong Chinese name according to the characteristics of the Chinese name of the organization. For example, 'University Paris 10' is a translation result of 'Paris 10' because redundant suffixes are generated in English data, 'Rettew Associates Incorporated' is translated to 'retewassociates' is a translation failure.

D1: the chinese names should not end with non-chinese characters, should not contain too many non-chinese characters, and should not be too short in name length. In this particular example, the length threshold k=3 and the english character ratio threshold p=0.75, thus deleting the chinese name length of 2 (' Tec ', ' technology ', ') and changing (", ' Rettew Associates Incorporated ', ' retnewassociates ', ') to (", ' Rettew Associates Incorporated ', ') and changing ' paris university 10' to ' paris university '.

D2: and D1, counting the occurrence frequencies of suffixes of different lengths of the Chinese names of the institutions, and automatically determining the common suffixes of the Chinese names of the institutions by using a recursion method. In this specific example, the suffix type threshold t=10, the threshold t1=2 of the information entropy v1, and the threshold t2=0.477 of the high-frequency suffix duty ratio. After automatic statistics, 266 common suffixes are obtained: 1127185 times, 289467 times at the center, 272157 times at the hospital, 167655 times at school, 153338 times … … times at school

D3: for the Chinese name processed by the D1, if the suffix is the common suffix obtained by the D2, the Chinese name is identified as the correct name; otherwise, if the ending part of the Chinese name deletion can be changed to end with a common suffix, the correction is carried out, for example, the communication of 'Rogowski state university' is corrected to 'Rogowski state university'; deleting the Chinese names which cannot be corrected or are too short after correction, and finally obtaining the correct Chinese names which are all terminated by the common suffix.

Step 5: academic institution merging based on text features

After the completion and correction of the previous steps, except a small amount of data found to be wrong in the step 4 have no Chinese names, other data simultaneously contain English full names and Chinese names, and entity alignment can be carried out by combining the Chinese and English names and geographic positions.

E1: the data is ID-tagged and different data that clearly represent the same organization are combined. Since translation can make many english-full-name different identical institutions correspond to identical chinese names, first, all chinese-name identical data are marked as identical ID, no chinese-name and english-full-name identical data are marked as identical ID, and finally, the set and identical english-full-name IDs, for example, ('09547', 'Mcmaster Unvers', 'marst university') and ('15852', 'Mcmaster Univ', 'Mcmasters University', 'marst university') are combined due to identical post-translation chinese-names, ('33886', 'Beijing Jiao Tong University', 'beijing transportation university') and ('33945', 'Beijing Jiao tong University', 'Beijing Jiao tong University', 'beijing intersection') are combined due to the english-full-name identical.

E2: and further expanding the alignment relationship on the basis of the entity alignment relationship obtained by the E1. Since the Chinese names are mostly generated by translation and not all data are available, all data are fully known in English, and whether two groups of data corresponding to two IDs represent the same mechanism is judged by using the similarity of the text calculated by the fully known English names:

for two sets of organization data corresponding to two IDs, the similarity is calculated in pairs and then averaged, and the geographic position word and the number word of each organization are required to be extracted first. In this particular example, the threshold pp=0.655, the weight ee=0.5, and z=0.8. For example, (' 60516', ' Guangzhou First People Hosp ', ' Guangzhou First People Hosp ', ' the [ (Guangzhou ' ], the [ (first ' ]) geographical location words are the Guangzhou number words are first which are 1 { (' 51140', ' Guangzhou 1st Municipal Peoples Hosp ', ' Guangzhou 1st Municipal Peoples Hospital ', ' the first people hospital in Guangzhou ', [ (Guangzhou ' ], the [ (1 ' ]) and … … } data are the same organization, and the calculated similarity is 0.8125>0.655 according to the above formula can be combined.

After matching and merging academic institution IDs according to the above processes, a group of data corresponding to each ID represents the same institution, different IDs correspond to different institutions, and statistics of all Chinese names, english names and English abbreviations in the data corresponding to the same ID can obtain a plurality of different types of names and unique geographic positions corresponding to the institution, and finally, the result of the entity alignment of the academic institution names is obtained.

In summary, the invention provides a scheme framework for entity alignment of academic institutions names, which finds out an adaptive processing flow and scheme according to the actual conditions and characteristics of academic institutions data, completes the entity alignment task of the academic institutions by 5 steps, and uses the entity alignment result in the construction process of academic knowledge maps. By using a series of methods based on text characteristics, geographical position information and Chinese and English combination, error data are corrected, missing data are complemented, the problems of less known labeling data and difficulty in no context related semantic information are solved, and a good entity alignment effect is obtained, so that the method is beneficial to constructing an academic knowledge graph with a better effect, and the related application and literature searching effect is optimized.

Those skilled in the art can make equivalent modifications and changes to the specific details of implementation of the present invention without departing from the spirit and scope of the inventive concept on the basis of the main idea of the process set forth in the present invention, and it is intended to cover the present invention by the appended claims.

Claims

1. An academic institution name entity alignment method based on text features, which is characterized by comprising the following specific steps:

Step 1: conversion of English abbreviations to English holonomics

step 3: translation complement English full name and Chinese name

step 4: correcting wrong chinese names

Step 5: academic institution merging is performed based on the text features,

2. The text feature-based academic institution name entity alignment method of claim 1, wherein the geographic location-based word replacement method of step 1 specifically comprises:

3. The method for aligning academic institution name entities based on text features according to claim 1, wherein the determining in step 2 whether the english abbreviation corresponds to the english full name correctly specifically comprises:

b2: calculating similarity sim between English abbreviations and English full names by using wj, wq and ws obtained in the step B1, setting a threshold value v, judging that English abbreviations and English full names are incorrectly corresponding if the similarity is smaller than the threshold value, namely sim < v, otherwise, the correspondence is correct.

4. The method for aligning academic institutions names based on text features according to claim 3, wherein the calculating of similarity sim between english acronym and english holonomy in B2 specifically comprises:

wherein, the number of words in English abbreviation is nj, namely, the number of words in wj, the number of words in English full names is nq, namely, the number of words in wq, cnta and cntb respectively represent the number of partial and full correspondence of words in wj found in wq, cntc and cntd respectively represent the number of partial and full correspondence of words in wq found in wj, and 0 < e < 1 represents the weight proportion corresponding to the partial; cnta, cntb, cntc, cntd are calculated as follows: for two words, if the two words are identical or the correspondence in A1 is satisfied, the two words are called complete matching, if one is the subsequence of the other is called partial matching, and if one exchanges the character sequence, the subsequence of the other is called permuting matching; cnta, cntb, cntc, cntd is initially 0; for each word in wj, if the word is completely matched with a word in wq, the cntb is accumulated by 1, otherwise, if the word is completely or partially matched with ws, the cntb is accumulated by 1 and the cntd is accumulated by the word length, otherwise, if the word is partially matched with a word in wq, the cnta is accumulated by 1, otherwise, the word length is accumulated by the cntc; for each word in wq, the cntd is accumulated by 1 if it matches exactly a word in wj, otherwise the cntc is accumulated by 1 if it matches partially a word in wj.

5. The text feature-based academic institution name entity alignment method according to claim 1, wherein the suffix frequency statistics correction method in step 4 specifically includes:

6. The text feature-based academic institution name entity alignment method of claim 5, wherein D2 automatically determines usual suffixes of the institution chinese names by a recursive method, specifically comprising:

deleting the suffix x from the collection seed; otherwise, if the number of occurrences of n suffixes b1, b2 and … bn with suffix length of i+1 ending with x in the mechanism data seed is c1, c2 and … cn, c1 is greater than or equal to c2 is greater than or equal to … is greater than or equal to cn, and c=c1+c2+ … +cn; if the category number n is greater than the threshold t, the information entropy v1 is greater than the threshold t1, and the high-frequency suffix duty ratio v2 is greater than the threshold t2, splitting the suffix x into suffixes b1, b2 and … bn; the calculation formulas of v1 and v2 are as follows:

7. the text feature-based academic institution name entity alignment method according to claim 1, wherein the merging of the multiple different pieces of data of the same institution in step 5 specifically includes:

8. The text feature-based academic institution name entity alignment method of claim 7, wherein the calculating the similarity s between the two sets of data in E2 specifically includes:

The method comprises the steps that the same geographic position information and the same word segmentation after the number of words in xi and yj are removed in the English full scale of xi and yj are wx, wy, nx and ny, the word number of the English full scale of yj, namely, the word number in wx and wy, is the word number in wx and cb, respectively, the word in wx finds part and all corresponding numbers in wy, inv shows the reverse order logarithm of the word in wx in wy, 0 < ee < 1 shows the weight ratio of the part corresponding relation and the whole corresponding relation, 0 < z < 1 shows the weight ratio of the similarity and the reverse order pair, and the calculation method of ca, cb and inv is as follows: for two words, if the correspondence in step Al is identical or satisfied, it is called perfect match, and if one is the subsequence of the other, it is called partial match; ca, cb is initially 0, for each word in wx, matching the word in wy by using a greedy strategy, if a complete match cb accumulation 1 with a word in wy is found, otherwise, if a partial match ca accumulation 1 with a word in wy is found, each word in wy can only be matched with each word in wx once, if a plurality of words in wy can be matched with one word in wx according to the priority of the front word segmentation order; inv is the reverse logarithm of the above wx, wy matching the corresponding order.