CN112016328A

CN112016328A - Text feature-based academic institution name entity alignment method

Info

Publication number: CN112016328A
Application number: CN202010867785.8A
Authority: CN
Inventors: 林欣; 郭晨亮; 李继洲
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-12-01
Anticipated expiration: 2040-08-26
Also published as: CN112016328B

Abstract

The invention discloses a text feature-based academic institution name entity alignment method, which comprises 5 steps: converting English abbreviation into English full name; correcting the correspondence between English abbreviation and English full name; the translation completes the English full name and the Chinese name; correcting the wrong Chinese name; and performing academic institution combination based on the text features. The method carries out entity alignment by using academic mechanism data extracted from Chinese and English text data, each mechanism data contains English short names, English full names, Chinese names and geographic positions and has content deficiency and a small amount of errors, and a plurality of different names corresponding to the same mechanism are finally obtained by complementing the deficiency data, correcting the error data and combining the same mechanism data. The invention combines the mechanism name text characteristic and the geographic position information to align the academic mechanism name entity, does not need the pre-labeled mechanism name corresponding relation and the name context semantic information, and obtains better entity alignment effect with lower complexity.

Description

Text feature-based academic institution name entity alignment method

Technical Field

The invention relates to the technical field of entity alignment, entity disambiguation, knowledge graph construction, data preprocessing technology and set searching algorithm, in particular to a method for performing entity alignment on academic institution names and constructing academic knowledge graphs, and relates to an academic institution name entity alignment method based on text features.

Background

In recent years, with the development of computers, networks, and the accumulation of data, more and more electronic data has been available to assist computers in performing more tasks. In order to understand the relationship between common articles in life, a computer can learn more knowledge, a knowledge map can be constructed for entities in life, each article corresponds to an entity point on the map, and the relationship between the articles corresponds to edges connecting the entities on the map. As a large number of academic papers are published and published in electronic form, the field classification of academic research is also becoming more and more detailed and complex. In order to provide a more convenient and effective document retrieval function, an academic knowledge graph needs to be constructed for the relevant fields, authors and institution information corresponding to the thesis, and a more deep association relationship is used for optimizing a query result.

In the process of constructing the knowledge graph, because entity information is extracted from a text and the same entity has a plurality of different expression modes, different expression modes of the same entity need to be combined to reduce errors; for data from different sources, the constructed knowledge graphs need to be merged to find corresponding entities with different names. The mechanism is an important component of the academic knowledge map, and disambiguation of the name of the academic mechanism is also a step of constructing the academic knowledge map.

The common entity alignment method can be judged by combining semantic information in the context where the entity is located, and can also be judged according to the relation with other entities in the knowledge graph, and the context information of the entity and the labeled data of the known entity alignment relation are needed. However, the names of the mechanisms in the academic knowledge map have insufficient context information, the mechanisms are wide in related field and are weakly associated with the context information; organization names often appear in abbreviated form rather than as common words; without sufficient annotation data for which the names of the known institutions are aligned, the method requiring annotation data is less effective in such cases; entity information such as author related to the organization also needs to be aligned. At present, no better method for solving the problems exists.

In the alignment of the names and the entities of the academic institutions, the problems that data sources have different Chinese and English languages, the institutions do not have uniform formats, the institutions often appear in English abbreviation forms, the full abbreviation correspondence of text extraction can be wrong, errors are caused by intercepting from detailed addresses by some institutions, and the like exist. The mechanism name data form is (English abbreviation, English full name, Chinese name, geographical position), some items of which may be missing or have wrong corresponding relation. The invention better solves the problems by comprehensively utilizing the text characteristics of the organization name and the geographic position information.

Disclosure of Invention

In view of the defects and difficulties and challenges of the prior art, the invention aims to provide a method for aligning academic institution names based on text features aiming at the characteristics of the academic institution names, solves the problems that the prior art cannot align the academic institution entities without context, the algorithm implementation complexity is high, and insufficient label data are used for training a model by using the text features of the academic institution names and the association relations between short names and full names, Chinese and English, and geographic positions and institutions, and only uses a small amount of known short names and full name word correspondence relations and place name data to align the academic institution entities without related thesis and author information, and obtains better effect.

The specific technical scheme for realizing the purpose of the invention is as follows:

a text feature-based academic institution name entity alignment method is characterized in that: and performing entity alignment on the names of the academic institutions on the basis of a plurality of academic institution data extracted from the Chinese and English text data. Each academic institution data format is (English abbreviation, English full name, Chinese name, geographic position), and has a small number of missing items and error corresponding relations. The method completes missing items in data, corrects error contents, combines data representing the same mechanism, finally obtains a plurality of different Chinese and English names and unique geographic positions of the same mechanism, and finds a group of data corresponding to each mechanism. The method comprises the following specific steps:

step 1: conversion from English abbreviation to English full name

Automatically generating the English full name according to the English short name by using a word replacement method based on the geographic position for the original data containing the English short name but without the English full name, and completing the missing English full name;

step 2: correcting the correspondence between English abbreviation and English full name

Judging whether the English abbreviation and the English full name are correctly corresponding to the original data simultaneously containing the English abbreviation and the English full name, correcting errors by incorrect splitting of the corresponding relation, inputting the English abbreviation as a new datum in the step 1 when splitting, and keeping the same geographic position by using the English abbreviation and the Chinese name as another datum;

and step 3: translation supplements the English full name and Chinese name

Complementing the data which is complemented in the step 1 and does not contain the Chinese name, the data which is modified in the step 2 and does not contain the Chinese name, and the original data which only contains the English full name but does not contain the English short name and the Chinese name by using the English translation; for original data containing Chinese names but without English full names, Chinese translation and English are used for complementing missing English full names;

and 4, step 4: chinese name for correcting errors

Identifying and correcting the wrong Chinese name by using a correction method of suffix frequency statistics for the data which is corrected in the step 2 and contains the Chinese name in the original data and the data which is subjected to Chinese name completion in the translation in the step 3;

and 5: performing academic institution combination based on the text features,

combining the data after the completion of the translation in the step 3 and the data after the correction in the step 4, and completing the entity alignment of the academic institution names to obtain a plurality of different Chinese and English names and unique geographic positions corresponding to the same institution.

The geographic location-based word replacement method in step 1 specifically includes:

a1: counting word frequency in the English abbreviation of a mechanism, constructing a corresponding relation from common English abbreviation of the mechanism to full-name words, and selecting a proper corresponding relation according to a geographical position when one abbreviation corresponds to a plurality of full-name words;

a2: finding out all word substrings appearing in the English abbreviation of the mechanism, wherein two ends of each substring are required to be non-alphabetic characters, non-single quotation marks and not contained in the other abbreviation, and replacing according to the corresponding relation constructed in A1 to obtain the English full name.

Step 2, judging whether the English abbreviation corresponds to the English full name correctly, specifically comprising:

b1: dividing the English abbreviation of the mechanism and the English full name by taking non-letter non-single quotation mark characters as separators and converting the separators into lowercase letters to obtain the abbreviation wj and the full-name participle wq, and splicing all word initials in the wq into the abbreviation ws of the English full name;

b2: and calculating the similarity sim between the English abbreviation and the English full name by using wj, wq and ws obtained by B1, setting a threshold v, and if the similarity is smaller than the threshold, namely sim < v, judging that the English abbreviation is not correct in correspondence with the English full name, otherwise, judging that the English abbreviation is correct in correspondence with the English full name.

The calculation of the similarity sim between the english abbreviation and the english full name in B2 specifically includes:

if the abbreviation of the English letter is simply the subsequence sim of the full English name, which is 1, otherwise, the calculation formula is:

the number of English abbreviated words is nj, namely the number of words in wj, the number of English full-name words is nq, namely the number of words in wq, cnta and cntb respectively represent the number of parts and all corresponding parts of the words in wj found in wq, cntc and cntd respectively represent the number of parts and all corresponding parts of the words in wq found in wj, and 0< e <1 represents the weight proportion corresponding to the parts; cnta, cntb, cntc, cntd are calculated as follows: for two words, if they are identical or satisfy the correspondence in a1, they are called complete match, if one is the subsequence of the other is called partial match, if one exchanges the character sequence and then the subsequence of the other is called permutated match; cnta, cntb, cntc, cntd are initially 0; for each word in wj, if cntb is completely matched with a word in wq, 1 is accumulated, otherwise, cntb is completely or partially matched with ws, 1 is accumulated, and cntd is accumulated, the word length is accumulated, otherwise, if cnta is partially matched with a word in wq, 1 is accumulated, otherwise, cnta is accumulated, 1 is accumulated, and cntc is accumulated, the word length is accumulated, if ws is permuted and matched; for each word in wq, add 1 if it completely matches a word in wj, cntd, otherwise add 1 if it partially matches a word in wj, cntc.

Step 4, the method for correcting suffix frequency statistics specifically comprises the following steps:

d1: deleting the Chinese name with the length being shorter than a threshold k in the data, deleting the Chinese name with the English letter proportion exceeding a threshold p in the data, correcting the Chinese name without the Chinese character tail by using a deletion suffix, and deleting and correcting the newly generated too short name;

d2: counting the occurrence frequency of suffixes with different lengths of the Chinese names of the institutions, and automatically determining common suffixes of the Chinese names of the institutions by a recursion method;

d3: and for the Chinese name at the end of the common suffix determined in the non-D2, if the Chinese name can be corrected to the end of the common suffix by deleting a section of substrings at the end and the corrected short name in the non-D1 is corrected, otherwise, the Chinese name is deleted.

D2, the method for automatically determining the common suffix of the Chinese name of the organization by using a recursive method includes:

d21: the total number of the mechanism data is N, all suffixes with the suffix length of 1 are used as an initial set of common suffixes, and the occurrence times of the suffixes in the mechanism data are simultaneously saved in the set;

d22: splitting uniformly distributed suffixes in the initial set, and deleting suffixes with too few occurrence times until all the suffixes are not uniformly distributed or have too few occurrence times; if the suffix x has a length i, the number of occurrences in the organization data is c and

remove suffix x from set seed; otherwise, if n suffixes b1, b2 and … bn with the suffix length of i +1 ending with x respectively have the occurrence times of c1, c2 and … cn in mechanism data, c1 ≥ c2 ≥ … ≥ cn, and c ═ c1+ c2+ … + cn; if the number n of classes is greater than the threshold t, the information entropy v1 is greater than the threshold t1 and the high-frequency suffix occupation ratio v2 is greater than the threshold t2, splitting the suffix x into suffixes b1, b2 and … bn; wherein, the calculation formula of v1 and v2 is as follows:

step 5, merging a plurality of different data of the same mechanism, specifically comprising:

e1: marking unique IDs on data with the same Chinese name and geographic position, no Chinese name and the same English full name and geographic position as a group of data of the same mechanism, changing the two groups of data with the same geographic position and different IDs but the same English full name into the same ID if the two groups of data corresponding to the two IDs are the same in geographic position and have two pieces of data with different IDs but the same English full name, and finishing preliminary combination by using a union-check algorithm;

e2: and calculating the similarity s between two groups of data according to the text characteristics of the full English name, the geographic position words in the name and the quantity words, if the similarity s is greater than a threshold value pp, considering that the two groups of data represent the same mechanism to be changed into the same ID, and using a parallel searching algorithm to complete further combination, wherein finally, a group of English abbreviation, full English name, Chinese name and unique address corresponding to each group of ID are complete information of the mechanism.

The calculating of the similarity s between two sets of data in E2 specifically includes:

wherein x1, x2, … and xq are q pieces of data of ID1, y1, y2, … and ym are m pieces of data of ID2, ss (xi, yj) represents the one-way similarity of mechanism data xi and yj, geographic position word sets pxi and pyj and number word sets nxi and nyj in the mechanism data xi and yj are extracted, wherein the geographic position words consist of geographic positions in the data, words appearing in a place name word bank in English full-name participles and pinyin words in English full-name words, and the pinyin words are identified by using a dynamic programming algorithm; if at least one of the sets pxi, pyj of geographic location words for the facility data xi and yj is non-empty and the intersection pxi n pyj is empty, or at least one of the sets nxi, nyj is non-empty and the intersection nxi n nyj is empty, ss (xi, yj) is 0, otherwise the formula for calculating ss (xi, yj) is:

the total English name xi, yj is divided into wx, wy, nx and ny after removing the same geographic position information and number word in xi, yj, the number of words of the total English name xi, yj is the number of words in wx, wy, ca and cb respectively represent the number of parts and all corresponding parts found in wy of the words in wx, inv represents the reverse logarithm of the corresponding sequence found in wy of the words in wx, 0< ee <1 represents the weight ratio of the part corresponding relation to the complete corresponding relation, 0< z <1 represents the weight ratio of the similarity to the reverse logarithm, and the calculation method of ca, cb and inv is as follows: for two words, if they are identical or satisfy the correspondence in step a1, they are called complete match, and if one is another, they are called partial match; ca, cb is initially 0, for each word in wx, words in wy are matched by a greedy strategy, if a cb accumulation 1 completely matched with a word in wy is found, otherwise, if ca accumulation 1 partially matched with a word in wy, each word in wy can only be matched with each word in wx once, and if a plurality of words in wy can be matched with one word in wx according to the priority of the word segmentation sequence; inv is the inverse logarithm of the matching order of wx and wy.

The invention has the beneficial effects that: the invention utilizes the combination of Chinese and English, full name and abbreviation, and geographical location information, utilizes the text characteristic of the academic institution name to reasonably design a name similarity calculation method, comprehensively uses a plurality of modes to correct and complement data, uses high-efficiency rules and algorithms to complete the entity alignment task of the academic institution name, simultaneously solves the problem that insufficient context information and known alignment relation label data are not available, only uses the text similarity, semantic information and geographical location information of words, and does not need to be connected with other entities such as authors and text papers, obtains better entity alignment effect of the academic institution name, obtains a plurality of different Chinese and English names and unique geographical location information of the same institution, and finds a group of data corresponding to the same institution.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to specific examples, and those skilled in the art can further understand this text-based academic institution name entity alignment method from this section. The present invention can also be applied by different embodiments according to the actual situation of the problem, for example, different threshold parameters are used according to the actual situation of the data. The various processes in the present invention may be modified and combined in accordance with the actual circumstances.

Referring to fig. 1, the present invention includes the following 5 steps: converting English abbreviation into English full name; correcting the correspondence between English abbreviation and English full name; the translation completes the English full name and the Chinese name; correcting the wrong Chinese name; and performing academic institution combination based on the text features.

In the method for converting English abbreviation into English full name in step 1 of the invention, information completion is carried out. And automatically generating the English full name according to the English short name by using a word replacement method based on the geographic position for the original data containing the English short name but without the English full name, and completing the missing English full name.

Because the English abbreviation contains a large number of words which cannot be automatically translated and intuitively understood, the step of converting the English abbreviation into the English full name is beneficial to improving the effect of the English translation in the step 3 and is used for comparing the similarity of the English full name in the step 5.

The word replacement method based on the geographic position comprises the following steps:

a1: counting word frequency in the English abbreviation of a mechanism, constructing a corresponding relation from common English abbreviation of the mechanism to full-name words, and selecting a proper corresponding relation according to the geographical position when one abbreviation corresponds to a plurality of full-name words. And selecting common words according to the statistical word frequency of the English abbreviation for manual labeling to obtain the corresponding relation from 402 abbreviation words to the full term.

A2: finding out all the word substrings in the English abbreviation, wherein two ends of each substring are required to be non-alphabetic characters and non-single quotation marks and are not contained in the other word substring, and replacing according to the corresponding relation constructed in A1 to obtain the English full name. If an acronym is a substring of another acronym, the longer term should be selected for replacement, and the requirement at the ends of the substring limits the acronym from being a separate word rather than a part of a word. The most common default full term may be selected for replacement when a geographic location is missing. And when the short words are matched, lowercase letters are used, and when the full words are replaced, the lower case form of the full words is kept.

In the step 2 of the invention, the information correction is carried out in the method for correcting the English abbreviation and the English full name error. And (3) judging whether the English abbreviation and the English full name are correctly corresponding to the original data containing the English abbreviation and the English full name at the same time, correcting errors by incorrect splitting corresponding relation, inputting the English abbreviation as one piece of data into the step (1) during splitting, and keeping the same geographic position by using the English abbreviation and the Chinese name as the other piece of data.

The step corrects the corresponding relation of errors, and can reduce the error propagation influence on the entity alignment in the step 5 caused by the error correspondence in the original data. Since it is observed that the chinese name and the geographic location usually correspond to the full english name correctly, for the split data (jc, qc, zh, pos), (jc, ",", ") is input as new data into step 1 if jc for short does not appear in other data, and (", qc, zh, pos) is retained if qc for short does not appear in other data.

The method for judging whether the English abbreviation corresponds to the English full name correctly comprises the following steps:

b1: dividing the English abbreviation of the mechanism and the English full name by taking non-letter non-single quotation mark characters as separators and converting the separators into lowercase letters to obtain the English abbreviation wj and the English full name word wq, and splicing all word initials in the wq into the English full name abbreviation ws. The number of words is nj, i.e. the number of words in wj, and English is called the number of words in word number nq, i.e. the number of words in wq.

B2: and calculating the similarity sim between the English abbreviation and the English full name by using wj, wq and ws obtained by B1, setting a threshold v, and if the similarity is less than the threshold sim < v, judging that the English abbreviation is not correct in correspondence with the English full name, otherwise, judging that the English abbreviation is correct in correspondence with the English full name. If the abbreviation of the english letter is the subsequence sim of the full english name, which is only reserved, is 1, otherwise the calculation formula of the similarity sim is:

the word in wj is found in wq, and the word in wq is found in wq, and the word in cnta, cntb, cntc, and cntdd are respectively found in wj, and 0< e <1 is the weight ratio corresponding to the part.

For two words, if they are identical or satisfy one of 402 correspondences in a1, they are called complete match, if they are subsequence of another one, they are called partial match, and if they are subsequence of another one after exchanging character order, they are called permute match.

cnta, cntb, cntc, cntd are initially 0; for each word in wj, if cntb is completely matched with a word in wq, 1 is accumulated, otherwise, cntb is completely or partially matched with ws, 1 is accumulated, and cntd is accumulated, the word length is accumulated, otherwise, if cnta is partially matched with a word in wq, 1 is accumulated, otherwise, cnta is accumulated, 1 is accumulated, and cntc is accumulated, the word length is accumulated, if ws is permuted and matched; for each word in wq, add 1 if it completely matches a word in wj, cntd, otherwise add 1 if it partially matches a word in wj, cntc.

In the step 3 of the method for translating and complementing the full English name and the Chinese name, the information complementation is carried out. Complementing the data which is complemented in the step 1 and does not contain the Chinese name, the data which is modified in the step 2 and does not contain the Chinese name, and the original data which only contains the English full name but does not contain the English short name and the Chinese name by using the English translation; for original data containing Chinese name but without English full name, Chinese translation and English are used to complement the missing English full name.

In the step of translation, a plurality of organization names are separated by line feed characters, and the names are simultaneously inquired each time, so that the efficiency is improved. According to observation, some mechanisms have a lot of differences in English names but similar Chinese names, and some mechanisms have a lot of differences in Chinese names but similar English names, so that the translation completion bilingual name information is highly helpful for comparison and combination between different mechanisms in the step 5.

In the method of correcting the wrong Chinese name in step 4 of the present invention, information correction is performed. And correcting the wrong Chinese name, namely identifying and correcting the wrong Chinese name by using a correction method of suffix frequency statistics for the data which is corrected in the step 2 and contains the Chinese name in the original data and the data which is subjected to Chinese name completion in the step 3 in the English translation.

Some data contain other contents which are not mechanism names at two ends due to wrong interception positions when extracted from texts, some mechanism names have excessively detailed child mechanism names and only need to reserve the former parent mechanism names, and some mechanism names can leave a large number of English words due to incomplete translation when translated.

The suffix frequency statistics correction method comprises the following steps:

d1: since the mechanism chinese name is usually not too short, the english alphabet scale is not too high, and it should not end with non-chinese characters, the decision is made from the following three points. Deleting the Chinese name with the length being too short and less than a threshold k in the data, deleting the Chinese name with the English letter proportion exceeding a threshold p in the data, correcting the Chinese name at the end of the non-Chinese character by using a deletion suffix, and deleting the Chinese name with the length being too short and less than the threshold k newly generated by correction.

D2: according to observation, the Chinese name of an organization is usually suffixed only with specific types, so whether the Chinese name is correct or not is judged by the name suffixes. Counting the occurrence frequency of suffixes with different lengths of the Chinese names of the organization, and automatically determining the common suffixes of the Chinese names of the organization by a recursion method, wherein the method comprises the following steps:

d22: and splitting uniformly distributed suffixes in the initial set, and deleting suffixes with too few occurrence times until all the suffixes are not uniformly distributed or have too few occurrence times. If the length of the suffix x is i, the occurrence frequency of the suffix x in the organization data is c, the n suffixes b1, b2 and … bn with the suffix length of i +1 ending with the suffix x in the organization data are c1, c2 and … cn, c1 and c2 and c … and c1+ c2+ … + cn, b1, b2 and … bn are all character connections x. If it is

The suffix x is removed from the set if it occurs too few times; if the number n of the types is larger than the threshold t and the suffixes b1, b2 and … bn are uniformly distributed, splitting the suffix x, deleting the suffix x from the set and adding the suffixes b1, b2 and … bn into the set. In order to measure whether the number of suffixes is uniformly distributed, the information entropy v1 and the high-frequency suffix ratio v2 are calculated to judge that the distribution is more uniform if v1 is larger and the uncertainty is larger, and the distribution is more uniform if v2 is smaller and the comparison threshold t is smaller and the high-frequency suffix ratio is smaller, a threshold t1 is set, and t2 is judged, if the information entropy v1 is larger than t1 and the high-frequency suffix ratio v2 is smaller and the distribution is more uniform<t2 shows that the distribution is uniform, and the calculation formula of v1 and v2 is as follows:

d3: and for the Chinese name at the end of the common suffix determined in the non-D2, if the Chinese name can be corrected by deleting a section of substring at the end, the Chinese name is changed into the end of the common suffix, if the length of the corrected Chinese name is not too short and is smaller than a threshold k, the Chinese name is corrected, otherwise, the Chinese name is deleted. After correction, all remaining institutional Chinese names end with the common institutional Chinese name suffix obtained at D22.

In the method for combining academic institutions based on text characteristics in step 5, a plurality of different data of the same institution are combined for the data after being translated and complemented in step 3 and the data corrected in step 4, so that the entity alignment of the names of the academic institutions is completed, and a plurality of different Chinese and English names and unique geographic positions corresponding to the same institution are obtained.

In the step, entity alignment is carried out on the supplemented and corrected correct data, the ID is marked firstly, the obviously same mechanisms are merged for preliminary alignment, then the similarity is compared for further alignment, and the final aim is to aggregate a group of data corresponding to each mechanism. More alignment relations can be easily found by combining Chinese and English bilingual, and errors of text similarity which are difficult to identify only by mechanisms with different addresses and numbers can be obviously reduced by combining geographic position information and number words.

The method for merging a plurality of different data of the same mechanism comprises the following steps:

e1: and marking unique IDs on the data with the same Chinese name and geographic position, no Chinese name and the same English full name and geographic position as a group of data of the same mechanism, and if the two groups of data corresponding to the two IDs are the same in geographic position and have two pieces of data with different IDs and the same English full name, changing the two pieces of data into the same ID, and finishing the preliminary combination by using a union searching algorithm.

According to the previous step, the data used by E1 all contain full english names and most contain chinese names, while the acronyms are too strong to be considered here. Since chinese names aggregate more data than full english names, ID tagging and merging is preferred using chinese names.

The specific steps of the searching algorithm are as follows: the current marked IDs are T in total, T trees with each ID as a root are established, when two pieces of data x1 belonging to ID1, x2 belonging to ID2 and ID1 not equal to ID2 corresponding to the two IDs are found to have the same English full name each time, the ID1 and the two tree trees 1 and tree2 with the ID2 are merged, the root of the tree1 is changed into the leaf of the root of the tree2, when the root of a certain ID is inquired, all points on the path from the root of the certain ID are changed into the leaf of the root to compress the depth of the tree, and all the IDs in each obtained tree are changed into the same after all merging operations are completed.

E2: for any two IDs 1, two groups of data corresponding to the IDs 2 have the same geographic position and contain common words, the similarity s between the two groups of data is calculated according to the text features of the full English names, the geographic position words and the quantity words in the names, if the similarity is greater than a threshold value s & gt pp, the two groups of data represent that the same mechanism is changed into the same ID, the combination is further completed by a set searching algorithm, and finally, a group of English abbreviation, full English name, Chinese name and unique address corresponding to each group of IDs are the complete information of the mechanism.

In the step, since the full English name is available for all data, similarity calculation is performed by using the full English name. Due to the large number of mechanisms, in order to improve algorithm efficiency and quickly find IDs which possibly represent the same mechanism, mechanism IDs corresponding to each word are calculated to reduce the comparison times of entity alignment complexity by about 99%. Because it is difficult to distinguish the small differences caused by the geographic location words and the quantity words in the mechanisms only by considering the text similarity of the two mechanism names to judge whether the two mechanism names are the same or not, and the words can cause great differences in mechanism meanings, the quantity words and the geographic location words are extracted from the English full name and used for similarity calculation.

The quantifier includes words extracted from the full english name that contain the following substrings:

['zero','one','two','three','four','five','six','seven','eight','nine','ten','first','second','thir','fif','eleven','twelve','twenty','hundred','thousand','million','billion','1','2','3','4','5','6','7','8','9','0']

the geographic position words are divided into two types, one is a word obtained by matching from a word bank containing 6213 place names, and the other is a word which is determined by finding all Chinese pinyins according to a pinyin rule and using a dynamic programming algorithm and consists of a plurality of pinyins.

The mode of judging whether a word is composed of a plurality of character pinyins by using a dynamic programming method is as follows: for each substring position which can represent the pinyin of a single word, if the starting position is 1, the substring position is marked as true, otherwise, the substring position is marked as false, and for each position marked as false, if the position of the string with the starting position as the end position is marked as true, the position is marked as true. If the ending position of the position marked true is the word length, it can be expressed as a pinyin for multiple words.

Then, the calculation formula of the similarity s between the group data is to calculate the similarity of each pair of mechanism data respectively and then average:

wherein x1, x2, …, xq are q pieces of data of ID1, y1, y2, …, ym are m pieces of data of ID2, ss (xi, yj) represents the one-way similarity of mechanism data xi and yj, if the geographical location information and number word intersection of the mechanism data xi and yj are null, ss (xi, yj) is 0, otherwise, the calculation formula of ss (xi, yj) is the weighted sum of the matching correlation and the matching inverse sequence pair:

the parts of the xi and yj in the English full name without the same geographic position information and the same number of words are wx, wy, nx and ny, respectively after the geographic position words and the number of words are removed, the word number of the yj in the English full name, namely the word number in wx and wy, ca and cb respectively represent the number of parts and all corresponding parts of the words in wx in wy, inv represents the number of reverse-order logarithms of the corresponding sequences found in wy in wx, 0< ee <1 represents the weight ratio of the part corresponding relationship to the complete corresponding relationship, and 0< z <1 represents the weight ratio of the similarity to the reverse-order pairs.

For two words, if they are identical or satisfy one of the 402 correspondence relationships in step a1, they are called complete match, and if they are one and the other, they are called partial match.

The calculation method of ca, cb, inv is as follows: ca and cb are initially 0, for each word in wx, words in wy are matched by using a greedy strategy, if the cb accumulation 1 is completely matched with a word in wy, otherwise, if ca accumulation 1 is partially matched with a word in wy, each word in wy can be matched with each word in wx only once, and if a plurality of words in wy can be matched with one word in wx in a priority manner, the word segmentation order is advanced. inv is the inverse logarithm of the matching order of wx, wy, for example, when the 1st, 2 nd, 4 th word in wx matches the 5 th, 1st, 3 th word in wy, inv is 2 because the inverse logarithm of the sequence 5, 1st, 3 rd is 2.

Examples

Referring to fig. 1, an english text data source can extract 4 types of mechanism data including only an english abbreviation, only a full english name, both the english abbreviation and the full name, and both the chinese and english names, and respectively mark the mechanism data as first, second, third, and fourth; the Chinese text data source can extract mechanism data which only contains Chinese names and simultaneously contains 2 types of Chinese and English names, and the mechanism data are respectively marked as (v) and (sixthly). The data marked as the first is used for supplementing the English full name in the step 1, and the result is marked as the seventh; correcting the correspondence between English abbreviation and full name in step 2 to obtain results marked as r, ninu and ninu; marking the data of (II), (III) and (III), translating the English name into the Chinese name by using the step (3), correcting the Chinese name by using the step (4), and marking the result as (nine); the data marked as the fifth, translate the Chinese name into English name in the step 3, and mark the result as the ninth. All data are changed into ninum through the processing marks of the steps 1,2,3 and 4, except that a small amount of wrong data do not have Chinese names, other data have Chinese names and English full names at the same time, the data marked as ninum are subjected to ID marking and combination through the step 5 to obtain entity alignment results, and a group of data corresponding to each ID represents the same mechanism.

Referring to the flow chart of fig. 1, the detailed process of the entity alignment method of academic institution names, it can be seen that the present invention includes the following steps: step 1: conversion from English abbreviation to English full name

A1: firstly, according to common English abbreviation words in the organization name, marking the corresponding relation between 402 abbreviation words and full abbreviation words, such as sci corresponding to science, or corresponding to oregon, inst corresponding to institute, and the like; some of these correspondences are related to geographic location, e.g., WA corresponds to Washington in the united states and Western Australia.

A2: for the mechanism data having only an english abbreviation, for example ('George Wa Univ', 'usa'), the english abbreviation is complemented according to the correspondence rule. According to the correspondence between acronyms and full-synonyms in A1, the occurrence positions of all acronyms in the English acronym of the organization are found, and the influence of letter case is ignored, but the characters on two sides of the acronym cannot be single quotation marks or English letters, such as Wa in George Wa Univ appears at (8,9), Univ appears at (11,14), or appears at (3,4) but the position 2 is the letter e, so that the acronym is not an independent word.

And replacing the abbreviation words with correct full name words, replacing Univ with univariate, replacing Wa with Washington according to the geographic position of the United states to obtain the full name 'George Washington University', and completing the data as ('George Wa Univ', 'George Washington University', 'United states').

And calculating the similarity of the text characteristics of the English abbreviation and the full name for the data containing the English abbreviation and the full name, and correcting error data with low similarity.

For example: in x1 ('Capital Med Univ', 'Peking University', 'Beijing University') and x2 ('Univ Pompeu fabric UPF', 'Pompeu fabric University', 'pompe brara University', the corresponding relationship x1 is the wrong corresponding relationship and x2 is the correct corresponding relationship.

B1: the abbreviation of english, the full name of english, is tokenized to obtain wj, wq, and an abbreviation of full name of english, ws, for example, wj (central, med, univ) of x1, wq (keying, unity), ws (pu), wj (univ, pompu, fabra, upf) of x2, wq (pompu, fabra, unity), ws (pfu);

b2: calculating similarity sim between English abbreviation and full name according to the text characteristics of the word segmentation:

in this specific example, the weight e is set to 0.5, and the threshold v is set to 0.25. For example, univ and university of x1 correspond, cnta ═ 1, cntb ═ 0, cntc ═ 1, cntd ═ 0, nj ═ 3, nq ═ 2,

univ, pompeu, fabra, upf of x2 correspond to univorsity, pompeu, fabra, pfu, cnta ═ 2, cntb ═ 2, cntc ═ 4, cntd ═ 2, nj ═ 4, nq ═ 3, respectively,

it can be seen that whether the corresponding relationship between the english abbreviation and the english full name is correct can be easily distinguished through the similarity. The data x1 with the wrong correspondence can be split into two pieces of data ('cache Med Univ', 'Beijing') and ('Peking University', 'Beijing').

And step 3: translation supplements the English full name and Chinese name

The method realizes the completion of translating the English full name into the Chinese name in the English translation and translating the Chinese name into the English full name in the Chinese translation by calling a Baidu translation API interface. In addition, a group of translation results corresponding to each line are obtained in a trans _ result field of a returned result, and the translation speed of about 200 mechanism names per second can be achieved.

And 4, step 4: chinese name for correcting errors

In the translation result of the step 3, a small amount of data can be translated unsuccessfully to generate wrong Chinese names, and a small amount of data contains redundant information due to English full name to generate errors; the Chinese names in the original data also contain a small amount of redundant information, which causes errors. The method corrects the Chinese name, and identifies the wrong Chinese name according to the characteristics of the Chinese name of the organization. For example, the 'University of Paris 10' translation result of 'University Paris 10' is due to the generation of an extra suffix in English data, and the 'Rettew Associates Incorporated' translation to 'Rettew Associates Corp' is a translation failure.

D1: the chinese name should not end with non-chinese characters, should not contain too many non-chinese characters, and should not be too short in name length. In this specific example, the length threshold k is 3, and the english character ratio threshold p is 0.75, so that the chinese name length of 2 is deleted ('Tec', 'Tecnology', 'science', '), and (",' review Associates Incorporated ',') is changed to (", 'review Associates Incorporated', "), and 'university of paris 10' is changed to 'university of paris'.

D2: for the Chinese name processed by D1, counting the occurrence frequency of suffixes with different lengths of the Chinese name of the organization, and automatically determining the common suffixes of the Chinese name of the organization by a recursive method. In this specific example, the suffix type threshold t is 10, the threshold t1 of the information entropy v1 is 2, and the threshold t2 of the high frequency suffix proportion is 0.477. After automatic statistics, 266 common suffixes are obtained: shi 1127185 times, center 289467 times, hospital 272157 times, school 167655 times, and school 153338 times, … … times

D3: for the Chinese name processed by D1, if the suffix is the common suffix obtained by D2, the name is identified as the correct name; otherwise, if the ending part of the Chinese name can be deleted and changed to end with a common suffix, the modification is performed, for example, the 'Rostoff State university communication' is modified to 'Rostoff State university'; chinese names that cannot be corrected or are too short after correction are deleted, and finally correct Chinese names that all end with common suffixes are obtained.

And 5: academic institution consolidation based on text features

After completion and correction of the previous steps, except for a small amount of data with errors found in the step 4 without Chinese names, other data simultaneously contain English full names and Chinese names, and entity alignment can be carried out by combining Chinese and English names and geographic positions.

E1: the data is ID tagged and different data that clearly represents the same organization are merged. Since translation can make many of the same organizations with different english overall names correspond to the same chinese name, data with all the same chinese name is first labeled as the same ID, data without a chinese name and with the same english overall name is labeled as the same ID, and finally the IDs of the sets and the same english overall name are merged, for example, ('09547', 'Mcmaster un vers', 'macmaster University') and ('15852', 'Mcmasters Univ', 'Mcmaster University', 'macmaster University') are merged because the translated chinese names are the same, ('33886', 'Beijing Jiao Tong University', 'Beijing traffic University') and ('33945', 'Beijing Jiao tang University', 'Beijing jia jiong University', 'beijie University', 'Beijing intersection' and 'beijijing intersection 945') are merged because the two IDs 336 and 88945 are all english names.

E2: and on the basis of the entity alignment relationship obtained by E1, further expanding the alignment relationship. Because most of Chinese names are generated by translation and not all data are available, and all data have full English names, the similarity of the full English names to the text is used for judging whether two groups of data corresponding to two IDs represent the same mechanism:

for two groups of organization data corresponding to two IDs, similarity of the two groups of organization data is calculated in pairs and then averaged, and firstly, a geographic position word and a quantity word of each organization need to be extracted. In this particular example, the threshold pp is 0.655, the weight ee is 0.5 and z is 0.8. For example, { ('51140', ' Guangzhou 1st bacterial pens Hosp ', ' Guangzhou 1' ]) where the geo-location word is the Guangzhou number word is First in ('60516', ' Guangzhou First pest Hosp ', ' Guangzhou First pest ' and ' Guangzhou 1st bacterial pens Hosp ', ' Guangzhou First person Hospital ' of Guangzhou city, [ ' Guangzhou ' ], [ ' … … } data are the same entity, and the similarity calculated according to the above formula as 0.8125>0.655 may be merged.

After the academic institution IDs are matched and combined by using a parallel searching algorithm according to the process, a group of data corresponding to each ID represents the same institution, different IDs correspond to different institutions, all Chinese names, English full names and English short names in the data corresponding to the same ID are counted to obtain a plurality of different types of names and unique geographic positions corresponding to the institution, and finally, the result of the name entity alignment of the academic institution is obtained.

In summary, the invention provides a scheme framework for entity alignment of academic institution names, finds out an adaptive processing flow and scheme for the reality and characteristics of the academic institution data, completes the task of entity alignment of the academic institution by 5 steps, and uses the entity alignment result in the construction process of the academic knowledge map. By using a series of methods based on combination of text features, geographic position information and Chinese and English, the method corrects error data and completes missing data, solves the problems of less known labeled data and difficulty of no context-related semantic information, and obtains a good entity alignment effect, thereby being beneficial to constructing an academic knowledge map with a better effect and optimizing related application and literature search effects.

Equivalent modifications and variations of the details of implementation of the particular parts, which can be realized by a person skilled in the art on the basis of the principle idea of the process proposed by the present invention without departing from the spirit and scope of the inventive concept, are intended to be included in the present invention and protected by the following claims.

Claims

1. A text feature-based academic institution name entity alignment method is characterized by comprising the following specific steps:

step 1: conversion from English abbreviation to English full name

and step 3: translation supplements the English full name and Chinese name

and 4, step 4: chinese name for correcting errors

and 5: performing academic institution combination based on the text features,

2. The method according to claim 1, wherein the geographic location-based word replacement method in step 1 specifically comprises:

3. The method according to claim 1, wherein the step 2 of determining whether the english abbreviation corresponds to the english full name correctly comprises:

4. The method of claim 3, wherein the calculating of similarity sim between the english abbreviation and the english full name in B2 specifically comprises:

the English short word number is nj, namely the word number in wj, the English full word number is nq, namely the word number in wq, cnta and cntb respectively represent that a part and all corresponding numbers of the words in wj are found in wq, cntc and cntd respectively represent that a part and all corresponding numbers of the words in wq are found in wj, and 0< e <1 represents the weight proportion corresponding to the part; cnta, cntb, cntc, cntd are calculated as follows: for two words, if they are identical or satisfy the correspondence in a1, they are called complete match, if one is the subsequence of the other is called partial match, if one exchanges the character sequence and then the subsequence of the other is called permutated match; cnta, cntb, cntc, cntd are initially 0; for each word in wj, if cntb is completely matched with a word in wq, 1 is accumulated, otherwise, cntb is completely or partially matched with ws, 1 is accumulated, and cntd is accumulated, the word length is accumulated, otherwise, if cnta is partially matched with a word in wq, 1 is accumulated, otherwise, cnta is accumulated, 1 is accumulated, and cntc is accumulated, the word length is accumulated, if ws is permuted and matched; for each word in wq, add 1 if it completely matches a word in wj, cntd, otherwise add 1 if it partially matches a word in wj, cntc.

5. The method according to claim 1, wherein the step 4 of correcting the suffix frequency statistics comprises:

6. The method as claimed in claim 5, wherein the step of determining common suffixes of names of institutions in a recursive manner according to D2 comprises:

7. the method according to claim 1, wherein the step 5 of merging a plurality of different pieces of data of the same institution comprises:

8. The method according to claim 7, wherein the calculating the similarity s between two sets of data in E2 specifically comprises:

wherein, the participle of xi, yj without the same geographic position information and number word in xi, yj is wx, wy, nx, ny after removing the geographic position word and number word, the word number of yj English full name is wx, word number in wy, ca, cb respectively represents the number of word in wx finding part and all corresponding in wy, inv represents the inverse logarithm of the corresponding sequence in wy in wx, 0 & ltee & lt 1 represents the weight proportion of part corresponding relation and complete corresponding relation, 0 & ltz & lt 1 represents the weight proportion of similarity and inverse logarithm, the calculating method of ca, cb, inv is as follows: for two words, if the words are identical or satisfy the corresponding relationship in step Al, it is called complete matching, and if one is a subsequence of the other, it is called partial matching; ca, cb is initially 0, for each word in wx, words in wy are matched by a greedy strategy, if a cb accumulation 1 completely matched with a word in wy is found, otherwise, if ca accumulation 1 partially matched with a word in wy, each word in wy can only be matched with each word in wx once, and if a plurality of words in wy can be matched with one word in wx according to the priority of the word segmentation sequence; inv is the inverse logarithm of the matching order of wx and wy.