CN113609304A

CN113609304A - Entity matching method and device

Info

Publication number: CN113609304A
Application number: CN202110818313.8A
Authority: CN
Inventors: 周琥晨; 李默涵; 张雨成; 顾钊铨; 韩伟红; 唐可可
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2021-11-05
Anticipated expiration: 2041-07-20
Also published as: CN113609304B

Abstract

The invention relates to the technical field of entity matching, and discloses an entity matching method and device, wherein the method comprises the following steps: acquiring a first data set and a second data set, wherein the data set comprises a plurality of entity records, and the entity records comprise a plurality of attributes; obtaining a Cartesian product of the first data set and the second data set to obtain a third data set, and combining sentences of each entity record in the third data set according to preset potential relations among a plurality of attributes in the entity records to obtain a fourth data set comprising a second combination; and inputting the second combination in the fourth data set into a preset Bert model, wherein the Bert model is used for judging whether the two sentences of the second combination are matched and outputting a matching result. Has the advantages that: and replacing the entity records in the third data set with sentences generated according to the attribute potential relationship, so that the data input into the Bert model by the second combination can keep the relation between the attributes, and the matching result of the entity records in the data set is more accurate.

Description

Entity matching method and device

Technical Field

The present invention relates to the technical field of entity matching, and in particular, to a method and an apparatus for entity matching.

Background

The goal of entity matching is to identify heterogeneous expressions of the same real-world entity in different data sources. Entity matching is an important step of knowledge fusion, but a multi-source heterogeneous data environment exists in the real world, such as structured data, dirty data, textual data and the like. These multi-source heterogeneous environments need to be heavily considered and targeted processing methods are needed.

In the task of entity matching, data to be matched are two data sets A and B, wherein the data sets A and B respectively comprise a plurality of entity records, each entity record comprises a plurality of attributes of one entity, and the data sets A and B have the same attribute. The two data sets A and B are data sets from two different sources, a plurality of entity records describing the same real-world entity exist in the two data sets respectively, and the object of the entity matching task is to find all matched entity record pairs in the first data set and the B. For example, each matching pair of entity records consists of two entity records tA and tB from the first data set and B, respectively, where tA and tB describe the same real world entity, and there may be multiple entity records ti in the first data set corresponding to tB in the second data set.

Some entity matching methods exist in the prior art, but these entity matching methods often directly adopt entity records for matching, and do not consider the relationship among attributes in the entity records, so that the matching result has a large error, and therefore the prior entity matching methods need to be improved, and the accuracy of entity matching is improved.

Disclosure of Invention

The purpose of the invention is: the entity matching method and the entity matching device are provided, the content of the entity record is comprehensively considered, and the accuracy of entity matching is improved.

In order to achieve the above object, the present invention provides an entity matching method, including:

acquiring a first data set and a second data set which need to be matched, wherein the first data set and the second data set respectively comprise a plurality of entity records, and each entity record comprises a plurality of attributes.

And acquiring a Cartesian product of the first data set and the second data set to obtain a third data set, wherein the third data set comprises a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set.

According to preset potential relations among a plurality of attributes in the entity records, sentence combination is carried out on each entity record in the third data set, and a fourth data set is obtained; the fourth data set includes a number of sets of second combinations, the second combinations being sentences and combinations of sentences of the corresponding entity records.

And inputting each group of second combinations in the fourth data set into a preset Bert model, converting the input sentences into entity embedding vectors by the Bert model, comparing whether two sentences in each group of second combinations are matched or not through the entity embedding vectors, and outputting a matching result.

Further, after the third data set is obtained, performing a blocking operation on the third data set to remove a negative example in the third data set, where the negative example is a first combination of entity records of the first data set and entity records of the second data set that are obviously unmatched.

Further, the blocking operation is performed on the third data set, and the specific method includes: attribute equality blocking and rule-based blocking;

the attribute equal blocking specifically includes: and judging whether a plurality of attribute values of two entity records in each group of first combination are equal, if the attribute values of the first number are not equal, deleting the first combination, and if the attribute values of the first number are not equal, reserving the first combination, wherein the first number is smaller than the number of the attribute values of the entity records.

The rule-based blocking specifically comprises: and judging whether the attribute values of the two entity records in each group of first combinations simultaneously meet a preset first condition, if so, reserving, and if not, deleting.

Further, after the blocking operation is performed on the third data set, the third data set is subjected to a first preprocessing, so that the third data set subjected to the first preprocessing meets the SBert model input standard.

Further, according to preset potential relationships among a plurality of attributes in the entity records, sentence combination is performed on each entity record in the third data set, specifically:

acquiring a potential relation between any two attributes in the entity record, and acquiring a phrase formed by any two attributes according to the potential relation;

composing the obtained plurality of phrases into sentences;

and replacing the obtained sentences into a third data set according to the corresponding relation between the sentences and the entity records.

Further, the Bert model is specifically an SBert model, and the SBert model includes a first Bert model and a second Bert model that share a twin neural network by using weights; when the second combination is input into the SBert model, the first Bert model and the second Bert model are respectively used for processing two sentences in the second combination and storing entity embedded vectors converted by each sentence.

Further, when the sentences in the second combination which is input later are processed by the SBert model, the saved entity embedding vector is called for matching judgment.

Further, comparing whether two sentences in each group of second combinations are matched through the entity embedded vector specifically includes:

calculating cosine similarity of entity embedded vectors corresponding to the two sentences in the second combination, and judging whether the value of the cosine similarity is larger than or equal to a preset first threshold, if so, determining that the two sentences in the first combination are matched, and if not, determining that the two sentences in the first combination are not matched.

The invention also discloses an entity matching device which is characterized by comprising a first acquisition module, a second acquisition module, a first processing module and a second processing module.

The first obtaining module is configured to obtain a first data set and a second data set that need to be matched, where the first data set and the second data set each include a plurality of entity records, and each entity record includes a plurality of attributes.

The second obtaining module is configured to obtain a cartesian product of the first data set and the second data set to obtain a third data set, where the third data set includes a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set.

The first processing module is used for combining sentences of each entity record in the third data set according to preset potential relations among a plurality of attributes in the entity records to obtain a fourth data set; the fourth data set includes a number of sets of second combinations, the second combinations being sentences and combinations of sentences of the corresponding entity records.

And the second processing module is used for inputting each group of second combinations in the fourth data set into a preset Bert model, converting the input sentences into entity embedded vectors by the Bert model, comparing whether two sentences in each group of second combinations are matched or not through the entity embedded vectors, and outputting a matching result.

Further, the matching device further comprises a third processing module arranged between the second acquiring module and the first processing module.

And the third processing module is used for performing blocking operation on the third data set and removing a negative example in the third data set, wherein the negative example is a combination of entity records of the first data set and entity records of the second data set which are obviously unmatched.

Compared with the prior art, the entity matching method and the entity matching device have the advantages that: and replacing the entity records in the third data set with sentences generated according to the attribute potential relationship, so that the data input into the Bert model by the second combination can keep the relation between the attributes, and the matching result of the entity records in the data set is more accurate.

Drawings

FIG. 1 is a schematic flow chart of an entity matching method of the present invention;

fig. 2 is a schematic structural diagram of an entity matching apparatus according to the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Example 1:

referring to the attached figure 1, the invention discloses an entity matching method, which is applied to entity matching among different data sets and mainly comprises the following steps:

step S1, obtaining a first data set and a second data set to be matched, where the first data set and the second data set each include a number of entity records, and each entity record includes a number of attributes.

Step S2, obtaining a cartesian product of the first data set and the second data set to obtain a third data set, where the third data set includes a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set.

Step S3, according to the preset potential relation among a plurality of attributes in the entity record, sentence combination is carried out on each entity record in the third data set, and a fourth data set is obtained; the fourth data set includes a number of sets of second combinations, the second combinations being sentences and combinations of sentences of the corresponding entity records.

Step S4, inputting each group of second combinations in the fourth data set to a preset Bert model, where the Bert model converts the input sentences into entity-embedded vectors, compares whether two sentences in each group of second combinations match with each other through the entity-embedded vectors, and outputs a matching result.

In step S1, a first data set and a second data set of the required matches are obtained, where the first data set and the second data set each include a number of entity records, and each entity record includes a number of attributes. For the sake of clarity, the description is given in mathematical language, as follows:

acquiring data set A (A)₁、A₂、A₃、……A_i) And data set B (B)₁、B₂、B₃、...B_j) Wherein, the element A in the data set A_iIs an entity record (A) comprising a plurality of attributes₁₁、A₁₂、A₁₃、……A_1k) Such as A₁The attribute of the inclusion is (A)₁₁、A₁₂、A₁₃) (ii) a Element B in the data set B_jAn entity record (B) comprising a plurality of attributes₁₁、B₁₂、B₁₃、……、B_1l) Such as B₁The attribute of the inclusion is (B)₁₁、B₁₂、B₁₃). The values of i, j, k and l are natural numbers which are larger than zero.

In step S2, a cartesian product of the first data set and the second data set is obtained to obtain a third data set, where the third data set includes several groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set.

The meaning of the cartesian product is known to the person skilled in the art, in particular: cartesian products refer to the two sets of Cartesian products (also known as direct products) of X and Y in mathematics, the first object being a member of X and the second object being one of all possible ordered pairs of Y. Assuming that the set a is { a, B }, and the set B is {0, 1, 2}, the cartesian product of the two sets is { (a, 0), (a, 1), (a, 2), (B, 0), (B, 1), (B, 2) }.

In this embodiment, a cartesian product of the data set a and the data set B is obtained to obtain the data set D (a)₁B₁、A₁B₂、A₁B₃、……A₁B_j、A₂B₁、A₂B₂、……、A_iB_j) (ii) a Wherein A is₁B₁Representing a first entity record in a first data set and a second entity record in a second data setThe meaning of one entity record combination, the other combinations can be analogized in turn. By designating such a combination as a first combination, it can be seen that the data set D includes several first combinations.

In step S3, according to preset potential relationships among a plurality of attributes in entity records, sentence combination is performed on each entity record in the third data set to obtain a fourth data set; the fourth data set includes a number of sets of second combinations, the second combinations being sentences and combinations of sentences of the corresponding entity records.

In this embodiment, according to a preset potential relationship among a plurality of attributes in an entity record, each entity record in the third data set is subjected to sentence combination, specifically:

and acquiring the potential relationship between any two attributes in the entity record, and acquiring a phrase formed by any two attributes according to the potential relationship. And forming a sentence by the obtained plurality of phrases. And replacing the obtained sentences into a third data set according to the corresponding relation between the sentences and the entity records.

In this embodiment, the potential relationship embedding is performed on the data set D, specifically: according to the element A_iAnd (4) the potential relations among the middle attributes, and different attributes are connected in series to form a sentence based on the potential relations.

Such as A₁The attribute of the inclusion is (A)₁₁、A₁₂、A₁₃)，rel(A₁₁，A₁₂) Is a phrase, the meaning of which indicates A₁₁，A₁₂Potential relationships between, e.g. A₁₁，A₁₂Are title and manufacturer, rel (A)₁₁，A₁₂) Is "made by". Potential relationship embedding: will rel (A)₁₁，A₁₂) Embedding entity records (A)₁₁、A₁₂、A₁₃) Corresponding attribute A of₁₁、A₁₂To (c) to (d); repeating the above process to establish A₁₁And A₁₃Relation of (A)₁₂And A₁₃The relation between the properties is that all the properties are connected in series by using the potential relation and finally combined into a sentence, and a plurality of properties can be connected in series to form a sentence like' A₁₁rel(A₁₁，A₁₂)A₁₂...A_1k-1rel(A_1k-1，A_1k)A_1k"sentence, last entity record is no longer made up of attributes, but becomes a sentence that describes the entity.

And (4) embedding the latent relation of the data set A and the data set B, so that all entity records are converted into corresponding sentences. The sentence converted from the entity record in the first data set is recorded as S_AiAnd the sentence converted from the entity record in the second data set is recorded as S_Bj. After the embedding of the latent relation, a data set E (fourth data set) is generated (S)_A1S_B1、S_A1S_B2、S_A1S_B3、S_A1S_Bj……、S_A2S_B1、S_A2S_B2、……、S_Ai S_Bj). Wherein S is_AiS_BjIn a second combination.

Further, the potential relationships between the attributes are uniformly assigned according to the data sets and the semantics of the attribute columns. In entity matching, the attribute types of the entity records are often the same, i.e. the first entity record includes Zhang three, Man and eighteen years old, and the second entity record is Li four, Man and twentieth years old. When the potential relationship is obtained, the potential attributes of the first entity record and the second entity record are known to be the same. Therefore, data can be collected and sorted in the data acquisition stage, attributes of the same type are in the same column of the table, and when the latent relation embedding is carried out, the assignment can be carried out according to the attribute column, so that the process of assigning the latent relation can be simplified.

In the embodiment, the entities of the data set in the entity matching have many attributes, and if the entities are simply spliced, the potential semantic association among the attributes is lost. By the method for adding the potential relationship, relationship perception can be carried out on each attribute of the entity record, the potential relationship embedding is added, semantic information is enriched, and entity information sentences which are easier to understand by a Bert model are generated.

In step S4, each set of second combinations in the fourth data set is input to a preset Bert model, which converts the input sentences into entity-embedded vectors and compares whether two sentences in each set of second combinations match with each other by the entity-embedded vectors and outputs a matching result.

In this embodiment, for further optimization of the technical solution, the Bert model is specifically an SBert model, and the SBert model includes a first Bert model and a second Bert model that adopt a weight sharing twin neural network; when the second combination is input into the SBert model, the first Bert model and the second Bert model are respectively used for processing two sentences in the second combination and storing entity embedded vectors converted by each sentence.

In this embodiment, when the sentences in the second combination that are input later have been processed by the SBert model, the saved entity embedding vector is called for matching judgment.

In the present embodiment, assuming that the sizes of the two data sets A and B are m and n entity records, respectively, the Cartesian product of the two data sets A and B has a size of m × n, and if input into the Bert model in a pair-wise manner, m × n calculations are required. However, by adopting the SBert model in the invention, each entity record of A and B can be independently input into the Bert model for coding calculation, and only m + n times of calculation is needed, thereby greatly reducing the calculation amount and improving the efficiency of entity matching.

In this embodiment, the comparing, by using the entity-embedded vector, whether two sentences in each group of second combinations match specifically includes:

In the embodiment, SBert uses a twin neural network, namely a model formed by a pair of parameters sharing Bert, and has higher coding efficiency. The SBert model is composed of two Bert models, a pair of entity records can be respectively input into the Bert models to be coded to generate embedded vectors, each entity record is independently coded and calculated and stored, repeated calculation of entity record codes is not needed, and waste of time and space is reduced.

In this embodiment, the encoding process is to input each entity record sentence into SBert separately, and generate an embedded vector of the sentence using SBert, and an optional implementation manner is as follows: the embedded vector has a matrix dimension of [ n,768], n is the number of words of the input data, and each row of the matrix represents the association information and the position information of all words of the input data and the word of the current position. E.g., Iloveyou, the three words are the data of the input Bert, then the vector in the first row [1, 768] of the output matrix [3, 768] represents all the association information, semantic association and location information, etc., of the I word and all the words of the input data, i.e., the words of both love and you. The generated embedded vector [ n,768] contains an entity record information, then the vector of [ n,768] dimension is subjected to information extraction through an average pooling layer, the size of the vector is reduced, the calculation amount is reduced, and finally the cosine similarity of the vector output by the operation is calculated.

In the model training process, model parameters of SBert can be adjusted according to labels, after multiple rounds of iterative training, the model is verified on a verification set after the training reaches a certain number of steps, the model is saved for updating when the effect is improved, and a final model is obtained after the training is finished.

After the model is trained, the entity record pairs S of the test set are recorded_AiS_BjThe method comprises the steps of calculating cosine similarity (range: 1-1) by using an encoded entity embedding vector, setting a plurality of similarity threshold values theta, wherein the threshold values are used for judging matching when the cosine similarity is larger than or equal to theta and judging mismatching when the cosine similarity is smaller than theta, calculating F1 scores according to the standard, enabling each similarity threshold value theta to correspond to an F1 score, and selecting the similarity threshold value theta with the highest F1 score as an optimal similarity threshold value (first threshold value).

Using the optimal similarity threshold as an entity record pair S_AiS_BjFirst threshold of whether match, then second combination in dataset D (fourth dataset)Inputting into SBert model, coding to generate vector, storing, and recording entity pair S_AiS_BjThe cosine similarity is calculated by the embedded vector, and when the cosine similarity is larger than or equal to the optimal similarity threshold theta, the embedded vector is judged to be matched, and when the cosine similarity is smaller than the optimal similarity threshold theta, the embedded vector is judged to be unmatched. And storing or recording the corresponding matching result after each group of second combinations are input until each group of second combinations is subjected to matching judgment.

In this embodiment, the matching method further includes: after the third data set is obtained, performing a blocking operation on the third data set, and removing a negative example in the third data set, wherein the negative example is a first combination of entity records of the first data set and entity records of the second data set which are obviously unmatched.

In this embodiment, the blocking operation on the third data set specifically includes: attribute equality blocking and rule-based blocking;

In this embodiment, the rule-based blocking is a conditional determination of the value of one or more entity attributes, and if the condition is not met, the record pair is deleted, e.g., an entity record pair A₁B₁All have an attribute of year, the rule is set to year > -2000, and the attribute of the entity record pair (A)₁.year,B₁Year) is not met and is deleted.

In this embodiment, after the blocking operation is performed on the third data set, the third data set is subjected to a first preprocessing, so that the third data set subjected to the first preprocessing satisfies SBert model input criteria.

The preprocessing specifically comprises the steps of clearing null values, adjusting formats and the like. If the preprocessing operation is not carried out, the null value or the non-uniform format brings interference and errors to the matching result and influences the judgment of the model, so that the input data can be preprocessed, the null value is eliminated, the format is uniform, and the interference of unnecessary factors to the judgment of the model is avoided.

Example 2:

referring to fig. 2, the present invention further discloses an entity matching apparatus, which is applied to the same application scenario as the embodiment, and performs entity matching between data sets, wherein the apparatus includes a first obtaining module 101, a second obtaining module 102, a first processing module 103, and a second processing module 104.

The first obtaining module 101 is configured to obtain a first data set and a second data set that need to be matched, where the first data set and the second data set each include a plurality of entity records, and each entity record includes a plurality of attributes;

the second obtaining module 102 is configured to obtain a cartesian product of the first data set and the second data set to obtain a third data set, where the third data set includes a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set;

the first processing module 103 is configured to perform sentence combination on each entity record in the third data set according to a preset potential relationship among a plurality of attributes in the entity record, so as to obtain a fourth data set; the fourth data set comprises a plurality of groups of second combinations, wherein the second combinations are sentences and sentence combinations of corresponding entity records;

the second processing module 104 is configured to input each group of second combinations in the fourth data set to a preset Bert model, where the Bert model converts an input sentence into an entity embedding vector, and compares, by using the entity embedding vector, whether two sentences in each group of second combinations match with each other, and outputs a matching result.

In this embodiment, the matching apparatus further includes a third processing module disposed between the second obtaining module and the first processing module;

Since embodiment 2 is written based on embodiment 1, some of the same technical features are not described in embodiment 2.

To sum up, compared with the prior art, the entity matching method and device of the embodiment of the invention have the following beneficial effects:

(1) and replacing the entity records in the third data set with sentences generated according to the potential relation of the attributes, so that the data input into the Bert model by the second combination can keep the relation between the attributes, and the matching result of the entity records in the data set is more accurate.

(2) And when entity matching is carried out in the prior art, the second combination is integrally input into a Bert model, and if the scales of the A and B data sets are respectively m and n entity records, the size of the Cartesian product of the A and B data sets is mxn, and if the A and B data sets are input into the Bert model in a paired mode, m and n times of calculation are needed. However, by adopting the SBert model in the invention, each entity record of A and B can be independently input into the Bert model for coding calculation, when the sentences in the second combination input later are processed by the SBert model, the stored entity embedding vector is called for matching judgment, only m + n times of calculation is needed, the calculation amount is greatly reduced, and the entity matching efficiency is improved.

(3) And preprocessing and blocking the input data, so that the data of the input model is standard and uniform, an invalid second combination is eliminated, and the matching efficiency and accuracy are improved. And no manual work is involved in the data processing, blocking operation and model matching processes, so that the method can be better applied to entity matching of different data sets, and the cost is saved.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these modifications and substitutions should also be regarded as the protection scope of the present invention.

Claims

1. An entity matching method, comprising:

acquiring a first data set and a second data set which need to be matched, wherein the first data set and the second data set respectively comprise a plurality of entity records, and each entity record comprises a plurality of attributes;

acquiring a Cartesian product of a first data set and a second data set to obtain a third data set, wherein the third data set comprises a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set;

according to preset potential relations among a plurality of attributes in the entity records, sentence combination is carried out on each entity record in the third data set, and a fourth data set is obtained; the fourth data set comprises a plurality of groups of second combinations, wherein the second combinations are sentences and sentence combinations of corresponding entity records;

2. The entity matching method of claim 1, further comprising: after the third data set is obtained, performing a blocking operation on the third data set, and removing a negative example in the third data set, wherein the negative example is a first combination of entity records of the first data set and entity records of the second data set which are obviously unmatched.

3. The entity matching method according to claim 2, wherein the blocking operation is performed on the third data set, and the specific method includes: attribute equality blocking and rule-based blocking;

the attribute equal blocking specifically includes: judging whether a plurality of attribute values of two entity records in each group of first combination are equal, if the attribute values of a first number are not equal, deleting the first combination, and if the attribute values of the first number are not equal, reserving the first combination, wherein the first number is smaller than the number of the attribute values of the entity records;

4. The entity matching method according to claim 2, further comprising: after the blocking operation is performed on the third data set, performing a first preprocessing on the third data set so that the first preprocessed third data set satisfies the SBert model input criteria.

5. The entity matching method according to claim 1, wherein each entity record in the third data set is sentence-combined according to a preset potential relationship among a plurality of attributes in the entity record, specifically:

composing the obtained plurality of phrases into sentences;

6. The entity matching method according to claim 1, wherein the Bert model is specifically an SBert model, and the SBert model comprises a first Bert model and a second Bert model that adopt weight sharing twin neural networks; when the second combination is input into the SBert model, the first Bert model and the second Bert model are respectively used for processing two sentences in the second combination and storing entity embedded vectors converted by each sentence.

7. The entity matching method as claimed in claim 6, wherein when the sentences in the second combination that are input later have been processed by SBert model, the saved entity embedding vector is called to make matching judgment.

8. The entity matching method according to claim 1, wherein the comparing of the two sentences in each group of second combinations by the entity embedding vector is specifically:

9. An entity matching device is characterized by comprising a first acquisition module, a second acquisition module, a first processing module and a second processing module;

the first obtaining module is configured to obtain a first data set and a second data set that need to be matched, where the first data set and the second data set each include a plurality of entity records, and each entity record includes a plurality of attributes;

the second obtaining module is configured to obtain a cartesian product of the first data set and the second data set to obtain a third data set, where the third data set includes a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set;

the first processing module is used for combining sentences of each entity record in the third data set according to preset potential relations among a plurality of attributes in the entity records to obtain a fourth data set; the fourth data set comprises a plurality of groups of second combinations, wherein the second combinations are sentences and sentence combinations of corresponding entity records;

10. The entity matching device of claim 9, further comprising a third processing module disposed between the second obtaining module and the first processing module;