CN113609304A - Entity matching method and device - Google Patents

Entity matching method and device Download PDF

Info

Publication number
CN113609304A
CN113609304A CN202110818313.8A CN202110818313A CN113609304A CN 113609304 A CN113609304 A CN 113609304A CN 202110818313 A CN202110818313 A CN 202110818313A CN 113609304 A CN113609304 A CN 113609304A
Authority
CN
China
Prior art keywords
data set
entity
combinations
sentences
combination
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110818313.8A
Other languages
Chinese (zh)
Other versions
CN113609304B (en
Inventor
周琥晨
李默涵
张雨成
顾钊铨
韩伟红
唐可可
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202110818313.8A priority Critical patent/CN113609304B/en
Publication of CN113609304A publication Critical patent/CN113609304A/en
Application granted granted Critical
Publication of CN113609304B publication Critical patent/CN113609304B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Abstract

The invention relates to the technical field of entity matching, and discloses an entity matching method and device, wherein the method comprises the following steps: acquiring a first data set and a second data set, wherein the data set comprises a plurality of entity records, and the entity records comprise a plurality of attributes; obtaining a Cartesian product of the first data set and the second data set to obtain a third data set, and combining sentences of each entity record in the third data set according to preset potential relations among a plurality of attributes in the entity records to obtain a fourth data set comprising a second combination; and inputting the second combination in the fourth data set into a preset Bert model, wherein the Bert model is used for judging whether the two sentences of the second combination are matched and outputting a matching result. Has the advantages that: and replacing the entity records in the third data set with sentences generated according to the attribute potential relationship, so that the data input into the Bert model by the second combination can keep the relation between the attributes, and the matching result of the entity records in the data set is more accurate.

Description

Entity matching method and device
Technical Field
The present invention relates to the technical field of entity matching, and in particular, to a method and an apparatus for entity matching.
Background
The goal of entity matching is to identify heterogeneous expressions of the same real-world entity in different data sources. Entity matching is an important step of knowledge fusion, but a multi-source heterogeneous data environment exists in the real world, such as structured data, dirty data, textual data and the like. These multi-source heterogeneous environments need to be heavily considered and targeted processing methods are needed.
In the task of entity matching, data to be matched are two data sets A and B, wherein the data sets A and B respectively comprise a plurality of entity records, each entity record comprises a plurality of attributes of one entity, and the data sets A and B have the same attribute. The two data sets A and B are data sets from two different sources, a plurality of entity records describing the same real-world entity exist in the two data sets respectively, and the object of the entity matching task is to find all matched entity record pairs in the first data set and the B. For example, each matching pair of entity records consists of two entity records tA and tB from the first data set and B, respectively, where tA and tB describe the same real world entity, and there may be multiple entity records ti in the first data set corresponding to tB in the second data set.
Some entity matching methods exist in the prior art, but these entity matching methods often directly adopt entity records for matching, and do not consider the relationship among attributes in the entity records, so that the matching result has a large error, and therefore the prior entity matching methods need to be improved, and the accuracy of entity matching is improved.
Disclosure of Invention
The purpose of the invention is: the entity matching method and the entity matching device are provided, the content of the entity record is comprehensively considered, and the accuracy of entity matching is improved.
In order to achieve the above object, the present invention provides an entity matching method, including:
acquiring a first data set and a second data set which need to be matched, wherein the first data set and the second data set respectively comprise a plurality of entity records, and each entity record comprises a plurality of attributes.
And acquiring a Cartesian product of the first data set and the second data set to obtain a third data set, wherein the third data set comprises a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set.
According to preset potential relations among a plurality of attributes in the entity records, sentence combination is carried out on each entity record in the third data set, and a fourth data set is obtained; the fourth data set includes a number of sets of second combinations, the second combinations being sentences and combinations of sentences of the corresponding entity records.
And inputting each group of second combinations in the fourth data set into a preset Bert model, converting the input sentences into entity embedding vectors by the Bert model, comparing whether two sentences in each group of second combinations are matched or not through the entity embedding vectors, and outputting a matching result.
Further, after the third data set is obtained, performing a blocking operation on the third data set to remove a negative example in the third data set, where the negative example is a first combination of entity records of the first data set and entity records of the second data set that are obviously unmatched.
Further, the blocking operation is performed on the third data set, and the specific method includes: attribute equality blocking and rule-based blocking;
the attribute equal blocking specifically includes: and judging whether a plurality of attribute values of two entity records in each group of first combination are equal, if the attribute values of the first number are not equal, deleting the first combination, and if the attribute values of the first number are not equal, reserving the first combination, wherein the first number is smaller than the number of the attribute values of the entity records.
The rule-based blocking specifically comprises: and judging whether the attribute values of the two entity records in each group of first combinations simultaneously meet a preset first condition, if so, reserving, and if not, deleting.
Further, after the blocking operation is performed on the third data set, the third data set is subjected to a first preprocessing, so that the third data set subjected to the first preprocessing meets the SBert model input standard.
Further, according to preset potential relationships among a plurality of attributes in the entity records, sentence combination is performed on each entity record in the third data set, specifically:
acquiring a potential relation between any two attributes in the entity record, and acquiring a phrase formed by any two attributes according to the potential relation;
composing the obtained plurality of phrases into sentences;
and replacing the obtained sentences into a third data set according to the corresponding relation between the sentences and the entity records.
Further, the Bert model is specifically an SBert model, and the SBert model includes a first Bert model and a second Bert model that share a twin neural network by using weights; when the second combination is input into the SBert model, the first Bert model and the second Bert model are respectively used for processing two sentences in the second combination and storing entity embedded vectors converted by each sentence.
Further, when the sentences in the second combination which is input later are processed by the SBert model, the saved entity embedding vector is called for matching judgment.
Further, comparing whether two sentences in each group of second combinations are matched through the entity embedded vector specifically includes:
calculating cosine similarity of entity embedded vectors corresponding to the two sentences in the second combination, and judging whether the value of the cosine similarity is larger than or equal to a preset first threshold, if so, determining that the two sentences in the first combination are matched, and if not, determining that the two sentences in the first combination are not matched.
The invention also discloses an entity matching device which is characterized by comprising a first acquisition module, a second acquisition module, a first processing module and a second processing module.
The first obtaining module is configured to obtain a first data set and a second data set that need to be matched, where the first data set and the second data set each include a plurality of entity records, and each entity record includes a plurality of attributes.
The second obtaining module is configured to obtain a cartesian product of the first data set and the second data set to obtain a third data set, where the third data set includes a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set.
The first processing module is used for combining sentences of each entity record in the third data set according to preset potential relations among a plurality of attributes in the entity records to obtain a fourth data set; the fourth data set includes a number of sets of second combinations, the second combinations being sentences and combinations of sentences of the corresponding entity records.
And the second processing module is used for inputting each group of second combinations in the fourth data set into a preset Bert model, converting the input sentences into entity embedded vectors by the Bert model, comparing whether two sentences in each group of second combinations are matched or not through the entity embedded vectors, and outputting a matching result.
Further, the matching device further comprises a third processing module arranged between the second acquiring module and the first processing module.
And the third processing module is used for performing blocking operation on the third data set and removing a negative example in the third data set, wherein the negative example is a combination of entity records of the first data set and entity records of the second data set which are obviously unmatched.
Compared with the prior art, the entity matching method and the entity matching device have the advantages that: and replacing the entity records in the third data set with sentences generated according to the attribute potential relationship, so that the data input into the Bert model by the second combination can keep the relation between the attributes, and the matching result of the entity records in the data set is more accurate.
Drawings
FIG. 1 is a schematic flow chart of an entity matching method of the present invention;
fig. 2 is a schematic structural diagram of an entity matching apparatus according to the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
Example 1:
referring to the attached figure 1, the invention discloses an entity matching method, which is applied to entity matching among different data sets and mainly comprises the following steps:
step S1, obtaining a first data set and a second data set to be matched, where the first data set and the second data set each include a number of entity records, and each entity record includes a number of attributes.
Step S2, obtaining a cartesian product of the first data set and the second data set to obtain a third data set, where the third data set includes a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set.
Step S3, according to the preset potential relation among a plurality of attributes in the entity record, sentence combination is carried out on each entity record in the third data set, and a fourth data set is obtained; the fourth data set includes a number of sets of second combinations, the second combinations being sentences and combinations of sentences of the corresponding entity records.
Step S4, inputting each group of second combinations in the fourth data set to a preset Bert model, where the Bert model converts the input sentences into entity-embedded vectors, compares whether two sentences in each group of second combinations match with each other through the entity-embedded vectors, and outputs a matching result.
In step S1, a first data set and a second data set of the required matches are obtained, where the first data set and the second data set each include a number of entity records, and each entity record includes a number of attributes. For the sake of clarity, the description is given in mathematical language, as follows:
acquiring data set A (A)1、A2、A3、……Ai) And data set B (B)1、B2、B3、...Bj) Wherein, the element A in the data set AiIs an entity record (A) comprising a plurality of attributes11、A12、A13、……A1k) Such as A1The attribute of the inclusion is (A)11、A12、A13) (ii) a Element B in the data set BjAn entity record (B) comprising a plurality of attributes11、B12、B13、……、B1l) Such as B1The attribute of the inclusion is (B)11、B12、B13). The values of i, j, k and l are natural numbers which are larger than zero.
In step S2, a cartesian product of the first data set and the second data set is obtained to obtain a third data set, where the third data set includes several groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set.
The meaning of the cartesian product is known to the person skilled in the art, in particular: cartesian products refer to the two sets of Cartesian products (also known as direct products) of X and Y in mathematics, the first object being a member of X and the second object being one of all possible ordered pairs of Y. Assuming that the set a is { a, B }, and the set B is {0, 1, 2}, the cartesian product of the two sets is { (a, 0), (a, 1), (a, 2), (B, 0), (B, 1), (B, 2) }.
In this embodiment, a cartesian product of the data set a and the data set B is obtained to obtain the data set D (a)1B1、A1B2、A1B3、……A1Bj、A2B1、A2B2、……、AiBj) (ii) a Wherein A is1B1Representing a first entity record in a first data set and a second entity record in a second data setThe meaning of one entity record combination, the other combinations can be analogized in turn. By designating such a combination as a first combination, it can be seen that the data set D includes several first combinations.
In step S3, according to preset potential relationships among a plurality of attributes in entity records, sentence combination is performed on each entity record in the third data set to obtain a fourth data set; the fourth data set includes a number of sets of second combinations, the second combinations being sentences and combinations of sentences of the corresponding entity records.
In this embodiment, according to a preset potential relationship among a plurality of attributes in an entity record, each entity record in the third data set is subjected to sentence combination, specifically:
and acquiring the potential relationship between any two attributes in the entity record, and acquiring a phrase formed by any two attributes according to the potential relationship. And forming a sentence by the obtained plurality of phrases. And replacing the obtained sentences into a third data set according to the corresponding relation between the sentences and the entity records.
In this embodiment, the potential relationship embedding is performed on the data set D, specifically: according to the element AiAnd (4) the potential relations among the middle attributes, and different attributes are connected in series to form a sentence based on the potential relations.
Such as A1The attribute of the inclusion is (A)11、A12、A13),rel(A11,A12) Is a phrase, the meaning of which indicates A11,A12Potential relationships between, e.g. A11,A12Are title and manufacturer, rel (A)11,A12) Is "made by". Potential relationship embedding: will rel (A)11,A12) Embedding entity records (A)11、A12、A13) Corresponding attribute A of11、A12To (c) to (d); repeating the above process to establish A11And A13Relation of (A)12And A13The relation between the properties is that all the properties are connected in series by using the potential relation and finally combined into a sentence, and a plurality of properties can be connected in series to form a sentence like' A11rel(A11,A12)A12...A1k-1rel(A1k-1,A1k)A1k"sentence, last entity record is no longer made up of attributes, but becomes a sentence that describes the entity.
And (4) embedding the latent relation of the data set A and the data set B, so that all entity records are converted into corresponding sentences. The sentence converted from the entity record in the first data set is recorded as SAiAnd the sentence converted from the entity record in the second data set is recorded as SBj. After the embedding of the latent relation, a data set E (fourth data set) is generated (S)A1SB1、SA1SB2、SA1SB3、SA1SBj……、SA2SB1、SA2SB2、……、SAi SBj). Wherein S isAiSBjIn a second combination.
Further, the potential relationships between the attributes are uniformly assigned according to the data sets and the semantics of the attribute columns. In entity matching, the attribute types of the entity records are often the same, i.e. the first entity record includes Zhang three, Man and eighteen years old, and the second entity record is Li four, Man and twentieth years old. When the potential relationship is obtained, the potential attributes of the first entity record and the second entity record are known to be the same. Therefore, data can be collected and sorted in the data acquisition stage, attributes of the same type are in the same column of the table, and when the latent relation embedding is carried out, the assignment can be carried out according to the attribute column, so that the process of assigning the latent relation can be simplified.
In the embodiment, the entities of the data set in the entity matching have many attributes, and if the entities are simply spliced, the potential semantic association among the attributes is lost. By the method for adding the potential relationship, relationship perception can be carried out on each attribute of the entity record, the potential relationship embedding is added, semantic information is enriched, and entity information sentences which are easier to understand by a Bert model are generated.
In step S4, each set of second combinations in the fourth data set is input to a preset Bert model, which converts the input sentences into entity-embedded vectors and compares whether two sentences in each set of second combinations match with each other by the entity-embedded vectors and outputs a matching result.
In this embodiment, for further optimization of the technical solution, the Bert model is specifically an SBert model, and the SBert model includes a first Bert model and a second Bert model that adopt a weight sharing twin neural network; when the second combination is input into the SBert model, the first Bert model and the second Bert model are respectively used for processing two sentences in the second combination and storing entity embedded vectors converted by each sentence.
In this embodiment, when the sentences in the second combination that are input later have been processed by the SBert model, the saved entity embedding vector is called for matching judgment.
In the present embodiment, assuming that the sizes of the two data sets A and B are m and n entity records, respectively, the Cartesian product of the two data sets A and B has a size of m × n, and if input into the Bert model in a pair-wise manner, m × n calculations are required. However, by adopting the SBert model in the invention, each entity record of A and B can be independently input into the Bert model for coding calculation, and only m + n times of calculation is needed, thereby greatly reducing the calculation amount and improving the efficiency of entity matching.
In this embodiment, the comparing, by using the entity-embedded vector, whether two sentences in each group of second combinations match specifically includes:
calculating cosine similarity of entity embedded vectors corresponding to the two sentences in the second combination, and judging whether the value of the cosine similarity is larger than or equal to a preset first threshold, if so, determining that the two sentences in the first combination are matched, and if not, determining that the two sentences in the first combination are not matched.
In the embodiment, SBert uses a twin neural network, namely a model formed by a pair of parameters sharing Bert, and has higher coding efficiency. The SBert model is composed of two Bert models, a pair of entity records can be respectively input into the Bert models to be coded to generate embedded vectors, each entity record is independently coded and calculated and stored, repeated calculation of entity record codes is not needed, and waste of time and space is reduced.
In this embodiment, the encoding process is to input each entity record sentence into SBert separately, and generate an embedded vector of the sentence using SBert, and an optional implementation manner is as follows: the embedded vector has a matrix dimension of [ n,768], n is the number of words of the input data, and each row of the matrix represents the association information and the position information of all words of the input data and the word of the current position. E.g., Iloveyou, the three words are the data of the input Bert, then the vector in the first row [1, 768] of the output matrix [3, 768] represents all the association information, semantic association and location information, etc., of the I word and all the words of the input data, i.e., the words of both love and you. The generated embedded vector [ n,768] contains an entity record information, then the vector of [ n,768] dimension is subjected to information extraction through an average pooling layer, the size of the vector is reduced, the calculation amount is reduced, and finally the cosine similarity of the vector output by the operation is calculated.
In the model training process, model parameters of SBert can be adjusted according to labels, after multiple rounds of iterative training, the model is verified on a verification set after the training reaches a certain number of steps, the model is saved for updating when the effect is improved, and a final model is obtained after the training is finished.
After the model is trained, the entity record pairs S of the test set are recordedAiSBjThe method comprises the steps of calculating cosine similarity (range: 1-1) by using an encoded entity embedding vector, setting a plurality of similarity threshold values theta, wherein the threshold values are used for judging matching when the cosine similarity is larger than or equal to theta and judging mismatching when the cosine similarity is smaller than theta, calculating F1 scores according to the standard, enabling each similarity threshold value theta to correspond to an F1 score, and selecting the similarity threshold value theta with the highest F1 score as an optimal similarity threshold value (first threshold value).
Using the optimal similarity threshold as an entity record pair SAiSBjFirst threshold of whether match, then second combination in dataset D (fourth dataset)Inputting into SBert model, coding to generate vector, storing, and recording entity pair SAiSBjThe cosine similarity is calculated by the embedded vector, and when the cosine similarity is larger than or equal to the optimal similarity threshold theta, the embedded vector is judged to be matched, and when the cosine similarity is smaller than the optimal similarity threshold theta, the embedded vector is judged to be unmatched. And storing or recording the corresponding matching result after each group of second combinations are input until each group of second combinations is subjected to matching judgment.
In this embodiment, the matching method further includes: after the third data set is obtained, performing a blocking operation on the third data set, and removing a negative example in the third data set, wherein the negative example is a first combination of entity records of the first data set and entity records of the second data set which are obviously unmatched.
In this embodiment, the blocking operation on the third data set specifically includes: attribute equality blocking and rule-based blocking;
the attribute equal blocking specifically includes: and judging whether a plurality of attribute values of two entity records in each group of first combination are equal, if the attribute values of the first number are not equal, deleting the first combination, and if the attribute values of the first number are not equal, reserving the first combination, wherein the first number is smaller than the number of the attribute values of the entity records.
The rule-based blocking specifically comprises: and judging whether the attribute values of the two entity records in each group of first combinations simultaneously meet a preset first condition, if so, reserving, and if not, deleting.
In this embodiment, the rule-based blocking is a conditional determination of the value of one or more entity attributes, and if the condition is not met, the record pair is deleted, e.g., an entity record pair A1B1All have an attribute of year, the rule is set to year > -2000, and the attribute of the entity record pair (A)1.year,B1Year) is not met and is deleted.
In this embodiment, after the blocking operation is performed on the third data set, the third data set is subjected to a first preprocessing, so that the third data set subjected to the first preprocessing satisfies SBert model input criteria.
The preprocessing specifically comprises the steps of clearing null values, adjusting formats and the like. If the preprocessing operation is not carried out, the null value or the non-uniform format brings interference and errors to the matching result and influences the judgment of the model, so that the input data can be preprocessed, the null value is eliminated, the format is uniform, and the interference of unnecessary factors to the judgment of the model is avoided.
Example 2:
referring to fig. 2, the present invention further discloses an entity matching apparatus, which is applied to the same application scenario as the embodiment, and performs entity matching between data sets, wherein the apparatus includes a first obtaining module 101, a second obtaining module 102, a first processing module 103, and a second processing module 104.
The first obtaining module 101 is configured to obtain a first data set and a second data set that need to be matched, where the first data set and the second data set each include a plurality of entity records, and each entity record includes a plurality of attributes;
the second obtaining module 102 is configured to obtain a cartesian product of the first data set and the second data set to obtain a third data set, where the third data set includes a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set;
the first processing module 103 is configured to perform sentence combination on each entity record in the third data set according to a preset potential relationship among a plurality of attributes in the entity record, so as to obtain a fourth data set; the fourth data set comprises a plurality of groups of second combinations, wherein the second combinations are sentences and sentence combinations of corresponding entity records;
the second processing module 104 is configured to input each group of second combinations in the fourth data set to a preset Bert model, where the Bert model converts an input sentence into an entity embedding vector, and compares, by using the entity embedding vector, whether two sentences in each group of second combinations match with each other, and outputs a matching result.
In this embodiment, the matching apparatus further includes a third processing module disposed between the second obtaining module and the first processing module;
and the third processing module is used for performing blocking operation on the third data set and removing a negative example in the third data set, wherein the negative example is a combination of entity records of the first data set and entity records of the second data set which are obviously unmatched.
Since embodiment 2 is written based on embodiment 1, some of the same technical features are not described in embodiment 2.
To sum up, compared with the prior art, the entity matching method and device of the embodiment of the invention have the following beneficial effects:
(1) and replacing the entity records in the third data set with sentences generated according to the potential relation of the attributes, so that the data input into the Bert model by the second combination can keep the relation between the attributes, and the matching result of the entity records in the data set is more accurate.
(2) And when entity matching is carried out in the prior art, the second combination is integrally input into a Bert model, and if the scales of the A and B data sets are respectively m and n entity records, the size of the Cartesian product of the A and B data sets is mxn, and if the A and B data sets are input into the Bert model in a paired mode, m and n times of calculation are needed. However, by adopting the SBert model in the invention, each entity record of A and B can be independently input into the Bert model for coding calculation, when the sentences in the second combination input later are processed by the SBert model, the stored entity embedding vector is called for matching judgment, only m + n times of calculation is needed, the calculation amount is greatly reduced, and the entity matching efficiency is improved.
(3) And preprocessing and blocking the input data, so that the data of the input model is standard and uniform, an invalid second combination is eliminated, and the matching efficiency and accuracy are improved. And no manual work is involved in the data processing, blocking operation and model matching processes, so that the method can be better applied to entity matching of different data sets, and the cost is saved.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these modifications and substitutions should also be regarded as the protection scope of the present invention.

Claims (10)

1. An entity matching method, comprising:
acquiring a first data set and a second data set which need to be matched, wherein the first data set and the second data set respectively comprise a plurality of entity records, and each entity record comprises a plurality of attributes;
acquiring a Cartesian product of a first data set and a second data set to obtain a third data set, wherein the third data set comprises a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set;
according to preset potential relations among a plurality of attributes in the entity records, sentence combination is carried out on each entity record in the third data set, and a fourth data set is obtained; the fourth data set comprises a plurality of groups of second combinations, wherein the second combinations are sentences and sentence combinations of corresponding entity records;
and inputting each group of second combinations in the fourth data set into a preset Bert model, converting the input sentences into entity embedding vectors by the Bert model, comparing whether two sentences in each group of second combinations are matched or not through the entity embedding vectors, and outputting a matching result.
2. The entity matching method of claim 1, further comprising: after the third data set is obtained, performing a blocking operation on the third data set, and removing a negative example in the third data set, wherein the negative example is a first combination of entity records of the first data set and entity records of the second data set which are obviously unmatched.
3. The entity matching method according to claim 2, wherein the blocking operation is performed on the third data set, and the specific method includes: attribute equality blocking and rule-based blocking;
the attribute equal blocking specifically includes: judging whether a plurality of attribute values of two entity records in each group of first combination are equal, if the attribute values of a first number are not equal, deleting the first combination, and if the attribute values of the first number are not equal, reserving the first combination, wherein the first number is smaller than the number of the attribute values of the entity records;
the rule-based blocking specifically comprises: and judging whether the attribute values of the two entity records in each group of first combinations simultaneously meet a preset first condition, if so, reserving, and if not, deleting.
4. The entity matching method according to claim 2, further comprising: after the blocking operation is performed on the third data set, performing a first preprocessing on the third data set so that the first preprocessed third data set satisfies the SBert model input criteria.
5. The entity matching method according to claim 1, wherein each entity record in the third data set is sentence-combined according to a preset potential relationship among a plurality of attributes in the entity record, specifically:
acquiring a potential relation between any two attributes in the entity record, and acquiring a phrase formed by any two attributes according to the potential relation;
composing the obtained plurality of phrases into sentences;
and replacing the obtained sentences into a third data set according to the corresponding relation between the sentences and the entity records.
6. The entity matching method according to claim 1, wherein the Bert model is specifically an SBert model, and the SBert model comprises a first Bert model and a second Bert model that adopt weight sharing twin neural networks; when the second combination is input into the SBert model, the first Bert model and the second Bert model are respectively used for processing two sentences in the second combination and storing entity embedded vectors converted by each sentence.
7. The entity matching method as claimed in claim 6, wherein when the sentences in the second combination that are input later have been processed by SBert model, the saved entity embedding vector is called to make matching judgment.
8. The entity matching method according to claim 1, wherein the comparing of the two sentences in each group of second combinations by the entity embedding vector is specifically:
calculating cosine similarity of entity embedded vectors corresponding to the two sentences in the second combination, and judging whether the value of the cosine similarity is larger than or equal to a preset first threshold, if so, determining that the two sentences in the first combination are matched, and if not, determining that the two sentences in the first combination are not matched.
9. An entity matching device is characterized by comprising a first acquisition module, a second acquisition module, a first processing module and a second processing module;
the first obtaining module is configured to obtain a first data set and a second data set that need to be matched, where the first data set and the second data set each include a plurality of entity records, and each entity record includes a plurality of attributes;
the second obtaining module is configured to obtain a cartesian product of the first data set and the second data set to obtain a third data set, where the third data set includes a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set;
the first processing module is used for combining sentences of each entity record in the third data set according to preset potential relations among a plurality of attributes in the entity records to obtain a fourth data set; the fourth data set comprises a plurality of groups of second combinations, wherein the second combinations are sentences and sentence combinations of corresponding entity records;
and the second processing module is used for inputting each group of second combinations in the fourth data set into a preset Bert model, converting the input sentences into entity embedded vectors by the Bert model, comparing whether two sentences in each group of second combinations are matched or not through the entity embedded vectors, and outputting a matching result.
10. The entity matching device of claim 9, further comprising a third processing module disposed between the second obtaining module and the first processing module;
and the third processing module is used for performing blocking operation on the third data set and removing a negative example in the third data set, wherein the negative example is a combination of entity records of the first data set and entity records of the second data set which are obviously unmatched.
CN202110818313.8A 2021-07-20 2021-07-20 Entity matching method and device Active CN113609304B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110818313.8A CN113609304B (en) 2021-07-20 2021-07-20 Entity matching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110818313.8A CN113609304B (en) 2021-07-20 2021-07-20 Entity matching method and device

Publications (2)

Publication Number Publication Date
CN113609304A true CN113609304A (en) 2021-11-05
CN113609304B CN113609304B (en) 2023-05-23

Family

ID=78337975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110818313.8A Active CN113609304B (en) 2021-07-20 2021-07-20 Entity matching method and device

Country Status (1)

Country Link
CN (1) CN113609304B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955837A (en) * 2011-12-13 2013-03-06 华东师范大学 Analogy retrieval control method based on Chinese word pair relationship similarity
CN106205608A (en) * 2015-05-29 2016-12-07 微软技术许可有限责任公司 Utilize the Language Modeling for speech recognition of knowledge graph
CN107145523A (en) * 2017-04-12 2017-09-08 浙江大学 Large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching
CN107818081A (en) * 2017-09-25 2018-03-20 沈阳航空航天大学 Sentence similarity appraisal procedure based on deep semantic model and semantic character labeling
US20200227032A1 (en) * 2018-02-24 2020-07-16 Twenty Lane Media, LLC Systems and Methods for Generating and Recognizing Jokes
WO2021000362A1 (en) * 2019-07-04 2021-01-07 浙江大学 Deep neural network model-based address information feature extraction method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955837A (en) * 2011-12-13 2013-03-06 华东师范大学 Analogy retrieval control method based on Chinese word pair relationship similarity
CN106205608A (en) * 2015-05-29 2016-12-07 微软技术许可有限责任公司 Utilize the Language Modeling for speech recognition of knowledge graph
CN107145523A (en) * 2017-04-12 2017-09-08 浙江大学 Large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching
CN107818081A (en) * 2017-09-25 2018-03-20 沈阳航空航天大学 Sentence similarity appraisal procedure based on deep semantic model and semantic character labeling
US20200227032A1 (en) * 2018-02-24 2020-07-16 Twenty Lane Media, LLC Systems and Methods for Generating and Recognizing Jokes
WO2021000362A1 (en) * 2019-07-04 2021-01-07 浙江大学 Deep neural network model-based address information feature extraction method

Also Published As

Publication number Publication date
CN113609304B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN112131383B (en) Specific target emotion polarity classification method
CN112434535B (en) Element extraction method, device, equipment and storage medium based on multiple models
CN114565104A (en) Language model pre-training method, result recommendation method and related device
CN108536735B (en) Multi-mode vocabulary representation method and system based on multi-channel self-encoder
CN112800239B (en) Training method of intention recognition model, and intention recognition method and device
CN110275928B (en) Iterative entity relation extraction method
CN112183065A (en) Text evaluation method and device, computer readable storage medium and terminal equipment
CN114049926A (en) Electronic medical record text classification method
CN115526236A (en) Text network graph classification method based on multi-modal comparative learning
US20230072297A1 (en) Knowledge graph based reasoning recommendation system and method
CN113779190A (en) Event cause and effect relationship identification method and device, electronic equipment and storage medium
CN113704393A (en) Keyword extraction method, device, equipment and medium
CN112560428A (en) Text processing method and device, electronic equipment and storage medium
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN113609304A (en) Entity matching method and device
CN115730221A (en) False news identification method, device, equipment and medium based on traceability reasoning
CN113688633A (en) Outline determination method and device
CN115640378A (en) Work order retrieval method, server, medium and product
JP7055848B2 (en) Learning device, learning method, learning program, and claim mapping device
CN111680151B (en) Personalized commodity comment abstract generation method based on hierarchical transformer
CN113821571A (en) Food safety relation extraction method based on BERT and improved PCNN
CN113704236A (en) Government affair system data quality evaluation method, device, terminal and storage medium
CN112597208A (en) Enterprise name retrieval method, enterprise name retrieval device and terminal equipment
CN111159366A (en) Question-answer optimization method based on orthogonal theme representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant