CN113609304B - Entity matching method and device - Google Patents

Entity matching method and device Download PDF

Info

Publication number
CN113609304B
CN113609304B CN202110818313.8A CN202110818313A CN113609304B CN 113609304 B CN113609304 B CN 113609304B CN 202110818313 A CN202110818313 A CN 202110818313A CN 113609304 B CN113609304 B CN 113609304B
Authority
CN
China
Prior art keywords
data set
entity
sentences
combinations
attributes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110818313.8A
Other languages
Chinese (zh)
Other versions
CN113609304A (en
Inventor
周琥晨
李默涵
张雨成
顾钊铨
韩伟红
唐可可
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202110818313.8A priority Critical patent/CN113609304B/en
Publication of CN113609304A publication Critical patent/CN113609304A/en
Application granted granted Critical
Publication of CN113609304B publication Critical patent/CN113609304B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Abstract

The invention relates to the technical field of entity matching, and discloses an entity matching method and device, wherein the method comprises the following steps: acquiring a first data set and a second data set, wherein the data set comprises a plurality of entity records, and the entity records comprise a plurality of attributes; obtaining a Cartesian product of the first data set and the second data set to obtain a third data set, and combining sentences of each entity record in the third data set according to a preset potential relation among a plurality of attributes in the entity records to obtain a fourth data set comprising a second combination; and inputting the second combination in the fourth data set into a preset Bert model, wherein the Bert model is used for judging whether two sentences of the second combination match or not and outputting a matching result. The beneficial effects are that: the entity records in the third data set are replaced by sentences generated according to the attribute potential relations, so that the second combination is input into the Bert model to keep the relation among the attributes, and the entity record matching result of the data set is more accurate.

Description

Entity matching method and device
Technical Field
The present invention relates to the field of entity matching technologies, and in particular, to a method and an apparatus for entity matching.
Background
The goal of entity matching is to identify heterogeneous expressions of the same real world entity in different data sources. Entity matching is an important step in knowledge fusion, but there are multiple heterogeneous data environments in the real world, such as structured data, dirty data, textual data, etc. These multi-source heterogeneous environments need to be considered heavily and targeted processing methods are needed.
In the task of entity matching, the data to be matched are two data sets A and B, wherein the A and B respectively comprise a plurality of entity records, each entity record comprises a plurality of attributes of one entity, and the A and B have the same attributes. The two data sets A and B are data sets of two different sources, a plurality of entity records describing the same entity in the real world exist in the two data sets respectively, and the object of the entity matching task is to find all matched entity record pairs in the first data set B. For example, each matching pair of entity records consists of two entity records tA and tB from the first dataset and B, respectively, where tA and tB describe the same real world entity, there may be multiple entity records ti in the first dataset corresponding to tB of the second dataset.
In the prior art, some entity matching methods exist, but the entity matching methods usually adopt entity records to match, and the relationship among attributes in the entity records is not considered, so that a matching result has a large error, and therefore the existing entity matching methods need to be improved, and the accuracy of entity matching is high.
Disclosure of Invention
The purpose of the invention is that: the entity matching method and device comprehensively consider the content of the entity records and improve the accuracy of entity matching.
In order to achieve the above object, the present invention provides an entity matching method, including:
and acquiring a first data set and a second data set which are required to be matched, wherein the first data set and the second data set both comprise a plurality of entity records, and each entity record comprises a plurality of attributes.
And obtaining a Cartesian product of the first data set and the second data set to obtain a third data set, wherein the third data set comprises a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set.
According to the preset potential relation among the plurality of attributes in the entity records, sentence combination is carried out on each entity record in the third data set, and a fourth data set is obtained; the fourth dataset comprises a number of sets of second combinations of sentences and combinations of sentences of corresponding entity records.
And inputting each group of second combinations in the fourth data set into a preset Bert model, converting the input sentences into entity embedded vectors, comparing whether two sentences in each group of second combinations are matched or not through the entity embedded vectors, and outputting a matching result.
Further, after the third data set is obtained, a blocking operation is performed on the third data set, and negative examples in the third data set are removed, wherein the negative examples are first combinations of entity records of the first data set and entity records of the second data set which are obviously not matched.
Further, the blocking operation is performed on the third data set, and the specific method includes: attribute equality blocking and rule-based blocking;
the attribute equal blocking specifically is: judging whether a plurality of attribute values recorded by two entities in each group of first combinations are equal or not, deleting the first combination if the attribute values with the first number are not equal, and reserving the first combination if the attribute values with the first number are not equal, wherein the first number is smaller than the attribute number of the entity records.
The rule-based blocking is specifically: judging whether the attribute values of the two entity records in each group of first combinations meet a preset first condition at the same time, if so, reserving, and if not, deleting.
Further, after the blocking operation is performed on the third data set, the first preprocessing is performed on the third data set, so that the third data set after the first preprocessing meets the SBert model input standard.
Further, according to a preset potential relationship among a plurality of attributes in the entity records, sentence combination is performed on each entity record in the third dataset, specifically:
acquiring potential relations between any two attributes in the entity record, and acquiring phrases formed by any two attributes according to the potential relations;
forming sentences from the obtained phrases;
and replacing the obtained sentences into a third data set according to the corresponding relation between the sentences and the entity records.
Further, the Bert model is specifically an SBert model, and the SBert model comprises a first Bert model and a second Bert model which adopt weight sharing twin neural networks; when the second combination is input to the SBert model, the first Bert model and the second Bert model are respectively used for processing two sentences in the second combination, and storing entity embedded vectors converted by each sentence.
Further, when sentences in the second combination of the later input are processed by the SBert model, the saved entity embedded vectors are called to carry out matching judgment.
Further, the comparing, by the entity embedding vector, whether two sentences in each group of second combinations match, specifically is:
and calculating cosine similarity of entity embedded vectors corresponding to the two sentences in the second combination, judging whether the value of the cosine similarity is larger than or equal to a preset first threshold value, if so, determining that the two sentences in the first combination are matched, and if not, determining that the two sentences in the first combination are not matched.
The invention also discloses an entity matching device which is characterized by comprising a first acquisition module, a second acquisition module, a first processing module and a second processing module.
The first acquisition module is used for acquiring a first data set and a second data set which are required to be matched, wherein the first data set and the second data set both comprise a plurality of entity records, and each entity record comprises a plurality of attributes.
The second obtaining module is configured to obtain a cartesian product of the first data set and the second data set, and obtain a third data set, where the third data set includes a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set.
The first processing module is configured to combine sentences of each entity record in the third dataset according to a preset potential relationship among a plurality of attributes in the entity records, so as to obtain a fourth dataset; the fourth dataset comprises a number of sets of second combinations of sentences and combinations of sentences of corresponding entity records.
The second processing module is configured to input each set of second combinations in the fourth dataset into a preset Bert model, where the Bert model converts the input sentence into an entity embedded vector, compares whether two sentences in each set of second combinations are matched through the entity embedded vector, and outputs a matching result.
Further, the matching device further comprises a third processing module arranged between the second acquisition module and the first processing module.
The third processing module is configured to perform blocking operation on the third data set, and remove negative examples in the third data set, where the negative examples are a combination of entity records of the first data set and entity records of the second data set that are obviously mismatched.
Compared with the prior art, the entity matching method and device provided by the embodiment of the invention have the beneficial effects that: the entity records in the third data set are replaced by sentences generated according to the attribute potential relations, so that the second combination is input into the Bert model to keep the relation among the attributes, and the entity record matching result of the data set is more accurate.
Drawings
FIG. 1 is a flow chart of an entity matching method of the present invention;
fig. 2 is a schematic structural diagram of an entity matching device according to the present invention.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
Example 1:
referring to fig. 1, the invention discloses an entity matching method, which is applied to entity matching among different data sets, and comprises the following main steps:
step S1, a first data set and a second data set which are required to be matched are obtained, wherein the first data set and the second data set both comprise a plurality of entity records, and each entity record comprises a plurality of attributes.
Step S2, obtaining a Cartesian product of the first data set and the second data set to obtain a third data set, wherein the third data set comprises a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set.
Step S3, according to the preset potential relation among the plurality of attributes in the entity records, sentence combination is carried out on each entity record in the third data set, and a fourth data set is obtained; the fourth dataset comprises a number of sets of second combinations of sentences and combinations of sentences of corresponding entity records.
And S4, inputting each group of second combinations in the fourth data set into a preset Bert model, converting the input sentences into entity embedded vectors by the Bert model, comparing whether the two sentences in each group of second combinations are matched through the entity embedded vectors, and outputting a matching result.
In step S1, a first data set and a second data set to be matched are acquired, wherein the first data set and the second data set each comprise a plurality of entity records, and each entity record comprises a plurality of attributes. For the sake of clarity, the description is illustrated using mathematical language as follows:
acquisition data set A (A 1 、A 2 、A 3 、……A i ) And data set B (B) 1 、B 2 、B 3 、...B j ) Wherein element A in the dataset A i For an entity record (A 11 、A 12 、A 13 、……A 1k ) For example A 1 Includes attributes of (A) 11 、A 12 、A 13 ) The method comprises the steps of carrying out a first treatment on the surface of the Element B in the dataset B j An entity record (B 11 、B 12 、B 13 、……、B 1l ) Such as B 1 Includes attributes of (B) 11 、B 12 、B 13 ). The i, j, k, l value is a natural number greater than zero.
In step S2, a cartesian product of the first data set and the second data set is obtained, resulting in a third data set comprising several sets of first combinations of entity records of the first data set and entity records of the second data set.
The meaning of a cartesian product is known to the person skilled in the art, in particular: the Cartesian product refers to the Cartesian product (Cartesian product), also known as the straight product, of two sets X and Y, expressed mathematically as X Y, the first object being a member of X and the second object being one of all possible ordered pairs of Y. Assuming that set a= { a, B }, set b= {0,1,2}, the cartesian product of the two sets is { (a, 0), (a, 1), (a, 2), (B, 0), (B, 1), (B, 2) }.
In this embodiment, the Cartesian product of data set A and data set B is obtained to obtain data set D (A 1 B 1 、A 1 B 2 、A 1 B 3 、……A 1 B j 、A 2 B 1 、A 2 B 2 、……、A i B j ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein A is 1 B 1 Representing a combination of a first entity record in a first data set and a first entity record in a second data set, the meaning of other combinations can be analogized. Such a combination is described as a first combination, and it is known that the data set D comprises a number of first combinations.
In step S3, sentence combination is performed on each entity record in the third dataset according to a preset potential relationship among a plurality of attributes in the entity records, so as to obtain a fourth dataset; the fourth dataset comprises a number of sets of second combinations of sentences and combinations of sentences of corresponding entity records.
In this embodiment, sentence combination is performed on each entity record in the third dataset according to a preset potential relationship among a plurality of attributes in the entity record, which specifically includes:
and acquiring potential relations between any two attributes in the entity record, and acquiring phrases formed by any two attributes according to the potential relations. And forming the plurality of obtained phrases into sentences. And replacing the obtained sentences into a third data set according to the corresponding relation between the sentences and the entity records.
In this embodiment, the embedding of the potential relationship in the data set D is specifically: according to element A i The potential relation among the attributes in the rule, and different attributes are connected in series based on the potential relation to form a sentence.
For example A 1 Includes attributes of (A) 11 、A 12 、A 13 ),rel(A 11 ,A 12 ) Is a phrase, the meaning of which indicates A 11 ,A 12 Potential relationships between, e.g. A 11 ,A 12 For title and manufector, rel (A 11 ,A 12 ) Is "hide by". Potential relationship embedding: will rel (A) 11 ,A 12 ) Embedding entity records (A) 11 、A 12 、A 13 ) Corresponding attribute A of (2) 11 、A 12 Between them; repeating the above process to build A 11 And A 13 Relation of A 12 And A 13 The relation between the attributes is that all the attributes are connected in series by using the potential relation and finally combined into a sentence, a plurality of the attributes can be connected in series to form a shape like A 11 rel(A 11 ,A 12 )A 12 ...A 1k-1 rel(A 1k-1 ,A 1k )A 1k "sentence, last entity record is no longer composed of attributes, but becomes a sentence describing an entity.
And embedding potential relations between the data set A and the data set B to enable all entity records to be converted into corresponding sentences. Converting entity records in a first dataset into sentences denoted S Ai Converting the entity records in the second data set into sentences denoted S Bj . After embedding the potential relationship, a data set E (fourth data set) is generated (S A1 S B1 、S A1 S B2 、S A1 S B3 、S A1 S Bj ……、S A2 S B1 、S A2 S B2 、……、S Ai S Bj ). Wherein S is Ai S Bj A second combination.
Further, potential relationships between attributes are given in accordance with the semantic unification of the dataset and the columns of attributes. In entity matching, the attribute types of the entity records are often the same, namely, the first entity record comprises Zhang three, man and eighteen years old, and the second entity record is Lifour, man and twenty years old. When the potential relationship is obtained, it is known that the potential attributes of the first entity record and the second entity record are the same. Therefore, the data can be collected and sorted in the data acquisition stage, the attributes of the same category are positioned in the same column of the table, and the potential relationship can be assigned according to the attribute column when the potential relationship is embedded, so that the process of assigning the potential relationship can be simplified.
In this embodiment, the entities of the dataset in the entity matching possess many attributes, and if simply stitching, the potential semantic associations between many attributes are lost. By the method for adding the potential relation, each attribute of the entity record can be subjected to relation perception, the potential relation embedding is added, semantic information is enriched, and entity information sentences which are easier to understand by the Bert model are generated.
In step S4, each set of second combinations in the fourth dataset is input to a preset Bert model that converts the input sentence into an entity embedded vector and compares whether two sentences in each set of second combinations match by the entity embedded vector and outputs a matching result.
In this embodiment, for further optimization of the technical solution, the Bert model is specifically an SBert model, where the SBert model includes a first Bert model and a second Bert model that adopt weight sharing twin neural networks; when the second combination is input to the SBert model, the first Bert model and the second Bert model are respectively used for processing two sentences in the second combination, and storing entity embedded vectors converted by each sentence.
In this embodiment, when the sentences in the second combination of later inputs have been processed by the SBert model, the saved entity embedded vectors are called for matching judgment.
In this embodiment, assuming that the sizes of the two data sets a and B are m and n physical records, respectively, the size of the cartesian product of the two data sets a and B is m×n, and if the two data sets are input into the Bert model in a pair-wise manner, m×n times are required to be calculated. However, by adopting the SBert model in the invention, each entity record of A and B can be independently input into the Bert model for coding calculation, and only m+n times of calculation are needed, so that the calculated amount is greatly reduced, and the entity matching efficiency is improved.
In this embodiment, the comparing, by the entity embedding vector, whether two sentences in each set of second combinations match, specifically is:
and calculating cosine similarity of entity embedded vectors corresponding to the two sentences in the second combination, judging whether the value of the cosine similarity is larger than or equal to a preset first threshold value, if so, determining that the two sentences in the first combination are matched, and if not, determining that the two sentences in the first combination are not matched.
In this embodiment, SBert uses a model of a twin neural network, i.e., a pair of parameter-shared berts, and has higher coding efficiency. The SBert model is composed of two Bert models, a pair of entity records can be respectively input into the Bert models for coding to generate embedded vectors, each entity record is independently coded, calculated and stored, repeated calculation of entity record coding is not needed, and waste in time and space is reduced.
In this embodiment, the encoding process is to separately input each entity record sentence into the SBert, and generate an embedded vector of the sentence using the SBert, and an alternative implementation is as follows: the matrix dimension of the embedded vector is [ n,768], n is the number of words of the input data, and each row of the matrix represents the combination of the association information and the position information of all the words of the input data and the words of the current position. For example Iloveyou, which are data of the input Bert, the first row [1, 768] of the output matrix [3, 768] the vector represents I this word and all associated information, semantic association and location information etc. of all words of the input data, i.e. both love and you. The generated embedded vector [ n,768] contains a piece of entity record information, then the vector in the [ n,768] dimension is subjected to information extraction by an average pooling layer, the size of the vector is reduced, the calculated amount is reduced, and finally the cosine similarity of the vectors output by the operation is calculated.
In the model training process, the model parameters of the SBerts can be adjusted according to the labels, after a plurality of rounds of iterative training, the model is verified on a verification set after the training reaches a certain step number, the model update is saved when the effect is improved, and the final model is obtained after the training is finished.
After training the model, the entity record pair S of the test set Ai S Bj The method comprises the steps of calculating cosine similarity (range of-1) of an embedded vector of an encoded entity, setting a plurality of similarity thresholds theta, judging that the similarity is matched when the cosine similarity is larger than or equal to theta, judging that the similarity is not matched when the similarity is smaller than the cosine similarity, calculating F1 scores according to the standard, wherein each similarity threshold theta corresponds to one F1 score, and selecting the similarity threshold theta with the highest F1 score as the optimal similarity threshold (first threshold).
Taking the optimal similarity threshold value as an entity record pair S Ai S Bj A first threshold value of whether the data set D (fourth data set) is matched, then a second combination in the data set D (fourth data set) is input into an SBert model to be encoded to generate vectors and stored, and the entity record pair S is recorded Ai S Bj The embedded vector of (2) calculates cosine similarity, and is judged to be matched when the similarity is larger than or equal to the optimal similarity threshold value theta, and is not matched when the similarity is smaller than the optimal similarity threshold value theta. And after each group of second combinations is input, storing or recording a corresponding matching result until each group of second combinations is subjected to matching judgment.
In this embodiment, the matching method further includes: after the third data set is obtained, a blocking operation is performed on the third data set, and negative examples in the third data set are removed, wherein the negative examples are first combinations of entity records of the first data set and entity records of the second data set which are obviously not matched.
In this embodiment, the blocking operation is performed on the third data set, and the specific method includes: attribute equality blocking and rule-based blocking;
the attribute equal blocking specifically is: judging whether a plurality of attribute values recorded by two entities in each group of first combinations are equal or not, deleting the first combination if the attribute values with the first number are not equal, and reserving the first combination if the attribute values with the first number are not equal, wherein the first number is smaller than the attribute number of the entity records.
The rule-based blocking is specifically: judging whether the attribute values of the two entity records in each group of first combinations meet a preset first condition at the same time, if so, reserving, and if not, deleting.
In this embodiment, rule-based blocking is to make a conditional determination on the values of one or more entity attributes, and delete a record pair, e.g., an entity record pair A, if the condition is not met 1 B 1 All have the attribute year, set rule year > =2000, attribute of entity record pair (a 1 .year,B 1 Year) is deleted if the rule is not met.
In this embodiment, after the blocking operation is performed on the third data set, the first preprocessing is performed on the third data set, so that the first preprocessed third data set meets the SBert model input standard.
The preprocessing specifically comprises the steps of clearing null values, adjusting formats and the like. If the preprocessing operation is not performed, the null value or the non-uniform format can cause interference and error to the matching result to influence the judgment of the model, so that the input data can be preprocessed, the null value is cleared, the format is uniform, and the interference to the judgment of the model caused by unnecessary factors is avoided.
Example 2:
referring to fig. 2, the invention also discloses an entity matching device, which is applied to the same application scenario as the embodiment, and is used for performing entity matching between data sets, and comprises a first acquisition module 101, a second acquisition module 102, a first processing module 103 and a second processing module 104.
The first obtaining module 101 is configured to obtain a first data set and a second data set that need to be matched, where the first data set and the second data set each include a plurality of entity records, and each entity record includes a plurality of attributes;
the second obtaining module 102 is configured to obtain a cartesian product of the first data set and the second data set, to obtain a third data set, where the third data set includes a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set;
the first processing module 103 is configured to combine sentences of each entity record in the third dataset according to a preset potential relationship among a plurality of attributes in the entity records, so as to obtain a fourth dataset; the fourth data set comprises a plurality of groups of second combinations, wherein the second combinations are the sentences and the combinations of the sentences recorded by the corresponding entities;
the second processing module 104 is configured to input each set of second combinations in the fourth dataset into a preset Bert model, where the Bert model converts the input sentence into an entity embedded vector, compares whether two sentences in each set of second combinations match with each other through the entity embedded vector, and outputs a matching result.
In this embodiment, the matching device further includes a third processing module disposed between the second acquisition module and the first processing module;
the third processing module is configured to perform blocking operation on the third data set, and remove negative examples in the third data set, where the negative examples are a combination of entity records of the first data set and entity records of the second data set that are obviously mismatched.
Since embodiment 2 is written on the basis of embodiment 1, some of the same technical features are not described in embodiment 2.
In summary, compared with the prior art, the entity matching method and device provided by the embodiment of the invention have the beneficial effects that:
(1) The entity records in the third data set are replaced by sentences generated according to the potential relation of the attributes, so that the second combination is input into the Bert model to keep the relation among the attributes, and the entity record matching result of the data set is more accurate.
(2) In the prior art, when entity matching is performed, the second combination is integrally input into a Bert model, and assuming that the sizes of the two data sets a and B are m and n entity records respectively, the Cartesian product of the two data sets a and B is m×n, and if the two data sets are input into the Bert model in a paired manner, m×n times are needed to be calculated. However, by adopting the SBert model in the invention, each entity record of A and B can be independently input into the Bert model for coding calculation, when sentences in a second combination input later are processed by the SBert model, the stored entity embedded vectors are called for matching judgment, and only m+n times of calculation are needed, so that the calculated amount is greatly reduced, and the entity matching efficiency is improved.
(3) And preprocessing and blocking the input data to unify the data specification of the input model and eliminate the invalid second combination, thereby improving the matching efficiency and accuracy. No artificial participation exists in the data processing, blocking operation and model matching processes, and the method can be better applied to entity matching of different data sets, and cost is saved.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and substitutions can be made by those skilled in the art without departing from the technical principles of the present invention, and these modifications and substitutions should also be considered as being within the scope of the present invention.

Claims (9)

1. An entity matching method, comprising:
acquiring a first data set and a second data set which are required to be matched, wherein the first data set and the second data set both comprise a plurality of entity records, and each entity record comprises a plurality of attributes;
obtaining a Cartesian product of the first data set and the second data set to obtain a third data set, wherein the third data set comprises a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set;
according to the preset potential relation among the plurality of attributes in the entity records, sentence combination is carried out on each entity record in the third data set, and a fourth data set is obtained; the fourth data set comprises a plurality of groups of second combinations, wherein the second combinations are the sentences and the combinations of the sentences recorded by the corresponding entities;
inputting each group of second combinations in the fourth data set into a preset Bert model, converting the input sentences into entity embedded vectors by the Bert model, comparing whether two sentences in each group of second combinations are matched or not through the entity embedded vectors, and outputting a matching result;
and combining sentences of each entity record in the third data set according to a preset potential relation among a plurality of attributes in the entity record, wherein the sentence combination is specifically as follows:
acquiring potential relations between any two attributes in the entity record, and acquiring phrases formed by any two attributes according to the potential relations; wherein, the potential relation between the attributes is endowed according to the semantic unification of the data set and each attribute column;
forming sentences from the obtained phrases;
and replacing the obtained sentences into a third data set according to the corresponding relation between the sentences and the entity records.
2. The method of entity matching of claim 1, further comprising: after the third data set is obtained, a blocking operation is performed on the third data set, and negative examples in the third data set are removed, wherein the negative examples are first combinations of entity records of the first data set and entity records of the second data set which are obviously not matched.
3. The method of claim 2, wherein the blocking the third data set, the method specifically comprises: attribute equality blocking and rule-based blocking;
the attribute equal blocking specifically is: judging whether a plurality of attribute values recorded by two entities in each group of first combinations are equal or not, if the attribute values with the first number are not equal, deleting the first combinations, and if the attribute values with the first number are not equal, reserving the first combinations, wherein the first number is smaller than the attribute number of the entity records;
the rule-based blocking is specifically: judging whether the attribute values of the two entity records in each group of first combinations meet a preset first condition at the same time, if so, reserving, and if not, deleting.
4. The method of entity matching of claim 2, further comprising: after the blocking operation is performed on the third data set, the first preprocessing is performed on the third data set, so that the third data set after the first preprocessing meets the SBert model input standard.
5. The entity matching method according to claim 1, wherein the Bert model is specifically an SBert model, and the SBert model includes a first Bert model and a second Bert model that adopt weight sharing twin neural networks; when the second combination is input to the SBert model, the first Bert model and the second Bert model are respectively used for processing two sentences in the second combination, and storing entity embedded vectors converted by each sentence.
6. The method of claim 5, wherein the stored entity-embedded vectors are invoked for matching decisions when sentences in the second combination of later inputs have been processed by the SBert model.
7. The method for matching entities according to claim 1, wherein the comparing, by means of entity embedding vectors, whether two sentences in each set of second combinations match, specifically:
and calculating cosine similarity of entity embedded vectors corresponding to the two sentences in the second combination, judging whether the value of the cosine similarity is larger than or equal to a preset first threshold value, if so, determining that the two sentences in the first combination are matched, and if not, determining that the two sentences in the first combination are not matched.
8. The entity matching device is characterized by comprising a first acquisition module, a second acquisition module, a first processing module and a second processing module;
the first acquisition module is used for acquiring a first data set and a second data set which are required to be matched, wherein the first data set and the second data set both comprise a plurality of entity records, and each entity record comprises a plurality of attributes;
the second acquisition module is configured to acquire a cartesian product of the first data set and the second data set to obtain a third data set, where the third data set includes a plurality of groups of first combinations, and the first combinations are combinations of entity records of the first data set and entity records of the second data set;
the first processing module is configured to combine sentences of each entity record in the third dataset according to a preset potential relationship among a plurality of attributes in the entity records, so as to obtain a fourth dataset; the fourth data set comprises a plurality of groups of second combinations, wherein the second combinations are the sentences and the combinations of the sentences recorded by the corresponding entities;
the second processing module is used for inputting each group of second combinations in the fourth data set into a preset Bert model, the Bert model converts the input sentences into entity embedded vectors, compares whether two sentences in each group of second combinations are matched through the entity embedded vectors and outputs a matching result;
and combining sentences of each entity record in the third data set according to a preset potential relation among a plurality of attributes in the entity record, wherein the sentence combination is specifically as follows:
acquiring potential relations between any two attributes in the entity record, and acquiring phrases formed by any two attributes according to the potential relations; wherein, the potential relation between the attributes is endowed according to the semantic unification of the data set and each attribute column;
forming sentences from the obtained phrases;
and replacing the obtained sentences into a third data set according to the corresponding relation between the sentences and the entity records.
9. The entity matching device of claim 8, further comprising a third processing module disposed between the second acquisition module and the first processing module;
the third processing module is configured to perform blocking operation on the third data set, and remove negative examples in the third data set, where the negative examples are a combination of entity records of the first data set and entity records of the second data set that are obviously mismatched.
CN202110818313.8A 2021-07-20 2021-07-20 Entity matching method and device Active CN113609304B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110818313.8A CN113609304B (en) 2021-07-20 2021-07-20 Entity matching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110818313.8A CN113609304B (en) 2021-07-20 2021-07-20 Entity matching method and device

Publications (2)

Publication Number Publication Date
CN113609304A CN113609304A (en) 2021-11-05
CN113609304B true CN113609304B (en) 2023-05-23

Family

ID=78337975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110818313.8A Active CN113609304B (en) 2021-07-20 2021-07-20 Entity matching method and device

Country Status (1)

Country Link
CN (1) CN113609304B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955837A (en) * 2011-12-13 2013-03-06 华东师范大学 Analogy retrieval control method based on Chinese word pair relationship similarity
CN106205608A (en) * 2015-05-29 2016-12-07 微软技术许可有限责任公司 Utilize the Language Modeling for speech recognition of knowledge graph
CN107145523A (en) * 2017-04-12 2017-09-08 浙江大学 Large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching
CN107818081A (en) * 2017-09-25 2018-03-20 沈阳航空航天大学 Sentence similarity appraisal procedure based on deep semantic model and semantic character labeling
WO2021000362A1 (en) * 2019-07-04 2021-01-07 浙江大学 Deep neural network model-based address information feature extraction method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10878817B2 (en) * 2018-02-24 2020-12-29 Twenty Lane Media, LLC Systems and methods for generating comedy

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955837A (en) * 2011-12-13 2013-03-06 华东师范大学 Analogy retrieval control method based on Chinese word pair relationship similarity
CN106205608A (en) * 2015-05-29 2016-12-07 微软技术许可有限责任公司 Utilize the Language Modeling for speech recognition of knowledge graph
CN107145523A (en) * 2017-04-12 2017-09-08 浙江大学 Large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching
CN107818081A (en) * 2017-09-25 2018-03-20 沈阳航空航天大学 Sentence similarity appraisal procedure based on deep semantic model and semantic character labeling
WO2021000362A1 (en) * 2019-07-04 2021-01-07 浙江大学 Deep neural network model-based address information feature extraction method

Also Published As

Publication number Publication date
CN113609304A (en) 2021-11-05

Similar Documents

Publication Publication Date Title
CN112131383B (en) Specific target emotion polarity classification method
WO2019168202A1 (en) Vector generation device, sentence pair learning device, vector generation method, sentence pair learning method, and program
CN107679031B (en) Advertisement and blog identification method based on stacking noise reduction self-coding machine
CN111858940A (en) Multi-head attention-based legal case similarity calculation method and system
CN114049926A (en) Electronic medical record text classification method
CN116127056A (en) Medical dialogue abstracting method with multi-level characteristic enhancement
CN114048729A (en) Medical document evaluation method, electronic device, storage medium, and program product
CN113609304B (en) Entity matching method and device
CN111737688A (en) Attack defense system based on user portrait
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium
CN112035607B (en) Method, device and storage medium for matching citation difference based on MG-LSTM
CN114756679A (en) Chinese medical text entity relation combined extraction method based on conversation attention mechanism
Rui et al. Data Reconstruction based on supervised deep auto-encoder
CN113849654A (en) Text cleaning method and system based on contrast learning clustering
Zalasiński et al. Intelligent approach to the prediction of changes in biometric attributes
CN113761126A (en) Text content identification method, text content identification device, text content identification equipment and readable storage medium
CN113821571A (en) Food safety relation extraction method based on BERT and improved PCNN
CN108733824B (en) Interactive theme modeling method and device considering expert knowledge
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
CN111611455A (en) User group division method based on user emotional behavior characteristics under microblog hot topics
CN109190556B (en) Method for identifying notarization will authenticity
CN116521875B (en) Prototype enhanced small sample dialogue emotion recognition method for introducing group emotion infection
CN116431757B (en) Text relation extraction method based on active learning, electronic equipment and storage medium
CN113282722B (en) Machine reading and understanding method, electronic device and storage medium
CN113935329B (en) Asymmetric text matching method based on adaptive feature recognition and denoising

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant