CN114781471B - Entity record matching method and system - Google Patents
Entity record matching method and system Download PDFInfo
- Publication number
- CN114781471B CN114781471B CN202110614418.1A CN202110614418A CN114781471B CN 114781471 B CN114781471 B CN 114781471B CN 202110614418 A CN202110614418 A CN 202110614418A CN 114781471 B CN114781471 B CN 114781471B
- Authority
- CN
- China
- Prior art keywords
- entity
- attribute
- entity record
- matching
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 79
- 239000013598 vector Substances 0.000 claims abstract description 131
- 238000012549 training Methods 0.000 claims abstract description 77
- 238000013528 artificial neural network Methods 0.000 claims abstract description 53
- 238000003066 decision tree Methods 0.000 claims abstract description 36
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 23
- 230000004927 fusion Effects 0.000 claims description 37
- 239000011159 matrix material Substances 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 11
- 230000007246 mechanism Effects 0.000 claims description 10
- 230000000873 masking effect Effects 0.000 claims description 9
- 238000010276 construction Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 7
- 239000013604 expression vector Substances 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 3
- 230000007547 defect Effects 0.000 abstract description 8
- 238000013135 deep learning Methods 0.000 abstract description 5
- 238000004891 communication Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000013461 design Methods 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 238000012407 engineering method Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method and a system for matching entity records, wherein the method comprises the following steps: acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities; and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm. The invention converts the entity into the attribute value vector through the neural network, overcomes the defect of poor interpretability of deep learning by utilizing the automatically constructed key attribute tree, can convert the learned key attribute tree into the matching rule, and applies the matching rule to other data sets; meanwhile, the training of the corresponding model of the invention only needs a small number of labeled entity record pairs, and overcomes the defect that the existing method needs a large number of labeled entity record pairs.
Description
Technical Field
The invention relates to the technical field of entity matching, in particular to an entity record matching method and system.
Background
With the popularization and development of Intemet, the size of data stored in the open-domain internet is rapidly increasing, and real-world entities (entities) are stored in databases constituting the internet in the form of Entity records (Record). However, there is a large amount of redundancy in these stored entity records, and the same entity may be stored as multiple entity records. Taking the e-commerce platform as an example, the same product is sold by different merchants, so that the same product is registered as different commodity records. Here, a product corresponds to an entity in the real world, and a registered commodity record corresponds to an entity record.
In order to identify entity records of the same entity, it is necessary to determine whether two given entity records refer to the same entity. This task is abstracted as an Entity record Matching task, or an Entity Matching (Entity Matching) task. The input of the task is two entity records represented by attribute-attribute value pairs, and the matching result of whether the two records refer to the same entity is output.
The method for realizing entity matching task at present stage has two categories, wherein one category adopts a data characteristic engineering method to manually construct matching characteristics, and matches the same entity record based on rules; another type of approach treats the attribute values in the entity records as natural language, represents the entity records as semantic vectors using a neural network, and trains a two-classifier based on the vectorized representation and labeled entity record pairs. However, both existing methods require manual intervention, full automation of the algorithm cannot be achieved, and the generated feature engineering cannot be migrated to a new data set, which means that the existing methods have poor generalization performance on new data sets.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides an entity record matching method and system.
The invention provides an entity record matching method, which comprises the following steps:
acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities;
and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.
According to the entity record matching method provided by the invention, the trained entity record matching model is obtained through the following steps:
constructing a first training sample set according to the sample entity record;
training a neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, wherein the heterogeneous information fusion model is used for converting entity records into entity record expression vectors;
constructing a second training sample set according to the sample entity record representation vector;
training a decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, wherein the key attribute tree model is used for classifying entity records and matching the entity records with the same classification result into the same entity;
and constructing a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model.
According to the entity record matching method provided by the invention, the method for training the neural network through the first training sample set based on the self-supervision learning to obtain the heterogeneous information fusion model comprises the following steps:
after the sample entity records are input into a neural network, converting each word of the attribute values in the sample entity records into a vector through a word vector embedding layer of the neural network to obtain an attribute value matrix;
converting the attribute value matrix into an attribute feature vector through an attention mechanism based on an attention layer in the attribute of the neural network;
based on an attention layer among attributes of the neural network, obtaining a first attribute value vector representation and a context vector representation corresponding to the first attribute value vector representation through an attention mechanism according to the attribute feature vector, and splicing the first attribute value vector representation and the context vector representation to construct a second attribute value vector representation;
and performing iterative training on the multilayer perceptron of the neural network output layer through the second attribute value vector representation until a preset training condition is met, obtaining a heterogeneous information fusion model, and outputting a sample entity record representation vector.
According to the entity record matching method provided by the invention, the calculation formula of the attribute feature vector is as follows:
wherein,a feature matrix recorded for an entity, consisting of m attribute feature vectors; ij]Representing a feature vector corresponding to the jth attribute;andtwo sets of trainable parameters for the jth attribute,represents the jth attribute;representing j-th attributeThe respective attention weight of the individual words,representing the number of words in the attribute value corresponding to the jth attribute;representing compressed feature vectors, d emb A dimension representing a word vector;
the calculation formula represented by the second attribute value vector is as follows:
wherein, anda parameter representing learning; d [ j ]]Representing a feature vector corresponding to the j attribute after compressing the information for the first attribute value vector representation; c [ j ]]Representing a characteristic vector of a context corresponding to the jth attribute for the context vector representation; alpha is alpha j,i Denotes the attention coefficient, v i Representing the ith attribute value, k, in an entity record vi The vector represents the key-value representation of the ith attribute value, q uj The vector represents the query for the jth attribute value.
According to the entity record matching method provided by the invention, the key attribute tree model consists of leaf nodes and internal nodes, wherein the internal nodes comprise a key attribute, a vector distance function and a distance threshold; and the leaf node records an entity matching result.
According to an entity record matching method provided by the present invention, before the constructing a first training sample set according to sample entity records, the method further includes:
and performing masking processing on the attribute values in the sample entity record samples according to a preset masking condition so as to construct a first training sample set according to the masked sample entity record samples.
The invention also provides an entity record matching system, comprising:
the system comprises an entity acquisition module, a matching module and a matching module, wherein the entity acquisition module is used for acquiring an entity record set to be matched, and entity records in the entity record set consist of attributes and attribute values of entities;
and the entity matching module is used for inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.
According to an entity record matching system provided by the invention, the system further comprises:
the first sample set construction module is used for constructing a first training sample set according to the sample entity record;
the heterogeneous information fusion training module is used for training the neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, and the heterogeneous information fusion model is used for converting entity records into entity record expression vectors;
the second sample set construction module is used for constructing a second training sample set according to the sample entity record representation vector;
the key attribute tree training module is used for training a decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, and the key attribute tree model is used for classifying entity records and matching the entity records with the same classification result into the same entity;
and the model construction module is used for constructing and obtaining a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the entity record matching method as described in any one of the above.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the entity record matching method as described in any of the above.
According to the entity record matching method and system provided by the invention, the entity is converted into the attribute value vector through the neural network, the defect of poor deep learning interpretability is overcome by utilizing the automatically constructed key attribute tree, the learned key attribute tree can be converted into the matching rule, and the matching rule is applied to other data sets; meanwhile, the training of the corresponding model of the invention only needs a small number of labeled entity record pairs, and overcomes the defect that the prior method needs a large number of labeled entity record pairs.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of an entity record matching method according to the present invention;
FIG. 2 is a schematic diagram of a physical record matching system according to the present invention;
fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
At present, there are two types of methods for implementing entity matching tasks, one of which adopts a data feature engineering method to manually construct matching features and matches the same entity record based on rules, for example, the paper "Magellan: a set of Entity Matching system depending on domain expert intervening design data characteristic engineering is designed in the hardware Building Entity Matching Management Systems, the system is called Magellan, tools such as data cleaning, data representation, matching rule design and the like are integrated, and the system can assist the domain expert in constructing an Entity Matching pipeline algorithm.
Another class of methods treats attribute values in Entity records as natural language, represents Entity records as semantic vectors using neural networks, and trains a two-classifier based on vectorized representations and labeled Entity record pairs, e.g., the paper "Deep Learning for Entity Matching: a Design Space Exploration' proposes to train a neural network called deep Matcher to realize the entity matching task by using an end-to-end training method. In their approach, the words in each attribute value in the entity record are first represented as a word vector, and then the word vectors are used to extract the attribute value vector for each attribute value. The neural network further compares the attribute value vectors corresponding to the two entity records for matching to obtain a comparison result vector under each attribute, and finally compares the two attribute value vectors by using a differentiable two-classifier to obtain a final matching result.
In the two existing entity record matching methods, the Magellan system realizes entity matching, on one hand, the Magellan system depends on the intervention of human experts to assist in designing a data characteristic engineering scheme, and the full automation of an algorithm cannot be realized; on the other hand, feature engineering cannot migrate to a new data set, which means that the generalization performance of the method on new data sets is poor.
The DeepMatcher adopts a neural network to realize feature extraction, however, the method adopts an end-to-end training method, and a large number of matching/mismatching entity record pairs need to be labeled, so that the DeepMatcher also needs a large amount of manpower to label data to ensure good generalization performance of the algorithm; another drawback of this method is the poor interpretability of the model, and since this method is built based on an end-to-end training method, no explicit entity record matching rules can be given.
Fig. 1 is a schematic flow diagram of an entity record matching method provided by the present invention, and as shown in fig. 1, the present invention provides an entity record matching method, including:
in the present invention, the complete set of attributes defining an entity isWhere m is the number of attributes. v. of i Representing attributesCorresponding attribute values, also denoted asWhereby entity records, one entity record, are formed from attribute-attribute value pairsIs shown as
And 102, inputting the entity record set into a trained entity record matching model to obtain a matching result between entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.
In the present invention, the entity matching task requires an entity matching algorithm EM (DEG), which records any two entities (entity records) as a functionAnd entity records) Making a judgment if the requirements are metIf and only if physical recordsAndrefer to the same real world entity; otherwise, giveThe invention trains the decision tree model through the neural network trained by the self-supervision learning method and the decision tree algorithm, thereby constructing and obtaining the entity record matching model, and based on the entity record matching model, when the entity record set is input into the model, the entity records in the set can be matched, thereby obtaining the matching result whether different entity records belong to the same entity.
According to the entity record matching method and system provided by the invention, the entity is converted into the attribute value vector through the neural network, the defect of poor deep learning interpretability is overcome by utilizing the automatically constructed key attribute tree, the learned key attribute tree can be converted into the matching rule, and the matching rule is applied to other data sets; meanwhile, the training of the corresponding model of the invention only needs a small number of labeled entity record pairs, and overcomes the defect that the existing method needs a large number of labeled entity record pairs.
On the basis of the above embodiment, the trained entity record matching model is obtained by the following steps:
constructing a first training sample set according to the sample entity record;
training the neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, wherein the heterogeneous information fusion model is used for converting entity records into entity record expression vectors;
constructing a second training sample set according to the sample entity record representation vector;
training a decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, wherein the key attribute tree model is used for classifying entity records and matching the entity records with the same classification result into the same entity;
and constructing a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model.
In the invention, the trained entity record matching model mainly comprises two functions which are obtained by respectively training a neural network and a decision tree model. Specifically, one of the functions is heterogeneous information fusion, and the function is to unify attribute values with different forms but the same semantics on a vector space. This is because the entity matching requires comparing the attribute values, and the attribute values in the original entity records are often heterogeneous, and the attribute values of two different entity records cannot be directly compared, and if the heterogeneous attribute values are directly compared, an erroneous comparison result is often obtained. For example, the attribute values "1 liter", "1000 ml" and "100 ml", the results of the direct comparison may consider "1000 ml" and "100 ml" to be closer because they are more words of the same, while "1 liter" and "1000 ml" are considered to be quite different. In the invention, a self-supervised learning method is used for training the neural network, and the entity records are converted into expression vectors. Specifically, each entity record is composed of a series of attribute values, and each attribute value is converted into an attribute value vector when heterogeneous information fusion is performed.
Further, training is carried out by utilizing a decision tree algorithm and the marked entity matching record pairs, and a key attribute tree model is constructed, wherein the nodes of the key attribute tree are attributes. For an input pair of entity records, the key attribute tree model compares the attribute value vectors corresponding to the attributes on the nodes. And selecting a path from the root to the leaf node on the tree according to the comparison result, and determining a matching result according to the result of the last leaf node on the path. After the training of the heterogeneous information fusion model and the key attribute tree model is completed, the two models are combined, so that a trained entity record matching model is obtained.
On the basis of the above embodiment, training the neural network through the first training sample set to obtain a heterogeneous information fusion model, including:
after the sample entity records are input into a neural network, converting each word of the attribute values in the sample entity records into a vector through a word vector embedding layer of the neural network to obtain an attribute value matrix;
converting the attribute value matrix into attribute feature vectors through an attention mechanism based on an attention layer in the attribute of the neural network;
based on an attention layer among attributes of the neural network, obtaining a first attribute value vector representation and a context vector representation corresponding to the first attribute value vector representation according to the attribute feature vector through an attention mechanism, and splicing the first attribute value vector representation and the context vector representation to construct and obtain a second attribute value vector representation;
and performing iterative training on the multilayer perceptron of the neural network output layer through the second attribute value vector representation until a preset training condition is met to obtain a heterogeneous information fusion model, and outputting a sample entity record representation vector.
In the invention, the heterogeneous information fusion model converts the attribute values in the entity records into attribute value vectors. The model consists of a word vector embedding layer, an attribute inner attention layer, an attribute middle attention layer and an output layer. Specifically, the word vector embedding layer converts each word in the attribute values in the entity record into a vector. Given attributesProperty value v = [ w = 1 ,w 2 ,…,w T ]If the attribute value has T words, the word vector embedding layer firstly inserts symbols beg and end for marking the beginning and the end of the sentence at the head part and the tail part of the attribute value; then, the characters are filled again<pad>Filling the length to a fixed value at the end of the attribute valueThe obtained new expression isWherein the words following end are all<pad>. Further, the word vector embedding layer converts the words into pre-trained word vectors by means of table lookup. Thus, propertyIs converted into an attribute value matrixWherein d is emb Is the dimension of the word vector.
Further, the intra-attribute attention layer converts each attribute value matrix into an attribute feature vector by using an attention mechanism. Specifically, the feature vector of each word is converted into a compressed feature vector through linear mapping, then the weight of each word is extracted, and the attribute feature vector is calculated in a weighted average mode according to the weight of the word and the compressed feature vector. The calculation formula of the attribute feature vector is as follows:
wherein,a feature matrix recorded for one entity, consisting of m attribute feature vectors; ij]Representing a characteristic vector corresponding to the jth attribute, and being a row vector corresponding to the jth row of the characteristic matrix;andtwo sets of trainable parameters for the jth attribute,represents the jth attribute;representing j-th attributeThe respective attention weight of the individual words,representing the number of words in the attribute value corresponding to the jth attribute;representing compressed feature vectors, d emb Representing the dimensions of the word vector.
Furthermore, the inter-attribute attention layer recovers the information of the attribute value from the attribute values of other attributes by using an inter-attribute attention mechanism, and after passing through the inter-attribute attention layer, each attribute value in each entity record is represented by a vector (i.e. a second attribute value vector), and the vector is divided into two parts, namely a context part(context vector representation) and the original part of the attribute value itself(first attribute value vector representation). Specifically, an entity record is represented by a matrix:and isWhere | represents the row-wise concatenation of the matrices. The calculation formula represented by the second attribute value vector is as follows:
wherein,anda parameter representing learning; d [ j ]]Representing a feature vector corresponding to the j attribute after information compression for the first attribute value vector representation; c [ j ]]Representing the feature vector of the context corresponding to the jth attribute for the context vector representation, wherein the context vector representation is obtained by performing weighted average calculation on D obtained by compressing information through an attention coefficient; alpha is alpha j,i Denotes the attention coefficient, v i Representing the ith attribute value in the entity record,the vector represents a key value representation (key) of the ith attribute value,the vector represents the query for the jth attribute value. By passingVector sumThe vector is point-multiplied to calculate the score of the ith attribute query jth attribute, which is further used to calculate the attention coefficient.
Finally, each entity record is represented as a feature matrix by a multi-layer Perceptron (MLP) of the output layerOn the calculation, the MLP between attributes does not share parameters, specifically:
it should be noted that, in the present invention, the heterogeneous information fusion model is used as a functionRealizes the recording of one entityFunction of transformation into a feature matrix, denoted as
On the basis of the embodiment, the key attribute tree model consists of leaf nodes and internal nodes, wherein the internal nodes comprise a key attribute, a vector distance function and a distance threshold; and the leaf node records an entity matching result.
In the present invention, a pair of entities is recordedCharacteristic matrix H of 1 ,H 2 Inputting the result into a key attribute tree model, and outputting the result whether the two entity records are matched. In the invention, the key attribute tree is a special binary tree which consists of leaf nodes and internal nodes, each internal node comprises a key attribute, a vector distance function and a distance threshold, and the leaf nodes record entity matching results. The entity matching reasoning process based on the key attribute tree is based on the input characteristic matrix H 1 ,H 2 And finding a path from the root node to the leaf node, wherein the result of the leaf node on the path is used as a final matching result.
On the basis of the above embodiment, before the constructing the first training sample set according to the sample entity record, the method further includes:
and performing masking processing on the attribute values in the sample entity record samples according to a preset masking condition to construct a first training sample set according to the masked sample entity record samples.
In the invention, according to a preset masking condition (for example, masking attribute values in a certain range), a part of attribute values in a masking entity record are adopted, and a model is led to recover the part of attribute values to train the neural network, so that the heterogeneous information fusion model is obtained.
In one embodiment, the performance of the entity record matching model of the present invention is evaluated by a standard entity matching dataset and a certain e-commerce entity matching dataset. Specifically, 1% -10% of entity pairs are selected as annotation data, and the performance of the entity record matching model in the weak supervision environment is evaluated. Experimental results show that under weak supervision, the entity record matching method provided by the invention achieves F1 measurement which is obviously higher than those of Magellan and DeepMatcher. And the depth of the key attribute tree does not exceed 3 layers, which shows that a very high matching result can be obtained only by comparison under a very small number of attributes, thereby proving the effectiveness of the entity record matching method of the invention.
The entity record matching method provided by the invention does not rely on manual data cleaning, and can automatically clean the data by utilizing a heterogeneous information fusion model obtained by neural network training; meanwhile, a key attribute tree can be constructed, key attributes can be automatically extracted, and a matching result with good interpretability can be provided; in addition, the invention has low requirement on the number of the labeling matching pairs, and can obtain high labeling quality only by a small number of labeling entity pairs.
Fig. 2 is a schematic structural diagram of an entity record matching system provided by the present invention, and as shown in fig. 2, the present invention provides an entity record matching system, which includes an entity obtaining module 201 and an entity matching module 202, where the entity obtaining module 201 is configured to obtain an entity record set to be matched, and an entity record in the entity record set is composed of an attribute and an attribute value of an entity; the entity matching module 202 is configured to input the entity record set to a trained entity record matching model to obtain a matching result between entity records in the entity record set, where the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.
The entity record matching system provided by the invention converts the entity into the attribute value vector through the neural network, overcomes the defect of poor deep learning interpretability by utilizing the automatically constructed key attribute tree, can convert the learned key attribute tree into the matching rule, and applies the matching rule to other data sets; meanwhile, the training of the corresponding model of the invention only needs a small number of labeled entity record pairs, and overcomes the defect that the existing method needs a large number of labeled entity record pairs.
On the basis of the embodiment, the system further comprises a first sample set building module, a heterogeneous information fusion training module, a second sample set building module, a key attribute tree training module and a model building module, wherein the first sample set building module is used for building a first training sample set according to the sample entity records; the heterogeneous information fusion training module is used for training the neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, and the heterogeneous information fusion model is used for converting entity records into entity record expression vectors; the second sample set construction module is used for constructing a second training sample set according to the sample entity record representation vector; the key attribute tree training module is used for training a decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, and the key attribute tree model is used for classifying entity records and matching the entity records with the same classification result into the same entity; and the model construction module is used for constructing and obtaining a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model.
The system provided by the present invention is used for executing the above method embodiments, and for the specific processes and details, reference is made to the above embodiments, which are not described herein again.
Fig. 3 is a schematic structural diagram of an electronic device provided in the present invention, and as shown in fig. 3, the electronic device may include: a processor (processor) 301, a communication interface (communication interface) 302, a memory (memory) 303 and a communication bus 304, wherein the processor 301, the communication interface 302 and the memory 303 communicate with each other through the communication bus 304. The processor 301 may invoke logic instructions in the memory 303 to perform an entity record matching method comprising: acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities; and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by a self-supervision learning method and a decision tree model trained by a decision tree algorithm.
In addition, the logic instructions in the memory 303 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, enable the computer to perform the entity record matching method provided by the above methods, the method comprising: acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities; and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.
In still another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the entity record matching method provided by the above embodiments, the method including: acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities; and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (7)
1. An entity record matching method, comprising:
acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities;
inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by a self-supervision learning method and a decision tree model trained by a decision tree algorithm;
the trained entity record matching model is obtained through the following steps:
constructing a first training sample set according to the sample entity record;
training a neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, wherein the heterogeneous information fusion model is used for converting entity records into entity record expression vectors;
constructing a second training sample set according to the sample entity record representation vector;
training a decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, wherein the key attribute tree model is used for classifying entity records and matching the entity records with the same classification result into the same entity;
constructing a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model;
the method for training the neural network through the first training sample set based on the self-supervision learning to obtain the heterogeneous information fusion model comprises the following steps:
after the sample entity records are input into a neural network, converting each word of the attribute values in the sample entity records into a vector through a word vector embedding layer of the neural network to obtain an attribute value matrix;
converting the attribute value matrix into an attribute feature vector through an attention mechanism based on an attention layer in the attribute of the neural network;
based on an attention layer among attributes of the neural network, obtaining a first attribute value vector representation and a context vector representation corresponding to the first attribute value vector representation through an attention mechanism according to the attribute feature vector, and splicing the first attribute value vector representation and the context vector representation to construct a second attribute value vector representation;
and performing iterative training on the multilayer perceptron of the neural network output layer through the second attribute value vector representation until a preset training condition is met to obtain a heterogeneous information fusion model, and outputting a sample entity record representation vector.
2. The entity record matching method according to claim 1, wherein the attribute feature vector is calculated by the formula:
wherein,a feature matrix recorded for an entity, consisting of m attribute feature vectors; ij]Representing a feature vector corresponding to the jth attribute;andtwo sets of trainable parameters for the jth attribute,represents the jth attribute;representing j-th attributeThe respective attention weight of the individual words,representing the number of words in the attribute value corresponding to the jth attribute;representing compressed feature vectors, d emb A dimension representing a word vector;
the calculation formula represented by the second attribute value vector is as follows:
wherein,anda parameter representing learning; d [ j ]]Representing a feature vector corresponding to the j attribute after compressing the information for the first attribute value vector representation; c [ j ]]Representing a characteristic vector of a context corresponding to the jth attribute for the context vector representation; alpha is alpha j,i Denotes the attention coefficient, v i Representing the ith attribute value in the entity record,the vector represents a key-value representation of the ith attribute value,the vector represents the query for the jth attribute value.
3. The entity record matching method according to claim 1, wherein the key attribute tree model is composed of leaf nodes and internal nodes, the internal nodes containing a key attribute, a vector distance function and a distance threshold; and the leaf node records an entity matching result.
4. The entity record matching method of claim 1, wherein prior to said constructing a first training sample set from sample entity records, said method further comprises:
and performing masking processing on the attribute values in the sample entity record samples according to a preset masking condition so as to construct a first training sample set according to the masked sample entity record samples.
5. An entity record matching system, comprising:
the system comprises an entity acquisition module, a matching module and a matching module, wherein the entity acquisition module is used for acquiring an entity record set to be matched, and entity records in the entity record set consist of attributes and attribute values of entities;
the entity matching module is used for inputting the entity record set into a trained entity record matching model to obtain a matching result between entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm;
the system further comprises:
the first sample set construction module is used for constructing a first training sample set according to the sample entity record;
the heterogeneous information fusion training module is used for training the neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, and the heterogeneous information fusion model is used for converting entity records into entity record expression vectors;
the second sample set construction module is used for constructing a second training sample set according to the sample entity record representation vector;
the key attribute tree training module is used for training the decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, and the key attribute tree model is used for classifying the entity records and matching the entity records with the same classification result into the same entity;
the model construction module is used for constructing and obtaining a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model;
the heterogeneous information fusion training module is specifically used for:
after the sample entity records are input into the neural network, converting each word of the attribute values in the sample entity records into a vector through a word vector embedding layer of the neural network to obtain an attribute value matrix;
converting the attribute value matrix into attribute feature vectors through an attention mechanism based on an attention layer in the attribute of the neural network;
based on an attention layer among attributes of the neural network, obtaining a first attribute value vector representation and a context vector representation corresponding to the first attribute value vector representation according to the attribute feature vector through an attention mechanism, and splicing the first attribute value vector representation and the context vector representation to construct and obtain a second attribute value vector representation;
and performing iterative training on the multilayer perceptron of the neural network output layer through the second attribute value vector representation until a preset training condition is met to obtain a heterogeneous information fusion model, and outputting a sample entity record representation vector.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the entity record matching method according to any one of claims 1 to 4 when executing the computer program.
7. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the entity record matching method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110614418.1A CN114781471B (en) | 2021-06-02 | 2021-06-02 | Entity record matching method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110614418.1A CN114781471B (en) | 2021-06-02 | 2021-06-02 | Entity record matching method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114781471A CN114781471A (en) | 2022-07-22 |
CN114781471B true CN114781471B (en) | 2022-12-27 |
Family
ID=82424310
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110614418.1A Active CN114781471B (en) | 2021-06-02 | 2021-06-02 | Entity record matching method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114781471B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107545033A (en) * | 2017-07-24 | 2018-01-05 | 清华大学 | A kind of computational methods based on the knowledge base entity classification for representing study |
CN108268643A (en) * | 2018-01-22 | 2018-07-10 | 北京邮电大学 | A kind of Deep Semantics matching entities link method based on more granularity LSTM networks |
CN108427714A (en) * | 2018-02-02 | 2018-08-21 | 北京邮电大学 | The source of houses based on machine learning repeats record recognition methods and system |
CN109408555A (en) * | 2018-09-19 | 2019-03-01 | 智器云南京信息科技有限公司 | Data type recognition methods and device, data storage method and device |
CN110334212A (en) * | 2019-07-01 | 2019-10-15 | 南京审计大学 | A kind of territoriality audit knowledge mapping construction method based on machine learning |
CN111339872A (en) * | 2020-02-18 | 2020-06-26 | 国网信通亿力科技有限责任公司 | Power grid fault classification method based on classification model |
CN111522961A (en) * | 2020-04-09 | 2020-08-11 | 武汉理工大学 | Attention mechanism and entity description based industrial map construction method |
CN112906396A (en) * | 2021-04-01 | 2021-06-04 | 翻车信息科技(杭州)有限公司 | Cross-platform commodity matching method and system based on natural language processing |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200073953A1 (en) * | 2018-08-30 | 2020-03-05 | Salesforce.Com, Inc. | Ranking Entity Based Search Results Using User Clusters |
CN110991185A (en) * | 2019-11-05 | 2020-04-10 | 北京声智科技有限公司 | Method and device for extracting attributes of entities in article |
-
2021
- 2021-06-02 CN CN202110614418.1A patent/CN114781471B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107545033A (en) * | 2017-07-24 | 2018-01-05 | 清华大学 | A kind of computational methods based on the knowledge base entity classification for representing study |
CN108268643A (en) * | 2018-01-22 | 2018-07-10 | 北京邮电大学 | A kind of Deep Semantics matching entities link method based on more granularity LSTM networks |
CN108427714A (en) * | 2018-02-02 | 2018-08-21 | 北京邮电大学 | The source of houses based on machine learning repeats record recognition methods and system |
CN109408555A (en) * | 2018-09-19 | 2019-03-01 | 智器云南京信息科技有限公司 | Data type recognition methods and device, data storage method and device |
CN110334212A (en) * | 2019-07-01 | 2019-10-15 | 南京审计大学 | A kind of territoriality audit knowledge mapping construction method based on machine learning |
CN111339872A (en) * | 2020-02-18 | 2020-06-26 | 国网信通亿力科技有限责任公司 | Power grid fault classification method based on classification model |
CN111522961A (en) * | 2020-04-09 | 2020-08-11 | 武汉理工大学 | Attention mechanism and entity description based industrial map construction method |
CN112906396A (en) * | 2021-04-01 | 2021-06-04 | 翻车信息科技(杭州)有限公司 | Cross-platform commodity matching method and system based on natural language processing |
Non-Patent Citations (3)
Title |
---|
《Entity Matching in Online Social Networks》;Olga Peled等;《SocialCom/PASSAT/BigData/EconCom/BioMedCom 2013》;20131231;第339-344页 * |
《Entity Matching on Unstructured Data: An Active Learning Approach》;Ursin Brunner等;《2019 6th Swiss Conference on Data Science (SDS)》;20191231;第97-102页 * |
《征信系统中实体匹配方法及应用研究》;陈波;《中国博士学位论文全文数据库 经济与管理科学辑》;20100915(第9期);第J145-31页 * |
Also Published As
Publication number | Publication date |
---|---|
CN114781471A (en) | 2022-07-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110597735B (en) | Software defect prediction method for open-source software defect feature deep learning | |
CN111581396B (en) | Event graph construction system and method based on multi-dimensional feature fusion and dependency syntax | |
CN112966074B (en) | Emotion analysis method and device, electronic equipment and storage medium | |
CN109992783B (en) | Chinese word vector modeling method | |
CN107330032B (en) | Implicit discourse relation analysis method based on recurrent neural network | |
CN111159407B (en) | Method, apparatus, device and medium for training entity recognition and relation classification model | |
CN108415953B (en) | Method for managing bad asset management knowledge based on natural language processing technology | |
CN111753024B (en) | Multi-source heterogeneous data entity alignment method oriented to public safety field | |
CN113239700A (en) | Text semantic matching device, system, method and storage medium for improving BERT | |
CN109271506A (en) | A kind of construction method of the field of power communication knowledge mapping question answering system based on deep learning | |
CN107220506A (en) | Breast cancer risk assessment analysis system based on deep convolutional neural network | |
CN112417289B (en) | Information intelligent recommendation method based on deep clustering | |
CN112711953A (en) | Text multi-label classification method and system based on attention mechanism and GCN | |
CN110765277B (en) | Knowledge-graph-based mobile terminal online equipment fault diagnosis method | |
CN109804371B (en) | Method and device for semantic knowledge migration | |
CN114896388A (en) | Hierarchical multi-label text classification method based on mixed attention | |
CN112749562A (en) | Named entity identification method, device, storage medium and electronic equipment | |
CN115526236A (en) | Text network graph classification method based on multi-modal comparative learning | |
CN110969023B (en) | Text similarity determination method and device | |
Manik et al. | Out-of-Scope Intent Detection on A Knowledge-Based Chatbot. | |
CN116578708A (en) | Paper data name disambiguation algorithm based on graph neural network | |
CN116561272A (en) | Open domain visual language question-answering method and device, electronic equipment and storage medium | |
CN115017879A (en) | Text comparison method, computer device and computer storage medium | |
CN114428860A (en) | Pre-hospital emergency case text recognition method and device, terminal and storage medium | |
US11520986B2 (en) | Neural-based ontology generation and refinement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |