CN114781471B - Entity record matching method and system - Google Patents

Entity record matching method and system Download PDF

Info

Publication number
CN114781471B
CN114781471B CN202110614418.1A CN202110614418A CN114781471B CN 114781471 B CN114781471 B CN 114781471B CN 202110614418 A CN202110614418 A CN 202110614418A CN 114781471 B CN114781471 B CN 114781471B
Authority
CN
China
Prior art keywords
entity
attribute
entity record
matching
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110614418.1A
Other languages
Chinese (zh)
Other versions
CN114781471A (en
Inventor
李涓子
姚子俊
侯磊
吕鑫
唐杰
张鹏
许斌
戴泽林
张亦弛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110614418.1A priority Critical patent/CN114781471B/en
Publication of CN114781471A publication Critical patent/CN114781471A/en
Application granted granted Critical
Publication of CN114781471B publication Critical patent/CN114781471B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for matching entity records, wherein the method comprises the following steps: acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities; and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm. The invention converts the entity into the attribute value vector through the neural network, overcomes the defect of poor interpretability of deep learning by utilizing the automatically constructed key attribute tree, can convert the learned key attribute tree into the matching rule, and applies the matching rule to other data sets; meanwhile, the training of the corresponding model of the invention only needs a small number of labeled entity record pairs, and overcomes the defect that the existing method needs a large number of labeled entity record pairs.

Description

Entity record matching method and system
Technical Field
The invention relates to the technical field of entity matching, in particular to an entity record matching method and system.
Background
With the popularization and development of Intemet, the size of data stored in the open-domain internet is rapidly increasing, and real-world entities (entities) are stored in databases constituting the internet in the form of Entity records (Record). However, there is a large amount of redundancy in these stored entity records, and the same entity may be stored as multiple entity records. Taking the e-commerce platform as an example, the same product is sold by different merchants, so that the same product is registered as different commodity records. Here, a product corresponds to an entity in the real world, and a registered commodity record corresponds to an entity record.
In order to identify entity records of the same entity, it is necessary to determine whether two given entity records refer to the same entity. This task is abstracted as an Entity record Matching task, or an Entity Matching (Entity Matching) task. The input of the task is two entity records represented by attribute-attribute value pairs, and the matching result of whether the two records refer to the same entity is output.
The method for realizing entity matching task at present stage has two categories, wherein one category adopts a data characteristic engineering method to manually construct matching characteristics, and matches the same entity record based on rules; another type of approach treats the attribute values in the entity records as natural language, represents the entity records as semantic vectors using a neural network, and trains a two-classifier based on the vectorized representation and labeled entity record pairs. However, both existing methods require manual intervention, full automation of the algorithm cannot be achieved, and the generated feature engineering cannot be migrated to a new data set, which means that the existing methods have poor generalization performance on new data sets.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides an entity record matching method and system.
The invention provides an entity record matching method, which comprises the following steps:
acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities;
and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.
According to the entity record matching method provided by the invention, the trained entity record matching model is obtained through the following steps:
constructing a first training sample set according to the sample entity record;
training a neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, wherein the heterogeneous information fusion model is used for converting entity records into entity record expression vectors;
constructing a second training sample set according to the sample entity record representation vector;
training a decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, wherein the key attribute tree model is used for classifying entity records and matching the entity records with the same classification result into the same entity;
and constructing a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model.
According to the entity record matching method provided by the invention, the method for training the neural network through the first training sample set based on the self-supervision learning to obtain the heterogeneous information fusion model comprises the following steps:
after the sample entity records are input into a neural network, converting each word of the attribute values in the sample entity records into a vector through a word vector embedding layer of the neural network to obtain an attribute value matrix;
converting the attribute value matrix into an attribute feature vector through an attention mechanism based on an attention layer in the attribute of the neural network;
based on an attention layer among attributes of the neural network, obtaining a first attribute value vector representation and a context vector representation corresponding to the first attribute value vector representation through an attention mechanism according to the attribute feature vector, and splicing the first attribute value vector representation and the context vector representation to construct a second attribute value vector representation;
and performing iterative training on the multilayer perceptron of the neural network output layer through the second attribute value vector representation until a preset training condition is met, obtaining a heterogeneous information fusion model, and outputting a sample entity record representation vector.
According to the entity record matching method provided by the invention, the calculation formula of the attribute feature vector is as follows:
Figure BDA0003097471950000031
wherein,
Figure BDA0003097471950000032
a feature matrix recorded for an entity, consisting of m attribute feature vectors; ij]Representing a feature vector corresponding to the jth attribute;
Figure BDA0003097471950000033
and
Figure BDA0003097471950000034
two sets of trainable parameters for the jth attribute,
Figure BDA0003097471950000035
represents the jth attribute;
Figure BDA0003097471950000036
representing j-th attribute
Figure BDA0003097471950000037
The respective attention weight of the individual words,
Figure BDA0003097471950000038
representing the number of words in the attribute value corresponding to the jth attribute;
Figure BDA0003097471950000039
representing compressed feature vectors, d emb A dimension representing a word vector;
the calculation formula represented by the second attribute value vector is as follows:
Figure BDA00030974719500000310
Figure BDA00030974719500000311
Figure BDA0003097471950000041
Figure BDA0003097471950000042
Figure BDA0003097471950000043
Figure BDA0003097471950000044
Figure BDA0003097471950000045
wherein,
Figure BDA0003097471950000046
Figure BDA0003097471950000047
and
Figure BDA0003097471950000048
a parameter representing learning; d [ j ]]Representing a feature vector corresponding to the j attribute after compressing the information for the first attribute value vector representation; c [ j ]]Representing a characteristic vector of a context corresponding to the jth attribute for the context vector representation; alpha is alpha j,i Denotes the attention coefficient, v i Representing the ith attribute value, k, in an entity record vi The vector represents the key-value representation of the ith attribute value, q uj The vector represents the query for the jth attribute value.
According to the entity record matching method provided by the invention, the key attribute tree model consists of leaf nodes and internal nodes, wherein the internal nodes comprise a key attribute, a vector distance function and a distance threshold; and the leaf node records an entity matching result.
According to an entity record matching method provided by the present invention, before the constructing a first training sample set according to sample entity records, the method further includes:
and performing masking processing on the attribute values in the sample entity record samples according to a preset masking condition so as to construct a first training sample set according to the masked sample entity record samples.
The invention also provides an entity record matching system, comprising:
the system comprises an entity acquisition module, a matching module and a matching module, wherein the entity acquisition module is used for acquiring an entity record set to be matched, and entity records in the entity record set consist of attributes and attribute values of entities;
and the entity matching module is used for inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.
According to an entity record matching system provided by the invention, the system further comprises:
the first sample set construction module is used for constructing a first training sample set according to the sample entity record;
the heterogeneous information fusion training module is used for training the neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, and the heterogeneous information fusion model is used for converting entity records into entity record expression vectors;
the second sample set construction module is used for constructing a second training sample set according to the sample entity record representation vector;
the key attribute tree training module is used for training a decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, and the key attribute tree model is used for classifying entity records and matching the entity records with the same classification result into the same entity;
and the model construction module is used for constructing and obtaining a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the entity record matching method as described in any one of the above.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the entity record matching method as described in any of the above.
According to the entity record matching method and system provided by the invention, the entity is converted into the attribute value vector through the neural network, the defect of poor deep learning interpretability is overcome by utilizing the automatically constructed key attribute tree, the learned key attribute tree can be converted into the matching rule, and the matching rule is applied to other data sets; meanwhile, the training of the corresponding model of the invention only needs a small number of labeled entity record pairs, and overcomes the defect that the prior method needs a large number of labeled entity record pairs.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of an entity record matching method according to the present invention;
FIG. 2 is a schematic diagram of a physical record matching system according to the present invention;
fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
At present, there are two types of methods for implementing entity matching tasks, one of which adopts a data feature engineering method to manually construct matching features and matches the same entity record based on rules, for example, the paper "Magellan: a set of Entity Matching system depending on domain expert intervening design data characteristic engineering is designed in the hardware Building Entity Matching Management Systems, the system is called Magellan, tools such as data cleaning, data representation, matching rule design and the like are integrated, and the system can assist the domain expert in constructing an Entity Matching pipeline algorithm.
Another class of methods treats attribute values in Entity records as natural language, represents Entity records as semantic vectors using neural networks, and trains a two-classifier based on vectorized representations and labeled Entity record pairs, e.g., the paper "Deep Learning for Entity Matching: a Design Space Exploration' proposes to train a neural network called deep Matcher to realize the entity matching task by using an end-to-end training method. In their approach, the words in each attribute value in the entity record are first represented as a word vector, and then the word vectors are used to extract the attribute value vector for each attribute value. The neural network further compares the attribute value vectors corresponding to the two entity records for matching to obtain a comparison result vector under each attribute, and finally compares the two attribute value vectors by using a differentiable two-classifier to obtain a final matching result.
In the two existing entity record matching methods, the Magellan system realizes entity matching, on one hand, the Magellan system depends on the intervention of human experts to assist in designing a data characteristic engineering scheme, and the full automation of an algorithm cannot be realized; on the other hand, feature engineering cannot migrate to a new data set, which means that the generalization performance of the method on new data sets is poor.
The DeepMatcher adopts a neural network to realize feature extraction, however, the method adopts an end-to-end training method, and a large number of matching/mismatching entity record pairs need to be labeled, so that the DeepMatcher also needs a large amount of manpower to label data to ensure good generalization performance of the algorithm; another drawback of this method is the poor interpretability of the model, and since this method is built based on an end-to-end training method, no explicit entity record matching rules can be given.
Fig. 1 is a schematic flow diagram of an entity record matching method provided by the present invention, and as shown in fig. 1, the present invention provides an entity record matching method, including:
step 101, acquiring an entity record set to be matched, wherein entity records in the entity record set are composed of attributes and attribute values of entities;
in the present invention, the complete set of attributes defining an entity is
Figure BDA0003097471950000071
Where m is the number of attributes. v. of i Representing attributes
Figure BDA0003097471950000072
Corresponding attribute values, also denoted as
Figure BDA0003097471950000073
Whereby entity records, one entity record, are formed from attribute-attribute value pairs
Figure BDA0003097471950000074
Is shown as
Figure BDA0003097471950000081
And 102, inputting the entity record set into a trained entity record matching model to obtain a matching result between entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.
In the present invention, the entity matching task requires an entity matching algorithm EM (DEG), which records any two entities (entity records) as a function
Figure BDA0003097471950000082
And entity records
Figure BDA0003097471950000083
) Making a judgment if the requirements are met
Figure BDA0003097471950000084
If and only if physical records
Figure BDA0003097471950000085
And
Figure BDA0003097471950000086
refer to the same real world entity; otherwise, give
Figure BDA0003097471950000087
The invention trains the decision tree model through the neural network trained by the self-supervision learning method and the decision tree algorithm, thereby constructing and obtaining the entity record matching model, and based on the entity record matching model, when the entity record set is input into the model, the entity records in the set can be matched, thereby obtaining the matching result whether different entity records belong to the same entity.
According to the entity record matching method and system provided by the invention, the entity is converted into the attribute value vector through the neural network, the defect of poor deep learning interpretability is overcome by utilizing the automatically constructed key attribute tree, the learned key attribute tree can be converted into the matching rule, and the matching rule is applied to other data sets; meanwhile, the training of the corresponding model of the invention only needs a small number of labeled entity record pairs, and overcomes the defect that the existing method needs a large number of labeled entity record pairs.
On the basis of the above embodiment, the trained entity record matching model is obtained by the following steps:
constructing a first training sample set according to the sample entity record;
training the neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, wherein the heterogeneous information fusion model is used for converting entity records into entity record expression vectors;
constructing a second training sample set according to the sample entity record representation vector;
training a decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, wherein the key attribute tree model is used for classifying entity records and matching the entity records with the same classification result into the same entity;
and constructing a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model.
In the invention, the trained entity record matching model mainly comprises two functions which are obtained by respectively training a neural network and a decision tree model. Specifically, one of the functions is heterogeneous information fusion, and the function is to unify attribute values with different forms but the same semantics on a vector space. This is because the entity matching requires comparing the attribute values, and the attribute values in the original entity records are often heterogeneous, and the attribute values of two different entity records cannot be directly compared, and if the heterogeneous attribute values are directly compared, an erroneous comparison result is often obtained. For example, the attribute values "1 liter", "1000 ml" and "100 ml", the results of the direct comparison may consider "1000 ml" and "100 ml" to be closer because they are more words of the same, while "1 liter" and "1000 ml" are considered to be quite different. In the invention, a self-supervised learning method is used for training the neural network, and the entity records are converted into expression vectors. Specifically, each entity record is composed of a series of attribute values, and each attribute value is converted into an attribute value vector when heterogeneous information fusion is performed.
Further, training is carried out by utilizing a decision tree algorithm and the marked entity matching record pairs, and a key attribute tree model is constructed, wherein the nodes of the key attribute tree are attributes. For an input pair of entity records, the key attribute tree model compares the attribute value vectors corresponding to the attributes on the nodes. And selecting a path from the root to the leaf node on the tree according to the comparison result, and determining a matching result according to the result of the last leaf node on the path. After the training of the heterogeneous information fusion model and the key attribute tree model is completed, the two models are combined, so that a trained entity record matching model is obtained.
On the basis of the above embodiment, training the neural network through the first training sample set to obtain a heterogeneous information fusion model, including:
after the sample entity records are input into a neural network, converting each word of the attribute values in the sample entity records into a vector through a word vector embedding layer of the neural network to obtain an attribute value matrix;
converting the attribute value matrix into attribute feature vectors through an attention mechanism based on an attention layer in the attribute of the neural network;
based on an attention layer among attributes of the neural network, obtaining a first attribute value vector representation and a context vector representation corresponding to the first attribute value vector representation according to the attribute feature vector through an attention mechanism, and splicing the first attribute value vector representation and the context vector representation to construct and obtain a second attribute value vector representation;
and performing iterative training on the multilayer perceptron of the neural network output layer through the second attribute value vector representation until a preset training condition is met to obtain a heterogeneous information fusion model, and outputting a sample entity record representation vector.
In the invention, the heterogeneous information fusion model converts the attribute values in the entity records into attribute value vectors. The model consists of a word vector embedding layer, an attribute inner attention layer, an attribute middle attention layer and an output layer. Specifically, the word vector embedding layer converts each word in the attribute values in the entity record into a vector. Given attributes
Figure BDA0003097471950000101
Property value v = [ w = 1 ,w 2 ,…,w T ]If the attribute value has T words, the word vector embedding layer firstly inserts symbols beg and end for marking the beginning and the end of the sentence at the head part and the tail part of the attribute value; then, the characters are filled again<pad>Filling the length to a fixed value at the end of the attribute value
Figure BDA0003097471950000102
The obtained new expression is
Figure BDA0003097471950000103
Wherein the words following end are all<pad>. Further, the word vector embedding layer converts the words into pre-trained word vectors by means of table lookup. Thus, property
Figure BDA0003097471950000104
Is converted into an attribute value matrix
Figure BDA0003097471950000105
Wherein d is emb Is the dimension of the word vector.
Further, the intra-attribute attention layer converts each attribute value matrix into an attribute feature vector by using an attention mechanism. Specifically, the feature vector of each word is converted into a compressed feature vector through linear mapping, then the weight of each word is extracted, and the attribute feature vector is calculated in a weighted average mode according to the weight of the word and the compressed feature vector. The calculation formula of the attribute feature vector is as follows:
Figure BDA0003097471950000111
wherein,
Figure BDA0003097471950000112
a feature matrix recorded for one entity, consisting of m attribute feature vectors; ij]Representing a characteristic vector corresponding to the jth attribute, and being a row vector corresponding to the jth row of the characteristic matrix;
Figure BDA0003097471950000113
and
Figure BDA0003097471950000114
two sets of trainable parameters for the jth attribute,
Figure BDA0003097471950000115
represents the jth attribute;
Figure BDA0003097471950000116
representing j-th attribute
Figure BDA0003097471950000117
The respective attention weight of the individual words,
Figure BDA0003097471950000118
representing the number of words in the attribute value corresponding to the jth attribute;
Figure BDA0003097471950000119
representing compressed feature vectors, d emb Representing the dimensions of the word vector.
Furthermore, the inter-attribute attention layer recovers the information of the attribute value from the attribute values of other attributes by using an inter-attribute attention mechanism, and after passing through the inter-attribute attention layer, each attribute value in each entity record is represented by a vector (i.e. a second attribute value vector), and the vector is divided into two parts, namely a context part
Figure BDA00030974719500001110
(context vector representation) and the original part of the attribute value itself
Figure BDA00030974719500001111
(first attribute value vector representation). Specifically, an entity record is represented by a matrix:
Figure BDA00030974719500001112
and is
Figure BDA00030974719500001113
Where | represents the row-wise concatenation of the matrices. The calculation formula represented by the second attribute value vector is as follows:
Figure BDA00030974719500001114
Figure BDA00030974719500001115
Figure BDA00030974719500001116
Figure BDA0003097471950000121
Figure BDA0003097471950000122
Figure BDA0003097471950000123
Figure BDA0003097471950000124
wherein,
Figure BDA0003097471950000125
and
Figure BDA0003097471950000126
a parameter representing learning; d [ j ]]Representing a feature vector corresponding to the j attribute after information compression for the first attribute value vector representation; c [ j ]]Representing the feature vector of the context corresponding to the jth attribute for the context vector representation, wherein the context vector representation is obtained by performing weighted average calculation on D obtained by compressing information through an attention coefficient; alpha is alpha j,i Denotes the attention coefficient, v i Representing the ith attribute value in the entity record,
Figure BDA0003097471950000127
the vector represents a key value representation (key) of the ith attribute value,
Figure BDA0003097471950000128
the vector represents the query for the jth attribute value. By passing
Figure BDA0003097471950000129
Vector sum
Figure BDA00030974719500001210
The vector is point-multiplied to calculate the score of the ith attribute query jth attribute, which is further used to calculate the attention coefficient.
Finally, each entity record is represented as a feature matrix by a multi-layer Perceptron (MLP) of the output layer
Figure BDA00030974719500001214
On the calculation, the MLP between attributes does not share parameters, specifically:
Figure BDA00030974719500001211
it should be noted that, in the present invention, the heterogeneous information fusion model is used as a function
Figure BDA00030974719500001212
Realizes the recording of one entity
Figure BDA00030974719500001213
Function of transformation into a feature matrix, denoted as
Figure BDA0003097471950000131
On the basis of the embodiment, the key attribute tree model consists of leaf nodes and internal nodes, wherein the internal nodes comprise a key attribute, a vector distance function and a distance threshold; and the leaf node records an entity matching result.
In the present invention, a pair of entities is recorded
Figure BDA0003097471950000132
Characteristic matrix H of 1 ,H 2 Inputting the result into a key attribute tree model, and outputting the result whether the two entity records are matched. In the invention, the key attribute tree is a special binary tree which consists of leaf nodes and internal nodes, each internal node comprises a key attribute, a vector distance function and a distance threshold, and the leaf nodes record entity matching results. The entity matching reasoning process based on the key attribute tree is based on the input characteristic matrix H 1 ,H 2 And finding a path from the root node to the leaf node, wherein the result of the leaf node on the path is used as a final matching result.
On the basis of the above embodiment, before the constructing the first training sample set according to the sample entity record, the method further includes:
and performing masking processing on the attribute values in the sample entity record samples according to a preset masking condition to construct a first training sample set according to the masked sample entity record samples.
In the invention, according to a preset masking condition (for example, masking attribute values in a certain range), a part of attribute values in a masking entity record are adopted, and a model is led to recover the part of attribute values to train the neural network, so that the heterogeneous information fusion model is obtained.
In one embodiment, the performance of the entity record matching model of the present invention is evaluated by a standard entity matching dataset and a certain e-commerce entity matching dataset. Specifically, 1% -10% of entity pairs are selected as annotation data, and the performance of the entity record matching model in the weak supervision environment is evaluated. Experimental results show that under weak supervision, the entity record matching method provided by the invention achieves F1 measurement which is obviously higher than those of Magellan and DeepMatcher. And the depth of the key attribute tree does not exceed 3 layers, which shows that a very high matching result can be obtained only by comparison under a very small number of attributes, thereby proving the effectiveness of the entity record matching method of the invention.
The entity record matching method provided by the invention does not rely on manual data cleaning, and can automatically clean the data by utilizing a heterogeneous information fusion model obtained by neural network training; meanwhile, a key attribute tree can be constructed, key attributes can be automatically extracted, and a matching result with good interpretability can be provided; in addition, the invention has low requirement on the number of the labeling matching pairs, and can obtain high labeling quality only by a small number of labeling entity pairs.
Fig. 2 is a schematic structural diagram of an entity record matching system provided by the present invention, and as shown in fig. 2, the present invention provides an entity record matching system, which includes an entity obtaining module 201 and an entity matching module 202, where the entity obtaining module 201 is configured to obtain an entity record set to be matched, and an entity record in the entity record set is composed of an attribute and an attribute value of an entity; the entity matching module 202 is configured to input the entity record set to a trained entity record matching model to obtain a matching result between entity records in the entity record set, where the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.
The entity record matching system provided by the invention converts the entity into the attribute value vector through the neural network, overcomes the defect of poor deep learning interpretability by utilizing the automatically constructed key attribute tree, can convert the learned key attribute tree into the matching rule, and applies the matching rule to other data sets; meanwhile, the training of the corresponding model of the invention only needs a small number of labeled entity record pairs, and overcomes the defect that the existing method needs a large number of labeled entity record pairs.
On the basis of the embodiment, the system further comprises a first sample set building module, a heterogeneous information fusion training module, a second sample set building module, a key attribute tree training module and a model building module, wherein the first sample set building module is used for building a first training sample set according to the sample entity records; the heterogeneous information fusion training module is used for training the neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, and the heterogeneous information fusion model is used for converting entity records into entity record expression vectors; the second sample set construction module is used for constructing a second training sample set according to the sample entity record representation vector; the key attribute tree training module is used for training a decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, and the key attribute tree model is used for classifying entity records and matching the entity records with the same classification result into the same entity; and the model construction module is used for constructing and obtaining a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model.
The system provided by the present invention is used for executing the above method embodiments, and for the specific processes and details, reference is made to the above embodiments, which are not described herein again.
Fig. 3 is a schematic structural diagram of an electronic device provided in the present invention, and as shown in fig. 3, the electronic device may include: a processor (processor) 301, a communication interface (communication interface) 302, a memory (memory) 303 and a communication bus 304, wherein the processor 301, the communication interface 302 and the memory 303 communicate with each other through the communication bus 304. The processor 301 may invoke logic instructions in the memory 303 to perform an entity record matching method comprising: acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities; and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by a self-supervision learning method and a decision tree model trained by a decision tree algorithm.
In addition, the logic instructions in the memory 303 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, enable the computer to perform the entity record matching method provided by the above methods, the method comprising: acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities; and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.
In still another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the entity record matching method provided by the above embodiments, the method including: acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities; and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (7)

1. An entity record matching method, comprising:
acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities;
inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by a self-supervision learning method and a decision tree model trained by a decision tree algorithm;
the trained entity record matching model is obtained through the following steps:
constructing a first training sample set according to the sample entity record;
training a neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, wherein the heterogeneous information fusion model is used for converting entity records into entity record expression vectors;
constructing a second training sample set according to the sample entity record representation vector;
training a decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, wherein the key attribute tree model is used for classifying entity records and matching the entity records with the same classification result into the same entity;
constructing a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model;
the method for training the neural network through the first training sample set based on the self-supervision learning to obtain the heterogeneous information fusion model comprises the following steps:
after the sample entity records are input into a neural network, converting each word of the attribute values in the sample entity records into a vector through a word vector embedding layer of the neural network to obtain an attribute value matrix;
converting the attribute value matrix into an attribute feature vector through an attention mechanism based on an attention layer in the attribute of the neural network;
based on an attention layer among attributes of the neural network, obtaining a first attribute value vector representation and a context vector representation corresponding to the first attribute value vector representation through an attention mechanism according to the attribute feature vector, and splicing the first attribute value vector representation and the context vector representation to construct a second attribute value vector representation;
and performing iterative training on the multilayer perceptron of the neural network output layer through the second attribute value vector representation until a preset training condition is met to obtain a heterogeneous information fusion model, and outputting a sample entity record representation vector.
2. The entity record matching method according to claim 1, wherein the attribute feature vector is calculated by the formula:
Figure FDA0003920493090000021
wherein,
Figure FDA0003920493090000022
a feature matrix recorded for an entity, consisting of m attribute feature vectors; ij]Representing a feature vector corresponding to the jth attribute;
Figure FDA0003920493090000023
and
Figure FDA0003920493090000024
two sets of trainable parameters for the jth attribute,
Figure FDA0003920493090000025
represents the jth attribute;
Figure FDA0003920493090000026
representing j-th attribute
Figure FDA0003920493090000027
The respective attention weight of the individual words,
Figure FDA0003920493090000028
representing the number of words in the attribute value corresponding to the jth attribute;
Figure FDA0003920493090000029
representing compressed feature vectors, d emb A dimension representing a word vector;
the calculation formula represented by the second attribute value vector is as follows:
Figure FDA00039204930900000210
Figure FDA00039204930900000211
Figure FDA00039204930900000212
Figure FDA00039204930900000213
Figure FDA0003920493090000031
Figure FDA0003920493090000032
Figure FDA0003920493090000033
wherein,
Figure FDA0003920493090000034
and
Figure FDA0003920493090000035
a parameter representing learning; d [ j ]]Representing a feature vector corresponding to the j attribute after compressing the information for the first attribute value vector representation; c [ j ]]Representing a characteristic vector of a context corresponding to the jth attribute for the context vector representation; alpha is alpha j,i Denotes the attention coefficient, v i Representing the ith attribute value in the entity record,
Figure FDA0003920493090000036
the vector represents a key-value representation of the ith attribute value,
Figure FDA0003920493090000037
the vector represents the query for the jth attribute value.
3. The entity record matching method according to claim 1, wherein the key attribute tree model is composed of leaf nodes and internal nodes, the internal nodes containing a key attribute, a vector distance function and a distance threshold; and the leaf node records an entity matching result.
4. The entity record matching method of claim 1, wherein prior to said constructing a first training sample set from sample entity records, said method further comprises:
and performing masking processing on the attribute values in the sample entity record samples according to a preset masking condition so as to construct a first training sample set according to the masked sample entity record samples.
5. An entity record matching system, comprising:
the system comprises an entity acquisition module, a matching module and a matching module, wherein the entity acquisition module is used for acquiring an entity record set to be matched, and entity records in the entity record set consist of attributes and attribute values of entities;
the entity matching module is used for inputting the entity record set into a trained entity record matching model to obtain a matching result between entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm;
the system further comprises:
the first sample set construction module is used for constructing a first training sample set according to the sample entity record;
the heterogeneous information fusion training module is used for training the neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, and the heterogeneous information fusion model is used for converting entity records into entity record expression vectors;
the second sample set construction module is used for constructing a second training sample set according to the sample entity record representation vector;
the key attribute tree training module is used for training the decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, and the key attribute tree model is used for classifying the entity records and matching the entity records with the same classification result into the same entity;
the model construction module is used for constructing and obtaining a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model;
the heterogeneous information fusion training module is specifically used for:
after the sample entity records are input into the neural network, converting each word of the attribute values in the sample entity records into a vector through a word vector embedding layer of the neural network to obtain an attribute value matrix;
converting the attribute value matrix into attribute feature vectors through an attention mechanism based on an attention layer in the attribute of the neural network;
based on an attention layer among attributes of the neural network, obtaining a first attribute value vector representation and a context vector representation corresponding to the first attribute value vector representation according to the attribute feature vector through an attention mechanism, and splicing the first attribute value vector representation and the context vector representation to construct and obtain a second attribute value vector representation;
and performing iterative training on the multilayer perceptron of the neural network output layer through the second attribute value vector representation until a preset training condition is met to obtain a heterogeneous information fusion model, and outputting a sample entity record representation vector.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the entity record matching method according to any one of claims 1 to 4 when executing the computer program.
7. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the entity record matching method according to any one of claims 1 to 4.
CN202110614418.1A 2021-06-02 2021-06-02 Entity record matching method and system Active CN114781471B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110614418.1A CN114781471B (en) 2021-06-02 2021-06-02 Entity record matching method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110614418.1A CN114781471B (en) 2021-06-02 2021-06-02 Entity record matching method and system

Publications (2)

Publication Number Publication Date
CN114781471A CN114781471A (en) 2022-07-22
CN114781471B true CN114781471B (en) 2022-12-27

Family

ID=82424310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110614418.1A Active CN114781471B (en) 2021-06-02 2021-06-02 Entity record matching method and system

Country Status (1)

Country Link
CN (1) CN114781471B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107545033A (en) * 2017-07-24 2018-01-05 清华大学 A kind of computational methods based on the knowledge base entity classification for representing study
CN108268643A (en) * 2018-01-22 2018-07-10 北京邮电大学 A kind of Deep Semantics matching entities link method based on more granularity LSTM networks
CN108427714A (en) * 2018-02-02 2018-08-21 北京邮电大学 The source of houses based on machine learning repeats record recognition methods and system
CN109408555A (en) * 2018-09-19 2019-03-01 智器云南京信息科技有限公司 Data type recognition methods and device, data storage method and device
CN110334212A (en) * 2019-07-01 2019-10-15 南京审计大学 A kind of territoriality audit knowledge mapping construction method based on machine learning
CN111339872A (en) * 2020-02-18 2020-06-26 国网信通亿力科技有限责任公司 Power grid fault classification method based on classification model
CN111522961A (en) * 2020-04-09 2020-08-11 武汉理工大学 Attention mechanism and entity description based industrial map construction method
CN112906396A (en) * 2021-04-01 2021-06-04 翻车信息科技(杭州)有限公司 Cross-platform commodity matching method and system based on natural language processing

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200073953A1 (en) * 2018-08-30 2020-03-05 Salesforce.Com, Inc. Ranking Entity Based Search Results Using User Clusters
CN110991185A (en) * 2019-11-05 2020-04-10 北京声智科技有限公司 Method and device for extracting attributes of entities in article

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107545033A (en) * 2017-07-24 2018-01-05 清华大学 A kind of computational methods based on the knowledge base entity classification for representing study
CN108268643A (en) * 2018-01-22 2018-07-10 北京邮电大学 A kind of Deep Semantics matching entities link method based on more granularity LSTM networks
CN108427714A (en) * 2018-02-02 2018-08-21 北京邮电大学 The source of houses based on machine learning repeats record recognition methods and system
CN109408555A (en) * 2018-09-19 2019-03-01 智器云南京信息科技有限公司 Data type recognition methods and device, data storage method and device
CN110334212A (en) * 2019-07-01 2019-10-15 南京审计大学 A kind of territoriality audit knowledge mapping construction method based on machine learning
CN111339872A (en) * 2020-02-18 2020-06-26 国网信通亿力科技有限责任公司 Power grid fault classification method based on classification model
CN111522961A (en) * 2020-04-09 2020-08-11 武汉理工大学 Attention mechanism and entity description based industrial map construction method
CN112906396A (en) * 2021-04-01 2021-06-04 翻车信息科技(杭州)有限公司 Cross-platform commodity matching method and system based on natural language processing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《Entity Matching in Online Social Networks》;Olga Peled等;《SocialCom/PASSAT/BigData/EconCom/BioMedCom 2013》;20131231;第339-344页 *
《Entity Matching on Unstructured Data: An Active Learning Approach》;Ursin Brunner等;《2019 6th Swiss Conference on Data Science (SDS)》;20191231;第97-102页 *
《征信系统中实体匹配方法及应用研究》;陈波;《中国博士学位论文全文数据库 经济与管理科学辑》;20100915(第9期);第J145-31页 *

Also Published As

Publication number Publication date
CN114781471A (en) 2022-07-22

Similar Documents

Publication Publication Date Title
CN110597735B (en) Software defect prediction method for open-source software defect feature deep learning
CN111581396B (en) Event graph construction system and method based on multi-dimensional feature fusion and dependency syntax
CN112966074B (en) Emotion analysis method and device, electronic equipment and storage medium
CN109992783B (en) Chinese word vector modeling method
CN107330032B (en) Implicit discourse relation analysis method based on recurrent neural network
CN111159407B (en) Method, apparatus, device and medium for training entity recognition and relation classification model
CN108415953B (en) Method for managing bad asset management knowledge based on natural language processing technology
CN111753024B (en) Multi-source heterogeneous data entity alignment method oriented to public safety field
CN113239700A (en) Text semantic matching device, system, method and storage medium for improving BERT
CN109271506A (en) A kind of construction method of the field of power communication knowledge mapping question answering system based on deep learning
CN107220506A (en) Breast cancer risk assessment analysis system based on deep convolutional neural network
CN112417289B (en) Information intelligent recommendation method based on deep clustering
CN112711953A (en) Text multi-label classification method and system based on attention mechanism and GCN
CN110765277B (en) Knowledge-graph-based mobile terminal online equipment fault diagnosis method
CN109804371B (en) Method and device for semantic knowledge migration
CN114896388A (en) Hierarchical multi-label text classification method based on mixed attention
CN112749562A (en) Named entity identification method, device, storage medium and electronic equipment
CN115526236A (en) Text network graph classification method based on multi-modal comparative learning
CN110969023B (en) Text similarity determination method and device
Manik et al. Out-of-Scope Intent Detection on A Knowledge-Based Chatbot.
CN116578708A (en) Paper data name disambiguation algorithm based on graph neural network
CN116561272A (en) Open domain visual language question-answering method and device, electronic equipment and storage medium
CN115017879A (en) Text comparison method, computer device and computer storage medium
CN114428860A (en) Pre-hospital emergency case text recognition method and device, terminal and storage medium
US11520986B2 (en) Neural-based ontology generation and refinement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant