CN114781471B

CN114781471B - Entity record matching method and system

Info

Publication number: CN114781471B
Application number: CN202110614418.1A
Authority: CN
Inventors: 李涓子; 姚子俊; 侯磊; 吕鑫; 唐杰; 张鹏; 许斌; 戴泽林; 张亦弛
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2022-12-27
Anticipated expiration: 2041-06-02
Also published as: CN114781471A

Abstract

The invention provides a method and a system for matching entity records, wherein the method comprises the following steps: acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities; and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm. The invention converts the entity into the attribute value vector through the neural network, overcomes the defect of poor interpretability of deep learning by utilizing the automatically constructed key attribute tree, can convert the learned key attribute tree into the matching rule, and applies the matching rule to other data sets; meanwhile, the training of the corresponding model of the invention only needs a small number of labeled entity record pairs, and overcomes the defect that the existing method needs a large number of labeled entity record pairs.

Description

Entity record matching method and system

Technical Field

The invention relates to the technical field of entity matching, in particular to an entity record matching method and system.

Background

With the popularization and development of Intemet, the size of data stored in the open-domain internet is rapidly increasing, and real-world entities (entities) are stored in databases constituting the internet in the form of Entity records (Record). However, there is a large amount of redundancy in these stored entity records, and the same entity may be stored as multiple entity records. Taking the e-commerce platform as an example, the same product is sold by different merchants, so that the same product is registered as different commodity records. Here, a product corresponds to an entity in the real world, and a registered commodity record corresponds to an entity record.

In order to identify entity records of the same entity, it is necessary to determine whether two given entity records refer to the same entity. This task is abstracted as an Entity record Matching task, or an Entity Matching (Entity Matching) task. The input of the task is two entity records represented by attribute-attribute value pairs, and the matching result of whether the two records refer to the same entity is output.

The method for realizing entity matching task at present stage has two categories, wherein one category adopts a data characteristic engineering method to manually construct matching characteristics, and matches the same entity record based on rules; another type of approach treats the attribute values in the entity records as natural language, represents the entity records as semantic vectors using a neural network, and trains a two-classifier based on the vectorized representation and labeled entity record pairs. However, both existing methods require manual intervention, full automation of the algorithm cannot be achieved, and the generated feature engineering cannot be migrated to a new data set, which means that the existing methods have poor generalization performance on new data sets.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an entity record matching method and system.

The invention provides an entity record matching method, which comprises the following steps:

acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities;

and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.

According to the entity record matching method provided by the invention, the trained entity record matching model is obtained through the following steps:

constructing a first training sample set according to the sample entity record;

training a neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, wherein the heterogeneous information fusion model is used for converting entity records into entity record expression vectors;

constructing a second training sample set according to the sample entity record representation vector;

training a decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, wherein the key attribute tree model is used for classifying entity records and matching the entity records with the same classification result into the same entity;

and constructing a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model.

According to the entity record matching method provided by the invention, the method for training the neural network through the first training sample set based on the self-supervision learning to obtain the heterogeneous information fusion model comprises the following steps:

after the sample entity records are input into a neural network, converting each word of the attribute values in the sample entity records into a vector through a word vector embedding layer of the neural network to obtain an attribute value matrix;

converting the attribute value matrix into an attribute feature vector through an attention mechanism based on an attention layer in the attribute of the neural network;

based on an attention layer among attributes of the neural network, obtaining a first attribute value vector representation and a context vector representation corresponding to the first attribute value vector representation through an attention mechanism according to the attribute feature vector, and splicing the first attribute value vector representation and the context vector representation to construct a second attribute value vector representation;

and performing iterative training on the multilayer perceptron of the neural network output layer through the second attribute value vector representation until a preset training condition is met, obtaining a heterogeneous information fusion model, and outputting a sample entity record representation vector.

According to the entity record matching method provided by the invention, the calculation formula of the attribute feature vector is as follows:

wherein,

a feature matrix recorded for an entity, consisting of m attribute feature vectors; ij]Representing a feature vector corresponding to the jth attribute;

and

two sets of trainable parameters for the jth attribute,

represents the jth attribute;

representing j-th attribute

The respective attention weight of the individual words,

representing the number of words in the attribute value corresponding to the jth attribute;

representing compressed feature vectors, d _emb A dimension representing a word vector;

the calculation formula represented by the second attribute value vector is as follows:

wherein,

and

a parameter representing learning; d [ j ]]Representing a feature vector corresponding to the j attribute after compressing the information for the first attribute value vector representation; c [ j ]]Representing a characteristic vector of a context corresponding to the jth attribute for the context vector representation; alpha is alpha _j，i Denotes the attention coefficient, v _i Representing the ith attribute value, k, in an entity record _vi The vector represents the key-value representation of the ith attribute value, q _uj The vector represents the query for the jth attribute value.

According to the entity record matching method provided by the invention, the key attribute tree model consists of leaf nodes and internal nodes, wherein the internal nodes comprise a key attribute, a vector distance function and a distance threshold; and the leaf node records an entity matching result.

According to an entity record matching method provided by the present invention, before the constructing a first training sample set according to sample entity records, the method further includes:

and performing masking processing on the attribute values in the sample entity record samples according to a preset masking condition so as to construct a first training sample set according to the masked sample entity record samples.

The invention also provides an entity record matching system, comprising:

the system comprises an entity acquisition module, a matching module and a matching module, wherein the entity acquisition module is used for acquiring an entity record set to be matched, and entity records in the entity record set consist of attributes and attribute values of entities;

and the entity matching module is used for inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.

According to an entity record matching system provided by the invention, the system further comprises:

the first sample set construction module is used for constructing a first training sample set according to the sample entity record;

the heterogeneous information fusion training module is used for training the neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, and the heterogeneous information fusion model is used for converting entity records into entity record expression vectors;

the second sample set construction module is used for constructing a second training sample set according to the sample entity record representation vector;

the key attribute tree training module is used for training a decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, and the key attribute tree model is used for classifying entity records and matching the entity records with the same classification result into the same entity;

and the model construction module is used for constructing and obtaining a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the entity record matching method as described in any one of the above.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the entity record matching method as described in any of the above.

According to the entity record matching method and system provided by the invention, the entity is converted into the attribute value vector through the neural network, the defect of poor deep learning interpretability is overcome by utilizing the automatically constructed key attribute tree, the learned key attribute tree can be converted into the matching rule, and the matching rule is applied to other data sets; meanwhile, the training of the corresponding model of the invention only needs a small number of labeled entity record pairs, and overcomes the defect that the prior method needs a large number of labeled entity record pairs.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of an entity record matching method according to the present invention;

FIG. 2 is a schematic diagram of a physical record matching system according to the present invention;

fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

At present, there are two types of methods for implementing entity matching tasks, one of which adopts a data feature engineering method to manually construct matching features and matches the same entity record based on rules, for example, the paper "Magellan: a set of Entity Matching system depending on domain expert intervening design data characteristic engineering is designed in the hardware Building Entity Matching Management Systems, the system is called Magellan, tools such as data cleaning, data representation, matching rule design and the like are integrated, and the system can assist the domain expert in constructing an Entity Matching pipeline algorithm.

Another class of methods treats attribute values in Entity records as natural language, represents Entity records as semantic vectors using neural networks, and trains a two-classifier based on vectorized representations and labeled Entity record pairs, e.g., the paper "Deep Learning for Entity Matching: a Design Space Exploration' proposes to train a neural network called deep Matcher to realize the entity matching task by using an end-to-end training method. In their approach, the words in each attribute value in the entity record are first represented as a word vector, and then the word vectors are used to extract the attribute value vector for each attribute value. The neural network further compares the attribute value vectors corresponding to the two entity records for matching to obtain a comparison result vector under each attribute, and finally compares the two attribute value vectors by using a differentiable two-classifier to obtain a final matching result.

In the two existing entity record matching methods, the Magellan system realizes entity matching, on one hand, the Magellan system depends on the intervention of human experts to assist in designing a data characteristic engineering scheme, and the full automation of an algorithm cannot be realized; on the other hand, feature engineering cannot migrate to a new data set, which means that the generalization performance of the method on new data sets is poor.

The DeepMatcher adopts a neural network to realize feature extraction, however, the method adopts an end-to-end training method, and a large number of matching/mismatching entity record pairs need to be labeled, so that the DeepMatcher also needs a large amount of manpower to label data to ensure good generalization performance of the algorithm; another drawback of this method is the poor interpretability of the model, and since this method is built based on an end-to-end training method, no explicit entity record matching rules can be given.

Fig. 1 is a schematic flow diagram of an entity record matching method provided by the present invention, and as shown in fig. 1, the present invention provides an entity record matching method, including:

step 101, acquiring an entity record set to be matched, wherein entity records in the entity record set are composed of attributes and attribute values of entities;

in the present invention, the complete set of attributes defining an entity is

Where m is the number of attributes. v. of _i Representing attributes

Corresponding attribute values, also denoted as

Whereby entity records, one entity record, are formed from attribute-attribute value pairs

Is shown as

And 102, inputting the entity record set into a trained entity record matching model to obtain a matching result between entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.

In the present invention, the entity matching task requires an entity matching algorithm EM (DEG), which records any two entities (entity records) as a function

And entity records

) Making a judgment if the requirements are met

If and only if physical records

And

refer to the same real world entity; otherwise, give

The invention trains the decision tree model through the neural network trained by the self-supervision learning method and the decision tree algorithm, thereby constructing and obtaining the entity record matching model, and based on the entity record matching model, when the entity record set is input into the model, the entity records in the set can be matched, thereby obtaining the matching result whether different entity records belong to the same entity.

According to the entity record matching method and system provided by the invention, the entity is converted into the attribute value vector through the neural network, the defect of poor deep learning interpretability is overcome by utilizing the automatically constructed key attribute tree, the learned key attribute tree can be converted into the matching rule, and the matching rule is applied to other data sets; meanwhile, the training of the corresponding model of the invention only needs a small number of labeled entity record pairs, and overcomes the defect that the existing method needs a large number of labeled entity record pairs.

On the basis of the above embodiment, the trained entity record matching model is obtained by the following steps:

constructing a first training sample set according to the sample entity record;

training the neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, wherein the heterogeneous information fusion model is used for converting entity records into entity record expression vectors;

In the invention, the trained entity record matching model mainly comprises two functions which are obtained by respectively training a neural network and a decision tree model. Specifically, one of the functions is heterogeneous information fusion, and the function is to unify attribute values with different forms but the same semantics on a vector space. This is because the entity matching requires comparing the attribute values, and the attribute values in the original entity records are often heterogeneous, and the attribute values of two different entity records cannot be directly compared, and if the heterogeneous attribute values are directly compared, an erroneous comparison result is often obtained. For example, the attribute values "1 liter", "1000 ml" and "100 ml", the results of the direct comparison may consider "1000 ml" and "100 ml" to be closer because they are more words of the same, while "1 liter" and "1000 ml" are considered to be quite different. In the invention, a self-supervised learning method is used for training the neural network, and the entity records are converted into expression vectors. Specifically, each entity record is composed of a series of attribute values, and each attribute value is converted into an attribute value vector when heterogeneous information fusion is performed.

Further, training is carried out by utilizing a decision tree algorithm and the marked entity matching record pairs, and a key attribute tree model is constructed, wherein the nodes of the key attribute tree are attributes. For an input pair of entity records, the key attribute tree model compares the attribute value vectors corresponding to the attributes on the nodes. And selecting a path from the root to the leaf node on the tree according to the comparison result, and determining a matching result according to the result of the last leaf node on the path. After the training of the heterogeneous information fusion model and the key attribute tree model is completed, the two models are combined, so that a trained entity record matching model is obtained.

On the basis of the above embodiment, training the neural network through the first training sample set to obtain a heterogeneous information fusion model, including:

converting the attribute value matrix into attribute feature vectors through an attention mechanism based on an attention layer in the attribute of the neural network;

based on an attention layer among attributes of the neural network, obtaining a first attribute value vector representation and a context vector representation corresponding to the first attribute value vector representation according to the attribute feature vector through an attention mechanism, and splicing the first attribute value vector representation and the context vector representation to construct and obtain a second attribute value vector representation;

and performing iterative training on the multilayer perceptron of the neural network output layer through the second attribute value vector representation until a preset training condition is met to obtain a heterogeneous information fusion model, and outputting a sample entity record representation vector.

In the invention, the heterogeneous information fusion model converts the attribute values in the entity records into attribute value vectors. The model consists of a word vector embedding layer, an attribute inner attention layer, an attribute middle attention layer and an output layer. Specifically, the word vector embedding layer converts each word in the attribute values in the entity record into a vector. Given attributes

Property value v = [ w = ₁ ，w ₂ ，…，w _T ]If the attribute value has T words, the word vector embedding layer firstly inserts symbols beg and end for marking the beginning and the end of the sentence at the head part and the tail part of the attribute value; then, the characters are filled again<pad>Filling the length to a fixed value at the end of the attribute value

The obtained new expression is

Wherein the words following end are all<pad>. Further, the word vector embedding layer converts the words into pre-trained word vectors by means of table lookup. Thus, property

Is converted into an attribute value matrix

Wherein d is _emb Is the dimension of the word vector.

Further, the intra-attribute attention layer converts each attribute value matrix into an attribute feature vector by using an attention mechanism. Specifically, the feature vector of each word is converted into a compressed feature vector through linear mapping, then the weight of each word is extracted, and the attribute feature vector is calculated in a weighted average mode according to the weight of the word and the compressed feature vector. The calculation formula of the attribute feature vector is as follows:

wherein,

a feature matrix recorded for one entity, consisting of m attribute feature vectors; ij]Representing a characteristic vector corresponding to the jth attribute, and being a row vector corresponding to the jth row of the characteristic matrix;

and

two sets of trainable parameters for the jth attribute,

represents the jth attribute;

representing j-th attribute

The respective attention weight of the individual words,

representing compressed feature vectors, d _emb Representing the dimensions of the word vector.

Furthermore, the inter-attribute attention layer recovers the information of the attribute value from the attribute values of other attributes by using an inter-attribute attention mechanism, and after passing through the inter-attribute attention layer, each attribute value in each entity record is represented by a vector (i.e. a second attribute value vector), and the vector is divided into two parts, namely a context part

(context vector representation) and the original part of the attribute value itself

(first attribute value vector representation). Specifically, an entity record is represented by a matrix:

and is

Where | represents the row-wise concatenation of the matrices. The calculation formula represented by the second attribute value vector is as follows:

wherein,

and

a parameter representing learning; d [ j ]]Representing a feature vector corresponding to the j attribute after information compression for the first attribute value vector representation; c [ j ]]Representing the feature vector of the context corresponding to the jth attribute for the context vector representation, wherein the context vector representation is obtained by performing weighted average calculation on D obtained by compressing information through an attention coefficient; alpha is alpha _j，i Denotes the attention coefficient, v _i Representing the ith attribute value in the entity record,

the vector represents a key value representation (key) of the ith attribute value,

the vector represents the query for the jth attribute value. By passing

Vector sum

The vector is point-multiplied to calculate the score of the ith attribute query jth attribute, which is further used to calculate the attention coefficient.

Finally, each entity record is represented as a feature matrix by a multi-layer Perceptron (MLP) of the output layer

On the calculation, the MLP between attributes does not share parameters, specifically:

it should be noted that, in the present invention, the heterogeneous information fusion model is used as a function

Realizes the recording of one entity

Function of transformation into a feature matrix, denoted as

On the basis of the embodiment, the key attribute tree model consists of leaf nodes and internal nodes, wherein the internal nodes comprise a key attribute, a vector distance function and a distance threshold; and the leaf node records an entity matching result.

In the present invention, a pair of entities is recorded

Characteristic matrix H of ₁ ，H ₂ Inputting the result into a key attribute tree model, and outputting the result whether the two entity records are matched. In the invention, the key attribute tree is a special binary tree which consists of leaf nodes and internal nodes, each internal node comprises a key attribute, a vector distance function and a distance threshold, and the leaf nodes record entity matching results. The entity matching reasoning process based on the key attribute tree is based on the input characteristic matrix H ₁ ，H ₂ And finding a path from the root node to the leaf node, wherein the result of the leaf node on the path is used as a final matching result.

On the basis of the above embodiment, before the constructing the first training sample set according to the sample entity record, the method further includes:

and performing masking processing on the attribute values in the sample entity record samples according to a preset masking condition to construct a first training sample set according to the masked sample entity record samples.

In the invention, according to a preset masking condition (for example, masking attribute values in a certain range), a part of attribute values in a masking entity record are adopted, and a model is led to recover the part of attribute values to train the neural network, so that the heterogeneous information fusion model is obtained.

In one embodiment, the performance of the entity record matching model of the present invention is evaluated by a standard entity matching dataset and a certain e-commerce entity matching dataset. Specifically, 1% -10% of entity pairs are selected as annotation data, and the performance of the entity record matching model in the weak supervision environment is evaluated. Experimental results show that under weak supervision, the entity record matching method provided by the invention achieves F1 measurement which is obviously higher than those of Magellan and DeepMatcher. And the depth of the key attribute tree does not exceed 3 layers, which shows that a very high matching result can be obtained only by comparison under a very small number of attributes, thereby proving the effectiveness of the entity record matching method of the invention.

The entity record matching method provided by the invention does not rely on manual data cleaning, and can automatically clean the data by utilizing a heterogeneous information fusion model obtained by neural network training; meanwhile, a key attribute tree can be constructed, key attributes can be automatically extracted, and a matching result with good interpretability can be provided; in addition, the invention has low requirement on the number of the labeling matching pairs, and can obtain high labeling quality only by a small number of labeling entity pairs.

Fig. 2 is a schematic structural diagram of an entity record matching system provided by the present invention, and as shown in fig. 2, the present invention provides an entity record matching system, which includes an entity obtaining module 201 and an entity matching module 202, where the entity obtaining module 201 is configured to obtain an entity record set to be matched, and an entity record in the entity record set is composed of an attribute and an attribute value of an entity; the entity matching module 202 is configured to input the entity record set to a trained entity record matching model to obtain a matching result between entity records in the entity record set, where the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.

The entity record matching system provided by the invention converts the entity into the attribute value vector through the neural network, overcomes the defect of poor deep learning interpretability by utilizing the automatically constructed key attribute tree, can convert the learned key attribute tree into the matching rule, and applies the matching rule to other data sets; meanwhile, the training of the corresponding model of the invention only needs a small number of labeled entity record pairs, and overcomes the defect that the existing method needs a large number of labeled entity record pairs.

On the basis of the embodiment, the system further comprises a first sample set building module, a heterogeneous information fusion training module, a second sample set building module, a key attribute tree training module and a model building module, wherein the first sample set building module is used for building a first training sample set according to the sample entity records; the heterogeneous information fusion training module is used for training the neural network through the first training sample set based on an automatic supervision learning method to obtain a heterogeneous information fusion model, and the heterogeneous information fusion model is used for converting entity records into entity record expression vectors; the second sample set construction module is used for constructing a second training sample set according to the sample entity record representation vector; the key attribute tree training module is used for training a decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, and the key attribute tree model is used for classifying entity records and matching the entity records with the same classification result into the same entity; and the model construction module is used for constructing and obtaining a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model.

The system provided by the present invention is used for executing the above method embodiments, and for the specific processes and details, reference is made to the above embodiments, which are not described herein again.

Fig. 3 is a schematic structural diagram of an electronic device provided in the present invention, and as shown in fig. 3, the electronic device may include: a processor (processor) 301, a communication interface (communication interface) 302, a memory (memory) 303 and a communication bus 304, wherein the processor 301, the communication interface 302 and the memory 303 communicate with each other through the communication bus 304. The processor 301 may invoke logic instructions in the memory 303 to perform an entity record matching method comprising: acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities; and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by a self-supervision learning method and a decision tree model trained by a decision tree algorithm.

In addition, the logic instructions in the memory 303 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, enable the computer to perform the entity record matching method provided by the above methods, the method comprising: acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities; and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.

In still another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the entity record matching method provided by the above embodiments, the method including: acquiring an entity record set to be matched, wherein entity records in the entity record set consist of attributes and attribute values of entities; and inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An entity record matching method, comprising:

inputting the entity record set into a trained entity record matching model to obtain a matching result between the entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by a self-supervision learning method and a decision tree model trained by a decision tree algorithm;

the trained entity record matching model is obtained through the following steps:

constructing a first training sample set according to the sample entity record;

constructing a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model;

the method for training the neural network through the first training sample set based on the self-supervision learning to obtain the heterogeneous information fusion model comprises the following steps:

2. The entity record matching method according to claim 1, wherein the attribute feature vector is calculated by the formula:

wherein,

and

two sets of trainable parameters for the jth attribute,

represents the jth attribute;

representing j-th attribute

The respective attention weight of the individual words,

wherein,

and

a parameter representing learning; d [ j ]]Representing a feature vector corresponding to the j attribute after compressing the information for the first attribute value vector representation; c [ j ]]Representing a characteristic vector of a context corresponding to the jth attribute for the context vector representation; alpha is alpha _j,i Denotes the attention coefficient, v _i Representing the ith attribute value in the entity record,

the vector represents a key-value representation of the ith attribute value,

the vector represents the query for the jth attribute value.

3. The entity record matching method according to claim 1, wherein the key attribute tree model is composed of leaf nodes and internal nodes, the internal nodes containing a key attribute, a vector distance function and a distance threshold; and the leaf node records an entity matching result.

4. The entity record matching method of claim 1, wherein prior to said constructing a first training sample set from sample entity records, said method further comprises:

5. An entity record matching system, comprising:

the entity matching module is used for inputting the entity record set into a trained entity record matching model to obtain a matching result between entity records in the entity record set, wherein the trained entity record matching model is constructed by a neural network trained by an auto-supervised learning method and a decision tree model trained by a decision tree algorithm;

the system further comprises:

the key attribute tree training module is used for training the decision tree model through the second training sample set based on a decision tree algorithm to obtain a key attribute tree model, and the key attribute tree model is used for classifying the entity records and matching the entity records with the same classification result into the same entity;

the model construction module is used for constructing and obtaining a trained entity record matching model through the heterogeneous information fusion model and the key attribute tree model;

the heterogeneous information fusion training module is specifically used for:

after the sample entity records are input into the neural network, converting each word of the attribute values in the sample entity records into a vector through a word vector embedding layer of the neural network to obtain an attribute value matrix;

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the entity record matching method according to any one of claims 1 to 4 when executing the computer program.

7. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the entity record matching method according to any one of claims 1 to 4.