CN111767325A

CN111767325A - Multi-source data deep fusion method based on deep learning

Info

Publication number: CN111767325A
Application number: CN202010914905.5A
Authority: CN
Inventors: 李国良; 柴成亮; 李熊; 李飞飞; 叶翔; 裘炜浩; 丁麒; 杨世旺; 金王英; 章晓明; 李舜
Original assignee: Tsinghua University; State Grid Zhejiang Electric Power Co Ltd; Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Current assignee: Tsinghua University; State Grid Zhejiang Electric Power Co Ltd; Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2020-09-03
Filing date: 2020-09-03
Publication date: 2020-10-13
Anticipated expiration: 2040-09-03
Also published as: CN111767325B

Abstract

The embodiment of the application provides a deep multi-source data fusion method based on deep learning, which comprises the steps of obtaining a relational data table to be fused; constructing a deep learning model, importing training data to perform word vectorization on the contents in the relational data table to be fused, and performing pattern matching on the processed data; hierarchically sampling data in a to-be-fused relational data table based on the similarity between entities corresponding to the data, importing the sampled data into a preset structural model for word vector-based integration processing to obtain a trained data barreled model, and performing entity-based data barreled processing based on the data barreled model; and judging whether the data in each barrel refers to the same entity or not, and performing data fusion on the data referring to the same entity to obtain a data table consisting of fused data. The method adopts a word vector mode to model the character string data, can simultaneously model the text and the semantics of the character string, and improves the tolerance of the dirty data.

Description

Multi-source data deep fusion method based on deep learning

Technical Field

The application belongs to the field of data processing, and particularly relates to a multi-source data depth fusion method based on deep learning.

Background

The deep fusion of the multi-source data refers to the fusion of the multi-source structured data by using a deep learning method, so that a data scientist can analyze the data conveniently. In this application, fusion refers to discovery, also referred to as entity matching, of the same entity (where each tuple in the table refers to an entity) in the multi-source data that refers to the real world. For example, different expression modes of the same mobile phone are one of the important subjects in the field of data science. By utilizing deep learning, the data with more dirty multiple sources can be quickly and accurately predicted, the Value of the data can be mined, and the 4V (size), high speed (Velocity), diversity (Variety) and Value (Value) challenges of large data can be well solved.

Real world data tends to be dirty, for example "Qinghua University" may have multiple representations, "Tsinghua University", "Tsinghua Univ. The existence of dirty data greatly affects the processing precision of the machine on the data, and the processing performance is reduced.

Disclosure of Invention

In order to overcome the defects and shortcomings in the prior art, the deep learning-based multi-source data deep fusion method is provided, and in the face of multi-source dirtier data, the method can perform data fusion in two aspects of structure and semantics, and is convenient for data scientists to further analyze the data.

Specifically, the embodiment of the application provides a deep learning-based multi-source data deep fusion method, which includes:

acquiring a relational data table to be fused including a first data table and a second data table;

constructing a deep learning model, introducing training data into the deep learning model, carrying out word vectorization processing on the contents in the relational data table to be fused, and carrying out pattern matching on the processed data;

hierarchically sampling data in a to-be-fused relational data table based on the similarity between entities corresponding to the data, importing the sampled data into a preset structural model for word vector-based integration processing to obtain a trained data barreled model, and performing entity-based data barreled processing based on the data barreled model;

and judging whether the data in each barrel refers to the same entity or not, and performing data fusion on the data referring to the same entity to obtain a data table consisting of fused data.

Optionally, the building a deep learning model, introducing training data into the deep learning model, performing word vectorization processing on the content in the to-be-fused relational data table, and performing pattern matching on the processed data includes:

carrying out data-based labeling on the relational data table to be fused to obtain a labeled data set containing whether the data are matched or not;

recoding the marked data set to complete word vectorization processing of the marked data set;

and importing the word vectors subjected to word vectorization into a deep learning model, performing similarity calculation based on attributes, attribute values and topics, and performing data matching based on calculation results.

Optionally, the data-based labeling is performed on the relational data table to be fused to obtain a labeled data set including whether the data is matched, where the data-based labeling includes:

with annotation of the public data set, the annotated data of the pattern matching is

(ii) a Wherein

Representative relationship table

The ith attribute value of the jth tuple in the list,

is the data to be annotated, 0 represents a match and 1 represents a mismatch.

Optionally, the recoding the tagged data set to complete word vectorization processing of the tagged data set includes:

will be provided with

And

the attribute value in (1) is encoded as a d-dimensional vector, i.e.

And

；

UNK is used instead for words not present in the vocabulary.

Optionally, the importing the word vector after the word vectorization processing into a deep learning model, performing similarity calculation based on three aspects of an attribute, an attribute value, and a theme, and performing data matching based on a calculation result includes:

mining a theme vector of each column according to the attribute value by adopting a theme model, and predicting according to the similarity of the attribute, the attribute value and the theme;

vectorizing the two attributes and the corresponding values, classifying according to the learned parameters, and calculating the probability of matching the two attributes;

finally, a matching combination which is matched to have the maximum probability is found between the first data table and the second data table to serve as a final matching result.

Optionally, the hierarchical sampling is performed on the data in the to-be-fused relational data table based on the similarity between the entities corresponding to the data, the sampled data is imported into a preset structure model to perform word vector-based integration processing, a trained data bucket dividing model is obtained, and entity-based data bucket dividing processing is performed based on the data bucket dividing model, which includes:

acquiring a similarity interval formed by the similarities of all entities, segmenting the acquired similarity interval, extracting a preset number of entity pairs in each acquired segment, and labeling the acquired entity pairs;

the method comprises the steps of selecting sample data from a first data table and a second data table respectively, obtaining Hash codes of each sample data, calculating the similarity of the two Hash codes, calculating loss values after bucket division, and dividing the data with the minimum loss values into the same bucket division.

Optionally, the method further includes:

and adjusting the weight of the data type in the process of calculating the similarity.

Optionally, the determining whether the data in each bucket refers to the same entity or not, and performing data fusion on the data referring to the same entity to obtain a data table formed by fused data includes:

judging whether the data belong to the same entity or not according to the entity name in each data in the bucket;

and fusing the data belonging to the same entity according to the same attribute to obtain a fused data table.

The beneficial effect that technical scheme that this application provided brought is:

the method adopts a Word vector (Word Embedding) mode to model the character string data, can simultaneously model the text and the semantics of the character string, and improves the tolerance of the dirty data.

Drawings

In order to more clearly illustrate the technical solutions of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flow chart of a deep learning-based multi-source data deep fusion method according to the present application;

FIG. 2 is a detailed flow chart of step 12 described herein;

FIG. 3 is a detailed flow chart of step 13 described herein;

FIG. 4 is a diagram of a model training architecture for a two tower structure model as described herein.

Detailed Description

To make the structure and advantages of the present application clearer, the structure of the present application will be further described with reference to the accompanying drawings.

Example one

Specifically, the embodiment of the present application provides a deep learning-based multi-source data deep fusion method, as shown in fig. 1, including:

11. acquiring a relational data table to be fused including a first data table and a second data table;

12. constructing a deep learning model, introducing training data into the deep learning model, carrying out word vectorization processing on the contents in the relational data table to be fused, and carrying out pattern matching on the processed data;

13. hierarchically sampling data in a to-be-fused relational data table based on the similarity between entities corresponding to the data, importing the sampled data into a preset structural model for word vector-based integration processing to obtain a trained data barreled model, and performing entity-based data barreled processing based on the data barreled model;

14. and judging whether the data in each barrel refers to the same entity or not, and performing data fusion on the data referring to the same entity to obtain a data table consisting of fused data.

In implementation, in particular, multi-source data fusion involves the following steps:

the first step is Schema Matching (Schema Matching), names of schemas (attributes) of different source data may be different, for example, one article has a related data source attribute "title", another data source has an attribute "article name", these two attributes refer to the same Schema although the names are different, and data fusion requires that the attributes are aligned first.

The second step is data bucketing (Blocking), if anyNThe pieces of data need to be matched, and then the complexity of entity matching isN ²The method is not acceptable for large data, so that the method carries out barreling on the data by using a deep learning method, only carries out entity matching on the data in the barrel, and greatly reduces the complexity of matching.

The third step is entity matching, after pattern matching and data barreling, the existing method usually solves the problem of entity matching by using rules or a traditional machine learning method, and the methods usually do not have universality and can not process dirty data well, so that the method models an entity pair by a deep learning method, and further matches the entity pair referring to the same entity in the real world.

For dirty data, the method adopts a Word vector (Word Embedding) mode to model the character string data, and can simultaneously model the text and the semantics of the character string, so that the tolerance of the dirty data is improved. Further, in pattern (attribute) matching, it is not sufficient to consider only the attribute itself, but the contents in the attribute column are also considered. And uniformly modeling the attributes and contents of the plurality of data sources by using the word vectors. In data bucket division, the conventional method usually adopts a regular method, has no universality and has poor effect on dirty data. The present application uses deep learning to learn buckets end-to-end. The problem of poor dirty data effect also exists in the entity matching process, and the problem is solved by utilizing deep learning.

Optionally, as shown in fig. 2, step 12 specifically includes:

121. carrying out data-based labeling on the relational data table to be fused to obtain a labeled data set containing whether the data are matched or not;

122. recoding the marked data set to complete word vectorization processing of the marked data set;

123. and importing the word vectors subjected to word vectorization into a deep learning model, performing similarity calculation based on attributes, attribute values and topics, and performing data matching based on calculation results.

In practice, step 121 is embodied as a process for preparing training data.

Training data is essential because a deep learning model is trained. The training data can be obtained in two ways, one is the labeling by using the existing public data set, and the labeled data of the pattern matching is

Wherein

Representative relationship table

The ith attribute value of the jth tuple in the set.

Is the data to be annotated, 0 represents a match and 1 represents a mismatch. The other is obtained by the annotator without truth, i.e. the annotation y of the pattern match is obtained.

When the amount of data is small, several experts may be requested to perform annotation. When the data volume is large, the cost of searching for experts is also large, and then the data label can be obtained by a crowdsourcing method. Crowdsourcing refers to the problem that computers are difficult to automatically solve, such as data annotation by using internet power. The labeling result is usually determined by majority voting. Specifically, a annotation question is divided into multiple annotation workers to answer, say 5, and the returned answers are 1 in majority and match in pair of attributes, otherwise they do not match.

The specific content of step 122 is a word vectorization process.

Word vectors are a fundamental tool for natural language domain modeling. There are many ways to generate a word vector by mapping a word or phrase from a vocabulary to a set of real numbers, including neural networks, PCA dimension reduction, probabilistic models, etc. In this step, for

And

the attribute value in (1) can be encoded into a d-dimensional vector, i.e. the vector

And

. It is specifically noted that for words not present in the vocabulary (OOV), UNK is usually used instead.

Word Embedding (Embedding) may be in word units or in character units with finer granularity. At this point, the input to the model is not a word. Instead, for each word, its characters are used as input, via a neural network, to generate a d-dimensional vector. The method has the advantage that the internal characteristics of words can be captured, especially some compound words, such as "kindness" is composed of "kind" and "ness". In addition, the OOV problem can be solved well in units of characters.

In addition, whether pre-trained models or re-trained word vectors are used is a different option. The pre-training contains word-level models (word 2vec, Glove) and character-level models (fastTest). The pre-trained model has the following two advantages. Firstly, the training time can be greatly shortened, and secondly, the models are trained on a large corpus, so that the robustness is strong. The present application allows the user to freely decide whether to use a word-level vector or a character-level vector, and whether to use a pre-trained model or to start training from scratch.

The specific content of step 123 is a building model training process.

The present application uses deep neural networks to build training models.

Firstly, the input of the network is a word vector of two columns of data to be matched, which comprises attribute information and attribute value information, namely

，

And

. Wherein

And

textual and semantic information describing the attributes,

text describing attribute values andsemantic information, which the present application utilizes a two-tower model to model.

In addition, the theme corresponding to the attribute is also important for matching the attribute. Thus, the subject Model (Topic Model) is also employed by the present application to mine the Topic vector of each column based on the attribute values, represented as

. The model will make predictions based on the similarity of the attributes, attribute values, and topics.

And the cross entropy is used as a loss function between the predicted value and the real value y, and the model is trained by carrying out back propagation and random gradient descent.

Specifically, the present application adds them to form a new vector, and the attribute vector

And a topic vector

Connected to form a new set of features. Wherein the subject vector

And automatically solving through an LDA model according to the attribute values.

And then, the new features corresponding to the two attributes start to interact, and the similarity between the two features is calculated in a similarity calculation layer, wherein a fixed distance function such as cosine similarity or Euclidean distance can be selected.

The loss is calculated through a full connection layer to a classification layer, and the loss is propagated reversely to update parameters of the neural network. When predicting, firstly vectorizing the two attributes and the corresponding values according to a forward propagation path, and then classifying according to the learned parameters to obtain the probability of matching the two attributes. Finally, a matching combination which is matched to have the maximum probability is found between the first data table and the second data table to serve as a final matching result.

Optionally, as shown in fig. 3, the specific content of step 13 includes:

131. acquiring a similarity interval formed by the similarities of all entities, segmenting the acquired similarity interval, extracting a preset number of entity pairs in each acquired segment, and labeling the acquired entity pairs;

132. the method comprises the steps of selecting sample data from a first data table and a second data table respectively, obtaining Hash codes of each sample data, calculating the similarity of the two Hash codes, calculating loss values after bucket division, and dividing the data with the minimum loss values into the same bucket division.

In practice, the reason for data bucketing is:

(1) after dispersion, the multiplication speed of the inner product of the sparse vector is higher, the calculation result is convenient to store, and the expansion is easy.

(2) The discrete features are more robust to abnormal values, such as age >30 is 1, otherwise 0, and do not cause great interference to the model for age 200.

(3) LR belongs to a generalized linear model, the expression capability is limited, each variable has independent weight after discretization, which is equivalent to introducing nonlinearity, so that the expression capability of the model can be improved, and fitting is enlarged.

(4) After the features are dispersed, feature crossing can be carried out, the expression capacity is improved, M x N variables are programmed by M + N variables, the nonlinearity is further introduced, and the expression capacity is improved.

(5) After the characteristics are dispersed, the model is more stable, such as the age interval of the user, and cannot be changed as long as the user ages.

(6) Deletions can be introduced into the model as a separate class.

(7) All variables are transformed to similar scales.

The bucket dividing method is divided into an unsupervised bucket dividing method and a supervised bucket dividing method. Common unsupervised bucket allocation methods include equal frequency bucket allocation, equidistant bucket allocation, and clustered bucket allocation. The supervised sub-barrel mainly comprises a best-ks sub-barrel and a chi-square sub-barrel.

Step 13 provides a solution for data bucketing based on deep learning.

Step 131: training data is prepared. For data tables

Assume it contains N tuples

(ii) a Another watch

Assume it contains M tuples

(ii) a Each tuple contains m attributes and the attributes between the two tables are aligned. The data bucket separation or data fusion is characterized in that the number of matched entity pairs in the training data is small, and the number of unmatched entity pairs is large, so that the problem of sample imbalance can be caused, and the training effect is deviated. Therefore, how to select the training data is one of the important points of attention in the present application. Specifically, the method and the device can perform hierarchical sampling according to the similarity between the entities so as to achieve the aim of training set balance. Generally, the reason why there are many pairs of unmatched entities is that there are many entities with low similarity. Thus, the similarity of all similar entities can be calculated (at [0,1 ]]Within an interval) scores. Then [0,1 ] is mixed]Interval segments, say 10 segments, extract a certain number of entity pairs within each segment. For the selected entity pair, if the selected entity pair is a public standard data set, the method directly uses the public labeling or manual labeling manner shown above, which is not described herein again.

Step 131: and (5) building model training. A model of a double tower structure was also used as shown in fig. 4. The model inputs at this point are different entities

Need to be provided with

Is coded into

. It is to be noted that each

The method includes a plurality of attributes, and the word vector corresponding to each attribute may be obtained by the method in step 122, and then the word vectors of the attributes are integrated to obtain the word vector representation of the entire entity. There are many common methods that can be selected, and the present application can provide many methods related to the existing common natural language processing techniques, such as the method of vector direct sum, the method of recurrent neural network such as LSTM, and the method with attention mechanism. After the features of each entity are obtained, for convenience of bucket splitting, the feature vector is connected with a hash layer, namely a vector consisting of 0 and 1. The reason for this is to facilitate binning, i.e., each identical hash string represents a bucket, where matching tuples are expected to be grouped into one bucket, where unmatched entities are in different buckets, and where hash-coding distances between similar entities are relatively short. Therefore, a similarity calculation layer is constructed to calculate the similarity of two hash codes, and then a classification layer is used to calculate a loss function so as to meet the function of the model. After the model is trained, for the newly arriving entity, it is propagated forward through the network, and the matched entity pair will be sorted into a bucket.

Optionally, the method further includes:

In practice, if the data bucketing requirement for recall is high in the bucketing operation of step 133, the unmatched entities may also be present in a bucket in a proper amount, but the matched entities must be in a bucket. Therefore, in the process of model training, the weight of the matched training label is increased.

Optionally, the specific content of step 14 includes:

In implementation, after pattern matching and data binning are completed, it is necessary to determine whether entity pairs in the same bin refer to the same entity. The present application still employs a deep neural network to solve the entity matching problem.

And (5) building model training and prediction. And coding each attribute tuple of each entity into a vector, then building a network by adopting a double-tower model, integrating vectors corresponding to different attributes by utilizing a cyclic neural network method such as vector direct addition or LSTM (least squares TM) and the like, constructing a similarity calculation layer, and finally calculating the classified loss.

The similarity calculation layer is quite similar to a network structure of data sub-buckets, so that the structure is multiplexed by adopting the idea of transfer learning, and training is accelerated. The same method can be used for obtaining and sampling the sample.

An example of the algorithm:

inputting: relational data sheet

And

respectively with m attributes

And

respectively including a plurality of entities at the same time

And

。

and (3) outputting: fused data sheet

。

1) Constructing attribute pairs to be matched;

2) acquiring the labeling data of the attribute pair;

3) vectorizing and coding the attribute and the list of attribute values to generate

，

And

；

4) calculating a topic model based on the attribute values, generating a topic vector

；

5) Constructing a pattern matching model for training;

6) sampling to generate entity pair training data;

7) vectorized encoding generation for entity pairs

；

7) Training a data bucket model;

8) training an entity matching model;

9) for the

And

aligning by using the trained pattern matching model;

10) using data bucket model pairs

And

the entity between is divided into barrels;

11) performing data fusion by using a data fusion model;

12) returning the merged data sheet

。

The sequence numbers in the above embodiments are merely for description, and do not represent the sequence of the assembly or the use of the components.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. The multi-source data depth fusion method based on deep learning is characterized by comprising the following steps:

2. The deep learning-based multi-source data deep fusion method of claim 1, wherein the deep learning model is built, training data is introduced into the deep learning model, word vectorization processing is performed on contents in a relational data table to be fused, and pattern matching is performed on the processed data, and the method comprises the following steps:

3. The deep learning-based multi-source data deep fusion method of claim 2, wherein the data-based labeling is performed on the relational data table to be fused to obtain a labeled data set containing whether data are matched, and the method comprises the following steps:

(ii) a Wherein

Representing the ith attribute value of the jth tuple in the relational table D,

is the data to be annotated, 0 represents a match and 1 represents a mismatch.

4. The deep learning-based multi-source data deep fusion method of claim 2, wherein the recoding of the labeled data set to complete word vectorization processing of the labeled data set comprises:

will be provided with

And

the attribute value in (1) is encoded as a d-dimensional vector, i.e.

And

；

UNK is used instead for words not present in the vocabulary.

5. The deep learning-based multi-source data deep fusion method of claim 2, wherein the step of importing the word vectors subjected to word vectorization into a deep learning model, performing similarity calculation based on three aspects of attributes, attribute values and topics, and performing data matching based on calculation results comprises:

6. The deep learning-based multi-source data deep fusion method of claim 1, wherein the data in the relational data table to be fused are hierarchically sampled based on the similarity between the entities corresponding to the data, the sampled data are imported into a preset structure model to be integrated based on word vectors, a trained data bucket model is obtained, and entity-based data bucket processing is performed based on the data bucket model, and the method comprises the following steps:

7. The deep learning-based multi-source data deep fusion method of claim 3, further comprising:

8. The deep learning-based multi-source data deep fusion method of claim 1, wherein the determining whether the data in each bucket refer to the same entity is performed, and the data referring to the same entity is subjected to data fusion to obtain a data table composed of fused data, and the method comprises: