CN111767325B - Multi-source data deep fusion method based on deep learning - Google Patents

Multi-source data deep fusion method based on deep learning Download PDF

Info

Publication number
CN111767325B
CN111767325B CN202010914905.5A CN202010914905A CN111767325B CN 111767325 B CN111767325 B CN 111767325B CN 202010914905 A CN202010914905 A CN 202010914905A CN 111767325 B CN111767325 B CN 111767325B
Authority
CN
China
Prior art keywords
data
model
deep learning
fused
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010914905.5A
Other languages
Chinese (zh)
Other versions
CN111767325A (en
Inventor
李国良
柴成亮
李熊
李飞飞
叶翔
裘炜浩
丁麒
杨世旺
金王英
章晓明
李舜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
State Grid Zhejiang Electric Power Co Ltd
Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
Tsinghua University
State Grid Zhejiang Electric Power Co Ltd
Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, State Grid Zhejiang Electric Power Co Ltd, Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd filed Critical Tsinghua University
Priority to CN202010914905.5A priority Critical patent/CN111767325B/en
Publication of CN111767325A publication Critical patent/CN111767325A/en
Application granted granted Critical
Publication of CN111767325B publication Critical patent/CN111767325B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Fuzzy Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a deep multi-source data fusion method based on deep learning, which comprises the steps of obtaining a relational data table to be fused; constructing a deep learning model, importing training data to perform word vectorization on the contents in the relational data table to be fused, and performing pattern matching on the processed data; hierarchically sampling data in a to-be-fused relational data table based on the similarity between entities corresponding to the data, importing the sampled data into a preset structural model for word vector-based integration processing to obtain a trained data barreled model, and performing entity-based data barreled processing based on the data barreled model; and judging whether the data in each barrel refers to the same entity or not, and performing data fusion on the data referring to the same entity to obtain a data table consisting of fused data. The method adopts a word vector mode to model the character string data, can simultaneously model the text and the semantics of the character string, and improves the tolerance of the dirty data.

Description

Multi-source data deep fusion method based on deep learning
Technical Field
The application belongs to the field of data processing, and particularly relates to a multi-source data depth fusion method based on deep learning.
Background
The deep fusion of the multi-source data refers to the fusion of the multi-source structured data by using a deep learning method, so that a data scientist can analyze the data conveniently. In this application, fusion refers to discovery, also referred to as entity matching, of the same entity (where each tuple in the table refers to an entity) in the multi-source data that refers to the real world. For example, different expression modes of the same mobile phone are one of the important subjects in the field of data science. By utilizing deep learning, the data with more dirty multiple sources can be quickly and accurately predicted, the Value of the data can be mined, and the 4V (size), high speed (Velocity), diversity (Variety) and Value (Value) challenges of large data can be well solved.
Real world data tends to be dirty, for example "Qinghua University" may have multiple representations, "Tsinghua University", "Tsinghua Univ. The existence of dirty data greatly affects the processing precision of the machine on the data, and the processing performance is reduced.
Disclosure of Invention
In order to overcome the defects and shortcomings in the prior art, the deep learning-based multi-source data deep fusion method is provided, and in the face of multi-source dirtier data, the method can perform data fusion in two aspects of structure and semantics, and is convenient for data scientists to further analyze the data.
Specifically, the embodiment of the application provides a deep learning-based multi-source data deep fusion method, which includes:
acquiring a relational data table to be fused including a first data table and a second data table;
constructing a deep learning model, introducing training data into the deep learning model, carrying out word vectorization processing on the contents in the relational data table to be fused, and carrying out pattern matching on the processed data;
hierarchically sampling data in a to-be-fused relational data table based on the similarity between entities corresponding to the data, importing the sampled data into a preset structural model for word vector-based integration processing to obtain a trained data barreled model, and performing entity-based data barreled processing based on the data barreled model;
and judging whether the data in each barrel refers to the same entity or not, and performing data fusion on the data referring to the same entity to obtain a data table consisting of fused data.
Optionally, the building a deep learning model, introducing training data into the deep learning model, performing word vectorization processing on the content in the to-be-fused relational data table, and performing pattern matching on the processed data includes:
carrying out data-based labeling on the relational data table to be fused to obtain a labeled data set containing whether the data are matched or not;
recoding the marked data set to complete word vectorization processing of the marked data set;
and importing the word vectors subjected to word vectorization into a deep learning model, performing similarity calculation based on attributes, attribute values and topics, and performing data matching based on calculation results.
Optionally, the data-based labeling is performed on the relational data table to be fused to obtain a labeled data set including whether the data is matched, where the data-based labeling includes:
with annotation of the public data set, the annotated data of the pattern matching is
Figure DEST_PATH_IMAGE001
(ii) a Wherein
Figure DEST_PATH_IMAGE002
Representative relationship table
Figure DEST_PATH_IMAGE003
The ith attribute value of the jth tuple in the list,
Figure DEST_PATH_IMAGE004
is the data to be annotated, 0 represents a match and 1 represents a mismatch.
Optionally, the recoding the tagged data set to complete word vectorization processing of the tagged data set includes:
will be provided with
Figure DEST_PATH_IMAGE005
And
Figure DEST_PATH_IMAGE006
the attribute value in (1) is encoded as a d-dimensional vector, i.e.
Figure DEST_PATH_IMAGE007
And
Figure DEST_PATH_IMAGE008
UNK is used instead for words not present in the vocabulary.
Optionally, the importing the word vector after the word vectorization processing into a deep learning model, performing similarity calculation based on three aspects of an attribute, an attribute value, and a theme, and performing data matching based on a calculation result includes:
mining a theme vector of each column according to the attribute value by adopting a theme model, and predicting according to the similarity of the attribute, the attribute value and the theme;
vectorizing the two attributes and the corresponding values, classifying according to the learned parameters, and calculating the probability of matching the two attributes;
finally, a matching combination which is matched to have the maximum probability is found between the first data table and the second data table to serve as a final matching result.
Optionally, the hierarchical sampling is performed on the data in the to-be-fused relational data table based on the similarity between the entities corresponding to the data, the sampled data is imported into a preset structure model to perform word vector-based integration processing, a trained data bucket dividing model is obtained, and entity-based data bucket dividing processing is performed based on the data bucket dividing model, which includes:
acquiring a similarity interval formed by the similarities of all entities, segmenting the acquired similarity interval, extracting a preset number of entity pairs in each acquired segment, and labeling the acquired entity pairs;
the method comprises the steps of selecting sample data from a first data table and a second data table respectively, obtaining Hash codes of each sample data, calculating the similarity of the two Hash codes, calculating loss values after bucket division, and dividing the data with the minimum loss values into the same bucket division.
Optionally, the method further includes:
and adjusting the weight of the data type in the process of calculating the similarity.
Optionally, the determining whether the data in each bucket refers to the same entity or not, and performing data fusion on the data referring to the same entity to obtain a data table formed by fused data includes:
judging whether the data belong to the same entity or not according to the entity name in each data in the bucket;
and fusing the data belonging to the same entity according to the same attribute to obtain a fused data table.
The beneficial effect that technical scheme that this application provided brought is:
the method adopts a Word vector (Word Embedding) mode to model the character string data, can simultaneously model the text and the semantics of the character string, and improves the tolerance of the dirty data.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic flow chart of a deep learning-based multi-source data deep fusion method according to the present application;
FIG. 2 is a detailed flow chart of step 12 described herein;
FIG. 3 is a detailed flow chart of step 13 described herein;
FIG. 4 is a diagram of a model training architecture for a two tower structure model as described herein.
Detailed Description
To make the structure and advantages of the present application clearer, the structure of the present application will be further described with reference to the accompanying drawings.
Example one
Specifically, the embodiment of the present application provides a deep learning-based multi-source data deep fusion method, as shown in fig. 1, including:
11. acquiring a relational data table to be fused including a first data table and a second data table;
12. constructing a deep learning model, introducing training data into the deep learning model, carrying out word vectorization processing on the contents in the relational data table to be fused, and carrying out pattern matching on the processed data;
13. hierarchically sampling data in a to-be-fused relational data table based on the similarity between entities corresponding to the data, importing the sampled data into a preset structural model for word vector-based integration processing to obtain a trained data barreled model, and performing entity-based data barreled processing based on the data barreled model;
14. and judging whether the data in each barrel refers to the same entity or not, and performing data fusion on the data referring to the same entity to obtain a data table consisting of fused data.
In implementation, in particular, multi-source data fusion involves the following steps:
the first step is Schema Matching (Schema Matching), names of schemas (attributes) of different source data may be different, for example, one article has a related data source attribute "title", another data source has an attribute "article name", these two attributes refer to the same Schema although the names are different, and data fusion requires that the attributes are aligned first.
The second step is data bucketing (Blocking), if anyNThe pieces of data need to be matched, and then the complexity of entity matching isN 2 The method is not acceptable for large data, so that the method carries out barreling on the data by using a deep learning method, only carries out entity matching on the data in the barrel, and greatly reduces the complexity of matching.
The third step is entity matching, after pattern matching and data barreling, the existing method usually solves the problem of entity matching by using rules or a traditional machine learning method, and the methods usually do not have universality and can not process dirty data well, so that the method models an entity pair by a deep learning method, and further matches the entity pair referring to the same entity in the real world.
For dirty data, the method adopts a Word vector (Word Embedding) mode to model the character string data, and can simultaneously model the text and the semantics of the character string, so that the tolerance of the dirty data is improved. Further, in pattern (attribute) matching, it is not sufficient to consider only the attribute itself, but the contents in the attribute column are also considered. And uniformly modeling the attributes and contents of the plurality of data sources by using the word vectors. In data bucket division, the conventional method usually adopts a regular method, has no universality and has poor effect on dirty data. The present application uses deep learning to learn buckets end-to-end. The problem of poor dirty data effect also exists in the entity matching process, and the problem is solved by utilizing deep learning.
Optionally, as shown in fig. 2, step 12 specifically includes:
121. carrying out data-based labeling on the relational data table to be fused to obtain a labeled data set containing whether the data are matched or not;
122. recoding the marked data set to complete word vectorization processing of the marked data set;
123. and importing the word vectors subjected to word vectorization into a deep learning model, performing similarity calculation based on attributes, attribute values and topics, and performing data matching based on calculation results.
In practice, step 121 is embodied as a process for preparing training data.
Training data is essential because a deep learning model is trained. The training data can be obtained in two ways, one is the labeling by using the existing public data set, and the labeled data of the pattern matching is
Figure DEST_PATH_IMAGE009
Wherein
Figure DEST_PATH_IMAGE010
Representative relationship table
Figure 717129DEST_PATH_IMAGE003
The ith attribute value of the jth tuple in the set.
Figure DEST_PATH_IMAGE011
Is the data to be annotated, 0 represents a match and 1 represents a mismatch. The other is obtained by the annotator without truth, i.e. the annotation y of the pattern match is obtained.
When the amount of data is small, several experts may be requested to perform annotation. When the data volume is large, the cost of searching for experts is also large, and then the data label can be obtained by a crowdsourcing method. Crowdsourcing refers to the problem that computers are difficult to automatically solve, such as data annotation by using internet power. The labeling result is usually determined by majority voting. Specifically, a annotation question is divided into multiple annotation workers to answer, say 5, and the returned answers are 1 in majority and match in pair of attributes, otherwise they do not match.
The specific content of step 122 is a word vectorization process.
Word vectors are a fundamental tool for natural language domain modeling. There are many ways to generate a word vector by mapping a word or phrase from a vocabulary to a set of real numbers, including neural networks, PCA dimension reduction, probabilistic models, etc. In this step, for
Figure DEST_PATH_IMAGE012
And
Figure DEST_PATH_IMAGE013
the attribute value in (1) can be encoded into a d-dimensional vector, i.e. the vector
Figure DEST_PATH_IMAGE014
And
Figure DEST_PATH_IMAGE015
. It is specifically noted that for words not present in the vocabulary (OOV), UNK is usually used instead.
Word Embedding (Embedding) may be in word units or in character units with finer granularity. At this point, the input to the model is not a word. Instead, for each word, its characters are used as input, via a neural network, to generate a d-dimensional vector. The method has the advantage that the internal characteristics of words can be captured, especially some compound words, such as "kindness" is composed of "kind" and "ness". In addition, the OOV problem can be solved well in units of characters.
In addition, whether pre-trained models or re-trained word vectors are used is a different option. The pre-training contains word-level models (word 2vec, Glove) and character-level models (fastTest). The pre-trained model has the following two advantages. Firstly, the training time can be greatly shortened, and secondly, the models are trained on a large corpus, so that the robustness is strong. The present application allows the user to freely decide whether to use a word-level vector or a character-level vector, and whether to use a pre-trained model or to start training from scratch.
The specific content of step 123 is a building model training process.
The present application uses deep neural networks to build training models.
Firstly, the input of the network is a word vector of two columns of data to be matched, which comprises attribute information and attribute value information, namely
Figure DEST_PATH_IMAGE016
Figure DEST_PATH_IMAGE017
And
Figure DEST_PATH_IMAGE018
. Wherein
Figure DEST_PATH_IMAGE019
And
Figure DEST_PATH_IMAGE020
textual and semantic information describing the attributes,
Figure DEST_PATH_IMAGE021
textual and semantic information of attribute values is described, which is modeled using a two-tower model.
In addition, the theme corresponding to the attribute is also important for matching the attribute. Thus, the subject Model (Topic Model) is also employed by the present application to mine the Topic vector of each column based on the attribute values, represented as
Figure DEST_PATH_IMAGE022
. The model will make predictions based on the similarity of the attributes, attribute values, and topics.
And the cross entropy is used as a loss function between the predicted value and the real value y, and the model is trained by carrying out back propagation and random gradient descent.
Specifically, the present application adds them to form a new vector, and the attribute vector
Figure DEST_PATH_IMAGE023
And a topic vector
Figure DEST_PATH_IMAGE024
Connected to form a new set of features. Wherein the subject vector
Figure DEST_PATH_IMAGE025
And automatically solving through an LDA model according to the attribute values.
And then, the new features corresponding to the two attributes start to interact, and the similarity between the two features is calculated in a similarity calculation layer, wherein a fixed distance function such as cosine similarity or Euclidean distance can be selected.
The loss is calculated through a full connection layer to a classification layer, and the loss is propagated reversely to update parameters of the neural network. When predicting, firstly vectorizing the two attributes and the corresponding values according to a forward propagation path, and then classifying according to the learned parameters to obtain the probability of matching the two attributes. Finally, a matching combination which is matched to have the maximum probability is found between the first data table and the second data table to serve as a final matching result.
Optionally, as shown in fig. 3, the specific content of step 13 includes:
131. acquiring a similarity interval formed by the similarities of all entities, segmenting the acquired similarity interval, extracting a preset number of entity pairs in each acquired segment, and labeling the acquired entity pairs;
132. the method comprises the steps of selecting sample data from a first data table and a second data table respectively, obtaining Hash codes of each sample data, calculating the similarity of the two Hash codes, calculating loss values after bucket division, and dividing the data with the minimum loss values into the same bucket division.
In practice, the reason for data bucketing is:
(1) after dispersion, the multiplication speed of the inner product of the sparse vector is higher, the calculation result is convenient to store, and the expansion is easy.
(2) The discrete features are more robust to abnormal values, such as age >30 is 1, otherwise 0, and do not cause great interference to the model for age 200.
(3) LR belongs to a generalized linear model, the expression capability is limited, each variable has independent weight after discretization, which is equivalent to introducing nonlinearity, so that the expression capability of the model can be improved, and fitting is enlarged.
(4) After the features are dispersed, feature crossing can be carried out, the expression capacity is improved, M x N variables are programmed by M + N variables, the nonlinearity is further introduced, and the expression capacity is improved.
(5) After the characteristics are dispersed, the model is more stable, such as the age interval of the user, and cannot be changed as long as the user ages.
(6) Deletions can be introduced into the model as a separate class.
(7) All variables are transformed to similar scales.
The bucket dividing method is divided into an unsupervised bucket dividing method and a supervised bucket dividing method. Common unsupervised bucket allocation methods include equal frequency bucket allocation, equidistant bucket allocation, and clustered bucket allocation. The supervised sub-barrel mainly comprises a best-ks sub-barrel and a chi-square sub-barrel.
Step 13 provides a solution for data bucketing based on deep learning.
Step 131: training data is prepared. For data tables
Figure DEST_PATH_IMAGE026
Assume it contains N tuples
Figure 460701DEST_PATH_IMAGE027
(ii) a Another watch
Figure DEST_PATH_IMAGE028
Assume it contains M tuples
Figure 568334DEST_PATH_IMAGE029
(ii) a Each tuple contains m attributes and the attributes between the two tables are aligned. The data bucket division or data fusion is characterized in that the number of matched entity pairs in the training data is small, and the number of unmatched entity pairs is extremely large, so that the problem of sample imbalance can be caused, and the training is realizedThe effect is biased. Therefore, how to select the training data is one of the important points of attention in the present application. Specifically, the method and the device can perform hierarchical sampling according to the similarity between the entities so as to achieve the aim of training set balance. Generally, the reason why there are many pairs of unmatched entities is that there are many entities with low similarity. Thus, the similarity of all similar entities can be calculated (at [0,1 ]]Within an interval) scores. Then [0,1 ] is mixed]Interval segments, say 10 segments, extract a certain number of entity pairs within each segment. For the selected entity pair, if the selected entity pair is a public standard data set, the method directly uses the public labeling or manual labeling manner shown above, which is not described herein again.
Step 131: and (5) building model training. A model of a double tower structure was also used as shown in fig. 4. The model inputs at this point are different entities
Figure DEST_PATH_IMAGE030
Need to be provided with
Figure DEST_PATH_IMAGE031
Is coded into
Figure DEST_PATH_IMAGE032
. It is to be noted that each
Figure 511013DEST_PATH_IMAGE033
The method includes a plurality of attributes, and the word vector corresponding to each attribute may be obtained by the method in step 122, and then the word vectors of the attributes are integrated to obtain the word vector representation of the entire entity. There are many common methods that can be selected, and the present application can provide many methods related to the existing common natural language processing techniques, such as the method of vector direct sum, the method of recurrent neural network such as LSTM, and the method with attention mechanism. After the features of each entity are obtained, for convenience of bucket splitting, the feature vector is connected with a hash layer, namely a vector consisting of 0 and 1. The reason for this is to facilitate binning, i.e., each identical hash string represents a bucket, where matching tuples are expected to be grouped together, and the fact of a mismatch is trueThe volumes are all in different buckets and the hash codes between similar entities are closer together. Therefore, a similarity calculation layer is constructed to calculate the similarity of two hash codes, and then a classification layer is used to calculate a loss function so as to meet the function of the model. After the model is trained, for the newly arriving entity, it is propagated forward through the network, and the matched entity pair will be sorted into a bucket.
Optionally, the method further includes:
and adjusting the weight of the data type in the process of calculating the similarity.
In practice, if the data bucketing requirement for recall is high in the bucketing operation of step 133, the unmatched entities may also be present in a bucket in a proper amount, but the matched entities must be in a bucket. Therefore, in the process of model training, the weight of the matched training label is increased.
Optionally, the specific content of step 14 includes:
judging whether the data belong to the same entity or not according to the entity name in each data in the bucket;
and fusing the data belonging to the same entity according to the same attribute to obtain a fused data table.
In implementation, after pattern matching and data binning are completed, it is necessary to determine whether entity pairs in the same bin refer to the same entity. The present application still employs a deep neural network to solve the entity matching problem.
And (5) building model training and prediction. And coding each attribute tuple of each entity into a vector, then building a network by adopting a double-tower model, integrating vectors corresponding to different attributes by utilizing a cyclic neural network method such as vector direct addition or LSTM (least squares TM) and the like, constructing a similarity calculation layer, and finally calculating the classified loss.
The similarity calculation layer is quite similar to a network structure of data sub-buckets, so that the structure is multiplexed by adopting the idea of transfer learning, and training is accelerated. The same method can be used for obtaining and sampling the sample.
An example of the algorithm:
inputting: relational data sheet
Figure DEST_PATH_IMAGE034
And
Figure DEST_PATH_IMAGE035
respectively with m attributes
Figure DEST_PATH_IMAGE036
And
Figure DEST_PATH_IMAGE037
respectively including a plurality of entities at the same time
Figure DEST_PATH_IMAGE038
And
Figure DEST_PATH_IMAGE039
and (3) outputting: fused data sheet
Figure DEST_PATH_IMAGE040
1) Constructing attribute pairs to be matched;
2) acquiring the labeling data of the attribute pair;
3) vectorizing and coding the attribute and the list of attribute values to generate
Figure 27183DEST_PATH_IMAGE041
Figure DEST_PATH_IMAGE042
And
Figure 928274DEST_PATH_IMAGE043
4) calculating a topic model based on the attribute values, generating a topic vector
Figure DEST_PATH_IMAGE044
5) Constructing a pattern matching model for training;
6) sampling to generate entity pair training data;
7) vectorized encoding generation for entity pairs
Figure DEST_PATH_IMAGE045
7) Training a data bucket model;
8) training an entity matching model;
9) for the
Figure DEST_PATH_IMAGE046
And
Figure DEST_PATH_IMAGE047
aligning by using the trained pattern matching model;
10) using data bucket model pairs
Figure DEST_PATH_IMAGE048
And
Figure DEST_PATH_IMAGE049
the entity between is divided into barrels;
11) performing data fusion by using a data fusion model;
12) returning the merged data sheet
Figure DEST_PATH_IMAGE050
The sequence numbers in the above embodiments are merely for description, and do not represent the sequence of the assembly or the use of the components.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (6)

1. The multi-source data depth fusion method based on deep learning is characterized by comprising the following steps:
acquiring a relational data table to be fused including a first data table and a second data table;
constructing a deep learning model, introducing training data into the deep learning model, carrying out word vectorization processing on the contents in the relational data table to be fused, and carrying out pattern matching on the processed data;
hierarchically sampling data in a to-be-fused relational data table based on the similarity between entities corresponding to the data, importing the sampled data into a preset structural model for word vector-based integration processing to obtain a trained data barreled model, and performing entity-based data barreled processing based on the data barreled model;
and judging whether the data in each barrel refers to the same entity or not, and performing data fusion on the data referring to the same entity to obtain a data table consisting of fused data.
2. The deep learning-based multi-source data deep fusion method of claim 1, wherein the deep learning model is built, training data is introduced into the deep learning model, word vectorization processing is performed on contents in a relational data table to be fused, and pattern matching is performed on the processed data, and the method comprises the following steps:
carrying out data-based labeling on the relational data table to be fused to obtain a labeled data set containing whether the data are matched or not;
recoding the marked data set to complete word vectorization processing of the marked data set;
and importing the word vectors subjected to word vectorization into a deep learning model, performing similarity calculation based on attributes, attribute values and topics, and performing data matching based on calculation results.
3. The deep learning-based multi-source data deep fusion method of claim 2, wherein the step of importing the word vectors subjected to word vectorization into a deep learning model, performing similarity calculation based on three aspects of attributes, attribute values and topics, and performing data matching based on calculation results comprises:
mining a theme vector of each column according to the attribute value by adopting a theme model, and predicting according to the similarity of the attribute, the attribute value and the theme;
vectorizing the two attributes and the corresponding values, classifying according to the learned parameters, and calculating the probability of matching the two attributes;
finally, a matching combination which is matched to have the maximum probability is found between the first data table and the second data table to serve as a final matching result.
4. The deep learning-based multi-source data deep fusion method of claim 1, wherein the data in the relational data table to be fused are hierarchically sampled based on the similarity between the entities corresponding to the data, the sampled data are imported into a preset structure model to be integrated based on word vectors, a trained data bucket model is obtained, and entity-based data bucket processing is performed based on the data bucket model, and the method comprises the following steps:
acquiring a similarity interval formed by the similarities of all entities, segmenting the acquired similarity interval, extracting a preset number of entity pairs in each acquired segment, and labeling the acquired entity pairs;
the method comprises the steps of selecting sample data from a first data table and a second data table respectively, obtaining Hash codes of each sample data, calculating the similarity of the two Hash codes, calculating loss values after bucket division, and dividing the data with the minimum loss values into the same bucket division.
5. The deep learning-based multi-source data deep fusion method according to claim 1, further comprising:
and adjusting the weight of the data type in the process of calculating the similarity.
6. The deep learning-based multi-source data deep fusion method of claim 1, wherein the determining whether the data in each bucket refer to the same entity is performed, and the data referring to the same entity is subjected to data fusion to obtain a data table composed of fused data, and the method comprises:
judging whether the data belong to the same entity or not according to the entity name in each data in the bucket;
and fusing the data belonging to the same entity according to the same attribute to obtain a fused data table.
CN202010914905.5A 2020-09-03 2020-09-03 Multi-source data deep fusion method based on deep learning Active CN111767325B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010914905.5A CN111767325B (en) 2020-09-03 2020-09-03 Multi-source data deep fusion method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010914905.5A CN111767325B (en) 2020-09-03 2020-09-03 Multi-source data deep fusion method based on deep learning

Publications (2)

Publication Number Publication Date
CN111767325A CN111767325A (en) 2020-10-13
CN111767325B true CN111767325B (en) 2020-11-24

Family

ID=72729245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010914905.5A Active CN111767325B (en) 2020-09-03 2020-09-03 Multi-source data deep fusion method based on deep learning

Country Status (1)

Country Link
CN (1) CN111767325B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254641B (en) * 2021-05-27 2021-11-16 中国电子科技集团公司第十五研究所 Information data fusion method and device
CN113609715B (en) * 2021-10-11 2022-02-22 深圳奥雅设计股份有限公司 Multivariate model data fusion method and system under digital twin background
CN114153839A (en) * 2021-10-29 2022-03-08 杭州未名信科科技有限公司 Integration method, device, equipment and storage medium of multi-source heterogeneous data
CN114997419A (en) * 2022-07-18 2022-09-02 北京芯盾时代科技有限公司 Updating method and device of rating card model, electronic equipment and storage medium
CN116303392B (en) * 2023-03-02 2023-09-01 重庆市规划和自然资源信息中心 Multi-source data table management method for real estate registration data

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341220B (en) * 2017-06-28 2020-05-12 阿里巴巴集团控股有限公司 Multi-source data fusion method and device
CN109308311A (en) * 2018-09-05 2019-02-05 广州小楠科技有限公司 A kind of multi-source heterogeneous data fusion system
CN110110082A (en) * 2019-04-12 2019-08-09 黄红梅 Multi-source heterogeneous data fusion optimization method
CN110515926A (en) * 2019-08-28 2019-11-29 国网天津市电力公司 Heterogeneous data source mass data carding method based on participle and semantic dependency analysis

Also Published As

Publication number Publication date
CN111767325A (en) 2020-10-13

Similar Documents

Publication Publication Date Title
CN111767325B (en) Multi-source data deep fusion method based on deep learning
CN108573411B (en) Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN111325029B (en) Text similarity calculation method based on deep learning integrated model
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN103336852B (en) Across language ontology construction method and device
CN110888991B (en) Sectional type semantic annotation method under weak annotation environment
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN112036178A (en) Distribution network entity related semantic search method
CN112632250A (en) Question and answer method and system under multi-document scene
Başarslan et al. Sentiment analysis on social media reviews datasets with deep learning approach
CN114238653A (en) Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education
CN117171333A (en) Electric power file question-answering type intelligent retrieval method and system
CN115390806A (en) Software design mode recommendation method based on bimodal joint modeling
Suresh et al. Data mining and text mining—a survey
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN112417170B (en) Relationship linking method for incomplete knowledge graph
CN114239828A (en) Supply chain affair map construction method based on causal relationship
CN115795060A (en) Entity alignment method based on knowledge enhancement
CN114626367A (en) Sentiment analysis method, system, equipment and medium based on news article content
Bao et al. HTRM: A hybrid neural network algorithm based on tag-aware
Jardaeh et al. ArEmotive Bridging the Gap: Automatic Ontology Augmentation using Zero-shot Classification for Fine-grained Sentiment Analysis of Arabic Text
Zafartavanaelmi Semantic Question Answering Over Knowledge Graphs: Pitfalls and Pearls
Koukaras et al. Introducing a novel bi-functional method for exploiting sentiment in complex information networks
Yang et al. Construction and analysis of scientific and technological personnel relational graph for group recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant