CN111767325A - Multi-source data deep fusion method based on deep learning - Google Patents
Multi-source data deep fusion method based on deep learning Download PDFInfo
- Publication number
- CN111767325A CN111767325A CN202010914905.5A CN202010914905A CN111767325A CN 111767325 A CN111767325 A CN 111767325A CN 202010914905 A CN202010914905 A CN 202010914905A CN 111767325 A CN111767325 A CN 111767325A
- Authority
- CN
- China
- Prior art keywords
- data
- model
- deep learning
- fused
- entity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Fuzzy Systems (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the application provides a deep multi-source data fusion method based on deep learning, which comprises the steps of obtaining a relational data table to be fused; constructing a deep learning model, importing training data to perform word vectorization on the contents in the relational data table to be fused, and performing pattern matching on the processed data; hierarchically sampling data in a to-be-fused relational data table based on the similarity between entities corresponding to the data, importing the sampled data into a preset structural model for word vector-based integration processing to obtain a trained data barreled model, and performing entity-based data barreled processing based on the data barreled model; and judging whether the data in each barrel refers to the same entity or not, and performing data fusion on the data referring to the same entity to obtain a data table consisting of fused data. The method adopts a word vector mode to model the character string data, can simultaneously model the text and the semantics of the character string, and improves the tolerance of the dirty data.
Description
Technical Field
The application belongs to the field of data processing, and particularly relates to a multi-source data depth fusion method based on deep learning.
Background
The deep fusion of the multi-source data refers to the fusion of the multi-source structured data by using a deep learning method, so that a data scientist can analyze the data conveniently. In this application, fusion refers to discovery, also referred to as entity matching, of the same entity (where each tuple in the table refers to an entity) in the multi-source data that refers to the real world. For example, different expression modes of the same mobile phone are one of the important subjects in the field of data science. By utilizing deep learning, the data with more dirty multiple sources can be quickly and accurately predicted, the Value of the data can be mined, and the 4V (size), high speed (Velocity), diversity (Variety) and Value (Value) challenges of large data can be well solved.
Real world data tends to be dirty, for example "Qinghua University" may have multiple representations, "Tsinghua University", "Tsinghua Univ. The existence of dirty data greatly affects the processing precision of the machine on the data, and the processing performance is reduced.
Disclosure of Invention
In order to overcome the defects and shortcomings in the prior art, the deep learning-based multi-source data deep fusion method is provided, and in the face of multi-source dirtier data, the method can perform data fusion in two aspects of structure and semantics, and is convenient for data scientists to further analyze the data.
Specifically, the embodiment of the application provides a deep learning-based multi-source data deep fusion method, which includes:
acquiring a relational data table to be fused including a first data table and a second data table;
constructing a deep learning model, introducing training data into the deep learning model, carrying out word vectorization processing on the contents in the relational data table to be fused, and carrying out pattern matching on the processed data;
hierarchically sampling data in a to-be-fused relational data table based on the similarity between entities corresponding to the data, importing the sampled data into a preset structural model for word vector-based integration processing to obtain a trained data barreled model, and performing entity-based data barreled processing based on the data barreled model;
and judging whether the data in each barrel refers to the same entity or not, and performing data fusion on the data referring to the same entity to obtain a data table consisting of fused data.
Optionally, the building a deep learning model, introducing training data into the deep learning model, performing word vectorization processing on the content in the to-be-fused relational data table, and performing pattern matching on the processed data includes:
carrying out data-based labeling on the relational data table to be fused to obtain a labeled data set containing whether the data are matched or not;
recoding the marked data set to complete word vectorization processing of the marked data set;
and importing the word vectors subjected to word vectorization into a deep learning model, performing similarity calculation based on attributes, attribute values and topics, and performing data matching based on calculation results.
Optionally, the data-based labeling is performed on the relational data table to be fused to obtain a labeled data set including whether the data is matched, where the data-based labeling includes:
with annotation of the public data set, the annotated data of the pattern matching is(ii) a WhereinRepresentative relationship tableThe ith attribute value of the jth tuple in the list,is the data to be annotated, 0 represents a match and 1 represents a mismatch.
Optionally, the recoding the tagged data set to complete word vectorization processing of the tagged data set includes:
UNK is used instead for words not present in the vocabulary.
Optionally, the importing the word vector after the word vectorization processing into a deep learning model, performing similarity calculation based on three aspects of an attribute, an attribute value, and a theme, and performing data matching based on a calculation result includes:
mining a theme vector of each column according to the attribute value by adopting a theme model, and predicting according to the similarity of the attribute, the attribute value and the theme;
vectorizing the two attributes and the corresponding values, classifying according to the learned parameters, and calculating the probability of matching the two attributes;
finally, a matching combination which is matched to have the maximum probability is found between the first data table and the second data table to serve as a final matching result.
Optionally, the hierarchical sampling is performed on the data in the to-be-fused relational data table based on the similarity between the entities corresponding to the data, the sampled data is imported into a preset structure model to perform word vector-based integration processing, a trained data bucket dividing model is obtained, and entity-based data bucket dividing processing is performed based on the data bucket dividing model, which includes:
acquiring a similarity interval formed by the similarities of all entities, segmenting the acquired similarity interval, extracting a preset number of entity pairs in each acquired segment, and labeling the acquired entity pairs;
the method comprises the steps of selecting sample data from a first data table and a second data table respectively, obtaining Hash codes of each sample data, calculating the similarity of the two Hash codes, calculating loss values after bucket division, and dividing the data with the minimum loss values into the same bucket division.
Optionally, the method further includes:
and adjusting the weight of the data type in the process of calculating the similarity.
Optionally, the determining whether the data in each bucket refers to the same entity or not, and performing data fusion on the data referring to the same entity to obtain a data table formed by fused data includes:
judging whether the data belong to the same entity or not according to the entity name in each data in the bucket;
and fusing the data belonging to the same entity according to the same attribute to obtain a fused data table.
The beneficial effect that technical scheme that this application provided brought is:
the method adopts a Word vector (Word Embedding) mode to model the character string data, can simultaneously model the text and the semantics of the character string, and improves the tolerance of the dirty data.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic flow chart of a deep learning-based multi-source data deep fusion method according to the present application;
FIG. 2 is a detailed flow chart of step 12 described herein;
FIG. 3 is a detailed flow chart of step 13 described herein;
FIG. 4 is a diagram of a model training architecture for a two tower structure model as described herein.
Detailed Description
To make the structure and advantages of the present application clearer, the structure of the present application will be further described with reference to the accompanying drawings.
Example one
Specifically, the embodiment of the present application provides a deep learning-based multi-source data deep fusion method, as shown in fig. 1, including:
11. acquiring a relational data table to be fused including a first data table and a second data table;
12. constructing a deep learning model, introducing training data into the deep learning model, carrying out word vectorization processing on the contents in the relational data table to be fused, and carrying out pattern matching on the processed data;
13. hierarchically sampling data in a to-be-fused relational data table based on the similarity between entities corresponding to the data, importing the sampled data into a preset structural model for word vector-based integration processing to obtain a trained data barreled model, and performing entity-based data barreled processing based on the data barreled model;
14. and judging whether the data in each barrel refers to the same entity or not, and performing data fusion on the data referring to the same entity to obtain a data table consisting of fused data.
In implementation, in particular, multi-source data fusion involves the following steps:
the first step is Schema Matching (Schema Matching), names of schemas (attributes) of different source data may be different, for example, one article has a related data source attribute "title", another data source has an attribute "article name", these two attributes refer to the same Schema although the names are different, and data fusion requires that the attributes are aligned first.
The second step is data bucketing (Blocking), if anyNThe pieces of data need to be matched, and then the complexity of entity matching isN 2 The method is not acceptable for large data, so that the method carries out barreling on the data by using a deep learning method, only carries out entity matching on the data in the barrel, and greatly reduces the complexity of matching.
The third step is entity matching, after pattern matching and data barreling, the existing method usually solves the problem of entity matching by using rules or a traditional machine learning method, and the methods usually do not have universality and can not process dirty data well, so that the method models an entity pair by a deep learning method, and further matches the entity pair referring to the same entity in the real world.
For dirty data, the method adopts a Word vector (Word Embedding) mode to model the character string data, and can simultaneously model the text and the semantics of the character string, so that the tolerance of the dirty data is improved. Further, in pattern (attribute) matching, it is not sufficient to consider only the attribute itself, but the contents in the attribute column are also considered. And uniformly modeling the attributes and contents of the plurality of data sources by using the word vectors. In data bucket division, the conventional method usually adopts a regular method, has no universality and has poor effect on dirty data. The present application uses deep learning to learn buckets end-to-end. The problem of poor dirty data effect also exists in the entity matching process, and the problem is solved by utilizing deep learning.
Optionally, as shown in fig. 2, step 12 specifically includes:
121. carrying out data-based labeling on the relational data table to be fused to obtain a labeled data set containing whether the data are matched or not;
122. recoding the marked data set to complete word vectorization processing of the marked data set;
123. and importing the word vectors subjected to word vectorization into a deep learning model, performing similarity calculation based on attributes, attribute values and topics, and performing data matching based on calculation results.
In practice, step 121 is embodied as a process for preparing training data.
Training data is essential because a deep learning model is trained. The training data can be obtained in two ways, one is the labeling by using the existing public data set, and the labeled data of the pattern matching isWhereinRepresentative relationship tableThe ith attribute value of the jth tuple in the set.Is the data to be annotated, 0 represents a match and 1 represents a mismatch. The other is obtained by the annotator without truth, i.e. the annotation y of the pattern match is obtained.
When the amount of data is small, several experts may be requested to perform annotation. When the data volume is large, the cost of searching for experts is also large, and then the data label can be obtained by a crowdsourcing method. Crowdsourcing refers to the problem that computers are difficult to automatically solve, such as data annotation by using internet power. The labeling result is usually determined by majority voting. Specifically, a annotation question is divided into multiple annotation workers to answer, say 5, and the returned answers are 1 in majority and match in pair of attributes, otherwise they do not match.
The specific content of step 122 is a word vectorization process.
Word vectors are a fundamental tool for natural language domain modeling. There are many ways to generate a word vector by mapping a word or phrase from a vocabulary to a set of real numbers, including neural networks, PCA dimension reduction, probabilistic models, etc. In this step, forAndthe attribute value in (1) can be encoded into a d-dimensional vector, i.e. the vectorAnd. It is specifically noted that for words not present in the vocabulary (OOV), UNK is usually used instead.
Word Embedding (Embedding) may be in word units or in character units with finer granularity. At this point, the input to the model is not a word. Instead, for each word, its characters are used as input, via a neural network, to generate a d-dimensional vector. The method has the advantage that the internal characteristics of words can be captured, especially some compound words, such as "kindness" is composed of "kind" and "ness". In addition, the OOV problem can be solved well in units of characters.
In addition, whether pre-trained models or re-trained word vectors are used is a different option. The pre-training contains word-level models (word 2vec, Glove) and character-level models (fastTest). The pre-trained model has the following two advantages. Firstly, the training time can be greatly shortened, and secondly, the models are trained on a large corpus, so that the robustness is strong. The present application allows the user to freely decide whether to use a word-level vector or a character-level vector, and whether to use a pre-trained model or to start training from scratch.
The specific content of step 123 is a building model training process.
The present application uses deep neural networks to build training models.
Firstly, the input of the network is a word vector of two columns of data to be matched, which comprises attribute information and attribute value information, namely,And. WhereinAndtextual and semantic information describing the attributes,text describing attribute values andsemantic information, which the present application utilizes a two-tower model to model.
In addition, the theme corresponding to the attribute is also important for matching the attribute. Thus, the subject Model (Topic Model) is also employed by the present application to mine the Topic vector of each column based on the attribute values, represented as. The model will make predictions based on the similarity of the attributes, attribute values, and topics.
And the cross entropy is used as a loss function between the predicted value and the real value y, and the model is trained by carrying out back propagation and random gradient descent.
Specifically, the present application adds them to form a new vector, and the attribute vectorAnd a topic vectorConnected to form a new set of features. Wherein the subject vectorAnd automatically solving through an LDA model according to the attribute values.
And then, the new features corresponding to the two attributes start to interact, and the similarity between the two features is calculated in a similarity calculation layer, wherein a fixed distance function such as cosine similarity or Euclidean distance can be selected.
The loss is calculated through a full connection layer to a classification layer, and the loss is propagated reversely to update parameters of the neural network. When predicting, firstly vectorizing the two attributes and the corresponding values according to a forward propagation path, and then classifying according to the learned parameters to obtain the probability of matching the two attributes. Finally, a matching combination which is matched to have the maximum probability is found between the first data table and the second data table to serve as a final matching result.
Optionally, as shown in fig. 3, the specific content of step 13 includes:
131. acquiring a similarity interval formed by the similarities of all entities, segmenting the acquired similarity interval, extracting a preset number of entity pairs in each acquired segment, and labeling the acquired entity pairs;
132. the method comprises the steps of selecting sample data from a first data table and a second data table respectively, obtaining Hash codes of each sample data, calculating the similarity of the two Hash codes, calculating loss values after bucket division, and dividing the data with the minimum loss values into the same bucket division.
In practice, the reason for data bucketing is:
(1) after dispersion, the multiplication speed of the inner product of the sparse vector is higher, the calculation result is convenient to store, and the expansion is easy.
(2) The discrete features are more robust to abnormal values, such as age >30 is 1, otherwise 0, and do not cause great interference to the model for age 200.
(3) LR belongs to a generalized linear model, the expression capability is limited, each variable has independent weight after discretization, which is equivalent to introducing nonlinearity, so that the expression capability of the model can be improved, and fitting is enlarged.
(4) After the features are dispersed, feature crossing can be carried out, the expression capacity is improved, M x N variables are programmed by M + N variables, the nonlinearity is further introduced, and the expression capacity is improved.
(5) After the characteristics are dispersed, the model is more stable, such as the age interval of the user, and cannot be changed as long as the user ages.
(6) Deletions can be introduced into the model as a separate class.
(7) All variables are transformed to similar scales.
The bucket dividing method is divided into an unsupervised bucket dividing method and a supervised bucket dividing method. Common unsupervised bucket allocation methods include equal frequency bucket allocation, equidistant bucket allocation, and clustered bucket allocation. The supervised sub-barrel mainly comprises a best-ks sub-barrel and a chi-square sub-barrel.
Step 131: training data is prepared. For data tablesAssume it contains N tuples(ii) a Another watchAssume it contains M tuples(ii) a Each tuple contains m attributes and the attributes between the two tables are aligned. The data bucket separation or data fusion is characterized in that the number of matched entity pairs in the training data is small, and the number of unmatched entity pairs is large, so that the problem of sample imbalance can be caused, and the training effect is deviated. Therefore, how to select the training data is one of the important points of attention in the present application. Specifically, the method and the device can perform hierarchical sampling according to the similarity between the entities so as to achieve the aim of training set balance. Generally, the reason why there are many pairs of unmatched entities is that there are many entities with low similarity. Thus, the similarity of all similar entities can be calculated (at [0,1 ]]Within an interval) scores. Then [0,1 ] is mixed]Interval segments, say 10 segments, extract a certain number of entity pairs within each segment. For the selected entity pair, if the selected entity pair is a public standard data set, the method directly uses the public labeling or manual labeling manner shown above, which is not described herein again.
Step 131: and (5) building model training. A model of a double tower structure was also used as shown in fig. 4. The model inputs at this point are different entitiesNeed to be provided withIs coded into. It is to be noted that eachThe method includes a plurality of attributes, and the word vector corresponding to each attribute may be obtained by the method in step 122, and then the word vectors of the attributes are integrated to obtain the word vector representation of the entire entity. There are many common methods that can be selected, and the present application can provide many methods related to the existing common natural language processing techniques, such as the method of vector direct sum, the method of recurrent neural network such as LSTM, and the method with attention mechanism. After the features of each entity are obtained, for convenience of bucket splitting, the feature vector is connected with a hash layer, namely a vector consisting of 0 and 1. The reason for this is to facilitate binning, i.e., each identical hash string represents a bucket, where matching tuples are expected to be grouped into one bucket, where unmatched entities are in different buckets, and where hash-coding distances between similar entities are relatively short. Therefore, a similarity calculation layer is constructed to calculate the similarity of two hash codes, and then a classification layer is used to calculate a loss function so as to meet the function of the model. After the model is trained, for the newly arriving entity, it is propagated forward through the network, and the matched entity pair will be sorted into a bucket.
Optionally, the method further includes:
and adjusting the weight of the data type in the process of calculating the similarity.
In practice, if the data bucketing requirement for recall is high in the bucketing operation of step 133, the unmatched entities may also be present in a bucket in a proper amount, but the matched entities must be in a bucket. Therefore, in the process of model training, the weight of the matched training label is increased.
Optionally, the specific content of step 14 includes:
judging whether the data belong to the same entity or not according to the entity name in each data in the bucket;
and fusing the data belonging to the same entity according to the same attribute to obtain a fused data table.
In implementation, after pattern matching and data binning are completed, it is necessary to determine whether entity pairs in the same bin refer to the same entity. The present application still employs a deep neural network to solve the entity matching problem.
And (5) building model training and prediction. And coding each attribute tuple of each entity into a vector, then building a network by adopting a double-tower model, integrating vectors corresponding to different attributes by utilizing a cyclic neural network method such as vector direct addition or LSTM (least squares TM) and the like, constructing a similarity calculation layer, and finally calculating the classified loss.
The similarity calculation layer is quite similar to a network structure of data sub-buckets, so that the structure is multiplexed by adopting the idea of transfer learning, and training is accelerated. The same method can be used for obtaining and sampling the sample.
An example of the algorithm:
inputting: relational data sheetAndrespectively with m attributesAndrespectively including a plurality of entities at the same timeAnd。
1) Constructing attribute pairs to be matched;
2) acquiring the labeling data of the attribute pair;
5) Constructing a pattern matching model for training;
6) sampling to generate entity pair training data;
7) Training a data bucket model;
8) training an entity matching model;
11) performing data fusion by using a data fusion model;
The sequence numbers in the above embodiments are merely for description, and do not represent the sequence of the assembly or the use of the components.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.
Claims (8)
1. The multi-source data depth fusion method based on deep learning is characterized by comprising the following steps:
acquiring a relational data table to be fused including a first data table and a second data table;
constructing a deep learning model, introducing training data into the deep learning model, carrying out word vectorization processing on the contents in the relational data table to be fused, and carrying out pattern matching on the processed data;
hierarchically sampling data in a to-be-fused relational data table based on the similarity between entities corresponding to the data, importing the sampled data into a preset structural model for word vector-based integration processing to obtain a trained data barreled model, and performing entity-based data barreled processing based on the data barreled model;
and judging whether the data in each barrel refers to the same entity or not, and performing data fusion on the data referring to the same entity to obtain a data table consisting of fused data.
2. The deep learning-based multi-source data deep fusion method of claim 1, wherein the deep learning model is built, training data is introduced into the deep learning model, word vectorization processing is performed on contents in a relational data table to be fused, and pattern matching is performed on the processed data, and the method comprises the following steps:
carrying out data-based labeling on the relational data table to be fused to obtain a labeled data set containing whether the data are matched or not;
recoding the marked data set to complete word vectorization processing of the marked data set;
and importing the word vectors subjected to word vectorization into a deep learning model, performing similarity calculation based on attributes, attribute values and topics, and performing data matching based on calculation results.
3. The deep learning-based multi-source data deep fusion method of claim 2, wherein the data-based labeling is performed on the relational data table to be fused to obtain a labeled data set containing whether data are matched, and the method comprises the following steps:
4. The deep learning-based multi-source data deep fusion method of claim 2, wherein the recoding of the labeled data set to complete word vectorization processing of the labeled data set comprises:
UNK is used instead for words not present in the vocabulary.
5. The deep learning-based multi-source data deep fusion method of claim 2, wherein the step of importing the word vectors subjected to word vectorization into a deep learning model, performing similarity calculation based on three aspects of attributes, attribute values and topics, and performing data matching based on calculation results comprises:
mining a theme vector of each column according to the attribute value by adopting a theme model, and predicting according to the similarity of the attribute, the attribute value and the theme;
vectorizing the two attributes and the corresponding values, classifying according to the learned parameters, and calculating the probability of matching the two attributes;
finally, a matching combination which is matched to have the maximum probability is found between the first data table and the second data table to serve as a final matching result.
6. The deep learning-based multi-source data deep fusion method of claim 1, wherein the data in the relational data table to be fused are hierarchically sampled based on the similarity between the entities corresponding to the data, the sampled data are imported into a preset structure model to be integrated based on word vectors, a trained data bucket model is obtained, and entity-based data bucket processing is performed based on the data bucket model, and the method comprises the following steps:
acquiring a similarity interval formed by the similarities of all entities, segmenting the acquired similarity interval, extracting a preset number of entity pairs in each acquired segment, and labeling the acquired entity pairs;
the method comprises the steps of selecting sample data from a first data table and a second data table respectively, obtaining Hash codes of each sample data, calculating the similarity of the two Hash codes, calculating loss values after bucket division, and dividing the data with the minimum loss values into the same bucket division.
7. The deep learning-based multi-source data deep fusion method of claim 3, further comprising:
and adjusting the weight of the data type in the process of calculating the similarity.
8. The deep learning-based multi-source data deep fusion method of claim 1, wherein the determining whether the data in each bucket refer to the same entity is performed, and the data referring to the same entity is subjected to data fusion to obtain a data table composed of fused data, and the method comprises:
judging whether the data belong to the same entity or not according to the entity name in each data in the bucket;
and fusing the data belonging to the same entity according to the same attribute to obtain a fused data table.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010914905.5A CN111767325B (en) | 2020-09-03 | 2020-09-03 | Multi-source data deep fusion method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010914905.5A CN111767325B (en) | 2020-09-03 | 2020-09-03 | Multi-source data deep fusion method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111767325A true CN111767325A (en) | 2020-10-13 |
CN111767325B CN111767325B (en) | 2020-11-24 |
Family
ID=72729245
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010914905.5A Active CN111767325B (en) | 2020-09-03 | 2020-09-03 | Multi-source data deep fusion method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111767325B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113254641A (en) * | 2021-05-27 | 2021-08-13 | 中国电子科技集团公司第十五研究所 | Information data fusion method and device |
CN113609715A (en) * | 2021-10-11 | 2021-11-05 | 深圳奥雅设计股份有限公司 | Multivariate model data fusion method and system under digital twin background |
CN114153839A (en) * | 2021-10-29 | 2022-03-08 | 杭州未名信科科技有限公司 | Integration method, device, equipment and storage medium of multi-source heterogeneous data |
CN114997419A (en) * | 2022-07-18 | 2022-09-02 | 北京芯盾时代科技有限公司 | Updating method and device of rating card model, electronic equipment and storage medium |
CN116303392A (en) * | 2023-03-02 | 2023-06-23 | 重庆市规划和自然资源信息中心 | Multi-source data table management method for real estate registration data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107341220A (en) * | 2017-06-28 | 2017-11-10 | 阿里巴巴集团控股有限公司 | A kind of multi-source data fusion method and device |
CN109308311A (en) * | 2018-09-05 | 2019-02-05 | 广州小楠科技有限公司 | A kind of multi-source heterogeneous data fusion system |
CN110110082A (en) * | 2019-04-12 | 2019-08-09 | 黄红梅 | Multi-source heterogeneous data fusion optimization method |
CN110515926A (en) * | 2019-08-28 | 2019-11-29 | 国网天津市电力公司 | Heterogeneous data source mass data carding method based on participle and semantic dependency analysis |
-
2020
- 2020-09-03 CN CN202010914905.5A patent/CN111767325B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107341220A (en) * | 2017-06-28 | 2017-11-10 | 阿里巴巴集团控股有限公司 | A kind of multi-source data fusion method and device |
CN109308311A (en) * | 2018-09-05 | 2019-02-05 | 广州小楠科技有限公司 | A kind of multi-source heterogeneous data fusion system |
CN110110082A (en) * | 2019-04-12 | 2019-08-09 | 黄红梅 | Multi-source heterogeneous data fusion optimization method |
CN110515926A (en) * | 2019-08-28 | 2019-11-29 | 国网天津市电力公司 | Heterogeneous data source mass data carding method based on participle and semantic dependency analysis |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113254641A (en) * | 2021-05-27 | 2021-08-13 | 中国电子科技集团公司第十五研究所 | Information data fusion method and device |
CN113254641B (en) * | 2021-05-27 | 2021-11-16 | 中国电子科技集团公司第十五研究所 | Information data fusion method and device |
CN113609715A (en) * | 2021-10-11 | 2021-11-05 | 深圳奥雅设计股份有限公司 | Multivariate model data fusion method and system under digital twin background |
CN113609715B (en) * | 2021-10-11 | 2022-02-22 | 深圳奥雅设计股份有限公司 | Multivariate model data fusion method and system under digital twin background |
CN114153839A (en) * | 2021-10-29 | 2022-03-08 | 杭州未名信科科技有限公司 | Integration method, device, equipment and storage medium of multi-source heterogeneous data |
CN114997419A (en) * | 2022-07-18 | 2022-09-02 | 北京芯盾时代科技有限公司 | Updating method and device of rating card model, electronic equipment and storage medium |
CN116303392A (en) * | 2023-03-02 | 2023-06-23 | 重庆市规划和自然资源信息中心 | Multi-source data table management method for real estate registration data |
CN116303392B (en) * | 2023-03-02 | 2023-09-01 | 重庆市规划和自然资源信息中心 | Multi-source data table management method for real estate registration data |
Also Published As
Publication number | Publication date |
---|---|
CN111767325B (en) | 2020-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111767325B (en) | Multi-source data deep fusion method based on deep learning | |
CN108573411B (en) | Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments | |
CN109271529B (en) | Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian | |
CN111325029B (en) | Text similarity calculation method based on deep learning integrated model | |
CN109271506A (en) | A kind of construction method of the field of power communication knowledge mapping question answering system based on deep learning | |
CN103336852B (en) | Across language ontology construction method and device | |
CN117171333B (en) | Electric power file question-answering type intelligent retrieval method and system | |
CN110888991A (en) | Sectional semantic annotation method in weak annotation environment | |
CN112417170B (en) | Relationship linking method for incomplete knowledge graph | |
CN116127090B (en) | Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction | |
CN112632250A (en) | Question and answer method and system under multi-document scene | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
CN112036178A (en) | Distribution network entity related semantic search method | |
Başarslan et al. | Sentiment analysis on social media reviews datasets with deep learning approach | |
CN112364132A (en) | Similarity calculation model and system based on dependency syntax and method for building system | |
CN116244448A (en) | Knowledge graph construction method, device and system based on multi-source data information | |
CN114238653A (en) | Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education | |
CN115390806A (en) | Software design mode recommendation method based on bimodal joint modeling | |
CN114443846B (en) | Classification method and device based on multi-level text different composition and electronic equipment | |
CN117973519A (en) | Knowledge graph-based data processing method | |
CN111666374A (en) | Method for integrating additional knowledge information into deep language model | |
Suresh et al. | Data mining and text mining—a survey | |
CN114626367A (en) | Sentiment analysis method, system, equipment and medium based on news article content | |
Bao et al. | HTRM: a hybrid neural network algorithm based on tag-aware | |
CN111581326A (en) | Method for extracting answer information based on heterogeneous external knowledge source graph structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |