CN114153839A

CN114153839A - Integration method, device, equipment and storage medium of multi-source heterogeneous data

Info

Publication number: CN114153839A
Application number: CN202111274821.0A
Authority: CN
Inventors: 龚小龙; 郑聪; 单超炳; 麻志毅
Original assignee: Advanced Institute of Information Technology AIIT of Peking University; Hangzhou Weiming Information Technology Co Ltd
Current assignee: Advanced Institute of Information Technology AIIT of Peking University; Hangzhou Weiming Information Technology Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-03-08

Abstract

The application discloses an integration method, device, equipment and storage medium of multi-source heterogeneous data, wherein the method comprises the following steps: preprocessing the acquired data table to obtain a preprocessed data table; matching the preprocessed data table according to a pre-trained matching model and a linear programming algorithm to obtain a matching relation of the data table; and performing data fusion according to the matching relation of the data table and the pre-trained fusion model to obtain fused data. According to the data integration method provided by the embodiment of the application, a data island can be broken, and interconnection and intercommunication of heterogeneous data sources are realized; the semantic structure analysis of a system database is realized, and the correction of a reconstructed database dictionary, a data dictionary, manual data and the like is rapidly assisted; the model is suitable for various application scenes, and efficiently assists manual data screening, so that large-scale data table searching and incomplete data table fusion become possible.

Description

Integration method, device, equipment and storage medium of multi-source heterogeneous data

Technical Field

The invention relates to the technical field of data processing, in particular to an integration method, device, equipment and storage medium of multi-source heterogeneous data.

Background

In recent decades, rapid development of science and technology and continuous promotion of informatization have led to exponential explosion and increase in data volume in many fields such as governments, enterprises and medical treatment. By mining and analyzing mass data, a large amount of valuable information can be obtained, so that the method can better energize business application and support various application developments. However, the data of the unit internal information system, the data shared among the units, or the public data acquired on the internet all present the fragmentation characteristics of scattering, isomerism, low quality, etc. Inside the unit, the information data of different departments are stored in a divided mode due to the organization and architecture problem, and meanwhile, the information display forms are various due to the fact that the informatization levels of all units are different. The data are independent and closed, so that interconnection and intercommunication cannot be effectively realized, and a serious information isolated island phenomenon is generated. Between enterprises, due to differences in physical storage and processing techniques of source data, shared data becomes more uncertain because some applications change or physical storage changes. Therefore, data integration is very demanding before data value is deeply mined.

At present, the research of data integration direction is mostly based on specific scenes. The corresponding deep learning model needs to be trained specially aiming at a specific scene, and a large amount of labels are needed, so that the multi-scene general capability of the deep learning model under the condition of labeling a small amount of data cannot be solved. The scattered files, folders and databases cannot be interconnected. Only structured data can be input, and multi-source data input cannot be considered.

In summary, in industrial units where technical talents and application tools are lacked, such as industrial manufacturing enterprises and small and medium-sized micro enterprises, the following problems are urgently needed to be solved: 1. how to break a data island and realize interconnection and intercommunication of heterogeneous data sources, wherein the data island comprises but is not limited to scattered files, folders, databases and other forms; 2. how to perform data semantic structure analysis: including database dictionaries, data dictionaries, manual data, etc.; 3. how to generate an easy-to-operate data integration model: including applicability to multiple application scenarios, low usage technology thresholds, etc.

Disclosure of Invention

The embodiment of the application provides an integration method, device and equipment of multi-source heterogeneous data and a storage medium. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

In a first aspect, an embodiment of the present disclosure provides an integration method of multi-source heterogeneous data, including:

preprocessing the acquired data table to obtain a preprocessed data table;

matching the preprocessed data table according to the pre-trained matching model and a linear programming algorithm to obtain a matching relation of the data table;

and performing data fusion according to the matching relation of the data table and the pre-trained fusion model to obtain fused data.

In an optional embodiment, matching the preprocessed data table according to a pre-trained matching model and a linear programming algorithm to obtain a matching relationship of the data table, including:

inputting the preprocessed data table into a pre-trained first matching model for pattern matching to obtain a first similarity matrix;

and obtaining the matching relation of the data table according to the first similarity matrix and a linear programming algorithm.

inputting the preprocessed data table into a pre-trained second matching model for pattern matching to obtain a second similarity matrix;

obtaining a matching relation and a matching score of the data table according to the second similarity matrix and a linear programming algorithm;

and sorting from high to low according to the matching scores to obtain the matching relation of the sorted data table.

In an optional embodiment, performing data fusion according to the matching relationship of the data table and the pre-trained fusion model to obtain fused data includes:

performing queue association according to the matching relation of the data table to obtain an associated data queue;

and inputting the associated data queue into a pre-trained fusion model to obtain fused data.

In an optional embodiment, after obtaining the preprocessed data table, the method further includes:

training a first matching model, a second matching model and a fusion model;

the neural network structure of the first matching model, the second matching model and the fusion model is an improved GAE network structure, the improved GAE network structure comprises an encoding layer and a reverse encoding layer, the encoding layer comprises a GNN network and an FC network which are sequentially connected, and the reverse encoding layer comprises the FC network.

In an optional embodiment, after obtaining the fused data, the method further includes:

performing statistical analysis on the fused data to obtain type information, missing value information, memory size information, minimum value information, maximum value information, median information, standard deviation information, variation coefficient information, histogram information and correlation color level map information of the data;

and automatically generating a data analysis report according to the analysis result.

In an optional embodiment, the preprocessing the acquired data table, and obtaining the preprocessed data table includes:

acquiring a data table of multi-source heterogeneous data;

and performing data filling, abnormal value processing and format conversion on the acquired data table to obtain a preprocessed data table.

In a second aspect, an embodiment of the present disclosure provides an integrated apparatus for multi-source heterogeneous data, including:

the data preprocessing module is used for preprocessing the acquired data table to obtain a preprocessed data table;

the intelligent data association module is used for matching the preprocessed data table according to the pre-trained matching model and the linear programming algorithm to obtain the matching relation of the data table;

and the intelligent data fusion module is used for carrying out data fusion according to the matching relation of the data table and the pre-trained fusion model to obtain fused data.

In a third aspect, an embodiment of the present disclosure provides an integration apparatus for multi-source heterogeneous data, including a processor and a memory storing program instructions, where the processor is configured to execute the integration method for multi-source heterogeneous data provided in the foregoing embodiment when executing the program instructions.

In a fourth aspect, the disclosed embodiments provide a computer-readable medium, on which computer-readable instructions are stored, where the computer-readable instructions are executed by a processor to implement a method for integrating multi-source heterogeneous data provided in the foregoing embodiments.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

according to the integration method of the multi-source heterogeneous data, provided by the embodiment of the application, a data island can be broken through, intelligent processing of the multi-source heterogeneous data is realized, access of non-use heterogeneous data sources is supported, the access comprises scattered files, shared folders and the like, and the problem of input data consistency is solved, wherein the problem comprises data format, data abnormity, data filling and the like; the semantic structure analysis of the system database is realized, and the correction of a reconstructed database dictionary, a data dictionary and manual data is rapidly assisted. The model is suitable for various application scenes, and efficiently assists manual data screening, so that large-scale data table searching and incomplete data table fusion become possible.

Furthermore, for newly fused data, the system intelligently analyzes the data by using an intelligent perspective function to find out the data with correlation and abnormal data, and automatically constructs an analysis chart to display a result, so that the labor and the energy required by data decision are greatly reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram illustrating a method for integrating multi-source heterogeneous data in accordance with an exemplary embodiment;

FIG. 2 is another flow diagram illustrating a method for integration of multi-source heterogeneous data in accordance with an exemplary embodiment;

FIG. 3 is a schematic flow diagram illustrating a method of model training in accordance with an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating the structure of a GAE network in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating an integrated device for multi-source heterogeneous data according to an exemplary embodiment;

FIG. 6 is another block diagram illustrating an integrated device for multi-source heterogeneous data according to an example embodiment;

FIG. 7 is a block diagram illustrating an integrated device for multi-source heterogeneous data, according to an example embodiment;

FIG. 8 is a schematic diagram illustrating a computer storage medium in accordance with an exemplary embodiment.

Detailed Description

The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of systems and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The embodiment of the application provides an intelligent data integration method, which specifically comprises five parts: 1. inputting multi-source heterogeneous data; 2. preprocessing intelligent data; 3. intelligent data association; 4. intelligent data fusion; 5. intelligent data perspective. According to the data integration method, intelligent processing can be performed on multi-source heterogeneous data, access of non-use heterogeneous data sources including scattered files, shared folders and the like is supported, and the problem of input data consistency including data formats, data abnormity, data filling and the like is solved; the semantic structure of the database of the information system can be analyzed, and the correction of a reconstructed database dictionary, a data dictionary and manual data can be rapidly assisted; and a set of adaptable industry-level pre-training model is provided, the generalization performance is good, the method is suitable for multiple scenes of the industry, the input data can be quickly calculated, the result can be timely fed back, and the application agile development is supported.

The method for integrating multi-source heterogeneous data provided by the embodiment of the application is described in detail below with reference to the accompanying drawings. Referring to fig. 1, the method specifically includes the following steps.

S101, preprocessing the acquired data table to obtain a preprocessed data table.

Specifically, the accessed data table is firstly obtained, and the data integration method in the embodiment of the application supports the access of the multi-source heterogeneous data table and can access various forms of data such as scattered files, shared folders and databases.

And further, preprocessing the acquired data table.

First, the data is subjected to an intelligent padding process. In one possible implementation, a proximity algorithm (KNN, k-nearest neighbor) is used for data padding, and the core idea of the KNN algorithm is that if most of k nearest neighbor samples of a sample in the feature space belong to a certain class, the sample also belongs to the class and has the characteristics of the samples on the class. The method only determines the category of the sample to be classified according to the category of the nearest sample or samples in the determination of classification decision. The KNN method is only related to a very small number of adjacent samples when the classification is decided. Because the KNN method mainly determines the class by the limited adjacent samples around, rather than by the method of distinguishing the class domain, the KNN method is more suitable than other methods for the sample sets to be classified with more class domain intersections or overlaps. Therefore, the KNN algorithm can utilize the relevance of the data on each dimension to fill and correct missing values or abnormal values in the data, and intelligent filling of the data is achieved.

In a possible implementation, other methods of filling in the missing value may be used, for example, a parameterized statistical method is used to fill in the missing value, and statistical data is used to fill in the missing value.

Further, abnormal value detection may be performed on the preprocessed data, for example, a bayesian regularization method is used to detect the abnormal value, and an averaging method is used to detect and correct the abnormal value.

Further, format conversion may be performed on the preprocessed data, for example, deleting head and tail spaces, case conversion, time format conversion, and the like.

According to the step, the accessed multi-source heterogeneous data can be preprocessed, and neat data with uniform format can be obtained, so that subsequent processing is facilitated.

S102, matching the preprocessed data table according to the pre-trained matching model and the linear programming algorithm to obtain the matching relation of the data table.

In a possible implementation manner, matching the preprocessed data tables to obtain a matching relationship of the data tables includes: inputting the preprocessed data table into a pre-trained first matching model for pattern matching, matching the target data table with other data tables one by one through the first matching model, and outputting a first similarity matrix, wherein the first similarity matrix refers to a similarity matrix of columns in the table.

And further, inputting the first similarity matrix into a linear programming algorithm, for example, a Hungarian algorithm, to obtain a matching relationship between columns in the data table, wherein the matching relationship between the columns refers to a relationship between column fields in the data table to be matched and column fields in the matched data table, and the relationship can be represented by a score.

Optionally, the preprocessed data table may be input into a pre-trained second matching model for pattern matching, and the second matching model matches the data table and outputs a second similarity matrix. Wherein, the second similarity matrix refers to the similarity matrix of the table and the table.

And further, inputting the second similarity matrix into a linear programming algorithm, such as a Hungarian algorithm, to obtain the matching relationship and the matching score of the data table, and obtaining which data table the data table A is matched with according to the matching relationship of the data tables. And (4) according to the similarity, finding the matching score of the data table A and other tables.

And sorting from high to low according to the matching scores to obtain the matching relation of the sorted data tables, outputting the table names, the matching scores and the corresponding ranks in the database matched with the table A in an exemplary scene, and sorting the data tables matched with the table A from high to low according to the matching scores.

Typically, before matching the data table, training the first matching model or the second matching model is also included.

Specifically, an industry public data set is obtained, for example, a knowledge network industry knowledge service platform data, an open source data platform data and the like are obtained.

Text semantic features of the data are then extracted, for example, by constructing an industry rule algorithm, to extract data rule features. And extracting characteristics of character strings, text statistics and the like of the data by constructing a machine learning model. And extracting semantic features of the text by constructing a Bert semantic model.

And further, performing feature recombination, recoding and standard vectorization according to the extracted features. And constructing a training set and a testing set of the first matching model or the second matching model according to the processed data.

Further, a network structure of the first matching model or the second matching model is constructed, the first matching model or the second matching model is trained according to the training set and the test set corresponding to the models, and the trained first matching model or the trained second matching model can be obtained when preset evaluation indexes are reached.

The first matching model is matched with columns in the data table, the second matching model is matched with the whole data table, optionally, the second matching model can also be matched with columns in the data table, when the second matching model is trained, the training data set can be changed, the second matching model is trained into the matching relation of the columns in the data table, and the output similarity matrix is the similarity of the columns.

In an alternative embodiment, the columns in the data table are matched by a first matching model.

In a possible implementation manner, the network structure such as the deep learning model GNN, the machine learning model IOU _ match, the rule model IOU _ schema, and the like may be used for training. The embodiment of the present disclosure is not particularly limited, and may be selected according to actual situations.

An improved GAE network structure provided by the embodiment of the present application is provided, fig. 4 is a schematic diagram of the improved GAE network structure provided by the embodiment of the present application, and as shown in fig. 4, the improved GAE network structure includes an encoding layer and an anti-encoding layer, where the encoding layer includes a GNN network and an FC network that are connected in sequence, and the anti-encoding layer only includes the FC network.

In an optional embodiment, model training is performed according to the improved neural network structure GAE, specifically, data rule features are encoded by using a pattern coding technology, and an initial feature vector X1 is generated; coding the machine learning statistical characteristics by using a mode coding technology to generate an initial characteristic vector X2; encoding the semantic features of the data by using a mode encoding technology to generate an initial feature vector X3; and recoding the X1, X2 and X3 vectors, and carrying out standard vectorization to obtain a standard vector X. A random two-eighths test set and a training set.

Further, a model evaluation index is constructed, the constructed graph self-encoding neural network GAE is obtained, and default initial parameters are set. Inputting the training set into the constructed graph self-coding neural network model GAE, updating parameters, inputting the model by using the test set, returning and adjusting the parameters according to the manual if the model does not meet the evaluation index, and completing construction of the pre-training model if the model meets the evaluation index. The improved GAE network can shorten the model training time and has the advantages of better generalization performance, higher robustness and higher reasoning speed.

According to the matching model provided by the embodiment of the application, the data tables can be subjected to correlation matching, the matching relation of the data tables is mined, the model can be suitable for different application scenes of the same industry, when the application scenes are changed, only different scene data need to be input, the data features input again are extracted, the feature vectors of new data are input into the model, the model does not need to be retrained again, and the model is an adaptable industry-level general pre-training model.

S103, data fusion is carried out according to the matching relation of the data table and the pre-trained fusion model, and fused data are obtained.

Specifically, queue association is performed according to the matching relationship of the data table, and an associated data queue is obtained. The first row of table a may be associated with the second row of table B, for example, based on the first row of table a matching the second row of table B.

And inputting the associated data queue into a pre-trained fusion model to obtain fused data. In one possible implementation, the associated data queue and the sample table may be input into the fusion model, and the new broad table may be synthesized according to the sample table. Or inputting the associated data queue into a pre-trained fusion model, and outputting a fused data table by the fusion model.

In a possible implementation manner, before fusing the data, training the fusion model is further included, wherein a network structure of the fusion model adopts a modified GAE network structure, and a training method of the modified GAE network structure is the same as a training method of the first matching model and the second matching model, and is not described in detail herein. By using the improved GAE network structure, the model training time can be shortened, and the method has the advantages of better generalization performance, higher robustness and higher reasoning speed.

According to the data integration method, the incomplete data table, such as the data table with the loss of the key column of the data table or the serious missing value of the key column due to some reasons, can be fused.

In an optional embodiment, after the fused data is obtained, data analysis can be automatically performed, and a data analysis report is generated.

Specifically, the fused data is subjected to statistical analysis to obtain summary information of the data: such as data type, unique value, missing value, memory size, etc. Statistical information of the data, such as minimum, maximum, median, etc., is obtained. Descriptive information of the data, such as standard deviation, coefficient of variation, skewness coefficient, etc., is obtained. Graphic information such as histograms and histograms of the data is obtained. And carrying out correlation analysis visualization on the data, highlighting strongly correlated variables, Spearman, Pearson matrix correlation color level diagrams and the like.

According to the steps, correlation analysis is carried out on the data, the data with correlation and abnormal data are found out, an analysis diagram display result is automatically constructed, and manpower and energy required by data decision are greatly reduced.

In order to facilitate understanding of the integration method of multi-source heterogeneous data provided by the embodiment of the present application, the following description is made with reference to fig. 2. As shown in fig. 2, the method includes the following steps.

The method comprises the steps of obtaining enterprise data, carrying out fine adjustment and correction on a pre-trained model according to an application scene of an enterprise, not changing the model, and only adjusting the characteristic vector of input data to obtain a mode matching model and a fusion model. And then acquiring a data table, including acquiring a database or a data file with complete meaning, a database or a data file to be analyzed, acquiring a data table sample, a database or a plurality of data tables to be extracted, or a plurality of data files.

And performing pattern matching on the acquired data table according to a pre-trained pattern matching model to obtain a matching relation, fusing the candidate data tables according to the matching relation and the fusion model to obtain a fused data table, and performing automatic analysis and statistics on the fused data table to obtain a data perspective view.

Fig. 3 is a schematic diagram of a model training method, and as shown in fig. 3, specifically, one certain line of industry public data is acquired, a rule algorithm is constructed, data features are extracted, a machine learning model is constructed, features of character strings, text statistics and the like of the data are extracted, a Bert semantic model is constructed, and semantic features of the text are extracted.

And further, performing feature recombination, recoding and standard vectorization according to the extracted features, and respectively constructing training data of a pattern matching model and a fusion model.

And carrying out model training according to the constructed training data and the improved graph neural network GAE to obtain a trained matching model and a trained fusion model.

According to the data integration method provided by the embodiment of the application, each of the four functional modules can independently operate. Each functional module has high-efficiency data processing capacity, manual participation is greatly reduced, and manual efficiency is improved. The intelligent data association function and the intelligent data fusion function can quickly find the internal relation of the data with high accuracy, and efficiently assist manual data screening, so that large-scale data table lookup and incomplete data table fusion become possible. For newly fused data, the method uses an intelligent perspective function to perform correlation analysis on the data, finds out the data with correlation and abnormal data, and automatically constructs an analysis chart display result, thereby greatly reducing the manpower and energy required by data decision.

An embodiment of the present application further provides an apparatus for integrating multi-source heterogeneous data, where the apparatus is configured to perform the method for integrating multi-source heterogeneous data according to the foregoing embodiment, and as shown in fig. 5, the apparatus includes:

a data preprocessing module 501, configured to preprocess the obtained data table to obtain a preprocessed data table;

the intelligent data association module 502 is configured to match the preprocessed data tables according to the pre-trained matching model and the linear programming algorithm to obtain a matching relationship of the data tables;

and the intelligent data fusion module 503 is configured to perform data fusion according to the matching relationship of the data table and the pre-trained fusion model to obtain fused data.

FIG. 6 is another block diagram illustrating an integrated device for multi-source heterogeneous data according to an example embodiment; as shown in fig. 6, the data integration apparatus in the embodiment of the present application may further include a data obtaining module, configured to obtain multi-source heterogeneous data. The intelligent data processing system comprises an intelligent data processing module used for preprocessing data. The intelligent data association module is used for matching data. The intelligent data fusion system comprises an intelligent data fusion module used for fusing matched data. The intelligent data perspective module is used for carrying out statistical analysis on the fused data and automatically generating an analysis report.

It should be noted that, when the integration apparatus for multi-source heterogeneous data provided in the foregoing embodiment executes the integration method for multi-source heterogeneous data, only the division of the functional modules is used as an example, in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the integration device of multi-source heterogeneous data and the integration method embodiment of multi-source heterogeneous data provided by the above embodiments belong to the same concept, and details of the implementation process are shown in the method embodiment and are not described herein again.

The embodiment of the application further provides electronic equipment corresponding to the multi-source heterogeneous data integration method provided by the embodiment, so as to execute the multi-source heterogeneous data integration method.

Referring to fig. 7, a schematic diagram of an electronic device provided in some embodiments of the present application is shown. As shown in fig. 7, the electronic apparatus includes: the processor 700, the memory 701, the bus 702 and the communication interface 703, wherein the processor 700, the communication interface 703 and the memory 701 are connected through the bus 702; the memory 701 stores a computer program that can be executed on the processor 700, and the processor 700 executes the computer program to perform the method for integrating multi-source heterogeneous data provided by any of the foregoing embodiments of the present application.

The Memory 701 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 703 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

Bus 702 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 701 is used for storing a program, and the processor 700 executes the program after receiving an execution instruction, and the integration method of multi-source heterogeneous data disclosed in any embodiment of the present application may be applied to the processor 700, or implemented by the processor 700.

The processor 700 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 700. The Processor 700 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 701, and the processor 700 reads the information in the memory 701, and completes the steps of the method in combination with the hardware thereof.

The electronic device provided by the embodiment of the application and the multi-source heterogeneous data integration method provided by the embodiment of the application have the same inventive concept and have the same beneficial effects as the method adopted, operated or realized by the electronic device.

Referring to fig. 8, the computer-readable storage medium is an optical disc 800, on which a computer program (i.e., a program product) is stored, and when the computer program is executed by a processor, the computer program may execute the method for integrating multi-source heterogeneous data provided in any of the foregoing embodiments.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.

The computer-readable storage medium provided by the above-mentioned embodiment of the present application and the method for integrating multi-source heterogeneous data provided by the embodiment of the present application have the same beneficial effects as the method adopted, run or implemented by the application program stored in the computer-readable storage medium.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for integrating multi-source heterogeneous data is characterized by comprising the following steps:

preprocessing the acquired data table to obtain a preprocessed data table;

matching the preprocessed data table according to a pre-trained matching model and a linear programming algorithm to obtain a matching relation of the data table;

2. The method of claim 1, wherein matching the preprocessed data tables according to a pre-trained matching model and a linear programming algorithm to obtain a matching relationship of the data tables comprises:

3. The method of claim 1, wherein matching the preprocessed data tables according to a pre-trained matching model and a linear programming algorithm to obtain a matching relationship of the data tables comprises:

obtaining a matching relation and a matching score of a data table according to the second similarity matrix and a linear programming algorithm;

and sequencing from high to low according to the matching scores to obtain the matching relation of the sequenced data table.

4. The method according to claim 1, wherein performing data fusion according to the matching relationship of the data table and the pre-trained fusion model to obtain fused data comprises:

5. The method of claim 1, wherein after obtaining the pre-processed data table, further comprising:

training a first matching model, a second matching model and a fusion model;

the neural network structure of the first matching model, the second matching model and the fusion model is an improved GAE network structure, the improved GAE network structure comprises an encoding layer and a reverse encoding layer, the encoding layer comprises a GNN network and an FC network which are sequentially connected, and the reverse encoding layer comprises an FC network.

6. The method of claim 1, after obtaining the fused data, further comprising:

performing statistical analysis on the fused data to obtain data type information, missing value information, memory size information, minimum value information, maximum value information, median information, standard deviation information, variation coefficient information, histogram information and correlation tone scale map information;

7. The method according to claim 1, wherein preprocessing the obtained data table to obtain a preprocessed data table comprises:

acquiring a data table of multi-source heterogeneous data;

8. An apparatus for integrating multi-source heterogeneous data, comprising:

the intelligent data association module is used for matching the preprocessed data table according to a pre-trained matching model and a linear programming algorithm to obtain a matching relation of the data table;

9. An integration apparatus for multi-source heterogeneous data, comprising a processor and a memory storing program instructions, the processor being configured to perform the integration method for multi-source heterogeneous data according to any one of claims 1 to 7 when executing the program instructions.

10. A computer readable medium having computer readable instructions stored thereon which are executed by a processor to implement a method of integrating multi-source heterogeneous data as claimed in any one of claims 1 to 7.