CN111309792B

CN111309792B - Data extraction and conversion method covering complex heterogeneous conditions

Info

Publication number: CN111309792B
Application number: CN201911419254.6A
Authority: CN
Inventors: 刘太敏; 张翠侠; 杨博文; 张永伟; 段然; 陈奡
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-12-08
Anticipated expiration: 2039-12-31
Also published as: CN111309792A

Abstract

The invention discloses a data extraction and conversion method covering complex heterogeneous conditions. Firstly, the method is used for researching the difficulties and problems encountered in the heterogeneous data conversion process, and summarizing the problems to form a heterogeneous data conversion problem library. Solutions are respectively proposed for each problem, and the solutions are assembled to form a set of solutions. The invention solves various heterogeneous conditions including different data organization structures, different storage forms, different metadata, accessory migration and the like in heterogeneous data conversion, and improves the conversion efficiency between heterogeneous data.

Description

Data extraction and conversion method covering complex heterogeneous conditions

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a data extraction and conversion method covering complex heterogeneous conditions.

Background

Data extraction conversion is a data processing flow from different heterogeneous data sources to unified target data in the process of data streaming. The data extraction and conversion is the basis of data application, is widely applied to the work of big data calculation processing and data mining analysis of various industries, has more and more requirements on the data extraction and conversion algorithm with complete functions and excellent performance, and is the key point of whether the data application can be efficiently performed.

However, the existing data extraction and conversion method in the industry does not consider complex heterogeneous conditions on one hand, and does not consider complex conditions such as master-slave table splitting, file migration, file storage format conversion and the like correspondingly, so that the existing achievements cannot be found when facing the conditions, and resources are consumed to customize and develop specific conditions; on the other hand, part of tools are not converted into methods and functions, but are not fully covered, and various complex heterogeneous extraction conversion conditions required by projects are difficult to meet at one time.

Disclosure of Invention

The invention aims to provide a data extraction conversion method capable of covering multiple complex heterogeneous data conversion conditions, and the requirements of the multiple conversion methods can be met at the same time.

The technical solution for realizing the purpose of the invention is as follows: a data extraction and conversion method covering complex heterogeneous conditions comprises the following steps:

the first step: and (5) sorting heterogeneous data structures before and after conversion, and marking out structural differences and organization corresponding relations between front and rear structure detailed tables and fields.

And a second step of: and searching newly added or missing field information in the front and back data structures, and reserving, deleting or supplementing the fields according to requirements.

And a third step of: and (5) comparing the names of the synonymous fields in the front and rear heterogeneous data, and marking and corresponding the corresponding fields of the synonymous different names.

Fourth step: and checking whether the file storage condition exists before conversion, marking the file path storage condition, and selecting a migration tool for migration.

Fifth step: checking the difference of storage modes of the file formats in the front and rear heterogeneous data, marking the storage modes, and selecting corresponding conversion tool definition conversion methods.

Sixth step: and comparing the different metadata in the front heterogeneous data and the rear heterogeneous data, and marking and corresponding the fields with different metadata.

The specific implementation method of the step 1 is as follows:

step 1-1, comparing the organization relations of the tables of the front and rear heterogeneous data structures, wherein the organization relations comprise data of the tables, information types described by the tables and expression forms of master and slave tables; finding out the difference of the above aspects, and marking the corresponding relation of the front and rear table structures according to the table number and the master-slave table form according to the difference;

and step 1-2, comparing field correspondence relations of the front and rear heterogeneous data structures, and marking the same corresponding fields in software.

The specific implementation method of the step 2 is as follows:

step 2-1, comparing and analyzing the missing table information and field information in the data structure to be converted compared with the target data structure;

step 2-2, comparing and analyzing redundant table information and field information compared with a target data structure in the data structure to be converted;

step 2-3, obtaining conversion requirements, and performing calculation supplement or discarding supplement on the missing tables and fields;

and 2-4, acquiring conversion requirements, and deleting or reserving translation operations on the redundant tables and fields.

The specific implementation method of the step 3 is as follows:

step 3-1, finding out the synonymous field names which are distinguished by inconsistent field names through analysis and comparison;

step 3-2, finding out synonymous field names which are distinguished by different field name naming habits through analysis and comparison;

and 3-3, converting and associating all synonymous different name fields.

The specific implementation method of the step 4 is as follows:

step 4-1, obtaining all field information of the data to be converted, which is stored in a file by using a storage path;

step 4-2, marking the selected field to be converted;

step 4-3, selecting a migration tool or converting and migrating paths and files by using the related migration method.

The specific implementation method of the step 5 is as follows:

step 5-1, obtaining all field information which directly stores the data blocks in a BLOB form in the data to be converted;

step 5-2, marking the selected field to be converted;

and 5-3, calling a related migration method or manually writing a conversion method block to copy, transmit and convert the data block.

The specific implementation method of the step 6 is as follows:

step 6-1, collecting and forming all metadata difference field sets by analyzing metadata and comparing front and rear heterogeneous data;

step 6-2, marking the selected metadata difference field;

and step 6-3, associating the metadata difference fields by using software, and executing conversion.

Compared with the prior art, the invention has the remarkable advantages that: the method integrates various complex heterogeneous extraction and conversion conditions, and can provide a solution for the problems of form splitting, file migration, file storage format conversion and the like. 1) For table splitting, the organization relations of the tables of the front and rear heterogeneous data structures are compared, wherein the organization relations comprise data of the tables, information types described by the tables, expression forms of the master table and the slave table, field corresponding relations and the like. Analyzing the difference of the above aspects, and marking the corresponding relation of the front and rear table structures according to the table number and the master-slave table form according to the difference. 2) For file migration, the selected fields to be converted are marked by acquiring field information of file storage in the data to be converted in a mode of utilizing a storage path, and a migration tool is selected or a related migration method is utilized to convert and migrate paths and files. 3) For file storage format conversion, all field information of directly storing the data blocks in the form of BLOB and the like in the data to be converted is obtained, the selected fields to be converted are marked, related migration methods are called or manual writing of conversion method blocks is carried out, and operations such as copying, transmitting, converting and the like are carried out on the data blocks. In addition, the method has multiple schemes capable of meeting the requirements of the project group on multiple conversion methods.

Drawings

FIG. 1 is a general flow chart of an embodiment of the present invention.

FIG. 2 is a schematic diagram of an embodiment of the present invention.

FIG. 3 is a diagram illustrating an exemplary file migration algorithm according to an embodiment of the present invention.

Fig. 4 is a diagram illustrating an example of a metadata transformation algorithm according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and examples.

As shown in fig. 1 and 2, the implementation selects two heterogeneous data structures for illustration, and the implementation steps are as follows. a) The organization relation of the tables of the front and rear heterogeneous data structures is compared, and the organization relation comprises data of the tables, information types described by the tables, expression forms of the master table and the slave table and the like. Finding out the difference of the above aspects, and marking the corresponding relation of the front and rear table structures according to the table number and the master-slave table form according to the difference.

b) And comparing the field correspondence of the front and rear heterogeneous data structures, and marking the same corresponding fields in the software.

c) And comparing and analyzing the table information and the field information which are lack in the data structure to be converted compared with the target data structure.

d) And comparing and analyzing the redundant table information and field information in the data structure to be converted compared with the target data structure.

e) The conversion requirement is obtained, and the missing tables and fields are supplemented by calculation or are abandoned.

f) And obtaining the conversion requirement, and deleting the redundant tables and fields or reserving the translation operation.

g) And (5) finding out the synonymous field names which are distinguished by inconsistent field names through analysis and comparison.

h) And (5) finding out the synonymous field names which are distinguished by different field name naming habits through analysis and comparison.

i) Converting and associating all synonymous different name fields

j) And acquiring all field information of the data to be converted, which is stored in the file by using a storage path, by a software data manager.

k) And marking the selected field to be converted.

l) selecting migration tools or converting and migrating paths and files by using related migration method calls.

m) acquiring all field information which directly stores the data blocks in the form of BLOB and the like in the data to be converted through a software data manager.

n) marking the selected field to be converted.

o) invoking the related migration method or manually writing the conversion method block to copy, transmit and convert the data block, and converting the code example is shown in fig. 3.

p) collecting and forming all metadata difference field sets by analyzing metadata and comparing front and rear heterogeneous data.

q) marking the selected metadata difference field.

The metadata difference fields are associated with software and a conversion is performed, an example of which is shown in fig. 4.

Claims

1. The data extraction and conversion method covering complex heterogeneous conditions is characterized by comprising the following steps:

step 1: sorting heterogeneous data structures before and after conversion, and marking structural differences and organization corresponding relations between front and rear structure detailed tables and fields:

step 1-2, comparing field correspondence of the front and rear heterogeneous data structures, and marking the same corresponding fields in software;

step 2: searching newly added or missing field information in the front and back data structures, and reserving, deleting or supplementing the fields according to requirements;

step 2-4, obtaining conversion requirements, and deleting redundant tables and fields or reserving translation operations;

step 3: comparing the names of synonymous fields in the front and rear heterogeneous data, and marking and corresponding fields of synonymous different names:

step 3-3, converting and associating all synonymous different name fields;

step 4: checking whether a file storage condition exists before conversion, marking the file path storage condition, and selecting a migration tool for migration; the specific implementation method is as follows:

step 4-2, marking the selected field to be converted;

step 4-3, selecting a migration tool or converting and migrating paths and files by using the related migration method;

step 5: checking the difference of storage modes of the file formats in the front and rear heterogeneous data, marking the changed storage modes, and selecting corresponding conversion tool definition conversion methods; the specific implementation method is as follows:

step 5-2, marking the selected field to be converted;

step 5-3, calling a related migration method or manually writing a conversion method block to copy, transmit and convert the data block;

step 6: the metadata in the heterogeneous data before and after comparison are different, and different fields of the metadata are marked and corresponding, and the specific implementation method is as follows:

step 6-2, marking the selected metadata difference field;