CN108228908B

CN108228908B - Data extraction method and device

Info

Publication number: CN108228908B
Application number: CN201810132705.7A
Authority: CN
Inventors: 林明; 欧阳小兵; 戴丽玛; 于鸿鹏; 陈宏亮; 张丹; 张素钊
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2018-02-09
Filing date: 2018-02-09
Publication date: 2021-11-12
Anticipated expiration: 2038-02-09
Also published as: CN108228908A

Abstract

The invention discloses a data extraction method and a device, wherein the method comprises the following steps: analyzing the acquired data extraction task of the source system, and generating a data extraction list corresponding to the data extraction task according to the data partition granularity; determining a data extraction mode for the source system according to the data capacity in the data extraction list; if a first preset data extraction mode is adopted for data extraction, extracting data of each data partition, generating a first data file, and storing the first data file to a target system; and if the data extraction is carried out by adopting a second preset data extraction mode, carrying out data extraction on the source system, and storing the extracted data file to the target system. The invention realizes the partition extraction of the data table, improves the data extraction efficiency and reduces the data extraction errors.

Description

Data extraction method and device

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data extraction method and apparatus.

Background

With the development of internet technology, more and more systems need to transmit and apply data, and data of some systems need to be extracted and imported or exported to corresponding destination systems.

The existing data extraction scheme is generally realized by the following steps: identifying a source system data table range that needs to be exported; and writing an export import statement, exporting the DMP file to a target system, or importing the DMP file. The whole processing flow of the existing data extraction scheme needs to be controlled and executed by an operator, so that the data extraction efficiency is low due to manual intervention. Moreover, errors are easy to occur in the process of exporting and importing a large number of data tables, once the exported DPM file has problems, hundreds of tables or even thousands of tables can not be exported and imported at one time, so that the efficiency is low and the accuracy is low; because the structures of all the data tables are different, the existing scheme cannot extract a part of data meeting requirements in all the tables by using data extraction statements, and the influence of data extraction errors can occur.

Disclosure of Invention

In view of the above problems, the present invention provides a data extraction method and apparatus, which achieve the purpose of extracting data table partitions, improving data extraction efficiency, and reducing data extraction errors.

In order to achieve the purpose, the invention provides the following technical scheme:

a method of data extraction, comprising:

analyzing the acquired data extraction task of the source system, and generating a data extraction list corresponding to the data extraction task according to the data partition granularity;

determining a data extraction mode for the source system according to the data capacity in the data extraction list;

if a first preset data extraction mode is adopted for data extraction, extracting data of each data partition, generating a first data file, and storing the first data file to a target system;

and if the data extraction is carried out by adopting a second preset data extraction mode, carrying out data extraction on the source system, and storing the extracted data file to the target system.

Preferably, the analyzing the acquired data extraction task of the source system, and generating a data extraction list corresponding to the data extraction task according to the data partition granularity includes:

analyzing the acquired data extraction task to obtain configuration information corresponding to the data extraction task;

partitioning the data extraction tasks according to the configuration information, and generating data extraction subtasks corresponding to the partitions for the data extraction tasks of each partition;

and generating a data extraction list by each data extraction subtask.

Preferably, before saving the data file to the target system, the method further comprises:

judging whether the target system has a partition corresponding to the data file, if not, adding the partition corresponding to the data file in the target system;

if the data file exists, the target partition corresponding to the data file in the target system is found, and the data in the target partition is emptied.

Preferably, the determining a data extraction manner for the source system according to the data capacity in the data extraction list includes:

judging whether the data capacity in the data extraction list is larger than a preset data quantity threshold value, if so, extracting data from the source system by adopting a first preset data extraction mode, and otherwise, extracting data by adopting a second preset data extraction mode;

the first data extraction mode is a DMP file extraction mode, and the second data extraction mode is a DBLINK data extraction mode.

Preferably, when saving the first data file to the target system, the method further comprises:

the first data file is transmitted to the target system in parallel according to a preset parallelism degree through a preset data transmission mode;

judging whether the target system successfully acquires the first data file according to the data extraction list in parallel, if so, continuing to upload the first data file; if not, judging whether the first data file has errors or not.

A data extraction apparatus comprising:

the generating module is used for analyzing the acquired data extraction task of the source system and generating a data extraction list corresponding to the data extraction task according to the data partition granularity;

the determining module is used for determining a data extraction mode of the source system according to the data capacity in the data extraction list;

the first extraction module is used for extracting the data of each data partition if a first preset data extraction mode is adopted for data extraction, generating a first data file and storing the first data file to a target system;

and the second extraction module is used for extracting data from the source system and storing the extracted data file to the target system if a second preset data extraction mode is adopted for data extraction.

Preferably, the generating module comprises:

the analysis unit is used for analyzing the acquired data extraction task to obtain configuration information corresponding to the data extraction task;

the partitioning unit is used for partitioning the data extraction tasks according to the configuration information and generating data extraction subtasks corresponding to the partitions for the data extraction tasks of each partition;

and the generating unit is used for generating a data extraction list from each data extraction subtask.

the judging module is used for judging whether the target system has a partition corresponding to the data file, and if not, the partition corresponding to the data file is added in the target system;

Preferably, the determining module comprises:

a capacity judging unit, configured to judge whether a data capacity in the data extraction list is greater than a preset data amount threshold, if so, perform data extraction on the source system in a first preset data extraction manner, and otherwise, perform data extraction in a second preset data extraction manner;

the transmission unit is used for transmitting the first data file to the target system in parallel according to a preset parallelism by a preset data transmission mode;

the acquisition judging unit is used for judging whether the target system successfully acquires the first data file in parallel according to the data extraction list, and if so, continuously uploading the first data file; if not, judging whether the first data file has errors or not.

Compared with the prior art, the data extraction method and the data extraction device provided by the invention have the advantages that the acquired data extraction tasks of the source system are analyzed, the data extraction tasks are divided into sub-tasks according to the partition granularity, and finally the task extraction list is generated, so that the problem of low data extraction efficiency caused by extraction of all data to be extracted in the prior art can be solved when the data are extracted in parallel by partitions, namely the data are imported or exported. Meanwhile, two data extraction modes are set, so that the extraction efficiency can be improved by adopting the corresponding data extraction mode according to the data capacity, manual intervention is not needed, the data synchronization of a source system and a target system can be realized, and the error influence of data extraction is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of a data extraction method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of another data extraction method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a data extraction device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first" and "second," and the like in the description and claims of the present invention and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not set forth for a listed step or element but may include steps or elements not listed.

An embodiment of the present invention provides a data extraction method, and referring to fig. 1, the method may include the following steps:

s11, analyzing the acquired data extraction task of the source system, and generating a data extraction list corresponding to the data extraction task according to the data partition granularity;

in another embodiment of the present invention, a method for generating a data extraction list is further provided, which may include the following steps:

and generating a data extraction list by each data extraction subtask.

The data extraction task configured at the front end is analyzed, wherein the data extraction task is generated by front-end task personnel selecting a table range, a date range, a region range and the like according to needs, and the data extraction can be specifically data export or data import in the embodiment of the invention.

Then, the background analyzes the configuration information of the data extraction task through a preset code, for example, p _ gen _ task _ file, and finally generates a data extraction list according to a way that a data extraction subtask is generated by one sub-partition, that is, the data extraction list includes a plurality of data extraction subtasks.

It should be noted that p _ gen _ task _ file is only a program for analyzing an import task configured at the front end and generating a subtask (file list); the actual exported program is exp _ process.sh, the shell reads subtasks (file lists), each subtask uses an expdp call to export the dmp according to the set parallelism, for example, the parallelism is set to 10, then ten expdp sentences run in the background at the same time to generate 10 dmp files; the imported program is imp _ process.sh, the process is to read subtasks (file lists), and each subtask uses impdp to call up and import according to the set parallelism.

S12, determining a data extraction mode for the source system according to the data capacity in the data extraction list;

the embodiment of the invention also provides a method for determining a data extraction mode, which can comprise the following steps:

The corresponding data extraction mode is selected according to the data capacity of the data extraction list, the DBLINK data extraction mode is adopted when the data volume is small, the DMP file extraction mode is adopted when the data volume is large, and the data volume is judged according to a preset threshold value, for example, a table with the small data volume is generally considered to be small when the data volume is smaller than 100M, and the table with the small data volume is configured in a table for a program to use.

S13, if a first preset data extraction mode is adopted for data extraction, extracting data of each data partition, generating a first data file, and storing the first data file to a target system;

and S14, if a second preset data extraction mode is adopted for data extraction, performing data extraction on the source system, and storing the extracted data file to the target system.

For example, when data is extracted, an export file list (EXP _ TASK _ FILES) is designed as follows, the primary keys are DMP _ TASK _ ID and DMP _ file, the table is created in the source system, provides export file information, records the export file state, each record corresponds to a partition or a table (a table without partitions), and EXP _ TASK _ FILES is inserted by the target system program through the DBLINK.

According to the data extraction method provided by the invention, the acquired data extraction task of the source system is analyzed, the data extraction task is divided into sub-tasks according to the partition granularity, and finally a task extraction list is generated, so that when the data is extracted in parallel by partitions, namely the data is imported or exported, the problem of low data extraction efficiency caused by extraction of all the data to be extracted in the prior art can be solved. Meanwhile, two data extraction modes are set, so that the extraction efficiency can be improved by adopting the corresponding data extraction mode according to the data capacity, manual intervention is not needed, the data synchronization of a source system and a target system can be realized, and the error influence of data extraction is reduced.

The embodiment of the invention also provides a partition adding and cleaning method, which comprises the following steps:

After the data file is generated, the partition corresponding to the data file may not exist in the target system, and then whether the partition exists or not needs to be judged, and if the partition does not exist, the partition needs to be added in the target system; if the data exists, the partition needs to be cleaned, and the original data of the partition is deleted. For a table without partitions, the entire table data is emptied before the data file is imported.

The basic data quantity of the table which is not partitioned is very small, the table can basically extract data by using a dblink data extraction mode, and only one dmp can be generated by using a dmp file extraction mode; in addition, the list without sub-partitions is subjected to generation of subtasks (file lists) by using the finest granularity, for example, the list without sub-partitions is generated, so that only one subtask (file list) is provided, namely the list per se; for a table with only a single partition (not a compound partition, i.e., no sub-partitions), then the number of subtask (file list) pieces is the number of single partitions. For a table of compound partitions, then the number of subtask (file list) pieces is the number of sub-partitions.

The embodiment of the invention also provides a data file transmission method, which comprises the following steps:

In the embodiment of the present invention, the predetermined data transmission mode is FTP (File transfer protocol), and FTP is used to transmit the dmp File to the target system. Specifically, the generated DMP file is FTP to a target system in parallel according to the set parallelism; if the FTP is successful, the file status in imp _ task _ files is set to ready for the import function to use:

for the case where the target system and the source system can be DBLINK, the source system performs an update file state to "ready" via DBLINK;

for the situation that the target system and the source system cannot be DBLINK, the target system determines whether all the tables related to the source system in the DMP _ TASK _ ID have been successfully received by judging whether an empty file with the file name of DMP _ TASK _ ID + the name of the source system exists in the receiving directory of the target system, and if yes, the target system sets the file states of all the source systems under all the TASKs to be ready.

The data import function in the data extraction process is divided into a DMP file extraction mode and a DBLINK data extraction mode:

the DMP file extraction mode comprises the following steps:

the target system executes import according to the configured parallelism parameter when the file state in the import file list is ready and the corresponding partition is processed;

when the data volume is large, a DMP file extraction mode is selected, each sub-partition can export a DMP file, the table without the partition exports the whole table data to generate the DMP file, a source system transmits the DMP file to a target system through a preset transmission format of FTP (file transfer protocol), the importing and exporting of the DMP file extraction mode have the advantages that breakpoint continuation can be supported, if the exporting and importing report is wrong, a program can automatically recognize to export and import again, and batch peak time can be avoided due to the support of the breakpoint continuation, so that the efficiency of the file in the importing and exporting process is higher.

And a DBLINK data extraction mode:

for the condition that the target system and the source system can use the DBLINK connection, a data extraction mode of the DBLINK data extraction mode is provided, and the specific configuration data is as follows: cfg _ value is a specific table name, part _ col is a main partition field, subpart _ col is a sub-partition field, the information is spliced into select statements according to the configuration information and the generated import file list, and different select statements are executed in parallel according to the set parallelism degree through dbms _ schedule.

The embodiment of the present invention further provides another data extraction method, which is shown in fig. 2 and mainly includes:

s21, data importing and configuring;

s22, importing configuration analysis;

s23, a data exporting step;

s24, file transmission;

s25, a data importing step;

and S26, front end display step.

In the step of data import configuration of S21, configuring data import tasks in province ranges through the data table range and the date range provided by the front-end interface;

in the step of analyzing the import configuration in S22, analyzing the data import configuration according to the partition information of the background process, and producing a corresponding export import file list, where each partition or each table (a table without partitions) corresponds to one record;

in the data export step of S23, a DMP file is generated for each record based on the export file list and the export is performed;

in the file transmission step of S24, FTP is performed on the generated DMP file to the target system, and the file status is updated;

in the data import step of S25, according to the import file list, performing DMP file import or performing data import directly from DBLINK to the source system;

in the S26 front end presentation step, the front end presents the execution state of the configured import task and provides the already imported partition list for each table.

The traditional data extraction whole process needs manual intervention. The method and the device have the advantages that the data range is determined, the execution statement is compiled, the export is executed, the DMP file is generated, and the import is executed, and the whole process is executed step by an operator, so that the time and the labor are consumed, the efficiency is low, the data range is configured at the front end, and all subsequent operations are automated without manual intervention.

The method is characterized in that errors are easy to occur in the process of exporting and importing a large number of data tables, once the exported DMP files have problems, thousands of tables are set for hundreds of disposable exports, all the tables cannot be exported and imported successfully, and the exports and imports need to be integrally executed again; the export and import parallelism is controlled by parameters, one file is successfully imported, a new import process can be automatically invoked subsequently, the export and import parallelism is controlled within the parallelism set by the parameters, the system load is prevented from being shut down due to high parallelism, and the export and import between the files are not influenced mutually. If a problem exists in an individual partition, the export import of other partitions is not affected.

The embodiment of the invention judges whether the file export and import is finished or not by the existence of the export and import process, automatically reads the generated log file, and inquires the error keywords to judge whether the export and import is successful or not. And because the finest granularity can be in the level of sub-partitions, the data extraction requirement is divided into a plurality of sub-partitions for exporting and importing, so that partial data of the data table can be extracted, and the whole table is not required to be exported and imported. And due to the difference of daily table structures, in the implementation process of the data extraction program, the program automatically synchronizes between the source system and the target system, so that the efficiency of the data extraction process is higher, and the manual intervention is not needed.

Corresponding to the data extraction method provided in the embodiment of the present invention, a data extraction device is further provided in the embodiment of the present invention, with reference to fig. 3, including:

the generation module 1 is used for analyzing the acquired data extraction task of the source system and generating a data extraction list corresponding to the data extraction task according to the data partition granularity;

a determining module 2, configured to determine a data extraction manner for the source system according to the data capacity in the data extraction list;

the first extraction module 3 is configured to extract data of each data partition if a first preset data extraction manner is adopted for data extraction, generate a first data file, and store the first data file to a target system;

and the second extraction module 4 is configured to, if data extraction is performed in a second preset data extraction manner, perform data extraction on the source system, and store the extracted data file to the target system.

Optionally, in another embodiment of the present invention, the generating module includes:

Optionally, in another embodiment of the present invention, before saving the data file to the target system, the method further includes:

Optionally, in another embodiment of the present invention, the determining module includes:

Optionally, in another embodiment of the present invention, when saving the first data file to the target system, the method further includes:

According to the data extraction device provided by the invention, the acquired data extraction task of the source system is analyzed, the data extraction task is divided into sub-tasks according to the partition granularity, and finally a task extraction list is generated, so that when the data is extracted in parallel by partitions, namely the data is imported or exported, the problem of low data extraction efficiency caused by extraction of all the data to be extracted in the prior art can be solved. Meanwhile, two data extraction modes are set, so that the extraction efficiency can be improved by adopting the corresponding data extraction mode according to the data capacity, manual intervention is not needed, the data synchronization of a source system and a target system can be realized, and the error influence of data extraction is reduced.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A data extraction method, comprising:

judging whether the data capacity in the data extraction list is larger than a preset data quantity threshold value, if so, extracting data from the source system by adopting a first preset data extraction mode, and otherwise, extracting data by adopting a second preset data extraction mode; the first preset data extraction mode is a DMP file extraction mode, and the second preset data extraction mode is a DBLINK data extraction mode;

2. The method according to claim 1, wherein the parsing the acquired data extraction task of the source system and generating a data extraction list corresponding to the data extraction task according to a data partition granularity includes:

and generating a data extraction list by each data extraction subtask.

3. The method of claim 1, further comprising, prior to saving a data file to the target system:

4. The method of claim 1, when saving the first data file to a target system, further comprising:

5. A data extraction apparatus, comprising:

the second extraction module is used for extracting data from the source system and storing the extracted data file to the target system if a second preset data extraction mode is adopted for data extraction;

the determining module comprises:

the first preset data extraction mode is a DMP file mode, and the second preset data extraction mode is a DBLINK data mode.

6. The apparatus of claim 5, wherein the generating module comprises:

7. The apparatus of claim 5, further comprising, prior to saving a data file to the target system:

8. The apparatus of claim 5, when saving the first data file to a target system, further comprising: