CN111159266A - ETL task batch generation method based on metadata - Google Patents

ETL task batch generation method based on metadata Download PDF

Info

Publication number
CN111159266A
CN111159266A CN201911234181.3A CN201911234181A CN111159266A CN 111159266 A CN111159266 A CN 111159266A CN 201911234181 A CN201911234181 A CN 201911234181A CN 111159266 A CN111159266 A CN 111159266A
Authority
CN
China
Prior art keywords
etl
metadata
data
task
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201911234181.3A
Other languages
Chinese (zh)
Inventor
陈旋
王冲
郝大松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Aijia Household Products Co Ltd
Original Assignee
Jiangsu Aijia Household Products Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Aijia Household Products Co Ltd filed Critical Jiangsu Aijia Household Products Co Ltd
Priority to CN201911234181.3A priority Critical patent/CN111159266A/en
Publication of CN111159266A publication Critical patent/CN111159266A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Abstract

The invention discloses a metadata-based ETL task batch generation method, which belongs to the field of data warehouses and is suitable for any ETL tool with metadata storage. The invention can greatly improve the development efficiency of the ETL, can be suitable for various ETL tools, once the template is formulated, the ETL tasks do not need to be developed independently, and only can be generated in batches, the work development of a single ETL task and a plurality of ETL tasks is almost equivalent, and the tasks generated in batches are the ETL tasks based on template standardization and unified specification, so that the standard of the ETL tasks is standardized, the development efficiency of the ETL tasks is improved, and the construction period of a data warehouse is shortened.

Description

ETL task batch generation method based on metadata
Technical Field
The invention belongs to the technical field of data warehouses, and particularly relates to a batch generation method of ETL tasks based on metadata.
Background
With the continuous improvement of enterprise informatization, data of enterprises are gradually precipitated into various informatization systems to form data isolated islands, the data of the whole enterprise cannot be comprehensively analyzed, and most enterprises can choose to construct data warehouses of the enterprises and provide Business Intelligence (BI) systems to assist enterprise operation decisions in order to exert the real value of the data of the enterprises and improve the informatization level of the enterprises. In the whole data warehouse construction process, data acquisition and calculation work (ETL) usually occupies more than half of the workload, and is the basic and core work in the data warehouse construction process. The data warehouse architecture is shown in fig. 5.
At present, main ETL tools in an enterprise include Powercenter, Datastage, open-source key and other tools, the tools cannot provide flexible and definable batch development functions, and in addition, the ETL tools are generally developed by professional ETL development engineers one by one, which generally cannot ensure standardization and standardization of developed ETL tasks, and the development efficiency is low.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a metadata-based ETL task batch generation method for any ETL tool with metadata storage, wherein ETL is a process of extracting data from a data source, converting and cleaning the data and finally loading the data to a target, an ETL universal template based on the data processing logic of the ETL tool can be established by analyzing an ETL tool metadata model, and then ETL tasks can be established in batches based on database metadata and the ETL universal template.
The invention adopts the following technical scheme for solving the technical problems:
a batch generation method of ETL tasks based on metadata specifically comprises the following steps:
step 1, creating a sample task: the sample task is an ETL task developed by an ETL tool according to the processing logic of the data, and is stored in the metadata of the ETL tool, and the whole data processing logic is defined in the metadata;
step 2, analyzing a sample task: analyzing the metadata of the sample task created in the step 1, and finding out the key information of description in the sample ETL task;
step 3, manufacturing a sample template: replacing the key information extracted in the step 2 with parametric description, and replacing the parameter content with the content of the actual ETL task in a mode of realizing batch text replacement of a section of program code;
step 4, establishing a mapping relation: establishing the mapping relation between the database table of the data source and the database table of the data writing,
step 5, creating ETL tasks in batches, confirming a list of source data tables needing synchronization, and determining which tables need synchronization according to actual data requirements, wherein all tables in a database can be realized by realizing a section of program: and (3) circularly acquiring metadata of a corresponding table in the source database and metadata of a table in the target database according to the source data table list, replacing the template according to the acquired metadata content, describing parameterization in the sample template manufactured in the step (3) as actual content, and outputting a file to form an ETL task of a corresponding ETL tool.
As a further preferable scheme of the batch generation method of ETL tasks based on metadata, the step 2 is specifically as follows: and (3) opening a task document generated by the ETL tool by using a text tool, namely the ETL sample task generated in the step (1), wherein the structure of the task document is described by an XML structure, and respectively finding out metadata contents describing source data, conversion data and target data in the metadata of the sample task and analyzing the structure of the metadata contents.
As a further preferable solution of the ETL task batch generation method based on metadata of the present invention, in step 2, the key information includes a source data connection string, a target data connection string, data structures of a source table and a target table, and a mapping relationship between a source table field and a target table field.
As a further preferable solution of the present invention, in the ETL task batch generation method based on metadata, in step 3, the parameterized description includes # sourceConnection #, # sourceTableName #.
As a further preferable solution of the metadata-based ETL task batch generation method of the present invention, in step 4, the data warehouse acquires the source data by creating a target table with the same structure as the source data in a buffer area or an ODS area of an operational data storage area, and the naming of the target table is also specified: and establishing a mapping relation between the origin table and the target table through a target table naming specification.
Advantageous effects
Compared with the prior art, the invention adopting the technical scheme has the following technical effects:
1. the invention is a relatively universal method, is not limited to a certain ETL tool, is suitable for various ETL tools which can be described by metadata, and can be realized by using various modes (programming languages);
2. the invention solves the problem of how to standardize and standardize the ETL development in the data warehouse construction process, and the ETL development specification is preset in the template to uniformly generate the standardized ETL task, thereby improving the quality of the ETL development work;
3. the invention solves the problem of development efficiency of hundreds of ETL development tasks in the data warehouse construction process, and improves the efficiency of ETL development work by defining ETL conversion logic into a template and generating ETL tasks in batches;
4. the invention can greatly improve the development efficiency of the ETL, can be suitable for various ETL tools, once the template is formulated, the ETL tasks do not need to be developed independently, and only can be generated in batches, the work development of a single ETL task and a plurality of ETL tasks is almost equivalent, and the tasks generated in batches are the ETL tasks based on template standardization and unified specification, so that the standard of the ETL tasks is standardized, the development efficiency of the ETL tasks is improved, and the construction period of a data warehouse is shortened.
Drawings
FIG. 1 is a schematic diagram of the process of extracting data from a data source and loading the data into a target according to the present invention;
FIG. 2 is a diagram illustrating a method for generating an ETL task based on metadata according to the present invention;
FIG. 3 is a schematic diagram of an example ETL task of the present invention;
FIG. 4 is a diagram illustrating an example of the contents of a template according to the present invention;
FIG. 5 is a schematic diagram of a data warehouse architecture;
fig. 6 is a program implemented logic flow of the present invention.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings:
the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A batch generation method of ETL tasks based on metadata comprises the following steps:
ETL is a process of extracting data from a data source, converting and cleaning the data, and finally loading the data to a target, as shown in fig. 1 below.
The steps of the ETL task generation method based on the metadata are shown in FIG. 2:
method step
1. Creating sample tasks
A sample is generally an ETL task developed by an ETL tool according to some data processing logic (such as extracting data of an A table into a B table with the same structure), and the ETL task is stored in metadata of the ETL tool, and the whole data processing logic is defined in the metadata, as shown in Table 1.
TABLE 1
Figure 428814DEST_PATH_IMAGE002
The sample ETL task structure is shown in fig. 3.
2. Analyzing sample task metadata to find out key information described in sample ETL task (step 1)
The general analysis process is that firstly, a task document generated by an ETL tool is opened by a text tool (or other tools) to obtain an ETL sample task, the structure of which is generally described by an XML structure (the main ETL tools such as Powercenter, Datastage, and keyboard are all XML descriptions), the metadata contents describing source data, conversion, and target data in the sample task metadata are respectively found out, and the structure is analyzed (the structure mainly comprises a source data connection string, a target data connection string, data structures of a source table and a target table, and mapping relations between source table fields and target table fields);
3. making a sample template
Key contents are extracted, which are generally required to be replaced by parameterized descriptions (such as: # parameter name # #), and then the parameter contents can be replaced by the contents of the actual ETL task in a text replacement mode by using tools.
Example of the contents of the template, as shown in FIG. 4.
4. Establishing a mapping relation between a source table and a target table: the mapping relation between a source table (a database table of a data source) and a target table (a database table of data writing) is established, the data warehouse acquires source data, a target table with the same structure as the source data is generally established in a buffer area or an ODS (operation type data storage area) area, the naming of the target table is also specified (data warehouse partition identification, data source identification and source table name), and the mapping relation between the source table and the target table is easily established through the target table naming specification.
The table mapping relationship is shown in table 2.
TABLE 2
Figure 267326DEST_PATH_IMAGE004
5. Batch creation of ETL tasks
The ETL task is created in batch, a list of source data tables needing synchronization is confirmed (generally, which tables need synchronization is determined according to actual data requirements, or all tables in a database), and the ETL task is realized by implementing a section of program (specifically, implementation logic is seen in a technical implementation scheme): and circularly acquiring metadata of a corresponding table in the source database and metadata of a table in the target database according to the source data table list, replacing parameters in the template (the template established in the step 3) with actual contents according to the acquired metadata contents, and outputting the file to form the ETL task of the corresponding ETL tool.
And (6) parameter replacement.
And outputting the ETL task (a button tool example).
An example batch processing program implementation is shown in FIG. 6.
The examples are as follows: the invention adopts the following technical scheme for solving the technical problems: the method is realized through a section of program code (java, python and other languages can all be used), and the specific implementation logic is as follows:
1. loading the mapping relation list data of the synchronous table to obtain a source table, a target table list and mapping relations which need to be processed, wherein the default state of all data is unprocessed;
2. judging whether the 'unprocessed' state data in the mapping relation list of the synchronization table is not empty, if so, ending, and continuing the next step if not;
3. reading the first row configuration data in the unprocessed state to obtain the mapping relation data of the source table and the target table which need to be processed currently;
4. loading metadata of the source table and the target table, and obtaining a source database connection string, a target database connection string, data structures of the source table and the target table, a mapping relation between fields of the source table and fields of the target table and the like by loading the metadata
5. Loading and replacing ETL template parameters, and replacing corresponding parameters in the ETL template with the metadata content obtained in the step 4;
6. outputting an ETL task file, and directly outputting the ETL template after the replacement in the step 5 into the ETL task file;
7. setting the state of the corresponding row in the list data loaded in the step 1 as a processed state; and (5) repeatedly executing the step 2 until the data states in all the lists are all processed, and finishing the processing.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention. While the embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (5)

1. A batch generation method of ETL tasks based on metadata is characterized in that: the method specifically comprises the following steps:
step 1, creating a sample task: the sample task is an ETL task developed by an ETL tool according to the processing logic of the data, and is stored in the metadata of the ETL tool, and the whole data processing logic is defined in the metadata;
step 2, analyzing a sample task: analyzing the metadata of the sample task created in the step 1, and finding out the key information of description in the sample ETL task;
step 3, manufacturing a sample template: replacing the key information extracted in the step 2 with parametric description, and replacing the parameter content with the content of the actual ETL task in a mode of realizing batch text replacement of a section of program code;
step 4, establishing a mapping relation: establishing the mapping relation between the database table of the data source and the database table of the data writing,
step 5, creating ETL tasks in batches, confirming a list of source data tables needing synchronization, and determining which tables need synchronization according to actual data requirements, wherein all tables in a database can be realized by realizing a section of program: and (3) circularly acquiring metadata of a corresponding table in the source database and metadata of a table in the target database according to the source data table list, replacing the template according to the acquired metadata content, describing parameterization in the sample template manufactured in the step (3) as actual content, and outputting a file to form an ETL task of a corresponding ETL tool.
2. The method of claim 1, wherein the ETL tasks based on the metadata are generated in batches, and the method comprises the following steps: in one embodiment, the step 2 is specifically as follows: and (3) opening a task document generated by the ETL tool by using a text tool, namely the ETL sample task generated in the step (1), wherein the structure of the task document is described by an XML structure, and respectively finding out metadata contents describing source data, conversion data and target data in the metadata of the sample task and analyzing the structure of the metadata contents.
3. The batch generation method of ETL tasks based on metadata as claimed in claim 2, wherein: in one embodiment, in step 2, the key information includes a source data connection string, a target data connection string, data structures of the source table and the target table, and a mapping relationship between fields of the source table and fields of the target table.
4. The batch generation method of ETL tasks based on metadata as claimed in claim 2, wherein: in one embodiment, in step 3, the parametric description contains # sourceConnection #, # sourceTableName #.
5. The batch generation method of ETL tasks based on metadata as claimed in claim 2, wherein: in one embodiment, in step 4, the data warehouse acquires the source data by creating a target table with the same structure as the source data in a buffer area or an ODS area of an operational data storage area, and naming the target table is also specified: and establishing a mapping relation between the origin table and the target table through a target table naming specification.
CN201911234181.3A 2019-12-05 2019-12-05 ETL task batch generation method based on metadata Withdrawn CN111159266A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911234181.3A CN111159266A (en) 2019-12-05 2019-12-05 ETL task batch generation method based on metadata

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911234181.3A CN111159266A (en) 2019-12-05 2019-12-05 ETL task batch generation method based on metadata

Publications (1)

Publication Number Publication Date
CN111159266A true CN111159266A (en) 2020-05-15

Family

ID=70556423

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911234181.3A Withdrawn CN111159266A (en) 2019-12-05 2019-12-05 ETL task batch generation method based on metadata

Country Status (1)

Country Link
CN (1) CN111159266A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076365A (en) * 2021-04-07 2021-07-06 杭州数梦工场科技有限公司 Data synchronization method and device, electronic equipment and storage medium
CN113448657A (en) * 2021-09-01 2021-09-28 深圳市信润富联数字科技有限公司 Method for generating and executing dynamic spark task
CN113934786A (en) * 2021-09-29 2022-01-14 浪潮卓数大数据产业发展有限公司 Implementation method for constructing unified ETL

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076365A (en) * 2021-04-07 2021-07-06 杭州数梦工场科技有限公司 Data synchronization method and device, electronic equipment and storage medium
CN113448657A (en) * 2021-09-01 2021-09-28 深圳市信润富联数字科技有限公司 Method for generating and executing dynamic spark task
CN113448657B (en) * 2021-09-01 2021-11-30 深圳市信润富联数字科技有限公司 Method for generating and executing dynamic spark task
CN113934786A (en) * 2021-09-29 2022-01-14 浪潮卓数大数据产业发展有限公司 Implementation method for constructing unified ETL
CN113934786B (en) * 2021-09-29 2023-09-08 浪潮卓数大数据产业发展有限公司 Implementation method for constructing unified ETL

Similar Documents

Publication Publication Date Title
CN111159266A (en) ETL task batch generation method based on metadata
US20180159747A1 (en) Automated feature deployment for active analytics microservices
CN108647883B (en) Business approval method, device, equipment and medium
US7854376B2 (en) System and method for managing item interchange and identification in an extended enterprise
CN106844190B (en) Automatic test script generation method and device
CN106874244B (en) Method for constructing automatic document generation model based on work sequence
US11734000B2 (en) System and method for automated cross-application dependency mapping
KR101966518B1 (en) Method to control version of excel-based architecture design file
US20060106856A1 (en) Method and system for dynamic transform and load of data from a data source defined by metadata into a data store defined by metadata
CN101059695A (en) Programmatic access to controller construct and variable names
US20180165386A1 (en) Chaining analytic models in tenant-specific space for a cloud-based architecture
US20200250897A1 (en) Intelligent prognostics and health management system and method
CN112651218A (en) Automatic generation method and management method of bidding document, medium and computer
CN107992293A (en) A kind of enterprise attributes dividing system and its implementation
Augenstein et al. Exploring Design Principles for a Business Model Mining Tool.
US10360208B2 (en) Method and system of process reconstruction
US20160063154A1 (en) Method and system for structured simulation of enterprise model and data
EP3343372A1 (en) Distributed cache cleanup for analytic instance runs processing operating data from industrial assets
CN112445492B (en) ANTLR 4-based source code translation method
CN110879710B (en) Method for automatically converting RPG program into JAVA program
CN109032578B (en) Database SQL (structured query language) -based code generation method and system
Reusch Extending project management processes
CN112199287A (en) Cross-project software defect prediction method based on enhanced hybrid expert model
CN116860227B (en) Data development system and method based on big data ETL script arrangement
US20240126759A1 (en) Converting an api into a graph api

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20200515

WW01 Invention patent application withdrawn after publication