CN114880387A - Data integration script generation method and device, storage medium and electronic equipment - Google Patents

Data integration script generation method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN114880387A
CN114880387A CN202210492533.0A CN202210492533A CN114880387A CN 114880387 A CN114880387 A CN 114880387A CN 202210492533 A CN202210492533 A CN 202210492533A CN 114880387 A CN114880387 A CN 114880387A
Authority
CN
China
Prior art keywords
data
data integration
script
task
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210492533.0A
Other languages
Chinese (zh)
Inventor
牙祖将
钱丽雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank of China Ltd
Original Assignee
Bank of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank of China Ltd filed Critical Bank of China Ltd
Priority to CN202210492533.0A priority Critical patent/CN114880387A/en
Publication of CN114880387A publication Critical patent/CN114880387A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data integration script generation method and device, a storage medium and electronic equipment, which can be applied to the financial field or other fields. The method comprises the following steps: when a task script of a data integration task needs to be generated, data acquisition information is determined; according to the data acquisition information, determining a target metadata identifier and target metadata information corresponding to the target metadata identifier in a preset data lake metadata information base, and according to the target metadata information, determining a database type corresponding to the target metadata identifier; determining a script generation strategy corresponding to the database type; determining a data acquisition mode corresponding to the data integration task; and if the data acquisition mode is a full acquisition mode, generating a full data integration script according to the script generation strategy and the data acquisition information to obtain a task script of the data integration task. By the method, the automatic generation of the data integration script can be realized, the script does not need to be manually compiled, and the working efficiency can be improved.

Description

Data integration script generation method and device, storage medium and electronic equipment
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for generating a data integration script, a storage medium, and an electronic device.
Background
In the informatization construction of enterprise organizations, a data lake is one of the commonly used data storage architectures. The data lake obtains original data from a plurality of data sources of an enterprise, all the data are stored in the original form of the data, including structured data (such as relational database data), semi-structured data (such as CSV, XML, JSON and the like), unstructured data (such as e-mail, documents and PDF) and binary data (such as images, audio and video), so that a centralized data storage containing all the form data is formed, and data support is further provided for subsequent business scenes such as reports, visual analysis, real-time analysis or machine learning.
In the process of data analysis based on the data lake, data integration needs to be performed on data in the data lake, that is, extraction (Extract), conversion (Transform) and loading (Load) are performed on the data into a specified data storage, that is, ETL (Extract-Transform-Load) processing is performed.
Data integration tasks are typically implemented based on data integration scripts, i.e., ETL scripts. At present, an ETL script is mainly obtained by technical personnel through manual compiling according to analysis of self business data by using database knowledge, a data table structure and an SQL technology. And the method of manually writing the ETL script consumes a large amount of human resources, consumes long time in the processing process and has low efficiency. Secondly, in the process of manually writing the ETL script, the data source needs to be checked manually, omission or errors are easy to occur, and adverse effects are brought to the data integration work.
Disclosure of Invention
In view of this, embodiments of the present invention provide a data integration script generation method, so as to solve the problems of long time consumption, low efficiency, and easy careless mistakes when an ETL script is manually written.
The embodiment of the invention also provides a data integration script generation device which is used for ensuring the actual realization and application of the method.
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
a data integration script generation method comprises the following steps:
when a task script of a data integration task needs to be generated, determining data acquisition information corresponding to the data integration task;
according to the data acquisition information, determining a target metadata identifier corresponding to the data integration task and target metadata information corresponding to the target metadata identifier in a preset data lake metadata information base, wherein the data lake metadata information base comprises a plurality of preset metadata identifiers and metadata information corresponding to each preset metadata identifier;
determining a database type corresponding to the target metadata identification according to the target metadata information;
determining a script generation strategy corresponding to the database type;
determining a data acquisition mode corresponding to the data integration task;
and if the data acquisition mode is a full acquisition mode, generating a full data integration script corresponding to the target metadata identifier according to the script generation strategy and the data acquisition information, and taking the full data integration script as a task script of the data integration task.
Optionally, the determining data acquisition information corresponding to the data integration task in the above method includes:
and acquiring a business theme, a subordinate department, a subordinate product, a metadata identifier and an acquisition mode input by a user, and taking the business theme, the subordinate department, the subordinate product, the metadata identifier and the acquisition mode as data acquisition information corresponding to the data integration task.
Optionally, the determining, according to the data acquisition information, a target metadata identifier corresponding to the data integration task in a preset data lake metadata information base includes:
and matching the data acquisition information with the metadata information corresponding to each preset metadata identifier respectively, and determining the preset metadata identifier corresponding to the metadata information matched with the data acquisition information as the target metadata identifier.
The above method, optionally, further includes:
if the data acquisition mode is not the full-scale acquisition mode, determining a start-stop condition corresponding to the data integration task;
performing condition check on the data integration task according to the target metadata information and the starting and stopping conditions;
and if the data integration task passes the condition check, generating an incremental data integration script corresponding to the target metadata identification according to the script generation strategy, the data acquisition information, the start-stop condition and the target metadata information, and taking the incremental data integration script as a task script of the data integration task.
Optionally, in the method, the performing conditional checking on the data integration task according to the target metadata information and the start-stop condition includes:
determining incremental information corresponding to the target metadata information, wherein the incremental information comprises an incremental variable identifier and an incremental variable data structure;
judging whether the start-stop condition is matched with the increment information;
if the starting and stopping conditions are matched with the incremental information, determining that the data integration task passes condition check;
and if the starting and ending conditions are not matched with the incremental information, determining that the data integration task does not pass condition check.
The above method, optionally, further includes:
and if the data integration task does not pass the condition check, carrying out error reporting prompt and ending the generation process of the task script of the data integration task.
A data integration script generating apparatus comprising:
the device comprises a first determining unit, a second determining unit and a processing unit, wherein the first determining unit is used for determining data acquisition information corresponding to a data integration task when a task script of the data integration task needs to be generated;
a second determining unit, configured to determine, according to the data acquisition information, a target metadata identifier corresponding to the data integration task and target metadata information corresponding to the target metadata identifier in a preset data lake metadata information base, where the data lake metadata information base includes a plurality of preset metadata identifiers and metadata information corresponding to each preset metadata identifier;
a third determining unit, configured to determine, according to the target metadata information, a database type corresponding to the target metadata identifier;
a fourth determining unit, configured to determine a script generation policy corresponding to the database type;
a fifth determining unit, configured to determine a data acquisition mode corresponding to the data integration task;
and the first generation unit is used for generating a full data integration script corresponding to the target metadata identifier according to the script generation strategy and the data acquisition information if the data acquisition mode is a full acquisition mode, and taking the full data integration script as a task script of the data integration task.
The above apparatus, optionally, further comprises:
a sixth determining unit, configured to determine, if the data acquisition mode is not the full-volume acquisition mode, a start-stop condition corresponding to the data integration task;
the checking unit is used for carrying out condition checking on the data integration task according to the target metadata information and the starting and stopping conditions;
and a second generating unit, configured to generate an incremental data integration script corresponding to the target metadata identifier according to the script generation policy, the data acquisition information, the start-stop condition, and the target metadata information if the data integration task passes condition check, and use the incremental data integration script as a task script of the data integration task.
A storage medium comprising stored instructions, wherein the instructions, when executed, control a device on which the storage medium is located to execute the data integration script generation method as described above.
An electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by one or more processors to perform the data integration script generation method as described above.
The data integration script generation method provided by the embodiment of the invention comprises the following steps: when a task script of a data integration task needs to be generated, determining data acquisition information corresponding to the data integration task; according to data acquisition information, determining a target metadata identifier and target metadata information corresponding to a data integration task in a preset data lake metadata information base, wherein the data lake metadata information base comprises a plurality of preset metadata identifiers and metadata information corresponding to each preset metadata identifier; determining a database type corresponding to the target metadata identifier according to the target metadata information; determining a script generation strategy corresponding to the database type; determining a data acquisition mode corresponding to the data integration task; and if the data acquisition mode is a full data acquisition mode, generating a full data integration script according to the script generation strategy and the data acquisition information, and taking the full data integration script as a task script of the data integration task. By applying the method provided by the embodiment of the invention, the task script of the data integration task can be generated through an automatic processing process, the automatic generation of the data integration script can be realized, the script does not need to be compiled manually, a large amount of human resources can be saved, the time consumption of the processing process is shortened, and the working efficiency is improved. And secondly, the artificial careless mistakes caused by manually writing scripts can be avoided, and the adverse effects on the data integration work caused by artificial reasons can be avoided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for generating a data integration script according to an embodiment of the present invention;
FIG. 2 is an interface diagram of a submission interface of a data integration task according to an embodiment of the present invention;
FIG. 3 is an exemplary diagram of a data integration script generation process provided by an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a data integration script generating apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiment of the invention provides a method for generating a data integration script, which can be applied to a data integration platform, wherein an execution main body of the method can be a server of the data integration platform, and a method flow chart of the method is shown in fig. 1 and comprises the following steps:
s101: when a task script of a data integration task needs to be generated, determining data acquisition information corresponding to the data integration task;
in the method provided by the embodiment of the invention, a user can submit the relevant information of the data integration task through the front end so as to send a trigger instruction of the task script of the data integration task to be generated to the server. When the server receives the trigger instruction, the server can acquire the relevant information input by the user so as to determine the data acquisition information corresponding to the data integration task which needs to generate the task script at present. The data collection information may include identification of metadata that needs to be collected, department to which it belongs, product to which it belongs, and so on.
S102: according to the data acquisition information, determining a target metadata identifier corresponding to the data integration task and target metadata information corresponding to the target metadata identifier in a preset data lake metadata information base, wherein the data lake metadata information base comprises a plurality of preset metadata identifiers and metadata information corresponding to each preset metadata identifier;
in the method provided by the embodiment of the invention, corresponding metadata identifiers and metadata information corresponding to each metadata identifier can be preset according to each metadata item of the data lake in advance to construct a data lake metadata information base, the data lake metadata information base comprises a plurality of preset metadata identifiers and metadata information corresponding to each preset metadata identifier, and the metadata information corresponding to each preset metadata identifier corresponds to each metadata item in the data lake. The metadata information corresponding to the preset metadata identifier may include information such as the preset metadata identifier, a metadata name, a database type, a collection mode, an incremental variable identifier, and an incremental variable data structure.
In the method provided by the embodiment of the invention, the target metadata identifier can be determined in the data lake metadata information base according to the data acquisition information of the data integration task, and the metadata information corresponding to the target metadata identifier in the data lake metadata information base is determined as the target metadata information.
S103: determining a database type corresponding to the target metadata identification according to the target metadata information;
in the method provided by the embodiment of the present invention, the database type corresponding to the target metadata identifier, that is, the type of the database storing the metadata corresponding to the target metadata identifier, may be obtained from the target metadata information. The database type may be Hive, HBase, Hdfs, Nosql, or other database types.
S104: determining a script generation strategy corresponding to the database type;
in the method provided by the embodiment of the present invention, a script generation policy corresponding to each type of database may be set in advance according to a data integration manner of each type of database, specifically, a script template may be set, and a script may be generated by replacing template parameters, or a script statement may be preset, and related parameters may be imported for generating a script, and the like.
In the method provided by the embodiment of the invention, the database type corresponding to the target metadata identifier can be matched with each preset script generation strategy to obtain the script generation strategy corresponding to the database type.
S105: determining a data acquisition mode corresponding to the data integration task;
in the method provided by the embodiment of the invention, a user can set the acquisition mode of the data integration task through the front end, the information can be contained in the data acquisition information, the acquisition mode set by the user can be acquired from the data acquisition information, and the acquisition mode is determined as the data acquisition mode corresponding to the data integration task. Specifically, the data acquisition mode may be a full acquisition mode or an incremental acquisition mode.
S106: and if the data acquisition mode is a full acquisition mode, generating a full data integration script corresponding to the target metadata identifier according to the script generation strategy and the data acquisition information, and taking the full data integration script as a task script of the data integration task.
In the method provided by the embodiment of the invention, if the data acquisition mode is the full-volume acquisition mode, a full-volume data integrated script, namely an ETL script, is generated according to the script generation strategy and the data acquisition information. And taking the full data integration script as a task script of the data integration task. For example, parameter replacement may be performed based on metadata identifiers, business topics, affiliated departments, affiliated products, and other data in the data acquisition information according to a script template of full-volume acquisition in the script generation policy, and a corresponding full-volume data integration script may be generated based on the script template of which parameter replacement is completed.
In the method provided by the embodiment of the invention, the generated task script can be displayed to the user, the user can check the task script, and the task script of the data integration task can be executed by calling the data integration engine subsequently, namely, the ETL execution engine is called to complete the execution of the ETL.
Based on the method provided by the embodiment of the invention, when the task script of the data integration task needs to be generated, the data acquisition information corresponding to the data integration task is determined; according to the data acquisition information, determining a target metadata identifier corresponding to the data integration task and target metadata information corresponding to the target metadata identifier in a preset data lake metadata information base, and according to the target metadata information, determining a database type corresponding to the target metadata identifier; determining a script generation strategy corresponding to the database type; determining a data acquisition mode corresponding to the data integration task; and if the data acquisition mode is a full acquisition mode, generating a full data integration script corresponding to the target metadata identifier according to the script generation strategy and the data acquisition information, and taking the full data integration script as a task script of the data integration task. By applying the method provided by the embodiment of the invention, the script generation strategy can be determined based on the data acquisition information of the data integration task and the preset data lake metadata information base, and then the task script is generated. The task script of the data integration task can be generated through an automatic processing process, the automatic generation of the data integration script can be realized, manual script writing is not needed, a large amount of human resources can be saved, the time consumed by the processing process is shortened, and the working efficiency is improved. And secondly, the artificial careless mistakes caused by manually writing scripts can be avoided, and the adverse effects on the data integration work caused by artificial reasons can be avoided.
On the basis of the method shown in fig. 1, in the method provided in the embodiment of the present invention, the process of determining the data acquisition information corresponding to the data integration task, which is mentioned in step S101, includes:
and acquiring a business theme, a department to which the business theme belongs, a product to which the business theme belongs, a metadata identifier and an acquisition mode input by a user, and taking the business theme, the department to which the business theme belongs, the product to which the business theme belongs, the metadata identifier and the acquisition mode as data acquisition information corresponding to the data integration task.
In the method provided by the embodiment of the invention, a user can input the relevant information of the metadata to be acquired, including the business theme, the affiliated department, the affiliated product, the metadata identifier and the acquisition mode, at the front end, and the information input by the user is used as the data acquisition information corresponding to the data integration task.
It should be noted that, the specific content of the data acquisition information mentioned in the method provided in the embodiment of the present invention is only to better illustrate one specific embodiment provided in the method provided in the present invention, and in a specific implementation process, the data acquisition information may further include other data content.
On the basis of the method shown in fig. 1, in the method provided in the embodiment of the present invention, in step S102, a process of determining a target metadata identifier corresponding to the data integration task in a preset data lake metadata information base according to the data acquisition information includes:
and matching the data acquisition information with the metadata information corresponding to each preset metadata identifier respectively, and determining the preset metadata identifier corresponding to the metadata information matched with the data acquisition information as the target metadata identifier.
In the method provided by the embodiment of the invention, each metadata information in the metadata information base of the data lake can be respectively matched with the data acquisition information, specifically, the metadata identifier contained in the metadata information can be compared with the metadata identifier in the data acquisition information, and if the metadata identifier contained in the metadata information is the same as the metadata identifier contained in the data acquisition information, the metadata information is considered to be matched with the data acquisition information. And taking a preset metadata identifier corresponding to the metadata information matched with the data acquisition information as a target metadata identifier. In a specific application scenario, each preset metadata identifier in the metadata information base of the data lake is different from each other, so that only one piece of metadata information is generally matched with the data acquisition information.
On the basis of the method shown in fig. 1, the method provided by the embodiment of the present invention further includes:
if the data acquisition mode is not the full-scale acquisition mode, determining a start-stop condition corresponding to the data integration task;
in the method provided by the embodiment of the invention, if the data acquisition mode corresponding to the data integration task is not the full-scale acquisition mode, namely the data acquisition mode is the incremental acquisition mode. When the user selects the incremental acquisition mode, the start-stop conditions of data acquisition can be input through the front end, and the start-stop conditions can comprise the start conditions and the end conditions, or only comprise the start conditions without setting the end conditions. The start-stop condition refers to a condition that characterizes a data range, such as data generated after a first preset time point is acquired, before a second preset time point, and the like.
Performing condition check on the data integration task according to the target metadata information and the starting and stopping conditions;
in the method provided by the embodiment of the invention, the condition check can be performed on the data integration task based on the target metadata information and the start-stop condition, that is, whether the start-stop condition corresponding to the data integration task is feasible or not is judged according to the target metadata information.
And if the data integration task passes the condition check, generating an incremental data integration script corresponding to the target metadata identification according to the script generation strategy, the data acquisition information, the start-stop condition and the target metadata information, and taking the incremental data integration script as a task script of the data integration task.
In the method provided by the embodiment of the invention, if the data integration task passes the condition check, an incremental data integration script is generated based on the script generation strategy, the data acquisition information, the start-stop condition and the target metadata information, and the incremental data integration script is used as the task script of the data integration task. Specifically, parameter replacement may be performed based on the data such as the metadata identifier, the service theme, the department to which the script belongs, the product to which the script belongs, and the start-stop condition in the data acquisition information, and the incremental variable identifier included in the target metadata information, according to the script template incrementally acquired in the script generation policy, and the incremental data integration script may be generated based on the script template for which parameter replacement is completed.
On the basis of the method provided by the foregoing embodiment, in the method provided by the embodiment of the present invention, the process of performing condition check on the data integration task according to the target metadata information and the start-stop condition includes:
determining incremental information corresponding to the target metadata information, wherein the incremental information comprises an incremental variable identifier and an incremental variable data structure;
in the method provided by the embodiment of the present invention, the target metadata information includes an incremental variable identifier and an incremental variable data structure corresponding to the target metadata identifier, and the incremental variable identifier and the incremental variable data structure can be determined as incremental information.
Judging whether the start-stop condition is matched with the increment information;
in the method provided by the embodiment of the present invention, it may be determined whether the start-stop condition matches the increment information through the increment variables related to the start-stop condition, for example, if the increment variables related to the start-stop condition are time, and the increment variable identifiers in the increment information and the increment variables characterized by the increment variable data structure are also time, it is determined that the start-stop condition matches the increment information, otherwise, if the increment variables related to the start-stop condition are different from the increment variables characterized in the increment information, it is determined that the start-stop condition and the increment information do not match.
If the starting and stopping conditions are matched with the incremental information, determining that the data integration task passes condition check;
and if the starting and ending conditions are not matched with the incremental information, determining that the data integration task does not pass condition check.
In the method provided by the embodiment of the invention, if the start-stop condition is matched with the increment information, the data integration task is determined to pass the condition check, otherwise, the data integration task does not pass the condition check.
On the basis of the method provided by the above embodiment, the method provided by the embodiment of the present invention further includes:
and if the data integration task does not pass the condition check, carrying out error reporting prompt and ending the generation process of the task script of the data integration task.
In the method provided by the embodiment of the invention, if the data integration task fails to pass the condition check, the error information is reported, and the generation process of the task script is ended.
In order to better explain the method provided by the embodiment of the present invention, next, in combination with an actual application scenario, the embodiment of the present invention provides another method for generating a data integration script.
The method provided by the embodiment of the invention is applied to a bank mechanism and is based on a data lake architecture data analysis scene. Data lake architecture is the way data is stored in a natural format in a system or repository. A data lake is a large warehouse that stores a wide variety of raw data of an enterprise, where the data is available for access, processing, analysis, and transmission. The three levels of the data lake are divided into a database and other bottom storage, metadata management and SQL engines crossing different data sources. Traditional data warehouses store data in relational tables, while data lakes use flat structures. Each data element is assigned a unique identifier and is tagged with a set of metadata tags.
The application process of the method provided by the embodiment of the invention comprises the following steps:
combing all enterprise-level metadata in a data lake and information related to ETL, such as business topics, departments, products, collection modes, incremental variable identifiers, incremental variable data structures and the like, of the enterprise-level metadata, prepares for subsequent importing of ETL parameters (namely metadata information), and can record combing content according to the structure shown in the following table:
TABLE 1
Figure BDA0003632041900000111
The index items are sequentially from high order to low order: subject of affiliated service > department of affiliated product > metadata.
Establishing a task number, and preparing for subsequently importing metadata parameter information and adding an ETL parameter, wherein the task number data structure can be as follows:
TABLE 2
Figure BDA0003632041900000112
And the first imported bottoming data needs to establish a task number required by bottoming for the first time, so that preparation is made for subsequent bottoming import.
Selecting an established task number, importing a metadata ETL parameter (namely metadata information corresponding to a metadata identifier), and for importing the metadata ETL parameter, a system allocates a unique identifier for each data element during importing and marks the data element by using a group of metadata tags, wherein the content of the metadata tags (total 18 bytes) is as follows:
TABLE 3
Figure BDA0003632041900000113
Storing a metadata ETL parameter table, wherein the structure of the ETL parameter table can be as follows:
TABLE 4
Figure BDA0003632041900000121
The database type may be, but is not limited to, Hive, HBase, Hdfs, Nosql, and the like.
After an ETL parameter table is established, when loading data needs to be acquired during data analysis, a user can select metadata, an acquisition mode, an increment condition or a segmentation interval which needs ETL according to an access rule provided by a data analyst, a system automatically matches parameters in the ETL parameter table according to the selection of the user to acquire a database type corresponding to the metadata, if the acquisition mode is an increment, an increment variable identifier is acquired, and an ETL script for the data analysis is automatically generated.
The ETL parameters support addition, deletion, modification and maintenance, when the ETL parameters are newly added in the future, the system performs parameter validity check according to information such as a task number, a metadata name, a business theme, a department to which the ETL parameters belong, a product to which the ETL parameters belong, an acquisition mode, an input incremental variable identifier, an incremental variable data structure and the like selected by an operator, establishes the ETL parameters of the metadata after the check is passed, and supports metadata-level ETL parameter query and panorama-level list query, wherein the metadata-level ETL parameter query shows all ETL parameters related to the metadata, and the panorama-level list query shows all the metadata ETL parameters.
Because ETL work and tasks are closely related, relevant data ETL work is provided only after a data analysis task is started, the processing mechanism mainly increases authorization of data lake data ETL, and relevant data ETL can be performed only after authorization.
The method provided by the embodiment of the invention can be applied to a data integration platform, wherein the data integration platform is an instantiation of the method shown in FIG. 1, and the platform can comprise the following modules:
the online module mainly comprises the following sub-modules:
a task maintenance submodule: the task maintenance submodule mainly has a task information maintenance function, mainly changes, checks and maintains online transactions of ETL parameters in an ETL parameter table, and main information comprises metadata tags, metadata names, acquisition modes, incremental variable identifiers, incremental variable data structures, task numbers and other related information online maintenance transactions. The ETL required by data analysis in the data lake is mainly driven by a data analysis task, no data analysis task exists, and any person or organization cannot randomly change published ETL parameters, including bottom paving for the first time and bottom paving import after a bottom paving task is established.
The maintenance of the batch task information mainly has four functions of addition, modification, deletion and check, and each operation can be effective only by authorization of a supervisor.
Adding a sub-module for ETL parameter of metadata: the submodule mainly completes the new adding work of the ETL parameters of the metadata, and the new adding of the ETL parameters of the metadata has two modes of single adding and batch importing. The main fields of the single-stroke newly-added online transaction interface comprise: metadata name, collection mode, belonging business theme, belonging department, belonging product, database type, incremental variable identifier, incremental variable data structure and task number, wherein the task number is a task number established in advance and is selected by a pull-down menu. The batch import supports the function of loading one or more ETL parameters through the import table after filling the EXCEL import table according to the template in the table 1. When the ETL parameter is newly added, the system can check the legality of the ETL parameter according to the information of the name of the metadata, the service theme where the metadata is located, the product to which the metadata belongs and the like recorded by an operator, and after the check is passed, the system allows the ETL parameter record to be newly added.
A metadata tag generation submodule: after the user inputs or imports the data, the user submits a new transaction, and the system generates a metadata tag for the importation or import according to the metadata ETL parameter related information code submitted by the user, and stores the metadata tag as information in a subsequent storage module in a database.
A query submodule: the query of ETL parameters supports single record level query that exposes all ETL parameters related to the record and panorama level list query that exposes the ETL parameters of all records.
Deleting the sub-modules: in contrast to the add operation, the delete operation will remove the ETL parameters of the corresponding record from the ETL parameter table.
Modifying the sub-modules: the sub-module supports selection of a list for modification in an ETL parameter list query. The system regenerates the metadata tag information according to the modified information and updates the associated fields in the modified record.
A storage module: mainly comprising ETL parameter storage.
An ETL generation module: the method mainly provides a visual online transaction picture, so that a user can select ETL generation conditions required by data analysis according to needs, and can automatically generate ETL scripts required by the user after submission.
In the method provided by the embodiment of the present invention, a user interface for a user to submit relevant information of a data integration task may be as shown in fig. 2, and includes selection of a task number, input of a business topic, a department to which the user belongs, a product to which the user belongs, metadata (identification), an acquisition mode, an initial condition, and a termination condition, where the initial condition and the termination condition may be manually filled in, and other data items may provide drop-down menu selection.
As shown in fig. 3, the process of generating a data integration script provided in the embodiment of the present invention specifically includes:
s201: a user submits a task;
in the method provided by the embodiment of the present invention, a front-end interface as shown in fig. 2 may be initialized, and a user may input relevant information of the ETL through the front-end interface, including selecting a data analysis task number, a business topic to which data belongs, a department to which the data belongs, a product to which the data belongs, selecting metadata (identification), selecting an acquisition mode, and submitting a task of generating an ETL script by clicking a submission control.
S202: judging whether the acquisition mode selected by the user is a full acquisition mode;
s203: if the acquisition mode is not the full acquisition mode, judging whether the acquisition mode is an interval acquisition mode;
in the method provided by the embodiment of the invention, the incremental acquisition mode can be further subdivided into an interval acquisition mode and a non-interval acquisition mode, wherein in the interval acquisition mode, a user needs to input an initial condition and a termination condition, and in the non-interval acquisition mode, the user only needs to input the initial condition.
S204: if the acquisition mode is an interval acquisition mode, acquiring a starting condition and a terminating condition input by a user to determine a starting condition and a terminating condition;
s205: if the acquisition mode is a non-interval acquisition mode, acquiring an initial condition input by a user, wherein a default termination condition is also the initial condition to determine a start-stop condition;
s206: reading in ETL parameters corresponding to metadata selected by a user;
s207: acquiring a database type, an incremental variable identifier and a data structure corresponding to the incremental variable identifier from the ETL parameter;
s208: checking whether the data structure of the increment variable (namely the data structure corresponding to the increment variable identification) is matched with the start-stop condition;
s209: judging whether the inspection is passed or not according to the inspection result;
s210: if the checking is passed, generating an incremental ETL script according to the information input by the user, the database type, the incremental variable identification and the start-stop condition;
s211: showing the ETL script to a user so as to facilitate the user to review;
s212: if the check is not passed, an error is reported and the computer exits;
s213: if the collection mode is judged to be the full collection mode in the judgment process of the step S202, reading the ETL parameters corresponding to the metadata selected by the user;
s214: obtaining a database type from the ETL parameter;
s215: and generating a full ETL script according to the information input by the user and the database type, and entering step S211.
Based on the method provided by the embodiment of the invention, the ETL script can be quickly generated on the basis of the data lake architecture, the repeated labor in the aspect of ETL operation infrastructure is eliminated, the data set in the data lake can be found and can be used for query and analysis, the time of ETL and data cataloging in an analysis project is greatly shortened, and the ETL generation becomes more automatic and more intelligent.
Corresponding to the method for generating a data integration script shown in fig. 1, an embodiment of the present invention further provides a device for generating a data integration script, which is used to implement the method shown in fig. 1 specifically, and a schematic structural diagram of the device is shown in fig. 4, where the device includes:
the first determining unit 301 is configured to determine, when a task script of a data integration task needs to be generated, data acquisition information corresponding to the data integration task;
a second determining unit 302, configured to determine, according to the data acquisition information, a target metadata identifier corresponding to the data integration task and target metadata information corresponding to the target metadata identifier in a preset data lake metadata information base, where the data lake metadata information base includes a plurality of preset metadata identifiers and metadata information corresponding to each preset metadata identifier;
a third determining unit 303, configured to determine, according to the target metadata information, a database type corresponding to the target metadata identifier;
a fourth determining unit 304, configured to determine a script generation policy corresponding to the database type;
a fifth determining unit 305, configured to determine a data acquisition mode corresponding to the data integration task;
a first generating unit 306, configured to generate a full data integration script corresponding to the target metadata identifier according to the script generation policy and the data acquisition information if the data acquisition mode is a full acquisition mode, and use the full data integration script as a task script of the data integration task.
Based on the device provided by the embodiment of the invention, when the task script of the data integration task needs to be generated, the data acquisition information corresponding to the data integration task is determined; according to the data acquisition information, determining a target metadata identifier corresponding to the data integration task and target metadata information corresponding to the target metadata identifier in a preset data lake metadata information base, and according to the target metadata information, determining a database type corresponding to the target metadata identifier; determining a script generation strategy corresponding to the database type; determining a data acquisition mode corresponding to the data integration task; and if the data acquisition mode is a full acquisition mode, generating a full data integration script corresponding to the target metadata identifier according to the script generation strategy and the data acquisition information, and taking the full data integration script as a task script of the data integration task. By applying the device provided by the embodiment of the invention, the script generation strategy can be determined based on the data acquisition information of the data integration task and the preset data lake metadata information base, and then the task script is generated. The task script of the data integration task can be generated through an automatic processing process, the automatic generation of the data integration script can be realized, manual script writing is not needed, a large amount of human resources can be saved, the time consumed by the processing process is shortened, and the working efficiency is improved. And secondly, the artificial careless mistakes caused by manually writing scripts can be avoided, and the adverse effects on the data integration work caused by artificial reasons can be avoided.
On the basis of the device provided by the above embodiment, the device provided by the embodiment of the present invention further includes:
a sixth determining unit, configured to determine, if the data acquisition mode is not the full-volume acquisition mode, a start-stop condition corresponding to the data integration task;
the checking unit is used for carrying out condition checking on the data integration task according to the target metadata information and the starting and stopping conditions;
and a second generating unit, configured to generate an incremental data integration script corresponding to the target metadata identifier according to the script generation policy, the data acquisition information, the start-stop condition, and the target metadata information if the data integration task passes condition check, and use the incremental data integration script as a task script of the data integration task.
On the basis of the apparatus provided in the foregoing embodiment, in the apparatus provided in an embodiment of the present invention, the first determining unit 301 includes:
and the acquisition subunit is used for acquiring a business theme, a affiliated department, an affiliated product, a metadata identifier and an acquisition mode input by a user, and taking the business theme, the affiliated department, the affiliated product, the metadata identifier and the acquisition mode as data acquisition information corresponding to the data integration task.
On the basis of the apparatus provided in the foregoing embodiment, in the apparatus provided in the embodiment of the present invention, the second determining unit 302 includes:
and the matching subunit is used for respectively matching the data acquisition information with the metadata information corresponding to each preset metadata identifier, and determining the preset metadata identifier corresponding to the metadata information matched with the data acquisition information as the target metadata identifier.
On the basis of the apparatus provided in the above embodiment, in the apparatus provided in an embodiment of the present invention, the checking unit includes:
the first determining subunit is configured to determine incremental information corresponding to the target metadata information, where the incremental information includes an incremental variable identifier and an incremental variable data structure;
a judging subunit, configured to judge whether the start-stop condition matches the incremental information;
the second determining subunit is configured to determine that the data integration task passes condition checking if the start-stop condition matches the incremental information; and if the starting and ending conditions are not matched with the incremental information, determining that the data integration task does not pass condition check.
On the basis of the device provided by the above embodiment, the device provided by the embodiment of the present invention further includes:
and the error reporting subunit is used for performing error reporting prompt and ending the generation process of the task script of the data integration task if the data integration task fails to pass the condition check.
The embodiment of the present invention further provides a storage medium, where the storage medium includes stored instructions, and when the instructions are executed, the device on which the storage medium is located is controlled to execute the above method for generating a data integration script.
An electronic device is provided in an embodiment of the present invention, and the structural diagram of the electronic device is shown in fig. 5, which specifically includes a memory 401 and one or more instructions 402, where the one or more instructions 402 are stored in the memory 401 and configured to be executed by one or more processors 403 to perform the following operations for executing the one or more instructions 402:
when a task script of a data integration task needs to be generated, determining data acquisition information corresponding to the data integration task;
according to the data acquisition information, determining a target metadata identifier corresponding to the data integration task and target metadata information corresponding to the target metadata identifier in a preset data lake metadata information base, wherein the data lake metadata information base comprises a plurality of preset metadata identifiers and metadata information corresponding to each preset metadata identifier;
determining a database type corresponding to the target metadata identification according to the target metadata information;
determining a script generation strategy corresponding to the database type;
determining a data acquisition mode corresponding to the data integration task;
and if the data acquisition mode is a full acquisition mode, generating a full data integration script corresponding to the target metadata identifier according to the script generation strategy and the data acquisition information, and taking the full data integration script as a task script of the data integration task.
The data integration script generation method and apparatus, the storage medium, and the electronic device provided by the present invention may be used in the financial field or other fields, for example, may be used in a data analysis application scenario in the financial field. The other fields are arbitrary fields other than the financial field, for example, the field of communication services. The above description is only an example, and does not limit the application fields of the data integration script generation method and apparatus, the storage medium, and the electronic device provided by the present invention.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A data integration script generation method is characterized by comprising the following steps:
when a task script of a data integration task needs to be generated, determining data acquisition information corresponding to the data integration task;
according to the data acquisition information, determining a target metadata identifier corresponding to the data integration task and target metadata information corresponding to the target metadata identifier in a preset data lake metadata information base, wherein the data lake metadata information base comprises a plurality of preset metadata identifiers and metadata information corresponding to each preset metadata identifier;
determining a database type corresponding to the target metadata identification according to the target metadata information;
determining a script generation strategy corresponding to the database type;
determining a data acquisition mode corresponding to the data integration task;
and if the data acquisition mode is a full acquisition mode, generating a full data integration script corresponding to the target metadata identifier according to the script generation strategy and the data acquisition information, and taking the full data integration script as a task script of the data integration task.
2. The method of claim 1, wherein the determining data collection information corresponding to the data integration task comprises:
and acquiring a business theme, a department to which the business theme belongs, a product to which the business theme belongs, a metadata identifier and an acquisition mode input by a user, and taking the business theme, the department to which the business theme belongs, the product to which the business theme belongs, the metadata identifier and the acquisition mode as data acquisition information corresponding to the data integration task.
3. The method according to claim 1, wherein the determining, in a preset data lake metadata information base, a target metadata identifier corresponding to the data integration task according to the data acquisition information includes:
and matching the data acquisition information with the metadata information corresponding to each preset metadata identifier respectively, and determining the preset metadata identifier corresponding to the metadata information matched with the data acquisition information as the target metadata identifier.
4. The method of claim 1, further comprising:
if the data acquisition mode is not the full-scale acquisition mode, determining a start-stop condition corresponding to the data integration task;
performing condition check on the data integration task according to the target metadata information and the starting and stopping conditions;
and if the data integration task passes the condition check, generating an incremental data integration script corresponding to the target metadata identification according to the script generation strategy, the data acquisition information, the start-stop condition and the target metadata information, and taking the incremental data integration script as a task script of the data integration task.
5. The method of claim 4, wherein the conditionally checking the data integration task according to the target metadata information and the start-stop condition comprises:
determining incremental information corresponding to the target metadata information, wherein the incremental information comprises an incremental variable identifier and an incremental variable data structure;
judging whether the start-stop condition is matched with the increment information;
if the starting and stopping conditions are matched with the incremental information, determining that the data integration task passes condition check;
and if the starting and ending conditions are not matched with the incremental information, determining that the data integration task does not pass condition check.
6. The method of claim 4, further comprising:
and if the data integration task does not pass the condition check, carrying out error reporting prompt and ending the generation process of the task script of the data integration task.
7. A data integration script generating apparatus, comprising:
the device comprises a first determining unit, a second determining unit and a processing unit, wherein the first determining unit is used for determining data acquisition information corresponding to a data integration task when a task script of the data integration task needs to be generated;
a second determining unit, configured to determine, according to the data acquisition information, a target metadata identifier corresponding to the data integration task and target metadata information corresponding to the target metadata identifier in a preset data lake metadata information base, where the data lake metadata information base includes a plurality of preset metadata identifiers and metadata information corresponding to each preset metadata identifier;
a third determining unit, configured to determine, according to the target metadata information, a database type corresponding to the target metadata identifier;
a fourth determining unit, configured to determine a script generation policy corresponding to the database type;
a fifth determining unit, configured to determine a data acquisition mode corresponding to the data integration task;
and the first generation unit is used for generating a full data integration script corresponding to the target metadata identifier according to the script generation strategy and the data acquisition information if the data acquisition mode is a full acquisition mode, and taking the full data integration script as a task script of the data integration task.
8. The apparatus of claim 7, further comprising:
a sixth determining unit, configured to determine, if the data acquisition mode is not the full-volume acquisition mode, a start-stop condition corresponding to the data integration task;
the checking unit is used for carrying out condition checking on the data integration task according to the target metadata information and the starting and stopping conditions;
and a second generating unit, configured to generate an incremental data integration script corresponding to the target metadata identifier according to the script generation policy, the data acquisition information, the start-stop condition, and the target metadata information if the data integration task passes condition check, and use the incremental data integration script as a task script of the data integration task.
9. A storage medium comprising stored instructions, wherein when executed, the storage medium controls a device on which the storage medium is located to execute the data integration script generating method according to any one of claims 1 to 6.
10. An electronic device comprising a memory, and one or more instructions stored in the memory and configured to be executed by the one or more processors to perform the data integration script generation method of any one of claims 1-6.
CN202210492533.0A 2022-05-07 2022-05-07 Data integration script generation method and device, storage medium and electronic equipment Pending CN114880387A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210492533.0A CN114880387A (en) 2022-05-07 2022-05-07 Data integration script generation method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210492533.0A CN114880387A (en) 2022-05-07 2022-05-07 Data integration script generation method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN114880387A true CN114880387A (en) 2022-08-09

Family

ID=82673561

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210492533.0A Pending CN114880387A (en) 2022-05-07 2022-05-07 Data integration script generation method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114880387A (en)

Similar Documents

Publication Publication Date Title
US20200019558A1 (en) Intelligent data ingestion system and method for governance and security
CN110443552B (en) Method and device for automatically transmitting product main data information
CN110471949B (en) Data blood margin analysis method, device, system, server and storage medium
US8838627B2 (en) Systems and methods for providing template based output management
US20070162482A1 (en) Method and system of using artifacts to identify elements of a component business model
CN105138312B (en) A kind of table generation method and device
CN105243528A (en) Financial IT system graphical centralized reconciliation system and method under big data environment
US11893066B2 (en) Binding traits to case nodes
CN103455896B (en) With no paper assembling Quality Control method based on Internet of Things
US20190310840A1 (en) Efficient information storage and retrieval using subgraphs
Ashari et al. Design and build inventory management information system using the scrum method
CN108268468A (en) The analysis method and system of a kind of big data
CN112598364A (en) Expert information management platform based on cloud service
CN110879939A (en) Method and device for generating response document
CN111061733B (en) Data processing method, device, electronic equipment and computer readable storage medium
CN109636303B (en) Storage method and system for semi-automatically extracting and structuring document information
CN116823464A (en) Data asset management platform, electronic device, and computer-readable storage medium
CN114880387A (en) Data integration script generation method and device, storage medium and electronic equipment
CN109542890A (en) Data modification method, device, computer equipment and storage medium
CN116976800A (en) Data processing method and device, storage medium and electronic equipment
US20140149186A1 (en) Method and system of using artifacts to identify elements of a component business model
CN114138787A (en) Bar code identification method, equipment and medium
CN110457332B (en) Information processing method and related equipment
CN113705180A (en) Document editing and reviewing method and device, electronic equipment and storage medium
CN115587087B (en) Efficient data sharing platform based on data extraction and system modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination