CN111597243A

CN111597243A - Data warehouse-based abstract data loading method and system

Info

Publication number: CN111597243A
Application number: CN202010411049.1A
Authority: CN
Inventors: 李湘玲; 聂冬琴; 唐一帆
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2020-08-28
Anticipated expiration: 2040-05-15
Also published as: CN111597243B

Abstract

The invention provides a method and a system for abstract data loading based on a data warehouse. The method comprises the following steps: preprocessing the operation script to obtain an actual service script; extracting the incidence relation of a target table, a source table field and insertion contents used by statement analysis in the actual service script, and extracting predicate information in the actual service script; determining the incidence relation between the loading program and the loading algorithm by using the predicate information and the loading algorithm standard information; and loading data of the input operation according to the incidence relation among the target table, the source table field and the inserted content and the incidence relation between the loading program and the loading algorithm. The invention overcomes the unordered management of a large number of manual loading programs of the data warehouse on different technical platforms, overcomes the defect that the logical correspondence of the data warehouse model depends on manual searching and cannot be accurately obtained, improves the development timeliness of the loading programs, reduces the labor cost, improves the analysis timeliness of the positioning problems and improves the analysis accuracy.

Description

Data warehouse-based abstract data loading method and system

Technical Field

The invention relates to the technical field of data loading, in particular to a method and a system for abstract data loading based on a data warehouse.

Background

The big data technology is developed rapidly, database software is updated continuously, data of a data warehouse is frequently switched among a SAS platform, a Teradata platform, a Hadoop platform and a GaussDB platform, so that in the migration and conversion process before operation, in order to adapt to the characteristics of a new technical platform, the situation of writing and modifying a large number of loading programs manually is inevitably generated, although a uniform template loading tool is formed at the later stage of the forming of the technical platform, the maintainability of the programs is poor, a large number of manual loading programs at the early stage are subjected to single analysis and modification by depending on manual searching programs, and uniform planning and management cannot be performed on the problems of homogeneity such as loading standard, technical platform characteristics and the like.

Disclosure of Invention

In order to solve the above problem, an embodiment of the present invention provides a method for abstract data loading based on a data warehouse, where the method includes:

preprocessing the operation script to obtain an actual service script;

extracting the incidence relation among a target table, a source table field and insertion contents used by statement analysis in the actual service script, and extracting predicate information in the actual service script;

determining the incidence relation between the loading program and the loading algorithm by using the predicate information and the loading algorithm standard information;

and loading data of the input operation according to the incidence relation among the target table, the source table field and the inserted content and the incidence relation between the loading program and the loading algorithm.

Optionally, in an embodiment of the present invention, the preprocessing the job script to obtain the actual service script includes: reading the content of the operation script into a variable, and removing unnecessary information in the variable by using a regular expression to obtain an actual service script only containing an actual service function.

Optionally, in an embodiment of the present invention, the extracting an association relationship between a target table, a source table field, and insertion content used for statement analysis in the actual service script, and extracting predicate information in the actual service script includes: and extracting the incidence relation of a target table, source table fields and insertion contents used by statement analysis in the actual service script and SQL statement predicate information in the actual service script one by using a regular expression.

Optionally, in an embodiment of the present invention, the loading data of the input job according to the association between the target table, the source table field, and the insertion content, and the association between the loader and the loading algorithm includes: according to the field level association relationship between the target table and the inserted content and the field level association relationship between the target table and the field of the source table, obtaining a templated loading program by utilizing the job name, the table name, the loading algorithm and the field level logic comparison relationship of the input job; and carrying out data loading by utilizing the templated loader.

The embodiment of the present invention further provides a system for abstract data loading based on a data warehouse, where the system includes:

the script processing module is used for preprocessing the operation script to obtain an actual service script;

the script analysis module is used for extracting the incidence relation among a target table, a source table field and insertion contents used by statement analysis in the actual service script and extracting predicate information in the actual service script; determining the incidence relation between the loading program and the loading algorithm by using the predicate information and the loading algorithm standard information;

and the data loading module is used for loading data of the input operation according to the incidence relation among the target table, the source table field and the inserted content and the incidence relation between the loading program and the loading algorithm.

Optionally, in an embodiment of the present invention, the script processing module includes: and the sentence extraction unit is used for reading the content of the operation script into a variable, and removing unnecessary information in the variable by using a regular expression to obtain an actual service script only containing an actual service function.

Optionally, in an embodiment of the present invention, the script parsing module includes: and the script analysis unit is used for extracting the incidence relation among a target table, a source table field and insertion contents used for statement analysis in the actual service script and SQL statement predicate information in the actual service script one by using the regular expression.

Optionally, in an embodiment of the present invention, the data loading module includes: the loading program unit is used for obtaining a templated loading program by utilizing the job name, the table name, the loading algorithm and the field-level logic contrast relation of the input job according to the field-level correlation relation between the target table and the inserted content and the field-level correlation relation between the target table and the field of the source table; and the data loading unit is used for loading data by utilizing the templated loader.

Optionally, in an embodiment of the present invention, the system further includes: and the source data module is used for storing the operation script of the application system.

Optionally, in an embodiment of the present invention, the system further includes: and the loading algorithm standard module is used for storing the loading algorithm standard information.

Optionally, in an embodiment of the present invention, the system further includes: and the storage module is used for storing the incidence relation among the target table, the source table field and the inserted content and the incidence relation between the loading program and the loading algorithm.

The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the following steps are implemented:

preprocessing the operation script to obtain an actual service script;

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following steps:

preprocessing the operation script to obtain an actual service script;

The invention overcomes the unordered management of a large number of manual loading programs of the data warehouse on different technical platforms, overcomes the defect that the logical correspondence of the data warehouse model depends on manual searching and cannot be accurately obtained, improves the development timeliness of the loading programs, reduces the labor cost, improves the analysis timeliness of the positioning problems and improves the analysis accuracy.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a flow chart of a method for abstracting data loading based on a data warehouse, according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a system for abstract data loading based on a data warehouse according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a script file structure of a source data application system according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an SQL statement and a corresponding extraction result in an actual service script according to an embodiment of the present invention;

FIG. 5 is a flowchart of a procedure for extracting association relation according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a method and a system for abstract data loading based on a data warehouse.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The data warehouse is a theme-oriented, integrated, relatively stable data set reflecting historical changes, and provides data support for business analysis and management decisions. To embody these characteristics, as shown in table 1, the loading operation of the subject model of the data warehouse includes the following main data algorithms (note: the algorithm is a general algorithm for modeling the data warehouse in the industry at present).

TABLE 1

In recent years, large data technology is rapidly developed, database software is continuously updated, data of a data warehouse is frequently switched among an SAS platform, a Teradata platform, a Hadoop platform and a GaussDB platform, so that the situations of writing and modifying a loading program manually in large quantity are inevitably generated in the migration and conversion process in the early stage of operation in order to adapt to the characteristics of a new technical platform.

The invention relates to the technical field of financial science and technology information of big data platforms, banks and the Internet, in particular to a method for forming a tool loading template by abstracting a data warehouse data loading manual program. Fig. 1 is a flowchart illustrating a method for abstracting data loading based on a data warehouse according to an embodiment of the present invention, where the method includes:

and step S1, preprocessing the operation script to obtain an actual service script. The operation script is preprocessed, the content of the file operation script is read into variables, information including line comments, section comments, variables and prompts is removed through regular expression matching, and only SQL statement content including actual business functions is left in the variables, is exported into a file and is stored in a specific position for use in extraction relation. The specific position is not unique, only a file system position is appointed, and a proper space is provided for storing the file.

And step S2, extracting the incidence relation among a target table, a source table field and insertion content used by statement analysis in the actual business script, and extracting predicate information in the actual business script. The method comprises the steps of extracting an incidence relation among a target table, an intermediate temporary table, a sub-query, fields of each source table and insertion contents (including various field deformation operations) used for statement analysis in a script and SQL statement predicate information in the script one by one through a regular expression.

And step S3, determining the association relationship between the loading program and the loading algorithm by using the predicate information and the loading algorithm standard information.

And step S4, loading data of the input operation according to the incidence relation among the target table, the source table field and the inserted content and the incidence relation between the loading program and the loading algorithm. The template loading program can be generated by inputting information such as a corresponding job name, a table name, a loading algorithm and a field-level logic comparison relation of related jobs, is used for loading data, and is convenient for developers to perform program maintenance and field-level logic analysis.

As an embodiment of the present invention, the preprocessing the job script to obtain the actual service script includes: reading the content of the operation script into a variable, and removing unnecessary information in the variable by using a regular expression to obtain an actual service script only containing an actual service function.

In this embodiment, the preprocessing job script may specifically be: reading the content of the file operation script into a variable, removing information including line annotation, segment annotation, variable and prompt class through regular expression matching, and only remaining SQL statement content including actual business functions in the variable and exporting the SQL statement content into a file.

As an embodiment of the present invention, extracting an association relationship between a target table, a source table field, and insertion content used for statement analysis in an actual service script, and extracting predicate information in the actual service script includes: and extracting the incidence relation of a target table, a source table field and inserted contents used by statement analysis in the actual service script and SQL statement predicate information in the actual service script one by using the regular expression.

As an embodiment of the present invention, the data loading of the input job according to the association relationship among the target table, the source table field, and the insertion content, and the association relationship between the loader and the loading algorithm includes: according to the field level association relationship between the target table and the inserted content and the field level association relationship between the target table and the field of the source table, obtaining a templated loading program by utilizing the job name, the table name, the loading algorithm and the field level logic comparison relationship of the input job; and carrying out data loading by utilizing a templated loading program.

In particular, the implementation of the method of the present invention can be seen in the implementation of a system based on abstract data loading of a data warehouse.

The invention overcomes the unordered management of a large number of manual loading programs of the data warehouse on different technical platforms, and overcomes the defect that the field-level logic corresponding relation of the data warehouse model depends on manual searching and cannot be accurately obtained. The invention can extract the algorithm information and the field-level logic relationship information contained in the manual loading program of the data warehouse model and store the information in the physical table so as to carry out uniform use and management. Therefore, the data warehouse manual loading program can be accurately and quickly abstracted into loading program elements, all the manual loading programs are finally converted into a template toolization loading mode, the development timeliness of the loading programs is improved, the labor cost is reduced, the positioning problem analysis timeliness is improved, and the analysis accuracy is improved.

Fig. 2 is a schematic structural diagram of a system for abstracting data loading based on a data warehouse according to an embodiment of the present invention, where the system includes:

and the script processing module 2 is used for preprocessing the operation script to obtain the actual service script. The operation script is preprocessed, the content of the file operation script is read into variables, information including line comments, section comments, variables and prompts is removed through regular expression matching, and only SQL statement content including actual business functions is left in the variables, is exported into a file and is stored in a specific position for use in extraction relation. The specific position is not unique, only a file system position is appointed, and a proper space is provided for storing the file.

The script analysis module 4 is configured to extract an association relationship between a target table, a source table field, and insertion content used for statement analysis in the actual service script, and extract predicate information in the actual service script; and determining the association relationship between the loading program and the loading algorithm by using the predicate information and the loading algorithm standard information. The method comprises the steps of extracting an incidence relation among a target table, an intermediate temporary table, a sub-query, each source table field and insertion content (including various field deformation operations) used in statement analysis in a script and sql statement predicate information in the script one by one through a regular expression.

And the data loading module 6 is used for loading data of the input operation according to the incidence relation among the target table, the source table field and the inserted content and the incidence relation between the loading program and the loading algorithm. The template loading program can be generated by inputting information such as a corresponding job name, a table name, a loading algorithm and a field-level logic comparison relation of related jobs, is used for loading data, and is convenient for developers to perform program maintenance and field-level logic analysis.

As an embodiment of the invention, the script processing module comprises: and the sentence extraction unit is used for reading the content of the operation script into the variable, removing unnecessary information in the variable by using the regular expression and obtaining the actual service script only containing the actual service function.

As an embodiment of the invention, the script parsing module comprises: and the script analysis unit is used for extracting the incidence relation among a target table, a source table field and insertion contents used for statement analysis in the actual service script and SQL statement predicate information in the actual service script one by using the regular expression.

As an embodiment of the present invention, the data loading module includes: the loading program unit is used for obtaining a templated loading program by utilizing the job name, the table name, the loading algorithm and the field-level logic contrast relation of the input job according to the field-level correlation relation between the target table and the inserted content and the field-level correlation relation between the target table and the field of the source table; and the data loading unit is used for loading data by utilizing the templated loader.

As an embodiment of the present invention, the system further includes a source data module 1 for storing job scripts of the application system. The source data module comprises a plurality of application systems, and scripts of the application systems are stored in a specific directory of the file system and are distinguished by system names.

As an embodiment of the invention, the system further comprises a loading algorithm standard module 3 for storing loading algorithm standard information. For example, the F1 algorithm in table 1 is characterized in that after the operation of clearing the entire table data of the target table, the data is inserted into the target table operation, and the two operations are in sequence, which can be specifically seen in the description of the algorithm in table 1.

As an embodiment of the invention, the system further comprises a storage module 5 for storing the association relationship among the target table, the source table field and the insertion content, and the association relationship between the loader and the loading algorithm.

In an embodiment of the present invention, as shown in fig. 2, the source data module 1 includes a plurality of application systems, and the scripts of the application systems are stored in a specific directory of the file system and are distinguished by system names. Fig. 3 is a schematic diagram illustrating a script file structure of the source data application system.

The system operation SQL sentence extraction module, namely the script processing module 2, is a preprocessing operation of the source data application system operation script, reads the content of the file operation script into the variable, removes the information including the line annotation, the section annotation, the variable and the prompt class through the regular expression matching, and finally only the SQL sentence content containing the actual business function is remained in the variable, is exported into a file and is stored in a specific position for the script analysis module 4 to extract the relation for use. The specific position is not unique, only a file system position is appointed, and a proper space is provided for storing the file.

And a loading algorithm standard information module, namely a loading algorithm standard module 3 is used for extracting predicate characteristic information standards corresponding to the algorithm. For example, the F1 algorithm is characterized in that after the data operation of the whole target table is cleared, the data is inserted into the target table operation, and the two operations are in sequence. See in particular the algorithmic description in table 1.

The theoretical possibility algorithm that the algorithm can reach is as follows:

wherein n is the number of corresponding syntax predicates.

The system operation script SQL parsing module, that is, the script parsing module 4, extracts, through the regular expression, the association relationship between the target table, the intermediate temporary table, the sub-queries, the fields of each source table and the insertion content (including various field transformation operations) used for parsing the statements in the script, and the SQL statement predicate information in the script one by one.

FIG. 4 shows the SQL statement and the corresponding extraction result in the actual script. This is a few target table operation statements (element 41 in the figure) in the T00_ APP _ FIELD _ CD _ H _ ZG0_ a job script, with the extraction result being element 42 in the figure. Unit 42 partially shows the field-level logical contrast between the target table and the inserted contents in the statements and the relationship between the predicate orders corresponding to the statements of the job program.

The script analysis module 4 extracts the field-level logic comparison relationship between the target table field and the inserted content and the corresponding relationship between the program operation and the algorithm. The specific logic formula of the analysis relation of the algorithm is as follows:

if

then

JOB is Algorithm_i

else

JOB is UNKNOW-Algorithm

Wherein, JOB (x, y, z) respectively represents statement sequence, predicate and condition in the operation.

The extraction schematic diagram of the scanning SQL statement analysis relationship after loop extraction and final finding of the association relationship is shown in fig. 5.

The field-level logic comparison relationship between the target table field and the inserted content and the corresponding relationship between the program operation and the algorithm are extracted and then stored in the model and table in the database, i.e. the storage module 5, and the complete relationship extraction program flow chart is shown in fig. 5.

And step.1, processing the manually loaded programs one by one, judging whether the analysis tasks of all the programs are finished, if so, exiting the programs, and otherwise, acquiring the next manually loaded program and starting analysis.

And step.2, judging whether the SQL statement is the last SQL statement in the manual loader or not, if so, completing the analysis of the loader, acquiring the association relation and the predicate sequence relation and storing the association relation and the predicate sequence relation in the array so as to carry out the next processing, and if not, continuously analyzing the next statement until the analysis is completed.

And step 3, extracting the association relationship between the fields of the target table and the fields of the source table from the field association relationship for each field in the target table, if the last field is processed, entering the next step, and if not, continuously extracting the association relationship between the fields of the target table and the fields of the source table until the last field is extracted.

And step 4, after the extracted field incidence relation between the target table and the source table is obtained, the field level logic comparison relation between the target field and the field of the source table is obtained through regular expression matching, and finally the field level logic comparison relation between the target field and the field of the source table and the relation between loading operation and loading algorithm are stored in a database table, so that a back exit program is completed.

The model loads the template program, that is, the data loading module 6 inputs information such as job names, table names, loading algorithms and field-level logic comparison relations corresponding to relevant jobs, so as to generate the templated loading program, which is used for data loading, and is convenient for developers to perform program maintenance and field-level processing logic analysis.

preprocessing the operation script to obtain an actual service script;

The invention also provides the computer equipment and a computer readable storage medium based on the same application concept as the data warehouse-based abstract data loading method. Since the principle of solving the problem of the computer device and the computer-readable storage medium is similar to that of a method based on abstract data loading of a data warehouse, the implementation of the computer device and the computer-readable storage medium can refer to the implementation of the method based on abstract data loading of the data warehouse, and repeated parts are not described again.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for abstracting data loading based on a data warehouse, the method comprising:

preprocessing the operation script to obtain an actual service script;

2. The method of claim 1, wherein preprocessing the job script to obtain the actual service script comprises:

reading the content of the operation script into a variable, and removing unnecessary information in the variable by using a regular expression to obtain an actual service script only containing an actual service function.

3. The method of claim 1, wherein the extracting the association relationship among the target table, the source table field, and the inserted content used in the parsing of the statement in the actual service script, and the extracting the predicate information in the actual service script comprises:

and extracting the incidence relation of a target table, source table fields and insertion contents used by statement analysis in the actual service script and SQL statement predicate information in the actual service script one by using a regular expression.

4. The method of claim 1, wherein the loading data of the input job according to the association relationship among the target table, the source table field and the insertion content and the association relationship between the loader and the loading algorithm comprises:

according to the field level association relationship between the target table and the inserted content and the field level association relationship between the target table and the field of the source table, obtaining a templated loading program by utilizing the job name, the table name, the loading algorithm and the field level logic comparison relationship of the input job;

and carrying out data loading by utilizing the templated loader.

5. A system for abstracting data loading based on a data warehouse, the system comprising:

6. The system of claim 5, wherein the script processing module comprises:

and the sentence extraction unit is used for reading the content of the operation script into a variable, and removing unnecessary information in the variable by using a regular expression to obtain an actual service script only containing an actual service function.

7. The system of claim 5, wherein the script parsing module comprises:

and the script analysis unit is used for extracting the incidence relation among a target table, a source table field and insertion contents used for statement analysis in the actual service script and SQL statement predicate information in the actual service script one by using the regular expression.

8. The system of claim 5, wherein the data loading module comprises:

the loading program unit is used for obtaining a templated loading program by utilizing the job name, the table name, the loading algorithm and the field-level logic contrast relation of the input job according to the field-level correlation relation between the target table and the inserted content and the field-level correlation relation between the target table and the field of the source table;

and the data loading unit is used for loading data by utilizing the templated loader.

9. The system of claim 5, further comprising: and the source data module is used for storing the operation script of the application system.

10. The system of claim 5, further comprising: and the loading algorithm standard module is used for storing the loading algorithm standard information.

11. The system of claim 5, further comprising: and the storage module is used for storing the incidence relation among the target table, the source table field and the inserted content and the incidence relation between the loading program and the loading algorithm.

12. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 4 when executing the computer program.

13. A computer-readable storage medium, on which a computer program is stored, characterized in that the processor, when executing the computer program, implements the method of any of claims 1 to 4.