CN112416752A - ETL (extract transform load) layered test method based on data warehouse - Google Patents

ETL (extract transform load) layered test method based on data warehouse Download PDF

Info

Publication number
CN112416752A
CN112416752A CN202011201216.6A CN202011201216A CN112416752A CN 112416752 A CN112416752 A CN 112416752A CN 202011201216 A CN202011201216 A CN 202011201216A CN 112416752 A CN112416752 A CN 112416752A
Authority
CN
China
Prior art keywords
data
layer
test
etl
layering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011201216.6A
Other languages
Chinese (zh)
Other versions
CN112416752B (en
Inventor
邓海霞
张瀚
黄小丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan XW Bank Co Ltd
Original Assignee
Sichuan XW Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan XW Bank Co Ltd filed Critical Sichuan XW Bank Co Ltd
Priority to CN202011201216.6A priority Critical patent/CN112416752B/en
Publication of CN112416752A publication Critical patent/CN112416752A/en
Application granted granted Critical
Publication of CN112416752B publication Critical patent/CN112416752B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Abstract

The invention discloses a method for ETL layered testing based on a data warehouse, which comprises the following steps: A. selecting a layering method, drawing up a layering principle, establishing an ETL operation rule layer by layer to layer the data warehouse system, and checking the layering effect after layering is finished; B. analyzing the level of the tested program and defining the testing range; C. preparing test data according to a data source mode; D. and carrying out ETL test layer by layer, sorting, evaluating and reporting the test result, and marking risks for the failed test inspection items. According to the invention, the problem can be identified and solved in the first time through the layered test, the risk of spreading the hidden problem to the upper layer is reduced, the test attention points are favorably concentrated to the layer and the upper and lower layers, and then the layer-by-layer process is carried out, and the test case compiling and executing are more compact and rigorous, so that the accurate test of the big data is realized.

Description

ETL (extract transform load) layered test method based on data warehouse
Technical Field
The invention relates to the field of software development and software testing, in particular to a method for ETL (extract transform load) layered testing based on a data warehouse.
Background
The data warehouse is a software system, is positioned at the bottom layer of the big data system, and is used for collecting data provided by each system in an enterprise and each three-party company outside the enterprise, and cleaning and processing the data according to the needs, thereby providing more comprehensive data support for enterprise operation condition analysis and management decision making. ETL is used to describe the process of extracting (Extract), converting (Transform), loading (Load) data from a source end to a destination end, and is commonly used in data warehouses. The data warehouse is divided into a plurality of layers, and each layer has different functional logic, but can be summarized into an ETL operation.
In the development and test stage of the data warehouse system, a tested system obtains target end data after performing ETL operation on source end data by a plurality of layers, and at present, a tester generally adopts the following two modes when testing the data warehouse system, (1) taking the inside of the whole data warehouse as a black box, firstly constructing initial source end data, then executing an ETL program of the data warehouse, running from A to Z layer by layer, and finally performing a comparison test on result data and the source end data.
(2) When testing personnel test the data warehouse system, the data warehouse is layered, but only the code change layer is tested, the target end data generated by the nearest dependent layer Y is directly constructed, then the tested program Z is operated, and finally the result data is tested.
However, in the first mode, all ETL layer operations are considered as a whole black box, and when a problem occurs in a certain layer inside the ETL layer, the problem cannot be found and repaired within a first time period, but hidden and diffused, so that hidden troubles are buried for data quality. The second way is to shield the complete path on which Z depends, cut in directly from the nearest dependent layer Y, only run the Z program, and then test the result data set. Therefore, both of the two testing methods have the problem that insufficient testing easily causes data quality.
Disclosure of Invention
Aiming at the problems, the invention provides a method for ETL layered test based on a data warehouse, which solves the problems that the data quality is easy to cause and the data cannot be found in time due to insufficient test by carrying out ETL layered test on the data warehouse.
A method for hierarchical testing based on a data warehouse ETL, comprising:
A. selecting a layering method, drawing up a layering principle, establishing an ETL operation rule layer by layer to layer the data warehouse system, and checking the layering effect after layering is finished;
B. analyzing the level of the tested program and defining the testing range;
C. preparing test data according to a data source mode;
D. and carrying out ETL test layer by layer, sorting, evaluating and reporting the test result, and marking risks for the failed test inspection items.
The invention can quickly and accurately define the test range by dividing the data warehouse system hierarchy and the tested code analysis, and sets the test card point on each layer through the hierarchical test, thereby being beneficial to identifying the data problem at the first time, simplifying the test of each layer, dividing and treating the complex problem, and finally solving the large data quality problem into the test problem of each layer.
Further, step a comprises:
A1. the data processing process of the data warehouse system is analyzed, the data are divided from bottom to top according to the data flow direction, and each layer of output data is one ETL layer;
A2. in the layered data warehouse system, each layer completes specific ETL operation and depends on data provided by the lower layer, and any operation of the layer shields the upper layer so that the change of the current layer does not affect the upper layer;
A3. establishing a specification for each layer of ETL, wherein the specification comprises a naming specification of metadata such as a program code name, a database name and a table name, and embedding the hierarchical information in the metadata; and the writing specification of the code, for example, the table name meaning is annotated to all enumerated fields, and data elimination is needed to be carried out when the tables are connected;
operational boundaries are established for each layer of ETL, such as: the data application layer can not directly extract the data of the technical buffer layer across layers, and the codes and the result table on the same layer can not be mutually dependent;
A4. and checking the layering effect, including checking the layering model and evaluating the layering effect.
The step A is used for layering the database, and can shield upstream and downstream changes to lock the changes on each layer, but not the changes on one layer cause linkage changes on other layers. The structure is clearer, and the maintenance and the expansion are more convenient.
Further, step B includes:
B1. the method for judging the hierarchy of the tested program through the static information or the dynamic information specifically comprises the following steps:
step A, carrying out system layering on a data warehouse and embedding layering information in metadata, namely extracting the layering information through a code name;
and (3) judging the layer of the tested program through dynamic information: analyzing ETL operation executed by the tested program, acquiring data source and output data depended by data input, and deducing to obtain the layer of the tested program;
B2. after the layer to which the tested program belongs is determined, a blood margin dependent layer path and an influence layer path can be determined, wherein each related layer ETL is a test range.
The step B analyzes the tested code, and can quickly and accurately define the test range.
Further, step C includes:
C1. drawing up a test data preparation strategy according to a data source mode, wherein the test data preparation strategy comprises data finding and data making; searching reusable test data in a test environment or desensitizing the test environment to obtain data; or selecting to construct data by blocking the dependence system through a real scene or a mock program;
C2. and loading the prepared data into the test environment according to the formulated data preparation strategy.
Further, step D includes:
D1. splitting each layer according to data actions, wherein the splitting comprises an input part, a processing part and an output part;
D2. performing input check on the data input part, including checking the validity, authenticity, stability, timeliness and safety of the data;
D3. the processing part calculates and transforms data, continuously splits the processing part according to execution actions until the processing part cannot be split, and then performs unit test on the split minimum processing logic branch, wherein the unit test needs to test whether the processing logic branch is correct when different data flow in detail, and check the normalization of the processing process and the logic accuracy of processing calculation;
D4. the data output part is subjected to output end inspection test, and the integrity, consistency, rationality and safety of the data are inspected;
D5. and (4) sorting, evaluating and reporting the layer-by-layer ETL test results, and marking risks for the failed test inspection items.
And step D, through the layered test, setting a test card point at each layer, which is beneficial to identifying the data problem at the first time.
According to the invention, through the layered test, the data flow test can be compared from a coarse-grained source data end and a coarse-grained destination end to each layer of fine-grained data input, processing and output test, and the test execution is more detailed and comprehensive. The problem can be recognized and solved at the first time, the risk that the hidden problem spreads to the upper layer is reduced, the test attention points can be focused on the layer and the upper and lower layers, then the test case is gradually shifted, and the test case is compiled and executed more closely and strictly, so that the large data accurate test is realized.
Drawings
Fig. 1 is a flowchart illustrating a method for a hierarchical test based on an ETL of a data warehouse according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples. This should not be understood as limiting the scope of the above-described subject matter of the present invention to the following examples. Various substitutions and alterations according to the general knowledge and conventional practice in the art are intended to be included within the scope of the present invention without departing from the technical spirit of the present invention as described above.
1. Layering a data warehouse system
1.1 selection of layering method: the data warehouse system analyzes the data processing process, the data warehouse system is divided from bottom to top according to the data flow direction, and one layer of ETL is formed when the data is output once.
1.2 drawing up a layering principle: in the layered data warehouse system, each layer completes specific ETL operation, provides data depending on the lower layer, shields the upper layer from modification details, and provides data for the upper layer.
1.3 establishing ETL operation criteria layer by layer:
1.3.1 establish specifications for each layer of ETL, such as naming specifications including: program code name, database name, table name and other metadata, and embedding the hierarchical information in the metadata; as code writing specifications: all enumerated fields need to be annotated with table name meanings, and data elimination needs to be carried out firstly when the tables are connected.
1.3.2 set up operational boundaries for each layer of ETL: for example, the data application layer cannot directly acquire data of the technology buffer layer across layers, or the same-layer code and the result table cannot be interdependent.
1.4 check the layering effect:
1.4.1 test layering model: the layered model substantially conforms to the industry common model and generally comprises: a technology buffer layer, a near source model layer, a subject model layer, a common processing layer, a data application layer, etc.
1.4.2 evaluation of the effects of stratification: the problem data can be found in a first occurrence layer at the first time after the problem data appear; problem data is limited to a certain layer and cannot be diffused upwards; the change range of the bug repairing process can be defined in a certain layer, and multi-layer linkage modification is not needed.
2. Analyzing the level of the tested program
2.1, judging the layer of the tested program according to static information or dynamic information;
and judging through static information: step 1, the data warehouse system is layered, and meanwhile, the layer information is embedded in the name of the tested program in step 1.3.1, so that the layer information can be extracted through the code name.
And judging through dynamic information: and analyzing the ETL operation executed by the tested program, acquiring a data source and output data depended on during data input, and deducing to obtain the layer of the tested program. The operations of ETL are three actions of extraction, loading and conversion, and the data source on which the data input depends can be obtained on the extracted action.
2.2 after the layer to which the tested program belongs is determined, the blood margin dependent layer path and the influence layer path can be determined, wherein each related layer ETL is a test range.
3. Preparing test data
3.1 test data preparation strategy is drawn up: according to the data source mode, the method is divided into the following steps: find data and make data.
A: finding data: searching reusable test data in a test environment; or from the production environment, desensitization.
B: data generation: constructing through real scene simulation; or the dependent system structure is blocked by the mock program.
3.2 test data preparation
Selecting a test environment, preparing data according to a 3.1 drawn strategy, and loading the data into the test environment.
4. ETL testing layer by layer
4.1 splitting each layer according to data actions, and dividing into: inputting, processing and outputting three parts, if the input ratio is taken as an independent variable x, the processing process is to calculate or transform data, and functions similar to functions are performed, so that the processing ratio can be taken as a function f (), the output ratio can be taken as y, the mathematical formula can be simplified into a mathematical formula of a model of y ═ f (x), and then tests are performed on each part of the formula:
4.1.1 for data entry component x, an entry check is made, the check items including:
legitimacy: checking whether the data source is legal or not, and whether the source end data is illegally obtained
Authenticity: checking whether the source data is real and has no false forgery
Stability: checking whether the source data is stable or not, and continuously and stably providing data support
And (3) timeliness: checking whether source end data timeliness meets service requirements
Safety: checking whether the data is safe or not in the process of flowing out from the source end
4.1.2 for the processing function part f (), continuing to split according to the execution action until the processing function part f () is the minimum execution unit function ff (), then performing unit test on all ff () functions, and when the unit test needs to test different data flows in detail, whether the function processing logic branch is correct needs to be checked:
standardization: whether the machining process complies with the specification or not and the operation rule provided in the step 1.3 needs to be satisfied
The accuracy is as follows: the processing and calculation logic is accurate, and the deviation of numerical values due to program realization is avoided
4.1.3 to data output part y, carry out output end inspection test, the inspection object divides into individual and whole two dimensions:
A. the application layer data is subjected to individual sampling test, several records are extracted, the records can be extracted randomly, the records can also be extracted according to a rule condition, and the data is checked one by one and in a field mode.
B. And performing integral test on the result set, taking all the result sets as an integral, and checking the data volume, the numerical value distribution and the extreme value interval from the full set.
The two-dimensional tests of A and B require the following checks on the output data:
integrity: checking whether the output data table reveals complete information without missing
Consistency: checking whether the output data is consistent with the information obtained from other channels
Rationality: checking whether the output data can be interpreted or not, and whether the real scene is satisfied or not
Safety: checking whether there is potential safety hazard when the output data provides information to outside
4.2 compile test quality reports
And (4) sorting, evaluating and reporting the results of the layer-by-layer ETL test in the step 4.1, and marking risks for the failed test examination items.

Claims (5)

1. A method for hierarchical testing based on data warehouse ETL is characterized by comprising the following steps:
A. selecting a layering method, drawing up a layering principle, establishing an ETL operation rule layer by layer to layer the data warehouse system, and checking the layering effect after layering is finished;
B. analyzing the level of the tested program and defining the testing range;
C. preparing test data according to a data source mode;
D. and carrying out ETL test layer by layer, sorting, evaluating and reporting the test result, and marking risks for the failed test inspection items.
2. The method for data warehouse ETL-based tiered testing as recited in claim 1, wherein step a comprises:
A1. the data processing process of the data warehouse system is analyzed, the data are divided from bottom to top according to the data flow direction, and each layer of output data is one ETL layer;
A2. in the layered data warehouse system, each layer completes specific ETL operation and depends on data provided by the lower layer, and any operation of the layer shields the upper layer so that the change of the current layer does not affect the upper layer;
A3. establishing a specification for each layer of ETL, wherein the specification comprises a naming specification of metadata such as a program code name, a database name and a table name, and embedding the hierarchical information in the metadata; and code writing specification, including annotating table name meanings to all enumerated fields, and performing data elimination when the tables are connected;
setting an operation boundary for each layer of ETL, wherein the operation boundary comprises that a data application layer can not directly extract technical buffer layer data across layers, and codes and a result table at the same layer can not be mutually dependent;
A4. and checking the layering effect, including checking the layering model and evaluating the layering effect.
3. The method for data warehouse ETL-based tiered testing as recited in claim 1, wherein step B comprises:
B1. the method for judging the hierarchy of the tested program through the static information or the dynamic information specifically comprises the following steps:
step A, carrying out system layering on a data warehouse and embedding layering information in metadata, namely extracting the layering information through a code name;
and (3) judging the layer of the tested program through dynamic information: analyzing ETL operation executed by the tested program, acquiring data source and output data depended by data input, and deducing to obtain the layer of the tested program;
B2. after the layer to which the tested program belongs is determined, a blood margin dependent layer path and an influence layer path can be determined, wherein each related layer ETL is a test range.
4. The method for data warehouse ETL-based tiered testing as recited in claim 1, wherein: the step C comprises the following steps:
C1. drawing up a test data preparation strategy according to a data source mode, wherein the test data preparation strategy comprises data finding and data making; searching reusable test data in a test environment or desensitizing the test environment to obtain data; or selecting to construct data by blocking the dependence system through a real scene or a mock program;
C2. and loading the prepared data into the test environment according to the formulated data preparation strategy.
5. The method for data warehouse ETL-based tiered testing as recited in claim 1, wherein: the step D comprises the following steps:
D1. splitting each layer according to data actions, wherein the splitting comprises an input part, a processing part and an output part;
D2. performing input check on the data input part, including checking the validity, authenticity, stability, timeliness and safety of the data;
D3. the processing part calculates and transforms data, continuously splits the processing part according to execution actions until the processing part cannot be split, and then performs unit test on the split minimum processing logic branch, wherein the unit test needs to test whether the processing logic branch is correct when different data flow in detail, and check the normalization of the processing process and the logic accuracy of processing calculation;
D4. the data output part is subjected to output end inspection test, and the integrity, consistency, rationality and safety of the data are inspected;
D5. and (4) sorting, evaluating and reporting the layer-by-layer ETL test results, and marking risks for the failed test inspection items.
CN202011201216.6A 2020-11-02 2020-11-02 Data warehouse ETL (extract-transform-load) layered test method Active CN112416752B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011201216.6A CN112416752B (en) 2020-11-02 2020-11-02 Data warehouse ETL (extract-transform-load) layered test method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011201216.6A CN112416752B (en) 2020-11-02 2020-11-02 Data warehouse ETL (extract-transform-load) layered test method

Publications (2)

Publication Number Publication Date
CN112416752A true CN112416752A (en) 2021-02-26
CN112416752B CN112416752B (en) 2023-06-06

Family

ID=74828295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011201216.6A Active CN112416752B (en) 2020-11-02 2020-11-02 Data warehouse ETL (extract-transform-load) layered test method

Country Status (1)

Country Link
CN (1) CN112416752B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069029A (en) * 2015-07-17 2015-11-18 电子科技大学 Real-time ETL (extraction-transformation-loading) system and method
CN105589874A (en) * 2014-10-22 2016-05-18 阿里巴巴集团控股有限公司 ETL task dependence relationship detecting method and device and ETL tool
CN106649754A (en) * 2016-12-26 2017-05-10 中国建设银行股份有限公司 Method and system for testing data warehouse performance
US20180074944A1 (en) * 2016-09-15 2018-03-15 Talend, Inc. Test case generator built into data-integration workflow editor
CN111078766A (en) * 2019-11-18 2020-04-28 江苏艾佳家居用品有限公司 Data warehouse model construction system and method based on multidimensional theory

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105589874A (en) * 2014-10-22 2016-05-18 阿里巴巴集团控股有限公司 ETL task dependence relationship detecting method and device and ETL tool
CN105069029A (en) * 2015-07-17 2015-11-18 电子科技大学 Real-time ETL (extraction-transformation-loading) system and method
US20180074944A1 (en) * 2016-09-15 2018-03-15 Talend, Inc. Test case generator built into data-integration workflow editor
CN106649754A (en) * 2016-12-26 2017-05-10 中国建设银行股份有限公司 Method and system for testing data warehouse performance
CN111078766A (en) * 2019-11-18 2020-04-28 江苏艾佳家居用品有限公司 Data warehouse model construction system and method based on multidimensional theory

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
VASILEIOS THEODOROU 等: ""Data generator for evaluating ETL process quality"" *
王冬敏: ""大数据ETL工具的测试设计与实现"" *

Also Published As

Publication number Publication date
CN112416752B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
Hanam et al. Discovering bug patterns in JavaScript
US8019795B2 (en) Data warehouse test automation framework
CN111459799A (en) Software defect detection model establishing and detecting method and system based on Github
Li et al. Classification of software defect detected by black-box testing: An empirical study
Rodriguez-Perez et al. Watch out for extrinsic bugs! a case study of their impact in just-in-time bug prediction models on the openstack project
CN103914379B (en) Fault is automatically injected the method with fault detect and system thereof
Zhao et al. Towards an understanding of change types in bug fixing code
CN109408385B (en) A kind of disfigurement discovery method based on mischief rule and classifying feedback
CN105260300A (en) Service test method based on CAS (General Classification Standards of China Accounting Standards) application platform
CN113487211A (en) Nuclear power equipment quality tracing method and system, computer equipment and medium
Tang et al. Compiler testing: a systematic literature analysis
Han et al. Code smells detection via modern code review: A study of the openstack and qt communities
CN116303380B (en) Data quality checking method, equipment and medium in monitoring service
CN106991050A (en) A kind of static test null pointer dereference defect false positive recognition methods
CN115080448B (en) Method and device for automatically detecting inaccessible path of software code
CN112416752B (en) Data warehouse ETL (extract-transform-load) layered test method
CN114492379B (en) Construction method and application method and device of digital standard meta-model based on semantics
Al-Zubidy et al. Review of systematic literature review tools
Wiesner et al. An ontology-based environment for effective collaborative and concurrent process engineering
Silva Santos et al. Resilience in the management and business research field: a bibliometric analysis
KR102217092B1 (en) Method and apparatus for providing quality information of application
Wu et al. Automated identification of uniqueness in junit tests
CN114331165A (en) Automatic generation and analysis method and system for urban safety performance evaluation report
Liu et al. An ast-based approach to classifying defects
Wang et al. Using workflow patterns to model and validate service requirements

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant