[ summary of the invention ]
The technical problems to be solved by the invention are as follows:
in the ETL test process, the data volume is usually huge, and if the whole data is matched one by one, the task volume is heavy, the matching efficiency is low, and the running time of the test process is long; moreover, when the adjustment improvement is carried out, the adjustment of the test process is lack of pertinence and purpose and is time-consuming.
The invention achieves the above purpose by the following technical scheme:
in a first aspect, the present invention provides an ETL software testing method, for source data, generating expected results in advance, the method comprising:
importing source data into a preset test flow for processing; the test process comprises data reading, data interaction conversion and data loading;
acquiring a processed test result, matching the test result with an expected result step by step, and recording the matching result;
feeding back the recorded matching result to a tester;
the test result and the expected result are both composed of two-dimensional tables, and the step-by-step matching of the test result and the expected result is specifically as follows: matching the data structure of the test result with the data structure of the expected result; and after the data structure is successfully matched, matching the line number of the test result with the line number of the expected result, and after the line number is successfully matched, continuously matching each line of data in the test result with the corresponding line of data in the expected result line by line.
Preferably, the matching the data structure of the test result and the data structure of the expected result specifically includes: matching the column number of the test result with the column number of the expected result; and after the column number is successfully matched, continuously matching the data definition of each column of the test result with the data definition of each column of the expected result.
Preferably, when the row matching between the test result and the expected result is successful, the method further comprises: and counting the total data volume of the test result, matching the total data volume of the test result with the total data volume of the expected result, and continuously matching each line of data in the test result with the corresponding line of data in the expected result after the total data volume is successfully matched.
Preferably, the matching of each row of data in the test result with the corresponding row of data in the expected result specifically includes:
counting the data size of each row in the test result;
matching the data size of each line in the test result with the data size of the corresponding line in the expected result line by line;
and when the data volume of each row is successfully matched, continuously matching the specific data of each row in the test result with the specific data of the corresponding row in the expected result row by row.
Preferably, the matching of each row of data in the test result with the corresponding row of data in the expected result specifically includes:
counting the data volume of each line in the test result, and arranging the lines in the two-dimensional table according to the sequence of the data volume from small to large;
matching the data size of each line in the test result with the data size of the corresponding line in the expected result line by line;
and when the data volume of each row is successfully matched, sequentially matching the specific data of each row in the test result with the specific data of the corresponding row in the expected result line by line according to the arrangement sequence of the rows.
Preferably, when there is a data structure mismatch between the test result and the expected result, or there is a row number mismatch, or there is a data mismatch of any row between the test result and the expected result, the method further includes:
outputting differential data between the test result and the expected result according to a data matching result;
importing the source data into a preset test flow for processing again, and recording intermediate data generated in each link in the processing process;
and respectively matching the differential data with intermediate data generated in each link, further determining the links of the differential data in the test flow, and generating a test report to feed back to a tester.
Preferably, in the process of matching each line of data in the test result with the corresponding line of data in the expected result, when any line of data in the test result is not matched with the data in the corresponding line of the expected result for the first time, stopping data matching; or stopping data matching when the proportion of the differentiated data between the test result and the expected result reaches a preset threshold value.
Preferably, the method for acquiring the source data specifically includes:
connecting a data source of software to be detected, acquiring relevant information of a data source table by reading a system table of the data source, and writing the relevant information into an ETL system table; wherein the related information comprises one or more items of a data structure, a field type, and primary key information.
Preferably, the importing the source data into the preset test flow for processing specifically includes:
creating a test flow of data synchronization, adding a data reading component, a data cleaning and converting component and a data loading component in the test flow, and setting a source table needing synchronization; the source table is used for storing source data needing synchronization;
carrying out different conversion designs on the test flow according to the test function, wherein the conversion designs comprise one or more of incremental data synchronous design, data filtering design and data cleaning conversion design;
and through the added data components, the source data is subjected to data synchronization processing according to the designed test flow.
Preferably, after the recorded matching result is fed back to a tester, the method further comprises writing and executing an ant script; the ant script is specifically written as follows:
compiling a script of preset conditions of the test flow;
calling codes compiled when the source data are imported into the test flow and the test result is obtained for matching, and completing the test flow and data matching;
calling a code compiled when the matching result is fed back, wherein the code is used for finishing the feedback of the matching result;
compiling a recovery script for recovering the source data and the expected result for the test to an initial state;
the execution of the ant script is specifically as follows:
and executing the ant script according to a preset period, and further completing the periodic ETL software test.
In a second aspect, the present invention further provides an ETL software testing apparatus, including at least one processor and a memory, where the at least one processor and the memory are connected through a data bus, and the memory stores instructions executable by the at least one processor, and the instructions are used to complete the ETL software testing method according to the first aspect after being executed by the processor.
Compared with the prior art, the invention has the beneficial effects that:
the ETL software testing method provided by the invention can realize the automatic testing of the ETL software, the processes of data source acquisition, data interaction conversion, result matching inspection and the like are completed in an automatic mode, and the simple and difficult step-by-step matching can be carried out during result matching verification, so that the testing operation time is greatly saved, and the matching efficiency is improved. Meanwhile, unmatched differential data are effectively utilized, and a link causing matching failure can be determined by matching the differential data with intermediate data, so that a tester can quickly and accurately lock problems in testing, and can make adjustment in a targeted manner, and the debugging efficiency is improved.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The invention will be described in detail below with reference to the figures and examples.
Example 1:
the embodiment of the invention provides an ETL software testing method, as shown in FIG. 1, the method specifically comprises the following steps:
step 10, importing source data into a preset test flow for processing; the test process comprises data reading, data interaction conversion and data loading. The testing process is customized by a research and development staff according to specific scene requirements and/or customer requirements, the invention tests the customized and differentiated ETL software, takes the DMETL of the dream database as an example, and specifically calls an API (Application Programming Interface) of the DMETL through java codes to pre-create and set the testing process. Before the step 10 is executed, source data is also acquired in an automated manner, specifically: connecting a data source of software to be detected through jdbc, acquiring relevant information of a source table by reading a system table of the data source, writing the relevant information into a system table of an ETL (extract transform load), and acquiring source data for testing through the relevant information; wherein the related information comprises one or more items of a data structure, a field type, and primary key information. In the embodiment of the present invention, the method for testing ETL software is described by taking DMETL in dameng database as an example, but the present invention is not limited thereto.
And 20, acquiring the processed test result, matching the test result with an expected result, and recording the matching result.
Wherein, for the source data, an expected result needs to be generated in advance, and the expected result is used as a matching standard, and the test result and the expected result in the embodiment of the present invention are both composed of a two-dimensional table, as shown in fig. 2: the columns represent attribute items, which are predetermined, such as "name", "sex", "age", "telephone", "address" and the like in fig. 2, and the rows represent result items, which correspond to specific data under the attribute items. In this step, the processed test result (including the data structure and the data set result) can be obtained by writing the java code, and the expected result is read at the same time, so that the test result is matched with the expected result, and the matching result can be written into result. Here, both the results of successful matching and failed matching can be recorded, and usually, only the result of failed matching can be recorded for later reference.
In making the data match, the test flow is deemed successful if and only if the test result matches the expected result. Assuming that the tester stops matching when the tester is currently set to find that matching fails, in order to improve matching efficiency, the matching process may specifically include three steps, with reference to fig. 3: firstly, matching a data structure of the test result with a data structure of the expected result; secondly, matching the line number of the test result with the line number of the expected result; and thirdly, matching each line of data in the test result with the corresponding line of data in the expected result. The matching of the data structure is specifically as follows: matching the column number of the test result with the column number of the expected result; after the column number matching is successful, the data definition of each column of the test result is continuously matched with the data definition of each column of the expected result, taking fig. 3 as an example, that is, whether each column is respectively defined by data such as name, gender, age, and the like is compared. Because the matching of the data structure and the matching of the row number are much simpler and the matching speed is much faster than the matching of the specific data of each row, the second step can be carried out only after the first step is successfully matched; if the matching fails in the first step, the testing process is proved to have problems, the second step of matching is not needed, and the matching process is directly stopped; similarly, if the first matching is successful and the second matching is failed, the matching process can be directly stopped without performing the third matching. Therefore, the matching efficiency can be greatly improved and the test running time can be saved by the simple step-by-step matching method.
And step 30, feeding back the recorded matching result to a tester. In the step, reading of the recorded result in result.xml can be realized by writing java specifically, and then the function of writing the mail to the tester by calling the mail server interface is called, so that the tester can obtain the recorded result of the matching failure, and further the problem can be checked, the reason can be found out, and the test flow can be adjusted.
The ETL software testing method provided by the invention can realize the automatic testing of the ETL software, and the processes of data acquisition, data interaction conversion, result matching inspection and the like are completed in an automatic mode; when the result matching is verified, the step matching from simple to difficult can be carried out, the matching process can be stopped when the first step matching fails, further matching is not needed, the test running time is greatly saved, and the matching efficiency is improved.
After the step 30, an ant script needs to be written and executed, so that the steps 10 to 30 are effectively connected in series, wherein the writing of the ant script specifically comprises: 1) compiling a script of preset conditions of the test flow; 2) calling codes compiled when the source data are imported into the test flow and the test result is obtained for matching, and completing the test flow and data matching; calling the entry of the code written in the steps 10 and 20 to run the program of the automatic test; 3) calling a code compiled when the matching result is fed back, wherein the code is used for finishing the feedback of the matching result; calling the main program of the program written in the step 30 to complete the acquisition of the matching result by the tester; 4) and writing a recovery script for recovering the source data and the expected result for the test to the initial state. The execution of the ant script is specifically as follows: and executing the ant script according to a preset period, and further completing the periodic ETL software test. For example, ETL software testing needs to be performed once a day, and herein an ant script can be executed once a day in a timed manner using a timer and a bat script, so as to automatically perform ETL testing once a day.
With reference to the embodiment of the present invention, there is also a preferred implementation scheme, in step 20, after the line number matching in the first step is performed and the line number matching is successful, before performing the second step matching, the method further includes: and analyzing and counting the total data volume of the test result, matching the total data volume of the test result with the total data volume of the expected result, and continuing to match each line of data in the test result with the corresponding line of data in the expected result after the total data volume is successfully matched, namely executing the matching of the second step. When the matching of the total data volume fails, the testing process is proved to have problems, and the process of matching each row of data line by line in the second step is not needed. The matching of the total data size is also fast, so that the matching efficiency can be improved to a certain extent by increasing the matching of the total data size before the second step of matching, and the test running time is saved.
With reference to fig. 4, a preferred implementation scheme also exists in combination with the embodiment of the present invention, and for the third step matching process in step 20, the method specifically includes the following steps:
step 201, counting the data size of each line in the test result. Wherein the data size of each line in the expected result is counted in advance.
Step 202, matching the data size of each line in the test result with the data size of the corresponding line in the expected result line by line.
And 203, continuing to match the specific data of each row in the test result with the specific data of the corresponding row in the expected result row by row after the data amount of each row is successfully matched. Compared with the matching of each specific data, the matching of the data volume of each row is simple, the matching speed is high, if the data volume of any row is not matched, the matching process can be directly stopped, the specific data of each row does not need to be matched, the matching efficiency can be improved to a certain extent, and the test running time is saved.
In a more preferred embodiment, referring to fig. 5, for the third matching process in step 20, the following steps may be specifically included:
step 201', counting the data size of each line in the test result, and arranging the lines in the two-dimensional table according to the sequence of the data size from small to large. Compared with the step 201, steps of sorting the rows according to the data volume size are added, for example, the data volume sizes of the first row to the fifth row are respectively 10M, 12M, 13M, 15M and 20M, and are arranged from small to large; the expected results can also be arranged in advance according to the sequence of the line data from small to large, and then correspond to the test results line by line.
Step 202', the data size of each line in the test result is matched with the data size of the corresponding line in the expected result line by line. Taking the five-row data as an example, the sequence formed by the data size of each row of the test result is as follows: 10. 12, 13, 15, 20, only if the sequence formed by the data size of each row of the expected result is also the above-mentioned number sequence, the data size of each row is successfully matched, and if any value is not matched, the matching is failed.
And 203', when the data volumes of the rows are successfully matched, sequentially matching the specific data of each row in the test result with the specific data of the corresponding row in the expected result line by line according to the arrangement sequence of the rows. Still taking the above five-row data as an example, after the data volumes of the rows are successfully matched, the first row with the smallest data volume is matched in sequence, and then the second row, the third row, the fourth row and the fifth row are matched in sequence, wherein the smaller the data volume, the smaller the byte occupied by the row of data is, the faster the matching speed is. Therefore, according to the matching from small to large, the data matching of the rows as many as possible can be completed in the same time, and the unmatched rows can be found conveniently in time, so that the matching speed is increased, and the test running time is saved.
With reference to the embodiment of the present invention, there is also a preferred implementation scheme, when any one of the three matching steps in step 20 fails, that is, when the data structure between the test result and the expected result is not matched, or the number of rows is not matched, or the data in any row is not matched, the problem of matching failure can be found by using the unmatched differentiated data, and the specific method is as follows:
firstly, according to a data matching result, outputting differentiated data between the test result and the expected result, and further feeding the differentiated data back to a tester.
Then, the source data is imported into the preset test flow again for processing, that is, step 20 is executed again, except that intermediate data generated in each link in the retest process is recorded, for example, data after data reading, data after data filtering, and data after different data cleansing or data conversion, which are intermediate data to be recorded.
And finally, matching the differentiated data with intermediate data generated in each link respectively, further determining the links of the differentiated data in the test flow, and generating a test report to feed back to a tester. For example, when the differential data is successfully matched with the data after the data cleaning, the differential data appears in the data cleaning process, which proves that the data cleaning design has problems, and by generating a corresponding test report and feeding the test report back to a tester, the tester can know the error link in time, so that the data cleaning process can be adjusted in a targeted manner, and the whole test flow can be prevented from being adjusted blindly.
In the method, the link causing the matching failure can be determined by utilizing the differentiated data, and the link is effectively fed back to the tester, so that the tester can quickly and accurately lock the problem in the test flow, and further can make targeted adjustment, and the debugging efficiency is improved.
In step 20, when the line-by-line matching in the third step is performed, stopping data matching when a situation that any line of data in the test result is not matched with data in a corresponding line of the expected result occurs for the first time, and recording a corresponding matching failure result; or stopping data matching when the ratio of the differentiated data between the test result and the expected result reaches a preset threshold value, and recording a corresponding matching failure result; the preset threshold value can be adjusted by a tester according to actual needs. For example, when the requirement on the test precision is not high and a slight error is allowed in the test process, the preset threshold value may be set to 2%, the matching may continue as long as the differentiated data is controlled within 2%, and if the differentiated data exceeds 2%, the matching is considered to be failed, and the test process may be ended. For another example, the tester may set that the matching process is continued after the matching fails, and the matching is ended when a preset threshold is reached, and the data of the matching failure is fed back to the tester so that the tester can analyze the data.
Referring to fig. 6, the step 10 is further described in detail, and specifically includes the following steps:
step 101, creating a test flow of data synchronization, adding a data reading component, a data cleaning and converting component and a data loading component in the test flow, and setting a source table needing synchronization; wherein, the source table is used for storing the source data needing synchronization. The data reading assembly is used for extracting data from a data source to a source table, the data cleaning and converting assembly is used for cleaning and converting the data, the data loading assembly is used for loading the data to a target table, and each data assembly is a functional assembly of the DMETL and can be directly called.
Before a test flow is created, a server connected with the DMETL is usually required to be tested, and if the DMETL cannot be connected, abnormal information is thrown out; then, a corresponding project and a corresponding conversion flow are created, for example, a project named as an automatic test project can be created, and a conversion flow named as test data synchronization is created under the project; adding a data reading component, a data cleaning conversion component and a data loading component of the DMETL in a test data synchronization conversion process; and setting a source table to be synchronized according to the relevant information written into the ETL system table, and setting the size of the sub-batch custom cache for the source data to be synchronized in the source table preferably, thereby improving the synchronization efficiency. For example, if the data size to be synchronized is 1G, it can be set to complete 4 times, 256M for each synchronization, which is faster than the 1G data at a time.
And 102, carrying out different conversion designs on the test flow according to the test function, wherein the conversion designs comprise one or more of incremental data synchronous design, data filtering design and data cleaning conversion design.
According to different incremental data acquisition modes, the incremental data synchronization design further comprises a data synchronization design of a trigger increment, an MD5 increment, a shadow table increment and an increment comparison component, and the trigger increment is designed as follows: creating a trigger on a data source to capture incremental change data and operation, and recording the incremental change data and operation into a system table of the DMETL so as to synchronize the incremental data; the MD5 increment is designed as follows: calculating the MD5 value of each row of data, recording the MD5 value in an MD5 table created by the DMETL, obtaining incremental data and operation through primary key matching, and recording the incremental data and the operation in a system table of the DMETL, thereby synchronizing the incremental data; the shadow table increment is designed as follows: copying the data of the source table to a shadow table created by the DMETL, obtaining incremental data and operation through primary key matching, and recording the incremental data and the operation in a system table of the DMETL, so as to synchronize the incremental data; the design of the increment comparison component is as follows: and sequencing the source table and the target table in a database layer, comparing the conditions such as a unique matching example and a comparison column by configuring, obtaining incremental data and operation, converting the incremental data into sql by a code, sending the sql to a target database, and completing the synchronization of the incremental data.
The data filtering is designed as follows: if condition judgment can be adopted, the source data in the source table is filtered according to the filtering condition, and only the data meeting the filtering condition is synchronized to the target table.
The data cleansing conversion is divided into three categories, wherein for field deletion, merging and splitting: the data cleaning conversion component of the DMETL can be used for setting, and deleting, merging or splitting fields according to conditions such as positions or separators; for field content cleaning: the field content can be cleaned through a java function; for date time string formatting: and setting through a data cleaning conversion component of the DMETL, calling a format () method of date conversion, and throwing abnormal information if an error occurs, thereby formatting the date-time character string.
And 103, performing data synchronization processing on the source data according to the designed test flow through the added data components. According to the flow designed in step 102, the synchronization of the source data from the source table to the target table is completed through data reading, data exchange conversion and data loading, and then the corresponding synchronous test flow is completed.
In conclusion, the ETL software testing method provided by the invention can realize the automatic testing of the ETL software, including the acquisition of data sources, the data interaction conversion, the result matching inspection and the like, and can carry out the step-by-step matching from simple to difficult during the result matching verification, thereby greatly saving the testing running time and improving the matching efficiency. Meanwhile, unmatched differential data can be effectively utilized and timely fed back to a tester, and a link leading to matching failure can be determined by matching the differential data with intermediate data, so that the tester can quickly and accurately lock problems in a testing process, and can make adjustment in a targeted manner, and the debugging efficiency is improved.
Example 2:
on the basis of the ETL software testing method provided in embodiment 1, the present invention further provides an ETL software testing apparatus for implementing the method, as shown in fig. 7, which is a schematic diagram of an apparatus architecture in an embodiment of the present invention. The ETL software testing apparatus of the present embodiment includes one or more processors 21 and a memory 22. In fig. 7, one processor 21 is taken as an example.
The processor 21 and the memory 22 may be connected by a bus or other means, and fig. 7 illustrates the connection by a bus as an example.
The memory 22, which is a non-volatile computer-readable storage medium for the ETL software testing method, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as the ETL software testing method in embodiment 1. The processor 21 executes various functional applications and data processing of the ETL software testing apparatus by running the nonvolatile software programs, instructions and modules stored in the memory 22, that is, implements the ETL software testing method of embodiment 1.
The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The program instructions/modules are stored in the memory 22, and when executed by the one or more processors 21, perform the ETL software testing method in the above embodiment 1, for example, perform the steps shown in fig. 1, 3-6 described above.
Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.