CN113868283A

CN113868283A - Data testing method, device, equipment and computer storage medium

Info

Publication number: CN113868283A
Application number: CN202111091898.4A
Authority: CN
Inventors: 尹小芳; 徐山凌; 江旻; 杨杨; 张晶
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2021-12-31

Abstract

The embodiment of the application provides a data testing method, a data testing device, electronic equipment and a computer storage medium; the method comprises the following steps: for each big data service demand, determining a corresponding client-level data model, wherein the client-level data model comprises at least one table in a data warehouse tool corresponding to each big data service demand and a query keyword of each table in the at least one table, and the query keyword is used for representing at least one type of client identity information; according to the client-level data model corresponding to each big data service requirement, data backup is carried out according to the client identity to obtain a backup result, and the backup result comprises backup data of each client identity in the at least one client identity; after the data test request is obtained, processing the backup result to obtain data to be tested; and carrying out data test on the data to be tested.

Description

Data testing method, device, equipment and computer storage medium

Technical Field

The present application relates to a big data testing technology of financial technology (Fintech), and relates to, but is not limited to, a data testing method, apparatus, electronic device, and computer storage medium.

Background

With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changing to financial technology, but higher requirements are also put forward on the technologies due to the requirements of the financial industry on safety and real-time performance.

At present, data grabbing and data backup can be achieved by using a data warehouse tool such as hive, so that big data testing is performed, however, in the related art, data grabbing and data backup achieved by using hive is an operation achieved for one partition, data grabbing cannot be achieved for related data of a specific customer, and actual requirements of big data testing tasks are difficult to meet.

Disclosure of Invention

The embodiment of the application provides a data testing method, a data testing device, electronic equipment and a computer storage medium, which can realize data backup aiming at each customer identity, so that test data corresponding to each customer identity can be obtained when data testing is carried out, and the actual requirements of a big data testing task can be met.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a data testing method, which comprises the following steps:

for each big data service demand, determining a corresponding client-level data model, wherein the client-level data model comprises at least one table in a data warehouse tool corresponding to each big data service demand and a query keyword of each table in the at least one table, and the query keyword is used for representing at least one type of client identity information;

according to the client-level data model corresponding to each big data service requirement, data backup is carried out according to the client identity to obtain a backup result, and the backup result comprises backup data of each client identity in the at least one client identity;

after the data test request is obtained, processing the backup result to obtain data to be tested; and carrying out data test on the data to be tested.

In some embodiments of the present application, the determining, for each big data traffic demand, a corresponding customer-level data model includes:

determining at least one big data task corresponding to each big data service requirement; analyzing the at least one big data task, and determining a task dependency tree of the at least one big data task, wherein the task dependency tree represents the task dependency relationship of each big data task in the at least one big data task;

analyzing each task in the task dependency tree, and determining a result table in the data warehouse tool corresponding to each task in the task dependency tree; taking the result table as at least one table in a data warehouse tool corresponding to each big data service requirement;

and determining the customer-level data model corresponding to each big data service demand according to at least one table in the data warehouse tool corresponding to each big data service demand and the query key words of each table in the at least one table.

In some embodiments of the present application, the analyzing each task in the task dependency tree to determine a result table in the data warehouse tool corresponding to each task in the task dependency tree includes:

when the task in the task dependency tree is a data extraction task, reading a task dependency table in the task dependency tree, and taking the read table as a result table in a data warehouse tool corresponding to the task in the task dependency tree;

and when the task in the task dependency tree is a processing task, determining a script storage path corresponding to the task in the task dependency tree, reading a script from the script storage path, and executing the read script to obtain a result table in the data warehouse tool corresponding to the task in the task dependency tree.

In some embodiments of the present application, the determining, according to at least one table in the data warehouse tool corresponding to each big data service demand and a query key of each table in the at least one table, the customer-level data model corresponding to each big data service demand includes:

identifying a particular field in each of the at least one table, the particular field including at least one of a difference field and a date field, the difference field representing: in a case where the at least one table includes a plurality of tables, fields in the plurality of tables having the same field name and different field values;

obtaining the customer-level data model corresponding to each big data service requirement according to the query keyword of each table in the at least one table, wherein the customer-level data model further comprises: the particular field in each of the at least one identified table.

In some embodiments of the present application, identifying a difference field in each of the at least one table comprises:

acquiring a designated field, and taking the designated field in each table of the at least one table as the difference field;

counting the number of appearing enumeration values of each other field in the at least one table, and determining a difference field in each table of the at least one table according to the number of appearing enumeration values of each other field in the at least one table; the other fields represent fields other than the specified fields.

In some embodiments of the present application, said determining the difference field in each table of the at least one table according to the enumerated number of occurrences of each other field in the at least one table comprises:

when the number of enumerated values appearing in the target field in the at least one table is greater than or equal to a first set threshold value, determining the target field as a difference field; the target field represents any other field in the at least one table.

In some embodiments of the present application, the first set threshold is a product of a total number of records of each of the other fields in the at least one table and a set ratio, the set ratio being a positive number less than 1.

In some embodiments of the present application, the processing the backup result to obtain data to be tested includes:

and processing the specific field in each table in the backup result to obtain the data to be tested.

In some embodiments of the present application, the processing the specific field in each table in the backup result to obtain the data to be tested includes:

when the specific field comprises a difference field, replacing the difference field in each table in the backup result to obtain the data to be tested;

and when the specific field comprises a date field, performing time sequence translation on the date field in each table in the backup result to obtain the data to be tested.

In some embodiments of the present application, the performing data testing on the data to be tested includes:

and performing mock test on at least part of the data to be tested to obtain a test result corresponding to the at least part of the data.

The embodiment of the application provides a data testing device, its characterized in that, the device includes:

the determining module is used for determining a corresponding client-level data model aiming at each big data service requirement, wherein the client-level data model comprises at least one table in a data warehouse tool corresponding to each big data service requirement and a query keyword of each table in the at least one table, and the query keyword is used for representing at least one type of client identity information;

the backup module is used for carrying out data backup according to the client identity aiming at the client level data model corresponding to each big data service requirement to obtain a backup result, and the backup result comprises backup data of each client identity in the at least one client identity;

the test module is used for processing the backup result after acquiring the data test request to obtain the data to be tested; and carrying out data test on the data to be tested.

An embodiment of the present application provides an electronic device, which includes:

a memory for storing executable instructions;

and the processor is used for realizing any one of the data testing methods when executing the executable instructions stored in the memory.

An embodiment of the present application provides a computer-readable storage medium, which stores executable instructions and is configured to, when executed by a processor, implement any one of the above data testing methods.

In the embodiment of the application, a corresponding client-level data model is determined according to each big data service requirement; according to the client-level data model corresponding to each big data service requirement, data backup is carried out according to the identity of a client to obtain a backup result; after the data test request is obtained, processing the backup result to obtain data to be tested; and carrying out data test on the data to be tested.

Therefore, in the embodiment of the application, data backup can be realized for each customer identity, so that test data corresponding to each customer identity can be obtained during data test, and the actual requirements of a big data test task can be met.

Drawings

FIG. 1 is a flow chart of a data testing method according to an embodiment of the present application;

FIG. 2 is a flow chart of another data testing method according to an embodiment of the present application;

FIG. 3 is a diagram of a task dependency tree and a relationship tree of tasks and tables in an embodiment of the present application;

FIG. 4 is a diagram illustrating an example of building a relationship tree of requirement-task-table through static text analysis in an embodiment of the present application;

FIG. 5 is a schematic diagram of an implementation flow of data backup and data cloning in an embodiment of the present application;

FIG. 6 is a diagram illustrating an example of time-shifting a date field according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a mock test performed on task 3 based on FIG. 4 according to an embodiment of the present application;

FIG. 8 is a schematic diagram of an alternative structure of the data testing device according to the embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the related art, the data warehouse tool represents a tool for data extraction, transformation, and loading, and can store, query, and analyze large-scale data; illustratively, the data warehouse tool may be a hive or other data warehouse tool.

The tables in the data warehouse tool belong to a class of data models in the data warehouse tool, and each table in hive illustratively has a corresponding directory to store data.

For example, in hive, the table content in the whole hive is generally scanned by the hive Select query, which consumes much time to complete unnecessary work. However, sometimes only a part of data concerned in the table needs to be scanned, so that the table creation in live introduces a partition (partition) concept, generally taking a date as a partition, and for a T-day partition, the partition of the previous day can be called a T-1-day partition, and T is an integer greater than 1.

In the related art, for a big data test task, if data is multiplexed during testing or automatic testing is realized in regression, testing is performed based on original hive partition data, or a hive command is used for partition translation, and data operation of a certain client cannot be performed; the method for capturing and backing up data by using hive is an operation realized by aiming at one partition, each partition is usually slice data of one date, the actual scene represented by each partition is limited, if data of multiple scenes are tested, data capture needs to be performed on different partitions, the cost is high, and in the related technology, data of different scenes of different partitions cannot be aggregated in one target partition.

In the related art, artificial data can be directly inserted into each table of the hive, which increases the cost and difficulty of data preparation, for example, for large data of loan products, the difficulty of manually manufacturing the hive data is very large, because in the large data of the loan products, the relationship between the tables and the tables is strongly coupled, and the number of the tables is large, when the data is manually inserted, the dependence relationship of the tables needs to be clearly analyzed, reasonable data needs to be inserted for the tables, otherwise, the script runs without data association and the script logic cannot be tested.

The big data test mainly tests the processing logic of the big data, and the correctness of the script logic is verified by depending on the abundance degree of the hive data; in bank loan products, scenes with a plurality of dimensions such as the amount of borrowed money, overdue days, repayment modes and the like can cause a plurality of scenes, and a processing script generally relates to a plurality of tables, when in testing, the problem that the scene is incomplete or the number of the associated tables is not drawn is often caused, so that the time consumption of a big data testing single case is very large, the first batch of data can be obtained only by searching for one or two days, however, when the task is to be retested every time the task is used up, firstly, the previous test data cannot be collected together for testing in a plurality of partitions, secondly, the partitions can be cleaned up, and the task has many associated tables, the tables are strongly coupled, if the cost of data analysis is very high by manually and directly inserting data in hive, therefore, data needs to be grabbed again and tested according to the sequence of the tasks of doing transaction- > batch- > running big data, wherein batch refers to daily cutting and scheduling the batch task on the day. How to make the tester can be used repeatedly for many times at one time is a technical problem to be solved urgently.

The technical scheme of the embodiment of the application is provided for solving the technical problems in the related art.

The present application will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the examples provided herein are merely illustrative of the present application and are not intended to limit the present application. In addition, the following examples are provided as partial examples for implementing the present application, not all examples for implementing the present application, and the technical solutions described in the examples of the present application may be implemented in any combination without conflict.

It should be noted that in the embodiments of the present application, the terms "comprises", "comprising" or any other variation thereof are intended to cover a non-exclusive inclusion, so that a method or apparatus including a series of elements includes not only the explicitly recited elements but also other elements not explicitly listed or inherent to the method or apparatus. Without further limitation, the use of the phrase "including a. -. said." does not exclude the presence of other elements (e.g., steps in a method or elements in a device, such as portions of circuitry, processors, programs, software, etc.) in the method or device in which the element is included.

For example, the data testing method provided in the embodiment of the present application includes a series of steps, but the data testing method provided in the embodiment of the present application is not limited to the described steps, and similarly, the data testing apparatus provided in the embodiment of the present application includes a series of modules, but the apparatus provided in the embodiment of the present application is not limited to include the explicitly described modules, and may include modules that are required to obtain relevant information or perform processing based on the information.

Embodiments of the application may be applied to terminals and/or servers where the terminals may be thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, programmable consumer electronics, network pcs, minicomputers, and the like. The server may be a small computer system, a mainframe computer system, a distributed cloud computing environment including any of the systems described above, and so forth.

Electronic devices such as servers may include program modules that execute computer instructions and, in general, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Fig. 1 is a flowchart of a data testing method according to an embodiment of the present application, and as shown in fig. 1, the flowchart may include:

step 101: and determining a corresponding client-level data model aiming at each big data service requirement, wherein the client-level data model comprises at least one table in the data warehouse tool corresponding to each big data service requirement and a query keyword of each table in the at least one table, and the query keyword is used for representing at least one type of client identity information.

Here, each BIG DATA service requirement represents one function point, and in order to implement one function point, it may be necessary to perform various BIG DATA TASKs (BIG DATA TASK). Illustratively, the big data task belongs to shell (shell) scripts, and the logic of the big data task is to Extract data from a DataBase (DataBase, DB) to live or process live data, so the big data task can be divided into an Extract-Transform-Load (ETL) task and a processing task.

It can be understood that each big data business requirement can be realized by executing a plurality of big data tasks, and the essence of executing the big data tasks is to read a plurality of tables in the data warehouse tool, perform associative processing in a script through a Structured Query Language (SQL) statement, and store the final result after the output processing in a result table of the data warehouse tool. Therefore, at least one table corresponding to each big data service requirement can be determined according to each big data service requirement.

In practical applications, the definition of the search mode for the client may be different for different service scenarios. Here, the query keyword of the information representing the identity of the customer may be manually configured according to the service scenario corresponding to the big data service requirement, and for example, the query keyword may include at least one of the following items for the big data service requirement corresponding to the loan service: the system comprises a client identity card id _ no, a client number cust _ id, an account number acct _ no and a uniform serial number ecifno in a bank line. In the embodiment of the present application, the query keyword may be referred to as a search keyword (key).

Step 102: and aiming at the client-level data model corresponding to each big data service requirement, carrying out data backup according to the client identity to obtain a backup result, wherein the backup result comprises backup data of each client identity in at least one client identity.

Here, since the query key is used to represent information of at least one client identity, for each client identity, a query may be performed in each table of at least one table according to the query key of at least one table in the client-level data model to obtain a corresponding query result of at least one table, and the corresponding query result of at least one table is used as backup data corresponding to each client identity.

It can be seen that, in the embodiment of the present application, corresponding DATA backup may be performed on each client identity according to information of the client identity in the client-level DATA model, so as to perform DATA cloning (DATA CLONE) subsequently according to backup DATA corresponding to each client identity. In the embodiment of the present application, the DATA clone represents a copy of a DATA FORM (DATA FORM) representing DATA having a complex life cycle, and each stage of the DATA life cycle corresponds to one DATA FORM.

Step 103: after the data test request is obtained, processing a backup result to obtain data to be tested; and carrying out data test on the data to be tested.

In the embodiment of the application, after the data test request is obtained, the backup result can be processed according to the identity information of the target client, so that the data to be tested is obtained. For example, in the case that the data warehouse tool is hive, the hive cloning technology can be adopted to clone the morphological data which is already backed up onto a completely new target empty account; here, in the course of hive cloning, the morphological data that has been backed up may be modified.

In practical applications, the steps 101 to 103 may be implemented based on a Processor of an electronic Device, where the Processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic device implementing the above-described processor function may be other electronic devices, and the embodiments of the present application are not limited thereto.

In the related art, a data model can be obtained by adopting supervised training, and the process of the supervised training is as follows: adopting a preselected sample TRAINING (TRAINING) data model, namely, taking known data forms as samples, and automatically extracting the data form model through the samples; however, in the hive data, the whole process is long and the standard sample is difficult to obtain because the tasks from trading, batch running and big data running enter the hive table, so that the realization complexity is high when the hive data is not easy to adopt supervised training to obtain a data model.

In view of the above technical problem, in some embodiments of the present application, an UNSUPERVISED TRAINING (UNSUPERVISED TRAINING) may be used to obtain a client-level data model, and the UNSUPERVISED TRAINING is different from the supervised TRAINING in that there is no TRAINING sample in advance and the data may be directly modeled.

The following is an exemplary description of an implementation of unsupervised training.

For example, determining an implementation of the corresponding client-level data model for each big data traffic demand may include:

and determining the customer-level data model corresponding to each big data service requirement according to at least one table in the data warehouse tool corresponding to each big data service requirement and the query key words of each table in the at least one table.

In practical application, if the table in the data warehouse tool is a table processed by other tasks, in the big data task running mechanism, the relationship between the tasks and the table can be determined through the task dependency relationship. Illustratively, all dependent tasks of each big data task may be analyzed by the big data consanguinity analysis tool, i.e., a task dependency tree is obtained for at least one big data task.

It can be understood that, in the embodiment of the present application, the client-level data model can be established according to the corresponding relationship between the big data service requirement and the big data task and the corresponding relationship between the big data task and the table in the data warehouse tool.

In some embodiments of the present application, the analyzing each task in the task dependency tree to determine a result table in the data warehouse tool corresponding to each task in the task dependency tree may include:

when the task in the task dependency tree is a processing task, determining a script storage path corresponding to the task in the task dependency tree, reading the script from the script storage path, and executing the read script to obtain a result table in the data warehouse tool corresponding to the task in the task dependency tree.

In the embodiment of the application, each task in the task dependency tree can be sequentially subjected to text analysis; when the task in the task dependency tree is a data extraction task, it indicates that the task belongs to a leaf node in the task dependency tree, in this case, a preset ETL snapshot task configuration table may be directly read, a table on which the task depends is obtained, and then the task analysis process of the task is ended, where the preset ETL snapshot task configuration table is used to indicate a dependency relationship between the preset task and the table.

When the task in the task dependency tree is a processing task, the script storage path corresponding to the task can be read through the corresponding script configuration information, so that the script is read from the script storage path, and the read script is executed. Here, the script configuration information may indicate a correspondence relationship of a preset task and a script storage path.

The task dependency tree processing method and device can process the tasks correspondingly according to the types of the tasks in the task dependency tree, so that the result tables in the data warehouse tool corresponding to the tasks are obtained.

In some embodiments of the present application, determining, according to at least one table in the data warehouse tool corresponding to each big data service requirement and a query key of each table in the at least one table, a customer-level data model corresponding to each big data service requirement may include:

identifying a particular field in each of the at least one table, the particular field including at least one of a difference field and a date field, the difference field indicating: in the case where the at least one table includes a plurality of tables, fields having the same field name and different field values in the plurality of tables;

obtaining the customer-level data model corresponding to each big data service requirement, wherein the customer-level data model further comprises: a particular field in each of the at least one table identified.

In the embodiment of the application, after data backup is performed according to the client identity, if the backed-up data is to be applied to a new client identity, fields which do not need to be replaced and fields which need to be replaced in the table need to be identified, here, the fields which need to be replaced in the table are difference fields, and the fields which do not need to be replaced can be recorded as the same fields. In an actual application scenario, for some fields in the table, even though the service scenarios are the same, values of the fields are different because different customer accounts are used, and the fields are difference fields.

In the related technology, when a data model is obtained by adopting supervised training, two samples in the same form need to be constructed, fields with the same name of all tables of the two samples are captured, then the field values of the tables are compared, and if the values are different, the field is indicated to be a field needing to be replaced when the backup data is applied to a new customer identity, namely, the field is a difference field; however, in practical scenarios, it is very difficult to construct two samples of the same morphology; for the problem, in the embodiment of the present application, an unsupervised training may be used to obtain the client-level data model, and during the unsupervised training to obtain the client-level data model, statistical analysis may be performed on the fields in the table, so as to determine the difference fields in the table according to the statistical analysis result.

Illustratively, a field that can be formatted as a date can be identified, and the identified field that can be formatted as a date can be recorded as a date field, so that date fields of different formats can each be identified.

It can be understood that the difference field and the date field in each table of at least one table can be identified, so that the difference field and the date field in the table can be adaptively modified more specifically when the backup result is processed subsequently.

It is to be understood that, according to the definition of the difference field, the generation source of the difference field in the table is often a random number, a self-increment, or a client inherent attribute such as the above query key, and the like, and the generation source is characterized in that the enumeration value is increased as the number of clients or the number of records is increased, so in one implementation, identifying the difference field in each table of the at least one table may include:

acquiring a designated field, and taking the designated field in each table of at least one table as the difference field; counting the number of appearing enumeration values of each other field in at least one table, and determining a difference field in each table of at least one table according to the number of appearing enumeration values of each other field in at least one table; the other fields represent fields other than the specified fields.

In practical application, the designated field can be manually set, and the manually set designated field is a default difference field; thus, a portion of the difference fields may be first identified based on the name of the specified field; after the designated field in each of the at least one table is taken as a difference field, the other fields may also be analyzed to determine a difference field in the other fields.

For example, for big data traffic of loan products, the field of the query key and the traffic DATE (BUSINESS DATE) field may be set as default difference fields; here, for the big data service of the loan product, the field in the year, month and day format indicates the service date field; a unified date field is set in the big data service system related to the loan product, the value of the date field is automatically scrolled according to the current time, the date field is a service date field, and the transaction date related to the loan product is subject to the value of the date field.

For example, for the big data service of the loan product, the field with the decimal number represents the money amount related field, and if the fields are the same in shape, the values of the fields in the two scenes are the same, so the money amount related field is determined to be the same field by default according to the characteristic.

It can be seen that, in the embodiment of the present application, the designated field may be preferentially used as the difference field, and then other difference fields are determined according to analysis of values of other fields, so that the difference field in the table can be accurately and comprehensively determined.

In some embodiments of the present application, determining the difference field in each table of the at least one table according to the enumerated number of occurrences of each other field in the at least one table may include:

when the number of the appearing enumeration values of the target field in at least one table is greater than or equal to a first set threshold value, determining the target field as a difference field; the destination field represents any other field in the at least one table.

Illustratively, the first set threshold may be a fixed value, e.g., the first set threshold may be 200, 300, or 400; illustratively, the first set threshold may also be a product of the total number of records of the respective other fields in the at least one table and a first set proportion which is a positive number less than 1, for example, the set proportion may be 15%, 20% or 25%.

In the embodiment of the application, after determining the number of appearing enumeration values of the target field in at least one table, the size relationship between the number of appearing enumeration values of the target field in at least one table and a first set threshold value can be judged; if the number of the enumeration values appearing in the target field in at least one table is smaller than a first set threshold value, the target field is indicated to be the same field or an unknown field, and the unknown field indicates a field which cannot determine the field type currently; after a period of data accumulation has elapsed, the type of unknown field may be determined, i.e., it may be determined whether the unknown field is the same field or a different field. And if the number of the enumerated values of the occurrence of the target field in at least one table is greater than or equal to a first set threshold value, the target field is indicated as a difference field.

For example, when the total recorded number of each field in at least one table is greater than or equal to a second set threshold, the size relationship between the number of enumerated values appearing in the target field in at least one table and the first set threshold may be determined; and if the total record number of each field in at least one table is less than a second set threshold value, the target field can be considered as an unknown field.

Here, the second set threshold may be set empirically, for example, 800, 1000, or 1200.

Illustratively, the type of target field in at least one table may be determined according to steps A1 through A4 below

Step A1: and performing deduplication processing on records in at least one table, and performing statistical analysis on the number of enumerated values appearing in each field of the words in the table after deduplication processing to obtain a statistical analysis result.

Step A2: and if the total record number of each field in at least one table is more than 1000 and the number of the appearing enumeration values of the target field in at least one table meets a first set condition, determining that the target field is the same field.

Here, the first setting condition is: the number of enumerated values of occurrence of the target field in the at least one table is less than 5% of the total number of records of the respective fields in the at least one table, and the number of enumerated values of occurrence of the target field in the at least one table is less than 300.

Step A3: and if the total record number of each field in the at least one table is more than 1000 and the number of the enumeration values of the occurrence of the target field in the at least one table meets a second set condition, determining the target field as a difference field.

Here, the second setting condition is: the number of enumerated values of occurrence of the target field in the at least one table is greater than or equal to 20% of the total number of records of the respective field in the at least one table, or the number of enumerated values of occurrence of the target field in the at least one table is greater than or equal to 300.

Step A4: and if the total record number of each field in at least one table is less than or equal to 1000, or the enumerated value number of the target field in at least one table does not satisfy the first setting condition and the second setting condition at the same time, determining that the target field is an unknown field, and at the moment, accumulating data until the total record number of each field in at least one table is greater than 1000, and determining the type of the target field by re-executing the steps A2 to A4.

In the embodiment of the application, if the total number of records of each field in at least one table is less than or equal to 1000, it indicates that the total number of records of each field in at least one table is less, and the credibility for determining the type of the target field is lower at this time, so that the target field can be determined as an unknown field first, and the type of the unknown field can be accurately determined again in the subsequent process.

And if the total record number of each field in the at least one table is more than 1000, letting A represent the number of the enumeration values of the occurrence of the target field in the at least one table, and letting B represent the ratio of the number of the enumeration values of the occurrence of the target field in the at least one table to the total record number of each field in the at least one table, determining that the target field is the same field when B is equal to [0, 5% ] and A is less than 300. At B ∈ (5%, 20%), the target field is determined to be an unknown field. When B ∈ [ 20%, 1] or A ≧ 300, the target field is determined to be the difference field.

In the embodiment of the present application, the execution sequence of steps a2 to a4 is not limited, and steps a2 to a4 may be executed simultaneously.

For example, the first set threshold, the second set threshold, the first set proportion, and the like may be determined according to characteristics of an application product of the big data service.

It can be seen that in the embodiment of the present application, the type of the target field in the table can be determined, and as data is accumulated, the number of unknown fields in each table is reduced.

For example, for an unknown field in the table, the backup result is processed subsequently, and the value of the unknown field may be kept unchanged in the same processing manner as the same field, so that the processing of the backup result is not affected by the unknown field.

It can be seen that, according to the size relationship between the number of enumerated values appearing in the target fields in at least one table and the first set threshold, the embodiment of the present application can accurately determine the unknown fields in at least one table, and improve the accuracy and reliability of subsequent processing on the backup result.

In some embodiments of the present application, the processing the backup result to obtain the implementation manner of the data to be tested may include: and processing the specific field in each table in the backup result to obtain the data to be tested.

Illustratively, when the specific field comprises a difference field, replacing the difference field in each table in the backup result to obtain the data to be tested; and when the specific field comprises a date field, performing time sequence translation on the date field in each table in the backup result to obtain the data to be tested.

Here, the time-shift process means a process of moving a date from a source date to a target date according to a reference standard, for example, in the embodiment of the present application, for a big data service of a loan product, a first bill date may be adopted as a reference because the loan product requires a fixed monthly bill date.

It can be seen that the difference field and the date field in each table in the backup result are modified correspondingly, which is beneficial to obtaining the data to be tested which is in accordance with the actual requirement.

In some embodiments of the present application, an implementation manner of performing data testing on data to be tested may include: and performing mock test on at least part of the data to be tested to obtain a test result corresponding to at least part of the data.

In practical application, if a certain task in the task dependency tree cannot be actually tested, mock testing on the task can be set, that is, a result table in a backup result corresponding to the task is directly used, and a big data test related to the task is not required to be realized by running the task, but the big data test related to other tasks can be continuously executed.

It can be seen that, the mock test is performed on at least part of data in the data to be tested, so that the data to be tested can be comprehensively tested, and the accuracy of the test result is improved.

The data testing method of the embodiment of the present application is further exemplified below with reference to the accompanying drawings.

Referring to fig. 2, a data testing method according to an embodiment of the present application may include:

step 201: and modeling data.

Here, the implementation manner of step 201 may be: and determining a corresponding client-level data model aiming at each big data service requirement. Illustratively, a relation tree of requirement-task-table can be established through static text analysis, query keywords of each hive table are obtained, and finally difference fields and date fields of each hive table are extracted through a statistical analysis method.

Step 202: and (6) backing up data.

Here, data backup can be performed according to the client identity for the client-level data model corresponding to each big data service requirement, so as to obtain a backup result; illustratively, the backup results may be stored in a data repository.

Step 203: and (5) cloning data.

Here, the implementation manner of step 203 may be: after the data test request is obtained, processing the backup result to obtain data to be tested; and carrying out data test on the data to be tested.

It should be noted that, in the data cloning process in the embodiment of the present application, the backup result is not directly used and tested, but the data is tested after the difference field and the date field in the backup result are adaptively modified.

According to the embodiment of the application, the backup result in the data warehouse can be applied to the target test environment, the data captured by different environments in different periods can be applied to the same partition of the same environment, and the running of part of tasks mock is supported.

Illustratively, the implementation of step 201 may include:

step B1: and establishing a relation tree of requirement-task-table through static text analysis.

Firstly, a top task corresponding to a big data service requirement can be configured manually, and in fig. 3, a task 1 is a top task; after the top task is determined, a task dependency tree of the top task can be obtained through a big data blood margin analysis tool (that is, all dependencies are obtained by analyzing the pre-dependencies of the tasks layer by layer).

In fig. 3, task 1 depends on task 2, task 3, and task 4, task 2 depends on task 5 and task 6, and task 3 depends on task 7 and task 8.

After the task dependency tree is obtained, text analysis can be sequentially performed on each task in the task dependency tree to obtain the type of the task in the task dependency tree, so that the task is correspondingly processed according to the type of the task.

Illustratively, referring to FIG. 3, a table on which a task depends or a result table of a task may be parsed using a common syntax format for the operation of big data scripts on the table, resulting in a relational tree of tasks and tables. For example, a list of task dependencies is parsed out using a select syntax, a table of results produced for a task is parsed out using an insert syntax, and a table of results produced for a task is parsed out using a drop syntax, where the table of results produced for a task is an intermediate table.

In fig. 3, task 1 depends on tables a, B, and C, task 2 depends on tables E and F, the results table for task 2 is table a, task 3 depends on tables F and H, the results table for task 3 is table B, the results table for task 4 is table C, the results table for task 5 is table E, the results table for task 6 is table F, and the results table for task 8 is table H.

By executing step B1, a relationship tree of tasks and tables can be established for each big data service requirement; FIG. 4 is an exemplary diagram of building a requirement-task-table relationship tree through static text analysis, where task 1, task 12, and task 13 correspond to three different top tasks of requirement 1.

Step B2: and querying in the hive table according to the query keywords in the hive table to obtain corresponding query results of at least one table, and taking the corresponding query results of at least one table as backup data corresponding to each client identity.

The embodiment of the application can realize hive data capture for a specified client, for example, if identity information of a certain client is input, a table related to the requirement of the identity information can be captured; after the relation tree of requirement-task-table is established, the query key words can be found for each hive table, so that the hive table is queried according to the query key words in the hive table.

In practical application, an SQL statement for obtaining the query keyword may be established for the big data service of each product, and exemplarily, only one configuration is required for each product.

After the query keywords are determined, the hive table can be retrieved according to the query keywords of the hive table, the field name of each query keyword is matched with the field of the hive table one by one in a field matching mode, if the matching is successful, the corresponding query keyword is hit in the hive table, and the query keyword can be classified as the query keyword of the hive table; if multiple query keys are hit, the multiple query keys may be attributed to the query keys of the table. In the embodiment of the application, even if the value of each field in the hive table is null, the corresponding query keyword can be retrieved.

After the search of the query keywords of the hive table is finished, if a table which is not matched with the query keywords exists, manual intervention processing can be performed, and if the fields of the two fields have the same meaning but different field names, the mapping fields of the query keywords can be reconfigured. If new query keywords exist in the analysis, the query keywords can be manually added; after the manual intervention process, the above process of searching in the hive table may be executed again until the query key is searched in each table except the parameter-related table.

By executing step 201, unsupervised training can be performed for each big data service requirement to obtain a customer-level data model corresponding to the big data service requirement, where the key points of the customer-level data model include a relation tree of requirement-task-table and a query keyword of each table. Step 202 may be performed through a relationship tree of requirements-tasks-tables and a query key for each table.

Illustratively, the implementation of step 202 may include:

when a test tester tests a certain big data demand, after test verification reaches an expectation, the hive data backup tool is used for backup, and the income parameters in the hive data backup tool can include: requirement name, environment, client identity information id _ no and partition date; according to the requirement name and the relation tree of requirement-task-table, the list of the table which needs to be backed up by the requirement can be determined, and all query keywords are searched in each table according to the client identity information id _ no; and then, acquiring the data of each table of the corresponding client according to the query key words of each table in the data model, thereby backing up the data of each table of the corresponding client to the hive data warehouse.

Referring to fig. 5, the source client Identity information includes at least one type of client Identity information, for each type of client Identity information in the source client Identity information, corresponding data backup may be performed according to information of the client Identity in a client-level data model, corresponding type of form data may be generated for each type of client Identity information, and an Identity identification number (ID) of the form data is a unique identifier of backup data corresponding to each type of client Identity.

After the corresponding form data generated by each type of customer identity information is obtained, the form data generated by each type of customer identity information can be permanently stored in a data warehouse, and data backup of the form data generated by each type of customer identity information is realized.

In the embodiment of the present application, an insert statement is used for data multiplexing during data cloning, and in practical application, referring to fig. 5, target identity information includes at least one type of client identity information; for each client identity information in the target identity information, replacing the difference field in the backup result; and because the cloned target partition is different from the source partition, the time sequence of the date field in the backup result can be shifted. After the difference field and the date field in the backup result are adaptively modified, the insert statement is really executed to the hive table, so that new form data is generated in the target environment, wherein the new form data represents form data obtained by adaptively modifying the difference field and the date field in the form data of the data warehouse.

Illustratively, referring to FIG. 6, a time-sequential shift may be made in the date field in the source modality data, resulting in new modality data that may be applied in the target environment. Here, the current business date of the target environment is different from the business date in the source modality data.

Illustratively, for the data cloning process, the embodiment of the application also supports running part of tasks in a mock form, that is, fig. 7 is a schematic diagram of conducting a mock test on task 3 based on fig. 4.

In summary, the embodiment of the present application may adopt an unsupervised training method to establish a corresponding client-level data model for each big data service requirement on the basis of a data cloning technology and a big data blood-related analysis technology, and use the client-level data model to perform data backup and data cloning. The key points of the embodiment of the application comprise: the text analysis of the shell script and the task dependency tree establish a relation tree of requirement-task-table, then the difference field of each table is trained by performing statistical analysis on the test environment data, and the query keyword is found for each table, thereby establishing a client-level data model of the loan product by an unsupervised training method. In an application layer, customer-level data backup and data cloning can be performed on hive data through the embodiment of the application, and a mock test can be started during data cloning. According to the embodiment of the application, the big data making efficiency and the efficiency of data preparation for the automation of the big data logic function can be improved.

On the basis of the data testing method provided by the foregoing embodiment, the embodiment of the present application further provides a data testing apparatus; fig. 8 is a schematic diagram of an alternative structure of a data testing apparatus according to an embodiment of the present application, and as shown in fig. 8, the data testing apparatus 800 may include:

a determining module 801, configured to determine, for each big data service requirement, a corresponding client-level data model, where the client-level data model includes at least one table in the data warehouse tool corresponding to each big data service requirement, and a query keyword of each table in the at least one table, where the query keyword is used to represent information of at least one client identity;

a backup module 802, configured to perform data backup according to a client identity for the client-level data model corresponding to each big data service requirement, so as to obtain a backup result, where the backup result includes backup data of each client identity in the at least one client identity;

the test module 803 is configured to, after acquiring the data test request, process the backup result to obtain data to be tested; and carrying out data test on the data to be tested.

In some embodiments of the present application, the determining module 801 is specifically configured to:

In some embodiments of the present application, the test module 803 is specifically configured to:

In practical applications, the determining module 801, the backup module 802 and the testing module 803 may be implemented by a processor of an electronic device, and the processor may be at least one of an ASIC, a DSP, a DSPD, a PLD, an FPGA, a CPU, a controller, a microcontroller and a microprocessor. It is understood that the electronic device implementing the above-described processor function may be other electronic devices, and the embodiments of the present application are not limited thereto.

It should be noted that the above description of the embodiment of the apparatus, similar to the above description of the embodiment of the method, has similar beneficial effects as the embodiment of the method. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the method described above is implemented in the form of a software functional module and sold or used as a standalone product, it may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a terminal, a server, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Correspondingly, an embodiment of the present application further provides a computer program product, where the computer program product includes computer-executable instructions, and the computer-executable instructions are used to implement any one of the data testing methods provided in the embodiment of the present application.

Accordingly, an embodiment of the present application further provides a computer storage medium, where computer-executable instructions are stored on the computer storage medium, and the computer-executable instructions are used to implement any one of the data testing methods provided in the foregoing embodiments.

An embodiment of the present application further provides an electronic device, fig. 9 is an optional schematic structural diagram of the electronic device provided in the embodiment of the present application, and as shown in fig. 9, the electronic device 90 includes:

a memory 901 for storing executable instructions;

the processor 902 is configured to implement any one of the above data testing methods when executing the executable instructions stored in the memory 901.

The processor 902 may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor.

The computer-readable storage medium/Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM), and the like; but may also be various terminals such as mobile phones, computers, tablet devices, personal digital assistants, etc., that include one or any combination of the above-mentioned memories.

Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be appreciated that reference throughout this specification to "some embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one embodiment of the present application. Thus, the appearances of the phrase "in some embodiments" appearing in various places throughout the specification are not necessarily all referring to the same embodiments. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an automatic test line of a device to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for data testing, the method comprising:

2. The method of claim 1, wherein determining a corresponding customer-level data model for each big data traffic demand comprises:

3. The method of claim 2, wherein analyzing each task in the task dependency tree to determine a result table in a data warehouse tool corresponding to each task in the task dependency tree comprises:

4. The method according to claim 2, wherein the determining the customer-level data model corresponding to each big data service requirement according to at least one table in the data warehouse tool corresponding to each big data service requirement and the query key of each table in the at least one table comprises:

5. The method of claim 4, wherein identifying a difference field in each of the at least one table comprises:

6. The method of claim 5, wherein determining the difference field in each of the at least one table based on the enumerated number of occurrences of each of the other fields in the at least one table comprises:

7. The method of claim 6, wherein the first set threshold is a product of a total number of records in each of the other fields of the at least one table and a set ratio, the set ratio being a positive number less than 1.

8. The method of claim 4, wherein the processing the backup result to obtain the data to be tested comprises:

9. The method of claim 8, wherein the processing the specific field in each table in the backup result to obtain the data to be tested comprises:

10. The method of claim 1, wherein the data testing the data to be tested comprises:

11. A data testing apparatus, characterized in that the apparatus comprises:

12. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the data testing method of any one of claims 1 to 10 when executing executable instructions stored in the memory.

13. A computer-readable storage medium storing executable instructions for implementing the data testing method of any one of claims 1 to 10 when executed by a processor.