CN108255712B

CN108255712B - Test system and test method of data system

Info

Publication number: CN108255712B
Application number: CN201711486827.8A
Authority: CN
Inventors: 郭庆; 付戈; 刘倩; 狄静舒; 张建磊
Original assignee: National Computer Network and Information Security Management Center; Dawning Information Industry Beijing Co Ltd
Current assignee: National Computer Network and Information Security Management Center; Dawning Information Industry Beijing Co Ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2021-05-14
Anticipated expiration: 2037-12-29
Also published as: CN108255712A

Abstract

The invention provides a test system and a test method of a data system. The test system of the data system comprises: the user layer is used for providing a mode for accessing the test system for a user; the middle application layer is used for receiving the request submitted by the user from the user layer and coordinating a plurality of tasks for analyzing and processing the request for the bottom layer big data platform; and the bottom layer big data platform is a data storage layer of the data system and comprises a local file system, a distributed file system and a database, wherein the middle application layer further comprises a data access interface which is used for reading and accessing data and files of various big data platforms. The invention provides a comprehensive and complete test system for loading, storing and retrieving mass data and aiming at the characteristics of big data software.

Description

Test system and test method of data system

Technical Field

The present invention relates generally to the field of computer technology, and more particularly, to a test system and test method for data systems.

Background

With the commercialization of open source MapReduce, HDFS, HBase, Spark and other technologies, the big data management technology has been developed dramatically, and with the continuous construction of big data analysis platforms, how to objectively compare different big data systems, that is, the evaluation of the big data systems, becomes an important technical direction.

The Transaction performance management committee (TPC) is the most well-known standardized organization of evaluation benchmarks for data management systems. In the past twenty years, the institution publishes a plurality of database evaluation benchmarks, such as TPC-A, TPC-D, TPC-H and TPC-DS, which are widely applied in the industry, TPC-DS 2.X published recently is the first standard benchmark test facing to a big data system based on SQL in the industry, wherein the standard benchmark test comprises various systems based on Hadoop and Apache Spark, a relational database management system (RDBMS for short) is additionally arranged, a decision-support data warehouse is constructed by using a mixed model of a star model and a snowflake model, various operations facing to the decision-support system are accurately simulated by using a more diverse query template, and almost every operation has high IO load and CPU calculation requirements.

The existing series of evaluation benchmarks are not greatly changed in decades after being developed, so that the system is beneficial to understanding and use of people, and longitudinal comparison (performance comparison of hardware and software systems in different times) and transverse comparison (comparison between systems of different manufacturers in the same time) of different systems are carried out. However, the TPC series evaluation criteria are not sufficient for evaluating various types of large data systems that have appeared in recent years. Because big data management and analysis systems (including various NoSQL databases, the S3 standard) were developed in recent years, they differ from traditional databases in the application-oriented, data models employed, consistency constraints, and so on. The big data and the application of the big data evolve continuously along with time, and a new big data evaluation benchmark is needed to objectively and deeply evaluate different systems.

Disclosure of Invention

The present invention provides a test system and a test method for a data system, which can solve the above problems.

According to an aspect of the present invention, there is provided a test system of a data system, including: the user layer is used for providing a mode for accessing the test system for a user; the intermediate application layer is used for receiving the request submitted by the user from the user layer and coordinating a plurality of tasks to analyze and process the request on the bottom storage platform; and the bottom layer big data platform is a data storage layer of the data system and comprises a local file system, a distributed file system and a database, wherein the middle application layer further comprises a data access interface which is used for reading and accessing data and files of various big data platforms.

Preferably, the intermediate application layer further comprises: the core service module is used for providing data service, processing the received user request and distributing the decomposed request to each subtask module; and the service public module is used for providing auxiliary log management, exception handling and service exception monitoring for various services.

Preferably, the core service module includes: the data generation service submodule is used for providing test data and test tasks; the task submitting service sub-module is used for submitting the test request of the user to a corresponding big data platform; the result comparison service submodule is used for judging the correctness and the technical performance index of each test task; and the resource monitoring service submodule is used for acquiring the load information of the tested system and providing data support for analyzing the performance index of the tested system.

Preferably, the test system further comprises a test benchmark module, which is designed based on a data type, sample data, an SQL standard, and a standard interface, and extends a functional benchmark, a performance benchmark, a stability benchmark, and a reliability benchmark on the basis of the data type, the sample data, the SQL standard, and the standard interface.

Preferably, the data generation service submodule for providing test data further includes: generating sample data and expected result data which are depended by the test system to submit the test task, wherein according to the data sample in the test benchmark module, the distribution characteristics of real data are kept, and the data size, the data format and the structure are controlled according to a doubling increase principle and parameters so as to be used for test environments with different scales.

Preferably, the doubling principle comprises: and when the sample data is doubled, ensuring that a column of data is completely different so as to ensure that each record is different.

Preferably, the doubling principle comprises: and carrying out doubling on the sample data by adopting the same type of columns of the sample data through cross operation so as to ensure that the data characteristics are unchanged.

Preferably, the data generation service submodule includes: generating test data, generating a query task, generating a query result set and generating a dependency table, wherein the test data generates and selects sample data carried by the test system or regenerates new sample data according to the doubling increase principle.

Preferably, the sample data includes normal data and abnormal data, wherein the abnormal data is used for testing the fault tolerance of the system and the abnormal condition of data inclination.

Preferably, the data generation service submodule includes: test data generation, query task generation, query result set generation and dependency tabulation generation.

Preferably, the task submission service sub-module is configured to submit the test request of the user to the corresponding big data platform, and further includes: and after the tables and data depended by the test tasks are generated and the preparation work is completed, submitting the test requests of the users to the corresponding big data platforms.

Preferably, the task submission service sub-module includes loading task submission, query task submission and other task submissions.

Preferably, the result comparison service sub-module is configured to determine correctness and technical performance index of each test task, and further includes a calculation service for providing correctness comparison and related index for the test tasks of the loading class and the query class.

Preferably, the result comparison service sub-module comprises query result judgment, loading result judgment and other test task result judgment.

Preferably, the resource monitoring service sub-module is configured to collect load information of the system under test, and further includes providing monitoring of load and resource usage of key services of the big data cluster node and each component on the node.

According to another aspect of the present invention, there is provided a method for testing a data system, including: generating a test task and data on which the test task depends; submitting the test task to a big data platform; and outputting the test result of the test task after the test task is completed.

Preferably, the test method further comprises: and monitoring the load and resource use condition of the big data cluster nodes and the key services of each component on the nodes during the execution of the test task.

Preferably, when the test task is a loading test task, the method further comprises: generating a loading test task and original test data required by a loading test; submitting the loading test task to the big data platform; and after the loading test task is completed, the big data platform returns information, and calculates the loaded performance index according to the returned information.

Preferably, when the test task is a query test task, the method further includes: generating original data and a result set corresponding to the query test task and the query result set; submitting the query test task to the big data platform; and after the query test task is completed, the big data platform returns information, and calculates the performance index related to query according to the returned information.

Preferably, the test method further comprises: and continuously acquiring node and service information of the big data platform during the execution period of the query test task, and outputting load information, wherein the load information is combined with the performance index to output a test result evaluation report.

The test system and the test method of the data system provided by the invention are oriented to mass data loading, storage and retrieval, provide a comprehensive and complete test method aiming at the characteristics of big data software, and make up the defects of the traditional test method aiming at a structured database; the automatic test system realizes the automation of a complex test process aiming at each test point, covers various types of system application scenes and has good application effect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a simplified block diagram of a test system of a data system according to an embodiment of the present invention;

FIG. 2 is a block diagram of a test system according to an embodiment of the present invention;

FIG. 3 is a functional block diagram of a test system according to an embodiment of the present invention;

FIG. 4 is a block diagram of an interworking function module according to an embodiment of the present invention;

FIG. 5 is a flow diagram of loading a test task according to an embodiment of the invention;

FIG. 6 is a flow diagram of a query test task according to an embodiment of the invention;

FIG. 7 is a block diagram of a test reference according to an embodiment of the present invention;

FIG. 8 is a flow diagram of a data generation service submodule according to an embodiment of the invention;

FIG. 9 is a flow diagram of a task submission service submodule, according to an embodiment of the invention;

FIG. 10 is a flow diagram of a results comparison services sub-module according to an embodiment of the invention; and

FIG. 11 is a flow diagram of a resource monitoring service sub-module according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

FIG. 1 is a simplified block diagram of a test system of a data system according to an embodiment of the present invention. Hereinafter, a test system of the data system will be described with reference to fig. 1.

Referring to fig. 1, a test system 100 of a data system according to an embodiment of the present invention includes: the user layer 102 is used for providing a mode for accessing the test system for a user; the middle application layer 104 is used for receiving a request submitted by a user from the user layer 102 and coordinating a plurality of tasks for analysis and processing of the request for the underlying big data platform 106; and a bottom big data platform 106 which is a data storage layer of the data system and includes a local file system, a distributed file system and a database, wherein the middle application layer 104 further includes a data access interface 108 for reading and accessing data and files of various big data platforms.

The test system provided by the embodiment of the invention is oriented to mass data loading, storing and retrieving, and provides a comprehensive and complete test system aiming at the characteristics of big data software.

The test system 100 of a data system according to an embodiment of the present invention includes: the user layer 102 is used for providing a mode for accessing the test system for a user; the middle application layer 104 is used for receiving a request submitted by a user from the user layer 102 and coordinating a plurality of tasks for analysis and processing of the request for the underlying big data platform 106; and a bottom big data platform 106 which is a data storage layer of the data system and comprises a local file system, a distributed file system and a database, wherein the middle application layer 104 further comprises a data access interface 108 for reading and accessing data and files of various big data platforms. In addition, the intermediate application layer 104 further includes: the core service module is used for providing data service, processing the received user request and distributing the decomposed request to each subtask module; and the service public module is used for providing auxiliary log management, exception handling and service exception monitoring for various services.

The core service module comprises: the data generation service submodule is used for providing test data and test tasks; the task submitting service sub-module is used for submitting the test request of the user to a corresponding big data platform; the result comparison service submodule is used for judging the correctness and the technical performance index of each test task; and the resource monitoring service submodule is used for acquiring the load information of the tested system and providing data support for analyzing the performance index of the tested system.

The test system also comprises a test benchmark module which is used for designing based on the data type, the sample data, the SQL standard and the standard interface and expanding the function benchmark, the performance benchmark, the stability benchmark and the reliability benchmark on the basis of the data type, the sample data, the SQL standard and the standard interface.

Specifically, the data generation service submodule for providing test data further includes: generating sample data and expected result data which are depended by the test system to submit the test task, wherein according to the data sample in the test benchmark module, the distribution characteristics of the real data are kept, and the data size, the data format and the structure are controlled according to the doubling increase principle and the parameters so as to be used for test environments with different scales. The doubling increase principle comprises: when the sample data is doubled, a column of data is ensured to be completely different so as to ensure that each record is different. The doubling increase principle comprises: the same type of columns of the sample data are adopted to carry out the doubling of the sample data through the cross operation so as to ensure that the data characteristics are unchanged. The data generation service submodule comprises: the method comprises the steps of test data generation, query task generation, query result set generation and dependency table creation, wherein the test data generation selects sample data carried by a test system or regenerates new sample data according to a doubling increase principle. The sample data comprises normal data and abnormal data, wherein the abnormal data is used for testing the fault tolerance of the system and the abnormal condition of data inclination.

Specifically, the task submission service sub-module is configured to submit the test request of the user to the corresponding big data platform, and further includes: after the tables, data, on which the test tasks depend, have been generated and the preparation work has been completed, the user's test requests are submitted to the corresponding big data platform. The task submission service sub-module comprises loading task submission, query task submission and other task submissions.

Specifically, the result comparison service sub-module is configured to determine correctness and technical performance index of each test task, and further includes a calculation service for providing correctness comparison and related index for the test tasks of the loading class and the query class. The result comparison service sub-module comprises query result judgment, loading result judgment and other test task result judgment.

In addition, the resource monitoring service submodule is used for collecting load information of the tested system and further monitoring load and resource use conditions of key services of the big data cluster node and each component on the node.

A test method and a test system for a mass data loading, storing and retrieving system are provided, which are a set of test benchmarks aiming at loading, storing and retrieving and a test system designed and realized based on the benchmarks in order to meet the application requirements of a mass data storage and processing platform. The test object of the system is a big data system, mainly evaluates and analyzes the loading, storing and retrieving capabilities of the big data system, can be used for guiding and assisting the function and performance tests of different manufacturer platforms, and provides powerful data support for model selection and expansion of the big data system.

Hereinafter, the test system will be described with reference to fig. 2. The test system adopts a layered design and sequentially comprises a presentation layer, a service layer, a task processing layer, a data access layer and a big data base platform from top to bottom. The system provides two access modes for users: web access and command lines, providing a user with access to the test prototype system. The bottom layer big data storage platform comprises a distributed storage platform, a file storage system, a database system and the like, wherein the distributed storage platform supports structured, semi-structured and unstructured data storage. The middle application layer is composed of four services, and analyzes and processes the request submitted by the upper receiving user and the request of the lower coordinating multiple tasks. Meanwhile, the system provides a plurality of common support modules for log management, service monitoring and the like.

The middle layer service of the test system is divided into the following parts:

1. and the core service program is used for providing data service for the outside, processing the received user request and distributing the decomposed request to each subtask module. The main service definition of the system comprises data generation, task submission, result comparison and resource monitoring. The data generation service is used for providing test data and preparation of test tasks, the task submission service submits test requests of users to a corresponding big data platform for execution, the result comparison service is used for judging the correctness and calculating performance indexes of each test task, and the resource monitoring service can collect load information of a tested system and provide data support for analyzing the performance indexes of the tested system; in order to consider the efficiency of data generation and meet the requirement of large-scale data testing, the system adopts a mapreduce parallel computing processing frame, divides a task into a plurality of parallel-processed subtasks, automatically distributes and executes the task on cluster nodes and collects the computing result, and completes the concurrent generation of large-scale data simply, conveniently and efficiently;

2. the service public module provides assistance for various services, such as log management, exception handling and service exception monitoring, and the log management module provides a uniform interface or function for each bottom layer task program to call so as to manage and record logs required to be recorded by the whole system. Recording key operation and maintenance information and execution result information of program operation in a log file, wherein the operation information can provide data for system auditing and monitoring, and the record of abnormal information of the execution result is used as a main basis for positioning and modifying faults by developers; the method has the advantages that each service can be ensured to normally run, and particularly when the system is in an abnormal condition, the abnormal condition can be recorded in time and self-recovery operation is carried out;

3. the big data basic platform data access interface is mainly used for reading and accessing various kinds of big data basic platform data or files. The data access layer manages various client interfaces or driving information, calls different interfaces to submit test tasks or acquire test results and the like according to test requirements, and comprises a common http request interface, a jdbc interface required by query, an s3 interface required by unstructured data access and other client interfaces required by other components.

4. The data storage layer is mainly a bottom storage platform of system data, and comprises a local file system used for sample data storage, distributed storage used for large-scale test data sets, a metadata base used for system key configuration information and the like.

Hereinafter, the functional modules of the test system are described in detail with reference to fig. 3. The test system mainly provides the following 4 functions according to different types of requirements: generating test data, submitting test tasks, outputting test results and monitoring resources.

1. The test data generation function is the basis of the whole test reference system, and is mainly divided into 3 main parts according to different service scenes:

(1) generating function, performance and stability test data according to the SQL standard grammar and the sample data;

(2) the SQL sentences of different components, such as library building, table building, index building, view/sequence building, query, retrieval, analysis, user management, authority management and resource management; and

(3) query, search type result set data

The test data types comprise structured, semi-structured and unstructured data, the structured data comprise csv, avro and the like, and the unstructured data comprise text, pictures, audio, video, compression and the like.

2. The task submitting function is used as a core of user interactive operation, and mainly provides submission of various test tasks, and supports single task submission, concurrent task submission and mixed submission of multiple tasks.

3. The result comparison module is used as a result output part of the whole test system process and mainly outputs the test result: if the test case is executed successfully or fails or is executed with errors, and the calculation indexes of all the test tasks are output.

4. The resource monitoring service is used for monitoring the resource occupation condition of the whole cluster, including network, disk IO, memory, CPU and the like. And the monitoring result of the monitoring module is combined with the performance test result to give the evaluation of the system performance test and help the user to make decisions such as type selection.

Hereinafter, a test method of the data system will be described with reference to fig. 4.

The test method of the data system according to the embodiment of the invention comprises the following steps: generating a test task and data on which the test task depends; submitting the test task to a big data platform; and outputting the test result of the test task after the test task is completed.

The automatic testing method realizes the automation of a complex testing process aiming at each testing point, covers various types of system application scenes and has good application effect.

The test method of the data system according to the embodiment of the invention comprises the following steps: generating a test task and data on which the test task depends; submitting the test task to a big data platform; and outputting the test result of the test task after the test task is completed. And monitoring the load and resource use condition of the big data cluster nodes and the key services of each component on the nodes during the execution of the test task. When the test task is a loading test task, further generating an original test data required by the loading test task and the loading test; submitting the loading test task to a big data platform; and after the loading test task is completed, the big data platform returns information, and calculates the loaded performance index according to the returned information. When the test task is a query test task, the method further comprises the following steps: generating original data and a result set corresponding to the query test task and the query result set; submitting the query test task to a big data platform; and after the query test task is completed, the big data platform returns information, and calculates the performance index related to the query according to the returned information. And continuously collecting node and service information of the big data platform during the execution period of the query test task, and outputting load information, wherein the load information is combined with the performance index to output a test result evaluation report.

As shown in FIG. 5, the load test task is described in detail. The relationship of each module in the loading test flow is as follows: and generating original test data required by the loading test through a data generation module. And the task submitting module submits the loading task and writes the data to be loaded into the big data platform, and the task is completed after successful submission. And returning the acquired task execution state to the task output module. And after the loading is finished, the task output module calculates the loaded performance index according to the information returned by the big data platform. In the whole process, the resource monitoring module can continuously acquire the node and service information of the big data platform and output the node and service information to the task output module, so that whether the test result meets the requirement or not can be conveniently judged.

As shown in fig. 6, the relationship among the modules in the query test flow is as follows: and generating an inquiry test task, original data corresponding to the inquiry result set and an expected result set through a data generation module, wherein the original data of the result set is imported into a target inquiry table of the big data platform. And the task submitting module submits the test task to the big data platform and waits for the completion of the task after the test task is successfully submitted. And returning the acquired task execution state to the task output module. After the query is completed, the task output module calculates the performance index related to the query according to the information returned by the big data platform. In the whole process, the resource monitoring module can continuously acquire the node and service information of the big data platform and output the node and service information to the returned load information, and the load information is combined with the performance index to finally output a test result evaluation report.

First, test benchmark

Referring to fig. 7, the test standard is designed based on the basic data type, the SQL standard and the standard interface, and the function, performance, stability and reliability standards are expanded on the basis.

Data type

Override value type, character type, time type, IP type, complex type, binary, other types.

The data types include: tinyint, smalllint, int, bigint, float, double, decimal;

the character types include: char, varchar, string, text;

the time types include: date, time, timestamp;

the IP types include: ipv4_ addr, ipv6_ addr;

the complex types include: array, map, struct;

the binary types include: a binary;

other types: unstructured objects, custom types

Sample data

Combining with actual business, synthesizing the characteristics of structured, semi-structured and unstructured data, preparing sample data containing all data types, and adopting a data doubling rule which is repeated as little as possible to generate test data of different scales.

In order to avoid reduction of data magnitude caused by compression storage after data is repeated and put in storage, when data is doubled, at least one column of data is completely different, and each record is different. Data doubling is performed by adopting the same type of data columns of the sample data through cross operation, and the characteristic of the data is ensured to be unchanged.

Sample data contains two major classes: normal data, abnormal data. The abnormal data is used for testing the abnormal conditions of system fault tolerance, data inclination and the like.

When generating test data, the sample data of the test system can be selected, and new sample data can be regenerated according to the sample data generation principle.

SQL Standard

The method comprises the standard grammars of creation and deletion of a library, creation, modification and deletion of a table, creation, modification and deletion of an index, creation, modification and deletion of a view/sequence, addition, deletion, modification and check of data, compression formats RCFile, ORC and the like, creation of a user group, addition and deletion of a user, authority authorization, deletion, resource allocation, recovery and the like. All grammars are designed based on actual services of the data center, and then the grammars corresponding to all components hdfs, hbase, hive, spark, strom, es, mpp, a database and the like of the data center software system are generated based on standard grammars.

The standard syntax contains the correct syntax, the wrong syntax for fault tolerance testing.

If new business requirements exist in the follow-up process, the SQL standard grammar can be directly added to expand the test content.

Standard interface

And a data center standard interface is used for development of a test prototype system, so that the test prototype system is convenient to transplant and expand, such as JDBC.

Functional reference

And generating correct and wrong grammars of all components of the data center according to the SQL standard, performing full-flow tests of user creation, resource distribution, weighting, library building, table building, index building, loading, storage, query, retrieval and analysis, and checking the functional integrity through correctness verification.

The table-building grammar comprises a dimension table and a service table, and the service table is divided into a partition table and a non-partition table.

The loading comprises a plurality of client modes, such as an http protocol and the like. And testing the ETL tool, such as static data extraction loading, incremental extraction and real-time online extraction.

The storage includes various compression methods such as RCFile, ORC, etc.

Query syntax includes filtering, sorting, grouping, statistics, deduplication, correlation, exact/fuzzy matching, TOP-K, sub-queries, arithmetic operations, functional operations, full table scans, large result set outputs, etc., where query syntax covers types of general queries, stream queries, search queries, analytical queries, etc.

The stream query syntax also includes operations such as windowing, stream concatenation, data source, etc.

The retrieval syntax contains operator type support (arithmetic, logic, interval Range), functions (aggregation function, normal function), Term retrieval, Range retrieval, WildCard query, Fuzzy query, Regexp query, Phrase query, Boolean query, Boost query, subnet retrieval, stride grouping, relevance ranking, weight retrieval, etc.

The analysis grammar comprises a classification algorithm, a clustering algorithm, a graph analysis algorithm, an association analysis algorithm, a frequent pattern mining algorithm, a self-defined algorithm and the like.

Performance benchmark

And generating performance test data of different scales according to the sample data, carrying out loading, inquiring, analyzing and searching tests on various performances, simultaneously collecting the utilization rates of a disk IO, a CPU, a memory and a network of a bottom system, outputting the results of the execution time and the resource occupation of the test case, carrying out performance evaluation, and providing a basis for analyzing the bottleneck and the type selection of the system.

The specific performances comprise single machine performance, cluster performance, single machine cache data volume, loading delay under high/low load conditions of different compression algorithm loading, single task performance, concurrent task performance, mixed load performance, cluster maximum concurrency, storage scale, table number, user number and the like of query retrieval analysis.

Stability benchmark

And constructing a high-load and temporary overload condition, continuously loading, inquiring, retrieving and analyzing task submission in 7 × 24 hours or more, and recording the longest period of stable operation of the system under the condition of ensuring the correctness.

Reliability benchmark

And under the condition of configuring the data copy number, carrying out loading, inquiring, retrieving, analyzing and testing, periodically manufacturing power failure, network disconnection, shutdown and service failure, and verifying the correctness of the test result. The number of the fault nodes is respectively designed according to different copy numbers.

Under the condition of configuring the data copy, the data center is made to be in fault, the data center is recovered after a period of time, and whether the data backup and the data recovery can be automatically carried out or not is monitored.

Second, data generation

The data generation is used for generating sample data, result data and the like which are depended by test tasks such as loading, inquiry and the like submitted by a test system, the distribution characteristics of real data are kept according to data samples in a test standard, the data size, the data format and the structure can be flexibly controlled through parameters according to a doubling increase principle, so that the data generation system is used for test environments with different scales, and fairness under different test environments is ensured.

The functions realized by the module mainly comprise test data generation, query statement generation, expected result set generation, and other corresponding test tasks such as table establishment and authority management.

The processing flow of the data generation module is shown in fig. 8, and the execution steps are as follows:

after receiving a request of a user, the data generation service firstly analyzes the request, checks grammar and the like and judges whether the request is effective or not;

the service forwards the analyzed request to the same type of interfaces on different nodes.

The interface of each child node further analyzes the request, calls corresponding programs and samples, and executes operations, such as generation of test data or test tasks.

The test data is generated according to the following data types: numeric type (tinyint, smallnt, int, bigint, float, double, decimal), character type (char, varchar, string, text) time type (date, time, timetag), IP type (ipv4_ addr, ipv6_ addr), complex type (array, map, struct), unstructured file, and the like;

the generation of the test task comprises the SQL sentences which can be suitable for different components to build a library, a table, an index, a view/sequence, a query, a retrieval, an analysis, user management, authority management and resource management and a desired result set required by the query, the retrieval and the analysis

And each child node returns the result to the service, and the service management node returns the summarized request to the client.

In addition, when the data generation module generates the test data, a multithreading mode is adopted, the test data are generated under a plurality of nodes and a plurality of disks, disk IO is fully utilized, and the test data and the use case are generated efficiently.

Third, test task submission

When the table and data depended by the test task are generated and the preparation work is finished, the test task can be submitted. The test tasks mainly comprise loading tasks, query tasks and other tasks without result set output according to different test task types.

The loading task mainly carries out the test of loading and persistent storage; the query task mainly tests the query retrieval capability of each component; other tasks without result set output mainly comprise the functions of DDL, DCL and the like which do not need to output a result set, such as the creation, deletion, table structure modification, index creation, deletion, reconstruction, permission grant, recovery and the like.

The test task submitting module mainly realizes the concurrent submission of various test tasks, and different clients are selected to operate according to different types of the test tasks. As shown in fig. 9, the specific steps are as follows:

(1) the task submitting service receives a user request;

(2) judging the request type, and selecting and calling different task control interfaces according to different task types;

(3) the loading task interface further decomposes user requests, selects different loading modes such as http, ftp and the like, operates and starts a plurality of loading clients, submits concurrent loading tasks and records loading response information to the log file;

(4) the query task interface calls the jdbc interface by connecting different jdbc drivers, concurrently submits a plurality of query tasks, and writes a specified result set into a file according to a user request; and

(5) other test tasks such as DDL and DCL statements are submitted through jdbc.

Fourthly, comparing task results

The task result comparison is a core part of output contents externally provided by the test system, and provides accuracy comparison of test tasks such as functions and performances of loading, storing and retrieving and the like and computing services of related indexes.

The flowchart of task output is shown in fig. 10, and the execution steps are as follows:

(1) the task output service receives a test result judgment request;

(2) judging whether the type of the task is a loading type, an inquiry type or a no result set output type;

(3) for different testing tasks, such as:

a) and loading the class, acquiring the number of the loading target table warehousing records, comparing the number with the loaded original data, comparing the detail data under the condition that the number of the loading target table warehousing records is consistent, and comparing the detail data in a mode of sampling the marked data in consideration of time resources and other efficiency. When the comparison result is consistent, judging that the test case passes

b) And inquiring the class, firstly judging whether the expected result set exists, if so, comparing the number of the statistical result set with the number of the actual result set, and comparing the detailed data under the condition that the recorded numbers are consistent.

(4) Performance calculation

a) The loading test task needs to calculate the loading speed and the storage index, and the concurrent loading speed is calculated by merging and summarizing according to the records of the loading time consumption and the loading data amount of each loading client; calculating indexes such as storage related compression ratio, storage speed and the like according to the size occupied by the original data and the physical file of hdfs;

b) the query type test task mainly calculates query response time and query export speed, obtains minimum response time, maximum response time and average response time of concurrent queries according to records of log files of a query client, and calculates the export speed of a query result set.

Fifth, resource monitoring

The module externally provides monitoring of load and resource use conditions of the big data cluster nodes and key services of all components on the nodes.

As shown in fig. 11, the main steps of the resource monitoring module are as follows:

(1) the resource monitoring service acquires a user request;

(2) analyzing the content of the user request to obtain a node list or a list of a certain component; and

(3) and calling a system resource acquisition interface to connect each node and acquiring system resources such as CPU, memory and network card information.

(4) And calling a component resource acquisition interface to acquire a certain component state instance, and further acquiring the monitoring information of the component through the instance.

According to the test system and the test method of the data system, disclosed by the embodiment of the invention, mass data loading, storing and retrieving are oriented, a comprehensive and complete test method is provided for the characteristics of big data software, and the defects of the traditional test method for a structured database are overcome; the automatic test system realizes the automation of a complex test process aiming at each test point, covers various types of system application scenes and has good application effect.

The invention provides a comprehensive and complete test benchmark, which comprises the stability and reliability of functional performance and scene support of each refinement. Aiming at the characteristics of each software of big data and the application use scene, mapping the corresponding test benchmark according to the requirements of the test task and the complete test benchmark, and generating the test data, selecting the test task, submitting the test task, executing the test task, analyzing the result, monitoring the test process and the like.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A test system for a data system, comprising:

the user layer is used for providing a mode for accessing the test system for a user;

the middle application layer is used for receiving the request submitted by the user from the user layer and coordinating a plurality of tasks for analyzing and processing the request for the bottom layer big data platform; and

a bottom big data platform which is a data storage layer of the data system and comprises a local file system, a distributed file system and a database,

wherein the intermediate application layer further comprises a data access interface for reading and accessing data and files of various big data platforms,

the test system also comprises a test benchmark module which is used for designing based on a data type, sample data, an SQL standard and a standard interface, expanding a functional benchmark, a performance benchmark, a stability benchmark and a reliability benchmark on the basis of the data type, the sample data, the SQL standard and the standard interface,

providing test data includes: and generating sample data and expected result data which are depended by the test system to submit the test task, wherein according to the sample data in the test benchmark module, the distribution characteristics of the real data are kept, and the data size, the data format and the structure are controlled according to a doubling increase principle and parameters so as to be used in test environments with different scales.

2. The test system of claim 1, wherein the intermediate application layer further comprises:

the core service module is used for providing data service, processing the received user request and distributing the decomposed request to each subtask module; and

and the service common module is used for providing auxiliary log management, exception handling and service exception monitoring for various services.

3. The test system of claim 2, wherein the core services module comprises:

the data generation service submodule is used for providing the test data and the test task;

the task submitting service sub-module is used for submitting the test request of the user to a corresponding big data platform;

the result comparison service submodule is used for judging the correctness and the technical performance index of each test task; and

and the resource monitoring service submodule is used for acquiring the load information of the tested system and providing data support for analyzing the performance index of the tested system.

4. The test system of claim 3, wherein the doubling factor increase principle comprises: and when the sample data is doubled, ensuring that a column of data is completely different so as to ensure that each record is different.

5. The test system of claim 3, wherein the doubling factor increase principle comprises: and carrying out doubling on the sample data by adopting the same type of columns of the sample data through cross operation so as to ensure that the data characteristics are unchanged.

6. The test system of claim 3, wherein the sample data comprises normal data and abnormal data, wherein the abnormal data is used for testing system fault tolerance and abnormal situations of data skew.

7. The test system of claim 3, wherein the data generation service submodule comprises: generating test data, generating a query task, generating a query result set and generating a dependency table, wherein the test data generates and selects sample data carried by the test system or regenerates new sample data according to the doubling increase principle.

8. The testing system of claim 3, wherein the task submission service submodule is configured to submit the user's test request to the corresponding big data platform further comprising: and after the tables and data depended by the test tasks are generated and the preparation work is completed, submitting the test requests of the users to the corresponding big data platforms.

9. The test system of claim 8, wherein the task submission service submodule includes load task submission, query task submission, and other task submissions.

10. The testing system of claim 3, wherein the result comparison service sub-module is configured to determine correctness and technical performance indicators of each testing task, and further comprises a computing service for providing correctness comparison and related indicators for the testing tasks of the loading class and the query class.

11. The test system of claim 10, wherein the result comparison service sub-module comprises query result determination, load result determination, and other test task result determination.

12. The testing system of claim 3, wherein the resource monitoring service sub-module is configured to collect load information of the system under test and further includes providing monitoring of load and resource usage of the big data cluster node and critical services of each component on the node.

13. A method for testing a data system, comprising:

generating a test task and data on which the test task depends;

submitting the test task to a big data platform; and

outputting a test result of the test task after the test task is completed,

wherein the design is based on data type, sample data, SQL standard and standard interface, and the design is based on the data type, sample data, SQL standard and standard interface to expand function standard, performance standard, stability standard and reliability standard,

providing test data includes: generating sample data and expected result data which are depended by the test system to submit the test task, wherein according to the sample data, the distribution characteristics of the real data are kept, and the data size, the data format and the structure are controlled according to a doubling increase principle and parameters so as to be used for test environments with different scales.

14. The method of testing of claim 13, further comprising: and monitoring the load and resource use condition of the big data cluster nodes and the key services of each component on the nodes during the execution of the test task.

15. The method according to claim 13, wherein when the test task is a load test task, further comprising

Generating a loading test task and original test data required by a loading test;

submitting the loading test task to the big data platform; and

and after the loading test task is completed, the big data platform returns information, and calculates the loaded performance index according to the returned information.

16. The method according to claim 13, wherein when the test task is a query test task, further comprising:

generating original data and a result set corresponding to the query test task and the query result set;

submitting the query test task to the big data platform; and

and after the query test task is completed, the big data platform returns information, and calculates the performance index related to query according to the returned information.

17. The test method of claim 16, further comprising: and continuously acquiring node and service information of the big data platform during the execution period of the query test task, and outputting load information, wherein the load information is combined with the performance index to output a test result evaluation report.