CN112241354A

CN112241354A - Application-oriented transaction load generation system and transaction load generation method

Info

Publication number: CN112241354A
Application number: CN201910800259.7A
Authority: CN
Inventors: 张蓉; 李宇明; 张舒燕; 舒科
Original assignee: Beijing Pingkai Star Technology Development Co ltd; East China Normal University
Current assignee: Pingkai Star Beijing Technology Co ltd
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2021-01-19
Anticipated expiration: 2039-08-28
Also published as: CN112241354B

Abstract

The invention provides an application-oriented synthetic load generation system, which can capture the load characteristics of real online transaction processing application, then generate synthetic load with the performance index highly similar to that of actual application load, and simultaneously ensure the information concealment, the tool expandability and the load expansibility. The invention proposes a new method for describing the load characteristics of online transaction processing applications: OLTP load characteristics are characterized from two dimensions of transaction logic and data access distribution; a method for extracting transaction logic and data access distribution from a real application load track is provided, and meanwhile, the concealment of application information is ensured; from the perspective of controlling transaction conflict and distributed transaction proportion, a method for ensuring implicit transaction logic by controlling the dependence relationship of operation parameters is provided; three angles depicting data access distribution are designed and realized: tiltability, dynamics, and continuity; the first application-oriented OLTP load generator is realized, and the authenticity and the expandability of the generated load on performance evaluation are ensured.

Description

Application-oriented transaction load generation system and transaction load generation method

Technical Field

The present invention relates to the field of transaction load generation technologies, and in particular, to an application-oriented transaction load generation system and a transaction load generation method.

Background

In different fields of application, there are many benchmarks for database performance evaluation. In the OLAP application field, TPC-H, TPC-DS and SSB [14], which have predefined standard database schema and test queries, are commonly used as benchmark tests. TPC-C [1], TPC-E and SmallBank [10] references are also used to evaluate the transaction capabilities of the database system. CH-benCHmark [15] and HTAPBench [16] may be transactions. An analytical hybrid processing (HTAP) system provides a unified assessment. In addition, YCSB [11] is commonly used to measure the throughput of cloud service systems, which is simple to load but requires high scalability. However, the evaluation load of these standard test standards is an abstraction for a class of applications, and thus they are too general to evaluate database performance for a particular application.

Load trace replay is an alternative method in order to achieve similar load as the target application. Microsoft SQL Server is equipped with two tools, SQL Server parser [17] and SQLServer distributed playback [2], for rendering production load based on SQL traces. Oracle database replay [3], [18] enables the user to record the load trace on the production system with minimal performance impact and then replay a complete load with the same concurrency and load characteristics as the actual load. Load playback is difficult to apply in practical database performance evaluation due to data privacy concerns because it requires a true database state and raw load trajectory [3 ]. In addition, load expansion (e.g., expanding concurrency) is also a problem that is difficult to solve by current playback technologies.

Therefore, load simulation is very necessary and urgent. Load-aware data and query generators [4], [5], [19], [20] are used to evaluate database performance of OLAP applications. The inputs to these tasks typically include database schema, basic data features, and size specifications for intermediate results of the query tree. The output is a composite database instance and instantiated test query, conforming to specified data and load characteristics. For OLAP applications, there is a database expansion task [6], [7] that can expand/shrink a given database instance, supporting application-specific database benchmarking. Load analyzers [8], [9] are aimed at studying and better understanding application load, but neither can generate composite load. There are also some load generators [12], [13] prepared for database performance benchmarking. Jeong et al [12] propose a load generator that can simulate the real hardware resource consumption state. NoWog [13] introduces a load description language for generating synthetic loads for testing NoSQL databases. None of these efforts can be used to simulate the various loads of a real OLTP application to achieve application-oriented database performance evaluation.

When evaluating query performance of a database management system (DBMS), synthetic loads are typically performed on the DBMS and then the throughput and response time of the system are observed, which makes synthetic loads critical in the process of evaluating DBMS performance. If the DBMS performance evaluation is carried out aiming at a specific application, the similarity of the composite load and the real load directly determines whether the evaluation result is credible or not. However, the load currently used for performance evaluation hardly has similar load characteristics to the target application, which results in inaccurate evaluation results. In order to solve the problem, the invention designs a synthetic load generation method for transaction application, which is used for capturing the load characteristics of real online transaction processing (OLTP) application and then generating the synthetic load which is highly similar to the actual application load performance index.

Disclosure of Invention

In order to solve the defects in the background art, the invention provides an application-oriented transaction load generation system, which comprises: the system comprises a database generation module and a load generation module; wherein the content of the first and second substances,

the database generation module is used for generating a test database by acquiring a database mode and data characteristics;

the load generation module generates a synthetic load similar to a real load by analyzing a load track of the real application; the load generation module includes:

a business logic analyzer that extracts business logic information by analyzing a full load trajectory over a short period of time;

a data access distribution analyzer which extracts data access distribution information and throughput information by analyzing a partial load trajectory over a long period of time;

and the load generator is used for instantiating the parameter values in the transaction template by utilizing the previously extracted transaction logic information, the data access distribution information and the throughput information to generate a composite load.

In the application-oriented transaction load generation system provided by the invention, the data features are automatically acquired by using SQL query through the data feature extractor.

In the application-oriented transaction load generation system provided by the invention, the load generator can be configured with the number of test nodes and the number of test threads on each node to simulate concurrency; for each test thread, a separate database connection is established.

In the application-oriented transaction load generation system provided by the invention, when a transaction is executed, the load generator uses the structural information of the transaction logic to determine whether the operation in the branch structure needs to be executed or not and the number of times of executing the operation in the loop structure; for the execution of an SQL operation, instantiating all parameters one by one, and then sending the operation with specific parameter values to a test database; after the operation is performed, the result set and parameters are saved as an intermediate state for generating other parameters in subsequent operations within the same transaction instance.

In the application-oriented transaction load generation system provided by the invention, for one parameter, if the load generator only depends on $1, the value of the parameter can be directly calculated according to the increment delta and the associated smaller parameter; if only $2 is relied upon, first attempt to instantiate a parameter by randomly selecting a dependent item according to the probability of the dependent item, and when no dependent item is selected, instantiate this parameter using the data access distribution; if both dependencies $2 and $3 exist, the corresponding operation must be in a loop structure, instantiating the parameter based on the dependent $2 and data access distribution on the first execution of the loop, first attempting to use the dependent $3 to instantiate the parameter based on probability for non-first loop executions, and using the dependent $2 and data access distribution to instantiate the parameter if no dependent item is selected.

Based on the system, the invention also provides an application-oriented transaction load generation method, which comprises the following steps:

step A: generating a test database by acquiring a database mode and data characteristics;

and B: generating a composite load similar to the real load by analyzing the load trajectory of the real application, including:

step B1: extracting business logic information by analyzing a complete load trajectory in a short time period;

step B2: extracting data access distribution information and throughput information by analyzing partial load tracks in a longer time period;

step B3: and instantiating parameter values in the transaction template by using the previously extracted transaction logic information, the data access distribution information and the throughput information to generate a synthetic load.

In the application-oriented transaction load generation method provided by the invention, in the step A, the test database is a plurality of tables meeting the primary key, foreign key constraint and non-key value attribute data characteristics; the method specifically comprises the following steps:

step A1: generating primary keys in sequence;

step A2: when the foreign key is generated, the foreign key is randomly generated in the value range of the primary key which is referred to by the foreign key;

step A3: values of non-key-value attributes are generated by a random attribute generator that includes a random index generator and an index value translator, while satisfying desired data characteristics.

The application-oriented transaction load generation method provided by the invention comprises the following steps of firstly determining a value range before generating a key value: first, if the primary key includes only a single attribute, its value range is [1, s ]]S is the size of the table; second, the value range of the foreign key attribute can be determined by the primary key to which it refers; thirdly, when processing the value range of the non-foreign key attribute in the composite primary key, only one non-foreign key attribute is in the composite primary key, and then the value range of the attribute is

d_fkA value range that is one of the foreign key attributes in the composite primary key; the second and third steps are performed a number of times if cascading references are involved.

The invention provides an application-oriented transaction load generation method, wherein the output of a random index generator is an integer from 1 to n, wherein n is the cardinal number of an attribute; given an index, the index value translator deterministically maps it to a value in the attribute value domain; according to the data type of the attribute, different index value converters are adopted: for numerical types, a linear function is used that maps indices uniformly to attribute value ranges; for the character string type, randomly generating a seed character string meeting the length requirement; first, a seed string is selected according to an input index, and then the index and the selected seed string are connected as an output value.

In step B1, the transaction logic extraction algorithm includes:

step B11: calculating the execution times of each operation in the transaction template by traversing the load track, thereby calculating the execution possibility of each branch and the average execution times of the loop operation;

step B12: identifying all parameter pairs < pi, l, pi, j > meeting BR in the transaction template, and then traversing the load track to obtain an average increment delta; construct dependency $1 for pi, j;

step B13: aiming at the parameter pi, j in each transaction template, traversing each parameter pm, n before the parameter pi, j, and calculating the number of the transactions which meet the ER parameter pair; similarly, traversing the previous return set rx, y, and respectively calculating the number of transactions meeting ER and IR;

step B14: randomly selecting N transaction instance groups from the K transaction instances, two per group; then calculating LR coefficients (a, b) of the parameter pairs in each group of transactions;

step B15: constructing a dependency $2 for each parameter pi, j using the statistics obtained in steps B13-B14;

step B16: the same operation in a loop structure runs multiple times, with dependency $3 describing the change in parameters therein; calculating the change of parameter values in continuously executed cyclic operation by traversing the load track; the coefficients (a, B) are calculated as in step B14, and then the dependency $3 is constructed from the statistical information.

The application-oriented transaction load generation method provided by the invention specifically comprises the following steps in the step B:

step B21: generating all high frequency terms that satisfy the expected repetition rate;

step B22: traversing all parameters in the previous time window, and selecting a repeat parameter for each interval until the parameter repetition rate on the interval is met;

step B23: deriving an index for the parameter from its value, thereby identifying its belonging interval in the current time window; if the index is not in the index field of the current time window, ignoring the parameter;

step B24: generating a random parameter added to each interval to meet the cardinality requirement; instantiating a parameter using a parameter generation mechanism based on the candidate parameter; within a certain interval, only one candidate parameter needs to be randomly selected as output.

The invention discloses an application-oriented synthetic load generation system and method, which can capture the load characteristics of real online transaction processing (OLTP) application, then generate synthetic load with the load performance index highly similar to that of actual application, and simultaneously ensure the information concealment, the tool expandability and the load expandability. The innovation points mainly comprise:

1. a new method of describing online transaction processing (OLTP) application load features is proposed: OLTP load characteristics are characterized from two dimensions of transactional logic and data access distribution.

2. A method for extracting transaction logic and data access distribution from a real application load track is provided, and meanwhile, the concealment of application information is guaranteed.

3. From the perspective of controlling transaction conflict and distributed transaction proportion, a method for ensuring implicit transaction logic by controlling the dependence relationship of operation parameters is provided.

4. Three angles depicting data access distribution are designed and realized: tiltability, dynamics, and continuity.

5. The first application-oriented OLTP load generator is realized, and the authenticity and the expandability of the generated load on performance evaluation are ensured.

Reference to the literature

[1]TPC-C benchmark,http://www.tpc.org/tpcc/.

[2]SQL Server Distributed Replay, https://docs.microsoft.com/enus/sql/tools/distributed-replay/sql-server-distribut ed-replayview＝sqlserver-2017.

[3]L.Galanis,S.Buranawatanachoke,R.Colle,B.Dageville,K.Dias,J.Klein, S.Papadomanolakis,L.L.Tan,V.Venkataramani,Y.Wang,

et al.,“Oracle database replay,”in SIGMOD,2008,pp.1159–1170.

[4]E.Lo,N.Cheng,W.W.K.Lin,W.Hon,and B.Choi,“Mybenchmark:generating databases for query workloads,”in VLDBJ,2014,pp.895–913.

[5]Y.Li,R.Zhang,X.Yang,Z.Zhang,and A.Zhou,“Touchstone:Generating enormous query-aware test databases,”in USENIX ATC,2018,pp.575–586.

[6]Y.Tay,B.T.Dai,D.T.Wang,E.Y.Sun,Y.Lin,and Y.Lin,“Upsizer: Synthetically scaling an empirical relational database,”in Information Systems,2013, pp.1168–1183.

[7]J.Zhang and Y.Tay,“Dscaler:Synthetically scaling a given relational database,”in PVLDB,2016,pp.1671–1682.

[8]P.S.Yu,M.-S.Chen,H.-U.Heiss,and S.Lee,“On workload characterization of relational database environments,”in IEEE Transactions on Software Engineering, 1992,pp.347–355.

[9]Q.T.Tran,K.Morfonios,and N.Polyzotis,“Oracle workload intelligence,” in SIGMOD,2015,pp.1669–1681.

[10]M.Alomari,M.Cahill,A.Fekete,and U.Rohm,“The cost of serializability on platforms that use snapshot isolation,”in ICDE,2008,pp.576–585.

[11]B.F.Cooper,A.Silberstein,E.Tam,R.Ramakrishnan,and R.Sears, “Benchmarking cloud serving systems with ycsb,”in SoCC,2010,pp.143–154.

[12]H.J.Jeong and S.H.Lee,“A workload generator for database system benchmarks,”in iiWAS,2005,pp.813–822.

[13]P.Ameri,N.Schlitter,J.Meyer,and A.Streit,“Nowog:a workload generator for database performance benchmarking,”in DASC/PiCom/DataCom/CyberSciTech,2016, pp.666–673.

[14]P.E.O’Neil,E.J.O’Neil,X.Chen,and S.Revilak,“The star schema benchmark and augmented fact table indexing,”in TPCTC,2009,pp.237–252.

[15]R.Cole,F.Funke,L.Giakoumakis,W.Guy,A.Kemper,S.Krompass,H.Kuno, R.Nambiar,T.Neumann,M.Poess,et al.,“The mixed workload ch-benchmark,”in DBTest, 2011,pp.8.

[16]F.Coelho,J.Paulo,R.Vilac，a,J.Pereira,and R.Oliveira,“Htapbench: Hybrid transactional and analytical processing benchmark,”in ICPE,2017,pp.293 –304.

[17]SQL Server Profiler, https://docs.microsoft.com/en-us/sql/tools/sqlserver-profiler/sql-server-profiler？ view＝sql-server-2017.

[18]Y.Wang,S.Buranawatanachoke,R.Colle,K.Dias,L.Galanis,S. Papadomanolakis,and U.Shaft,“Real application testing with database replay,”in DBTest,2009,pp.8.

[19]C.Binnig,D.Kossmann,E.Lo,and M.T.Ozsu,“Qagen:generating¨ query-aware test databases,”in SIGMOD,2007,pp.341–352.

[20]A.Arasu,R.Kaushik,and J.Li,“Data generation using declarative constraints,”in SIGMOD,2011,pp.685–696.

Drawings

Fig. 1 is a basic architecture diagram of the present invention.

FIG. 2 is a diagram illustrating the determination of value ranges for key-value attributes according to the present invention.

FIG. 3 is an example of an S-Dist of the present invention.

FIG. 4 is an example of parameter generation according to the present invention.

FIG. 5 is an example of the C-Dist of the present invention.

FIG. 6 is a graph of the deviation of the simulated TPC-C load performance index on the PostgreSQL database of the present invention.

FIG. 7 is a discussion of S-Dist and D-Dist in the present invention for both tilting and dynamic loads.

FIG. 8 is a C-Dist test in the present invention for a continuous load.

Figure 9 is the transactional logic extraction algorithm performance of the present invention (K-N).

FIG. 10 is the C-Dist extraction algorithm performance of the present invention.

Detailed Description

The invention is further described in detail with reference to the following specific examples and the accompanying drawings. The procedures, conditions, experimental methods and the like for carrying out the present invention are general knowledge and common general knowledge in the art except for the contents specifically mentioned below, and the present invention is not particularly limited.

Application-oriented database performance evaluation has the following requirements:

and (4) fidelity. The load used for evaluation should be highly similar to the real application load. The performance indicators obtained from the evaluations (e.g., throughput, delay, utilization of physical resources) should be consistent with the results of the actual application run. The similarity of the evaluation load and the real application load is measured by the deviation degree of the performance index. The smaller the degree of deviation, the higher the similarity.

And (4) concealment. Data privacy is a basic requirement of commercial applications, so the load of real applications cannot be directly used as database performance evaluation.

Tool extensibility. The target application may have a large data size and high concurrency/throughput. This requires that the load generation tool be capable of scaling across multiple nodes and support concurrent database and load generation.

Load scalability. Sometimes, it is desirable to extend the current application load to measure the performance of the DBMS at the expected synthetic load scale. Query concurrency and query throughput are critical concerns since it is primarily focused on generating transactional loads.

Based on these needs, the present invention formalizes the application-oriented transactional load generation problem:

application-oriented transactional load generation (Application-oriented transactional load generation): and generating a synthetic load which is highly similar to the target application, and simultaneously ensuring the fidelity, the concealment, the tool expandability and the load expandability.

Based on the definition of the data generation problem, the basic architecture of the design of the present invention is shown in fig. 1. In order to solve the data privacy problem, the method of the invention isolates the production environment and the evaluation environment, thereby ensuring that the data owner protects the data privacy. From a functional point of view, the implementation of the present invention can be divided into a Database Generation module (Database Generation) and a load Generation module (Workload Generation).

Database Generation module (Database Generation): the input to the database generator mainly consists of two parts: database schema and data characteristics; the output is a Test database (Test DB). Since Data features are tedious and need to be retrieved from a real database, the present invention provides a Data features Extractor (Data Characteristics Extractor) to help the present invention automatically retrieve this information using simple SQL queries. In design, the present invention is only concerned with testing some of the fundamental data characteristics of a database (e.g., the value range of an attribute), since the load characteristics of the composite load are the key factors that affect the performance of the DBMS.

Load Generation module (Workload Generation): the load generation module consists of three parts: a Transaction Logic Analyzer (Transaction Logic Analyzer), a Data Access Distribution Analyzer (Data Distribution Analyzer), and a load Generator (Workload Generator). The transactional logic analyzer extracts the transactional logic information by analyzing the full load trace (containing all the parameters and returned result sets for each SQL operation) over a short period of time. The data access distribution analyzer extracts data access distribution information and throughput information by analyzing partial load traces (containing only some key parameters of SQL operations) over a longer period of time. The load generator uses this information to instantiate parameter values in the transaction template, generating a composite load.

The database generation in the invention:

the data characteristics required for all database generation are the same as Touchstone [5], e.g., table size, value range of attributes, cardinality of attributes, i.e., number of non-duplicate values. Generating a test database is in fact a way to generate multiple tables that satisfy primary/foreign key constraints and non-key value attribute data characteristics.

Without loss of generality, the present invention assumes that the primary and foreign keys are integers. Primary keys are identifiers of records, usually without physical meaning, and therefore do not take into account their data characteristics. When the main key is generated, the method simply generates the main key in sequence; when generating a foreign key, the present invention generates randomly within the range of values of the primary key to which it refers. This may ensure uniqueness of the primary key and referential integrity of the foreign key. Before generating these key values, three steps are required to determine their value ranges. First, if the primary key includes only a single attribute, its value range is [1, s ]]S is the size of the table (see (r) in FIG. 2); second, the value range of the foreign key attribute can be determined by the primary key to which it refers (e.g.,. sup.2); third, the present invention deals with the value range problem of non-foreign key attributes in the composite primary key (see (c) in FIG. 2). The most common and reasonable case is that there is only one non-foreign attribute in the composite primary key, and then the value range of this attribute is

d_fkIs a value range for one of the foreign key attributes in the composite primary key. The processing is similar for the other cases. The second and third steps may be performed multiple times if cascading references are involved.

A random attribute generator [5] comprising a random index generator and an index value converter is used to generate values for non-key-value attributes while satisfying desired data characteristics, particularly cardinality characteristics. The output of the random index generator is an integer from 1 to n, where n is the cardinality of the attribute. Given an index, the index value translator deterministically maps it to a value in the attribute value field. The present invention employs different index value translators depending on the data type of the attribute. For numerical types, such as integers, the present invention simply uses a linear function that maps indices uniformly to attribute value ranges; for string types, such as varchar, there will be some seed strings that are randomly generated that meet the length requirement. The present invention first selects a seed string (e.g., (i% k) th seed string based on the input index, where i is the index and k is the number of all seed strings.

In general, each table is generated independently of the other. And for each table, the invention can realize parallel data generation on a plurality of nodes by distributing the primary key generation range to each thread.

The transaction logic in the invention:

the transactional logic referred to herein represents potential business logic in OLTP applications. In the process of testing the database, transaction logic can cause obvious influence on deadlock possibility and distributed transaction proportion, thereby influencing the performance of the tested database. In this section, the invention first introduces the definition of the transactional logic and then presents the algorithm to extract the transactional logic.

The relationship between the SQL parameters and the SQL parameters in the transaction template and the relationship between the SQL parameters and the return items determine the hidden semantics between the SQL operations. After the existing OLTP reference load and the actual application load are investigated, the present invention focuses on four types of relationships. First, the Equality Relationship (ER) is the most common relationship. For example, two SQL parameters are equal with a certain probability. Second, the Inclusion Relationship (IR) is also common. Because the SQL result return set may be a set of tuples, the value of the SQL parameter may be a value in the previous return result set. Third, the Linear Relationship (LR) is complementary and an extension of the equality relationship, with greater expressive power. Fourth, the Between Relationship (BR) is proposed for predicates like "col between p1 and p 2" and "col ≧ p1 and col ≦ p 2", where there is a between relationship between p2 and p 1.

The business logic is formally defined below. O is_iRepresenting the ith operation in the transaction template; p is a radical of_i,jRepresents O_iThe jth parameter of (a); r is_i,jRepresents O_iThe jth return entry of (1); both i and j count from 1.

Define 2 Transaction logic (Transaction logic): for a transaction template, the transaction logic is composed of transaction structure information and parameter dependency relationship information, which is detailed as follows:

transaction structure information:

#1 possibility of execution per branch in the branch structure.

#2 average number of executions of operations in the loop structure.

Parameter dependent information (for each parameter p_i,j)：

$1[p_i,l,p_i,j,BR,Δ]

$ 2A list of dep-items, dep-item ∈ { [ p ]_m,n,p_i,j,ER,ξ],[p_m,n,p_i,j,LR,ξ,(a,b)], [r_x,y,p_i,j,ER,ξ],[r_x,y,p_i,j,LR,ξ,(a,b)],[r_x,y,p_i,j,IR,ξ]}；m≤i；x<i; if m is i, then n<j.

$3 to [ p ]_i,j,LR,ξ,(a,b)]List of (A), O_iIt must be an operation in a cyclic configuration.

If a parameter p_i,jWith a dependency of $1, then neither $2 nor $3 need be present, since p_i,jCan be expressed as (p)_i,l +Δ)，Δ is the average increment calculated from the load trajectory. In each dependency, ξ represents the likelihood that the corresponding dependency is satisfied; (a, b) are two coefficients describing a linear relationship. Dependency $3 represents a linear relationship between the values of the same parameter in a continuously executed loop operation.

The business logic is the embodiment of the business logic of the application layer, and is not changed frequently, so that the load track in a longer time period does not need to be analyzed. Because the transaction logic analysis process of each transaction template is the same and independent, the extraction algorithm of the invention is only used for analyzing one transaction template. The business logic extraction algorithm comprises six steps, and the input is K transactions on a transaction template and corresponding load tracks. The algorithm is as follows:

step 1: transaction structure information is extracted. By traversing the load trace, the number of executions of each operation within the transaction template is calculated, thereby calculating the likelihood of execution of each branch and the average number of executions of the loop operation.

Step 2: the BR is determined. First, the present invention identifies all pairs of parameters in the transaction template that satisfy BR<p_i,l,p_i,j>The load trajectory is then traversed to obtain the average delta. Then, the invention is p_i,jConstruction dependent $1, then the processing of steps 3-6 may skip p_i,j。

And step 3: ER and IR information was collected. For parameter p in each transaction template_i,jThe invention traverses each parameter p preceding it_m,nThe computation owns pairs of parameters that satisfy ER (i.e., p)_i,j＝p_m,n) The number of transactions of (2); similarly, go through the previous return set r_x,yThe number of transactions satisfying ER and IR is calculated.

And 4, step 4: LR information is collected. LR only involves parameters and return sets of numeric type, and the return sets must be derived from operations filtered by the primary key. Since the calculation of the LR coefficients (a, b) requires two transaction instances, the present invention randomly selects N transaction instance groups (two per group) from K transaction instances. The LR coefficients (a, b) for the parameter pairs in each set of transactions are then calculated. Here LR with coefficient (1,0) needs to be ignored because it is represented by ER.

And 5: ER, IR and LR were determined by a trade-off. Using the statistical information obtained in step 3-4, the present invention can easily identify for each parameter p_i,jConstruct dependency $ 2. However, for each parameter, there may be many dependencies and the probability (i.e., ξ) of some parameters is small, so the present invention requires a tradeoff between these dependencies to eliminate noise and reduce subsequent computations. The method selects the most important dependent items, such as the dependent items with higher probability, and ensures that the sum of the probabilities of selecting the dependent items is less than 1. The present invention is more inclined to preserve ER, as ER was found to be much more important than IR and LR in the experiments.

Step 6: an LR of a cyclic structure is constructed. The same operation in a loop structure will run multiple times and the present invention uses the dependency $3 to describe the change in parameters therein. By traversing the load trajectory, changes to parameter values in the continuously executed loop operations are calculated. The coefficients (a, b) are calculated similarly to step 4. The dependency $3 is then constructed from the statistical information.

If step 3-4 encounters an operation in a loop, the present invention uses only the load trajectory that the loop first executes.

Data access distribution in the present invention:

data access distribution has a considerable influence on the conflict intensity of load transactions and the cache hit rate of a database system, and is always taken as an important characteristic of application load. In the part, the invention firstly describes basic oblique data access distribution, then describes the dynamics and continuity of the data access distribution, and finally provides a generation algorithm of candidate parameters.

The data access distribution of the composite load depends on the parameter values instantiated in the transaction template. Without loss of generality, it is assumed that the predicate form that decides the data access distribution in OLTP applications can all be represented as "col op para". The present invention represents the tilt data access profile (S-Dist) for each parameter using a high frequency term set (HFI) and Histogram Statistics (HS) extracted from the load trajectory. HFI records the H most frequently occurring thermal data items. The domain value of the attribute is evenly divided into I intervals, and then the frequency and cardinality of the parameters (excluding those that have appeared in the HFI) on each interval are noted as HS. FIG. 3 is an example of S-Dist, where both H and I have a value of 5, and the corresponding attribute is integer, domain value [0,2000 ]. In the re-HFI, the hottest term is 57, the frequency of occurrence is 0.17; in the first interval of HS, there are 20 unique parameter values, and the total access frequency is 0.08.

The data items in the S-Dist are all from load traces running on real data, but the same data does not necessarily exist in the synthetic database generated by the present invention, so the present invention first performs data transformation for HFI. Assume that the attribute generator is "index" n [1,400], value "index" 5 ", where 400 is the attribute cardinality. In the first step, the data items in the HFI are regenerated using an attribute generator, as shown in FIG. 4, 57 is replaced with 195. in the second step, a cumulative probability array, i.e., "cumu prob" in FIG. 4, is calculated based on the frequency of the high frequency items and all of the intervals. And finally, generating a random number between 0 and 1, and mapping the random number to one item in the cumulative probability array, thereby selecting a proper parameter to supplement the predicate. Fig. 4 gives two examples of selecting parameter values.

In addition, in order to control the cardinality of the generated parameters on each interval, the present invention redefines the random index generator as

Cdn therein_iIs the base number of the target interval, cdn_avgIs the average base number per interval, minIdx_iIs the minimum index of the target interval. In FIG. 4, the random index generator for interval 2 is

Cdn therein₂=50，

minIdx₂=80*2+1=161.

Although the parameters in the above examples are of integer type, the method of the present invention is general. For all the numerical parameters of the non-key value attribute, the S-Dist and the parameter generation are completely the same. For parameters of key-value attributes, there are small differences. Because the primary key in the synthetic database is generated sequentially, the domain value of the key-value attribute in the synthetic database may be different from the primary key in the real database. Thus, when collecting S-Dist statistics, the present invention uses the domain values of key-value attributes in the real database to partition between and construct HS. However, in the synthetic load generation process, the present invention uses the domain value of the key-value attribute in the synthetic database to support parameter generation. For the string type parameter, the largest difference is how the partitions are divided. When the HS is constructed, the interval to which the string type parameter belongs is calculated by h% I, where h is the hash value of the parameter. The index value converter is consistent with that used for the synthesis database generation.

And (4) dynamic property. If the data access distribution changes dynamically, S-Dist is inaccurate or even completely wrong. Assume a table with 100 records and a 100 second load. In the ith second, all database requests for the load access only the ith record in the table. Assume that the throughput of the database is stable during this period. At this time, if only S-Dist is used to represent the whole data access process, it is found that there is no hot data and the data access distribution is very uniform. Clearly, this is in contrast to the fact. With the resultant load generated by this S-Dist, the collision strength of transactions on the database will be much lower than the collision strength of the actual load. Therefore, the invention provides the D-Dist, and adds the description of the dynamics on the basis of the S-Dist. Firstly, dividing the load track of the same parameter into a plurality of time windows with equal length according to the time stamp of the log. Then, for a parameter trace in any time window, the present invention will generate a separate S-Dist, and define D-Dist of the entire parameter trace as a list of S-Dist. Finally, the present invention instantiates the symbolic parameter using the S-Dist corresponding to the generation time. Furthermore, for numerical parameters, the range of data used in the time window may be much smaller than the attribute threshold. To improve the accuracy of HS, the interval may be divided according to the data range of the current window. Of course, in generating the parameters, the corresponding index ranges for each interval must also be used.

Continuity. In some applications, the heat of the data is closely related to time, in particular in that the data may be accessed continuously for a period of time. The present invention is referred to as continuity of data access distribution. Previously, D-Dist captured deviations in data access within a time window, and ignored the continuity of data access between successive time windows. When it is used to generate a composite load, the data accessed between successive time windows may be quite different, resulting in a lower cache hit rate. Therefore, the invention provides C-Dist, and the description of continuity is added on the basis of D-Dist. When statistical data is collected, the invention calculates the repetition rate of the high-frequency items in the current time window and the high-frequency items in the previous time window, and the parameter repetition rate of all intervals. Fig. 5 increases the repetition frequency of HFI and HS on the basis of the example in fig. 3. In this example, the present invention can see that the repetition frequency of the HFI is 0.6, i.e., three high frequency terms are maintained from the previous time window. The repetition frequencies of the five intervals are 0, 0.33, 0.5, 0.46 and 0.56, respectively. Suppose cdn ₁15, then 15 × 0.33 ≈ 5 parameters in interval 1 appear in the previous time window.

In order to guarantee the repetition rate of the parameters in the C-Dist, the invention needs to pre-generate candidate parameters for each time window. A detailed generation process of the candidate parameters is given in algorithm 1. In lines 1-2, all high frequency terms are generated that meet the expected repetition rate. In lines 3-10, all parameters in the previous time window are traversed and a repeat parameter is selected for each interval until the parameter repetition rate over the interval is met. On line 6, the index of the parameter is derived from its value, identifying its belonging interval in the current time window. If the index is not in the index field of the current time window, the parameter is ignored. For string type parameters such as "296 # dgtckuy", the preceding part of the "#" character is the index required by the present invention. Finally, in lines 11-13, the present invention generates random parameters that are added to each interval to achieve the cardinality requirement. Based on these candidate parameters, the present invention instantiates the parameters using the parameter generation mechanism of FIG. 4. Within a certain determined interval, the method only needs to randomly select one candidate parameter as output. In addition, if candidate parameters are generated online in the synthetic load generation process, the load generator may become a performance bottleneck, thereby affecting the correctness of the evaluation result. Thus, the present invention can generate candidate parameters for all time windows off-line and store them on disk, and then read them as needed when generating the composite load.

The generation load in this embodiment:

given the transaction logic of each transaction template and the data access distribution of each parameter, the load Generator (Workload Generator) in FIG. 1 is responsible for generating the composite load that satisfies the specified configuration. At the same time, efficiently generating high-concurrency, high-throughput composite loads in a distributed environment is also a fundamental requirement for the load generator of the present invention. The details of load generation will be described below from three levels of threading model, transaction execution, and parameter instantiation.

And (4) thread models. When deploying the load generator, the user can configure the number of test nodes and the number of test threads on each node to simulate concurrency. For each test thread, the present invention establishes a separate database connection. Two different execution models are implemented that support test thread invocation transactions: no latency loops and fixed throughput. Without the wait loop setting, all test threads issue transactions without stopping, with no thought time between requests. In a fixed throughput setting, the user may specify a fixed request throughput or throughput scaling factor. If a throughput scaling factor is specified, the product of the throughput in each time window and the throughput scaling factor is taken as the target throughput for that window. The test thread achieves the required throughput by controlling the thought time between transaction requests. When the required throughput exceeds the maximum throughput that can be achieved by the current test thread, the execution model will revert to no-wait loop. The different execution models enable the present invention to build synthetic loads with scalability.

And executing the transaction. The test thread invokes different types of transactions based on the proportion of transactions extracted from the load trace. The transaction proportion is periodically adjusted over a time window. When executing a transaction, the structure information of the transaction logic will be used to determine whether operations in the branch structure need to be executed, and the number of times operations are executed in the loop structure. For the execution of an SQL operation, the invention first instantiates all parameters one by one, and then sends the operation with the specific parameter value to the test database. After the operation is performed, the result set and parameters are saved as an intermediate state for generating other parameters in subsequent operations within the same transaction instance.

And instantiating the parameters. When instantiating the parameters, the consistency of the transaction logic is guaranteed firstly, and then the data access distribution of the synthetic load is guaranteed. For one parameter, there are several cases as follows. Case 1: if only $1 is relied upon, the value of the parameter can be calculated directly from the delta and the associated smaller parameter. Case 2: if there is only $2 dependency, the present invention first attempts to instantiate the parameter by randomly selecting the dependent according to its probability. When no dependent item in the selection is made, this parameter is instantiated using the data access distribution. Case 3: if dependencies $2 and $3 exist at the same time, the corresponding operations must be in a loop structure. The first time the loop executes, as in case 2, the invention is still based on relying on $2 and data access distribution instantiation parameters; for non-first loop execution, the present invention first attempts to instantiate the parameter using the dependent $3 based on probability, and if no dependent item is selected, instantiates the parameter using the dependent $2 and the data access distribution.

In summary, transaction execution and parameter instantiation are independent of each other for all test threads, and thus the load generator of the present invention can be deployed on multiple nodes to efficiently generate high-concurrency, throughput composite loads while meeting the required load characteristics and configurations.

Examples

Experimental Environment

Experimental hardware configuration: 4 physical nodes, each node has 2 CPUs, and the model is Intel Xeon Silver4110@2.1 GHz; the memory is 120 GB; the storage is 4TB, RAID-5, 4GB RAID cache. The physical nodes communicate with each other by using a gigabit Ethernet.

Experiment one: and using the TPC-C load as a simulation object, changing the expansion factor, and detecting the fidelity of the synthesized load by comparing the throughput, the time delay, the CPU utilization rate and the disk utilization rate of the real load (namely the TPC-C load) and the synthesized load on the same database. The experimental database is PostgreSQL. The experimental results are shown in FIG. 6. The concurrency number of the database requests is the same as the expansion factor. In fig. 6a, we present the transaction execution throughput for real load and synthetic load. The results show that the throughput of the two loads is very similar with a maximum deviation of 6.29%. The metrics for the average delay and the 95% delay are shown in fig. 6 b. From these two indexes, the resultant load is very close to the real load, and the maximum deviation is only 8.99%. Fig. 6c and 6d show the CPU and disk usage for two loads, respectively. The result shows that the resource consumption of the execution synthetic load and the real load in the PostgreSQL database is consistent, and the high fidelity of the synthetic load generated by the invention is further verified.

Experiment two: this experiment demonstrates the ability of data access profiles (i.e., S-Dist, D-Dist, and C-Dist) to describe the inclination, dynamics, and continuity of data access. Since the data access distribution of existing benchmark loads is typically neither dynamic nor continuous, we build an assessment load based on YCSB. The experiments were performed on a MySQL database containing a test chart from YCSB. The size of the test meter is 10⁶The number of concurrent database requests is 20.

The evaluation load in fig. 7 has only one transaction type. The transaction consists of five pairs of read-write operation groupsEach pair of operations first reads a record and then updates it. The extended YCSB load runs for 90 seconds and is divided into three phases, and the data request of each phase is 10³Randomly selected within the bar record. In the first stage, the data access distribution is a Zipf distribution with a parameter s ═ 1; the second stage is still a Zipf distribution, but the parameter s is 1.2; the third stage is a uniform distribution. Fig. 7 shows the dynamic variation of transaction throughput, latency and deadlock volume for YCSB and the load generated by the present invention. From the results, it can be seen that the composite load produced by the present invention is dynamically consistent with the real load produced by YCSB in terms of throughput, delay and deadlock when D-Dist is used, indicating that D-Dist is well able to describe the dynamics of the workload. At the same time, in each time window, D-Dist is represented by S-Dist, which also indicates that S-Dist is well able to characterize the inclination of the load. But global S-Dist is not very good, it is defined over the entire load time, regardless of the dynamic changes of the load.

The estimated load in fig. 8 is a single row update transaction of YCSB, running for 100 seconds with a time window of 1 second. The data requests in each time window are based on 10³Random recordings with 50% of the selected recordings for each time window coinciding with the previous window. Innodb _ buffer _ pool _ size of MySQL is set to 16 MB. Fig. 8 shows the throughput and the Innodb _ buffer _ pool _ reads increments of the loads generated by YCSB and the present invention, respectively. Innodb _ buffer _ pool _ reads is the number of InnodB that must be read directly from disk. The result shows that the disk access rate of the D-Dist is obviously higher than that of the YCSB, and the throughput is lower. This is because D-Dist cannot capture the continuity of the data access distribution, resulting in almost completely different data requests in each time window, with a low cache hit rate. The load performance using C-Dist is consistent with YCSB, which indicates that C-Dist can well characterize the continuity of data access.

Experiment three: in the case that the number of transaction instances (K) and the number of transaction instance groups (N) are equal, K and N are changed simultaneously, and the performance of the transaction logic extraction algorithm is observed through execution time and memory consumption. The experimental results are shown in FIG. 9. From the results, it can be seen that when K and N are both 10⁴Disclosure of the Business logicThe fetch time is only 2.1 seconds and the memory consumption is 1.1 GB. As K and N increase, execution time and memory consumption increase almost linearly. In general, the transactional logic extraction proposed herein is efficient and can be completed in a few seconds.

Experiment four: and changing the length of the load track, and observing the performance of the C-Dist extraction algorithm through execution time and memory consumption. The experimental results are shown in FIG. 10. The result shows that the C-Dist extraction time and the load trajectory are in a linear relation, and the memory consumption is constant. This is because C-Dist is a window-based data access profile, and the load trace for each time window can be deleted from memory after processing is completed. In FIG. 10, when the TPC-C load trajectory time is 10⁴Second (transaction throughput is 3610.3, log amount is 33.8GB), C-Dist extraction time is 678.7s, and memory consumption is 1.2 GB. Since the maximum load period in the actual evaluation is typically one day, the present invention can effectively support performance evaluation of high throughput loads.

The protection of the present invention is not limited to the above embodiments. Variations and advantages that may occur to those skilled in the art may be incorporated into the invention without departing from the spirit and scope of the inventive concept, and the scope of the appended claims is intended to be protected.

Claims

1. An application-oriented transaction load generation system, comprising: the system comprises a database generation module and a load generation module; wherein the content of the first and second substances,

a data access distribution analyzer that extracts data access distribution information and throughput information by analyzing a partial load trajectory over a long period of time;

2. An application-oriented transactional load generation system according to claim 1, wherein the data features are obtained automatically using SQL queries by a data feature extractor.

3. An application-oriented transactional load generating system as recited in claim 1, wherein said load generator can configure the number of test nodes and the number of test threads on each node to simulate concurrency; for each test thread, a separate database connection is established.

4. An application-oriented transactional load generating system according to claim 1, wherein said load generator uses structure information of transaction logic when executing transactions to determine whether operations in a branch structure need to be executed and the number of times operations are executed in a loop structure; for the execution of an SQL operation, instantiating all parameters one by one, and then sending the operation with specific parameter values to a test database; after the operation is performed, the result set and parameters are saved as an intermediate state for generating other parameters in subsequent operations within the same transaction instance.

5. An application-oriented transactional load generation system as claimed in claim 1, wherein said load generator is configured to calculate for a parameter, if only $1 is relied upon, the value of the parameter directly from the delta and the associated smaller parameter; if only $2 is relied upon, first attempt to instantiate a parameter by randomly selecting a dependent item according to the probability of the dependent item, and when no dependent item is selected, instantiate this parameter using the data access distribution; if both dependencies $2 and $3 exist, the corresponding operations must be in a loop structure, instantiating the parameter based on the dependent $2 and data access distribution on the first execution of the loop, first attempting to use the dependent $3 to instantiate the parameter based on probability for non-first loop executions, and using the dependent $2 and data access distribution to instantiate the parameter if no dependent item is selected.

6. An application-oriented transaction load generation method, characterized by comprising the steps of:

7. The application-oriented transactional load generation method of claim 1, wherein in step a, the test database is a plurality of tables satisfying primary key, foreign key constraints and non-key value attribute data characteristics; the method specifically comprises the following steps:

step A1: generating primary keys in sequence;

8. An application-oriented transactional load generation method as claimed in claim 7, characterized in that before generating the key value, first of all a confirmation is madeSetting a value range: first, if the primary key includes only a single attribute, its value range is [1, s ]]S is the size of the table; second, the value range of the foreign key attribute can be determined by the primary key to which it refers; thirdly, when processing the value range of the non-foreign key attribute in the composite primary key, only one non-foreign key attribute is in the composite primary key, and then the value range of the attribute is

9. An application-oriented transactional load generation method according to claim 7, wherein the output of the random index generator is an integer from 1 to n, where n is the cardinality of the attribute; given an index, the index value translator deterministically maps it to a value in the attribute value domain; according to the data type of the attribute, different index value converters are adopted: for numerical types, a linear function is used that maps indices uniformly to attribute value ranges; for the character string type, randomly generating a seed character string meeting the length requirement; first, a seed string is selected according to an input index, and then the index and the selected seed string are connected as an output value.

10. The application-oriented transactional load generation method according to claim 6, wherein in step B1, the transactional logic extraction algorithm comprises:

step B13: for the parameter pi, j in each transaction template, traversing each parameter pm, n before the parameter pi, j, and calculating the number of the transactions having the parameter pair meeting the ER; similarly, traversing the previous return set rx, y, and respectively calculating the number of transactions meeting ER and IR;

11. The application-oriented transaction load generation method according to claim 6, wherein in the step B, the method specifically comprises:

step B22: traversing all parameters in the previous time window, and selecting a repeated parameter for each interval until the parameter repetition rate on the interval is met;

step B23: deriving an index for the parameter from its value, thereby identifying its range in the current time window; if the index is not in the index field of the current time window, ignoring the parameter;