CN113157541B

CN113157541B - Multi-concurrency OLAP type query performance prediction method and system for distributed database

Info

Publication number: CN113157541B
Application number: CN202110425574.3A
Authority: CN
Inventors: 李晖; 丁玺润; 闵圣天; 戴震宇
Original assignee: Guizhou Youlian Borui Technology Co ltd
Current assignee: Guizhou Youlian Borui Technology Co ltd
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2024-04-05
Anticipated expiration: 2041-04-20
Also published as: CN113157541A

Abstract

The embodiment of the invention relates to the technical field of data processing, and discloses a multi-concurrency OLAP type query performance prediction method and system for a distributed database. The invention discloses a multi-concurrency OLAP type query performance prediction method for a distributed database, which comprises the following steps: calculating interference degree, calculating sensitivity and predicting delay; a multi-concurrency OLAP type query performance prediction system for a distributed database comprises: the system comprises a query interference degree calculation module, a query sensitivity calculation module, a cache module and a query delay calculation module. Compared with the prior art, the method and the device have the advantages that resources are occupied in terms of query optimization, the algorithm part is clear and simple, the performance requirement is low, and the method and the device are easier to deploy in actual use, so that the practicability is ensured.

Description

Multi-concurrency OLAP type query performance prediction method and system for distributed database

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to database data processing technology.

Background

Executing queries in parallel in a database can bring many advantages. For example, it can shorten the overall run time of multiple queries and increase the utilization of hardware, but for one of the concurrent queries, its execution time may be extended or shortened compared to its execution alone. The main reasons are the interaction among multiple queries, some of which can facilitate the execution of the query, and some of which extend the execution of the query due to resource competition with the query.

The concurrent query performance prediction has great application value for query scheduling control and the like, for example, if the query execution time can be known in advance, the sequence of a plurality of queries can be changed, and then the user SLA requirement is met. The accurate query performance prediction technique can also be used for query progress display to know the execution progress of the current query, and then the DBA can make the next decision to wait for the query to be executed or kill the query. Query performance prediction also has a guiding effect on the query optimizer, such as: the query optimizer may better create concurrent query aware query plans to shorten the overall execution time of the query.

Because the query performance prediction technology has great value, many researches are conducted on the aspect, the researches mainly face two types of queries, namely, an OLTP type query, the OLTP mainly refers to a plurality of transaction type queries with higher time requirements in a relational database, and generally, the execution time of the queries is shorter; and secondly, OLAP type inquiry is mainly applied to a data warehouse, and the data volume faced by the type inquiry is relatively large, and the execution time is relatively long. The text is primarily directed to OLAP type queries. There are some techniques available to make performance predictions for analytical queries, but these techniques have certain limitations in terms of practicality and extensibility.

The inventor finds that at least the following problems exist in the prior art: the prior art can predict the performance of the analysis type query, but the techniques have certain limitations in the aspects of practicability and expansibility.

Disclosure of Invention

The embodiment of the invention aims to provide a multi-concurrency OLAP type query performance prediction method and system for a distributed database, which enable query optimization to occupy less resources, thereby ensuring practicability, and the calculation process is clear and simple and easy to enhance, thereby ensuring expansibility.

In order to solve the technical problems, the embodiment of the invention provides a multi-concurrency OLAP type query performance prediction method for a distributed database, which comprises the following steps:

calculating the interference degree: based on the query request, calculating the occupation condition of the computing resources related to the query request to obtain the query interference degree;

calculating sensitivity: based on the query request, calculating by combining the query interference degree to obtain the query sensitivity;

prediction delay: a query delay is calculated based on the query sensitivity.

The computing resource occupation situation specifically includes: the time when the query requests are executed alone, the percentage of the total running time of the I/O time in the query requests, the I/O time shared by the master query and the concurrent query, the I/O time shared between the concurrent queries, and the network interference of the concurrent query to the master query.

In the step of calculating the interference degree, the interference degree is inquiredThe calculation is performed in the following manner:

wherein,time of individual execution for query request, +.>For the percentage of I/O time in the query request to total run time, +.>I/O time shared for master query and concurrent query, +.>I/O time shared between queries for and to +.>Network interference to the master query for concurrent queries.

The query sensitivity is a linear dependent variable of the query interference degree, and a plurality of groups of query sensitivity and query interference degree values are adopted for training to obtain a linear relation parameter; the query sensitivity for training is calculated based on the query delay, the time the query request is executed in the worst environment, and the time the query request is executed alone.

The inquiry delay is obtained through measurement and is used for training.

The linear relationship between the query sensitivity and the query interference is that,

c _q,m ＝μ _q *γ _q,m +b _q

wherein mu _q And b _q Is a linear relationship parameter.

The query sensitivity for training is calculated by the following method:

wherein τ _q,m For inquiring delay τ _maxq Execution time, τ, for query requests in worst case environment _minq The time of individual execution is requested for the query.

In the step of predicting the delay, the query delay is calculated based on the following formula:

wherein c _q,m For query sensitivity τ _q,m For inquiring delay τ _maxq Execution time, τ, for query requests in worst case environment _minq The time of individual execution is requested for the query.

The query request is a primary query and/or a concurrent query.

The embodiment of the invention also provides a multi-concurrency OLAP type query performance prediction system facing the distributed database, which comprises the following steps: a query interference degree calculation module, a query sensitivity calculation module, a cache module and a query delay calculation module, wherein,

the inquiry interference degree calculation module is used for executing calculation of the calculated interference degree;

the query sensitivity calculation module is used for executing calculation of the calculation sensitivity;

the caching module is used for caching query interference, query sensitivity and query delay data in the calculation process;

the query delay calculation module is used to perform the calculation of the predicted delay as described above.

Compared with the prior art, the method and the device have the advantages that resources are occupied in terms of query optimization, the algorithm part is clear and simple, the performance requirement is low, and the method and the device are easier to deploy in actual use, so that the practicability is ensured.

In addition, the data processing process of the embodiment of the invention is clear and easy to understand, and is easy to further improve subsequently, so that the invention is easy to enhance, and the expansibility is ensured.

In addition, the method and the device not only calculate quickly, but also calculate accurately the query delay embodying the query performance by taking network resource overhead into consideration.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings.

Fig. 1 is a flowchart of a multi-concurrent OLAP type query performance prediction method for a distributed database according to the first to ninth embodiments of the present invention;

FIG. 2 is a schematic connection diagram of a distributed database oriented multi-concurrent OLAP query performance prediction system module according to a tenth embodiment of the present invention;

FIG. 3 is a flowchart of a method for predicting multi-concurrency OLAP query performance for a distributed database according to an eleventh embodiment of the present invention;

FIG. 4 is a graph comparing predicted query results of different I/O contention optimizations for experiments performed by a multi-concurrent OLAP query performance prediction method for a distributed database according to an eleventh embodiment of the present invention;

fig. 5 is a comparison chart of predicted query results of different concurrent query amounts of an experiment performed by a multi-concurrent OLAP type query performance prediction method for a distributed database according to an eleventh embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the following detailed description of the embodiments of the present invention will be given with reference to the accompanying drawings. However, those of ordinary skill in the art will understand that in various embodiments of the present invention, numerous technical details have been set forth in order to provide a better understanding of the present application. However, the technical solutions claimed in the present application can be implemented without these technical details and with various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not be construed as limiting the specific implementation of the present invention, and the embodiments can be mutually combined and referred to without contradiction.

The first embodiment of the invention relates to a multi-concurrency OLAP type query performance prediction method for a distributed database. The core of the embodiment is that the query interference degree and the query sensitivity are calculated, and the query delay is obtained based on the calculation, so that the performance of database query (especially distributed database concurrent query) is accurately predicted, the network resource cost is considered by calculating the resource occupation condition, the query delay embodying the query performance is accurately calculated, and the complex models such as deep learning and the like are not adopted, so that the sufficient practicability and expansibility are realized.

The flow of the method in this embodiment is shown in fig. 1, and specifically includes the following steps: calculating the interference degree: based on the query request, calculating the occupation condition of the computing resources related to the query request to obtain the query interference degree;

prediction delay: a query delay is calculated based on the query sensitivity.

The above steps are divided, for clarity of description, and may be combined into one step or split into multiple steps when implemented, so long as they include the same logic relationship, and they are all within the protection scope of this patent; it is within the scope of this patent to add insignificant modifications to the algorithm or flow or introduce insignificant designs, but not to alter the core design of its algorithm and flow.

The second embodiment of the invention relates to a multi-concurrency OLAP type query performance prediction method for a distributed database. The second embodiment is substantially the same as the first embodiment, and in the second embodiment of the present invention, the calculation of the resource occupation situation specifically includes: the time when the query requests are executed alone, the percentage of the total running time of the I/O time in the query requests, the I/O time shared by the master query and the concurrent query, the I/O time shared between the concurrent queries, and the network interference of the concurrent query to the master query. In addition, those skilled in the art will appreciate that the above data may be obtained directly from the operating system by practical comparison during the calculation process using prior art means.

The third embodiment of the invention relates to a multi-concurrency OLAP type query performance prediction method for a distributed database. In the third embodiment, the interference degree is searched for in the step of calculating the interference degreeThe calculation is performed in the following manner:

The fourth embodiment of the invention relates to a multi-concurrency OLAP type query performance prediction method for a distributed database. In the fourth embodiment of the present invention, the query sensitivity is a linear dependent variable of the query interference degree, and a plurality of sets of query sensitivity and query interference degree values are adopted to train to obtain a linear relation parameter; the query sensitivity for training is calculated based on the query delay, the time the query request is executed in the worst environment, and the time the query request is executed alone.

The fifth embodiment of the invention relates to a multi-concurrency OLAP type query performance prediction method for a distributed database. The fifth embodiment is substantially the same as the fourth embodiment, and in the fifth embodiment of the present invention, the query delay for training is obtained by measurement. The measurement is specifically obtained by comparing the system practice before and after inquiry and performing subtraction calculation.

The sixth embodiment of the invention relates to a multi-concurrency OLAP (on-line analytical processing) type query performance prediction method for a distributed database. The sixth embodiment is substantially the same as the fourth embodiment, and in the sixth embodiment of the present invention, the linear relationship between the query sensitivity and the query interference degree is that,

c _q,m ＝μ _q *γ _q,m +b _q

wherein mu _q And b _q Is a linear relationship parameter.

The seventh embodiment of the invention relates to a multi-concurrency OLAP type query performance prediction method for a distributed database. The seventh embodiment is substantially the same as the fourth embodiment, and in the seventh embodiment of the present invention, the query sensitivity for training is calculated as follows:

The eighth embodiment of the invention relates to a multi-concurrency OLAP type query performance prediction method for a distributed database. The eighth embodiment is substantially the same as the first embodiment, and in the eighth embodiment of the present invention, the inquiry delay is calculated based on the following formula:

The ninth embodiment of the invention relates to a multi-concurrency OLAP type query performance prediction method for a distributed database. The ninth embodiment is substantially the same as the first embodiment, and in the ninth embodiment of the present invention, the query request is a master query and/or a concurrent query.

A tenth embodiment of the present invention relates to a multi-concurrent OLAP query performance prediction system for a distributed database, as shown in fig. 2, including: a query interference degree calculation module, a query sensitivity calculation module, a cache module and a query delay calculation module, wherein,

It is to be noted that this embodiment is a system example corresponding to the first embodiment, and can be implemented in cooperation with the first embodiment. The related technical details mentioned in the first embodiment are still valid in this embodiment, and in order to reduce repetition, a detailed description is omitted here. Accordingly, the related art details mentioned in the present embodiment can also be applied to the first embodiment.

It should be noted that each module in this embodiment is a logic module, and in practical application, one logic unit may be one physical unit, or may be a part of one physical unit, or may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, units that are not so close to solving the technical problem presented by the present invention are not introduced in the present embodiment, but this does not indicate that other units are not present in the present embodiment.

The eleventh embodiment of the invention relates to a multi-concurrency OLAP type query performance prediction method for a distributed database. The eleventh embodiment, which is implemented in a specific production environment in combination with the first to ninth embodiments, is shown in fig. 3. In this embodiment, mathematical symbols are referred to, and the definition of the main symbols is shown in table 1.

TABLE 1 Primary symbol meanings

A. Calculation of query interference

The query interference level (CQI, concurrent Query Interference) is used to describe the quality of the current execution environment of the primary query, i.e. to describe the contention situation of the resources.

Assuming that a query combination is m, it includes a master query q and a query c= { C that is executed in parallel with the master query ₁ ,c ₂ ,…,c _n The number of concurrent queries is n. First, get each concurrent query c _i I/O and network resources required during the separate operation, at this point, no resource contention occurs. Then, the impact of each concurrent query competing for resources with the master query on the master query is estimated. Finally, the impact on the master query due to contention resources between concurrent queries is evaluated.

Baseline I/O refers to the benchmark I/O of a query, i.e., when a query is executed independently, its I/O time occupies a percentage of the total execution time, the greater the percentage, the more I/O resources are required for the query. By usingRepresenting a concurrent query c _i The percentage of I/O in the system.

When a master query is executed together with a concurrent query, if one concurrent query scans a different table than the master query, the concurrent query will "interfere" with the master query because the different queries contend for I/O. When the concurrent query scans the same table as the main query, the interference is greatly reduced, and even the main query is promoted, because in the database, when one table is frequently scanned, the data of the table is stored in the shared cache, and then the data of the table is requested to be directly fetched from the shared cache, thereby avoiding repeated I/O operation.

Let t be the primary query q and the concurrent query c _i Table of common scans. The following values are defined:

can seeThe shared I/O time is calculated below with values of only 0 and 1.

Wherein n represents the total number of scan tables required for main query and concurrent query, S _t The time taken to scan the table t is shown. With select from table]Formal query statement acquisition scan table]The total time spent, i.e., the time of the scan table in the query statement execution time. In Greenplum, the data of the table is distributed to each node, and queries are executed in each node, so if multiple queries contain a common table, the time to repeatedly scan the table on disk can be "saved". Equation (3) calculates the time saved due to shared I/O.

In addition to considering the shared I/O of the primary query and the concurrent query, there is a need to measure the I/O impact between the concurrent queries. I.e., the primary query is executed in concert with two concurrent queries a and b, a and b save I/O time due to concurrent execution. First, define table t as concurrent query c _i Table co-scanned with other non-master queries:

definition d _t The number of concurrent queries for scan table t, where d _t Must be greater than 1. In addition, since only the table scan case between concurrent queries is considered, the table t here cannot appear in the master query. The time saved by computing the shared I/O for concurrent queries is:

n in the above formula is also the total number of scan tables needed for the primary query and the concurrent query.

When the distributed database is oriented, data is distributed in each node in the cluster, and the table connection operation in the SQL query must occur data transmission, namely, the data is transmitted to one nodeThe data is migrated to another node. There are two ways in which data can be migrated in greenplus: broadcast and redistribution. Broadcasting is the transmission of data on one node to all other nodes so that each node has the complete data of a table. The redistribution is to calculate a hash value of the data of the table according to the association key, and then redistribute the hash value to each node. Assuming that the number of records of a table is N, the amount of data to be redistributed is N, the amount of data to be broadcast is n×the number of nodes, and the data migration amount of a connection operation can be calculated in the above manner. The total data migration amount of the main query is t _q The migration data volume of concurrent query isDefining network interference of concurrent queries to a master query as:

as can be seen from the above, concurrent query c _i The larger the data migration volume, the more interference to the primary query, and conversely, the smaller. This is because the network bandwidth of the system is constant, which necessarily affects the data transmission of the master query when there are other queries in the network to transmit data.

After the variables are obtained, a concurrent query c can be defined _i Impact on the master query.

Equation (7) can be understood as a concurrent query c _i The first half of equation (7) is that the primary query subtracts the time that the concurrent query shares I/O with the primary query, where in the case of network contention determination, when r _ci The larger the time that the concurrent query shares I/O with the master query is, the shorter the time to contend for I/O is, in which case the query delay of the master query is extended. When (when)The smaller the resource competition between the concurrent query and the main query is, the smaller the delay effect on the main query is.

In one query combination, the CQI value of the primary query is defined as gamma _q,m The calculation formula is as follows:

the above formula takes the concurrent queriesAverage value.

B. Calculation of query sensitivity

The query performance interval PR (Performance Range) refers to a range of query delay times, the values in this interval represent the execution time of the query in different environments, and the maximum value of the interval isRepresenting the execution time of the query in the worst resource environment. This document simulates the worst case by constantly reading large files and exchanging the transfer of these files between different nodes. Minimum value +.>Representing the delay time in the current environment when only this query is executed. The two values represent execution queries in the extreme execution environment, and the query execution time in the rest of the environments is within the query performance interval, and the PRP (Performance Range Point) value of the main query is defined as follows:

when knowing c _q,m After the value of (2), the value is carried into the formula (9) to be reversely deduced to obtain tau _q,m I.e., the query delay of the master query.

Given a query combination m and a master query q, the CQI value can be calculated using equation (8) and then the linear regression model is used to predict the performance of the query. To further illustrate the linear relationship between query performance and CQI, query sensitivity QS (Query Sensitivity) is introduced herein.

Assuming that CQI and PRP have a linear relationship, the following formula is defined:

c _q,m ＝μ _q *γ _q,m +b _q (10)

wherein mu is _q Is a slope, b _q For the intercept, c _q,m And gamma is equal to _q,m Is a linear relationship.

In summary, the flow of the present embodiment is as follows:

firstly, generating a query combination m by utilizing LHS, wherein the query combination m comprises a main query q;

second, τ with respect to q is obtained separately _minq 、τ _maxq And τ _q,m ，τ _minq And τ _maxq Is obtainable in advance, τ _q,m Can be obtained from experimental data; then take formula (9) to get c _q,m The method comprises the steps of carrying out a first treatment on the surface of the In this way, a large number (c) of test sets can be obtained from the experimentally generated test sets _q,m ,γ _q,m ) Value pairs;

third step, using the obtained (c _q,m ,γ _q,m ) The value pairs train a QS model (formula (10)) by using a regression method based on least square linearity to obtain a QS model of query q;

fourth, when q is in another query combination, predicting the query delay of q at that time, and calculating the CQI value gamma of q at that time _q,m′ ；

Fifth, obtaining c 'by the QS model generated in the third step' _q,m′ ；

Sixth step, c' _q,m′ Substituting formula (9) again to obtain q query delay tau in m' query combination _q,m′ 。

An experiment was performed in an experiment of the eleventh embodiment of the present invention in a Greemplum distributed cluster, greemplum version 5.0.0-alpha+79a3598. The cluster has 4 nodes in total, a master node and three slave nodes, the slave nodes are mainly used for storing data and executing inquiry, and the master node is responsible for distributing inquiry and summarizing results. The hardware of the master node is configured into a 32GB memory, the CPU is a 4-core Intel (R) Xeon (R) CPU E5-2630 v2@2.60GHz, the memory 16GB of the slave node, the core number and the model number of the CPU are the same as those of the master node, four database examples are arranged in each slave node, and each database example is equivalent to a complete PostgreSQL database and is used for processing a part of data. The operating systems of the master node and the slave node are both centOS 7.4, and the linux kernel version is 3.10. The table and data are generated by TPC-DS, which is a decision-supporting benchmark. The data size used in the experiment is 50G, 10 templates in TPC-DS are selected to generate 10 queries for training and testing the model, the 10 queries are mainly I/O sensitive queries, the execution time is long, and the accuracy of the prediction model is improved.

The impact of the various components of the CQI on the error rate is first evaluated and then the CQI is used to predict the query delay. When the number of queries run simultaneously MPL (Multi-programming Level) is 3, the prediction error of each variable against the query delay is shown in fig. 4. In the figure:

baseline I/O refers to the reference I/O of a query, i.e., when a query is executed independently, its I/O time occupies a percentage of the total execution time, with a larger percentage indicating that the query requires more I/O resources

Positive I/O refers to the situation where concurrent queries "interfere" with the primary query, contending for I/O.

Concurrent I/O refers to the I/O time saved by Concurrent execution of a and b when a primary query is executed in conjunction with two Concurrent queries, a and b.

Network refers to the optimized I/O occupancy of the eleventh embodiment.

It can be seen that the error is large when only baseline I/O is used to predict query delay, and the error rate is significantly reduced when the factors of concurrent query interaction are added. The prediction accuracy is not obviously improved by considering the current I/O and network contention factors, so that the positive I/O is the main factor affecting the accuracy of the prediction model, and other factors can improve the accuracy by a small margin. In summary, the eleventh embodiment considers the main influencing factors between concurrent queries, and is a better predictive model.

For a particular query q, a query combination containing this query is found, and then q is the dominant query to construct the QS model of the eleventh embodiment, which is used to predict execution time and compare with actual execution time to obtain the result shown in fig. 5.

It can be seen that the different MPLs, except for queries 61 and 62, have errors below 25% and some can even reach below 20%. Also, the reason for the higher errors of queries 61 and 62 is that they are performed for a shorter time, resulting in a larger error. From the experimental results, the QS model can adapt to different query execution environments (different query combinations under different MPLs), so that the execution delay of the query can be predicted more accurately.

In summary, the experimental results show that most of the prediction error rate of the method can be maintained below 25%, and the delay time of the query can be predicted more accurately.

That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in at least one storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments described herein. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the invention and that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. The multi-concurrency OLAP type query performance prediction method for the distributed database is characterized by comprising the following steps of:

the inquiry interference degreeThe calculation is performed in the following manner:

wherein,time of individual execution for query request, +.>For the percentage of I/O time in the query request to total run time, +.>I/O time shared for master query and concurrent query, +.>I/O time shared between queries for and to +.>Network interference to the master query for concurrent queries;

c _q,m ＝μ _q *γ _q,m +b _q

wherein c _q,m For inquiring sensitivity, gamma _q,m To query the interference level, mu _q And b _q Is a linear relation parameter;

prediction delay: calculating a query delay based on the query sensitivity;

the query delay is calculated based on the following formula:

2. The multi-concurrency OLAP type query performance prediction method for a distributed database according to claim 1, wherein the computing resource occupation situation specifically includes: the time when the query requests are executed alone, the percentage of the total running time of the I/O time in the query requests, the I/O time shared by the master query and the concurrent query, the I/O time shared between the concurrent queries, and the network interference of the concurrent query to the master query.

3. The multi-concurrency OLAP type query performance prediction method for a distributed database according to claim 1, wherein the query sensitivity is a linear dependent variable of query interference, and a plurality of sets of query sensitivity and query interference value training are adopted to obtain a linear relation parameter; the query sensitivity for training is calculated based on the query delay, the time the query request is executed in the worst environment, and the time the query request is executed alone.

4. A multi-concurrent OLAP type query performance prediction method for a distributed database of claim 3, wherein the query delay used for training is measured.

5. The distributed database oriented multi-concurrent OLAP type query performance prediction method of claim 3, wherein the query sensitivity for training is calculated by:

6. The multi-concurrent OLAP type query performance prediction method for a distributed database of claim 1, wherein the query request is a primary query and/or a concurrent query.

7. The multi-concurrency OLAP type query performance prediction system for the distributed database is characterized by comprising the following components: a query interference degree calculation module, a query sensitivity calculation module, a cache module and a query delay calculation module, wherein,

the query interference calculation module is configured to perform the calculation of the calculated interference as claimed in claim 1;

a query sensitivity calculation module for performing the calculation of the calculated sensitivity of claim 1;

the query delay calculation module is configured to perform the calculation of the predicted delay of claim 1.