CN114218287B

CN114218287B - Query time prediction method for time sequence database

Info

Publication number: CN114218287B
Application number: CN202111662253.1A
Authority: CN
Inventors: 王宏志; 李同欣; 张凯欣; 郑博; 梁栋; 叶天生; 燕钰; 丁小欧
Original assignee: Beijing Nosi Spacetime Technology Co ltd; Harbin Institute of Technology
Current assignee: Beijing Nosi Spacetime Technology Co ltd; Harbin Institute of Technology
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-11-04
Anticipated expiration: 2041-12-30
Also published as: CN114218287A

Abstract

A query time prediction method for a time sequence database relates to the technical field of computers, and aims at the problem of low query time prediction speed in the prior art, and comprises the following steps: the method comprises the following steps: reading time sequence data; step two: writing the time series data into a CnOSDB, wherein the CnOSDB uses a CnoSQL query statement to query and retrieve the time series data and records query time; step three: encoding the query statement into vectorized data; step four: extracting data distribution characteristics from the opposite quantitative data; step five: using PCA to reduce the dimension of the data distribution characteristics; step six: training a gradient lifting regression tree model by using vectorized data and data distribution characteristics after dimension reduction as input and query time as output; step seven: and predicting the query time by using the trained gradient lifting regression tree model. In the prediction time, the model can give a prediction result in tens of milliseconds in the experiment, and has considerable response speed.

Description

Query time prediction method for time sequence database

Technical Field

The invention relates to the technical field of computers, in particular to a query time prediction method facing a time sequence database.

Background

Query time prediction is the technical basis for multiple hot spot problems in the field of databases, such as admission control, query optimization, query scheduling and the like. For example, in database optimization, the main objectives of the optimization are query response time and space utilization, and therefore, the execution time of the query will be used as an important feedback index to indicate the quality of the optimization result. In practice, however, if the query is physically executed on the query load, the actual execution time is obtained, which may bring unacceptable cost to the optimization process, because the load is often executed for hundreds of thousands of rounds repeatedly.

At present, the related research for inquiring the time prediction direction is relatively mature. In general, there are two approaches to the query time prediction technology on the database: one is to encode the query at the query plan level and the other is to encode the query at the physical operator level. The coding granularity of the query plan layer is low, the prediction effect depends on the feature similarity of the training set and the test set, the prediction effect on unknown queries is poor, the coding of the physical operator layer generally utilizes the depth-first traversal of the query tree to obtain the sequence of operators, so that the structural features in the query tree can be extracted, the coding granularity is high, the prediction effect is good relative to the query plan layer in terms of dealing with unknown queries, but the prediction speed of the test queries with the features similar to the training set is low.

Disclosure of Invention

The purpose of the invention is: aiming at the problem of low query time prediction speed in the prior art, a query time prediction method facing a time sequence database is provided.

The technical scheme adopted by the invention to solve the technical problems is as follows:

a query time prediction method facing a time sequence database comprises the following steps:

the method comprises the following steps: reading time sequence data;

step two: writing the time series data into a CnOSDB, wherein the CnOSDB uses a CnoSQL query statement to query and retrieve the time series data and records query time;

step three: encoding the query statement into vectorized data;

step four: extracting data distribution characteristics from the opposite quantitative data;

step five: using PCA to reduce the dimension of the data distribution characteristics;

step six: training a gradient lifting regression tree model by using vectorized data and data distribution characteristics after dimension reduction as input and query time as output;

step seven: and predicting the query time by using the trained gradient lifting regression tree model.

Further, before the third step, the following steps are also included: the CnoSQL is rewritten to standard SQL.

Further, the encoding in the third step includes encoding a join graph and encoding column information, and the results of encoding the join graph and encoding the column information are connected as the encoding of the whole query.

Further, the specific steps of the join graph coding are as follows:

analyzing tables of joining joins related in CnoSQL or SQL query sentences, analyzing the connection relation of every two tables of joining joins, judging whether every two tables of joining joins are connected or not, if so, setting codes corresponding to the joins as 1, if not, setting the codes corresponding to the joins as 0, and finally, reserving an upper triangular matrix part of the two-dimensional matrix and expanding the upper triangular matrix part into a one-dimensional matrix according to rows.

Further, the specific steps of the column information encoding are as follows:

and aiming at each table joining the join, setting the coding column participating in the query predicate in each table joining the join as 1, setting the coding column not participating in the query predicate in each table joining the join as 0, and connecting the column set as 1 with the column set as 0 to obtain the query code.

Further, the column information coding is performed by a one-hot coding method.

The beneficial effects of the invention are:

the method generates a quote _ Big table by using stock transaction data of a 2016, 10, 24 and 24 new york exchange on an NYSE website, and tests 11 commonly used queries on the quote _ Big table, and by applying the method, under the condition that the data scale of a database does not change elastically, the coefficient of the query prediction regression model can reach over 0.996, the Mean Absolute Percentage Error (MAPE) can reach within 17 percent, and when the data scale of the database changes elastically within 20 percent, the coefficient of the query prediction regression model can reach over 0.971, and the Mean Absolute Percentage Error (MAPE) can reach within 19 percent.

In addition, in the prediction time, the model can give a prediction result in tens of milliseconds in the experiment, and has very considerable response speed.

Drawings

FIG. 1 is a schematic diagram of a standard SQL rewrite of CnoSQL;

FIG. 2 is a sequence diagram of one-dimensional encoding by compressing the upper half area;

fig. 3 is a schematic overall flow chart of the present application.

Detailed Description

It should be noted that, in the case of conflict, the various embodiments disclosed in the present application may be combined with each other.

The first embodiment is as follows: specifically, the present embodiment is described with reference to fig. 3, and the query time prediction method for a time series database according to the present embodiment includes the following steps:

the method comprises the following steps: reading time sequence data;

step two: writing time series data into CnOSDB, using a CnoSQL query statement by the CnOSDB to perform query retrieval on the time series data, and recording query time;

step three: encoding the query statement into vectorized data;

The second embodiment is as follows: this embodiment is further described with respect to the first embodiment, and the difference between this embodiment and the first embodiment is that the following steps are further included before the third step: the CnoSQL is rewritten to standard SQL.

The third concrete implementation mode: this embodiment is a further description of the second embodiment, and the difference between this embodiment and the second embodiment is that the encoding in step three includes join graph encoding and column information encoding, and the results of join graph encoding and column information encoding are connected as the encoding of the whole query.

The fourth concrete implementation mode is as follows: this embodiment is a further description of a third embodiment, and the difference between this embodiment and the third embodiment is that the specific steps of the join graph encoding are:

The fifth concrete implementation mode is as follows: the present embodiment is further described with respect to a fourth embodiment, and the difference between the present embodiment and the fourth embodiment is that the specific step of the column information encoding is:

The sixth specific implementation mode: this embodiment mode is a further description of a fifth embodiment mode, and the difference between this embodiment mode and the fifth embodiment mode is that the column information encoding is performed by a one-hot encoding method.

Time series data, referred to simply as time series data, refers to a sequence of data points acquired by a fixed acquisition device at fixed time intervals over a period of time, each data point being associated with a time stamp indicating the time of acquisition. By analyzing time series data of a period of time, the development trend of the transaction in the period of time can be obtained, and therefore the future prediction is achieved. The time sequence data has obvious characteristics, fixed data structure, fixed write-in interval, large write-in quantity, large occupied original data space, wide time range of each query, less updating operation requirements and less single-point query requirements. These characteristics result in that the conventional relational database does not support the time-series data well. Databases that are specifically optimized for time series data are referred to as time series databases.

The characterized coding scheme provided by the application achieves the aim of quickly and accurately inquiring the time prediction under the condition of the business requirement of intensive inquiring workload.

Corresponding to SQL query of a relational database, cnosDB carries out query retrieval on time-series data by using CnosQL query language. In actual prediction, the CnoSQL is first rewritten to standard SQL as shown in fig. 1.

To predict the query time using an algorithm, the query statement needs to be encoded first. The invention carries out the characteristic coding of the query statement of a CnosDB time sequence database, mainly carries out the analysis and the coding from the query level, and is divided into two main parts: the method comprises the steps of joint image coding and column information coding, and the results of the two codes are connected to be used as the code of the whole query.

The Join graph coding means that a Join list involved in the whole SQL query is analyzed, the connection relation of the Join list is analyzed, whether connection is needed between every two Join lists is judged in sequence, and the number of the connection needed is set to be 1 in a two-dimensional matrix. Considering that the connection is required to be in a bidirectional relationship, the two-dimensional matrix only needs to be reserved by half and compressed into a one-dimensional matrix for subsequent encoding. As shown, there is a connection between a and B, B and D, a and C, i.e. 1 is set in the two-dimensional array, compressing the upper half into a one-dimensional encoding resulting in the sequence shown in fig. 2.

The column information encoding means that for each table, the column participating in the query predicate is set to be 1, and the encoding is performed by adopting a one-hot encoding mode, that is, for a table with n columns, the condition that each column participates in the query predicate is represented by n digits, 1 represents that the column participates in the query predicate, and 0 represents that the column does not participate. The two are connected to obtain the final inquiry code

In order to adapt the characteristic code to the flexible change of the database data scale in the actual service, the data distribution characteristics need to be extracted. For this reason, it is assumed that the data size has an upper limit value within a certain service range, and the upper limit value is manually set by experience according to actual service requirements. The characteristic coding length of data distribution is recorded to be equal to the number of tables, indexes of characteristic vectors correspond to table sequence numbers, and values of coding bits are ratios of current table data entries to upper limits of the table data entries. And the data distribution feature vector is spliced to the front of the query feature vector.

After the query and data distribution are characterized and encoded, the PCA dimension reduction is performed on the data matrix. Because the query load in the actual service generally cannot cover the tags in all the time series data, some relation tables are not used, and therefore a large number of zero values are stored in the data matrix and are sparse. The PCA dimension reduction processing can effectively remove redundant components of the data matrix, accelerate the model regression speed and improve the accuracy.

Finally, the training data are used for training a gradient lifting regression tree model, so that the model can be used for accurately and quickly predicting the time of the query.

It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations that fall within the spirit and scope of the invention be limited only by the claims and the description.

Claims

1. A query time prediction method facing a time sequence database is characterized by comprising the following steps:

the method comprises the following steps: reading time sequence data;

step three: encoding the query statement into vectorized data;

step four: extracting data distribution characteristics from the opposite quantization data;

step six: training a gradient lifting regression tree model by using vectorized data and the data distribution characteristics after dimensionality reduction as input and query time as output;

step seven: carrying out query time prediction by using the trained gradient lifting regression tree model;

before the third step, the following steps are also included: rewriting CnoSQL into standard SQL;

the encoding in the third step comprises join graph encoding and column information encoding, and the results of the join graph encoding and the column information encoding are connected to be used as the encoding of the whole query;

the specific steps of the join graph coding are as follows:

analyzing tables of joining joins related in CnoSQL or SQL query sentences, analyzing the connection relation of every two tables of joining joins, judging whether every two tables of joining joins are connected or not, if so, setting codes corresponding to the joins as 1, if not, setting the codes corresponding to the joins as 0, and finally, reserving an upper triangular matrix part of a two-dimensional matrix and expanding the upper triangular matrix part into a one-dimensional matrix according to rows;

the specific steps of the column information coding are as follows:

and aiming at each table participating in the join, setting the coding column participating in the query predicate in each table participating in the join as 1, setting the coding column not participating in the query predicate in each table participating in the join as 0, and connecting the column set as 1 with the column set as 0 to obtain the query code.

2. The method of claim 1, wherein the column information is encoded by a one-hot encoding method.