CN114218287A - Query time prediction method for time sequence database - Google Patents
Query time prediction method for time sequence database Download PDFInfo
- Publication number
- CN114218287A CN114218287A CN202111662253.1A CN202111662253A CN114218287A CN 114218287 A CN114218287 A CN 114218287A CN 202111662253 A CN202111662253 A CN 202111662253A CN 114218287 A CN114218287 A CN 114218287A
- Authority
- CN
- China
- Prior art keywords
- query
- time
- data
- encoding
- steps
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2474—Sequence data queries, e.g. querying versioned data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24558—Binary matching operations
- G06F16/2456—Join operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A query time prediction method for a time sequence database relates to the technical field of computers, and aims at the problem of low query time prediction speed in the prior art, and comprises the following steps: the method comprises the following steps: reading time sequence data; step two: writing the time series data into a CnOSDB, wherein the CnOSDB uses a CnoSQL query statement to query and retrieve the time series data and records query time; step three: encoding the query statement into vectorized data; step four: extracting data distribution characteristics from the opposite quantitative data; step five: using PCA to reduce the dimension of the data distribution characteristics; step six: training a gradient lifting regression tree model by using vectorized data and data distribution characteristics after dimension reduction as input and query time as output; step seven: and predicting the query time by using the trained gradient lifting regression tree model. In the prediction time, the model can give a prediction result in tens of milliseconds in the experiment, and has considerable response speed.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a query time prediction method facing a time sequence database.
Background
Query time prediction is the technical basis for multiple hot spot problems in the field of databases, such as admission control, query optimization, query scheduling and the like. For example, in database optimization, the main objectives of the optimization are query response time and space utilization, and therefore, the execution time of the query will be used as an important feedback index to indicate the quality of the optimization result. In practice, however, if the query is physically executed on the query load, the actual execution time is obtained, which may bring unacceptable cost to the optimization process, because the load is often executed for hundreds of thousands of rounds repeatedly.
At present, the related research for inquiring the time prediction direction is relatively mature. In general, there are two schemes for the query time prediction technology on the database: one is to encode the query at the query plan level and the other is to encode the query at the physical operator level. The coding granularity of the query plan layer is low, the prediction effect depends on the feature similarity of a training set and a test set, the prediction effect on unknown queries is poor, the coding of the physical operator layer generally utilizes depth-first traversal of the query tree to obtain the sequence of operators, so that the structural features in the query tree can be extracted, the coding granularity is higher, the prediction effect is better than that of the query plan layer in terms of dealing with unknown queries, but the prediction speed of the test queries with the features similar to the training set is low.
Disclosure of Invention
The purpose of the invention is: aiming at the problem of low query time prediction speed in the prior art, a query time prediction method facing a time sequence database is provided.
The technical scheme adopted by the invention to solve the technical problems is as follows:
a query time prediction method facing a time sequence database comprises the following steps:
the method comprises the following steps: reading time sequence data;
step two: writing the time series data into a CnOSDB, wherein the CnOSDB uses a CnoSQL query statement to query and retrieve the time series data and records query time;
step three: encoding the query statement into vectorized data;
step four: extracting data distribution characteristics from the opposite quantitative data;
step five: using PCA to reduce the dimension of the data distribution characteristics;
step six: training a gradient lifting regression tree model by using vectorized data and data distribution characteristics after dimension reduction as input and query time as output;
step seven: and predicting the query time by using the trained gradient lifting regression tree model.
Further, before the third step, the following steps are also included: the CnoSQL is rewritten to standard SQL.
Further, the encoding in the third step includes encoding a join graph and encoding column information, and the results of encoding the join graph and encoding the column information are connected as the encoding of the whole query.
Further, the specific steps of the join graph coding are as follows:
analyzing tables of joining joins related in CnoSQL or SQL query sentences, analyzing the connection relation of every two tables of joining joins, judging whether every two tables of joining joins are connected or not, if so, setting codes corresponding to the joins as 1, if not, setting the codes corresponding to the joins as 0, and finally, reserving an upper triangular matrix part of the two-dimensional matrix and expanding the upper triangular matrix part into a one-dimensional matrix according to rows.
Further, the specific steps of the column information encoding are as follows:
and aiming at each table participating in the join, setting the coding column participating in the query predicate in each table participating in the join as 1, setting the coding column not participating in the query predicate in each table participating in the join as 0, and connecting the column set as 1 with the column set as 0 to obtain the query code.
Further, the column information is encoded by a one-hot encoding method.
The invention has the beneficial effects that:
the method generates a quote _ Big table by using stock transaction data of a 2016, 10, 24 and 24 new york exchange on an NYSE website, and tests 11 commonly used queries on the quote _ Big table, and by applying the method, under the condition that the data scale of a database does not change elastically, the coefficient of the query prediction regression model can reach over 0.996, the Mean Absolute Percentage Error (MAPE) can reach within 17 percent, and when the data scale of the database changes elastically within 20 percent, the coefficient of the query prediction regression model can reach over 0.971, and the Mean Absolute Percentage Error (MAPE) can reach within 19 percent.
In addition, in the prediction time, the model can give a prediction result in tens of milliseconds in the experiment, and has very considerable response speed.
Drawings
Fig. 1 is a schematic diagram of SQL rewritten to the standard by CnoSQL;
FIG. 2 is a sequence diagram of one-dimensional encoding by compressing the upper half area;
fig. 3 is a schematic overall flow chart of the present application.
Detailed Description
It should be noted that, in the present invention, the embodiments disclosed in the present application may be combined with each other without conflict.
The first embodiment is as follows: specifically, the present embodiment is described with reference to fig. 3, and the query time prediction method for a time series database according to the present embodiment includes the following steps:
the method comprises the following steps: reading time sequence data;
step two: writing the time series data into a CnOSDB, wherein the CnOSDB uses a CnoSQL query statement to query and retrieve the time series data and records query time;
step three: encoding the query statement into vectorized data;
step four: extracting data distribution characteristics from the opposite quantitative data;
step five: using PCA to reduce the dimension of the data distribution characteristics;
step six: training a gradient lifting regression tree model by using vectorized data and data distribution characteristics after dimension reduction as input and query time as output;
step seven: and predicting the query time by using the trained gradient lifting regression tree model.
The second embodiment is as follows: this embodiment is a further description of the first embodiment, and the difference between this embodiment and the first embodiment is that the following steps are further included before the third step: the CnoSQL is rewritten to standard SQL.
The third concrete implementation mode: this embodiment is a further description of the second embodiment, and the difference between this embodiment and the second embodiment is that the encoding in step three includes join graph encoding and column information encoding, and the results of join graph encoding and column information encoding are connected as the encoding of the whole query.
The fourth concrete implementation mode: this embodiment is a further description of a third embodiment, and the difference between this embodiment and the third embodiment is that the specific steps of the join graph encoding are:
analyzing tables of joining joins related in CnoSQL or SQL query sentences, analyzing the connection relation of every two tables of joining joins, judging whether every two tables of joining joins are connected or not, if so, setting codes corresponding to the joins as 1, if not, setting the codes corresponding to the joins as 0, and finally, reserving an upper triangular matrix part of the two-dimensional matrix and expanding the upper triangular matrix part into a one-dimensional matrix according to rows.
The fifth concrete implementation mode: the present embodiment is further described with respect to a fourth embodiment, and the difference between the present embodiment and the fourth embodiment is that the specific step of the column information encoding is:
and aiming at each table participating in the join, setting the coding column participating in the query predicate in each table participating in the join as 1, setting the coding column not participating in the query predicate in each table participating in the join as 0, and connecting the column set as 1 with the column set as 0 to obtain the query code.
The sixth specific implementation mode: this embodiment mode is a further description of a fifth embodiment mode, and the difference between this embodiment mode and the fifth embodiment mode is that the column information encoding is performed by a one-hot encoding method.
Time series data, referred to simply as time series data, refers to a sequence of data points acquired by a fixed acquisition device at fixed time intervals over a period of time, each data point being associated with a time stamp indicating the time of acquisition. By analyzing time series data of a period of time, the development trend of the transaction in the period of time can be obtained, and therefore future prediction is achieved. The time sequence data has obvious characteristics, fixed data structure, fixed write-in interval, large write-in quantity, large occupied original data space, wide time range of each query, less updating operation requirements and less single-point query requirements. These characteristics result in that the conventional relational database does not support the time-series data well. Databases that are specifically optimized for time series data are referred to as time series databases.
The characterized coding scheme provided by the application achieves the aim of quickly and accurately inquiring the time prediction under the condition of the business requirement of intensive inquiring workload.
Corresponding to SQL query of a relational database, CnosDB carries out query retrieval on time-series data by using CnosQL query language. In actual prediction, the CnoSQL is first rewritten to standard SQL as shown in fig. 1.
To predict the query time using an algorithm, the query statement needs to be encoded first. The invention carries out the characteristic coding of the query statement of a CnosDB time sequence database, mainly carries out the analysis and the coding from the query level, and is divided into two main parts: and encoding the join graph and the column information, wherein the results of the two encodings are connected to be used as the encoding of the whole query.
The Join graph coding means that a Join list involved in the whole SQL query is analyzed, the connection relation of the Join list is analyzed, whether connection is needed between every two Join lists is judged in sequence, and the number of the connection needed is set to be 1 in a two-dimensional matrix. Considering that the connection is required to be bidirectional, the two-dimensional matrix only needs to be half reserved and compressed into a one-dimensional matrix for subsequent encoding. As shown, there is a connection between a and B, B and D, a and C, i.e. 1 is set in the two-dimensional array, compressing the upper half into a one-dimensional encoding resulting in the sequence shown in fig. 2.
The column information encoding means that for each table, the column of the query predicate is set to be 1, and one-hot encoding is adopted for encoding, that is, for a table with n columns, n bits are used for representing the condition that each column participates in the query predicate, 1 represents that the column participates in the query predicate, and 0 represents that the column does not participate. The two are connected to obtain the final inquiry code
In order to adapt the characteristic code to the flexible change of the database data scale in the actual service, the data distribution characteristics need to be extracted. For this reason, it is assumed that the data size has an upper limit value within a certain service range, and the upper limit value is set by experience according to actual service requirements. The characteristic coding length of data distribution is recorded to be equal to the number of tables, indexes of characteristic vectors correspond to table sequence numbers, and values of coding bits are ratios of current table data entries to upper limits of the table data entries. And the data distribution feature vector is spliced to the front of the query feature vector.
After the query and data distribution are characterized and encoded, the PCA dimension reduction is performed on the data matrix. Because the query load in the actual service generally cannot cover the tags in all the time series data, some relation tables are not used, and therefore a large number of zero values are stored in the data matrix and are sparse. The PCA dimension reduction processing can effectively remove redundant components of the data matrix, accelerate the model regression speed and improve the accuracy.
Finally, the training data are used for training a gradient lifting regression tree model, so that the model can be used for accurately and quickly predicting the time of the query.
It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations be included within the scope of the invention as defined in the following claims and the description.
Claims (6)
1. A query time prediction method facing a time sequence database is characterized by comprising the following steps:
the method comprises the following steps: reading time sequence data;
step two: writing the time series data into a CnOSDB, wherein the CnOSDB uses a CnoSQL query statement to query and retrieve the time series data and records query time;
step three: encoding the query statement into vectorized data;
step four: extracting data distribution characteristics from the opposite quantitative data;
step five: using PCA to reduce the dimension of the data distribution characteristics;
step six: training a gradient lifting regression tree model by using vectorized data and data distribution characteristics after dimension reduction as input and query time as output;
step seven: and predicting the query time by using the trained gradient lifting regression tree model.
2. The method according to claim 1, wherein the step three is preceded by the steps of: the CnoSQL is rewritten to standard SQL.
3. The method according to claim 2, wherein the encoding in three steps comprises join graph encoding and column information encoding, and the results of join graph encoding and column information encoding are concatenated as the encoding of the whole query.
4. The method as claimed in claim 3, wherein the encoding of the join graph comprises the following steps:
analyzing tables of joining joins related in CnoSQL or SQL query sentences, analyzing the connection relation of every two tables of joining joins, judging whether every two tables of joining joins are connected or not, if so, setting codes corresponding to the joins as 1, if not, setting the codes corresponding to the joins as 0, and finally, reserving an upper triangular matrix part of the two-dimensional matrix and expanding the upper triangular matrix part into a one-dimensional matrix according to rows.
5. The method according to claim 4, wherein the column information is encoded by the following steps:
and aiming at each table participating in the join, setting the coding column participating in the query predicate in each table participating in the join as 1, setting the coding column not participating in the query predicate in each table participating in the join as 0, and connecting the column set as 1 with the column set as 0 to obtain the query code.
6. The method of claim 5, wherein the column information is encoded by a one-hot encoding method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111662253.1A CN114218287B (en) | 2021-12-30 | 2021-12-30 | Query time prediction method for time sequence database |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111662253.1A CN114218287B (en) | 2021-12-30 | 2021-12-30 | Query time prediction method for time sequence database |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114218287A true CN114218287A (en) | 2022-03-22 |
CN114218287B CN114218287B (en) | 2022-11-04 |
Family
ID=80707313
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111662253.1A Active CN114218287B (en) | 2021-12-30 | 2021-12-30 | Query time prediction method for time sequence database |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114218287B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114665884A (en) * | 2022-03-29 | 2022-06-24 | 北京诺司时空科技有限公司 | Time sequence database self-adaptive lossy compression method, system and medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080270346A1 (en) * | 2007-04-30 | 2008-10-30 | Abhay Mehta | Estimating the static execution time of a database query |
US20130226903A1 (en) * | 2012-02-27 | 2013-08-29 | Nec Laboratories America, Inc. | Predicting query execution time |
CN107491508A (en) * | 2017-08-01 | 2017-12-19 | 浙江大学 | A kind of data base querying time forecasting methods based on Recognition with Recurrent Neural Network |
CN111611274A (en) * | 2020-05-28 | 2020-09-01 | 华中科技大学 | Database query optimization method and system |
CN113032418A (en) * | 2021-02-08 | 2021-06-25 | 浙江大学 | Method for converting complex natural language query into SQL (structured query language) based on tree model |
CN113711198A (en) * | 2019-04-30 | 2021-11-26 | 微软技术许可有限责任公司 | Learning resource consumption model for optimizing big data queries |
-
2021
- 2021-12-30 CN CN202111662253.1A patent/CN114218287B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080270346A1 (en) * | 2007-04-30 | 2008-10-30 | Abhay Mehta | Estimating the static execution time of a database query |
US20130226903A1 (en) * | 2012-02-27 | 2013-08-29 | Nec Laboratories America, Inc. | Predicting query execution time |
CN107491508A (en) * | 2017-08-01 | 2017-12-19 | 浙江大学 | A kind of data base querying time forecasting methods based on Recognition with Recurrent Neural Network |
CN113711198A (en) * | 2019-04-30 | 2021-11-26 | 微软技术许可有限责任公司 | Learning resource consumption model for optimizing big data queries |
CN111611274A (en) * | 2020-05-28 | 2020-09-01 | 华中科技大学 | Database query optimization method and system |
CN113032418A (en) * | 2021-02-08 | 2021-06-25 | 浙江大学 | Method for converting complex natural language query into SQL (structured query language) based on tree model |
Non-Patent Citations (2)
Title |
---|
JINGXIONG NI: "DeepQT: Learning Sequential Context for Query Execution Time Prediction", 《DATABASE SYSTEMS FOR ADVANCED APPLICATIONS. 25TH INTERNATIONAL CONFERENCE, DASFAA 2020. PROCEEDINGS. LECTURE NOTES IN COMPUTER SCIENCE (LNCS 12114)》 * |
毕里缘等: "基于循环神经网络的数据库查询开销预测", 《软件学报》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114665884A (en) * | 2022-03-29 | 2022-06-24 | 北京诺司时空科技有限公司 | Time sequence database self-adaptive lossy compression method, system and medium |
CN114665884B (en) * | 2022-03-29 | 2022-11-25 | 北京诺司时空科技有限公司 | Time sequence database self-adaptive lossy compression method, system and medium |
Also Published As
Publication number | Publication date |
---|---|
CN114218287B (en) | 2022-11-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10762071B2 (en) | Value-ID-based sorting in column-store databases | |
US6728720B1 (en) | Identifying preferred indexes for databases | |
CN108776673B (en) | Automatic conversion method and device of relation mode and storage medium | |
CN101499065B (en) | Table item compression method and device based on FA, table item matching method and device | |
CN111444220B (en) | Cross-platform SQL query optimization method combining rule driving and data driving | |
CN110597844B (en) | Unified access method for heterogeneous database data and related equipment | |
CN107491508B (en) | Database query time prediction method based on recurrent neural network | |
CN110674211B (en) | Automatic analysis method and device for AWR report of Oracle database | |
CN112434024A (en) | Relational database-oriented data dictionary generation method, device, equipment and medium | |
KR20160071746A (en) | Personalized log analysis system and rule based log data grouping method in the same | |
WO2021012861A1 (en) | Method and apparatus for evaluating data query time consumption, and computer device and storage medium | |
CN114218287B (en) | Query time prediction method for time sequence database | |
CN112783867A (en) | Database optimization method for meeting real-time big data service requirements and cloud server | |
CN117609470A (en) | Question-answering system based on large language model and knowledge graph, construction method thereof and intelligent data management platform | |
CN106844541B (en) | Online analysis processing method and device | |
CN110209589B (en) | Knowledge base system test method, device, equipment and medium | |
CN117271481B (en) | Automatic database optimization method and equipment | |
Plase et al. | Accelerating data queries on Hadoop framework by using compact data formats | |
CN117290376A (en) | Two-stage Text2SQL model, method and system based on large language model | |
CN113157814B (en) | Query-driven intelligent workload analysis method under relational database | |
CN116842076A (en) | Data analysis method, device, analysis equipment and readable storage medium | |
CN111046054A (en) | Method and system for analyzing power marketing business data | |
CN115422180A (en) | Data verification method and system | |
CN112835932B (en) | Batch processing method and device for business table and nonvolatile storage medium | |
CN111984625B (en) | Database load characteristic processing method and device, medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |