CN114218287A - Query time prediction method for time sequence database - Google Patents

Query time prediction method for time sequence database Download PDF

Info

Publication number
CN114218287A
CN114218287A CN202111662253.1A CN202111662253A CN114218287A CN 114218287 A CN114218287 A CN 114218287A CN 202111662253 A CN202111662253 A CN 202111662253A CN 114218287 A CN114218287 A CN 114218287A
Authority
CN
China
Prior art keywords
query
time
data
encoding
steps
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111662253.1A
Other languages
Chinese (zh)
Other versions
CN114218287B (en
Inventor
王宏志
李同欣
张凯欣
郑博
梁栋
叶天生
燕钰
丁小欧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Nosi Spacetime Technology Co ltd
Harbin Institute of Technology
Original Assignee
Beijing Nosi Spacetime Technology Co ltd
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Nosi Spacetime Technology Co ltd, Harbin Institute of Technology filed Critical Beijing Nosi Spacetime Technology Co ltd
Priority to CN202111662253.1A priority Critical patent/CN114218287B/en
Publication of CN114218287A publication Critical patent/CN114218287A/en
Application granted granted Critical
Publication of CN114218287B publication Critical patent/CN114218287B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • G06F16/2456Join operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A query time prediction method for a time sequence database relates to the technical field of computers, and aims at the problem of low query time prediction speed in the prior art, and comprises the following steps: the method comprises the following steps: reading time sequence data; step two: writing the time series data into a CnOSDB, wherein the CnOSDB uses a CnoSQL query statement to query and retrieve the time series data and records query time; step three: encoding the query statement into vectorized data; step four: extracting data distribution characteristics from the opposite quantitative data; step five: using PCA to reduce the dimension of the data distribution characteristics; step six: training a gradient lifting regression tree model by using vectorized data and data distribution characteristics after dimension reduction as input and query time as output; step seven: and predicting the query time by using the trained gradient lifting regression tree model. In the prediction time, the model can give a prediction result in tens of milliseconds in the experiment, and has considerable response speed.

Description

Query time prediction method for time sequence database
Technical Field
The invention relates to the technical field of computers, in particular to a query time prediction method facing a time sequence database.
Background
Query time prediction is the technical basis for multiple hot spot problems in the field of databases, such as admission control, query optimization, query scheduling and the like. For example, in database optimization, the main objectives of the optimization are query response time and space utilization, and therefore, the execution time of the query will be used as an important feedback index to indicate the quality of the optimization result. In practice, however, if the query is physically executed on the query load, the actual execution time is obtained, which may bring unacceptable cost to the optimization process, because the load is often executed for hundreds of thousands of rounds repeatedly.
At present, the related research for inquiring the time prediction direction is relatively mature. In general, there are two schemes for the query time prediction technology on the database: one is to encode the query at the query plan level and the other is to encode the query at the physical operator level. The coding granularity of the query plan layer is low, the prediction effect depends on the feature similarity of a training set and a test set, the prediction effect on unknown queries is poor, the coding of the physical operator layer generally utilizes depth-first traversal of the query tree to obtain the sequence of operators, so that the structural features in the query tree can be extracted, the coding granularity is higher, the prediction effect is better than that of the query plan layer in terms of dealing with unknown queries, but the prediction speed of the test queries with the features similar to the training set is low.
Disclosure of Invention
The purpose of the invention is: aiming at the problem of low query time prediction speed in the prior art, a query time prediction method facing a time sequence database is provided.
The technical scheme adopted by the invention to solve the technical problems is as follows:
a query time prediction method facing a time sequence database comprises the following steps:
the method comprises the following steps: reading time sequence data;
step two: writing the time series data into a CnOSDB, wherein the CnOSDB uses a CnoSQL query statement to query and retrieve the time series data and records query time;
step three: encoding the query statement into vectorized data;
step four: extracting data distribution characteristics from the opposite quantitative data;
step five: using PCA to reduce the dimension of the data distribution characteristics;
step six: training a gradient lifting regression tree model by using vectorized data and data distribution characteristics after dimension reduction as input and query time as output;
step seven: and predicting the query time by using the trained gradient lifting regression tree model.
Further, before the third step, the following steps are also included: the CnoSQL is rewritten to standard SQL.
Further, the encoding in the third step includes encoding a join graph and encoding column information, and the results of encoding the join graph and encoding the column information are connected as the encoding of the whole query.
Further, the specific steps of the join graph coding are as follows:
analyzing tables of joining joins related in CnoSQL or SQL query sentences, analyzing the connection relation of every two tables of joining joins, judging whether every two tables of joining joins are connected or not, if so, setting codes corresponding to the joins as 1, if not, setting the codes corresponding to the joins as 0, and finally, reserving an upper triangular matrix part of the two-dimensional matrix and expanding the upper triangular matrix part into a one-dimensional matrix according to rows.
Further, the specific steps of the column information encoding are as follows:
and aiming at each table participating in the join, setting the coding column participating in the query predicate in each table participating in the join as 1, setting the coding column not participating in the query predicate in each table participating in the join as 0, and connecting the column set as 1 with the column set as 0 to obtain the query code.
Further, the column information is encoded by a one-hot encoding method.
The invention has the beneficial effects that:
the method generates a quote _ Big table by using stock transaction data of a 2016, 10, 24 and 24 new york exchange on an NYSE website, and tests 11 commonly used queries on the quote _ Big table, and by applying the method, under the condition that the data scale of a database does not change elastically, the coefficient of the query prediction regression model can reach over 0.996, the Mean Absolute Percentage Error (MAPE) can reach within 17 percent, and when the data scale of the database changes elastically within 20 percent, the coefficient of the query prediction regression model can reach over 0.971, and the Mean Absolute Percentage Error (MAPE) can reach within 19 percent.
In addition, in the prediction time, the model can give a prediction result in tens of milliseconds in the experiment, and has very considerable response speed.
Drawings
Fig. 1 is a schematic diagram of SQL rewritten to the standard by CnoSQL;
FIG. 2 is a sequence diagram of one-dimensional encoding by compressing the upper half area;
fig. 3 is a schematic overall flow chart of the present application.
Detailed Description
It should be noted that, in the present invention, the embodiments disclosed in the present application may be combined with each other without conflict.
The first embodiment is as follows: specifically, the present embodiment is described with reference to fig. 3, and the query time prediction method for a time series database according to the present embodiment includes the following steps:
the method comprises the following steps: reading time sequence data;
step two: writing the time series data into a CnOSDB, wherein the CnOSDB uses a CnoSQL query statement to query and retrieve the time series data and records query time;
step three: encoding the query statement into vectorized data;
step four: extracting data distribution characteristics from the opposite quantitative data;
step five: using PCA to reduce the dimension of the data distribution characteristics;
step six: training a gradient lifting regression tree model by using vectorized data and data distribution characteristics after dimension reduction as input and query time as output;
step seven: and predicting the query time by using the trained gradient lifting regression tree model.
The second embodiment is as follows: this embodiment is a further description of the first embodiment, and the difference between this embodiment and the first embodiment is that the following steps are further included before the third step: the CnoSQL is rewritten to standard SQL.
The third concrete implementation mode: this embodiment is a further description of the second embodiment, and the difference between this embodiment and the second embodiment is that the encoding in step three includes join graph encoding and column information encoding, and the results of join graph encoding and column information encoding are connected as the encoding of the whole query.
The fourth concrete implementation mode: this embodiment is a further description of a third embodiment, and the difference between this embodiment and the third embodiment is that the specific steps of the join graph encoding are:
analyzing tables of joining joins related in CnoSQL or SQL query sentences, analyzing the connection relation of every two tables of joining joins, judging whether every two tables of joining joins are connected or not, if so, setting codes corresponding to the joins as 1, if not, setting the codes corresponding to the joins as 0, and finally, reserving an upper triangular matrix part of the two-dimensional matrix and expanding the upper triangular matrix part into a one-dimensional matrix according to rows.
The fifth concrete implementation mode: the present embodiment is further described with respect to a fourth embodiment, and the difference between the present embodiment and the fourth embodiment is that the specific step of the column information encoding is:
and aiming at each table participating in the join, setting the coding column participating in the query predicate in each table participating in the join as 1, setting the coding column not participating in the query predicate in each table participating in the join as 0, and connecting the column set as 1 with the column set as 0 to obtain the query code.
The sixth specific implementation mode: this embodiment mode is a further description of a fifth embodiment mode, and the difference between this embodiment mode and the fifth embodiment mode is that the column information encoding is performed by a one-hot encoding method.
Time series data, referred to simply as time series data, refers to a sequence of data points acquired by a fixed acquisition device at fixed time intervals over a period of time, each data point being associated with a time stamp indicating the time of acquisition. By analyzing time series data of a period of time, the development trend of the transaction in the period of time can be obtained, and therefore future prediction is achieved. The time sequence data has obvious characteristics, fixed data structure, fixed write-in interval, large write-in quantity, large occupied original data space, wide time range of each query, less updating operation requirements and less single-point query requirements. These characteristics result in that the conventional relational database does not support the time-series data well. Databases that are specifically optimized for time series data are referred to as time series databases.
The characterized coding scheme provided by the application achieves the aim of quickly and accurately inquiring the time prediction under the condition of the business requirement of intensive inquiring workload.
Corresponding to SQL query of a relational database, CnosDB carries out query retrieval on time-series data by using CnosQL query language. In actual prediction, the CnoSQL is first rewritten to standard SQL as shown in fig. 1.
To predict the query time using an algorithm, the query statement needs to be encoded first. The invention carries out the characteristic coding of the query statement of a CnosDB time sequence database, mainly carries out the analysis and the coding from the query level, and is divided into two main parts: and encoding the join graph and the column information, wherein the results of the two encodings are connected to be used as the encoding of the whole query.
The Join graph coding means that a Join list involved in the whole SQL query is analyzed, the connection relation of the Join list is analyzed, whether connection is needed between every two Join lists is judged in sequence, and the number of the connection needed is set to be 1 in a two-dimensional matrix. Considering that the connection is required to be bidirectional, the two-dimensional matrix only needs to be half reserved and compressed into a one-dimensional matrix for subsequent encoding. As shown, there is a connection between a and B, B and D, a and C, i.e. 1 is set in the two-dimensional array, compressing the upper half into a one-dimensional encoding resulting in the sequence shown in fig. 2.
The column information encoding means that for each table, the column of the query predicate is set to be 1, and one-hot encoding is adopted for encoding, that is, for a table with n columns, n bits are used for representing the condition that each column participates in the query predicate, 1 represents that the column participates in the query predicate, and 0 represents that the column does not participate. The two are connected to obtain the final inquiry code
In order to adapt the characteristic code to the flexible change of the database data scale in the actual service, the data distribution characteristics need to be extracted. For this reason, it is assumed that the data size has an upper limit value within a certain service range, and the upper limit value is set by experience according to actual service requirements. The characteristic coding length of data distribution is recorded to be equal to the number of tables, indexes of characteristic vectors correspond to table sequence numbers, and values of coding bits are ratios of current table data entries to upper limits of the table data entries. And the data distribution feature vector is spliced to the front of the query feature vector.
After the query and data distribution are characterized and encoded, the PCA dimension reduction is performed on the data matrix. Because the query load in the actual service generally cannot cover the tags in all the time series data, some relation tables are not used, and therefore a large number of zero values are stored in the data matrix and are sparse. The PCA dimension reduction processing can effectively remove redundant components of the data matrix, accelerate the model regression speed and improve the accuracy.
Finally, the training data are used for training a gradient lifting regression tree model, so that the model can be used for accurately and quickly predicting the time of the query.
It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations be included within the scope of the invention as defined in the following claims and the description.

Claims (6)

1. A query time prediction method facing a time sequence database is characterized by comprising the following steps:
the method comprises the following steps: reading time sequence data;
step two: writing the time series data into a CnOSDB, wherein the CnOSDB uses a CnoSQL query statement to query and retrieve the time series data and records query time;
step three: encoding the query statement into vectorized data;
step four: extracting data distribution characteristics from the opposite quantitative data;
step five: using PCA to reduce the dimension of the data distribution characteristics;
step six: training a gradient lifting regression tree model by using vectorized data and data distribution characteristics after dimension reduction as input and query time as output;
step seven: and predicting the query time by using the trained gradient lifting regression tree model.
2. The method according to claim 1, wherein the step three is preceded by the steps of: the CnoSQL is rewritten to standard SQL.
3. The method according to claim 2, wherein the encoding in three steps comprises join graph encoding and column information encoding, and the results of join graph encoding and column information encoding are concatenated as the encoding of the whole query.
4. The method as claimed in claim 3, wherein the encoding of the join graph comprises the following steps:
analyzing tables of joining joins related in CnoSQL or SQL query sentences, analyzing the connection relation of every two tables of joining joins, judging whether every two tables of joining joins are connected or not, if so, setting codes corresponding to the joins as 1, if not, setting the codes corresponding to the joins as 0, and finally, reserving an upper triangular matrix part of the two-dimensional matrix and expanding the upper triangular matrix part into a one-dimensional matrix according to rows.
5. The method according to claim 4, wherein the column information is encoded by the following steps:
and aiming at each table participating in the join, setting the coding column participating in the query predicate in each table participating in the join as 1, setting the coding column not participating in the query predicate in each table participating in the join as 0, and connecting the column set as 1 with the column set as 0 to obtain the query code.
6. The method of claim 5, wherein the column information is encoded by a one-hot encoding method.
CN202111662253.1A 2021-12-30 2021-12-30 Query time prediction method for time sequence database Active CN114218287B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111662253.1A CN114218287B (en) 2021-12-30 2021-12-30 Query time prediction method for time sequence database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111662253.1A CN114218287B (en) 2021-12-30 2021-12-30 Query time prediction method for time sequence database

Publications (2)

Publication Number Publication Date
CN114218287A true CN114218287A (en) 2022-03-22
CN114218287B CN114218287B (en) 2022-11-04

Family

ID=80707313

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111662253.1A Active CN114218287B (en) 2021-12-30 2021-12-30 Query time prediction method for time sequence database

Country Status (1)

Country Link
CN (1) CN114218287B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114665884A (en) * 2022-03-29 2022-06-24 北京诺司时空科技有限公司 Time sequence database self-adaptive lossy compression method, system and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080270346A1 (en) * 2007-04-30 2008-10-30 Abhay Mehta Estimating the static execution time of a database query
US20130226903A1 (en) * 2012-02-27 2013-08-29 Nec Laboratories America, Inc. Predicting query execution time
CN107491508A (en) * 2017-08-01 2017-12-19 浙江大学 A kind of data base querying time forecasting methods based on Recognition with Recurrent Neural Network
CN111611274A (en) * 2020-05-28 2020-09-01 华中科技大学 Database query optimization method and system
CN113032418A (en) * 2021-02-08 2021-06-25 浙江大学 Method for converting complex natural language query into SQL (structured query language) based on tree model
CN113711198A (en) * 2019-04-30 2021-11-26 微软技术许可有限责任公司 Learning resource consumption model for optimizing big data queries

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080270346A1 (en) * 2007-04-30 2008-10-30 Abhay Mehta Estimating the static execution time of a database query
US20130226903A1 (en) * 2012-02-27 2013-08-29 Nec Laboratories America, Inc. Predicting query execution time
CN107491508A (en) * 2017-08-01 2017-12-19 浙江大学 A kind of data base querying time forecasting methods based on Recognition with Recurrent Neural Network
CN113711198A (en) * 2019-04-30 2021-11-26 微软技术许可有限责任公司 Learning resource consumption model for optimizing big data queries
CN111611274A (en) * 2020-05-28 2020-09-01 华中科技大学 Database query optimization method and system
CN113032418A (en) * 2021-02-08 2021-06-25 浙江大学 Method for converting complex natural language query into SQL (structured query language) based on tree model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JINGXIONG NI: "DeepQT: Learning Sequential Context for Query Execution Time Prediction", 《DATABASE SYSTEMS FOR ADVANCED APPLICATIONS. 25TH INTERNATIONAL CONFERENCE, DASFAA 2020. PROCEEDINGS. LECTURE NOTES IN COMPUTER SCIENCE (LNCS 12114)》 *
毕里缘等: "基于循环神经网络的数据库查询开销预测", 《软件学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114665884A (en) * 2022-03-29 2022-06-24 北京诺司时空科技有限公司 Time sequence database self-adaptive lossy compression method, system and medium
CN114665884B (en) * 2022-03-29 2022-11-25 北京诺司时空科技有限公司 Time sequence database self-adaptive lossy compression method, system and medium

Also Published As

Publication number Publication date
CN114218287B (en) 2022-11-04

Similar Documents

Publication Publication Date Title
US10762071B2 (en) Value-ID-based sorting in column-store databases
US6728720B1 (en) Identifying preferred indexes for databases
CN108776673B (en) Automatic conversion method and device of relation mode and storage medium
CN101499065B (en) Table item compression method and device based on FA, table item matching method and device
CN111444220B (en) Cross-platform SQL query optimization method combining rule driving and data driving
CN110597844B (en) Unified access method for heterogeneous database data and related equipment
CN107491508B (en) Database query time prediction method based on recurrent neural network
CN110674211B (en) Automatic analysis method and device for AWR report of Oracle database
CN112434024A (en) Relational database-oriented data dictionary generation method, device, equipment and medium
KR20160071746A (en) Personalized log analysis system and rule based log data grouping method in the same
WO2021012861A1 (en) Method and apparatus for evaluating data query time consumption, and computer device and storage medium
CN114218287B (en) Query time prediction method for time sequence database
CN112783867A (en) Database optimization method for meeting real-time big data service requirements and cloud server
CN117609470A (en) Question-answering system based on large language model and knowledge graph, construction method thereof and intelligent data management platform
CN106844541B (en) Online analysis processing method and device
CN110209589B (en) Knowledge base system test method, device, equipment and medium
CN117271481B (en) Automatic database optimization method and equipment
Plase et al. Accelerating data queries on Hadoop framework by using compact data formats
CN117290376A (en) Two-stage Text2SQL model, method and system based on large language model
CN113157814B (en) Query-driven intelligent workload analysis method under relational database
CN116842076A (en) Data analysis method, device, analysis equipment and readable storage medium
CN111046054A (en) Method and system for analyzing power marketing business data
CN115422180A (en) Data verification method and system
CN112835932B (en) Batch processing method and device for business table and nonvolatile storage medium
CN111984625B (en) Database load characteristic processing method and device, medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant