CN114218287B - Query time prediction method for time sequence database - Google Patents

Query time prediction method for time sequence database Download PDF

Info

Publication number
CN114218287B
CN114218287B CN202111662253.1A CN202111662253A CN114218287B CN 114218287 B CN114218287 B CN 114218287B CN 202111662253 A CN202111662253 A CN 202111662253A CN 114218287 B CN114218287 B CN 114218287B
Authority
CN
China
Prior art keywords
query
data
time
encoding
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111662253.1A
Other languages
Chinese (zh)
Other versions
CN114218287A (en
Inventor
王宏志
李同欣
张凯欣
郑博
梁栋
叶天生
燕钰
丁小欧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Nosi Spacetime Technology Co ltd
Harbin Institute of Technology
Original Assignee
Beijing Nosi Spacetime Technology Co ltd
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Nosi Spacetime Technology Co ltd, Harbin Institute of Technology filed Critical Beijing Nosi Spacetime Technology Co ltd
Priority to CN202111662253.1A priority Critical patent/CN114218287B/en
Publication of CN114218287A publication Critical patent/CN114218287A/en
Application granted granted Critical
Publication of CN114218287B publication Critical patent/CN114218287B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • G06F16/2456Join operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Abstract

A query time prediction method for a time sequence database relates to the technical field of computers, and aims at the problem of low query time prediction speed in the prior art, and comprises the following steps: the method comprises the following steps: reading time sequence data; step two: writing the time series data into a CnOSDB, wherein the CnOSDB uses a CnoSQL query statement to query and retrieve the time series data and records query time; step three: encoding the query statement into vectorized data; step four: extracting data distribution characteristics from the opposite quantitative data; step five: using PCA to reduce the dimension of the data distribution characteristics; step six: training a gradient lifting regression tree model by using vectorized data and data distribution characteristics after dimension reduction as input and query time as output; step seven: and predicting the query time by using the trained gradient lifting regression tree model. In the prediction time, the model can give a prediction result in tens of milliseconds in the experiment, and has considerable response speed.

Description

Query time prediction method for time sequence database
Technical Field
The invention relates to the technical field of computers, in particular to a query time prediction method facing a time sequence database.
Background
Query time prediction is the technical basis for multiple hot spot problems in the field of databases, such as admission control, query optimization, query scheduling and the like. For example, in database optimization, the main objectives of the optimization are query response time and space utilization, and therefore, the execution time of the query will be used as an important feedback index to indicate the quality of the optimization result. In practice, however, if the query is physically executed on the query load, the actual execution time is obtained, which may bring unacceptable cost to the optimization process, because the load is often executed for hundreds of thousands of rounds repeatedly.
At present, the related research for inquiring the time prediction direction is relatively mature. In general, there are two approaches to the query time prediction technology on the database: one is to encode the query at the query plan level and the other is to encode the query at the physical operator level. The coding granularity of the query plan layer is low, the prediction effect depends on the feature similarity of the training set and the test set, the prediction effect on unknown queries is poor, the coding of the physical operator layer generally utilizes the depth-first traversal of the query tree to obtain the sequence of operators, so that the structural features in the query tree can be extracted, the coding granularity is high, the prediction effect is good relative to the query plan layer in terms of dealing with unknown queries, but the prediction speed of the test queries with the features similar to the training set is low.
Disclosure of Invention
The purpose of the invention is: aiming at the problem of low query time prediction speed in the prior art, a query time prediction method facing a time sequence database is provided.
The technical scheme adopted by the invention to solve the technical problems is as follows:
a query time prediction method facing a time sequence database comprises the following steps:
the method comprises the following steps: reading time sequence data;
step two: writing the time series data into a CnOSDB, wherein the CnOSDB uses a CnoSQL query statement to query and retrieve the time series data and records query time;
step three: encoding the query statement into vectorized data;
step four: extracting data distribution characteristics from the opposite quantitative data;
step five: using PCA to reduce the dimension of the data distribution characteristics;
step six: training a gradient lifting regression tree model by using vectorized data and data distribution characteristics after dimension reduction as input and query time as output;
step seven: and predicting the query time by using the trained gradient lifting regression tree model.
Further, before the third step, the following steps are also included: the CnoSQL is rewritten to standard SQL.
Further, the encoding in the third step includes encoding a join graph and encoding column information, and the results of encoding the join graph and encoding the column information are connected as the encoding of the whole query.
Further, the specific steps of the join graph coding are as follows:
analyzing tables of joining joins related in CnoSQL or SQL query sentences, analyzing the connection relation of every two tables of joining joins, judging whether every two tables of joining joins are connected or not, if so, setting codes corresponding to the joins as 1, if not, setting the codes corresponding to the joins as 0, and finally, reserving an upper triangular matrix part of the two-dimensional matrix and expanding the upper triangular matrix part into a one-dimensional matrix according to rows.
Further, the specific steps of the column information encoding are as follows:
and aiming at each table joining the join, setting the coding column participating in the query predicate in each table joining the join as 1, setting the coding column not participating in the query predicate in each table joining the join as 0, and connecting the column set as 1 with the column set as 0 to obtain the query code.
Further, the column information coding is performed by a one-hot coding method.
The beneficial effects of the invention are:
the method generates a quote _ Big table by using stock transaction data of a 2016, 10, 24 and 24 new york exchange on an NYSE website, and tests 11 commonly used queries on the quote _ Big table, and by applying the method, under the condition that the data scale of a database does not change elastically, the coefficient of the query prediction regression model can reach over 0.996, the Mean Absolute Percentage Error (MAPE) can reach within 17 percent, and when the data scale of the database changes elastically within 20 percent, the coefficient of the query prediction regression model can reach over 0.971, and the Mean Absolute Percentage Error (MAPE) can reach within 19 percent.
In addition, in the prediction time, the model can give a prediction result in tens of milliseconds in the experiment, and has very considerable response speed.
Drawings
FIG. 1 is a schematic diagram of a standard SQL rewrite of CnoSQL;
FIG. 2 is a sequence diagram of one-dimensional encoding by compressing the upper half area;
fig. 3 is a schematic overall flow chart of the present application.
Detailed Description
It should be noted that, in the case of conflict, the various embodiments disclosed in the present application may be combined with each other.
The first embodiment is as follows: specifically, the present embodiment is described with reference to fig. 3, and the query time prediction method for a time series database according to the present embodiment includes the following steps:
the method comprises the following steps: reading time sequence data;
step two: writing time series data into CnOSDB, using a CnoSQL query statement by the CnOSDB to perform query retrieval on the time series data, and recording query time;
step three: encoding the query statement into vectorized data;
step four: extracting data distribution characteristics from the opposite quantitative data;
step five: using PCA to reduce the dimension of the data distribution characteristics;
step six: training a gradient lifting regression tree model by using vectorized data and data distribution characteristics after dimension reduction as input and query time as output;
step seven: and predicting the query time by using the trained gradient lifting regression tree model.
The second embodiment is as follows: this embodiment is further described with respect to the first embodiment, and the difference between this embodiment and the first embodiment is that the following steps are further included before the third step: the CnoSQL is rewritten to standard SQL.
The third concrete implementation mode: this embodiment is a further description of the second embodiment, and the difference between this embodiment and the second embodiment is that the encoding in step three includes join graph encoding and column information encoding, and the results of join graph encoding and column information encoding are connected as the encoding of the whole query.
The fourth concrete implementation mode is as follows: this embodiment is a further description of a third embodiment, and the difference between this embodiment and the third embodiment is that the specific steps of the join graph encoding are:
analyzing tables of joining joins related in CnoSQL or SQL query sentences, analyzing the connection relation of every two tables of joining joins, judging whether every two tables of joining joins are connected or not, if so, setting codes corresponding to the joins as 1, if not, setting the codes corresponding to the joins as 0, and finally, reserving an upper triangular matrix part of the two-dimensional matrix and expanding the upper triangular matrix part into a one-dimensional matrix according to rows.
The fifth concrete implementation mode is as follows: the present embodiment is further described with respect to a fourth embodiment, and the difference between the present embodiment and the fourth embodiment is that the specific step of the column information encoding is:
and aiming at each table joining the join, setting the coding column participating in the query predicate in each table joining the join as 1, setting the coding column not participating in the query predicate in each table joining the join as 0, and connecting the column set as 1 with the column set as 0 to obtain the query code.
The sixth specific implementation mode: this embodiment mode is a further description of a fifth embodiment mode, and the difference between this embodiment mode and the fifth embodiment mode is that the column information encoding is performed by a one-hot encoding method.
Time series data, referred to simply as time series data, refers to a sequence of data points acquired by a fixed acquisition device at fixed time intervals over a period of time, each data point being associated with a time stamp indicating the time of acquisition. By analyzing time series data of a period of time, the development trend of the transaction in the period of time can be obtained, and therefore the future prediction is achieved. The time sequence data has obvious characteristics, fixed data structure, fixed write-in interval, large write-in quantity, large occupied original data space, wide time range of each query, less updating operation requirements and less single-point query requirements. These characteristics result in that the conventional relational database does not support the time-series data well. Databases that are specifically optimized for time series data are referred to as time series databases.
The characterized coding scheme provided by the application achieves the aim of quickly and accurately inquiring the time prediction under the condition of the business requirement of intensive inquiring workload.
Corresponding to SQL query of a relational database, cnosDB carries out query retrieval on time-series data by using CnosQL query language. In actual prediction, the CnoSQL is first rewritten to standard SQL as shown in fig. 1.
To predict the query time using an algorithm, the query statement needs to be encoded first. The invention carries out the characteristic coding of the query statement of a CnosDB time sequence database, mainly carries out the analysis and the coding from the query level, and is divided into two main parts: the method comprises the steps of joint image coding and column information coding, and the results of the two codes are connected to be used as the code of the whole query.
The Join graph coding means that a Join list involved in the whole SQL query is analyzed, the connection relation of the Join list is analyzed, whether connection is needed between every two Join lists is judged in sequence, and the number of the connection needed is set to be 1 in a two-dimensional matrix. Considering that the connection is required to be in a bidirectional relationship, the two-dimensional matrix only needs to be reserved by half and compressed into a one-dimensional matrix for subsequent encoding. As shown, there is a connection between a and B, B and D, a and C, i.e. 1 is set in the two-dimensional array, compressing the upper half into a one-dimensional encoding resulting in the sequence shown in fig. 2.
The column information encoding means that for each table, the column participating in the query predicate is set to be 1, and the encoding is performed by adopting a one-hot encoding mode, that is, for a table with n columns, the condition that each column participates in the query predicate is represented by n digits, 1 represents that the column participates in the query predicate, and 0 represents that the column does not participate. The two are connected to obtain the final inquiry code
In order to adapt the characteristic code to the flexible change of the database data scale in the actual service, the data distribution characteristics need to be extracted. For this reason, it is assumed that the data size has an upper limit value within a certain service range, and the upper limit value is manually set by experience according to actual service requirements. The characteristic coding length of data distribution is recorded to be equal to the number of tables, indexes of characteristic vectors correspond to table sequence numbers, and values of coding bits are ratios of current table data entries to upper limits of the table data entries. And the data distribution feature vector is spliced to the front of the query feature vector.
After the query and data distribution are characterized and encoded, the PCA dimension reduction is performed on the data matrix. Because the query load in the actual service generally cannot cover the tags in all the time series data, some relation tables are not used, and therefore a large number of zero values are stored in the data matrix and are sparse. The PCA dimension reduction processing can effectively remove redundant components of the data matrix, accelerate the model regression speed and improve the accuracy.
Finally, the training data are used for training a gradient lifting regression tree model, so that the model can be used for accurately and quickly predicting the time of the query.
It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations that fall within the spirit and scope of the invention be limited only by the claims and the description.

Claims (2)

1. A query time prediction method facing a time sequence database is characterized by comprising the following steps:
the method comprises the following steps: reading time sequence data;
step two: writing time series data into CnOSDB, using a CnoSQL query statement by the CnOSDB to perform query retrieval on the time series data, and recording query time;
step three: encoding the query statement into vectorized data;
step four: extracting data distribution characteristics from the opposite quantization data;
step five: using PCA to reduce the dimension of the data distribution characteristics;
step six: training a gradient lifting regression tree model by using vectorized data and the data distribution characteristics after dimensionality reduction as input and query time as output;
step seven: carrying out query time prediction by using the trained gradient lifting regression tree model;
before the third step, the following steps are also included: rewriting CnoSQL into standard SQL;
the encoding in the third step comprises join graph encoding and column information encoding, and the results of the join graph encoding and the column information encoding are connected to be used as the encoding of the whole query;
the specific steps of the join graph coding are as follows:
analyzing tables of joining joins related in CnoSQL or SQL query sentences, analyzing the connection relation of every two tables of joining joins, judging whether every two tables of joining joins are connected or not, if so, setting codes corresponding to the joins as 1, if not, setting the codes corresponding to the joins as 0, and finally, reserving an upper triangular matrix part of a two-dimensional matrix and expanding the upper triangular matrix part into a one-dimensional matrix according to rows;
the specific steps of the column information coding are as follows:
and aiming at each table participating in the join, setting the coding column participating in the query predicate in each table participating in the join as 1, setting the coding column not participating in the query predicate in each table participating in the join as 0, and connecting the column set as 1 with the column set as 0 to obtain the query code.
2. The method of claim 1, wherein the column information is encoded by a one-hot encoding method.
CN202111662253.1A 2021-12-30 2021-12-30 Query time prediction method for time sequence database Active CN114218287B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111662253.1A CN114218287B (en) 2021-12-30 2021-12-30 Query time prediction method for time sequence database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111662253.1A CN114218287B (en) 2021-12-30 2021-12-30 Query time prediction method for time sequence database

Publications (2)

Publication Number Publication Date
CN114218287A CN114218287A (en) 2022-03-22
CN114218287B true CN114218287B (en) 2022-11-04

Family

ID=80707313

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111662253.1A Active CN114218287B (en) 2021-12-30 2021-12-30 Query time prediction method for time sequence database

Country Status (1)

Country Link
CN (1) CN114218287B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114665884B (en) * 2022-03-29 2022-11-25 北京诺司时空科技有限公司 Time sequence database self-adaptive lossy compression method, system and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491508A (en) * 2017-08-01 2017-12-19 浙江大学 A kind of data base querying time forecasting methods based on Recognition with Recurrent Neural Network
CN113711198A (en) * 2019-04-30 2021-11-26 微软技术许可有限责任公司 Learning resource consumption model for optimizing big data queries

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7831594B2 (en) * 2007-04-30 2010-11-09 Hewlett-Packard Development Company, L.L.P. Estimating the static execution time of a database query
US8874548B2 (en) * 2012-02-27 2014-10-28 Nec Laboratories America, Inc. Predicting query execution time
CN111611274A (en) * 2020-05-28 2020-09-01 华中科技大学 Database query optimization method and system
CN113032418B (en) * 2021-02-08 2022-11-11 浙江大学 Method for converting complex natural language query into SQL (structured query language) based on tree model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491508A (en) * 2017-08-01 2017-12-19 浙江大学 A kind of data base querying time forecasting methods based on Recognition with Recurrent Neural Network
CN113711198A (en) * 2019-04-30 2021-11-26 微软技术许可有限责任公司 Learning resource consumption model for optimizing big data queries

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DeepQT: Learning Sequential Context for Query Execution Time Prediction;Jingxiong Ni;《Database Systems for Advanced Applications. 25th International Conference, DASFAA 2020. Proceedings. Lecture Notes in Computer Science (LNCS 12114)》;20201231;第188-203页 *
基于循环神经网络的数据库查询开销预测;毕里缘等;《软件学报》;20171206(第03期);第799-810页 *

Also Published As

Publication number Publication date
CN114218287A (en) 2022-03-22

Similar Documents

Publication Publication Date Title
CN109977110B (en) Data cleaning method, device and equipment
US10762071B2 (en) Value-ID-based sorting in column-store databases
US8463822B2 (en) Data merging in distributed computing
CN112434024B (en) Relational database-oriented data dictionary generation method, device, equipment and medium
CN101499065B (en) Table item compression method and device based on FA, table item matching method and device
CN108776673B (en) Automatic conversion method and device of relation mode and storage medium
US11669523B2 (en) Question library for data analytics interface
KR101679050B1 (en) Personalized log analysis system using rule based log data grouping and method thereof
CN114218287B (en) Query time prediction method for time sequence database
Paludo Licks et al. SmartIX: A database indexing agent based on reinforcement learning
WO2021012861A1 (en) Method and apparatus for evaluating data query time consumption, and computer device and storage medium
CN110674211A (en) Automatic analysis method and device for AWR report of Oracle database
CN112783867A (en) Database optimization method for meeting real-time big data service requirements and cloud server
CN110209589B (en) Knowledge base system test method, device, equipment and medium
CN106844541B (en) Online analysis processing method and device
CN115062070A (en) Question and answer based text table data query method
CN113034193A (en) Working method for modeling of APP2VEC in wind control system
CN117290376A (en) Two-stage Text2SQL model, method and system based on large language model
CN113157814B (en) Query-driven intelligent workload analysis method under relational database
CN116842076A (en) Data analysis method, device, analysis equipment and readable storage medium
CN111046054A (en) Method and system for analyzing power marketing business data
US10877998B2 (en) Highly atomized segmented and interrogatable data systems (HASIDS)
CN113779215A (en) Data processing platform
CN116108072B (en) Data query method and query prediction model training method
CN117609281B (en) Text2Sql method, system, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant