CN115617830A - Data query optimization processing method and device based on machine learning - Google Patents

Data query optimization processing method and device based on machine learning Download PDF

Info

Publication number
CN115617830A
CN115617830A CN202110785793.2A CN202110785793A CN115617830A CN 115617830 A CN115617830 A CN 115617830A CN 202110785793 A CN202110785793 A CN 202110785793A CN 115617830 A CN115617830 A CN 115617830A
Authority
CN
China
Prior art keywords
query
information
execution plan
learning
cost
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110785793.2A
Other languages
Chinese (zh)
Inventor
刘芳
闫文
赵洪松
李方岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Heilongjiang Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Heilongjiang Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Heilongjiang Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202110785793.2A priority Critical patent/CN115617830A/en
Publication of CN115617830A publication Critical patent/CN115617830A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data query optimization processing method and device based on machine learning. The method comprises the following steps: acquiring input structured query statement information; determining an execution plan corresponding to the structured query statement information, and converting execution environment parameter information corresponding to the execution plan into a corresponding expression vector; inputting the expression vector into a query optimizer model based on a long-short term memory network to obtain cost information output by the query optimizer model, and determining a corresponding target execution plan based on the cost information; the query optimizer model is obtained by training a historical execution plan serving as sample data and an actual cost corresponding to the historical execution plan serving as a target value. By adopting the method disclosed by the invention, more accurate cost estimation can be obtained under different execution environments based on dynamic execution environment parameter information, so that a better execution plan is selected, and the efficiency and the accuracy of query optimization processing are improved.

Description

Data query optimization processing method and device based on machine learning
Technical Field
The invention relates to the technical field of computer application, in particular to a data query optimization processing method and device based on machine learning. In addition, an electronic device and a processor-readable storage medium are also related.
Background
Structured Query Language (SQL), which is a database Query and programming Language, is used to access data and Query, update, and manage data stored in a relational database system. Structured query languages can allow users to work on high-level data structures. The method does not require a user to specify a data storage method, and does not require the user to know a specific data storage mode, so that different database systems with completely different underlying structures can use the same structured query language as an interface for data input and management. Meanwhile, the structured query language statements can be nested, so that the structured query language statements have great flexibility and strong functions.
In recent years, with the rapid development of internet technology, data query implemented by means of structured query language is more and more widely applied. However, because the execution environment of the data query process is extremely complex, especially with the change of data amount and data distribution, the accuracy of optimization quality is not high when the optimization scheme matching is performed by using static statistical data in the prior art, and various execution scenarios of data query cannot be handled at the same time, which results in low data query efficiency. Therefore, how to design a stable and efficient data query optimization processing scheme based on machine learning becomes an important topic studied by those skilled in the art.
Disclosure of Invention
Therefore, the invention provides a data query optimization processing method and device based on machine learning, and aims to solve the problems of relatively high data query optimization processing scheme limitation and relatively low data query efficiency and accuracy under a complex execution environment in the prior art.
In a first aspect, the present invention provides a data query optimization processing method based on machine learning, including: acquiring input structured query statement information;
determining an execution plan corresponding to the structured query statement information, and converting execution environment parameter information corresponding to the execution plan into corresponding expression vectors;
inputting the expression vector into a query optimizer model based on a long-short term memory network to obtain cost information output by the query optimizer model, and determining a corresponding target execution plan based on the cost information;
the query optimizer model is obtained by training a historical execution plan serving as sample data and an actual cost corresponding to the historical execution plan serving as a target value.
In one embodiment, the execution environment parameter information includes a query parse tree, query load data for executing the structured query statement information, hardware feature information of an execution environment, and software feature information.
In one embodiment, the query optimizer model includes: the system comprises a radix estimation module, a cost estimation module, a connection sequence selection module and a learning type optimizer module with machine learning and reinforcement learning capabilities;
the cardinality estimation module is used for determining the total row number processed on each level of the execution plan and determining the total row number as the cardinality information of the execution plan;
the cost estimation module is used for determining the cost information of each physical path in the corresponding query optimizer based on the statistical information corresponding to the structured query statement information; wherein the statistical information comprises: high frequency value statistical information, histogram statistical information, target coefficient statistical information, type high frequency value information, array type histogram information, and range value type statistical information;
the connection sequence selection module is used for carrying out connection sequence selection optimization based on the reinforcement learning capacity of the long-term and short-term memory network to obtain a corresponding execution plan;
and the learning-type optimizer module is used for determining a corresponding target execution plan from the execution plans by using a machine learning and reinforcement learning mode based on the base information and the cost information.
In one embodiment, the data query optimization processing method based on machine learning further includes: optimizing and rewriting a target query statement in the structured query statement information according to a preset rewriting rule and a predetermined optimization type of the structured query statement information; or, based on a preset learning rule, performing optimization selection on the structured query statement information by adopting a machine learning and reinforcement learning mode to obtain a corresponding query rewriting result.
In one embodiment, the data query optimization processing method based on machine learning further includes: analyzing the structured query statement information by using an SQL analyzer in a query analysis tool corresponding to the database to obtain a query analysis tree containing an operator expression; and optimizing the operator expression according to a preset plugging principle by using the query optimizer model to obtain a corresponding query analysis result.
In one embodiment, the long-short term memory network query optimizer model is configured to analyze and process the structured query statement information based on a representation feature obtained by feature extraction from the historical execution plan, and determine corresponding cost information.
In a second aspect, the present invention further provides a data query optimization processing apparatus based on machine learning, including: the system comprises a structural query statement acquisition unit, a query statement processing unit and a query statement processing unit, wherein the structural query statement acquisition unit is used for acquiring input structural query statement information;
the vector representation unit is used for determining an execution plan corresponding to the structured query statement information and converting the execution environment parameter information corresponding to the execution plan into a corresponding representation vector;
the query optimization processing unit is used for inputting the expression vector into a query optimizer model based on a long-short term memory network to obtain cost information output by the query optimizer model and determining a corresponding target execution plan based on the cost information;
the query optimizer model is obtained by training a historical execution plan serving as sample data and an actual cost corresponding to the historical execution plan serving as a target value.
In one embodiment, the execution environment parameter information includes a query parse tree, query load data for executing the structured query statement information, hardware feature information of an execution environment, and software feature information.
In one embodiment, the query optimizer model includes: the system comprises a radix estimation module, a cost estimation module, a connection sequence selection module and a learning type optimizer module with machine learning and reinforcement learning capabilities;
the cardinality estimation module is used for determining the total row number processed on each level of the execution plan and determining the total row number as the cardinality information of the execution plan;
the cost estimation module is used for determining the cost information of each physical path in the corresponding query optimizer based on the statistical information corresponding to the structured query statement information; wherein the statistical information comprises: high frequency value statistical information, histogram statistical information, target coefficient statistical information, type high frequency value information, array type histogram information, and range value type statistical information;
the connection sequence selection module is used for performing connection sequence selection optimization based on the reinforcement learning capacity of the long-short term memory network to obtain a corresponding execution plan;
the learning-type optimizer module is used for determining a corresponding target execution plan from the execution plans in a machine learning and reinforcement learning mode based on the cardinality information and the cost information.
In one embodiment, the data query optimization processing apparatus based on machine learning further includes: the query rewriting unit is used for optimizing and rewriting a target query statement in the structured query statement information according to a preset rewriting rule and the predetermined optimization type of the structured query statement information; or based on a preset learning rule, performing optimization selection on the structured query statement information by adopting a machine learning and reinforcement learning mode to obtain a corresponding query rewriting result.
In one embodiment, the data query optimization processing method based on machine learning further includes: the query analysis unit is used for analyzing the structured query statement information by adopting an SQL analyzer in a query analysis tool corresponding to the database to obtain a query analysis tree containing an operator expression; and optimizing the operator expression according to a preset plugging principle by using the query optimizer model to obtain a corresponding query analysis result.
In one embodiment, the long-short term memory network query optimizer model is configured to analyze and process the structured query statement information based on a representation feature obtained by feature extraction from the historical execution plan, and determine corresponding cost information.
In a third aspect, the present invention also provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for processing data query optimization based on machine learning as described in any one of the above when executing the program.
In a fourth aspect, the present invention further provides a processor-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method for optimizing data query based on machine learning according to any one of the above items.
By adopting the data query optimization processing method based on machine learning, more accurate cost estimation can be obtained under different execution environments based on dynamic execution environment parameter information, so that a better execution plan can be selected quickly and accurately, and the efficiency and the accuracy of query optimization processing are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a data query optimization processing method based on machine learning according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a method for implementing data query optimization processing based on machine learning according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a query optimization module in the data query optimization processing method based on machine learning according to the embodiment of the present invention;
fig. 4 is a schematic structural diagram of an SQL statement reinforcement learning model in the data query optimization processing method based on machine learning according to the embodiment of the present invention;
fig. 5 is a schematic diagram of a specific implementation flow corresponding to the data query optimization processing method based on machine learning according to the embodiment of the present invention;
fig. 6 is a schematic structural diagram of a data query optimization processing apparatus based on machine learning according to an embodiment of the present invention;
fig. 7 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The following describes an embodiment of the data query optimization processing method based on machine learning according to the present invention in detail. As shown in fig. 1, which is a schematic flow chart of a data query optimization processing method based on machine learning according to an embodiment of the present invention, a specific implementation process includes the following steps:
step 101: and acquiring the input structured query statement information.
In the embodiment of the present invention, before this step is performed, model training needs to be performed in advance. Specifically, it is first required to convert execution environment information, such as a historically executed SQL (Structured Query Language) execution plan, attribute columns, tables, and Query trees related to the execution plan, load data of an execution SQL statement, hardware feature information, and software feature information, into corresponding vector representations, that is, determine corresponding representation vectors, use the representation vectors as input of an original Query optimizer model, use actual cost of executing the execution plan as a target value, compare the target value with a predicted value to calculate corresponding loss, perform back propagation update parameters, and implement model training. The original query optimizer model refers to a cost model based on LSTM (Long Short-Term Memory; long Short-Term Memory network). Further, in this step, structured query statement information, i.e., SQL query statements, input when the SQL data is queried is obtained.
Step 102: and determining an execution plan corresponding to the structured query statement information, and converting the execution environment parameter information corresponding to the execution plan into a corresponding expression vector. The execution environment parameter information includes a query parsing tree, query load data for executing the structured query statement information, hardware characteristic information and software characteristic information of the execution environment, and the like. The execution plan is an SQL execution plan.
In the embodiment of the invention, each execution plan corresponding to the SQL query statement information is determined, the execution environment information such as attribute columns and tables related to the execution plan, a query analysis tree, load data, hardware characteristic information, software characteristic information and the like is converted into corresponding expression vectors, the corresponding expression vectors are input into a trained cost model, and the cost output by the cost model is obtained, so that the execution plan with the lowest cost is selected for execution in the follow-up process. The query parse Tree, namely Abstract Syntax Tree (AST).
In the process of query feature representation, the performance of the machine learning method depends on the form or features of the data represented by the vector to a large extent. Therefore, the method and the device extract the characteristics of the execution environment parameter information, the database metadata (tables, attributes, statistical information and the like) and the like when the user inquires the data, and can adopt the One-Hot mode to perform characteristic representation. Specifically, the features are expressed as: and converting the objects such as attribute columns, tables and the like of the SQL execution plan and the execution plan, the generated query analysis tree and the like into corresponding expression vectors as the input of the cost model. The vector is represented as: and converting attribute columns, tables, query parsing trees and the like into corresponding expression vectors to be used as input of the cost model. The query parsing Tree is represented by a Tree-LSTM, which is one of the GNNs (Graph Neural Networks), and the parameters of the Tree-LSTM may be set according to preset rules, which are not described in detail herein.
In a specific implementation, the representation of the SQL query statement may include: let n be the number of tables in the database, each table having a unique identifier from 0 to n-1; let m be an n x n matrix with each element being 0 or 1; mi, j is 1, which means that a connection relation exists between the ith table and the jth table, and if the connection relation does not exist, the connection relation is 0; the original matrix m is then flattened into a vector v by putting all the rows (from 0 to n-1) of the matrix into the vector, i.e. Vi n + j = mi, j; v | = n × (convert a matrix into a one-dimensional vector).
Representation of the attribute column: based on the preset SQL query statement, the expression form of the column only needs to be constructed by using the predicate in the SQL, and the database does not need to be scanned.
For a table with k columns, its representation vector is represented using its k column vectors. Some symbolic operators in SQL query statements are supposed to be encoded in the form of One-Hot. In addition, in the actual implementation process, related One-Hot vectorization is also performed on the Hash Join, seq Scan, and the like in the radix estimation, each table, and the attribute value of each table, and the like, and details are not repeated here.
During model processing, the embedding layer: and converting the original sparse high-dimensional representation vector into a low-dimensional dense representation vector by a layer of Embedding of the vectorized operator, the attributes of various tables, the fields and the like. And inputting the finally processed data into a cost model for calculation. Respectively using operator, metadata, radix estimation operation method and node graph as O t ,M t ,P t ,B t And (4) showing. The processing through the active layer can be expressed as follows, for example:
embed(O t )=ReLU(W o O t +b o )
embed(M t )=ReLU(W m M t +b m )
embed(B t )=ReLU(W b B t +b b )
wherein, embed represents the embedding layer function; reLU is a linear rectification function; w o 、W m 、W b 、b o 、b m 、b b Is a preset parameter.
The following node types, for example, can be expressed as follows:
Figure BDA0003159185690000081
wherein, P t The selection type is expressed as a base number estimation operation method, AND specifically comprises AND, OR AND other selection type operations; embed represents the embedding layer function; expr represents an expression; w m 、b p
Figure BDA0003159185690000091
Is a preset parameter.
Presentation layer: the problem to be solved in the embodiments of the present invention is to avoid the loss between global nodes and between root nodes, so it is proposed to use the LSTM-based cost model for model training. The embedding layer return value is input into the presentation layer, which may correspond to the following formalized definition:
output=LSTM(embed(O t ),embed(M t ),embed(B t ),embed(P t ))
evaluation layer: the connection is made using a two-layer Fully Connected network (FC), setting a specific Dropout to prevent over-fitting of the training. The activation function is carried out by using a ReLU (reconstructed Linear Unit, reLU) Linear rectification function or a Sigmoid (S-type function), and finally, the target value and the predicted value are compared to calculate corresponding loss, and parameters are updated by back propagation. The formalization of the ensemble can be defined as follows:
Final_output=sigmoid(Dropout(FC(output)))
the method can be adjusted according to actual services to achieve the optimal effect, and finally the trained query optimizer model based on the long-term and short-term memory network can be obtained.
Step 103: and inputting the expression vector into a query optimizer model based on a long-short term memory network to obtain cost information output by the query optimizer model, and determining a corresponding target execution plan based on the cost information. The query optimizer model is obtained by training a historical execution plan serving as sample data and actual cost corresponding to the historical execution plan serving as a target value. And the long-short term memory network query optimizer model is used for analyzing and processing the structured query statement information based on the expression characteristics obtained by extracting the characteristics from the historical execution plan and determining the corresponding cost information.
As shown in fig. 2, in the embodiment of the present invention, the query processing system includes a query parser, a query rewriter, a query optimizer, a query executor, and the like; the storage management system comprises a buffer management module and the like; a machine learning model manager.
In the query analysis process, an SQL (structured query language) analyzer in a query analysis tool corresponding to the database can be adopted to analyze the structured query statement information to obtain a query analysis tree containing an operator expression; and optimizing the operator expression according to a preset plugging and unplugging principle by using the query optimizer model to obtain a corresponding query analysis result. Specifically, the query analysis tool calcium corresponding to the database can be used for various offline, search and real-time query engines, such as Drill, hive, kylin, solr, flink, samza and the like, and the query analysis tool calcium can be used for analyzing SQL query statements, checking SQL and the like. And analyzing by using an SQL analyzer of the Call to obtain a query analysis tree, wherein the query analysis tree comprises an operator expression. The query optimizer can optimize the operator expression according to a plugging principle, wherein the information provided by the metadata database can be used for operations such as cost calculation.
In the query rewriting process, optimizing and rewriting a target query statement in the structured query statement information according to a preset rewriting rule and a predetermined optimization type of the structured query statement information; or, based on a preset learning rule, performing optimization selection on the structured query statement information by adopting a machine learning and reinforcement learning mode to obtain a corresponding query rewriting result. Specifically, a preset SQL rewrite tool is provided, index recommendations are optimized, the advantages and costs of building an index given query workload can be learned based on a model, and then an index is automatically built according to the learned information, for example: and indexing by using a foreign key so as to improve the query speed. Compared with the traditional query rewriting method, the SQL statement is rewritten only based on the content in the SQL, the click behavior of the user, the semantic information and the acquired Session information of different users. On the basis, the method is improved in two ways, namely, the predetermined SQL optimization type is optimized and rewritten for the target query statement in the structured query statement information based on the preset rewriting rule; and based on a preset learning mode, optimizing and selecting the SQL query sentence through machine learning and reinforcement learning.
In the query optimization process, the query optimizer model includes: the system comprises a radix estimation module, a cost estimation module, a connection sequence selection module, a rule-based optimizer, a cost-based optimizer and a learning-type optimizer module with machine learning and reinforcement learning capabilities. The cardinality estimation module is used for determining the total row number processed on each level of the execution plan and determining the total row number as the cardinality information of the execution plan; the cost estimation module is used for determining the cost information of each physical path in the corresponding query optimizer based on the statistical information corresponding to the structured query statement information; wherein the statistical information includes: high-frequency value statistical information, histogram statistical information, target coefficient statistical information, type high-frequency value information, array type histogram information and range value type statistical information; the connection sequence selection module is used for performing connection sequence selection optimization based on the reinforcement learning capacity of the long-short term memory network to obtain a corresponding execution plan; and the learning-type optimizer module is used for determining a corresponding target execution plan from the execution plans by using a machine learning and reinforcement learning mode based on the base information and the cost information.
In the query optimizer, machine learning is used for carrying out corresponding classification on SQL query statements by using classification and regression in a Spark MLib library, and then a corresponding optimization method is extracted from the classified classes. The overall structure of the query optimizer is shown in FIG. 3. It mainly comprises two major parts: an SQL optimization component and a query optimizer component. When the SQL optimization part is carried out, SQL query statements rewritten by the SQL optimization part can be optimized, and optimization functions such as index recommendation, view recommendation and partition recommendation are formed through feature extraction, dimension reduction, collaborative filtering, association analysis and trend prediction of data. The query Optimizer part is intended to use combination of radix estimation, cost estimation, join order selection (using the Tree-LSTM model structure), and also to use Rule-Based optimizers (RBO), cost-Based optimizers (CBO), and learning-Based optimizers with machine learning and reinforcement learning functions.
The scale estimation, i.e. cardinality estimation. Firstly, determining the total row number processed on each level of an execution plan, which is called the cardinality of the execution plan; an overhead pattern of the algorithm specified by the operator used in the query; the first factor (cardinality) is used as an input parameter for the second factor (overhead pattern); thus, increasing the cardinality will reduce estimation overhead, thereby speeding up execution planning.
And estimating the cost, namely the cost model. Determining that the query optimizer performs physical optimization requires the cost of computing the various physical paths, while cost estimation relies heavily on statistical information. Whether the statistical information can accurately describe the data distribution in the table is one of the important conditions for determining the accuracy of the cost. In the specific implementation process, how many data, how many pages are used, the frequency of a certain value and the like of a table can be known through cost estimation, and then, according to the information, how much data can be filtered out by a constraint condition, wherein the proportion of the data filtered out by the constraint condition to the total data amount is called as the selection rate.
According to the SQL query statement, various forms of statistical information are provided, including single-column statistical information and multi-column statistical information, the single-column statistical information means that the attribute column of each table generates a corresponding statistical tuple in the system table, and the tuple is responsible for describing data distribution in the attribute from multiple angles.
The statistical information includes: high frequency value statistics, histogram statistics, correlation coefficient statistics, type high frequency values, array type histograms, and finally range value type statistics. The confidence of the data population is formed from the above statistical information, and the estimation of the selection rate needs to be based on the statistical information (including histogram, high frequency value, NULL value rate, etc.).
In the actual implementation process, the selection rate of the constraint condition is determined, that is, the proportion of the results to be scanned through the scanning path or the proportion of the tuples obtained through the join operation is determined, the number of the intermediate results and the final results can be calculated according to the proportion, and then the number is used for calculating the cost, so that the corresponding cost is determined.
The plan is generated, i.e. the connection order is selected. And performing connection optimization of the query by adopting a Tree-LSTM model and reinforcement learning. Specifically, it uses reinforcement learning with long term short term memory (LSTM) of tree structure for connection order selection. Firstly, capturing a structure of a connection tree by adopting a graph neural network; in addition it supports modification of database schemas and multialias table names. A number of experiments performed on JOB (Join Order Benchmark) and TPC-H based optimizers have shown that RTOS (Real-time operating system) is superior to traditional optimizers and existing Learning optimizers based on DRL (Deep discovery Learning). In particular, compared to dynamic planning, RTOS connection planning (estimated) costs 101% and execution time 67%.
Further, in the process of selecting an efficient execution plan, a reinforcement learning method may be adopted to train the actual execution plan time as an execution cost (Reward), that is, to perform adjustment of the execution plan (i.e., the scheduling policy). The goal is to find the execution plan that minimizes the execution plan time for the data query.
The structure of the reinforcement learning main model is shown in fig. 4. The load data of the query and the effectiveness of the SQL query statement are changed every time the SQL query statement is analyzed, so that the SQL query statement is tuned and optimized in a sequence. And the reinforcement learning can better solve the sequence decision process and can reduce the dependence on training samples. Reinforcement learning interacts with the environment in a continuous trial and error mode through a scheduler (Agent) to find an optimal scheduling strategy (Action) so as to maximize the accumulated rewards obtained by interacting with the environment. Mathematically modelable as a Markov Decision Process (MDP) for simulating a stochastic strategy and reward implementable by an agent in an environment where the system state has Markov properties, the elements including state, actions, strategy functions and reward functions. The invention can adopt Q-Learning and DQN (Deep Q-Learning Network) asynchronous and offline RL algorithms, and accelerate the reinforcement Learning efficiency through episodic element reinforcement Learning. Reinforcement learning mainly obtains feedback from interaction with the environment, has limited interpretability, and further causes poor interpretability when deep learning is used for solving a target. In the scenario calculation applied to the database, practical experience is mainly relied on, for example: its Reward calculation derives benefits mainly from the front and back query times. Therefore, the project not only uses the Tree-LSTM method of the cost model as a baseline method, but also is integrated or fused with the reinforcement learning method, so that the efficiency and the interpretability are further improved.
As shown in fig. 2, in one implementation, a user inputs SQL query statement information through a user interface, and the machine learning library and the reinforcement learning library expand the SQL query statement information to convert the SQL query statement information into a relational SQL query statement. After the query processing system receives the SQL query statement, the SQL query statement is analyzed, rewritten, optimized and executed, and then an execution result is extracted from the index file or the cache. The query processing system depends on the collector, and can combine the load performance indexes of the machine learning library and the reinforcement learning library with the query processing performance to finally achieve effective interaction on the environment. The storage Management module (i.e. storage Management System) adopts a heterogeneous data platform exchange technology provided by a hybrid data storage Management System (DBMS) to realize data reading and complex association query of various databases, thereby simplifying the data preprocessing process. The query optimization model can be specifically refined into scale estimation, cost estimation, final plan generation and the like.
In the embodiment of the invention, the machine learning library mainly adopts a Spark MLib, ML Algorithm, stored Procedure, spark UDF and other machine learning libraries and related methods. In addition, the invention can also develop the self-defining algorithm for the provided machine learning library. The illustrated metadatabase is used to store database schemas, knowledge maps, user query load logs, system performance logs, machine learning libraries, reinforcement learning libraries, and the like.
In another practical implementation, as shown in FIG. 5, data set A may be used for model testing and training. The data set a is a real data set based on the hit leaderboard, which provides real load data. It contains 113 SQL query statements from 33 templates, and 3.6GB of data (11 GB when computing the index) and 21 tables. The tables in each SQL query statement range from 4 to 17. Specifically, firstly, SQL query statements in the data set a are extracted and loaded, the loaded SQL query statements are distributed according to the proportion of the training set to the test set 9, after the SQL query statements are loaded, features (table names, column names, predicates, and the like) of the table are encoded to obtain an expression vector, the expression vector is used as input of a query optimizer model based on deep reinforcement learning, and then features extracted from the encoding are trained. The intermediate state is a junction tree containing partial states. The intermediate state will be passed to the query optimizer model to select the operation that should now be performed and the state of itself will be continuously updated according to the selected operation. The termination state of the system is to join all tables together to form a join tree containing all tables, which will be converted into an execution plan of the SQL query statement and passed to the hybrid data storage management system managing the storage modules for execution.
The invention adopts methods based on machine learning, reinforcement learning and the like, improves the operation efficiency of the query optimizer, continuously learns the queried load data of the user, and improves the self-adaptive optimization capability of database SQL data query, namely, system optimization is carried out according to the data scale of the user and the characteristics of the queried load data, thereby promoting intelligent charging for data query analysis.
By adopting the data query optimization processing method based on machine learning, more accurate cost estimation can be obtained under different execution environments based on dynamic execution environment parameter information, so that a better execution plan can be selected quickly and accurately, and the efficiency and the accuracy of query optimization processing are improved.
Corresponding to the method for optimizing the data query based on the machine learning, which is provided by the invention, the invention also provides a device for optimizing the data query based on the machine learning, which is provided by the invention. Since the embodiment of the apparatus is similar to the above method embodiment, the description is relatively simple, and please refer to the description in the above method embodiment section for relevant points, and the embodiment of the data query optimization processing apparatus based on machine learning described below is only illustrative. Fig. 6 is a schematic structural diagram of a data query optimization processing device based on machine learning according to an embodiment of the present invention.
The data query optimization processing device based on machine learning specifically comprises the following parts:
a structured query statement acquisition unit 601 configured to acquire input structured query statement information;
a vector representing unit 602, configured to determine an execution plan corresponding to the structured query statement information, and convert execution environment parameter information corresponding to the execution plan into a corresponding representation vector;
the query optimization processing unit 603 is configured to input the expression vector into a query optimizer model based on a long-short-term memory network, obtain cost information output by the query optimizer model, and determine a corresponding target execution plan based on the cost information.
The query optimizer model is obtained by training a historical execution plan serving as sample data and an actual cost corresponding to the historical execution plan serving as a target value.
By adopting the data query optimization processing device based on machine learning, more accurate cost estimation can be obtained under different execution environments based on dynamic execution environment parameter information, so that a better execution plan can be selected quickly and accurately, and the efficiency and the accuracy of query optimization processing are improved.
Corresponding to the data query optimization processing method based on machine learning, the invention also provides electronic equipment. Since the embodiment of the electronic device is similar to the above method embodiment, the description is simple, and please refer to the description of the above method embodiment, and the electronic device described below is only schematic. Fig. 7 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention. The electronic device may include: a processor (processor) 701, a memory (memory) 702 and a communication bus 703, wherein the processor 701 and the memory 702 communicate with each other through the communication bus 703 and communicate with the outside through the communication interface 704. Processor 701 may invoke logic instructions in memory 702 to perform a machine learning-based data query optimization processing method. The method comprises the following steps: acquiring input structured query statement information; determining an execution plan corresponding to the structured query statement information, and converting execution environment parameter information corresponding to the execution plan into corresponding expression vectors; inputting the expression vector into a query optimizer model based on a long-short term memory network to obtain cost information output by the query optimizer model, and determining a corresponding target execution plan based on the cost information; the query optimizer model is obtained by training a historical execution plan serving as sample data and actual cost corresponding to the historical execution plan serving as a target value.
Furthermore, the logic instructions in the memory 702 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a Memory chip, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In another aspect, embodiments of the present invention further provide a computer program product, where the computer program product includes a computer program stored on a processor-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer can execute the method for optimizing processing of data query based on machine learning provided by the above-mentioned method embodiments. The method comprises the following steps: acquiring input structured query statement information; determining an execution plan corresponding to the structured query statement information, and converting execution environment parameter information corresponding to the execution plan into a corresponding expression vector; inputting the expression vector into a query optimizer model based on a long-short term memory network to obtain cost information output by the query optimizer model, and determining a corresponding target execution plan based on the cost information; the query optimizer model is obtained by training a historical execution plan serving as sample data and an actual cost corresponding to the historical execution plan serving as a target value.
In still another aspect, an embodiment of the present invention further provides a processor-readable storage medium, where a computer program is stored on the processor-readable storage medium, and when the computer program is executed by a processor, the computer program is implemented to perform the method for optimizing and processing data query based on machine learning provided in the foregoing embodiments. The method comprises the following steps: acquiring input structured query statement information; determining an execution plan corresponding to the structured query statement information, and converting execution environment parameter information corresponding to the execution plan into a corresponding expression vector; inputting the expression vector into a query optimizer model based on a long-short term memory network to obtain cost information output by the query optimizer model, and determining a corresponding target execution plan based on the cost information; the query optimizer model is obtained by training a historical execution plan serving as sample data and an actual cost corresponding to the historical execution plan serving as a target value.
The processor-readable storage medium may be any available medium or data storage device that can be accessed by a processor, including, but not limited to, magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memories (NAND FLASH), solid State Disks (SSDs)), etc.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A data query optimization processing method based on machine learning is characterized by comprising the following steps:
acquiring input structured query statement information;
determining an execution plan corresponding to the structured query statement information, and converting execution environment parameter information corresponding to the execution plan into a corresponding expression vector;
inputting the expression vector into a query optimizer model based on a long-short term memory network to obtain cost information output by the query optimizer model, and determining a corresponding target execution plan based on the cost information;
the query optimizer model is obtained by training a historical execution plan serving as sample data and an actual cost corresponding to the historical execution plan serving as a target value.
2. The machine-learning-based data query optimization processing method of claim 1, wherein the execution environment parameter information comprises a query parse tree, query load data for executing the structured query statement information, hardware feature information of an execution environment, and software feature information.
3. The machine-learning-based data query optimization processing method of claim 1, wherein the query optimizer model comprises: the system comprises a radix estimation module, a cost estimation module, a connection sequence selection module and a learning type optimizer module with machine learning and reinforcement learning capabilities;
the cardinality estimation module is used for determining the total row number processed on each level of the execution plan and determining the total row number as the cardinality information of the execution plan;
the cost estimation module is used for determining the cost information of each physical path in the corresponding query optimizer based on the statistical information corresponding to the structured query statement information; wherein the statistical information comprises: high-frequency value statistical information, histogram statistical information, target coefficient statistical information, type high-frequency value information, array type histogram information and range value type statistical information;
the connection sequence selection module is used for carrying out connection sequence selection optimization based on the reinforcement learning capacity of the long-term and short-term memory network to obtain a corresponding execution plan;
the learning-type optimizer module is used for determining a corresponding target execution plan from the execution plans in a machine learning and reinforcement learning mode based on the cardinality information and the cost information.
4. The machine-learning-based data query optimization processing method of claim 1, further comprising:
analyzing the structured query statement information by using an SQL analyzer in a query analysis tool corresponding to the database to obtain a query analysis tree containing an operator expression;
and optimizing the operator expression according to a preset plugging principle by using the query optimizer model to obtain a corresponding query analysis result.
5. The machine-learning-based data query optimization processing method according to claim 1, further comprising:
optimizing and rewriting a target query statement in the structured query statement information according to a preset rewriting rule and a predetermined optimization type of the structured query statement information; alternatively, the first and second electrodes may be,
and based on a preset learning rule, performing optimization selection on the structured query statement information by adopting a machine learning and reinforcement learning mode to obtain a corresponding query rewriting result.
6. The machine-learning-based data query optimization processing method of claim 1, wherein the long-short term memory network query optimizer model is configured to analyze and process the structured query statement information based on an expression feature obtained by feature extraction from the historical execution plan, and determine corresponding cost information.
7. A data query optimization processing apparatus based on machine learning, comprising:
the structured query statement acquisition unit is used for acquiring input structured query statement information;
the vector representation unit is used for determining an execution plan corresponding to the structured query statement information and converting the execution environment parameter information corresponding to the execution plan into a corresponding representation vector;
the query optimization processing unit is used for inputting the expression vector into a query optimizer model based on a long-short term memory network to obtain cost information output by the query optimizer model and determining a corresponding target execution plan based on the cost information;
the query optimizer model is obtained by training a historical execution plan serving as sample data and an actual cost corresponding to the historical execution plan serving as a target value.
8. The machine-learning-based data query optimization processing apparatus of claim 7, wherein the execution environment parameter information comprises a query parse tree, query load data for executing the structured query statement information, hardware feature information of an execution environment, and software feature information.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for machine learning-based data query optimization processing according to any one of claims 1 to 6 when executing the program.
10. A processor-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for machine learning-based data query optimization processing according to any one of claims 1 to 6.
CN202110785793.2A 2021-07-12 2021-07-12 Data query optimization processing method and device based on machine learning Pending CN115617830A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110785793.2A CN115617830A (en) 2021-07-12 2021-07-12 Data query optimization processing method and device based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110785793.2A CN115617830A (en) 2021-07-12 2021-07-12 Data query optimization processing method and device based on machine learning

Publications (1)

Publication Number Publication Date
CN115617830A true CN115617830A (en) 2023-01-17

Family

ID=84856112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110785793.2A Pending CN115617830A (en) 2021-07-12 2021-07-12 Data query optimization processing method and device based on machine learning

Country Status (1)

Country Link
CN (1) CN115617830A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116339744A (en) * 2023-03-23 2023-06-27 深圳计算科学研究院 Custom function optimization method, device, computer equipment and medium
CN117056361A (en) * 2023-07-03 2023-11-14 杭州拓数派科技发展有限公司 Data query method and device for distributed database
CN117520380A (en) * 2023-11-21 2024-02-06 深圳计算科学研究院 Method, device, equipment and medium for executing database query optimization plan
CN117609288A (en) * 2024-01-17 2024-02-27 矩阵起源(深圳)信息科技有限公司 Data query policy optimization method, device, terminal equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116339744A (en) * 2023-03-23 2023-06-27 深圳计算科学研究院 Custom function optimization method, device, computer equipment and medium
CN116339744B (en) * 2023-03-23 2024-01-16 深圳计算科学研究院 Custom function optimization method, device, computer equipment and medium
CN117056361A (en) * 2023-07-03 2023-11-14 杭州拓数派科技发展有限公司 Data query method and device for distributed database
CN117520380A (en) * 2023-11-21 2024-02-06 深圳计算科学研究院 Method, device, equipment and medium for executing database query optimization plan
CN117609288A (en) * 2024-01-17 2024-02-27 矩阵起源(深圳)信息科技有限公司 Data query policy optimization method, device, terminal equipment and storage medium
CN117609288B (en) * 2024-01-17 2024-04-30 矩阵起源(深圳)信息科技有限公司 Data query policy optimization method, device, terminal equipment and storage medium

Similar Documents

Publication Publication Date Title
Marcus et al. Plan-structured deep neural network models for query performance prediction
US20200349162A1 (en) System and Method for Optimizing Large Database Management Systems with Multiple Optimizers
US7933894B2 (en) Parameter-sensitive plans for structural scenarios
CN115617830A (en) Data query optimization processing method and device based on machine learning
US8108399B2 (en) Filtering of multi attribute data via on-demand indexing
US6108648A (en) Optimizer with neural network estimator
US7509311B2 (en) Use of statistics on views in query optimization
EP2369506B1 (en) System and method of optimizing performance of schema matching
CN108804473B (en) Data query method, device and database system
CN112988782B (en) Hive-supported interactive query method and device and storage medium
US9110949B2 (en) Generating estimates for query optimization
EP4075292A1 (en) Method and apparatus for processing database
CN114041128A (en) Learning-based query plan caching for capturing low-cost query plans
Zou et al. Survey on learnable databases: A machine learning perspective
CN113656440A (en) Database statement optimization method, device and equipment
CN111723076A (en) Method and device for generating database index
US20230126509A1 (en) Database management system and method for graph view selection for a relational-graph database
US9135302B2 (en) Query rewrite with a nested materialized view
Fang et al. A query-level distributed database tuning system with machine learning
Sharma et al. Indexer++ workload-aware online index tuning with transformers and reinforcement learning
Kamali et al. Roq: Robust Query Optimization Based on a Risk-aware Learned Cost Model
KR102605929B1 (en) Method for processing structured data and unstructured data by allocating different processor resource and data processing system providing the method
KR102599008B1 (en) Method for processing multi-queries based on multi-query scheduler and data processing system providing the method
Peng et al. Performance analysis between different decision trees for uncertain data
Malysheva et al. Evaluation of Unsupervised and Reinforcement Learning approaches for Horizontal Fragmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination