CN113010547B

CN113010547B - Database query optimization method and system based on graph neural network

Info

Publication number: CN113010547B
Application number: CN202110491751.8A
Authority: CN
Inventors: 范淑焕; 侯孟书; 何东升; 廖建明; 周世杰
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-05-06
Filing date: 2021-05-06
Publication date: 2023-04-07
Anticipated expiration: 2041-05-06
Also published as: CN113010547A

Abstract

The invention discloses a database query optimization method and a database query optimization system based on a graph neural network, wherein the method comprises the following steps: data acquisition: reasonably adjusting sampling behaviors according to the use rule of data; cardinality and cost estimation: extracting the structural characteristics of a query plan tree by using a tree graph neural network, and extracting the incidence relation and the connected topological relation of each data column by using a graph convolution network, thereby establishing a radix and cost estimation network model; optimizing the connection sequence: extracting query features and connection sequence features by using a graph convolution algorithm, and sensing the real environment of the database by using a reinforcement learning algorithm so as to continuously optimize the model and improve the effect of connection sequence decision; through learning of data distribution characteristics, association relations and database environment characteristics, a radix estimation model and a connection sequence optimization model are established, and an algorithm is fused into a distributed relational database. The invention improves the algorithm efficiency in the logic optimization and physical optimization stages and improves the query execution speed.

Description

Database query optimization method and system based on graph neural network

Technical Field

The invention relates to the field of databases, in particular to a database query optimization method and a database query optimization system based on a graph neural network.

Background

With the development of internet technology, data shows the trend of blowout, and big data becomes the main melody of the information era. For the field of databases, the problem faced is how to efficiently organize and manage data. Many new structured databases are continuously generated to meet the storage requirement, but the optimization of the query performance has not been developed in a breakthrough manner. How to improve the performance of database query execution is always an important research direction in the field of databases, and the traditional database query optimization method cannot be competent for the current large-scale data size and the databases with novel architectures. The intelligent learning system has the advantages that a plurality of problems can be effectively solved by utilizing the mature development of the artificial intelligence technology and the strong learning ability of deep learning. Artificial intelligence technology and database fusion also become one of the important directions of current development, and many researches have made good progress, but still face huge challenges.

The TiDB serving as a distributed relational database has good data organization management function and query optimization function, but the query optimization method still uses a more traditional mode, such as a cardinal number estimation mode, and still adopts a probability estimation calculation mode utilizing independent hypothesis conditions, so that certain flexibility and accuracy are lacked; for the connection sequence optimization with larger query influence, a greedy algorithm and a dynamic programming algorithm based on radix estimation are mainly adopted, and a search algorithm with wrong radix estimation and higher algorithm complexity is adopted, so that certain search efficiency and query performance are not improved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, effectively utilizes the feature extraction capability of a graph neural network on graph structure data and analyzes the structure features of a query plan tree and the data association features aiming at the problem that the data distribution features and the association relationship cannot be effectively utilized in query optimization, and provides a database query optimization method and a database query optimization system based on the graph neural network.

The purpose of the invention is realized by the following technical scheme:

a database query optimization method based on a graph neural network comprises the following steps:

data acquisition: reasonably adjusting the sampling behavior according to the use rule of the data;

cardinality and cost estimation: extracting the structural characteristics of a query plan tree by using a tree-shaped graph neural network, and extracting the incidence relation and the connected topological relation of each data column by using a graph convolution network, thereby establishing a cardinal number and cost estimation network model;

optimizing the connection sequence: extracting query features and connection sequence features by using a graph convolution algorithm, sensing the real environment of a database by using a reinforcement learning algorithm, and continuously optimizing a model by using real feedback, thereby improving the effect of connection sequence decision;

and establishing a radix estimation model and a connection sequence optimization model through learning data distribution characteristics, association relations and database environment characteristics, and fusing an algorithm into a TiDB distributed relational database.

Further, the data acquisition comprises the acquisition of global data, the acquisition of off-line data and the acquisition of on-line data; the global data acquisition trigger point is active trigger, modified data are re-sampled, and old data are replaced; the offline data can be acquired by maintaining a data pool, sampling related data in a large scale, and determining whether to sample according to the effect of an algorithm; and when the algorithm actively requests data, the online data is acquired, the data is loaded from the database environment according to the request, and then the data is returned to the algorithm, so that a data synchronization process is completed.

Furthermore, the radix and cost estimation network model utilizes a Tree-LSTM structure to extract the characteristics of the query plan number, and utilizes a GCN graph neural network model to extract the characteristics of the relation between the columns, thereby designing and realizing the radix estimation network model combining the columns and the two graph structures.

Further, the method also comprises network remote calling, wherein the network remote calling is realized by using the gRPC, and is convenient to access the database, and the method comprises the following steps:

step 1: defining a data structure of a service, wherein the service is defined by using a Protobuf serialization tool, the data structure comprises a data object composition structure, and the name, the input parameter and the return value of a service interface are defined for calling a specific service function;

step 2: the processing logic of the service is realized, the input parameters of the client are deserialized from the network request, the calculation result is completed at the server according to the specific request and the service logic, and the serialization result is finally returned to the client result;

and step 3: and remote service call realization, namely firstly establishing a remote service network connection request, constructing parameters of the request and specified services, then serializing the request and sending the request to a server, and obtaining a specific result after the server processes the request.

A database query optimization system based on a graph neural network is expanded on the structure of a TiDB, and comprises a data acquisition module, an algorithm control module, a base number estimation module and a connection sequence selection module; the system is used for collecting information generated in the whole operation process of the system, accumulating data required by subsequent algorithm training and simultaneously storing intermediate information generated in a primary query process; the algorithm control module determines whether the model needs to be re-optimized according to the execution efficiency of the algorithm, so that the normal work of the model is ensured; the cardinality estimation module is used for predicting the size of a final result of the physical plan and making cost estimation based on the cardinality; the connection sequence selection module optimizes the connection sequence between the tables in the execution plan tree and dynamically optimizes the network from system feedback in an online learning mode according to a reinforcement learning method.

Furthermore, the data acquisition module provides a uniform data query interface and returns data in a standardized format.

Further, the cardinality estimation module comprises an offline training unit and an online prediction unit; the off-line training unit obtains off-line data pairs from the data acquisition module, and continuously trains and optimizes the model according to the difference between the prediction and the reality as a cost; the online prediction unit is used by the query optimizer and used for judging whether the candidate plan is good or bad in the physical optimization process, when a large difference occurs between the prediction result and the real result, the training process can be timely found and restarted, and the model can be updated spontaneously to adapt to the current environment.

Furthermore, the connection sequence selection module integrates online training and prediction, is based on a reinforcement learning model and uses an online learning mode, and each step of operation needs to be optimized according to real feedback of the system.

The invention has the beneficial effects that:

1. on the basis of deep research and analysis of the theory and implementation of a database query optimization algorithm, a query optimization algorithm based on a graph neural network is provided, and the query optimization algorithm comprises the step of analyzing the data association relation characteristics by using a tree network and graph convolution mode.

2. A cardinality estimation model of a fusion graph neural network is innovatively provided. According to the relevant theoretical knowledge of LSTM and graph convolution, a TreeLSTM structure is utilized to extract the characteristics of the query plan number and the characteristics of the relation between the column and the row of a graph neural network model of GCN, and a radix prediction network model combining the two graph structures in the row is designed and realized so as to improve the accuracy of radix and cost prediction.

3. A connection optimization model combining graph nerves and reinforcement learning is innovatively provided. Related algorithms are designed and realized, the connection association relation is analyzed by using graph convolution, and an algorithm model is trained in an online mode by using a reinforcement learning method so as to improve the accuracy of a connection sequence algorithm.

4. The query optimization algorithm of the TiDB is researched, and the two optimization algorithms are combined with the search optimization algorithm, so that the algorithm efficiency of the TiDB in the logic optimization stage and the physical optimization stage is further improved, and the query execution speed is increased.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a diagram of the overall architecture of the system.

Fig. 3 is a diagram of a tree convolution structure.

FIG. 4 is a diagram of a tree LSTM architecture.

Fig. 5 is a graph convolution calculation diagram of a column relationship.

Fig. 6 is a grpc logic architecture diagram.

FIG. 7 is a table selectivity diagram.

Fig. 8 is a schematic diagram of a table structure representation layer.

FIG. 9 is a predicate encoding diagram.

FIG. 10 is a diagram of a cardinality estimation model.

FIG. 11 is a schematic diagram of a cardinality estimation training algorithm.

Fig. 12 is a schematic diagram of the RPC interface.

FIG. 13 is a cardinality estimation flow diagram.

FIG. 14 is a connection weight relationship diagram.

FIG. 15 is a diagram showing a connection sequence Q-network structure.

FIG. 16 is a diagram of a concatenated sequential training algorithm.

FIG. 17 is a diagram of a dual Q-network architecture.

Fig. 18 is a connection sequence flowchart.

FIG. 19 is an interface diagram of a selection connection table.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.

In this embodiment, as shown in fig. 1, a database query optimization method based on a graph neural network includes the following steps:

cardinality and cost estimation: extracting the structural characteristics of a query plan tree by using a tree graph neural network, and extracting the incidence relation and the connected topological relation of each data column by using a graph convolution network, thereby establishing a radix and cost estimation network model;

1. System architecture design

Since the algorithm needs to be applied to the distributed database, tiDB, the overall architecture of the system requires a TiDB-compliant system structure,

the TiDB is used as a distributed database, and the system needs to have certain expandability, so that all functional modules are decoupled and independently exist, and are communicated with each other through RPC. This has the advantage that the individual functions can be kept independent and separately deployed, without affecting each other. Because the optimization algorithms based on big data and artificial intelligence also occupy a large amount of resources in the training and running processes, in order to reduce the influence on the normal execution of the database, the algorithms also need to be independently deployed respectively.

Fig. 2 shows the overall architecture of the system. The overall architecture is expanded on the structure of the TiDB, and a data acquisition and algorithm control module and a radix estimation and connection sequence selection module are newly added. When a user initiates a query request, the PD node is queried to find the TiDB service node which is requested to be processed, the TiDB analyzes the request, a reasonable execution plan is obtained according to the query optimization module, data meeting the requirements are loaded from the TiKV node, and the data are returned to the user. The data acquisition module simultaneously stores intermediate information generated in the one-time query process, and the algorithm control module determines whether the model needs to be re-optimized according to the execution efficiency of the algorithm so as to ensure the normal work of the model. The following mainly describes the design of three newly added modules.

The data acquisition is mainly used for collecting information generated in the whole operation process of the system and accumulating data required by subsequent algorithm training. The collected information comprises three major parts, wherein the first part is related to query and comprises a query statement, a logic plan tree and a physical plan tree; the second part is the collection of the execution performance information of the database, and comprises indexes which can reflect the execution efficiency, such as the real cardinal number, the execution time and the like of each operator in the execution process; the third part is related information of data stored in the database, including statistical information of data tables in the database, including data table structure, statistical information of columns, constraint information between columns and the like. After the data module collects the relevant information, the information of each party needs to be organized according to the sequence of the query execution time, and the query information and the execution result are subjected to relational binding, so that subsequent query and use are facilitated. Cardinality estimation is used to predict the size of the final outcome of the physical plan and make cost estimates based on cardinality. The module is divided into two parts, off-line training and on-line prediction. The offline training mainly obtains offline data pairs (query plan, real cost estimation) from a data acquisition module, and continuously trains and optimizes the model according to the difference between prediction and reality as cost. The online prediction module is mainly used by a query optimizer and used for judging whether a candidate plan is good or bad in the physical optimization process. Because the data, configuration and environment in the database are continuously changed in the using process of the database, the module can timely discover and restart the training process when a large difference occurs between the predicted result and the real result, and can spontaneously update the model to adapt to the current environment. The connection sequence optimization mainly optimizes the connection sequence between the tables in the execution plan tree. Unlike radix module estimation, the connection order optimization is trained and predicted on line, because the connection order optimization module is based on a reinforcement learning model and uses an on-line learning mode, and each operation needs to be adjusted according to the real feedback of the system. It is noted that for a real execution plan, the execution efficiency of a single step cannot be directly obtained from the system when a complete plan is not planned, so that the single step needs to be disassembled from the final result, which is different from the common reinforcement learning. When the query optimizer is in the stage of constructing the execution plan, the current state information is input into the connection sequence optimization module to obtain the currently considered optimal execution sequence, the module can continuously adjust the decision model according to the real execution load, and the whole process is in an online environment to adapt to the dynamic change of the database. In general, the overall architecture of the system is an extension of the TiDB architecture, and each functional module of the TiDB can be greatly compatible. Meanwhile, the newly added function module can be ensured to have certain expansibility, and particularly, under the condition that the algorithm has certain calculation force support, the stability of the system can be ensured by the aid of the architecture design.

2. Algorithm design

2.1 data acquisition Algorithm design

The data acquisition module is used as a basic module for supporting a subsequent algorithm, and the quality of acquired data can directly influence the effect of the algorithm. In most databases, some data query interfaces have been provided for users, including metadata structures for data, statistics, query performance information, and for distributed databases, topology information of the system. The data acquisition module mainly finishes acquisition and summarization of the information to form structured information which can be directly used by the algorithm.

However, for part of data, the database itself does not provide a relevant interface, including an execution plan tree structure, a cost calculation method, physical operator information, and the like, and for this part of information, the data acquisition module needs to extend the interface of the database and expose the structured information in the database to the outside.

For the TiDB data, all modules are relatively independent, communication among the modules is realized through RPC, and the same design mode is adopted for a data acquisition and collection module.

The data acquisition module needs to design a sampling period, because the collection process has a certain cost, if the sampling frequency is too high, the performance of the database is reduced, and if the sampling frequency is too low, the data may be too old, thereby affecting the algorithm effect. Therefore, the invention provides an adaptive sampling mechanism, namely the sampling behavior is reasonably adjusted according to the use rule of data. The data can be divided into three types according to the use purpose of the data, wherein the first type is global data, the second type is data aiming at an off-line algorithm, and the third type is data used by an on-line algorithm.

For a first type of global data, the variation period of such data is relatively small, such as a data table structure, a system topology relationship, and the like, for such data, the trigger point of sampling is active trigger, that is, when a user of the database modifies a statement of the database using a DDL (data definitional language), and the like, after the operation execution is completed, the data acquisition module needs to resample the modified data to replace the old data. This both keeps the sampling period at a low level and ensures that the data is up to date. For the data needed by the second type of off-line algorithm, the off-line algorithm mostly adopts regular mining of a large amount of historical data, so that the algorithm model is optimized and realized, and the off-line algorithm has a certain containment on the data lag. The data acquisition module needs to maintain a data pool, a certain amount of historical scale data is stored in the data pool, the sampling period is determined by two modes, firstly, when the load of the database is low, the relevant data can be sampled in a large scale to fill the data pool, and therefore competition with normal transactions of the database can be avoided. The second way is to decide whether to sample according to the condition of the effect of the algorithm, for example, an index threshold value can be established for the effect of each algorithm, when the algorithm is in the running process, if the effect of the algorithm does not meet the expectation, the algorithm is no longer adaptive to the current database environment, and the algorithm needs to use the newer data to recover the effect through a retraining stage. For the third kind of data which is training data required by the online algorithm, the sampling strategy is consistent with the steps of the online algorithm. Because the training data must be kept online for online algorithms, the data must be up to date, for example, for online reinforcement learning algorithms, the data fed back must be instantaneous. The coordination steps are as follows: when the algorithm actively requests data, the collection module needs to load data from the database environment according to the request, and then returns the data to the algorithm to complete a data synchronization process.

2.2 radix and cost estimation algorithm design

The cardinality estimation and the cost estimation are complementary, the relation between the size of the cardinality and the size of the cost is very close, generally, a larger cardinality corresponds to a larger cost, so the factors influencing the cardinality and the cost are usually communicated and can be estimated by using the same model. Some models output the cost directly during design, generally in units of time, and provide a reference for physical optimization, but do not output the base size directly. However, in the model of the present invention, the output part includes both the radix size and the cost size, which is done to be compatible with the traditional cost model, because the traditional cost model is fixed and takes radix as input, the traditional cost model can be kept as a candidate.

There are many factors affecting the correctness of radix estimation, the most important factor is the execution tree structure of the query plan, in the volcano execution model, the execution tree determines the whole data flow direction of data loading and execution of the whole query plan, and different plan trees can cause the intermediate calculation results of data to be different. For example, there are 2 plans for three table join operations with very large data

And &>

If->

The result is then very small and->

The result obtained is large and the last join operation of the first case is less costly because of fewer intermediate results. Therefore, the features of the tree structure need to be extracted, and the most widely applied at present are TreeCNN, which is an implementation form of a CNN tree, and TreeLSTM.

In fig. 3, the convolution kernel in the tree is a triangular learnable weight operator, the convolution kernel mainly aggregates the feature information of two child nodes and the self node, the result obtained after convolution forms a new node, and then the convolution operation is continuously expanded to obtain the features of the whole tree. Finally, the maximum pooling layer is used for pooling the convolution result, the tree data are changed into feature vectors, and then the features are mapped through the full-connection layer to obtain a final result. However, one disadvantage of this approach is that the convolution feature can only extract information around the node, but for a planned execution, the information flows from bottom to top, and for a compute node, the computational efficiency should be affected by including the structural information of the entire left and right subtrees, not the simple left and right child nodes.

The TreeLSTM graph neural network is more consistent with this trait, and LSTM is a network model for processing serialized data, and can mine the time-series characteristics of each data from the serialized data. For TreeLSTM, the direction of the sequence is from a leaf node to a root node, and is different from the single direction of the conventional LSTM, and in a certain state S of TreeLSTM, there are two directions of input of features affecting the current state, i.e., information gathered in the left sub-tree direction and the right sub-tree direction. The specific structure is shown in fig. 4.

In the calculation of TreeLSTM, basically LSTM is kept consistent, except that the source of the input is two child nodes, so different calculation methods are required when processing the input, as follows:

R _t ＝σ(W _sr S _t-1 +W _hr H _t-1 +b _r )

Z _t ＝σ(W _sz S _t-1 )+W _hz Z _t-1 +b _z )

N _t ＝tanh(W _sn S _t-1 +R _t (W _h nH _t-1 )+b _n )

H _t ＝(1-Z _t )N _t +Z _t H _t-1

wherein H _t For the hidden layer currently in the t state, S _t Encode for the current input, H _t-1 Is the mean of the implicit states of the children around the last state, R _t ，Z _t And N _t Gate control sheet representing reset gate, update gate and control hidden layer respectivelyThe way of TreeLSTM, the meta, trelstm, can be to mine the plan tree features from bottom to top in a data-streaming manner.

In addition to the influence of the tree structure on the radix estimation, the operator type in the nodes in the tree is also very important for the radix estimation, for example, for a table scan operator, the cost of different scan operators is different, for sequential scanning, data is read from a disk according to the storage order of the data in the disk, for index scanning, the index of the table is scanned first, and if necessary, the whole row of data is taken out through a table returning operation, wherein the reading is a random reading operation. For a mechanical hard disk, the speed of sequential reading is much higher than that of random reading, so if large-scale data is read, IO cost of sequential reading is lower. For another example, for join operations, the cost difference between different join operators is large, such as mergejoin and hashjoin, if the joined keys of two tables to be joined are themselves ordered, mergejoin may be less costly than hashjoin, since hashjoin may go through the two phases hash and probe, and is less effective. It is also necessary to distinguish the impact of different operators on the cost in the structure of the tree.

Besides the above-mentioned very intuitive factors that can affect the estimation of the cardinality, there is also a very important factor, which is a problem that has been ignored by previous research, that is, the correlation of data. In databases, there is a tendency to organize together data that have associative relationships or are frequently found together, and such associative types may be logically embodied, for example, in relational databases, grouping them into the same relationship (table); or they may be stored physically close together. This relationship is often significant to cost.

The leaf nodes of the plan tree, which are typically table scan operators, and the leaf nodes serve as the initial positions of the plan, the estimation of the cardinality of which is decisive for the subsequent join operations. For radix estimation of leaf nodes, related to data distribution of a table to be scanned and filtering predicate conditions, the predicate conditions usually act on a plurality of columns, and different predicate combinations and incidence relations of the columns have influence on the radix estimation. For example, looking up all people with age less than 30 years and monthly salary less than 1 ten thousand from a payroll table, according to the conventional estimation method, it is assumed that two columns of age and monthly salary are independent, so that E (x) = Np (age < 30) p (salary < 10000) is obtained, but generally, for a company, there is a strong correlation between age and salary, and staff with smaller age tend to have lower salary level, so p (age <30, salary < 10000) > p (age < 30) p (salary < 10000). In the previous method based on deep learning, when the condition is estimated, the predicate and the action range are simply used as the input of the model, and the relation between the columns is not explicitly used as the influence factor. In the scheme, a correlation network topological graph is constructed in a linear correlation analysis mode and a graph neural network mode so as to improve the efficiency of radix estimation.

In order to obtain the association relationship between the columns in the table, the present invention adopts an association analysis method using KCCA (kernelcanonica correlation analysis). The common linear CCA can only explore the linear relationship between two groups of random variables, and in practical situations, the relationship between the variables is often nonlinear, and KCCA is a commonly used nonlinear CCA algorithm. KCCA introduces the idea of kernel function into CCA, and the idea is to map low-dimensional data to high-dimensional feature space (kernel function space) and perform correlation analysis in kernel function space through kernel function conveniently. Assuming that X and Y are two columns in the same table, and the kernel function is selected to be K (X, z) = e- | a-z |22 σ 2, the maximum feature function in the following equation is calculated as the associated weight of X and Y:

(K _x +ηI) ^-1 K _y (K _y +ηI) ^-1 K _x a＝λ ² a

wherein:

K _x ＝K(X，X)，K _y ＝K(Y，Y)

a table adjacency matrix is established, each element of the matrix calculates the eigenvalue corresponding to KCCA for its corresponding column, and then the table is convolved by the adjacency matrix, as shown in fig. 5 below.

In addition to the table scan in the leaf nodes as the start of the radix estimation, the intermediate nodes of the plan number are basically all related to the join operation and are the key that can affect the plan efficiency most. In many of the previous studies, the processing of the connection is performed by encoding only the connection operator and the input table of the connection, and then processing the encoded connection operator and the input table in a tree (special graph) network manner, and it is still not constructed from an explicit relationship between them. The association between tables determines to a large extent the cardinality size of their connection, e.g. two tables are connected in an equality to find the comedy movie where the comedy actor plays, then the general connection condition is that the intersection of id of the comedy actor = movie info. From the above example it can be seen that the degree of similarity of the connected columns also affects the estimation of cardinality to some extent.

A relational graph is also required to be obtained between the columns of the cross-table, the nodes of the graph are the columns in each table, and the weight between the columns is the association relationship of the columns. However, this method does not take the form of a graph with the columns in the tables, because the bases of the tables are different for two different tables, and X and Y as samples cannot align their elements when performing the correlation analysis, that is, cannot know how the elements between them are mapped, so that their covariance matrices cannot be calculated. P (x = y) is estimated by using monte carlo sampling, and the obtained probability is used as the association relation weight between columns.

For a data table with N columns, the size of the adjacency matrix G of the graph is N2, and obviously for a database with many columns, the table is very large, so that the graph needs to be reduced to a certain extent. In practice, not every connection between two columns is meaningful, or the result of a connection between two columns is almost 0. For example, the connection between id of the cast and the name of the movie is meaningless, and id and name are two different field types, and the final result is also almost equal to 0, so in the process of using the sampling, the meaningless connection operations can be omitted, and the value is directly assigned to 0.

After the connection relationship graph between columns is established, a graph convolution operation may be performed on each query according to the adjacency matrix. For a query plan, the presentation is usually in the form of a tree, and the connection relation topology in the plan tree needs to be extracted to extract their association relation. Firstly, columns involved in a query plan need to be extracted, the characteristics of the columns are coded, characteristic information is added according to the connection relation and the predicates of the columns, and finally graph convolution is carried out on connection topology through graph convolution to become an important reference part of radix estimation.

2.3 connection order optimization Algorithm design

In the database optimization, the connection sequence directly determines the architecture of the whole query plan, and the rest of the optimization is performed based on the architecture, so that the optimization is a crucial optimization step for the execution efficiency. In summary, the efficiency of the algorithm affecting the connection optimization mainly includes the correctness of the cost estimation and the search algorithm, and the problem of the optimal execution sequence is an NP problem, if the cost estimation is completely correct, the optimal execution plan can be obtained certainly by using algorithms such as dynamic programming, but the cost estimation is almost impossible, the time complexity of the algorithms such as dynamic programming is higher, and the algorithms using some approximation algorithms or heuristic algorithms can only obtain a suboptimal solution, which has certain limitations, and these algorithms are more generalized algorithms and cannot adapt to the change of the environment dynamically. Therefore, the optimization algorithm of the connection sequence itself needs to be improved on the basis of the improvement of the estimation of the cardinality. The method of boosting radix estimation has been described in the previous section, and the focus here is mainly on describing the optimization of the connection order using the graph neural network and the reinforcement learning approach. The graph neural network is mainly used for establishing characteristics from the relation among all factors influencing the connection sequence, and the network can be continuously optimized from the feedback of the system dynamically in an online learning mode according to the reinforcement learning method, so that the change of a database is adapted in time.

By utilizing the reinforcement learning method, an optimizer of a connection sequence can be used as an agent (agent), a database environment is a real sensing world (env), a certain connection sequence is selected as an action (action) of the agent, then the real execution time corresponding to a plan in the sequence is a reward (rewarded) of the current action, and the agent continuously optimizes an action decision of the agent according to the reward to improve the expected reward under continuously trying various actions, so that the connection sequence which can enable the plan reward to be maximum (with minimum cost) can be generated.

For a query plan, in the connection sequence stage, the state S is set as a table to be connected, and is composed of two parts, one is a connected sub-plan, and the other is not yet connected, and in the current state, the action a taken by the query optimizer is to select a table T to be connected with the current sub-plan plansub. Q (S, a) is defined as the final expected reward value, i.e. the final cost, after taking action a in the current S state. For the cost function V(s), the meaning is how much reward can be expected in the current state s, i.e.:

q (s, a) reflects how well the action is taken in the current state, while V(s) reflects how well the current state is, i.e., a uniform cost of the complete plan formed with the current sub-plan. The optimal decision strategy pi' (s, a), i.e. the optimal action to be taken each time, is defined as:

suppose Q can be obtained ^π (s, a) accurate results, then byThe final result must converge to the optimal decision by updating the strategy.

Assuming that a series of data s0a0r0, sn anrn is now collected from the real environment, and assuming that qnewok is accurate, it can be derived from the above definition:

because the above equation satisfies the bell inequality, Q can be solved in an iterative manner, but in practical situations, Q parameters and situations are very many, and it is often very difficult to solve, so that it is necessary to use a neural network instead of Q, that is, Q = > Q θ, to continuously minimize the gap of 36 by using real data, that is:

wherein l is a function of a specific measurement error, and the quality of the current state s and the action a can be truly reflected by minimizing loss, so that the connection sequence capable of maximizing Q (s, a) is selected every time. It is noted that the action space is gradually reduced during the gradual completion of the whole planning process, which is different from the traditional reinforcement learning.

The method is characterized in that a targetnetwork needs to be designed on the basis of reinforcement learning, the input of the targetnetwork is S and action, for the connection sequence, S is composed of two parts, one part is a generated sub-plan tree, the other part is a table which is not connected yet, for the generated sub-plan tree, the tree can be coded by using TreeLSTM and two graph neural network models of graph convolution in a previous section mode, and for the table which is not connected, structural relation does not exist between the sub-plan tree and the table, and coding of a tree structure cannot be adopted. Meanwhile, for action, the action is an element from an unconnected table set, the action is a key node for connecting the current sub-plan and the table which is not connected yet, and since it is Q (s, a) to be calculated, the current action plays a role of being up-down, so that the content of the part needs to be presented in the network architecture.

Also using graph convolution neural network, besides the coding of the current sub-plan tree itself, the connection relationship between the current sub-plan and the rest tables must exist, so for the state diagram, the whole of the current sub-plan can be regarded as a whole, and then the connection relationship is established between the current sub-plan and the rest tables, so as to construct the connection relationship, thereby forming the graph relationship under the state, and the convolution characteristic of the graph is used as the input of Qnetwork.

3. Network communication design

The independent modules in the distributed database TiDB are separately deployed, and an RPC mode is used for communication among the modules. As an expanded implementation of the TiDB query optimization module, the acquisition module needs to consume a large amount of IO resources, and the algorithm occupies a large amount of computing resources in the training and using processes, so that the normal affairs of the TiDB module are not influenced, and each module is also independent.

For compatibility with the network communication module of the TiDB, the gRPC protocol is also used for each module of the invention. RPC is an implementation mode for requesting application service from a remote computer through a network, and can realize transparency of a calling mode and an implementation mode. The gPC is widely applied to distributed application as an open-source RPC implementation framework. A complete gRPC service logic structure is shown in fig. 6:

the data structure and the service interface of the RPC are mainly defined in the proto, wherein the data structure comprises a data field name and a type, so that the data sequence and deserialization are facilitated, and the service interface comprises a service name, an input parameter and a return parameter, so that the interaction between a client and a server is facilitated. And a stub is generated on the client side according to the proto file, so that the remote interface is convenient to call. And generating a service interface at the service side according to the proto file, wherein a user needs to realize a specific function. The client and the server negotiate a specific data API format and communicate therewith.

The steps for implementing a network-invoked service, typically using a gRPC, are as follows:

1. a data structure of the service is defined. Because the RPC is called and transparent, the data information needs to be serialized in the network transmission process, which is convenient for the use of the client and the server. Before constructing a gRPC service, a Protobuf serialization tool needs to be used to define the service, including the composition structure of a data object, which is usually a combination of some basic types of data; meanwhile, the name, input parameters and return values of the service interface are defined, similar to functions in programming languages, and are used for calling specific service functions.

2. Processing logic implementation of the service. After the serialization work of the service is completed, the Protobuf automatically generates a frame code of a specific programming language, and specific functions of the service end are realized according to own functional logic. The implementation logic of the server is roughly divided into three stages, input parameters of the client are deserialized from the network request, a calculation result is completed at the server according to the specific request and the service logic, and the final serialization result is returned to the client result.

3. And (4) realizing remote service calling. In case of completing the first two steps, one available RPC service is ready, the client needs to complete one complete service request through the following steps. Firstly, a remote service network connection request is established, parameters of the request and the specified service are constructed, then the request is serialized and sent to a server, and a specific result is obtained after the server processes the request. Thus, the life cycle of one RPC call is completed.

Thus, a complete gRPC service can be established, and the network communication modules of the modules described below are developed according to the implementation steps.

4. Data acquisition module

The data acquisition module is used as a module for providing basic data for an optimization algorithm, the main function is to record information generated in the operation process of the database, and besides the requirement and quality of data of each algorithm are met, lower performance cost is ensured, and the problem that the overall performance of the database is reduced due to resource competition with normal transactions is avoided.

The various information in the database may originate from different modules, and the data generation cycle is different, so that the data needs to be organized and managed effectively. In order to satisfy the use of data by each module, the data acquisition module needs to provide two functions of data query and collection.

4.1 the data query function mainly provides query services to other modules and returns correct query results according to query requests. The storage format of all data in the storage module is JSON format, that is, each single piece of information is stored in the form of key value pair. The query interface for the data provides query services according to the restful interface standard.

The data return format is still the JSON format, and is organized and arranged according to the inheritance relationship, and the requester can access the data according to the affiliated range and the specific name of the data. The interaction mode of each module to the data can be effectively unified through the standardized interface and the data transmission format, and the whole system can conveniently use the data.

4.2 acquisition of Global data

For static data, the data information generally includes hardware parameter parameters of a database environment, database architecture information, configuration information, data table meta information and other data information with a long change period, the data information can be directly obtained through an operating system and a query interface provided by a database, and a data acquisition module only needs to obtain information through the query interface as required and then stores the information according to a uniform storage specification.

Since the global data does not change frequently, the collection strategy adopts passive collection, that is, when the relevant change occurs, the initiator of the change needs to notify the collection module, and the collection module can collect the changed data again according to the changed condition. For example, when a data administrator changes the structure of the data table by using a DDL statement, the database scheduling unit initiates a data change notification to the acquisition module after the database is changed, and then the acquisition module acquires new structure information of the data table again to cover old data.

4.3 acquisition of offline data

The off-line data is mainly used for collecting historical state information of the database, the data is generated gradually in the running process of the database, and the generated cycle frequency is not constant. For the subsequent optimization algorithm, the module mainly collects load information generated in the database query process and statistical information of data in the data table, and collects the load information and the statistical information by taking one complete query as a unit. The method mainly comprises statements of data query, an execution plan and finally executed cardinality and cost information.

Because the periodic frequency generated by the off-line data is unstable, the sampling can not be carried out at a certain fixed sampling frequency, and the sampling can only be carried out at a dynamic sampling period. The triggering of sampling is divided into two cases, one is when the system is under low load, and the second is when the effect of the optimization algorithm is not ideal.

The first method is that the data acquisition module monitors the load condition of the database all the time, sets a default load proportion, and starts the acquisition module to start to collect data when the overall load of the system is lower than the threshold value. And when the load is higher than the threshold value, stopping sampling work and avoiding competition with normal database transactions. The second way is that when the optimization algorithm needs to be retrained, for example, when the effect of the optimization algorithm is obviously low, or the database is structurally changed, the algorithm cannot adapt to the current database, the model needs to be readjusted, and at this time, data needs to be obtained again. The algorithm module initiates a data request to the data acquisition module, loads data meeting conditions from the current data cache according to the requested data volume, and if not, new data needs to be taken in time. And if the model effect of the algorithm is always in a better state, new training data are not needed, and the workload of the data module is reduced. The first mode can buffer the newer data as much as possible for the algorithm module to use on the premise of ensuring the performance of the database, and the second mode can ensure the data to be used as required, thereby reducing unnecessary data acquisition.

4.4 acquisition of Online data

The acquisition of online data needs to ensure the instantaneity of the data and the data requirement of the operation of an online algorithm. The online data acquisition module needs to be synchronously coordinated with the algorithm, and the data can be stably obtained in the calculation process of the algorithm.

In order to maintain online and synchronization with the algorithm, during the operation of the algorithm, the algorithm registers a collection transaction ID with the online data collection, and the data collection module identifies each online algorithm transaction based on the ID. Such as reinforcement learning based join order optimization algorithms, which require the step of tracking each query plan to optimize the model. The data acquisition module establishes buffers respectively according to the currently active transactions for recording some session states of the transactions. When the data acquisition module receives the data request obtained by the algorithm, the data acquisition module acquires data from the database according to the data request and then returns the acquired data. For example, for a connection sequence optimization algorithm, the algorithm needs to know the currently generated sub-plan tree and the estimated cost of the current sub-plan tree in the operation process, and the data acquisition module obtains the relevant information from the database and then returns the relevant information to the algorithm.

Since the data amount required by the online algorithm is generally not too large, but mainly needs to obtain the newer data, the data acquisition module may not need too large a buffer area for buffering the data, and may not put too much stress on the database.

5. Radix estimation module

The cardinality estimation module mainly estimates the cost of the query plan tree so that the query optimizer can select a plan with the minimum cost and the optimal performance from the candidate plans. The radix estimation module is mainly divided into two modules, namely algorithm training and algorithm use, and the detailed implementation of the two parts is described here.

The radix estimation module mainly utilizes historical data and estimates the cost of the physical plan based on the graph neural network. The data acquisition module is used for collecting data in the query process, information of single query is organized in a data set S in a { query, plan, cardinality/cost } mode, the radix estimation module is used for establishing a supervised learning model according to the data set, rules between query and cost are learned from historical data, and therefore the planned cost can be effectively estimated.

5.1 data processing and representation

In order to be able to implement a full-depth network model, the primary task is how to convert the structured information in the database into digitized characterizing information so that the network model can process this type of data. The final effect of the model is directly determined by the quality of the data coding mode, so that the larger the data quantity contained in the data coding information, the better the model can extract effective information from the data coding information.

5.1.1 metadata information presentation layer

In a relational database, metadata information of the database is usually defined in advance, and such information mainly describes the structure of a data table in the database, the column names and types of data included in the data table, and index-related information, and data contents are organized under these logical structures. Another category of information is information related to operators, and in a query plan tree, there are operators of various categories, and tasks executed by different operators are different and relatively independent of each other. For such artificially defined discretization information, vectors of onehot are usually used for representation, so that data can be effectively distinguished from different categories. For example, for a data table with n columns, the data in the i-th column can be represented by a vector v = [0, 0.,. 1.,. 0], where the i-th position is 1 and the rest are 0.onehot has the advantage of simple coding and does not lose essential information, which is also the way often taken in other studies. In order to enrich the information of the table in the encoding process of the table, many researches will add the bitmap information of the table into encoding, namely, the positions meeting the query condition are assigned as 1 by a sampling mode, and the rest are 0. Although this method may have a certain effect, in general, the sampling may cause time consumption, and from the work of the previous stage, the bitmap is mostly a very sparse matrix, even an all-zero matrix in many cases, and the calculation overhead is increased.

Therefore, in the table coding, the traditional bitmap coding mode is abandoned, and the table information involved in each query process is indirectly represented by combining the coding modes of the selection rates of each column. The specific form is shown in fig. 7.

For a lookup table, the query relationship of the predicate to the table is represented by a column selection rate vector. Each column corresponds to a position in the vector whose value is the selectivity of the selection predicate acting on that column, calculated directly using the estimation method of the database itself. The method has the advantages that the cost caused by sampling is avoided, the original two-dimensional data can be converted into the one-dimensional vector, the data calculation cost is reduced, and meanwhile, the coding method intuitively expresses the filtering effect of each column. The cardinality result is usually calculated by combining the results of the columns, and in the conventional method, the cardinality result is usually calculated on the premise of independent assumption, and the result is often low. The incidence relation between each column can be modeled by using a graph convolution mode, and column vectors are combined to obtain comprehensive information of the table. And calculating the optimal matching mapping by the weight between the nodes in the column-column correlation diagram according to a KCCA mode, and forming an adjacency matrix by taking the maximum characteristic value as the correlation weight between two columns. And combining the column selection rate matrix, and coding the characteristics of each table under the query in a graph convolution mode.

As shown in fig. 8, the column KCCA matrix is the correlation relationship between each column in the table, and if the relationship between columns is tighter, the value is larger, and conversely, the smaller the value is, the smaller the correlation coefficient is. After the layers are mapped, the relations between the columns can be effectively captured, and if a column is closely related to other columns, the model can effectively learn the influence of the columns on each other on the final cardinality estimation. Compared with the traditional mode, the underestimation problem caused by independence can be avoided, and the information can be effectively utilized, so that the accuracy of the radix estimation is improved.

5.1.2 predicate representation layer

Predicates are typically filter conditions or join conditions, consisting of operators and parameters, which typically include data columns and constants, such as filter table predicate a.id <3 or two table equation join a.id = b.id. For a single predicate, a triple encoding mode is usually used, which corresponds to a predicate operation operator and two pieces of parameter information. The operators represented by the nodes in the physical execution partition tree are used for specifying specific computing operations, and include scanning operators (such as sequential scanning and index scanning), connection operations (such as hash connection and cyclic connection), sorting operators (such as hash sorting and merging sorting), and aggregation operations (such as grouping and summation). For these operators, each operator type can be represented by a vector of onehot. And the parameters are usually coded by onehot, and in constant values, if the parameters are real numbers, the parameters are coded by themselves, and if the parameters are character strings, the parameters are coded by worembeding. Usually, predicates appear in a combined form, and logical combination can be performed between predicates, such as "or" and ", but most of the information is ignored in the current research, and only the predicate codes are simply piled up, and the logical association relationship is ignored. Therefore, for predicates on a table, in order to extract the part of information, the connection relationship of the predicates can be constructed in a tree manner, and each non-leaf node in the tree corresponds to a certain connection relationship of the predicates. However, the calculation of "or" and "is not derivable, and in order to solve the problem, maxporoling is selected as a calculation mode of" and ", because usually" and "expresses coexistence, max can select the largest feature from various latitudes as a final result, and better accords with the characteristic of" and ", and similarly, a logical operator of" or "is derived by using a minipoling table.

As shown in FIG. 9, a predicate join relationship for a complex form is illustrated, wherein leaf nodes encode for a specific predicate triplet, a predicate encoding tree is constructed using the organizational characteristics of the logical relationship, and is characterized and calculated using max and min pooling layers.

5.2 implementation of neural network model

In the foregoing, how to design the encoding and representation layers of the metadata information and the predicates in detail is introduced, and the metadata information representation layer can extract the structural information of the data table and the associated characteristics of the data of the query request; the predicate representation layer can obtain the selection characteristics of the query from the predicate representation layer according to the characteristics of the predicate. By combining the information, the network model can establish a behavior model of radix estimation, so that the accuracy of radix estimation is improved. The overall cardinality estimation model architecture is shown in FIG. 10 in conjunction with the network structure of the presentation layer.

In the above model, the query plan tree is mainly composed of two parts, one is a representation layer of the query plan tree, and the other is a representation layer of the global connection relation. The query plan tree representation layer mainly extracts the information of the query structure from the physical execution plan tree. The connection relation representation layer extracts the data distribution characteristics of the query according to the relevance of the connection relevant columns, finally combines the relevant information of the two representation layers and the full-connection neural network, and therefore the estimated cardinality or cost is obtained.

According to the execution flow of a volcano calculation model, the calculation input of each node is from the output of the child node, and data flows from bottom to top in a one-way mode, so that the physical plan tree is represented in a TreeLSTM mode in order to meet the characteristics of the calculation model. For querying leaf nodes of a plan tree, it is typically the operation associated with a table scan. Since most predicate conditions are pushed down to the table scan process in the logic optimization stage, there are a large number of selection conditions for this node, so the representation layer of the connection table is selected as the encoding input. The representation layer of the connection table is mainly obtained by graph convolution results of predicate tree features and column selection rate and column correlation, and query correlation features of the table are formed. For non-leaf nodes, operators related to connection are usually formed by left and right connection columns and connection conditions, according to the structure of the plan tree, left and right subtrees are connected data sources (namely, subtrees where the connection columns are located), treeLSTM fuses the information, and continuously transmits information of a lower layer upwards according to the structure of the tree, so that the characteristics of the whole query plan tree are obtained.

The representation layer of the global connection relation mainly has different relevance between different columns and columns for each column and connection relation in the table, so cardinalities generated by different connection conditions are very different during connection. In order to obtain this part of information, it is necessary to construct the association relationship between columns, and here, the adjacency matrix representation of the association weights is used. Meanwhile, in view of the influence relationships of different join predicates and columns, for a query, the information needs to be constructed into a column join relationship vector according to the join columns and the join predicate characteristics involved in the query. The feature extraction is carried out in the form of graph convolution, so that features can be extracted from column connection vectors according to a specific convolution form according to the correlation of different columns. After the model is built, the model needs to be trained and tuned by using offline data, and a specific algorithm is shown in fig. 11. Analyzing the association relation between the query plan and the data connection from each batch of data sets containing historical query and execution results, calculating the characteristics by using the training step, namely the model to obtain an estimated base number value, and continuously optimizing model parameters according to the error between the model and the real base number so that the model can be finally converged to obtain an estimated model suitable for the current data set.

5.3 integration of radix estimation with database

In the TiDB, the physical optimization stage is to generate a physical plan executable by the database from the logical plan tree according to a cost minimization principle. The TiDB proceeds in a top-down direction by a mnemonic-based search method. The specific search method is as follows:

1. starting from the root node, searching downwards, and exploring the possible physical operators of the logical operators of each node, such as connected logical operation, wherein the physical operators can be hash connection, combination connection, circular connection and the like.

2. And continuously exploring the possible execution schemes of the sub-plans downwards according to the limit conditions of each physical operator. For example, for merge joins, the table to be entered requires ordering according to the joined keys, and therefore, when searching down again, this property needs to be guaranteed.

3. After a complete path is searched, the current plan is stored by using a hash table, and if the same sub-plan is encountered during searching, the current plan is directly returned, so that repeated searching is avoided.

4. From all possible plans, the possible cost is calculated according to the size of the base, and the plan with the minimum cost is selected. In the TiDB, the planned cardinality estimate is derived by a probability estimation method with histogram, and for join operations, if there is no relevant column cardinality size, it is calculated with a selectivity of 0.8. On the premise of cardinality estimation, the cost of the plan tree is the sum of the costs of the computing nodes, and the cost calculation formula of each computing node can be summarized as the following formula:

cost＝Cardinality*(CPUFactor+MemoryFactor+DiskFactor+NetworkFactor)

the calculation mode of the cardinality of different calculation nodes is different to a certain extent, but the calculation mode follows the rules. In order to integrate the radix estimation and cost estimation algorithm of the present invention into the TiDB, the cost calculation method needs to be modified. In the TiDB, a GetCost method is realized for the physical operator, the input parameter is a physical plan, and the output parameter is a real number, so that the model of the invention is only needed to be replaced by a calculation mode.

The compute service RPC interface, which defines the cardinality and cost, is consistent with the interface of the TiDB itself, as shown in fig. 12.

The radix estimation module puts a model obtained by using historical data learning into an online environment to provide calculation service, and after receiving an RPC request for calculating the cost, the service encodes a physical plan according to the method described above, inputs the physical plan into the model to obtain an estimated value, and returns the estimated value to a calling end to complete one-time cost calculation. The algorithm control module can periodically send newly acquired data to the cost estimation module, the radix module can verify the real performance of the model according to the current algorithm, if the accuracy is obviously lower than a set threshold, a new round of training is started, the current model is updated, and inaccurate plan estimation caused by too large errors is avoided. The specific flow of the radix estimation module is shown in fig. 13.

6. Connection sequence optimization module

Optimization of join order takes a very important position in the query optimization process, because the order of joins directly determines the overall framework of the planning tree. An excellent optimization algorithm needs to ensure the execution efficiency and accuracy of the algorithm, but often the two are not compatible.

6.1 data processing and representation

In the connection sequence optimization process, different from the cardinality estimation, the stage occurs in a logic optimization stage, the processed plan tree is a logic plan tree, nodes in the tree are nodes which are not endowed with physical operators, so that the information is a little bit less than that of the physical plan tree, and the model in the cardinality estimation can be multiplexed at a data coding and representation layer.

Because the model of the present invention is based on reinforcement learning, it is expected that the optimizer will be able to obtain the most rewards, but the feedback of the query plan is the execution time, and the shorter the execution time, the better, the time can be taken to be negative. Meanwhile, for different plans, the execution time per se has great difference from nanosecond level to second level, and in order to keep high stability of output, the execution time needs to be unified into a unit and then logarithmized, namely t _out ＝log(t _true )。

Meanwhile, in reinforcement learning, the algorithm needs to be optimized by using the real execution efficiency, and the efficiency of the physical plan rather than the logical plan is finally output, so that the physical plan needs to be reversely analyzed and restored into the logical plan, and the execution time of each step of the logical plan is obtained so as to optimize the connection sequence.

6.2 neural network model construction

The model needs to judge the effect of action according to the current state S, optimize the algorithm and select the optimal connection sequence according to the result. Specifically, S represents the currently determined connection order, and action represents the selection of a table from the tables for undetermined connections to connect with the current sub-plan and get the next state. Since Q (S, a) represents the desired reward until the entire sub-program is completed after selecting a action in the current state, each selection enables Q _value The largest action may be until the entire sub-plan is completed.

Each action selection, both with respect to the current sub-plan and with respect to the unconnected remaining tables, serves as a pull-down. The join operation is not only related to the columns, predicates and types of the joins which need to be joined currently, but also largely determines the efficiency of column join with the association between the columns, so in addition to the characteristics of each local part, the association relationship between the columns in the global join operation needs to be considered. Because the connection operation is in a table crossing mode, a certain mapping relation does not exist between columns, and the association relation between two columns cannot be obtained through an association analysis method KCCA, the size of the connected base number is calculated in a connection mode, and then the association weight is calculated according to the base number of each table. As shown in detail in fig. 14.

For column a in tables a and B, to calculate their connection association, an equality connection operation may be performed on two columns first, and the result of the connection is K, and the cartesian product of the connections in the two columns is mxn, so that their ratio is used as the association weight W ∈ [0,1]. In practical situations, the size of each table is very huge, and it is often inefficient to adopt a full-table connection mode for each column, so that a sampling mode can be adopted, and only a certain amount of data is adopted for connection. Because the cardinality size of each table is different, sampling can be carried out according to a certain proportion, the default is 30%, then an equality connection operation is carried out between two columns by using the sampled data, and the proportion of the product of the size of the result and the size of the two columns is used as the weight value. According to the method, the weight matrix W of each column in the whole database can be finally obtained, and the specific calculation formula is as follows:

wherein i and j are subscripts of any two columns, W is an adjacent matrix, and the corresponding position is the sampling connection weight between the two columns. It should be noted how, when the number of columns in the database is very large, the above calculation is very huge, and in order to reduce the calculation overhead, only the case with higher connection probability is calculated, and the rest cases are directly replaced by the weight value 0. There are possible situations including the same type of value between two columns and the presence of a connection operation in the query load.

In order to enable the Qnetwork to more accurately judge whether the action is good or bad from the current state, a graph convolution mode is still adopted to extract the characteristic relation from the state and the action, for each step, an unconnected table needs to be selected for being connected with the current sub-plan, and a new sub-plan is generated and then connected with the rest tables. For Q (S, a) calculation, connecting S and a to obtain a new subtree, then respectively generating two graph vectors, one is an already connected column and the other is an uncompleted column, wherein elements in the vectors are codes of the columns, then carrying out graph convolution operation in the vectors by using the associated adjacency matrixes of the columns, and inputting the graph convolution into a Qnetwork after splicing. For convenience of calculation, the result after convolution can be calculated uniformly, and then the result is multiplied by the corresponding position by using a mask matrix, namely the appearing column is 1, and the non-appearing column is 0. Fig. 15 illustrates a neural network-based Qnetwork model of connection order selection.

In addition to the characteristics of the association relationship between columns, the structure information of the tree is also important reference information for the state S, and the TreeLSTM operation is performed in the bottom-up direction in the same manner for the logical plan tree by performing the characteristic extraction on the physical execution plan tree using the trelstm graph neural network as described above. Finally, qnetwork predicts the expected reward (negative logarithm of cost) of action for the currently selected connection table through the full connectivity layer by aggregating the current plan tree, and features behind the selected connection table and the new plan tree behind the connection table. And the decision algorithm selects the result with the largest reward from all possible connection tables as the best connection plan in the current state. So far, a Qnetwork based on the selection sequence of the graph neural network is established, and a decision method of the selection sequence is obtained. The specific training algorithm for the model is shown in fig. 16.

The training algorithm is kept in an online state, and each algorithm call is also an updating process of the algorithm. For the connection optimization of one query, the connection plan tree is built from bottom to top, the table set to be connected is traversed each time, the connection table with the largest reward is selected as the current plan according to the Qnetwork, the current table is deleted from the table set to be connected, and the complete plan is generated through sequential iteration. And finally, the real delay fed back by the database can be obtained after the finished plan is executed, and the model is continuously adjusted according to the delay condition. However, in an actual test situation, the completely online mode may be affected by errors of a system environment and sampling, which may result in a large variance of results, so the method adopts a double qnetwork structure, one is a decision network, and the other is a target network, and a specific structure is shown in fig. 17.

Decision network Q ^π For determining expected reward of action a (link table), target network Q 'in current state' ^π For predicting the reward value after the connection table, when Q ^π And after the network tends to be stable, updating the target network, so that the problem of overlarge variance caused by online updating due to insufficient sampling quantity can be avoided.

6.3 ligation order optimization in combination with TiDB. The connection sequence in the TiDB is carried out in a logic optimization stage, when the number of the connected data tables is large, a greedy optimization algorithm is adopted, the default number is 0 (a greedy algorithm is directly used), and the greedy optimization process depending on the TiDB is integrated into the algorithm model. In the reinforcement learning based algorithm, the actual feedback of the physical execution plan is required as an optimization basis, but the stage in the TiDB is a logic optimization stage, the processed data is a logic plan tree, and the physical plan needs to be changed into the logic plan tree so as to unify the two. Meanwhile, for online reinforcement learning, the model needs to be fed back in time until each step of operation, but the model can only be executed after a complete plan is formulated in the database, so that timely feedback cannot be obtained at each step. For the situation, the state of each step can be saved only by using the cache, and after the database is really executed, the complete plan is decomposed, and the execution time of each step is assigned to each state again. Optimizing the model according to the data by a final algorithm; the execution flow of the algorithm is shown in fig. 18.

The optimization algorithm also selects an optimal connection table from the unconnected tables in each step according to a frame of a greedy algorithm, the original evaluation basis of the TiDB is the size of the calculated base number, and the algorithm of the invention calculates the score according to a model. The interaction of the algorithm with the model also uses RPC services, defining the interface to select the connection table as shown in fig. 19.

The calling terminal takes the currently connected sub-plan and the table array which can be connected with the current sub-plan as input parameters each time, after the model receives a request, the model encodes the current state according to the encoding mode and calculates each table one by one as Q for executing actions _value Selecting the table arg min with the maximum value _a∈Actions (Q (CurrentState, a)) is returned to the caller, completing the calling of the algorithm.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A database query optimization method based on a graph neural network is characterized by comprising the following steps:

data acquisition: the sampling is reasonably adjusted according to the use rule of data, and the method comprises the following steps: collecting information generated in the whole operation process of the system, and accumulating data required by subsequent algorithm training;

cardinality and cost estimation: extracting the structural characteristics of a query plan tree by using a tree-shaped graph neural network, and extracting the incidence relation and the connected topological relation of each data column by using a graph convolution network, thereby establishing a cardinal number and cost estimation network model; the cardinality and cost estimation network model comprises a representation layer of global connection relation and a representation layer of query plan tree; the global connection relation representation layer constructs a column-column connection correlation diagram according to a sampling method, the query plan tree representation layer extracts query structure information from a physical execution plan tree, a tree neural network TreeLSTM is adopted to represent the physical execution plan tree, onehot coding is adopted for a connection operator of a non-leaf node, and a table data representation layer is adopted for a connection table of the leaf node; the table data representation layer comprises a column correlation diagram constructed by linear correlation analysis, a column matrix constructed by the existing base number estimation method of the database and a query predicate tree calculated by max and min representation and or;

optimizing the connection sequence: extracting query features and connection sequence features by using a graph convolution algorithm, sensing the real environment of a database by using a reinforcement learning algorithm, and continuously optimizing a model by using real feedback, thereby improving the effect of connection sequence decision; in the reinforcement learning, an optimizer of a connection sequence is used as an intelligent agent, the database environment is a real sensing world, a certain connection sequence is selected as the action of the intelligent agent, and the real execution time corresponding to a plan in the sequence is rewarded by the current action; the model of the logic plan tree in the data coding and representing layer is the same as that in the cardinality estimation in the optimization process; the reinforcement learning also comprises the steps of carrying out graph convolution operation in a vector by utilizing a column connection weight relation graph, splicing and inputting the graph convolution operation into a Double Q-Network structure comprising a decision Network and a target Network, and updating the target Network after the decision Network tends to be stable so as to avoid overlarge variance caused by insufficient online updating sampling quantity;

2. The method for optimizing database query based on graph neural network according to claim 1, wherein the data collection comprises the collection of global data, the collection of off-line data and the collection of on-line data; the global data acquisition trigger point is active trigger, modified data are re-sampled, and old data are replaced; the offline data can be acquired by maintaining a data pool, sampling related data in a large scale, and determining whether to sample according to the effect of an algorithm; and when the algorithm actively requests data, the online data is acquired, the data is loaded from the database environment according to the request, and then the data is returned to the algorithm, so that a data synchronization process is completed.

3. The database query optimization method based on the graph neural network as claimed in claim 1, wherein the radix and cost estimation network model performs feature extraction on query plan numbers by using Tree structure, and performs feature extraction on the relation between columns by using the graph neural network model of GCN, and designs and realizes a radix estimation network model combining the two graph structures.

4. The database query optimization method based on the graph neural network according to claim 1, further comprising a network remote call, wherein the network remote call is realized by a gPC, and is convenient for accessing the database, and the method comprises the following steps:

step 1: defining a data structure of the service, wherein the data structure comprises a data object composition structure, and defines the name, input parameters and return values of a service interface by using a Protobuf serialization tool for calling a specific service function;

and 3, step 3: the remote service call is realized by firstly establishing a remote service network connection request, constructing parameters of the request and specified services, then serializing the request and sending the request to a server, and obtaining a specific result after the server processes the request.

5. The database query optimization system based on the neural network is used for realizing the database query optimization method based on the neural network as claimed in any one of claims 1 to 4, and is expanded on the structure of the TiDB, and is characterized by comprising a data acquisition module, an algorithm control module, a radix estimation module and a connection sequence selection module; the data acquisition module is used for collecting information generated in the whole operation process of the system, accumulating data required by subsequent algorithm training and simultaneously storing intermediate information generated in the process of one-time query; the algorithm control module determines whether the model needs to be re-optimized according to the execution efficiency of the algorithm, so that the normal work of the model is ensured; the cardinality estimation module is used for predicting the size of a final result of the physical plan and making cost estimation based on the cardinality; the connection sequence selection module optimizes the connection sequence between the tables in the execution plan tree and dynamically optimizes the network from system feedback in an online learning mode according to a reinforcement learning method.

6. The neural network based database query optimization system of claim 5, wherein the data collection module provides a unified data query interface and returns data in a normalized format.

7. The graph neural network-based database query optimization system of claim 5, wherein the radix estimation module comprises an offline training unit and an online prediction unit; the off-line training unit obtains an off-line data pair from the data acquisition module, and continuously trains and optimizes the model according to the difference between the forecast and the reality as a cost; the online prediction unit is used for judging whether the candidate plan is good or bad in the physical optimization process by the query optimizer, and restarting the training process when a difference exists between the prediction result and the real result, so that the model is updated spontaneously to adapt to the current environment.

8. The database query optimization system based on the graph neural network as claimed in claim 5, wherein the connection sequence selection module integrates online training and prediction, and is based on a reinforcement learning model and uses an online learning mode, and each operation step needs to be optimized according to real feedback of the system.