CN117472942A - Cost estimation method, electronic device, storage medium, and computer program product - Google Patents

Cost estimation method, electronic device, storage medium, and computer program product Download PDF

Info

Publication number
CN117472942A
CN117472942A CN202210840332.5A CN202210840332A CN117472942A CN 117472942 A CN117472942 A CN 117472942A CN 202210840332 A CN202210840332 A CN 202210840332A CN 117472942 A CN117472942 A CN 117472942A
Authority
CN
China
Prior art keywords
information
node
edge
coding
vertex
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210840332.5A
Other languages
Chinese (zh)
Inventor
林阳
屠要峰
韩银俊
陈正华
徐墨
郭斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN202210840332.5A priority Critical patent/CN117472942A/en
Priority to PCT/CN2023/102435 priority patent/WO2024016946A1/en
Publication of CN117472942A publication Critical patent/CN117472942A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a cost estimation method, electronic equipment, storage medium and computer program product, wherein the method comprises the following steps: acquiring a query plan tree to be estimated, wherein the query plan tree comprises a plurality of execution nodes and at least one connecting edge for connecting adjacent execution nodes; performing coding processing on each execution node to obtain node coding information; carrying out coding processing on each connecting edge to obtain edge coding information corresponding to each connecting edge, wherein the edge coding information is used for representing data transmission information between two execution nodes connected with the connecting edge; constructing graph structure data according to the node coding information and the side coding information; inputting the graph structure data into a trained graph neural network model to obtain vertex characteristic data; and inputting the vertex characteristic data into the trained cost estimation model to obtain a cost estimation result. The embodiment of the invention can fully utilize all information of the query plan tree, and realize that the cost estimation based on the artificial intelligence algorithm is applicable to a novel hardware environment.

Description

Cost estimation method, electronic device, storage medium, and computer program product
Technical Field
The present invention relates to the field of database technologies, and in particular, to a cost estimation method, an electronic device, a storage medium, and a computer program product.
Background
The query optimizer is a vital part of the database system that can select a least costly query plan from a plurality of query plans to execute. With the mature development of artificial intelligence (Artificial Intelligence, AI) technology, the problem in the traditional cost estimation can be effectively solved by utilizing the strong learning capability of deep learning. Meanwhile, the cost estimation method based on the AI can reduce the error of cost estimation, and meanwhile, the problem of configuring a cost reference unit by manual intervention is avoided. The existing AI-based cost estimation method processes feature vectors of a query plan Tree through a neural network (Tree-Long Short Term Memory, tree-LSTM) which constructs a Tree structure to obtain a cost estimation method, and although the problems encountered by the traditional cost estimation are solved, the accuracy of the cost estimation is improved, the method is not suitable for an environment containing novel hardware because the problems of difference of computing capacity between the novel hardware and common hardware, cost of data transmission between the novel hardware and common hardware and the like are not considered. In addition, when the existing method adopts the Tree-LSTM for input, only the coding information of the nodes of the query plan Tree is contained, and all the information of the query plan Tree in the novel hardware environment can not be fully utilized. Therefore, how to solve the problem of adapting the existing cost estimation based on the artificial intelligence algorithm to the new hardware environment and fully utilizing all information of the query plan tree is a problem that needs to be considered at present.
Disclosure of Invention
The embodiment of the invention provides a cost estimation method, electronic equipment, storage medium and computer program product, which can fully utilize all information of a query plan tree and realize that cost estimation based on an artificial intelligence algorithm is applicable to a novel hardware environment.
In a first aspect, an embodiment of the present invention provides a cost estimation method, where the method includes:
acquiring a query plan tree to be estimated, wherein the query plan tree comprises a plurality of execution nodes and at least one connecting edge for connecting adjacent execution nodes;
performing coding processing on each execution node to obtain node coding information corresponding to each execution node;
performing coding processing on each connecting edge to obtain edge coding information corresponding to each connecting edge, wherein the edge coding information is used for representing data transmission information between two executing nodes connected with the connecting edge;
constructing graph structure data according to the node coding information corresponding to each execution node and the side coding information corresponding to each connecting side;
inputting the graph structure data into a trained graph neural network model to obtain vertex characteristic data;
and inputting the vertex characteristic data into a trained cost estimation model to obtain a cost estimation result.
In a second aspect, an embodiment of the present invention provides an electronic device, including:
a processor and a memory;
the memory has stored thereon program instructions which, when executed by the processor, cause the processor to perform the cost estimation method as described in the first aspect above.
In a third aspect, an embodiment of the present invention provides a computer-readable storage medium storing program instructions that, when executed by a computer, implement the cost estimation method as described in the first aspect above.
In a fourth aspect, embodiments of the present invention provide a computer program product storing program instructions which, when executed by a computer, cause the computer to implement the cost estimation method as described in the first aspect above.
In the embodiment of the invention, cost estimation under a novel hardware environment is realized through the graph neural network model, and in the process of constructing graph structure data, the encoding information of the top points and the edges of the graph contains important information related to the novel hardware. According to the scheme of the embodiment of the invention, firstly, a query plan tree to be estimated is obtained, wherein the query plan tree comprises a plurality of execution nodes and at least one connecting edge for connecting adjacent execution nodes. Then, in order to fully utilize all information of the query plan tree, each execution node is subjected to coding processing to obtain node coding information corresponding to each execution node; and carrying out coding processing on each connecting edge to obtain edge coding information corresponding to each connecting edge, wherein the edge coding information is used for representing data transmission information between two execution nodes connected with the connecting edge. And secondly, constructing graph structure data according to node coding information corresponding to each execution node and side coding information corresponding to each connecting side, and simultaneously reserving node information and side information of the query plan tree to provide more sufficient information for subsequent operations. And inputting the graph structure data into the trained graph neural network model to obtain vertex characteristic data, and inputting the vertex characteristic data into the trained cost estimation model to obtain a cost estimation result. Compared with the existing characteristic extraction method of the query plan tree, the embodiment of the invention can construct the query plan tree into the graph structure data, can simultaneously reserve the node information and the side information of the query plan tree so as to fully utilize all the information of the query plan tree, and can adapt the method to a novel hardware environment and provide an intelligent cost estimation method for a database query optimizer in the novel hardware environment.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate and do not limit the invention.
Fig. 1 is a flow chart of a cost estimation method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a query plan tree provided by an embodiment of the present invention;
fig. 3 is a schematic flow chart of step S120 provided in the embodiment of the present invention;
FIG. 4 is a schematic diagram of encoding of an execution node of a query plan tree provided by an embodiment of the present invention;
fig. 5 is a schematic flow chart of step S130 according to an embodiment of the present invention;
FIG. 6 is a schematic diagram illustrating encoding of a connection edge of a query plan tree according to an embodiment of the present invention;
fig. 7 is a schematic flow chart of step S140 according to an embodiment of the present invention;
fig. 8 is a schematic flow chart of step S142 according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of the structure of the data of the structure of FIG. 9 according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of a process for obtaining feature information of vertices in graph structure data according to an embodiment of the present invention;
FIG. 11 is a schematic diagram of a process of cost estimation for vertices in graph structure data according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
It should be appreciated that in the description of embodiments of the present invention, the descriptions of "first," "second," etc. are for the purpose of distinguishing between technical features only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated. "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relation of association objects, and indicates that there may be three kinds of relations, for example, a and/or B, and may indicate that a alone exists, a and B together, and B alone exists. Wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of the following" and the like means any of these items, including any group of single or plural items. For example, at least one of a, b, and c may represent: a, b, c, a and b, a and c, b and c, or a and b and c, wherein a, b and c may be single or multiple.
In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The query optimizer is a vital part of the database system that can select a least costly query plan from a plurality of query plans to execute. A conventional Cost Estimation (CE) model is a simple formula consisting of a number of different Cost reference units and the cardinality of the estimate. Wherein, the association between the cost reference unit and the hardware feature is very high, and manual configuration is needed; the estimated cardinality depends on the statistical information of the database, and the error of the cardinality estimation can be caused by untimely updating of the statistical information. Therefore, the traditional cost estimation model has the problems of errors and manual intervention. With the mature development of artificial intelligence (Artificial Intelligence, AI) technology, the problem in the traditional cost estimation can be effectively solved by utilizing the strong learning ability of deep learning. The AI-based cost estimation method reduces the error of cost estimation and simultaneously avoids the problem of configuring a cost reference unit by manual intervention.
The novel hardware and the environment constructed by the novel hardware change the traditional computing, storage and network systems, particularly the huge difference of the general processor and the special accelerator in terms of computing capacity and the scale and cost of data transmission among different hardware, change the bottom framework design of the database system and the cost estimation model of the query optimizer, and the query optimizer of the database needs to make adaptive adjustment aiming at the characteristics of the novel hardware to bring the potential of the novel hardware into play.
The traditional cost estimation model is adopted by the existing cost estimation method suitable for the novel hardware, and is corrected through estimation of heterogeneous hardware capacity and calibration of actual capacity of database system hardware. The existing method solves the problem that the computing capacities of different hardware are different in the novel hardware environment to a certain extent, but the problems that the traditional cost estimation model has errors and needs manual intervention still exist.
The existing cost estimation method based on AI is not suitable for the environment of the novel hardware. The existing method processes the feature vector of the query plan Tree through a neural network (Tree-Long Short Term Memory, tree-LSTM) which constructs a Tree structure to obtain a cost estimation result. The method solves the problems encountered by the traditional cost estimation by using AI, improves the accuracy of the cost estimation and reduces the manual intervention. However, the method is not suitable for the environment containing the novel hardware because the method does not consider the problems of difference of computing power between the novel hardware and the common hardware, cost of data transmission between the novel hardware and the common hardware, and the like.
In addition, existing AI-based cost estimation methods do not fully utilize the information of the query plan tree. When the existing AI-based cost estimation method adopts the Tree-LSTM to carry out cost estimation on the query plan Tree, the input of the Tree-LSTM only contains the coding information of the nodes of the query plan Tree, and does not contain the coding information of the relation between the nodes, namely the cost information of data transmitted between the nodes cannot be embodied in the novel hardware environment, so that all the information of the query plan Tree in the novel hardware environment cannot be fully utilized. Therefore, how to solve the problem of adapting the existing cost estimation based on the artificial intelligence algorithm to the new hardware environment and fully utilizing all information of the query plan tree is a problem that needs to be considered at present.
In order to solve the above problems, embodiments of the present invention provide a cost estimation method, an electronic device, a computer readable storage medium, and a computer program product, which can make full use of all information of a query plan tree, and realize that cost estimation based on an artificial intelligence algorithm is applicable to a novel hardware environment.
The embodiment of the invention can be applied to any cost-based database query optimizer, including PostgreSQL, a technical information database (Technical Information Database, tiDB), oracle, mySQL and the like, and is particularly suitable for the database query optimizers under the environment of novel hardware (graphic processing units (Graphics Processing Unit, GPU) and field programmable gate arrays (Field Programmable Gate Array, FPGA)).
Fig. 1 is a schematic flow chart of a cost estimation method according to an embodiment of the present invention. The implementation process of the method includes, but is not limited to, the following steps S110 to S160, and the following steps are sequentially described:
step S110, a query plan tree to be estimated is obtained, wherein the query plan tree comprises a plurality of execution nodes and at least one connecting edge for connecting adjacent execution nodes.
It can be understood that in practical application of the embodiment of the invention, the database query optimizer includes a plurality of execution plans correspondingly, and the intelligent cost estimation model is provided for the database query optimizer in the novel hardware environment to help the query optimizer select a better execution plan, so that the query performance of the database in the novel hardware environment is improved.
Specifically, each execution plan corresponds to the query plan tree described in step S110, and the workload is executed through the displain analysis, so as to obtain a query plan tree in the new hardware environment, thereby obtaining a plurality of execution nodes and at least one connection edge for connecting adjacent execution nodes, which are included in each query plan tree.
For example, as shown in fig. 2, a schematic diagram of a query plan tree according to an embodiment of the present invention, assuming that the query plan tree has 7 execution nodes and is set as nodes 1 to 7, the query plan tree correspondingly includes 6 connection edges for connecting adjacent execution nodes, which may be represented as connection edges 1 to 6.
And step S120, carrying out coding processing on each execution node to obtain node coding information corresponding to each execution node.
It can be understood that, in order to make full use of all information of the query plan tree, graph structure data including all important information related to the new hardware can be constructed, and node coding information corresponding to each execution node is obtained by performing coding processing on each execution node.
As shown in fig. 3, a specific flow chart of step S120 provided in the embodiment of the present invention, step S120 specifically includes, but is not limited to, the following steps S121-S122:
step S121, obtaining node basic information of each execution node.
Specifically, the node basic information includes at least one of node execution position information, operator type information, table information, column information, predicate condition information, or example bitmap information.
The node execution location information is used to indicate an execution location corresponding to the execution node, and may include at least one of a central processing unit (Central Processing Unit, CPU), a GPU, and an FPGA.
It should be noted that, the operator type information is used to indicate the operator type corresponding to the execution node, and may include at least one of a Seq Scan, an Index Scan, a Bitmap Scan, a new Loop, a Hash Join, a Merge Join, and an Agg.
The table information and the column information indicate a table and a column to which the execution node corresponds, respectively.
It should be noted that the predicate condition information may include a filtering predicate and a connecting predicate, where the filtering predicate is used to indicate that a certain table or a certain column in one SQL statement satisfies a certain condition; a join predicate is used to indicate that a column in a table is equal to a column in another table.
It should be noted that the example bitmap information is used to indicate that the data set randomly selects a specified number of samples. For example, when 1000 pieces of data in the data set are randomly selected, by calculating whether or not all of the sample data satisfy the set predicate condition, the data is set to 1 when the condition is satisfied, and the data is set to 0 when the condition is not satisfied, thereby obtaining an example bitmap vector of 1000 dimensions. According to the embodiment of the invention, the distribution condition of the data can be better known by acquiring the example bitmap information, and the coding information corresponding to the example bitmap comprises more data information, so that the accuracy of cost evaluation can be improved.
Step S122, the node basic information is subjected to coding processing to obtain node coding information.
Specifically, each part in the node basic information is subjected to coding processing to obtain node coding information corresponding to each part. For example, performing encoding processing of the position information on the node, resulting in one-hot encoding (one-hot encoding) of the node execution position; coding the operator type information to obtain a node type one-hot code; encoding the table information to obtain a one-hot code of the table; performing coding treatment on the column information to obtain a column one-hot code; encoding the predicate condition information to obtain predicate condition codes; and (3) carrying out encoding processing on the example bitmap information to obtain example bitmap encoding.
The predicate condition codes include the left one-hot code, the operator one-hot code and the right one processed.
It should be noted that the example bitmap encoding is used to represent a 0-1 encoding in which a specified number of samples are randomly selected in the data set to satisfy the node condition.
For example, as shown in fig. 4, in order to provide a schematic coding diagram for the execution nodes of the query plan tree according to the embodiment of the present invention, it is assumed that the execution nodes in the query plan tree shown in fig. 2 are subjected to coding processing, so that node coding information corresponding to each execution node can be obtained. Specifically, node 1 is encoded with node 1, node 2 is encoded with node 2, node 3 is encoded with node 3, node 4 is encoded with node 4, node 5 is encoded with node 5, node 6 is encoded with node 6, and node 7 is encoded with node 7.
Step S130, performing encoding processing on each connecting edge to obtain edge encoding information corresponding to each connecting edge, wherein the edge encoding information is used for representing data transmission information between two execution nodes connected by the connecting edge.
It will be appreciated that in order to fully utilize all the information of the query plan tree, embodiments of the present invention simultaneously preserve node information and side information of the query plan tree. Specifically, each connection edge is subjected to coding processing to obtain edge coding information corresponding to each connection edge, wherein the edge coding information is used for representing data transmission information between two execution nodes connected with the connection edge.
As shown in fig. 5, a specific flow chart of step S130 provided in the embodiment of the present invention, step S130 specifically includes, but is not limited to, the following steps S131-S132:
step S131, obtaining basic information of each connection edge.
Specifically, after the encoding process is performed on each execution node, the basic information of each connection side is acquired.
It should be noted that, the basic information of the connection edge in the embodiment of the present invention includes at least one of the following: at least one of node connection relation coding information, inter-node data transmission direction coding information, or inter-node data transmission size information.
It should be noted that, the node connection relation coding information is used for indicating whether there is a 0-1 code of the connection relation between the nodes, and the corresponding node connection relation is used for indicating that any node in the query plan tree has a connection relation with its child node.
It should be noted that, the inter-node data transmission direction encoding information is used for one-hot encoding for indicating the data transmission direction between the nodes, and the corresponding inter-node data transmission direction may include at least one of CPU to GPU, GPU to CPU, CPU to FPGA, and the like.
It should be noted that, the data transmission size information between the nodes is used to represent the data transmission size between the nodes, and may be represented by an estimated base number of two nodes.
Step S132, inputting the basic information of each connecting edge to the trained transcoding model to obtain the edge coding information corresponding to the connecting edge.
It can be understood that, in order to construct the graph structure data, the coding form of the basic information of each connection side needs to be converted into a numerical form, that is, the basic information of each connection side is input into a trained coding conversion model to obtain the side coding information corresponding to the connection side, where the side coding information is in a one-dimensional numerical form.
It should be noted that, the transcoding model in the embodiment of the present invention adopts a fully connected neural network (Fully Connected Neural Network, FC) to construct a model, and obtains a first training sample set, where the first training sample set includes a plurality of first training samples, and each first training sample includes sample basic information and corresponding sample output information thereof. And inputting the basic information of each sample into a preset fully-connected neural network for model training so as to output target output information. And adjusting model parameters of the fully-connected neural network according to the loss values obtained by the sample output information and the target output information until a preset training ending condition is met, so as to obtain a code conversion model.
It should be noted that the number of layers of the fully connected neural network corresponding to the transcoding model is not fixed. For example, when the number of network layers in the transcoding model is 2, the structure of the transcoding model may include: full connection layer, regularization layer, full connection layer, regularization layer.
For example, as shown in fig. 6, in order to provide a schematic coding diagram of the connection edges of the query plan tree according to the embodiment of the present invention, it is assumed that the connection edge coding process is performed on the data encoded by the execution nodes in the query plan tree shown in fig. 4, and the basic information of each connection edge is input to the trained transcoding model, so that the edge coding information corresponding to each connection edge can be obtained. Specifically, the connection side 1 obtains FC (encoding of the side 1), the connection side 2 obtains FC (encoding of the side 2), the connection side 3 obtains FC (encoding of the side 3), the connection side 4 obtains FC (encoding of the side 4), the connection side 5 obtains FC (encoding of the side 5), and the connection side 6 obtains FC (encoding of the side 6).
Step S140, constructing graph structure data according to the node coding information corresponding to each execution node and the edge coding information corresponding to each connecting edge.
As shown in fig. 7, a specific flow chart of step S140 provided in the embodiment of the present invention, step S140 specifically includes, but is not limited to, the following steps S141-S143:
step S141, obtaining a vertex matrix according to node coding information corresponding to each execution node.
It can be understood that in the process of constructing the graph structure data, the encoding information of the vertices and edges of the graph contains important information related to the new hardware, and first, a vertex matrix is obtained according to the node encoding information corresponding to each execution node.
It will be appreciated that, as shown in fig. 4, by performing the encoding process on the nodes 1 to 7, the node 1 to node 7 codes are obtained respectively, and the vertex matrix is constructed according to the node 1 to node 7 codes, for example, the form of the vertex matrix may be expressed as: { encoding of node 1; encoding of node 2; encoding of node 3; encoding of node 4; encoding of node 5; encoding of node 6; code of node 7 }.
Step S142, obtaining an edge matrix according to the edge coding information corresponding to each connecting edge.
Specifically, the side coding information corresponding to each connection side is one-dimensional output obtained through a code conversion model.
As shown in fig. 8, a specific flowchart of step S142 provided in the embodiment of the present invention, step S142 specifically includes, but is not limited to, the following steps S1421-S1423:
step S1421, for each connecting edge, acquiring node number information of the executing nodes at two ends of the connecting edge, and obtaining first number information and second number information.
It can be understood that in the embodiment of the present invention, the graph structure data is constructed according to the execution nodes and the connection edges of the query plan tree, and for each connection edge, node number information of the execution nodes located at two ends of the connection edge is obtained, so as to obtain first number information and second number information. Specifically, for example, as shown in fig. 6, it is possible to obtain node number information of the execution nodes connecting both ends of the edge 1 as 1 and 2, respectively.
Step S1422, obtaining the triplet data corresponding to each connecting edge according to the first number information, the second number information and the edge coding information corresponding to each connecting edge.
It can be understood that the edge matrix of the graph structure data is composed of triplet data corresponding to all the connection edges of the query plan tree, and each triplet data includes node coding information and edge coding information of two execution nodes connected with the connection edge. For example, as shown in fig. 6, after acquiring node number information of the execution nodes at both ends of the connection side 1 as 1 and 2, respectively, side code information FC (code of the side 1) of the connection side 1 is acquired, and triplet data [1,2, FC (code of the side 1) ] corresponding to the connection side 1 is further constructed.
Step S1423, obtaining the edge matrix according to all the triplet data.
It can be understood that, in the embodiment of the present invention, corresponding triplet data is constructed for each connection edge in the query plan tree, so that an edge matrix corresponding to the graph structure data is obtained according to all the obtained triplet data. For example, according to fig. 6, node number information and corresponding edge code information of the execution nodes at both ends of each connection edge are obtained to obtain triplet data corresponding to each connection edge, so as to obtain an edge matrix of the graph structure data, where the edge matrix may be expressed as: { [1,2, fc (encoding of side 1) ]; [2,3, fc (encoding of side 2) ]; [3,4, fc (encoding of side 3) ]; [3,5, fc (encoding of side 4) ] [2,6, fc (encoding of side 5) ]; [1,7, fc (encoding of side 6) ] }.
And step S143, obtaining graph structure data according to the vertex matrix and the edge matrix.
It will be appreciated that, when the vertices and edges of the graph structure data include important information related to new hardware, the graph structure data is obtained according to the vertex matrix and the edge matrix after the vertex matrix and the edge matrix of the query plan tree are obtained. The vertex of the graph structure data is an execution node corresponding to each node coding information, and the edge of the graph structure data is a connecting edge corresponding to each edge coding information.
For example, as shown in fig. 9, for a schematic structural diagram of graph structure data provided in an embodiment of the present invention, it is assumed that after the execution nodes and the connection edges of the query plan tree are encoded according to fig. 6, a vertex matrix is obtained according to node encoding information corresponding to each execution node. And then, obtaining an edge matrix according to the edge coding information corresponding to each connecting edge. Finally, obtaining graph structure data according to the vertex matrix and the edge matrix, wherein the graph corresponding to the graph structure data at the moment comprises vertexes 1 to 7.
And step S150, inputting the graph structure data into a trained graph neural network model to obtain vertex characteristic data.
Specifically, when cost estimation is performed on nodes in the query plan tree, vertex characteristic data is obtained by inputting a vertex matrix and an edge matrix into the trained graph neural network model, wherein the vertex characteristic matrix comprises characteristic information of each vertex in the graph structure data.
The feature information of each vertex in the graph structure data includes vertex coding information, coding information of adjacent vertices, and coding information of adjacent edges.
It should be noted that, the neural network used in the graph neural network model in the embodiment of the present invention is not limited in type, and may be a graph roll-up neural network (Graph Convolution Network, GCN), a generation countermeasure network (Generative Adversarial Network, GAN), a graph generation network (Graph Generative Network, GGn), a graph self-encoder (Graph Autoencoders), and the like, and the number of layers of the neural network is not fixed.
It should be noted that, the training of the graph neural network model is performed by acquiring a second training sample set, where the training sample set includes a plurality of second training samples, and each second training sample includes a sample graph structure and a sample vertex feature corresponding to the sample graph structure. And inputting each sample graph structure into a preset initial neural network for model training so as to output the target vertex characteristics. And adjusting model parameters of the initial neural network according to the loss values obtained by the target vertex characteristics and the sample vertex characteristics until a preset training ending condition is met, so as to obtain the graph neural network model.
For example, as shown in fig. 10, a schematic process of extracting feature information of vertices in graph structure data through a graph neural network is shown. Assuming that a GCN with a network layer number of 2 is used as the graph neural network in the embodiment of the present invention, the structure of the graph neural network at this time is: a first graph convolution layer, a first regularization layer, a second graph convolution layer, and a second regularization layer. Inputting a vertex matrix and an edge matrix corresponding to the graph structure data into a trained graph neural network model, and sequentially processing the graph neural network model through a first graph convolution layer, a first regularization layer, a second graph convolution layer and a second regularization layer to obtain vertex characteristic data corresponding to the graph structure data, wherein the vertex characteristic data comprises characteristic information of each vertex in the graph structure data and can be expressed as { vertex 1 characteristics; features of vertex 2; features of vertex 3; the features of vertex 4; features of vertex 5; features of vertex 6; features of vertex 7 }
Step S160, vertex characteristic data are input into the trained cost estimation model, and a cost estimation result is obtained.
It can be understood that the characteristic information of each vertex in the graph structure data is input to the trained cost estimation model, so as to obtain cost estimation corresponding to each vertex.
It should be noted that, the cost estimation model of the embodiment of the present invention adopts a fully connected neural network to construct a model, and a third training sample set is obtained, where the third training sample set includes a plurality of third training samples, and each third training sample includes sample feature information and a corresponding sample cost estimation. And inputting the characteristic information of each sample into a preset initial neural network for model training so as to output target cost estimation. And adjusting model parameters of the initial neural network according to the loss values obtained by the sample cost estimation and the target cost estimation until a preset training ending condition is met, so as to obtain a cost estimation model.
For example, as shown in fig. 11, a schematic diagram of a process of cost estimation for vertices in graph structure data is shown. Assuming that a fully connected neural network with the network layer number of 2 is adopted as the graph neural network in the embodiment of the invention, the structure of the graph neural network at this time is as follows: the first full-connection layer, the third regularization layer, the second full-connection layer and the fourth regularization layer. Inputting the vertex characteristic data into a trained cost estimation model, and sequentially carrying out model processing through a first full-connection layer, a third regularization layer, a second full-connection layer and a fourth regularization layer to obtain a cost estimation result, wherein the cost estimation result comprises cost estimation corresponding to each vertex.
According to the scheme of the embodiment of the invention, firstly, a query plan tree to be estimated is obtained, wherein the query plan tree comprises a plurality of execution nodes and at least one connecting edge for connecting adjacent execution nodes. Then, in order to fully utilize all information of the query plan tree, node basic information of each execution node is acquired, and coding processing is carried out on each node basic information to obtain node coding information. And inputting the basic information of each connecting side to a trained code conversion model by acquiring the basic information of each connecting side to obtain side code information corresponding to the connecting side, wherein the side code information is used for representing data transmission information between two execution nodes connected with the connecting side. And secondly, obtaining a vertex matrix according to node coding information corresponding to each execution node. And acquiring node number information of execution nodes at two ends of each connecting side aiming at each connecting side to obtain first number information and second number information, obtaining triplet data corresponding to each connecting side according to the first number information, the second number information and the side coding information corresponding to each connecting side, and further obtaining a side matrix according to all the triplet data. After that, the process is performed. The graph structure data is obtained according to the vertex matrix and the edge matrix, and node information and edge information of the query plan tree can be reserved at the same time so as to provide more sufficient information for subsequent operations. And inputting the graph structure data into the trained graph neural network model to obtain vertex characteristic data, and inputting the vertex characteristic data into the trained cost estimation model to obtain a cost estimation result. Compared with the existing characteristic extraction method of the query plan tree, the embodiment of the invention can construct the query plan tree into the graph structure data, can simultaneously reserve the node information and the side information of the query plan tree so as to fully utilize all the information of the query plan tree, and can adapt the method to a novel hardware environment and provide an intelligent cost estimation method for a database query optimizer in the novel hardware environment.
It should be noted that although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous.
In addition, in the embodiments of the present invention, the descriptions of the embodiments are emphasized, and the details or descriptions of some embodiments may be referred to in the related descriptions of other embodiments.
The embodiment of the present invention further provides an electronic device, as shown in fig. 12, where the electronic device 1200 includes, but is not limited to:
a processor 1210 and a memory 1220;
the memory 1220 has stored thereon program instructions that, when executed by the processor 1210, cause the processor 1210 to perform the cost estimation method as described in any of the embodiments above.
The processor 1210 and the memory 1220 may be connected by a bus or otherwise.
It should be appreciated that the processor 1210 may employ a central processing unit (Central Processing Unit, CPU). The processor may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. Or the processor 1210 may employ one or more integrated circuits for executing associated programs to carry out the techniques provided by embodiments of the present invention.
Memory 1220 acts as a non-transitory computer readable storage medium that may be used to store a non-transitory software program and a non-transitory computer executable program, such as the cost estimation methods described in any of the embodiments of the present invention. Processor 1210 implements the cost estimation method described above by running non-transitory software programs and instructions stored in memory 1220.
Memory 1220 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store training methods that perform the cost estimation methods or spectrum sensing models described above. In addition, memory 1220 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some implementations, the memory 1220 may optionally include memory remotely located relative to the processor 1210, which may be connected to the processor 1210 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The non-transitory software programs and instructions required to implement the cost estimation methods described above are stored in memory 1220, which when executed by one or more processors 1210, perform the cost estimation methods provided by any of the embodiments of the present invention.
The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores program instructions, and when the program instructions are executed by a computer, the cost estimation method described in any embodiment above is realized.
The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
Embodiments of the present invention provide a computer program product having stored thereon program instructions that, when executed by a computer, cause the computer to implement the cost estimation method as described in any of the embodiments above.
The preferred embodiments of the present invention have been described in detail, but the present invention is not limited to the above embodiments, and those skilled in the art will appreciate that the present invention may be practiced without departing from the spirit of the present invention. Various equivalent modifications and substitutions may be made in the shared context, and are intended to be included within the scope of the present invention as defined in the following claims.

Claims (10)

1. A cost estimation method, the method comprising:
acquiring a query plan tree to be estimated, wherein the query plan tree comprises a plurality of execution nodes and at least one connecting edge for connecting adjacent execution nodes;
performing coding processing on each execution node to obtain node coding information corresponding to each execution node;
performing coding processing on each connecting edge to obtain edge coding information corresponding to each connecting edge, wherein the edge coding information is used for representing data transmission information between two executing nodes connected with the connecting edge;
constructing graph structure data according to the node coding information corresponding to each execution node and the side coding information corresponding to each connecting side;
inputting the graph structure data into a trained graph neural network model to obtain vertex characteristic data;
and inputting the vertex characteristic data into a trained cost estimation model to obtain a cost estimation result.
2. The method of claim 1, wherein the encoding each of the execution nodes to obtain node encoded information corresponding to each of the execution nodes includes:
acquiring node basic information of each execution node;
and carrying out coding processing on the node basic information to obtain node coding information.
3. The method of claim 2, wherein the node base information comprises at least one of:
the node performs location information, operator type information, table information, column information, predicate condition information, or example bitmap information.
4. The method of claim 1, wherein said encoding each of said connection sides to obtain side encoded information comprises:
basic information of each connecting edge is obtained;
inputting the basic information of each connecting side to a trained code conversion model to obtain side code information corresponding to the connecting side;
wherein the basic information of the connection edge comprises at least one of the following: node connection relation coding information, inter-node data transmission direction coding information or inter-node data transmission size information.
5. The method according to any one of claims 1 to 4, wherein said constructing graph structure data from the node coding information corresponding to each of the execution nodes and the side coding information corresponding to each of the connection sides includes:
obtaining a vertex matrix according to the node coding information corresponding to each execution node;
obtaining an edge matrix according to the edge coding information corresponding to each connecting edge;
obtaining graph structure data according to the vertex matrix and the edge matrix;
and the vertex of the graph structure data is the execution node corresponding to each node coding information, and the edge of the graph structure data is the connecting edge corresponding to each edge coding information.
6. The method of claim 5, wherein the obtaining an edge matrix from the edge coding information corresponding to each connection edge comprises:
acquiring node number information of the execution nodes positioned at two ends of each connecting edge aiming at each connecting edge to obtain first number information and second number information;
obtaining triplet data corresponding to each connecting edge according to the first number information, the second number information and the edge coding information corresponding to each connecting edge;
and obtaining an edge matrix according to all the triplet data.
7. The method of claim 5, wherein inputting the graph structure data into a trained graph neural network model to obtain vertex feature data comprises:
inputting the vertex matrix and the edge matrix into a trained graph neural network model to obtain vertex characteristic data, wherein the vertex characteristic matrix comprises characteristic information of each vertex in the graph structure data;
the characteristic information of each vertex comprises vertex coding information, coding information of adjacent vertices and coding information of adjacent edges.
8. An electronic device, comprising:
a processor and a memory;
the memory has stored thereon program instructions which, when executed by the processor, cause the processor to perform the method of any of claims 1-7.
9. A computer readable storage medium, characterized in that program instructions are stored, which, when executed by a computer, implement the method of any of claims 1-7.
10. A computer program product, characterized in that it stores program instructions that, when executed by a computer, cause the computer to implement the method of any of claims 1-7.
CN202210840332.5A 2022-07-18 2022-07-18 Cost estimation method, electronic device, storage medium, and computer program product Pending CN117472942A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210840332.5A CN117472942A (en) 2022-07-18 2022-07-18 Cost estimation method, electronic device, storage medium, and computer program product
PCT/CN2023/102435 WO2024016946A1 (en) 2022-07-18 2023-06-26 Cost estimation method, electronic device, storage medium and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210840332.5A CN117472942A (en) 2022-07-18 2022-07-18 Cost estimation method, electronic device, storage medium, and computer program product

Publications (1)

Publication Number Publication Date
CN117472942A true CN117472942A (en) 2024-01-30

Family

ID=89616987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210840332.5A Pending CN117472942A (en) 2022-07-18 2022-07-18 Cost estimation method, electronic device, storage medium, and computer program product

Country Status (2)

Country Link
CN (1) CN117472942A (en)
WO (1) WO2024016946A1 (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9798772B2 (en) * 2013-04-12 2017-10-24 Oracle International Corporation Using persistent data samples and query-time statistics for query optimization
CN111581454B (en) * 2020-04-27 2023-05-23 清华大学 Parallel query performance prediction system and method based on depth map compression algorithm
CN112749191A (en) * 2021-01-19 2021-05-04 成都信息工程大学 Intelligent cost estimation method and system applied to database and electronic equipment
CN113010547B (en) * 2021-05-06 2023-04-07 电子科技大学 Database query optimization method and system based on graph neural network

Also Published As

Publication number Publication date
WO2024016946A1 (en) 2024-01-25

Similar Documents

Publication Publication Date Title
Scrucca GA: A package for genetic algorithms in R
US7809713B2 (en) Efficient search space analysis for join factorization
US20150370919A1 (en) Graph travelsal operator and extensible framework inside a column store
US20210065052A1 (en) Bayesian optimization of sparsity ratios in model compression
CN103942253B (en) A kind of spatial data handling system of load balancing
CN113723618B (en) SHAP optimization method, equipment and medium
CN111310918B (en) Data processing method, device, computer equipment and storage medium
CN112463159A (en) Compiling method, compiling device, electronic equipment and storage medium
CN110750560B (en) System and method for optimizing network multi-connection
CN108073641B (en) Method and device for querying data table
KR102189811B1 (en) Method and Apparatus for Completing Knowledge Graph Based on Convolutional Learning Using Multi-Hop Neighborhoods
CN116112563A (en) Dual-strategy self-adaptive cache replacement method based on popularity prediction
CN114781654A (en) Federal transfer learning method, device, computer equipment and medium
CN115544029A (en) Data processing method and related device
CN117472942A (en) Cost estimation method, electronic device, storage medium, and computer program product
CN113705798A (en) Processing unit, computing device and computation graph optimization method of deep learning model
CN115102935B (en) Point cloud encoding method, point cloud decoding method and related equipment
CN113609126B (en) Integrated storage management method and system for multi-source space-time data
CN110728359A (en) Method, device, equipment and storage medium for searching model structure
US20240119329A1 (en) Quantum computer operating system, quantum computer and readable storage medium
WO2022252694A1 (en) Neural network optimization method and apparatus
CN113836174B (en) Asynchronous SQL (structured query language) connection query optimization method based on reinforcement learning DQN (direct-to-inverse) algorithm
CN113688191B (en) Feature data generation method, electronic device, and storage medium
CN115238134A (en) Method and apparatus for generating a graph vector representation of a graph data structure
CN114218210A (en) Data processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication