WO2023016537A1

WO2023016537A1 - Database management system, data processing method, and device

Info

Publication number: WO2023016537A1
Application number: PCT/CN2022/111991
Authority: WO
Inventors: 李国良; 李士福; 周煊赫; 王天庆
Original assignee: 华为技术有限公司; 清华大学
Priority date: 2021-08-13
Filing date: 2022-08-12
Publication date: 2023-02-16
Also published as: CN115705322A

Abstract

Disclosed in the present application are a database management system, a data processing method, and a device. The system comprises: an optimizer including n models; a training data collector; a model manager; and a model evaluator. The optimizer is used to obtain a target physical plan by means of the n models according to an SQL statement; the training data collector is used to construct m training sets according to operating data of a database process; the model manager is used to fine-tune a first target model (which belongs to the n models and needs to meet a preset requirement, e.g. a performance reduction) by using a target training set (which belongs to the m training sets), so as to obtain a second target model; and the model evaluator is used to evaluate the performance of the second target model, and update the first target model to the second target model when the performance meets a preset requirement (e.g. a performance improvement). The present application is combined with machine learning, so as to realize the function of automatically executing database tuning and updating and other database management tasks that are traditionally executed by a DBA, without manual intervention.

Description

A database management system, data processing method and device

This application claims the priority of the Chinese patent application with the application number 202110930569.8 and the application title "A database management system, data processing method and equipment" filed with the China Patent Office on August 13, 2021, the entire contents of which are incorporated by reference in this application.

technical field

The present application relates to the field of database management, in particular to a database management system, data processing method and equipment.

Background technique

A database is a "warehouse that organizes, stores and manages data according to the data structure", and is a collection of organized, shareable and unified management of a large amount of data stored in a computer for a long time. Databases are critical to the efficient operation of modern businesses.

In practical applications, the management method of the database mainly depends on the database administrator (DBA for short), and the DBA often needs to spend a lot of time and energy to manage and maintain the database manually, which is very error-prone and affects the uptime, performance and quality of the database. Security has catastrophic effects. For example, failure to properly patch and install security updates in a timely manner can lead to database vulnerabilities, weakening or even complete failure of database protection measures, which in turn exposes enterprises to serious data breach risks, severe financial impact and loss of goodwill.

However, this traditional database management method is based on traditional optimization techniques based on expert experience, such as cost estimation, connection sequence selection, and parameter tuning, which can no longer meet the requirements of multi-scenario business, massive applications, and extreme performance.

Contents of the invention

The embodiment of the present application provides a database management system, data processing method and equipment, which can automatically perform database optimization, protection, update and other routine database management tasks traditionally performed by DBAs by combining with machine learning technology function without human intervention.

Based on this, the embodiment of the present application provides the following technical solutions:

In the first aspect, the embodiment of the present application firstly provides a database management system, which may specifically include: a self-learning optimizer (it should be noted that some kernel components included in the optimizer provided by the present application may be native kernels Components, but the n models included in the optimizer are newly added in this application, n≥1), training data collector, model manager and model evaluator, wherein, the optimizer is used according to the SQL input database statement, the final physical plan to be executed (may be called the target physical plan) is obtained through the n models, wherein the target physical plan refers to the execution cost that meets a certain preset requirement (may be called the first preset requirement) physical plan. The training data collector is used to obtain training data according to the operation data of the process in the database (such as database operation indicators, query logs, system logs, etc.), and construct m training sets based on the obtained training data, where m≥1. A model manager, configured to use the target training set corresponding to the first target model to fine-tune the first target model when the first target model satisfies a certain preset requirement (which may be referred to as a second preset requirement). (finetune), thereby obtaining the second target model (the second target model is essentially the first target model with model parameters updated). Wherein, the first target model is one of the n models, and the target training set is one of the m training sets. After that, the model parameters of the second target model after finetune can be passed to the model evaluator to evaluate the model performance. What needs to be explained here is that the values of m and n can be the same or different. If m=n, it means that each of the n models corresponds to a training set used by itself. If m≠n, Then it shows that among n models, there may be multiple models sharing a training set (that is, the case of m<n), and there may also be a situation where a model can have multiple training sets for training (that is, the case of m>n ), the specific application is not limited to this. The model evaluator is used to evaluate the performance of the obtained second target model, and when the performance of the second target model satisfies a certain preset requirement (which may be called the third preset requirement), the first An object model is updated to the second object model. As an example, the update process may specifically be: the model evaluator sends the model parameters of the second target model to the optimizer, and the optimizer assigns the received updated model parameters to the first target model, thereby obtaining the Second target model. It should be noted that, in some embodiments of the present application, the model evaluator may be a performance prediction model based on graph embedding.

In the above embodiments of the present application, the database management system includes an optimizer with n models, a training data collector, a model manager, and a model evaluator, wherein the optimizer with n models replaces the traditional heuristic optimizer, By combining with machine learning technology, logical queries can be transformed into physical execution plans with higher execution efficiency; the training data collector can generate training data for models involved in the database based on the running data of the processes in the database, and can realize continuous Optimize the database and provide credible autonomous operation and maintenance services; the model manager can call the corresponding target training set in the training data collector to fine-tune the first target model, so as to dynamically update and replace the database according to the real-time operation status of the database The corresponding model used in ; the model evaluator is used to provide secondary verification capabilities to ensure that the best valid model is provided. The database management system constructed in the embodiment of the present application can realize the function of automatically performing database tuning, protection, updating and other routine database management tasks traditionally performed by DBAs without human intervention by combining with machine learning technology.

In a possible implementation manner of the first aspect, the database management system may further include a suggester, which may include p models, where p≥1; the suggester is used to discover the Abnormal conditions (that is, abnormal data found) in the operation data (such as CPU utilization rate, user response time, etc.), and the abnormal cause is diagnosed based on the obtained abnormal data, and then based on the obtained abnormal cause optimization and the abnormal cause The corresponding optimization module (the optimization module is also located in the advisor, and the function of the optimization module is to optimize the parameters of the database) is to reduce the probability of abnormal occurrence of the operation data of the process in the database.

In the above-mentioned embodiments of the present application, the constructed database management system can be based on machine learning methods, and a new suggester is added to the database management system to realize the functions of self-monitoring, self-diagnosis, and self-optimization, so as to automatically and intelligently optimize the database . This solves the problem that database monitoring, configuration, diagnosis, and optimization methods (such as parameter tuning, slow SQL diagnosis, index/view consultant) in the prior art rely on DBA, have high cost, and cannot adapt to large-scale instances (such as cloud databases). .

In a possible implementation manner of the first aspect, the suggester may specifically include three models, which may be respectively referred to as a codec, a first model, and a second model, and are respectively used to perform self-monitoring, self-diagnosis, and Self-optimize these three steps. Specifically, the suggester encodes and decodes the running data of the process in the database through a codec to obtain encoded data, and compares the encoded data with the running data input to the codec to obtain abnormal data . What needs to be explained here is that the principle of using the codec to obtain abnormal data is: the codec can restore the normal original data, but not the abnormal original data, so that the input original data is encoded and then decoded to obtain the encoded data. The encoded data is compared with the original data to know whether there is abnormal data. After obtaining the abnormal data, if the operation data belongs to the system index data (such as page failure), the suggester can further use the first model to diagnose the cause of the abnormality according to the abnormal data, wherein the first model is based on depth A model constructed by a learning algorithm; if the operating data belongs to query index data (such as average delay), the suggester can further diagnose and obtain the cause of the abnormality based on the abnormal data through the second model, wherein the second model is based on Models built by deep learning algorithms. It should be noted here that in other implementations of this application, the suggester may also include more or fewer models for realizing the process of self-monitoring and self-diagnosis of the database. In the embodiment of this application, it is recommended Included are three models for illustration only.

In the above embodiments of the present application, it is specifically explained how the advisor implements the functions of self-monitoring and self-diagnosis of the database through the models contained therein, and realizes the purpose of the database management system to automatically and intelligently manage the operating data of the database.

In a possible implementation of the first aspect, the first model may include an LSRM model and a classifier. Specifically, the suggester encodes the found abnormal data into a compressed vector (ie dimensionality reduction/ Dimension-uplifted vector), and then use learning-type classifiers (such as binary classifiers, multi-classifiers, etc.) to reason about the corresponding root cause (such as database backup operations); the second model can include Tree-LSTM models and The softmax function, specifically, the suggester encodes the slow query (that is, the execution time of the query is long) by calling the Tree-LSTM model, locates the operation of the physical operator (that is, the execution operator) that causes the exception, and then uses the softmax function to identify the resulting The root cause of the exception.

In the above-mentioned implementation manners of the present application, a typical implementation manner of the first model and the second model is described in detail, which is realizable.

In a possible implementation of the first aspect, if the database management system constructed in the embodiment of the present application includes the suggester, the above-mentioned first target model can not only be any one of the n models in the optimizer , can also be any one of the p models in the suggester.

In the above embodiments of the present application, it is clarified that the first target model can be any one of the optimizer and the recommender, which expands the selection range of the first target model, so that the subsequent model evaluator can not only evaluate the model in the optimizer performance, you can further evaluate the performance of the model in the proposer, and have the feasibility of the scheme.

In a possible implementation of the first aspect, the optimizer may specifically include three models, which may be referred to as the third model, the fourth model, and the fifth model, which are respectively used to perform logical query rewriting, cost estimation, and The physical plan is generated in these three steps. Specifically, the optimizer rewrites the logical query of the SQL statement (also referred to as SQL query) input to the database through the third model, so as to obtain the rewritten logical plan, wherein the third model is based on tree search The model constructed by the algorithm, for example, can be a Monte Carlo tree search algorithm; afterward, the optimizer generates q physical plans according to the logical plan through the fourth model, wherein the fourth model is constructed based on a deep learning algorithm The model, for example, can be a model based on Tree-LSTM, q≥1; finally, the optimizer calculates the q execution costs corresponding to the q physical plans through the fifth model (one physical plan corresponds to one execution cost) , and determine the final target physical plan to be executed according to the q execution overheads, wherein the fifth model is a model constructed based on a reinforcement learning algorithm, for example, it may be a model based on deep reinforcement learning (deep Q-learning, DQN) . It should be noted here that in other implementations of this application, the optimizer may also include more or fewer models to implement the process of logical query rewriting, cost estimation, and physical plan generation. In the example, the optimizer includes three models for illustration only.

In the above-mentioned embodiments of the present application, it is specifically explained how the optimizer implements the functions of logical query rewriting, cost estimation and physical plan generation of the database through the model included in it, thereby replacing the traditional heuristic optimizer. The combination of technologies realizes the conversion of logical queries into physical execution plans with higher execution efficiency, and can effectively solve the problems of inaccurate cost evaluation caused by current database architecture problems and poor physical plans generated by complex SQL statements.

In a possible implementation of the first aspect, if the performance of the second target model does not meet the third preset requirement, the model evaluator will also be used to trigger the database to use the native kernel components in the database to generate the final to-be-executed The target physical plan, for example, the index selection module enables the traditional hill-climbing algorithm to create new indexes to execute the logic of the SQL statement. That is to say, if the performance of the second target model after finetune still does not meet the requirements, the traditional kernel component algorithm of the database is used to generate the target physical plan.

In the above embodiments of the present application, since the original kernel components of the database are not deleted, but coexist with the newly added optimizer in the database software, therefore, during the operation of the database, if the performance of the second target model does not meet the requirements of the first Three preset requirements, the model evaluator will also trigger the database to use the native kernel components in the database to generate the final target physical plan to be executed (because in this case, the native kernel components perform better), so that it can be based on The method of dynamically adjusting the evaluation of the current performance of the second target model in real time to generate the target physical plan improves the performance of the database as a whole.

In a possible implementation of the first aspect, if the performance of the second target model does not meet the third preset requirement, the model evaluator can be used to trigger the database to generate the final target to be executed using native kernel components in the database The physical plan may also feed back the information that the model update fails (that is, the performance of the second target model does not meet the third preset requirement) to the model manager, so that the model manager adjusts the first target model based on the information fine-tuning strategy.

In the above-mentioned embodiments of the present application, the model evaluator can further feed back information about model update failure to the model manager, so that the model manager can provide reference for subsequent model training strategies, thereby improving model training capability and training efficiency.

In a possible implementation manner of the first aspect, the execution overhead meeting the first preset requirement includes but is not limited to:

1) The execution overhead of the target physical plan is the lowest among the q execution overheads. The q execution overheads are the corresponding execution overheads of the q physical plans generated based on the SQL statements input into the database. One physical plan corresponds to one execution overhead, where ,q≥1;

2) The execution cost of the target physical plan is lower than a certain preset value (which may be referred to as a first preset threshold). For ease of illustration, in subsequent embodiments of the present application, the execution cost of the target physical plan is the lowest among the q execution costs as the case where the execution cost meets the first preset requirement, and details will not be described later.

In the above embodiments of the present application, several specific situations in which the execution cost meets the first preset requirement are specifically described, which has wide applicability and flexibility.

In a possible implementation manner of the first aspect, the first target model meeting the second preset requirement includes but is not limited to:

1) The performance of the first target model begins to decline;

2) The performance of the first target model not only declines, but the degree of decline reaches a certain preset value (which may be called the second preset threshold);

3) Evaluate the real-time performance of the first target model and predict the next performance. For example, the performance of the first target model can be predicted by a model evaluator, assuming that the predicted performance of the first target model declines The probability of reaching a certain preset value (may be referred to as the third preset threshold value), for example, the probability that the predicted performance will decline reaches 80%;

4) The continuous running time of the first target model reaches a preset time period, for example, the continuous running time of the first target model reaches 30 minutes.

In the above embodiments of the present application, several specific situations in which the first target model satisfies the second preset requirement are specifically described, which has wide applicability and flexibility.

In a possible implementation manner of the first aspect, the second target model meeting the third preset requirement may include but not limited to:

1) Compared with the performance of the first target model, the performance of the second target model has increased by a certain preset value (which may be called the fourth preset threshold). As an example, the fourth preset threshold can be zero, which means that as long as When the performance of the second target model reaches the performance level of the original first target model, it is considered that the second target model meets the third preset requirement; as another example, the fourth preset threshold can also be a certain value greater than zero Or a certain ratio, indicating that only when the performance of the second target model is improved to a certain extent compared with the performance of the original first target model, the second target model is considered to meet the third preset requirement;

2) The performance of the second target model is improved by a fifth preset threshold compared with the performance of the native kernel components in the database. That is to verify the performance improvement of the second target model compared with the traditional database algorithm. If the performance improvement reaches a certain threshold, the model used by the corresponding module of the data is actually replaced. Otherwise, the traditional database algorithm is still used to execute the target physical plan. Wherein, the value of the fifth preset threshold may be zero, or a certain value or a certain ratio greater than zero. For details, please refer to the above-mentioned first method, which will not be repeated here.

In the above embodiments of the present application, several specific situations in which the second target model satisfies the third preset requirement are specifically described, which has wide applicability and flexibility.

The second aspect of the embodiment of the present application also provides a data management method, the method includes: first, the local computer device receives the SQL statement sent by the client device to the database deployed in the computer device, wherein the database includes Optimizer and database native kernel components, the optimizer includes n models, n≥1. After the computer device receives the SQL statement sent by the client device, it will first judge whether there are n models in the optimizer that do not meet a certain preset requirement (which can be called the second preset requirement). A target model (that is, one of the n models) does not meet the second preset requirement, then the computer device will obtain the target physical plan through the n models included in the optimizer according to the SQL statement, wherein the target physical plan To execute a physical plan whose cost meets the first preset requirement. After the computer device obtains the final target physical plan, it will execute the target physical plan. The essence of this execution process is to use the generated target physical plan to execute the actual logic of the input SQL statement.

In the above embodiments of the present application, it is explained how the computer device obtains the target physical plan based on the optimizer included in the database and finally executes the target physical plan. The database deployed in the computer device includes an optimizer with n models, It replaces the traditional heuristic optimizer and realizes the conversion of logical queries into physical execution plans with higher execution efficiency by combining with machine learning technology.

In a possible implementation of the second aspect, the optimizer may specifically include three models, which may be referred to as the third model, the fourth model, and the fifth model, which are respectively used to perform logical query rewriting, cost estimation, and Steps for physical plan generation. In this case, the way the computer device obtains the target physical plan through the n models included in the optimizer according to the SQL statement can specifically be: first, the computer device uses the third model to input the SQL statement (also can be referred to as SQL query) to rewrite the logical query, so as to obtain the rewritten logical plan, wherein, the model A is a model based on a tree search algorithm, for example, it can be a Monte Carlo tree search algorithm; after that, through The fourth model is to generate q physical plans according to the logical plan, wherein the fourth model is a model based on a deep learning algorithm, for example, a model based on Tree-LSTM, q≥1; finally calculated by the fifth model q execution overheads corresponding to the q physical plans (one physical plan corresponds to one execution overhead), and determine the final target physical plan to be executed according to the q execution overheads, wherein the fifth model is constructed based on a reinforcement learning algorithm The model of , for example, can be a DQN-based model. It should be noted here that in other implementations of this application, the optimizer may also include more or fewer models to implement the process of logical query rewriting, cost estimation, and physical plan generation. In the example, the optimizer includes three models for illustration only.

In a possible implementation of the second aspect, the computer device may also send the running data of the processes in the database deployed therein to the recommender, and the recommender may be deployed in the computer device or in the The remote device is not limited here. After receiving the operation data, the advisor can send abnormal data based on the operation data, and diagnose the cause of the abnormality based on the obtained abnormal data, and finally optimize the self-optimization module corresponding to the abnormal cause based on the abnormal cause, so as to reduce the subsequent process in the database. The probability of anomalies in the running data of , where the suggester includes p models, p≥1.

In the above-mentioned embodiments of the present application, it is specifically stated that the computer device can also feed back the running data of the processes in the database to the suggester, and the suggester can give all-round optimization suggestions for the database based on the running data, and can realize an unattended database Performance monitoring and root cause identification greatly liberate operation and maintenance manpower, and can help the database system quickly recover from abnormalities or improve performance.

In a possible implementation manner of the second aspect, the computer device may also send the running data of the processes in the database deployed therein to the recommender, or send the running data to the training data collector, the The training data collector can be deployed in the computer device or in the remote device, which is not limited here. After receiving the running data, the training data collector can obtain training data according to the running data, and construct m training sets based on these training data, where m≥1.

In the above-mentioned embodiments of the present application, it is specifically stated that the computer device can also feed back the running data of the processes in the database to the training data collector, and the training data collector can generate the training data of the models involved in the database based on the running data of the processes in the database. Data can realize continuous optimization of the database system, reduce the probability of misjudgment of the database system, and provide credible autonomous operation and maintenance services.

In a possible implementation of the second aspect, if the first target model satisfies the second preset requirement, the computer device may further send a model manager (the model manager may be deployed in the computer device, or may be not deployed on the computer device, not limited here) to send a first instruction, the first instruction is used to instruct the model manager to fine-tune the first target model, and, when the performance of the second target model satisfies a predetermined In the case of setting a requirement (which may be referred to as a third preset requirement), the computer device receives the model parameters of the second target model sent by the model manager, wherein the second target model is used by the model manager and the first target model The target training set corresponding to the model is a model obtained by fine-tuning the first target model. The target training set is one of the m training sets. Finally, the computer device updates the first target model to the second target model, and passes The updated n models (at this time, the updated n models do not include the first target model, but include the second target model) to obtain the target physical plan.

In the above-mentioned embodiments of the present application, it is specifically stated that in the case that the first target model does not meet the second preset requirements, it is also necessary to use the target training set corresponding to the first target model to set the first target model through the model manager. The model is fine-tuned, and when the performance of the second target model obtained after fine-tuning meets the third preset requirement, the first target model is updated, so as to dynamically update and replace the used in the database according to the real-time operation status of the database corresponding model.

In a possible implementation of the second aspect, if the performance of the second target model does not meet the third preset requirement, then the computer device will also receive a second instruction sent by the model evaluator, and the second The instruction is used to instruct the database to use native kernel components in the database to generate the final target physical plan, wherein the model evaluator can be deployed in the computer device or in a remote device, which is not limited here. The model evaluator is used to evaluate the performance of the second target model.

In the above-mentioned embodiments of the present application, it is specifically stated that when the fine-tuned second target model still does not meet the third preset requirement, the computer device receives the second instruction of the model evaluator to instruct the database to use the traditional algorithm of the database (that is, the native kernel component) to generate the target physical plan. The embodiment of the present application provides multiple options for generating the target physical plan, and has flexibility.

In a possible implementation of the second aspect, the first target model meeting the second preset requirements includes but is not limited to:

1) The performance of the first target model begins to decline;

In a possible implementation manner of the second aspect, the second target model meeting the third preset requirement may include but not limited to:

1) Compared with the performance of the first target model, the performance of the second target model has increased by a certain preset value (which may be called the fourth preset threshold). As an example, the fourth preset threshold can be zero, which means that as long as When the performance of the second target model reaches the performance level of the original first target model, it is considered that the second target model meets the third preset requirement; as another example, the fourth preset threshold can also be a certain value greater than zero Or a certain ratio, it means that only when the performance of the second target model is improved to a certain extent compared with the performance of the original first target model, the second target model is considered to meet the third preset requirement;

In a possible implementation of the second aspect, the execution overhead meeting the first preset requirement includes but is not limited to:

A third aspect of the embodiments of the present application provides a computer device, where the computer device has a function of implementing the method of the second aspect or any possible implementation manner of the second aspect. This function may be implemented by hardware, or may be implemented by executing corresponding software on the hardware. The hardware or software includes one or more modules corresponding to the above functions.

The fourth aspect of the embodiment of the present application provides a computer device, which may include a memory, a processor, and a bus system, wherein the memory is used to store a program, and the processor is used to call the program stored in the memory to execute the second aspect of the embodiment of the present application Or any possible implementation method of the second aspect.

The fifth aspect of the embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores instructions, and when it is run on a computer, the computer can execute any one of the above-mentioned second aspect or the second aspect. method of possible implementation.

The sixth aspect of the embodiments of the present application provides a computer program, which, when running on a computer, causes the computer to execute the method of the above-mentioned second aspect or any possible implementation manner of the second aspect.

The seventh aspect of the embodiment of the present application provides a chip, the chip includes at least one processor and at least one interface circuit, the interface circuit is coupled to the processor, and the at least one interface circuit is used to perform the function of sending and receiving, and send instructions to At least one processor, at least one processor is used to run computer programs or instructions, which has the function of realizing the method of the second aspect or any possible implementation mode of the second aspect above, and this function can be realized by hardware or by software Realization can also be achieved through a combination of hardware and software, where the hardware or software includes one or more modules corresponding to the above functions. In addition, the interface circuit is used to communicate with other modules outside the chip.

Description of drawings

Fig. 1 is a schematic diagram of a system architecture for constructing a database management system provided by the embodiment of the present application;

FIG. 2 is a schematic diagram of a logical architecture of the database management system provided by the embodiment of the present application;

Fig. 3 is a schematic diagram of the principle of the optimizer provided by the embodiment of the present application;

Fig. 4 is a schematic diagram of the principle of the model evaluator provided by the embodiment of the present application;

Fig. 5 is a schematic diagram of the principle of the suggester provided by the embodiment of the present application;

FIG. 6 is a schematic diagram of the comparison between the optimizer provided by the embodiment of the present application and three re-strategies;

Fig. 7 is a schematic diagram comparing the model evaluator provided by the embodiment of the present application with two known performance evaluation methods;

FIG. 8 is a schematic flowchart of a data processing method provided in an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a computer device provided by an embodiment of the present application;

FIG. 10 is another schematic structural diagram of a computer device provided by an embodiment of the present application.

Detailed ways

The terms "first", "second" and the like in the specification and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the terms used in this way can be interchanged under appropriate circumstances, and this is merely a description of the manner in which objects with the same attribute are described in the embodiments of the present application. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, product, or apparatus comprising a series of elements is not necessarily limited to those elements, but may include elements not expressly included. Other elements listed explicitly or inherent to the process, method, product, or apparatus.

The embodiment of the present application involves a lot of related knowledge about databases, models, etc. In order to better understand the solution of the embodiment of the present application, the following first introduces related terms and concepts that may be involved in the embodiment of the present application. It should be understood that the interpretation of related concepts may be limited due to the specific conditions of the embodiment of the application, but it does not mean that the application is limited to the specific conditions, and there may be differences in the specific conditions of different embodiments. Specifically, there is no limitation here.

(1) database

A database is a computer software system that stores and manages data according to its data structure. The concept of the database actually includes two meanings: a. The database is an entity, which is a "warehouse" that can reasonably store data. Users store transaction data to be managed in the "warehouse". Concepts are combined into a database. b. Database is a new method and technology of data management. It can organize data more appropriately, maintain data more conveniently, control data more closely and utilize data more effectively.

(2) Database software

Database software is deployed on local devices, such as local servers and local terminal devices (such as mobile phones, smart watches, personal computers, etc.), and usually exists in the form of single or multiple processes, so database software can also be called database processes .

(3) neural network

A neural network can be composed of neural units. Specifically, it can be understood as a neural network with an input layer, a hidden layer, and an output layer. Generally speaking, the first layer is the input layer, and the last layer is the output layer. The number of layers in the middle is is the hidden layer. Among them, a neural network with many hidden layers is called a deep neural network (DNN). The work of each layer in a neural network can be expressed mathematically

To describe, from the physical level, the work of each layer in the neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column space) through five operations on the input space (a collection of input vectors). Space), these five operations include: 1. Dimension up/down; 2. Zoom in/out; 3. Rotate; 4. Translation; 5. "Bending". Among them, the operations of 1, 2, and 3 are performed by

Complete, the operation of 4 is completed by "+b", and the operation of 5 is realized by "a()". The reason why the word "space" is used here is because the object to be classified is not a single thing, but a class of things. Space refers to the collection of all individuals of such things, where W is the weight matrix of each layer of the neural network , each value in this matrix represents the weight value of a neuron in the layer. The matrix W determines the space transformation from the input space to the output space mentioned above, that is, the W of each layer of the neural network controls how to transform the space. The purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network. Therefore, the training process of the neural network is essentially to learn the way to control the spatial transformation, and more specifically, to learn the weight matrix.

(4) Loss function (loss function)

In the process of training the neural network, because it is hoped that the output of the neural network is as close as possible to the value that you really want to predict, you can compare the predicted value of the current network with the target value you really want, and then according to the difference between the two to update the weight matrix of each layer of neural network (of course, there is usually an initialization process before the first update, that is, to pre-configure parameters for each layer in the neural network), for example, if the predicted value of the network is high, then Adjust the weight matrix to make it predict lower, and keep adjusting until the neural network can predict the desired target value. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function or objective function, which is an important equation for measuring the difference between the predicted value and the target value. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, so the training of the neural network becomes a process of reducing the loss as much as possible.

(5) Back propagation algorithm

In the training process of the neural network, the error back propagation (BP) algorithm can be used to correct the size of the parameters in the initial neural network model, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, the forward transmission of the input signal until the output will generate an error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss converges. The backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.

(6) Machine learning

Machine learning is a multi-field interdisciplinary subject, involving probability theory, statistics, convex analysis, algorithm complexity and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire knowledge or skills and reorganize existing The knowledge structure enables it to continuously improve its own functions. The following introduces several machine learning models used in the embodiments of this application:

a. Monte Carlo tree search

Monte Carlo tree search (monte carlo tree search, MCTS) is a method of strategy optimization on artificial intelligence problems, usually for those parts that require mobile planning in combination games, Monte Carlo tree search will randomly simulate Versatility combined with tree search accuracy. Due to its achievements in computer Go and the potential to solve some difficult problems, the application field of the Monte Carlo tree search algorithm can cover any field that can be described in the form of (state, action) and predict the result through simulation ( e.g., the rewrite sequence selection problem in query rewriting).

b. Recurrent neural networks (RNN)

RNN is a kind of neural network, and the purpose of RNN is to process sequence data. In the traditional neural network model, from the input layer to the hidden layer to the output layer, the layers are fully connected, and the nodes between each layer are unconnected. But this ordinary neural network is powerless for many problems. For example, to predict what the next word in a sentence is, you generally need to use the previous words, because the preceding and following words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output. The specific manifestation is that the network will remember the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layers are no longer connected but connected, and the input of the hidden layer not only includes the output of the input layer Also includes the output of the hidden layer at the previous moment. In theory, RNN can process sequence data of any length.

Since pure RNN cannot deal with recursion, weight exponential explosion or gradient disappearance, it is difficult to capture long-term time correlation; and combining different LSTMs can solve this problem well. The time recurrent neural network can describe dynamic time behavior, because unlike the feedforward neural network (feed forward neural network) that accepts a more specific structure input, RNN circulates the state in its own network, so it can accept a wider range of time series structure input .

c. Long short-term memory artificial neural network (long short-term memory, LSTM)

LSTM can also be called long-term short-term memory network. It is a time-cycle neural network. It is specially designed to solve the long-term dependence problem of general RNN. All RNNs have a chain form of repeated neural network modules. . In standard RNNs, this repeated structural module has only a very simple structure, such as a tanh layer.

d. Tree long short-term memory artificial neural network (tree long short-term memory, Tree-LSTM)

Traditional LSTM can perform better on sequence model tasks because it can maintain sequence information. However, many problems in the database are modeled as a tree structure, such as the logical plan and physical plan of query statements, and LSTM can only input linear sequences, which makes it difficult to handle tree-shaped input well. Tree-LSTM mainly extends LSTM to the tree-shaped input structure, and surpasses the traditional LSTM model in predicting semantic relevance and semantic classification tasks on the tree.

e. Convolutional neural networks (CNN)

CNN is a deep neural network with a convolutional structure. The convolutional neural network includes a feature extractor composed of a convolutional layer and a subsampling layer. The feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolutional feature map. The convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network. In the convolutional layer of a convolutional neural network, a neuron can only be connected to some adjacent neurons. A convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units of the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as a way to extract image information that is independent of location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. That means that the image information learned in one part can also be used in another part. So for all positions on the image, we can use the same learned image information. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.

f. Graph convolutional network (GCN)

Motivated by the great success of CNNs in computer vision, many approaches have recently emerged to redefine the concept of convolution for graph data. These methods belong to the category of GCN. Since spectral methods usually process the entire graph at the same time and are difficult to parallelize or scale to large graphs, spatial-based GCNs directly perform convolutions on the graph structure by aggregating information from neighboring nodes. Combined with the sampling strategy, calculations can be performed on a batch of nodes instead of the entire graph, which can effectively improve the processing efficiency of problems related to graph models in databases (such as concurrent query modeling).

g. Reinforcement learning (reinforcement learning, RL)

RL is a field in machine learning that emphasizes how to act based on the environment to maximize the expected benefit. RL is the third basic machine learning method besides supervised learning and unsupervised learning. Unlike supervised learning, RL does not require labeled input/output pairs, nor does it require accurate correction of non-optimal solutions. Its focus is to find the balance between exploration (for unknown domains) and utilization (for existing knowledge), the exchange of "exploration-utilization" in reinforcement learning, multi-armed bandit problems and limited Markov decision-making processes (markov decision process (MDP) is the most studied.

For example, in the embodiment of the present application, for query statements of multi-connection operations, RL controls the execution sequence of each connection operation to ensure that the total query execution cost is minimized. RL also has a training process, which requires continuous execution of actions, observation of the effect of the execution of actions, and accumulation of experience to form a model. Different from supervised learning, each action here generally does not have a directly calibrated label value as a supervisory signal, and the system only gives a feedback to the action performed by the algorithm.

(7) Remote procedure call protocol (remote procedure call, RPC)

RPC is a protocol for requesting services from remote computer programs over a network without requiring knowledge of the underlying network technology.

In the embodiment of the present application, RPC is used to realize the rapid interaction between the database kernel component and the external model manager, such as making a model update request, creating a new model, and so on.

In addition, before introducing the embodiment of the present application, several management structures of the current database management system are briefly introduced, so that the subsequent understanding of the embodiment of the present application is facilitated.

Method 1. Database management system SageDB

The core idea of SageDB is to build multiple cumulative distribution functions (CDF) about data distribution, and use these CDF models to generate learning indexes, replacement cost estimation models, and accelerated physical operators.

As a conceptual system, SageDB first assumes that it can learn a "perfect" CDF model, that is, the probability distribution of the model accurately conforms to the data distribution of the corresponding data table, and then inserts the CDF model into different modules of the database to provide machine learning-based Reasoning ability: 1) For the optimizer, SageDB directly replaces the cost estimation model with the CDF model learned on a single table to estimate the cost (cost) and cardinality of different queries; 2) For the data structure, SageDB directly uses the traditional The block on the multi-way search tree (balance tree, B-Tree) is replaced by the learned CDF model, and a list of abnormal conditions is summarized to correct the problem of CDF positioning errors; 3) For physical operator acceleration, use Take the sorting operation as an example. SageDB first performs rough sorting on the underlying data based on the learned CDF model (input data values, output relative position numbers), and then use traditional sorting algorithms (such as quick sort) to get the final sorting results.

SageDB is still in the process of research, using a simple CDF experimental model, which is difficult to adapt to large-scale data sets and other issues; and all learning functions of SageDB are based on the CDF model cluster learned on a single table, suitable for simple single-table query scenario, but cannot effectively deal with multi-table connection problems; in addition, CDF is only used to learn data or load distribution, and cannot provide intelligent decision-making functions such as query rewriting, query plan generation, abnormal diagnosis, etc., and SageDB does not provide multi-CDF model management and model update mechanism.

Method 2, relational database management system Oracle

Oracle Database has a long-term investment in the automated operation and maintenance of the database. Oracle 10g introduces various self-management functions to simplify management, improve efficiency, and reduce the total cost related to system management. These management functions include: 1) Statistical analysis related to SQL query optimization; 2) Automatic storage manager: Simplifies how data files, control files and log files are stored; 3) Automatic workload repository: Stores and manages Adjusted information; 4) Automatic database diagnostic monitor: analyze stored statistics, identify possible performance bottlenecks and provide suggestions for solving found problems; 5) Automatic query optimization: determine the execution structure by using query rewriting rules and cost models 6) Automatically generate tuning suggestions for SQL statements or workloads. These recommendations are provided to users, who accept or reject these recommendations subjectively; 7) SQL tuning recommendations make decisions based on information provided by query optimizers, including automatic database diagnostic monitors and automatic workload repositories; 8) based on current load Quantity recommended indexes (including bitmap indexes, function-based indexes, and B-tree indexes), materialized views, table partitions, and indexes. Get the content from the SQL cache, and select the appropriate index and materialized view after analysis; 9) Optimizer statistics collector, collect statistics about optimization; 10) By managing database snapshots and storing information, coordinating all in the server Self-management; 11) Server generates alerts, configure the system to automatically generate alerts when events are triggered; 12) Automate pre-installation and post-installation tasks: systems are automatically checked before installation to ensure the success of the installation process and recommend changes; 13) Automatically manage the shared memory used by the Oracle database instance, and eliminate the need for administrators to manually configure shared memory components; 14) The database resource manager allows the DBA to logically divide the workload into different units, and central processing unit (central processing unit) , CPU) resources are allocated to these units without additional overhead. During peak hours, on-line transaction processing (OLTP) should take precedence over on-line analytical processing (OLAP), and vice versa. Its scheduling mechanism is used for fixed time intervals and controls the number of active sessions executing at one time. When an active session's available slots are filled with new sessions, the remaining sessions are queued until a slot becomes available. Its first autonomous database release version is 19c, which provides external services in the form of public cloud, including functions such as automatic indexing, distributed columns, and materialized view recommendations.

However, the optimization capabilities of the Oracle database are mainly presented in the form of single-point, independent functions, and the functions are not unified to form a closed loop. Users need to call on demand according to their own needs; in addition, the autonomous functions of the Oracle database are mainly reflected in the In the stage of analysis and management, based on limited rules or traditional statistical learning, the ability to optimize database exceptions is limited; moreover, Oracle database does not provide a unified model and training data management and update mechanism, nor does it provide component performance verification functions, which is passive Tuning.

To sum up, in order to solve the above problems, the embodiment of this application provides a new database management system, which is based on machine learning algorithms and expert experience, realizes self-learning kernel and model optimization, and builds all-round autonomous functions of the database.

Embodiments of the present application are described below in conjunction with the accompanying drawings. Those of ordinary skill in the art know that, with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

First, describe the system architecture of the database management system constructed by the embodiment of the present application. Please refer to FIG. 1 for details. FIG. 1 is a schematic diagram of a system architecture for constructing the database management system provided by the embodiment of the present application. Machine learning platform component 102, self-learning suggester 103 (may be referred to as suggester 103 for short), the module functions of each structure are introduced below:

(1) The module function of database software 101

The database software 101 (similar to server software) is deployed on a local device, for example, can be deployed on a local server or a local terminal device (such as a mobile phone, a personal computer, etc.), and usually exists in the form of a single or multiple processes. The system architecture of the embodiment of the present application includes the self-learning kernel components of the database, and replaces or replaces the algorithm or implementation of the traditional database native kernel components (it should be noted that the native kernel components are not deleted, and they are still in the database process). Overall database reliability or performance improvement. The self-learning kernel component means that it does not replace a certain algorithm through machine learning at a single point. Its outstanding capabilities are based on changes in system load or business status, automatic adaptation to scenarios, automatic updates based on algorithm training models, and access to feedback mechanisms and verification mechanisms. , for automatic model drift and continuous model availability.

It should be noted that, in the embodiment of the present application, the self-learning kernel components can be identified as self-learning optimizer 1011 (may be referred to as optimizer 1011 for short), self-learning index, self-learning storage and self-learning executor, etc., different The self-learning kernel component can correspondingly realize the functions of the corresponding modules in the native kernel component. For the convenience of explanation, in the embodiment of the present application, the implementation mechanism and calling logic of the self-learning kernel component are mainly introduced by using the optimizer 1011 .

(2) Module functions of the machine learning platform component 102

In the embodiment of this application, no matter which machine learning algorithm is used, the data source is the database system, including but not limited to the internal indicators of the database (such as the number of transaction processing per second (transactions per second, TPS), cache hits, active transactions, resource usage, etc.), operating system information, log information, etc. These information will be used for model training. Specifically, the information can be written into the training data collector 1021 (also referred to as the training data collection platform) through the information collector, and the model manager 1022 (also referred to as the model management platform) based on the requirements The training algorithm, combined with the data information, completes the model training. The trained model will be pushed to the model evaluator 1023, and only the model that has been evaluated and meets business expectations can be identified as the model to be applied, otherwise it needs to be readjusted and trained.

It should be noted here that the information collector can be deployed in the database software 101, or it can be a process deployed separately outside the database software 101. The purpose of the separate deployment is to decouple from the database and to implement specific functions (that is, the process in the database The collection function of the operating data), the specific embodiment of the present application does not limit the deployment mode of the information collector.

It should also be noted that in other embodiments of the present application, since the function of the information collector is to collect data, for the sake of illustration, the function of the information collector can be integrated in the optimizer 1011, or integrated in the training data collection The specific application does not limit this, but for the convenience of explanation, in the following embodiments of the application, the training data collector 1021 also functions as an information collector is taken as an example for illustration, and details will not be described later.

It should be noted that, in the embodiment of the present application, the models included in the machine learning platform component 102 may be pre-trained in advance, and the models deployed in the system architecture may refer to pre-trained models. In this case, the training data collected by the training data collector 1021 in the embodiment of the present application can be used to fine-tune the pre-trained model (for example, fine-tuning can be performed when the performance of the model declines after the model has been applied for a period of time); the machine The models included in the learning platform component 102 may also not be pre-trained in advance, but the initialized models are directly deployed on the system architecture, and then the training data collected by the training data collector 1021 are used to train each model and Subsequent fine-tuning, specifically, this application does not limit the state of the model included in the machine learning platform component 102 during deployment (that is, whether it has been pre-trained).

It should also be noted that, in some embodiments of the present application, the machine learning platform component 102 may be deployed on a remote device (for example, a remote server, as shown in FIG. 1 is deployed on a remote device), or may Deploying local devices (for example, local servers) can even be implemented together with the database in the same process (that is, implemented in the database kernel). The individual component program is the least intrusive way to the existing capabilities of the database, and can be used as an accessory to iteratively replace the database kernel module capabilities gradually. The same is true for deployment on local devices. It is less intrusive to the database, but it will force the same device resources as the database. Usually, new scheduling components are required for balance and resource control. Integrate the machine learning component platform in the database kernel, that is, the database itself provides machine learning (including but not limited to deep learning, reinforcement learning, etc.) capabilities, which is very intrusive to the database, but the data privacy protection is good, communication overhead is reduced, and the interface is indirectly convenient and more convenient. Easy model tuning or fine-tuning.

(3) Module functions of the suggester 103

The suggester 103 is used to discover possible problems in the running process of the database, and perform diagnosis and tuning for intelligent operation and maintenance management of the database. The recommender 103 also needs a machine learning platform component to manage the algorithm model used in the intelligent operation and maintenance process. The machine learning platform used can be of the same origin as the machine learning platform components used by the database system, that is, the two can use the same machine learning platform, or can be managed separately, that is, the model in the suggester 103 can deploy a machine learning platform component separately, specifically This application does not limit this, but the functions or mechanisms of the machine learning platform components have not changed, and the model is also automatically updated to provide a learning and feedback mechanism to ensure the availability of the model.

What needs to be explained here is that the suggester 103 is not a manifestation of the database kernel capability, but is used for database management, and can tune or strengthen the capabilities provided by the database kernel module. After interacting with the database, the suggester 103 obtains more information and suggestions, and the model is more optimized, which is beneficial to the intelligent implementation of the system.

Usually, the suggester 103 needs to report diagnostic information and health index, and also needs to accept instructions from users. At this time, a web front-end is required to realize this function. The implementation method is a conventional operation, which will not be described here.

It should also be noted that, in the above embodiments of the present application, the system architecture of the data management system includes a suggester 103, and it can be seen from the above description that the module function of the suggester 103 is to find possible problems in the database operation process, and perform diagnostics and tuning. Therefore, in other implementation manners of the present application, the suggester 103 may not be required.

Based on the above description of the system architecture of the data management system, the following specifically introduces the specific functions and calling logic of each functional module under the database management system provided by the embodiment of the present application. Each functional module is respectively the optimizer 1011 shown in Figure 1, Training Data Collector 1021 (Training Data Collector may also be referred to as Training Data Collection Platform), Model Manager 1022 (Model Manager may also be referred to as Model Management Platform), Model Evaluator 1023, Adviser 103 (in some embodiments , the suggester 103 may also not be required). Please refer to FIG. 2 for details. FIG. 2 is a schematic diagram of a logical architecture of the database management system provided by the embodiment of the present application, wherein the SQL query parser and the storage engine in the database management system are existing modules, and others are new to the system. Add module, the new module includes self-learning optimizer 201 (it should be noted that some kernel components included in the optimizer 201 provided by this application can be native kernel components, but the n models included in the optimizer 201 This application is newly added), training data collector 202, model manager 203, model evaluator 204 and self-learning suggester 205 (in some embodiments, suggester 205 may not be required), the following are respectively from the specific functions Each function module is described logically and called:

1. Optimizer 201

Most problems in optimizers (e.g., query rewriting, cost estimation, and physical plan generation) are NP-hard problems (it is difficult to solve non-deterministic problems with efficient algorithms), and existing optimization techniques use heuristic methods, may fall into a local optimum. Therefore, in order to solve these problems, the learning optimizer provided by the embodiment of the present application uses machine learning technology to improve performance.

Specifically, in the embodiment of the present application, the optimizer 201 includes n models, where n≥1. The optimizer 201 is used to obtain the final physical plan (which may be called the target physical plan) to be executed through the n models according to the SQL statement input into the database, wherein the target physical plan means that the execution cost meets a certain predetermined A physical plan for setting requirements (which may be referred to as the first predetermined requirement).

It should be noted that, in some embodiments of the present application, the execution overhead meeting the first preset requirements includes but is not limited to: 1) The execution overhead of the target physical plan is the lowest among the q execution overheads, and the q execution overheads are The corresponding execution costs of q physical plans generated based on the SQL statements input into the database, one physical plan corresponds to one execution cost, where q≥1; 2) The execution cost of the target physical plan is lower than a certain preset value (which can be called is the first preset threshold). For ease of illustration, in subsequent embodiments of the present application, the execution cost of the target physical plan is the lowest among the q execution costs as the case where the execution cost meets the first preset requirement, and details will not be described later.

It should be noted that, in some embodiments of the present application, the optimizer 201 may specifically include three models, which may be referred to as model A, model B, and model C respectively, which are respectively used to perform logical query rewriting, cost estimation, and physical The plan generates these three steps. It should be noted here that, in other implementations of this application, the optimizer 201 may also include more or fewer models for realizing the process of logical query rewriting, cost estimation, and physical plan generation. In this application In the embodiment, the three models included in the optimizer 201 are only for illustration, and will not be described in detail later.

Specifically, the optimizer 201 performs logical query rewriting on the SQL statement (also referred to as SQL query) input to the database through model A, so as to obtain a rewritten logical plan, wherein the model A is based on a tree search algorithm The constructed model, for example, may be a Monte Carlo tree search algorithm; afterward, the optimizer 201 generates q physical plans according to the logical plan through the model B, wherein the model B is a model constructed based on a deep learning algorithm, For example, it may be a model based on Tree-LSTM, q≥1; finally, the optimizer 201 calculates the q execution costs corresponding to the q physical plans (one physical plan corresponds to one execution cost) through the model C, and The final target physical plan to be executed is determined according to the q execution costs, wherein the model C is a model constructed based on a reinforcement learning algorithm, for example, a model based on DQN.

It should be noted that, in some embodiments of the present application, the above three steps performed by the optimizer 201 can specifically be implemented through three sub-modules: a learning-type rewriter, a learning-type cost estimator, and a learning-type plan generator. Generation, this process can specifically be: first, the rewriter provided in the embodiment of the present application uses a model based on a tree search algorithm to convert the initial SQL statement input into the database system into a semantically equivalent logical plan A recognizable by the database system; then based on The cost estimator obtains a rewritten logical plan B with higher execution efficiency. The plan generator generates x physical plans (for example, x=5) based on the logical plan B, and then obtains the respective execution costs of the x physical plans based on the cost estimator, and selects the physical plan whose execution cost meets the first preset requirement (that is, the target physical plan), and finally use the generated target physical plan to execute the actual logic of the initial SQL statement.

For ease of understanding, taking the optimizer 201 including three sub-modules of a learning-type rewriter, a learning-type cost estimator, and a learning-type plan generator as an example, the logical query rewriting and cost estimation executed by the optimizer 201 And the process of generating the physical plan is described in detail:

(1) Logical query rewriting

The learning rewriter uses an MCTS-based approach to rewrite the input SQL statement into an equivalent but less expensive query. It first constructs a strategy tree, where the root node is the original query, and the tree nodes are obtained from its The parent node rewrites the resulting query. This rewriter uses an MCTS method. First, it constructs a strategy tree with the input logical plan as the root node, in which each child node represents the semantically equivalent logical plan obtained by the parent node through a rewriting operation; then in the strategy tree The last iteration selects the equivalent logical plan with the least cost or the lowest probability of being selected, and expands the strategy tree (that is, according to all the rewriting strategies of the selected plan, add a new child node under the tree node corresponding to the selected plan); finally The logical plan with the least execution cost on the strategy tree is selected as the output of the rewriter.

(2) Cost estimation

The learned cost estimator uses deep learning-based methods to estimate the cost and cardinality of queries, which can capture the correlation between different columns. It designs a tree structure model of a physical plan that matches the query statement. The analogy physical plan is composed of multiple sub-plans, and each tree structure model can be composed of several sub-models. The embodiment of this application uses the tree structure model to estimate the cost of the plan or base.

(3) Physical plan generation

The learning plan generator uses a method based on reinforcement learning to generate an optimized physical plan (that is, the target physical plan). The logic here is: the generated equivalent logical plan will correspond to multiple execution plan trees, and each execution plan tree includes One or more execution operators (also called physical operators), each execution plan tree may have multiple execution paths, involving different execution operators, one execution plan tree corresponds to a total cost, our The goal is to find the physical plan with the smallest total overhead. It uses the reinforcement learning of LSTM with tree structure to select the connection order. Specifically, the plan generator uses Tree-LSTM to encode the current physical plan into a compressed vector as the state of deep reinforcement learning, and then iterates multiple times, each Select the connection operation with the highest long-term income for the first time, and finally output the physical plan with the lowest execution cost as the actual logic for executing the SQL statement. In this embodiment of the present application, GCN may be used to capture the structure of the connection tree supporting database schema update and multiple aliases for table names. The model can automatically select the appropriate physical operator.

To sum up, the call logic of the optimizer 201 is: in the learning optimizer 201 provided by the embodiment of the present application, for the logical plan input through the SQL query parser, the rewriter first builds a The logical plan is the strategy tree of the root node, where each child node represents the equivalent logical plan obtained by the parent node after a rewriting operation. Based on MCTS, the rewriter searches the strategy tree for an equivalent logical plan with the least cost, and then inputs it to the plan generator. The plan generator iteratively adjusts the sequence of join operations to obtain multiple different physical plans. For each physical plan, the cost estimator is used to estimate the execution cost, and then the physical plan output with the smallest execution cost is selected.

For ease of understanding, based on the principle of the optimizer 201, the execution process of the optimizer 201 is described in detail below. Please refer to FIG. 3 for details. FIG. 3 is a schematic diagram of the principle of the optimizer provided by the embodiment of the present application. The learning model rewrites SQL statements, estimates costs, and selects physical execution plans during database execution. The model in this process can be updated through feedback and incremental training to dynamically adapt to load changes. Its core steps are as follows:

1) Before a SQL statement input into the database is executed, call the logical query rewriting function of the optimizer (that is, a word learning rewriter) to rewrite it at the statement level to prevent performance problems caused by bad SQL writing habits . The specific process is as follows:

a. In the first step, a strategy tree is constructed first, in which the root node is the input SQL query, and the non-root nodes are rewritten query statements. Find the rewriting sequence that obtains the maximum benefit through the MCTS search algorithm, that is, iteratively select the equivalent logical plan with the least cost or the least frequently selected equivalent logical plan on the strategy tree, and expand the strategy tree (that is, according to all rewrite strategies of the selected plan , add a new child node under the corresponding tree node of the selected plan); finally select the logical plan with the least execution cost on the strategy tree as the output of the rewriter.

b. The second step is to confirm the potential income of each tree node, and design a neural network-based income estimation model ( For example, the attention layer calculates the similarity between rules and rules on the rewriting operator), and predicts the execution overhead that can be reduced in the subsequent queries.

c. In the third step, in order to improve the search efficiency, especially when there are many logical operators in the query, dynamic programming is used to calculate from the bottom up to select the optimal top N nodes without ancestor-descendant relationship from each node and its subtrees. node, to ensure the maximum total revenue value; then output the node selection scheme corresponding to the root node, which means that the optimal rewritten query can be obtained with the highest probability from the corresponding N node expansion strategy tree.

d. In the fourth step, when the maximum number of iterations is reached or there is no new leaf node, output the rewritten query statement with the smallest execution cost.

2) Call the learning-type cost estimator, use the machine learning model to calculate the cost estimate, and use the Tree-LSTM model to estimate the cardinality and cost of any execution plan for the optimizer to select the path. The specific process is as follows:

a. For offline training, the training data is based on the collected historical query statements, and through feature extraction, the training data is input into the model to be trained. During training, based on the current training loss, the weight of the model is updated through backpropagation.

b. When using cost estimation online, if the sub-plan of the root of the current node has been evaluated, obtain sub-plan estimation information from the cache pool; if the current sub-plan has not been evaluated, encode the root, and convert the encoded plan vector Input the Tree-LSTM model, and then the model returns the estimated cost and plan to the optimizer, and puts the new evaluation information into the cache pool for subsequent query statements.

3) When a complex SQL statement is detected, the model based on deep reinforcement learning is called to enumerate the execution path. Compared with the heuristic algorithm based on genetic algorithm in the traditional database, it can obtain feasible data faster and more efficiently. The SQL execution plan. The specific process is as follows:

a. Discover the optimal plan by combining the Deep Q Network (DQN) with the Tree-LSTM model.

b. Initialize the empty state first, which only contains the basic information of the query, and then further design many intermediate states, each intermediate state contains optional plans for part of the plan tree, forming a plan forest.

c. The training process is divided into cost training and delay adjustment. Among them, the cost training continuously selects fragments in the execution plan through the reinforcement learning method to judge whether the operation of the newly selected plan conforms to the optimal plan; in this process, the Q value retrieval method in the Tree-LSTM model is used to initially determine the pros and cons of the plan. When delay tuning, only a few planned delays are used as training data for model fine-tuning.

d. DQN uses Q network estimation and finds which execution tree is better. In the plan tree, there are three types of leaf nodes, including columns, tables, and operations. Use the depth-first search to traverse the plan tree, and the Tree-LSTM network layer judges that each leaf node represents the cost.

4) When a simple SQL statement is detected, based on the statement characteristics and data distribution, the machine learning algorithm is used to search the list of optional execution plans in the plan buffer and match similar plans as the final execution plan. The specific process is as follows:

a. Build a buffer. When a simple SQL statement is executed for the first time, add its plan to the buffer.

b. In the second and subsequent executions of the simple SQL statement, based on the 90:10 strategy, 90% of the operations are to obtain the plan in the buffer, and 10% of the operations are to be executed after regenerating the plan. If the regenerated plan does not exist in the buffer, it will be added to the buffer as a candidate plan, and each statement can store up to y candidate plans, where y can be customized, for example, y=5.

c. Match the characteristics of the new execution statement with the information in the buffer plan through the K-nearest neighbor (KNN) classification algorithm, and select the execution plan that matches successfully. That is to say, the matching plan is the real executable plan of the statement.

5) The database execution engine executes the above-mentioned optimized SQL statement execution plan, that is, executes the final target physical plan.

2. Training data collector 202

In the embodiment of the present application, the training data collector 202 is configured to obtain training data according to the running data of the processes in the database, and construct m training sets based on the obtained training data, where m≥1.

Specifically, in some embodiments of the present application, the training data collector 202 can automatically collect statistical information of the database, including database operation indicators, query logs, system logs, etc., and use these information to generate all the learning data involved in the database management system. type model (i.e. the model included in the suggester 205 (if any), the optimizer 201 and the model evaluator 204), and can generate corresponding training sets for different models (i.e. construct m training sets) .

As an example, assuming that the optimizer 201, the recommender 205, and the model evaluation 204 include 6 models in total, then 6 different training sets corresponding to each can be generated based on the characteristics of the respective models, that is, m=6, in this In the case, a model corresponds to a training set. In other embodiments of the present application, the training set constructed can also be less than 6 types, that is, m<6. In this case, some models can share one training set. There is no limit to the corresponding relationship between the training set and the model.

It should be noted that the models involved in the embodiments of this application may be pre-trained in advance, that is, the models deployed in the database management system all refer to pre-trained models. In this case, the implementation of this application For example, the training data collected by the training data collector 202 can be used to fine-tune the pre-trained model (for example, fine-tuning can be performed when the performance of the model declines after a period of application); the model involved in the embodiment of the present application can also be It may not be pre-trained in advance, but directly deploy the initialized model to the database management system, and then use the training data collected by the training data collector 202 to train each model and perform subsequent fine-tuning. The deployment state of the model included in the management system (that is, whether it has been pre-trained) is not limited.

It should also be noted that, in this embodiment of the application, the training data collector 202 can collect the running data of the processes in the database from various aspects, including but not limited to: 1) Database indicators: the running status of the database, such as queries per second Number (query per second, QPS), CPU usage, cache hit rate, etc. These are usually represented by time-series data; 2) SQL queries: which collect SQL queries and their statistics, such as physical plan, response time and duration, etc. 3) Database log: It collects running logs. Since different models in the database management system require different training data, this embodiment of the application can intelligently organize training data for different learning modules, including organizing related columns into the same table To reduce connection overhead, select training data for the model, etc.

In summary, the invocation logic of the training data collector 202 is: the training data collector 202 receives the running data (such as the collection information of the database Agent program) from the process in the database, and performs data cleaning on the received running data And data processing operations (for example, after data cleaning, data merging, multi-indicator direct correlation analysis and other operations, so that the data is more suitable for subsequent model training or fine-tuning), the training data is obtained, and the training data is constructed into m training sets, It is used for training or fine-tuning of each model in the database management system. If the model is a model that has not been pre-trained, train the model based on the specified algorithm and training data; if the model is a model that has been pre-trained, evaluate the relationship between the newly obtained training data and the pre-trained model relationship, and continuously monitor to decide whether to update the model; the monitoring time is based on the data source of the training model, whether it is high frequency and easy to change data.

3. Model Manager 203

In the embodiment of the present application, the model manager 203 is configured to use the target training set corresponding to the first target model when the first target model satisfies a certain preset requirement (which may be referred to as the second preset requirement). The first target model is fine-tuned to obtain a second target model (the second target model is essentially the first target model with model parameters updated). Wherein, the first target model is one of the n models, and the target training set is one of the m training sets. Afterwards, the model parameters of the finetune second target model can be passed to the model evaluator 204 for model performance evaluation.

What needs to be explained here is that the values of m and n can be the same or different. If m=n, it means that each of the n models corresponds to a training set used by itself. If m≠n, Then it shows that among n models, there may be multiple models sharing a training set (that is, the case of m<n), and there may also be a situation where a model can have multiple training sets for training (that is, the case of m>n ), the specific application is not limited to this.

It should also be noted that, in some embodiments of the present application, the first target model meeting the second preset requirements includes but is not limited to: 1) the performance of the first target model starts to decline; 2) the first target model The performance of the model not only declines, but the degree of decline reaches a certain preset value (which may be called the second preset threshold); 3) Evaluate the real-time performance of the first target model and predict the next performance, for example , the performance of the first target model can be predicted by the model evaluator 204, assuming that the predicted probability of performance degradation of the first target model reaches a certain preset value (which can be called a third preset threshold), such as predicting The probability that the performance of the first target model will decrease reaches 80%; 4) The continuous running time of the first target model reaches a certain preset time period, for example, the continuous running time of the first target model has reached 30 minutes.

In the embodiment of the present application, the model manager 203 integrates commonly used machine learning capabilities to provide a unified application access interface and support management and scheduling of learning models. Specifically, the model manager 203 generates a better model according to the training data updated by the training data collector 202 to conform to the current system running state.

In summary, the invocation logic of the model manager 203 is: after receiving the training data, the model manager 203 judges whether the model needs to be updated, and if it needs to be updated, after the model is updated, the model parameters of the model are passed to the model for verification device 204.

4. Model evaluator 204

In the embodiment of the present application, the model evaluator 204 is used to evaluate the performance of the obtained second target model, and when the performance of the second target model satisfies a certain preset requirement (which may be referred to as the third preset requirement) Next, update the first objective model in the optimizer 201 to the second objective model. As an example, the update process may specifically be: the model evaluator 204 sends the model parameters of the second target model to the optimizer 201, and the optimizer 201 assigns the received updated model parameters to the first target model, Thus the second target model is obtained. It should be noted that, in some embodiments of the present application, the model evaluator 204 may be a performance prediction model based on graph embedding.

It should be noted that, in some embodiments of the present application, the second target model meeting the third preset requirements may include but not limited to: 1) The performance of the second target model is improved by a certain amount compared with the performance of the first target model A preset value (can be called the fourth preset threshold), as an example, the fourth preset threshold can be zero, indicating that as long as the performance of the second target model reaches the performance level of the original first target model, it is considered The second target model satisfies the third preset requirement; as another example, the fourth preset threshold can also be a certain value or a certain ratio greater than zero, indicating that only the performance of the second target model is compared with that of the original first target model. The second target model is considered to meet the third preset requirement only when the performance has been improved to a certain extent; 2) The performance of the second target model is improved by the fifth preset threshold compared with the performance of the native kernel components in the database. That is to verify the performance improvement of the second target model compared with the traditional database algorithm. If the performance improvement reaches a certain threshold, the model used by the corresponding module of the data is actually replaced. Otherwise, the traditional database algorithm is still used to execute the target physical plan. Wherein, the value of the fifth preset threshold may be zero, or a certain value or a certain ratio greater than zero. For details, please refer to the above-mentioned first method, which will not be repeated here.

It should also be noted that, in some implementations of the present application, if the performance of the second target model does not meet the third preset requirement, the model evaluator 204 will also be used to trigger the database to be generated using native kernel components in the database. The final target physical plan to be executed, for example, the index selection module enables the traditional hill-climbing algorithm to create an index to execute the logic of the SQL statement. That is to say, if the performance of the second target model after finetune still does not meet the requirements, the traditional kernel component algorithm of the database is used to generate the target physical plan. Since the original kernel components of the database are not deleted, but coexist with the newly added optimizer in the database software, therefore, during the operation of the database, which method can improve the performance of the database will be dynamically adjusted in real time to improve overall performance. database performance.

It should also be noted that, in other embodiments of the present application, if the performance of the second target model does not meet the third preset requirement, the model evaluator 204 can be used to trigger the database to use native kernel components in the database to generate the final The target physical plan to be executed may also feed back the information that the model update fails (that is, the performance of the second target model does not meet the third preset requirement) to the model manager 203, so that the model manager 203 can based on the information Adjust the fine-tuning strategy for the first target model to provide a reference for subsequent model training strategies.

Based on the above, the purpose of the model evaluator is to verify whether the model is effective for the workload. If the database uses a learning model, the model evaluator 204 can be used to predict the performance of the model. For the models deployed in the database management system provided by the embodiment of the present application, the present application can perform performance prediction through the model evaluator 204 . If the performance of the model becomes better (for example, the performance improvement reaches a certain threshold), the new model obtained (that is, the original model has updated model parameters) is marked as the best model, which can be actually deployed to the database management system; otherwise, it is marked as The model needs to be updated, abandoning the deployment.

To sum up, the calling logic of the model evaluator 204 is: the model evaluator 204 obtains the latest model generated by the model manager 203, and verifies whether the model is stable and reliable, and at the same time can improve system performance. The verification result is fed back to the model manager 203, which identifies it as the best model or needs to re-update the model.

For ease of understanding, the execution process of the model evaluator 204 is described in detail below based on the principle of the model evaluator 204. For details, please refer to FIG. 4. FIG. The evaluator evaluates the performance of the model deployed in the database management system constructed in the embodiment of the present application (such as the model included in the optimizer and the recommender), and checks whether it can obtain performance improvement. If the new model does not improve performance, then abandon the deployment of the model. Specifically, in some embodiments of the present application, the model evaluator may be implemented based on GNN, and its structure is shown in FIG. 4 . Its core steps are as follows:

1) First, by vectorizing the load to be executed by the user, and then calling the GNN-based performance prediction model for evaluation, the estimated execution effect can be given by combining the load characteristics of the old and new models, and then comparing the new model is it effective.

2) Secondly, input the workload graph into the prediction model. If the graph is too large, it may affect the prediction efficiency. Therefore, the embodiment of the present application proposes a graph compression algorithm, which deletes redundant vertices and merges similar vertices. The specific implementation process is as follows:

a. First, in the construction of the load graph, a graph model is used to capture the workload characteristics, where the vertices represent the operator characteristics extracted from the query plan, and the edges between two operators represent the query correlation and resources between them compete.

b. Second, the performance prediction model feeds features into the prediction model, in which this application proposes a graph embedding algorithm that embeds graph features (e.g., operator features and K-hop neighbors) at the operator level, and constructs a Deep learning models to predict query performance.

c. In addition, if the graph is too large, use the graph compression algorithm in the load graph optimization program to reduce the size of the load graph by merging nodes with overlapping time. The method used is to first divide the existing time according to the execution time range of each node The overlapping nodes are clustered, and then the nodes without edge relationship are merged within each class by least fully connected subgraph (clique).

d. Finally, predict query performance and verify whether the input model brings benefits.

5. Adviser 205

Existing database monitoring, configuration, diagnosis, and optimization methods (eg, parameter tuning, slow SQL diagnosis, index/view consultant) rely on DBAs, are costly, and cannot be adapted to large-scale instances (such as cloud databases). Therefore, in order to solve this problem, this application can realize the functions of self-monitoring, self-diagnosis and self-optimization based on machine learning methods, so as to automatically and intelligently optimize the database.

Therefore, in the embodiment of the present application, the database management system may further include a suggester 205, and the suggester 205 includes p models, where p≧1. The suggester 205 is used to discover abnormal conditions (that is, find abnormal data) in the running data (such as CPU utilization rate, user response time, etc.) of processes in the database, and diagnose the cause of the abnormality based on the obtained abnormal data, and then , then optimize the optimization module corresponding to the abnormal cause based on the obtained abnormal cause (the optimization module is also located in the suggester 205, and the function of the optimization module is to optimize the parameters of the database), so as to reduce the running time of the process in the database The probability of anomalies in the data. It should be noted here that if the database management system constructed in the embodiment of the present application includes the suggester 205, the above-mentioned first target model can not only be any one of the n models in the optimizer 201, but can also be a suggestion Any one of the p models in the device 205.

It should be noted that, in some embodiments of the present application, the recommender 205 may specifically include three models, which may be called codec, model D, and model E, respectively, and are respectively used to perform self-monitoring, self-diagnosis and Self-optimize these three steps. It should be noted here that in other implementations of this application, the suggester 205 may also include more or fewer models for realizing the process of self-monitoring, self-diagnosis and self-optimization of the database. In this example, the suggester 205 includes three models for illustration only, and details will not be described later.

Specifically, the suggester 205 uses a codec to encode and then decode the running data of the process in the database to obtain the coded data, and compares the coded data with the running data input to the codec to obtain an abnormal data. What needs to be explained here is that the principle of using the codec to obtain abnormal data is: the codec can restore the normal original data, but not the abnormal original data, so that the input original data is encoded and then decoded to obtain the encoded data. The encoded data is compared with the original data to know whether there is abnormal data. After obtaining the abnormal data, if the operation data belongs to the system index data (such as page failure), the suggester 205 can further diagnose the cause of the abnormality based on the abnormal data through the model D, wherein the model D is based on deep learning A model constructed by an algorithm, for example, the model D may include an LSRM model and a classifier. Specifically, the suggester 205 encodes the abnormal data found by calling the LSRM model into a compressed vector (that is, a dimension-reduced/up-dimensional vector) , and then use learning-type classifiers (such as binary classifiers, multi-classifiers, etc.) to reason about the corresponding root cause (such as database backup operations); if the running data belongs to query index data (such as average delay), then the The suggester 205 can further use the model E to diagnose and obtain the cause of the abnormality based on the abnormal data, wherein the model E is a model constructed based on a deep learning algorithm. For example, the model E may include a Tree-LSTM model and a softmax function. Specifically, the suggester 205 encodes a slow query (that is, the execution time of the query is long) by calling the Tree-LSTM model, and locates the physical operator that causes the exception (that is, Execute the operator) operation, and then use the softmax function to identify the root cause of the abnormality.

Finally, after the suggester 205 finds the root cause of the data anomaly based on self-monitoring and self-diagnosis, it will select the corresponding optimization module for optimization according to the root cause of the performance degradation of the database system. For example, if the performance degradation is caused by the lack of indexes, The suggester 205 can call the index selection module to build a new index based on deep reinforcement learning, so that the performance of the query load can be improved (the root cause is that there is no suggestion for the index); Parameter tuning based on empirical rules or reinforcement learning at the level, connection level, and system level. The purpose of optimization is to make the system have as few abnormal operating data as possible.

It should be noted here that there is a corresponding relationship (but not necessarily a one-to-one correspondence) between the optimization modules included in the suggester 205 and the root causes. For example, one optimization module may correspond to a series (that is, multiple) root causes It is also possible that one optimization module corresponds to one root cause, which is not limited in this application. As an example, Table 1 is a schematic diagram of the corresponding relationship between optimization modules in the suggester 205 and some root causes.

Table 1. Correspondence between optimization modules and some root causes

Based on the above, the suggester 205 is mainly used to implement the following three functions:

(1) Self-monitoring

Self-monitoring database status, and providing running data (such as CPU usage, response time, running log) when the database is running. For anomaly detection, this application utilizes a codec to automatically detect anomalies based on data distribution and metric correlation. Specifically, the running data is converted into a low-dimensional representation by an encoder, and the low-dimensional representation is restored by a decoder. Data that the codec cannot reconstruct well is considered outlier.

(2) Self-diagnosis

Self-diagnosis aims at automatically diagnosing anomalies and discovering the root causes of abnormal data. If the abnormal data is system index data (for example, lock conflict), the abnormal data found is encoded into a compressed vector (that is, the vector after dimension reduction/dimension enhancement) by calling the LSRM model, and then a learning classifier (such as , binary classifier, multi-classifier, etc.) to infer the corresponding root cause (such as database backup operation); if the abnormal data is query index data (such as slow query), the slow query is encoded by calling the Tree-LSTM model, Locate the physical operator (that is, the execution operator) operation that causes the exception, and then use the softmax function to identify the root cause of the exception.

(3) Self-optimization

Self-tuning automatically optimizes the database for query workloads, e.g. index/view recommendations. Using deep reinforcement learning to automatically recommend indexes, learned view recommendation utilizes an encoder-decoder model to automatically recommend views. Self-optimization is optimized for the database system, and the learning parameter adjustment module uses deep reinforcement learning technology to adjust parameter values. The embodiment of the present application can use the Actor-Critic model to automatically select appropriate parameter values, and can support SQL-level, session-level and system-level parameter tuning.

To sum up, the invocation logic of the suggester 205 is as follows: first, dynamically collect database and query execution status indicators, and then use the self-monitoring module (ie codec) to find abnormal data. For abnormal data, the self-diagnosis module uses the system diagnosis function (namely model D) and the query diagnosis function (namely model E) to find the root cause of database performance degradation, and then specifies the self-optimization module to perform corresponding optimization functions. For example, if the root cause is that no index is established for the accessed column, the self-diagnosis module will call the index selection module of the self-optimization module for optimization.

It should be noted that, in some embodiments of the present application, the module function of the suggester 205 is to discover possible problems in the running process of the database, and perform diagnosis and optimization. Therefore, in some implementation manners of the present application, the suggester 103 may not be needed, and the present application does not limit this.

For ease of understanding, based on the principle of the adviser 205, the execution process of the adviser 205 will be described in detail below. Please refer to FIG. 5 for details. FIG. There are three parts: monitoring, self-diagnosis and self-optimization. Among them, self-monitoring can judge whether there are problems in the history, current and future of the database according to the performance indicators of the running data of the process in the database, and judge the abnormal or possible abnormal state of the database. By exploring the abnormal state of the database, the database self-diagnosis and self-optimization functions are used to solve the actual problems of the database. For the sake of illustration, it is assumed that the advisor includes a self-monitoring module, a self-diagnosis module, and a self-optimization module, which are used to realize self-monitoring, self-diagnosis, and self-optimization respectively. The core steps are as follows:

1) The self-monitoring module continuously collects database performance indicators. When abnormalities occur inside or outside the database, it can be reflected through corresponding indicators and system logs. Therefore, openGauss performs real-time anomaly monitoring and discovery by analyzing database and operating system indicators. The specific process is as follows:

a. First, the training data collector continuously collects indicators and logs from the database and operating system, such as QPS, operation logs, etc., and then puts these data together to form time series data.

b. Secondly, the reconstruction-based algorithm is used to find anomalies, that is, normal time series data always have a regular change pattern, and the abnormal change pattern is likely to be a system anomaly. The embodiment of the present application adopts an LSTM-based automatic codec with an attention layer. Raw time series data is encoded into a low-dimensional representation, and a decoder parses the representation and attempts to recover the original data. The training loss is the reconstruction quality. The model learns the distribution of these multidimensional data and acquires the ability to reconstruct. Data that cannot be reconstructed (errors exceed a threshold) are reported as anomalies. The embodiment of the present application adopts the statistical method "extreme value theory" to determine the dynamic threshold. Therefore, the user needs to set the system sensitivity to 1% or 5%, and it will calculate the corresponding threshold based on historical data: the present invention first normalizes the training data, and then inputs the processed data into the time series autoencoder to update the model parameters, After the model has the ability to reconstruct normal database indicators, openGauss collects the reconstruction error and calculates the threshold.

2) If no abnormality is found, wait for a period of time (that is, the preset duration), and then repeat step 1); if abnormality is found, then execute step 3).

3) When the database history, current or future problems or potential problems are found, the self-diagnosis module is called to conduct root cause analysis.

4) The self-diagnosis module judges the found faults, and if there are indeed problems, it gives the root cause of the problem at the system level or SQL statement level. The specific process is as follows: The self-diagnosis function of the database can identify the root causes of faults or abnormalities at the system level and SQL statement level. Among them, the system-level fault analysis method is realized by the LSTM+KNN algorithm; the SQL statement-level fault root cause is realized by the Tree-LSTM algorithm. For locating the root cause of the fault, call the self-optimization function and give corresponding optimization suggestions to solve the problem.

5) Make suggestions to the database system through the root cause given by the self-diagnosis, and call the self-optimization module for tuning. The specific process is as follows:

a. The self-optimization module includes optimizing parameter configuration according to the characteristics of the database system. Parameter recommendation is realized through deep reinforcement learning: First, the embodiment of the present application models the database parameter configuration and its corresponding performance through historical learning, that is, searches for the parameter with the best performance in the search space composed of selected parameters Then, the deep reinforcement learning model takes the database state and load characteristics as the input state, and selects the appropriate parameter configuration as the output behavior according to the parameter tuning experience learned from historical data, thus giving the optimal database parameter optimization scheme.

b. The self-optimization module also includes tuning for database SQL statements, for example, materialized view recommendation and index recommendation. Among them, materialized view recommendation is implemented through RNN and reinforcement learning. By analyzing the user's load, enumeration and evaluation methods are used to recommend materialized views that can be created. Users can create such materialized views to achieve load acceleration. Index recommendation refers to the index recommendation for the load level, and provides the optimal index configuration scheme adapted to it according to the user's addition, deletion, query and modification ratio.

In order to have a more intuitive understanding of the beneficial effects brought by the embodiments of the present application, the technical effects brought by the embodiments of the present application are further compared below, and the comparison results are as follows:

(1) Comparison results between the optimizer 201 and three rewriting strategies (random rewriting, top-down rewriting, and heuristic rewriting).

Please refer to Figure 6 for details. Figure 6 is a schematic diagram of the comparison between the optimizer and the three rewriting strategies provided by the embodiment of the present application. Taking query rewriting as an example, the embodiment of the present application compares the query rewriting in openGauss with the three rewriting strategies Strategies (random rewriting, top-down rewriting, and heuristic rewriting) are compared. For random rewriting and top-down rewriting, the embodiment of the present application extracts 82 rewriting rules from the query optimization engine Calcite, and rewrites the query with corresponding strategies. In addition, the embodiment of the present application uses the tool SQL-smith to generate 15,750 and 10,673 slow queries (>1s) for TPC-H and JOB respectively. As shown in Figure 6, this rewriting strategy outperforms other methods in all cases, namely, the execution time of TPC-H is reduced by more than 49.7%, and that of JOB is reduced by more than 36.8%. The reason is mainly twofold: First, openGauss explores a rewrite order that is less expensive to execute than the default top-down order in PostgreSQL. For example, with outer joins, PostgreSQL cannot push down predicates to input tables, while openGauss solves the problem by first converting outer joins to inner joins and then pushing down predicates. Second, the estimation model in openGauss predicts potential cost reductions, from which openGauss chooses a rewrite order with lower execution overhead. In addition, openGauss works better on TPC-H than JOB, because TPC-H query contains many subqueries that can be optimized by query rewriting, while multi-joins in JOB query will be further optimized by plan enumerator.

To sum up, the optimizer 201 included in the database management system provided by the embodiment of the present application can perform fine-grained optimization during the execution of SQL statements according to the optimization method given by the AI model, improve the execution efficiency of SQL statements, and improve the performance of the database. Performance.

(2) The comparison result between the suggester 205 (if any) and two index strategies (default index, artificially designed index).

For details, please refer to Table 2 and Table 3. Table 2 and Table 3 are a schematic diagram of the comparison between the suggester provided by the embodiment of this application and the two index strategies. Taking index selection as an example, the present invention is based on TPC-H and TPC-C Experiments were conducted and the proposer of the embodiment of the application was compared with the default index and the artificially designed index. The results are shown in Table 2 and Table 3. The index selection algorithm of the present application outperforms the default and manual indexes on both workloads. This is because the index selection algorithm encodes system statistics into the state representation and is able to optimize the index selection strategy based on historical data in order to dynamically update the index configuration.

Table 2. Indexing program

the	TPC-H(s)TPC-H(s)	TPC-C(tpmC)TPC-C (tpmC)
数据库管理系统openGaussDatabase management system openGauss	122.9122.9	1020210202
数据库管理员DBAdatabase administrator DBA	130.1130.1	1000110001
默认设置default setting	140.8140.8	97009700

Table 3. Anomaly detection (TPC-C)

the	准确率Accuracy	召回率recall rate	F1分数F1 score
数据库管理系统openGaussDatabase management system openGauss	0.7950.795	0.7760.776	0.7850.785
变分自编码器VAEVariational Autoencoder VAE	0.3020.302	0.8210.821	0.4410.441
生成式对抗网络GANGenerative Adversarial Network GAN	0.5540.554	0.7450.745	0.6350.635

To sum up, the optimizer 205 (if any) included in the database management system provided by the embodiment of the present application can timely discover whether there are faults or abnormalities in the database history, current and future, and give corresponding solutions according to business types and characteristics. root causes of failures, and provide optimal optimization and configuration solutions.

(3) Comparison results between the model evaluator 204 and two known performance evaluation methods.

Please refer to Figure 7 for details. Figure 7 is a schematic diagram of the comparison between the model evaluator provided by the embodiment of the present application and the two known performance evaluation methods of BAL and DL. Among them, BAL estimates the average buffer access delay and uses linear regression to predict Query latency of concurrent queries; DL uses a neural network designed according to the query plan structure to predict the performance of a single query. The embodiment of this application compares the prediction accuracy and prediction time on JOB, and the results are shown in Figure 7. From the comparison results in Figure 7, it can be seen that the error rate of the model evaluator provided by the embodiment of this application is the lowest, which is about 29.9 times lower than that of BAL , 22.5 times lower than DL. There are two reasons: First, the workload graph in the model evaluator provided by the embodiment of this application encodes concurrency factors such as resource contention, which increases the query delay of JOB by more than 20% compared with serial execution. In contrast, BAL collects buffer access latencies, while DL relies on a single query function. Second, openGauss utilizes a graph embedding network to directly map structural information to performance factors, which can improve generality when workloads vary. Instead, BAL uses a linear regression approach that requires many statistical samples for a single workload. In addition, it can be seen from Fig. 7 that the prediction delay of the model evaluator provided by the embodiment of the present application is less than that of BAL and DL, and when the concurrency level increases, the prediction delay of the model evaluator provided by the embodiment of the present application is relatively stable. For openGauss, the model evaluator provided in the embodiment of this application simultaneously predicts the execution time of all vertices. It embeds the localized graph for all vertices in the workload graph, so the total prediction time for the workload is close to predicting the vertex with the largest localized graph. Whereas for BAL, it takes the longest time to predict because it predicts the performance while executing the workload. For DL, it propagates intermediate data features in the query plan tree in a bottom-up manner, which takes relatively longer time than openGauss.

To sum up, the model evaluator 204 included in the database management system provided by the embodiment of the present application can effectively and timely check whether the new model is valid, and if it is effective, it will be deployed; otherwise, the model update will be abandoned.

(4) The training data collector 202 and the model manager 203 do not exist in the known database management system, and these two modules included in the database management system constructed in this application can ensure the reliability of data processing.

Based on the description of the database management system constructed by the embodiment of the present application in Figures 1 to 7 above, it can be seen that the embodiment of the present application provides an autonomous database framework, which implements a self-learning kernel and a suggester based on machine learning algorithms and expert experience , to build a full range of autonomous functions of the database.

Specifically, the learning-type optimizer built into the database kernel in the database management system constructed in the embodiment of the present application may specifically include an MCTS-based rewriter, a Tree-LSTM-based cost estimator, and an RL-based plan generator. Realize the efficient query optimization of the optimizer to meet the demands of multi-scenario business; the database management system provided by the embodiment of the present invention may also include a learning-type suggester of the database, which is based on machine learning technology to realize automatic abnormal monitoring, automatic system diagnosis, automatic slow Query diagnosis and automatic performance optimization (such as parameter tuning, index recommendation, and view recommendation) functions, satisfy customers with one-click operation and maintenance management operations, improve operation and maintenance efficiency and database execution efficiency; the database management system provided by the embodiment of the present invention also provides It can include an efficient model evaluator, which is based on machine learning technology, estimates the performance of the model deployed in the database management system, judges the benefits brought by the application of the corresponding model, and ensures that the database management system is always running at high performance and high reliability; The database management system provided by the embodiment of the present invention may also include a training data collector and a model manager based on a unified interface. The training data collector automatically collects the running data of processes in the database, including database running indicators, query logs, system logs, etc. Use this information to generate training data for the model deployed in the database management system; the model manager provides a unified interface to manage and control the model version, and dynamically update and replace the models used by each module.

Based on the above-mentioned database management system, it can specifically solve the following technical problems existing in the existing traditional database technology:

1) Traditional database technologies have different modules and functions, and it is difficult to select a suitable automatic algorithm to maximize performance benefits. If you need to replace a database module that may have a performance bottleneck with a learning model, you first need to select an appropriate machine learning model or algorithm. For example, deep learning models can be applied to cost estimation problems, because deep learning can fit data characteristics and access correlations of high-dimensional data columns; deep reinforcement learning can be applied to parameter tuning problems, because deep reinforcement learning does not require prior Provide training data, which can efficiently explore continuous high-dimensional parameter spaces under the condition of a small number of samples. The embodiment of the present application provides a suitable machine learning model or algorithm for complex and changeable business scenarios, large concurrency and high-performance business through the performance evaluation of the model evaluator.

2) It is difficult to evaluate the validity of the model. Before designing a new learning model and deploying it to the database, it is necessary to confirm in advance whether the above model has performance improvement compared with traditional algorithms. Traditional methods often rely on expert experience or a large number of running tests, which have high overhead and low evaluation efficiency. In the embodiment of the present application, a model evaluator is designed to predict the performance of different learning models in typical application scenarios.

3) Each module of a traditional database is based on a classic heuristic or rule-defined algorithm. However, after replacing each module with a machine learning model, each machine learning model needs to collect data and train the model, and update the model when the scene changes. If the above operations are performed separately, the training and management costs of the model are very high. The embodiment of the present application provides a training data collector and a model manager with a unified interface to evaluate the availability of the model and automatically update the model according to changes in the collected information.

On the basis of the embodiments corresponding to Figures 1 to 5, the method for applying the database management system is introduced below, please refer to Figure 8 for details, Figure 8 is a schematic flow chart of the data processing method provided by the embodiment of the present application , the method may specifically include the following steps:

801. The computer device receives the SQL statement sent by the client to the database. The database is deployed in the computer device. The database includes an optimizer and native kernel components. The optimizer includes n models, where n≥1.

First, a local computer device (i.e. the local device described in Figure 1) receives the SQL statement sent by the client device to the database deployed in the computer device, wherein the database includes an optimizer and a database native kernel component, the The optimizer includes n models, n≥1.

802. When the first target model does not meet the second preset requirement, the computer device obtains the target physical plan through the n models included in the optimizer according to the SQL statement, and the target physical plan is that the execution cost meets the first preset requirement. Assuming a required physical plan, the first target model is one of n models.

After the computer device receives the SQL statement sent by the client device, it will first judge whether there are n models in the optimizer that do not meet a certain preset requirement (which can be called the second preset requirement). A target model (that is, one of the n models) does not meet the second preset requirement, then the computer device will obtain the target physical plan through the n models included in the optimizer according to the SQL statement, wherein the target physical plan To execute a physical plan whose cost meets the first preset requirement.

It should also be noted that, in some embodiments of the present application, the failure of the first target model to meet the second preset requirements may include but not limited to: 1) the performance of the first target model does not degrade; Although the performance of a target model has declined, the degree of decline has not reached a certain preset value (which can be called a second preset threshold); 3) Evaluate the real-time performance of the first target model and predict the next performance Performance, assuming that the predicted probability of performance degradation of the first target model does not reach a certain preset value (which may be referred to as the third preset threshold), such as the probability that the predicted performance will decline does not reach 80%; 4) the first The duration of continuous running of the target model has not reached a certain preset duration, for example, the duration of continuous running of the first target model has not reached 30 minutes.

It should also be noted that, in some embodiments of the present application, as an example, the optimizer may specifically include three models, which may be called model A, model B, and model C respectively, and are respectively used to perform logical query rewriting , cost estimation and physical plan generation steps. In this case, the way the computer device obtains the target physical plan through the n models included in the optimizer according to the SQL statement can specifically be: first, the computer device uses the model A to input the SQL statement of the database (also can be called SQL query) to rewrite the logical query to obtain the rewritten logical plan, wherein, the model A is a model based on a tree search algorithm, for example, it can be a Monte Carlo tree search algorithm; after that, through the model B, generate q physical plans according to the logical plan, where the model B is a model based on a deep learning algorithm, for example, it can be a model based on Tree-LSTM, q≥1; finally, the q physical plans are calculated by the model C The q execution overheads corresponding to the physical plan (one physical plan corresponds to one execution overhead), and the final target physical plan to be executed is determined according to the q execution overheads, where the model C is a model constructed based on a reinforcement learning algorithm, for example, It can be a model based on deep reinforcement learning (deep Q-learning, DQN).

It should be noted here that in other implementations of this application, the optimizer may also include more or fewer models to implement the process of logical query rewriting, cost estimation, and physical plan generation. In the example, the optimizer includes three models for illustration only, and details will not be described later.

It should also be noted that in the embodiment of the present application, the specific functions and calling logic of the optimizer involved in the computer device can refer to the part of the optimizer 201 described in the embodiment corresponding to Figure 2 above, and details will not be repeated here. .

803. The computer device executes the target physical plan.

After the computer equipment obtains the final target physical plan, it will execute the target physical plan. The essence of this execution process is to use the generated target physical plan to execute the actual logic of the input SQL statement.

In the above embodiments of the present application, it is explained how the computer device obtains the target physical plan based on the optimizer included in the database and finally executes the target physical plan. The database deployed in the computer device includes an optimizer with n models, Therefore, it replaces the traditional heuristic optimizer. By combining with machine learning technology, it realizes the conversion of logical queries into physical execution plans with higher execution efficiency, and can effectively solve the inaccurate and complex cost evaluation caused by the current database architecture problems. Problem with poor physical plan generation for SQL statements.

It should be noted that, in some embodiments of the present application, the computer device can also send the running data of the processes in the database deployed in it to the recommender, and the recommender can be deployed in the computer device or can not be deployed For the remote device, there is no limitation here. After receiving the operation data, the advisor can send abnormal data based on the operation data, and diagnose the cause of the abnormality based on the obtained abnormal data, and finally optimize the self-optimization module corresponding to the abnormal cause based on the abnormal cause, so as to reduce the subsequent process in the database. The probability of anomalies in the running data of , where the suggester includes p models, p≥1.

It should be noted that, in the embodiment of the present application, the specific functions and calling logic of the recommender involved in the embodiment of the present application can refer to the part of the recommender 205 described in the above-mentioned embodiment corresponding to FIG. 2 , and details will not be repeated here. .

It should also be noted that, in other embodiments of the present application, the computer device may also send the running data of the processes in the database deployed in it to the recommender, or send the running data to the training data collector Sending, the training data collector can be deployed in the computer device, and can not be deployed in the remote device, which is not limited here. After receiving the running data, the training data collector can obtain training data according to the running data, and construct m training sets based on these training data, where m≥1.

It should be noted that in the embodiment of the present application, the specific functions and calling logic of the training data collector involved in the embodiment of the present application can refer to the part of the training data collector 202 described in the embodiment corresponding to FIG. 2 above. I won't repeat them here.

It should also be noted that in the above-mentioned embodiments of the present application, the situation that the first target model does not meet the second preset requirements includes but is not limited to, on the contrary, the first target model meets the second preset requirements including but not limited to The limited situation is: 1) the performance of the first target model begins to decline; 2) the performance of the first target model not only declines, but the degree of decline reaches a certain preset value (which can be called the second preset value). threshold); 3) evaluate the real-time performance of the first target model and predict the next performance, for example, the performance of the first target model can be predicted by the model evaluator 204, assuming the predicted first The probability that the performance of the target model declines reaches a certain preset value (which can be referred to as the third preset threshold), such as the probability that the predicted performance will decline reaches 80%; 4) the duration of the first target model's continuous operation reaches a certain preset The duration is set, for example, the duration of the continuous operation of the first target model has reached 30 minutes.

Therefore, in some other embodiments of the present application, if the first target model satisfies the second preset requirement, the computer device may further submit a request to the model manager (the model manager may be deployed in the computer device, or may not deployed on the computer device, not limited here) to send a first instruction, the first instruction is used to instruct the model manager to fine-tune the first target model, and, when the performance of the second target model satisfies In the case of a certain preset requirement (which may be referred to as a third preset requirement), the computer device receives the model parameters of the second target model sent by the model manager, wherein the second target model is used by the model manager and the first target model A target training set corresponding to a target model is obtained by fine-tuning the first target model, the target training set is one of m training sets, and finally, the computer device updates the first target model to a second target model, And the target physical plan is obtained through the updated n models (at this time, the updated n models do not include the first target model, but include the second target model).

In the above embodiments of the present application, it is specifically explained that when the first target model of the computer device meets the second preset requirements, the model manager invokes the corresponding target training set in the training data collector to fine-tune the first target model , which can dynamically update and replace the corresponding model used in the database according to the real-time operation status of the database.

It should be noted that in the embodiment of the present application, the specific functions and calling logic of the model manager involved in the embodiment of the present application can refer to the part of the model manager 203 described in the embodiment corresponding to Figure 2 above. I will repeat.

It should also be noted that if the performance of the second target model does not meet the third preset requirement, then the computer device will also receive a second instruction sent by the model evaluator, and the second instruction is used to instruct the database to adopt The native kernel component in the database generates the final target physical plan, wherein, the model evaluator can be deployed in the computer device or a remote device, which is not limited here. The model evaluator is used to evaluate the performance of the second target model.

It should be noted that in the embodiment of the present application, the specific functions and calling logic of the model evaluator involved in the embodiment of the present application can refer to the part of the model evaluator 204 described in the embodiment corresponding to Figure 2 above. I will repeat.

On the basis of the embodiment corresponding to FIG. 8 , in order to better implement the above solution of the embodiment of the present application, the computer equipment for implementing the above solution is also provided below. Refer to FIG. 9 for details. FIG. 9 is a schematic diagram of a computer device provided by an embodiment of the present application. The computer device 900 may specifically include: a receiving module 901, a determining module 902, and an execution module 903, wherein the receiving module 901 is used to receive The SQL statement sent by the client to the database; the determination module 902 is used to obtain the target physical plan through the n models according to the SQL statement when the first target model does not meet the second preset requirement, and the target physical The plan is a physical plan whose execution cost meets a first preset requirement, and the first target model is one of the n models; the execution module 903 is configured to execute the target physical plan.

In a possible design, the n models include a third model, a fourth model, and a fifth model. In this case, the determination module 902 is specifically configured to: use the third model to execute the SQL statement Rewrite the logical query to obtain the rewritten logical plan. The third model is a model based on the tree search algorithm; through the fourth model, q physical plans are generated according to the logical plan. The fourth model is based on deep learning The model constructed by the algorithm, q≥1; through the fifth model, calculate the q execution costs corresponding to the q physical plans, and determine the target physical plan according to the q execution costs, one physical plan corresponds to one execution cost, The fifth model is a model constructed based on a reinforcement learning algorithm.

In a possible design, the computer device 900 further includes a sending module 904, configured to: send the operation data of the processes in the database to the advisor, so that the advisor finds abnormal data based on the operation data, And make the recommender diagnose the cause of the abnormality based on the abnormal data, and optimize the self-optimization module corresponding to the abnormal cause based on the abnormal cause, so as to reduce the probability of abnormal occurrence of the operating data, wherein the suggester includes p models , p≥1.

In a possible design, the sending module 904 can also be configured to: send the running data of the processes in the database to the training data collector, so that the training data collector can obtain training data according to the running data, and based on The training data constructs m training sets, m≥1.

In a possible design, the sending module 904 may also be configured to: when the first target model meets the second preset requirement, send a first instruction to the model manager, where the first instruction is used to instructing the model manager to fine-tune the first target model, where the first target model is one of the n models; the receiving module 901 may also be used for: when the performance of the second target model satisfies a third preset If required, receive the model parameters of the second target model sent by the model manager, the second target model is the first target model that the model manager uses the target training set corresponding to the first target model The model obtained by fine-tuning, the target training set is one of the m training sets; the determination module 902 can also be used to: update the first target model to the second target model, and pass the updated n models Get a physics plan for that goal.

In a possible design, the receiving module 901 may also be configured to: when the performance of the second target model does not meet the third preset requirement, receive the second instruction sent by the model evaluator, the first The second instruction is used to instruct the database to use native kernel components in the database to generate the target physical plan, and the model evaluator is used to evaluate the performance of the second target model.

In a possible design, the meeting of the second preset requirement by the first target model includes at least any one of the following: the performance of the first target model is degraded; or, the degree of performance degradation of the first target model reaches the second Two preset thresholds, or, the predicted probability of performance degradation of the first target model reaches a third preset threshold; or, the duration of continuous operation of the first target model satisfies a preset duration.

In a possible design, the performance of the second target model meeting the third preset requirement includes at least any one of the following: the performance of the second target model is improved by a fourth preset compared with the performance of the first target model threshold; or, the performance of the second target model is improved by a fifth preset threshold compared with the performance of the native kernel components in the database.

In a possible design, the execution overhead meeting the first preset requirement includes at least any one of the following: the execution overhead of the target physical plan is the lowest among the q execution overheads, and the q execution overheads are based on the SQL statement Each of the generated q physical plans corresponds to an execution cost, one physical plan corresponds to one execution cost, q≥1; or, the execution cost of the target physical plan is lower than a first preset threshold.

It should be noted that the information interaction and execution process among the various modules/units in the computer device 900 are based on the same idea as the method embodiment corresponding to Figure 8 in this application, and the specific content can refer to the method shown above in this application The narration in the embodiment will not be repeated here.

Next, another computer device provided by the embodiment of the present application is introduced. Please refer to FIG. 10. FIG. 10 is a schematic structural diagram of the computer device provided by the embodiment of the present application. The computer device 1000 can be deployed with The described computer device 900 is used to realize the functions of the computer device 900 in the embodiment corresponding to FIG. Differences may include one or more central processing units (central processing units, CPU) 1022 and memory 1032, one or more storage media 1030 (such as one or more mass storage devices) for storing application programs 1042 or data 1044. Wherein, the memory 1032 and the storage medium 1030 may be temporary storage or persistent storage. The program stored in the storage medium 1030 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the computer device 1000 . Furthermore, the central processing unit 1022 may be configured to communicate with the storage medium 1030 , and execute a series of instruction operations in the storage medium 1030 on the computer device 1000 .

Computer device 1000 can also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input and output interfaces 1058, and/or, one or more operating systems 1041, such as Windows Server™, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

In the embodiment of the present application, the central processing unit 1022 is configured to execute the steps performed by the computer device in the embodiment corresponding to FIG. 8 . For example, the central processing unit 1022 can be used to: firstly, receive the SQL statement sent by the client device to the database deployed in the computer device, wherein the database includes an optimizer and database native kernel components, and the optimizer includes n model, n≥1. After receiving the SQL statement sent by the client device, it will first judge whether there are n models in the optimizer that do not meet a preset requirement (which can be called the second preset requirement). If there is a first goal If the model (that is, one of the n models) does not meet the second preset requirement, then according to the SQL statement, the target physical plan is obtained through the n models included in the optimizer, wherein the target physical plan is that the execution cost meets the first A physical plan of preset requirements. After the final target physical plan is obtained, the target physical plan will be executed. The essence of this execution process is to use the generated target physical plan to execute the actual logic of the input SQL statement.

It should be noted that the specific way for the central processing unit 1022 to execute the above-mentioned steps is based on the same concept as the method embodiment corresponding to FIG. Refer to the descriptions in the foregoing method embodiments of the present application, and details are not repeated here.

In addition, it should be noted that the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be A physical unit can be located in one place, or it can be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the device embodiments provided in the present application, the connection relationship between the modules indicates that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines.

Through the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus necessary general-purpose hardware, and of course it can also be realized by special hardware including application-specific integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions completed by computer programs can be easily realized by corresponding hardware, and the specific hardware structure used to realize the same function can also be various, such as analog circuits, digital circuits or special-purpose circuit etc. However, for this application, software program implementation is a better implementation mode in most cases. Based on this understanding, the essence of the technical solution of this application or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a floppy disk of a computer , U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be A personal computer, a training device, or a network device, etc.) executes the methods described in various embodiments of the present application.

In the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, training device, or data The center transmits to another website site, computer, training device or data center via wired (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device or a data center integrated with one or more available media. The available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (solid state disk, SSD)), etc.

Claims

A database management system, characterized in that it comprises:

An optimizer, a training data collector, a model manager, and a model evaluator, wherein the optimizer includes n models, n≥1;

The optimizer is configured to obtain a target physical plan through the n models according to the SQL statement input into the database, and the target physical plan is a physical plan whose execution cost meets a first preset requirement;

The training data collector is used to obtain training data according to the running data of the process in the database, and construct m training sets based on the training data, where m≥1;

The model manager is configured to use the target training set corresponding to the first target model to fine-tune the first target model to obtain a second target when the first target model satisfies a second preset requirement model, the first target model is one of the n models, and the target training set is one of the m training sets;

The model evaluator is configured to evaluate the performance of the second target model, and update the first target model to the first target model when the performance of the second target model meets a third preset requirement. Two-objective model.
The system according to claim 1, further comprising:

a suggester, the suggester comprising p models, p≥1;

The advisor is used to discover abnormal data in the operating data, and diagnose the cause of the abnormality based on the abnormal data;

The suggester is further configured to optimize an optimization module corresponding to the abnormal cause based on the abnormal cause, so as to reduce the probability of abnormal occurrence of the operating data.
The system according to claim 2, wherein the p models include a codec, a first model, and a second model, and the suggester is specifically used for:

Encoding and decoding the operating data by the codec to obtain encoded data, and comparing the encoded data with the operating data to obtain the abnormal data;

Through the first model, when the operation data belongs to the system index data, the cause of the abnormality is obtained according to the diagnosis of the abnormal data, and the first model is a model constructed based on a deep learning algorithm;

Through the second model, when the operation data belongs to the query index data, the cause of the abnormality is diagnosed and obtained according to the abnormal data, and the second model is a model constructed based on a deep learning algorithm.
The system according to claim 3, characterized in that,

The first model includes a long short-term memory model LSTM and a classifier;

The second model includes a tree-length short-term memory model Tree-LSTM and a softmax function.
The system according to any one of claims 2-4, wherein the first object model further comprises:

Any one of the p models.
The system according to any one of claims 1-5, wherein the n models include a third model, a fourth model, and a fifth model, and the optimizer is specifically used for:

Through the third model, rewrite the logical query of the SQL statement input to the database to obtain a rewritten logical plan, and the third model is a model constructed based on a tree search algorithm;

Through the fourth model, q physical plans are generated according to the logical plan, and the fourth model is a model built based on a deep learning algorithm, and q≥1;

Through the fifth model, calculate q execution costs corresponding to the q physical plans, and determine the target physical plan according to the q execution costs, one physical plan corresponds to one execution cost, the fifth model For models built based on reinforcement learning algorithms.
The system according to any one of claims 1-6, wherein the model evaluator is also used for:

In the case that the performance of the second target model does not meet the third preset requirement, the database is triggered to generate a target physical plan using native kernel components in the database.
The system according to claim 7, wherein the model evaluator is also used for:

Feedback information about model update failure to the model manager, so that the model manager adjusts a strategy for fine-tuning the first target model based on the information, and the model update failure is the second target model The performance of does not meet the third preset requirement.
The system according to any one of claims 1-8, wherein the execution cost meeting the first preset requirement includes at least any of the following:

The execution overhead of the target physical plan is the lowest among the q execution overheads. The q execution overheads are the corresponding execution overheads of the q physical plans generated based on the SQL statement. One physical plan corresponds to one execution overhead, and q ≥1;

or,

The execution cost of the target physical plan is lower than a first preset threshold.
The system according to any one of claims 1-9, wherein the first object model meeting the second preset requirement includes at least any one of the following:

The performance of the first target model is degraded;

or,

The degree of performance degradation of the first target model reaches a second preset threshold;

or,

The predicted probability of performance degradation of the first target model reaches a third preset threshold;

or,

The duration of continuous operation of the first target model reaches a preset duration.
The system according to any one of claims 1-10, wherein the performance of the second target model meeting the third preset requirement includes at least any of the following:

The performance of the second target model is increased by a fourth preset threshold compared with the performance of the first target model;

or,

The performance of the second target model is increased by a fifth preset threshold compared with the performance of the native kernel components in the database.
A data processing method, characterized in that the method is applied to computer equipment, the computer equipment is deployed with a database, the database includes an optimizer and native kernel components, and the optimizer includes n models, n≥1 , the method includes:

The computer device receives the SQL statement sent by the client to the database;

In the case that the first target model does not meet the second preset requirement, the computer device obtains a target physical plan through the n models according to the SQL statement, and the target physical plan is that the execution cost meets the first preset The required physical plan, the first target model is one of the n models;

The computer device executes the target physical plan.
The method according to claim 12, wherein the n models include a third model, a fourth model, and a fifth model, and the computer device obtains the target physical model through the n models according to the SQL statement. Programs include:

The computer device rewrites the logical query of the SQL statement through the third model to obtain a rewritten logical plan, and the third model is a model constructed based on a tree search algorithm;

The computer device generates q physical plans according to the logical plan through the fourth model, the fourth model is a model constructed based on a deep learning algorithm, and q≥1;

The computer device calculates q execution overheads corresponding to the q physical plans through the fifth model, and determines the target physical plan according to the q execution overheads, one physical plan corresponds to one execution overhead, so The fifth model mentioned above is a model constructed based on a reinforcement learning algorithm.
The method according to any one of claims 12-13, wherein the method further comprises:

The computer device sends the operation data of the processes in the database to the advisor, so that the advisor finds abnormal data based on the operation data, and enables the advisor to diagnose the cause of the abnormality based on the abnormal data, and The self-optimization module corresponding to the abnormal reason is optimized based on the abnormal reason, so as to reduce the probability of abnormal occurrence of the operation data, wherein the suggester includes p models, p≧1.
The method according to any one of claims 12-14, wherein the method further comprises:

The computer device sends the operation data of the processes in the database to the training data collector, so that the training data collector obtains training data according to the operation data, and constructs m training sets based on the training data, m ≥1.
The method according to claim 15, further comprising:

When the first target model satisfies the second preset requirement, the computer device sends a first instruction to the model manager, and the first instruction is used to instruct the model manager to The target model is fine-tuned, and the first target model is one of the n models;

When the performance of the second target model meets the third preset requirement, the computer device receives the model parameters of the second target model sent by the model manager, the second target model is the model management A model obtained by fine-tuning the first target model using a target training set corresponding to the first target model, where the target training set is one of the m training sets;

The computer device updates the first target model to the second target model, and obtains the target physical plan through the updated n models.
The method according to claim 16, further comprising:

When the performance of the second target model does not meet the third preset requirement, the computer device receives a second instruction sent by the model evaluator, and the second instruction is used to instruct the database to adopt the The in-database native kernel component generates the target physical plan, and the model evaluator is used to evaluate the performance of the second target model.
The method according to any one of claims 16-17, wherein the first target model meeting the second preset requirement at least includes any of the following:

The performance of the first target model is degraded;

or,

The degree of performance degradation of the first target model reaches a second preset threshold;

or,

The predicted probability of performance degradation of the first target model reaches a third preset threshold;

or,

The duration for which the first target model continues to run satisfies a preset duration.
The method according to any one of claims 16-18, wherein the performance of the second target model meeting the third preset requirement includes at least any of the following:

The performance of the second target model is increased by a fourth preset threshold compared with the performance of the first target model;

or,

The performance of the second target model is increased by a fifth preset threshold compared with the performance of the native kernel components in the database.
The method according to any one of claims 12-19, wherein the execution cost meeting the first preset requirement includes at least any of the following:

The execution overhead of the target physical plan is the lowest among the q execution overheads. The q execution overheads are the corresponding execution overheads of the q physical plans generated based on the SQL statement. One physical plan corresponds to one execution overhead, and q ≥1;

or,

The execution cost of the target physical plan is lower than a first preset threshold.
A computer device, the device has the function of implementing the method according to any one of claims 12-20, the function is implemented by hardware or by executing corresponding software through hardware, and the hardware or the software includes one or more A module corresponding to the described function.
A computer device comprising a processor and a memory, the processor being coupled to the memory, characterized in that,

The memory is used to store programs;

The processor is configured to execute the program in the memory, so that the computer device executes the method according to any one of claims 12-20.
A computer-readable storage medium, including a program, which, when run on a computer, causes the computer to execute the method according to any one of claims 12-20.
A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method as claimed in any one of claims 12-20.
A chip, the chip includes a processor and a data interface, the processor reads instructions stored on the memory through the data interface, and executes the method according to any one of claims 12-20.