WO2023016537A1 - Système de gestion de base de données, procédé de traitement de données et dispositif - Google Patents

Système de gestion de base de données, procédé de traitement de données et dispositif Download PDF

Info

Publication number
WO2023016537A1
WO2023016537A1 PCT/CN2022/111991 CN2022111991W WO2023016537A1 WO 2023016537 A1 WO2023016537 A1 WO 2023016537A1 CN 2022111991 W CN2022111991 W CN 2022111991W WO 2023016537 A1 WO2023016537 A1 WO 2023016537A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
target
database
performance
data
Prior art date
Application number
PCT/CN2022/111991
Other languages
English (en)
Chinese (zh)
Inventor
李国良
李士福
周煊赫
王天庆
Original Assignee
华为技术有限公司
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司, 清华大学 filed Critical 华为技术有限公司
Publication of WO2023016537A1 publication Critical patent/WO2023016537A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning

Definitions

  • the present application relates to the field of database management, in particular to a database management system, data processing method and equipment.
  • a database is a "warehouse that organizes, stores and manages data according to the data structure", and is a collection of organized, shareable and unified management of a large amount of data stored in a computer for a long time. Databases are critical to the efficient operation of modern businesses.
  • the management method of the database mainly depends on the database administrator (DBA for short), and the DBA often needs to spend a lot of time and energy to manage and maintain the database manually, which is very error-prone and affects the uptime, performance and quality of the database.
  • Security has catastrophic effects. For example, failure to properly patch and install security updates in a timely manner can lead to database vulnerabilities, weakening or even complete failure of database protection measures, which in turn exposes enterprises to serious data breach risks, severe financial impact and loss of goodwill.
  • this traditional database management method is based on traditional optimization techniques based on expert experience, such as cost estimation, connection sequence selection, and parameter tuning, which can no longer meet the requirements of multi-scenario business, massive applications, and extreme performance.
  • the embodiment of the present application provides a database management system, data processing method and equipment, which can automatically perform database optimization, protection, update and other routine database management tasks traditionally performed by DBAs by combining with machine learning technology function without human intervention.
  • the embodiment of the present application firstly provides a database management system, which may specifically include: a self-learning optimizer (it should be noted that some kernel components included in the optimizer provided by the present application may be native kernels Components, but the n models included in the optimizer are newly added in this application, n ⁇ 1), training data collector, model manager and model evaluator, wherein, the optimizer is used according to the SQL input database statement, the final physical plan to be executed (may be called the target physical plan) is obtained through the n models, wherein the target physical plan refers to the execution cost that meets a certain preset requirement (may be called the first preset requirement) physical plan.
  • a self-learning optimizer it should be noted that some kernel components included in the optimizer provided by the present application may be native kernels Components, but the n models included in the optimizer are newly added in this application, n ⁇ 1)
  • the optimizer is used according to the SQL input database statement
  • the final physical plan to be executed (may be called the target physical plan) is obtained
  • the training data collector is used to obtain training data according to the operation data of the process in the database (such as database operation indicators, query logs, system logs, etc.), and construct m training sets based on the obtained training data, where m ⁇ 1.
  • a model manager configured to use the target training set corresponding to the first target model to fine-tune the first target model when the first target model satisfies a certain preset requirement (which may be referred to as a second preset requirement). (finetune), thereby obtaining the second target model (the second target model is essentially the first target model with model parameters updated).
  • the first target model is one of the n models
  • the target training set is one of the m training sets.
  • the model parameters of the second target model after finetune can be passed to the model evaluator to evaluate the model performance.
  • the model evaluator is used to evaluate the performance of the obtained second target model, and when the performance of the second target model satisfies a certain preset requirement (which may be called the third preset requirement), the first An object model is updated to the second object model.
  • the update process may specifically be: the model evaluator sends the model parameters of the second target model to the optimizer, and the optimizer assigns the received updated model parameters to the first target model, thereby obtaining the Second target model.
  • the model evaluator may be a performance prediction model based on graph embedding.
  • the database management system includes an optimizer with n models, a training data collector, a model manager, and a model evaluator, wherein the optimizer with n models replaces the traditional heuristic optimizer,
  • the training data collector can generate training data for models involved in the database based on the running data of the processes in the database, and can realize continuous Optimize the database and provide credible autonomous operation and maintenance services
  • the model manager can call the corresponding target training set in the training data collector to fine-tune the first target model, so as to dynamically update and replace the database according to the real-time operation status of the database
  • the corresponding model used in ; the model evaluator is used to provide secondary verification capabilities to ensure that the best valid model is provided.
  • the database management system constructed in the embodiment of the present application can realize the function of automatically performing database tuning, protection, updating and other routine database management tasks traditionally performed by DBAs without human intervention by combining with machine learning technology
  • the database management system may further include a suggester, which may include p models, where p ⁇ 1; the suggester is used to discover the Abnormal conditions (that is, abnormal data found) in the operation data (such as CPU utilization rate, user response time, etc.), and the abnormal cause is diagnosed based on the obtained abnormal data, and then based on the obtained abnormal cause optimization and the abnormal cause
  • the corresponding optimization module (the optimization module is also located in the advisor, and the function of the optimization module is to optimize the parameters of the database) is to reduce the probability of abnormal occurrence of the operation data of the process in the database.
  • the constructed database management system can be based on machine learning methods, and a new suggester is added to the database management system to realize the functions of self-monitoring, self-diagnosis, and self-optimization, so as to automatically and intelligently optimize the database .
  • This solves the problem that database monitoring, configuration, diagnosis, and optimization methods (such as parameter tuning, slow SQL diagnosis, index/view consultant) in the prior art rely on DBA, have high cost, and cannot adapt to large-scale instances (such as cloud databases). .
  • the suggester may specifically include three models, which may be respectively referred to as a codec, a first model, and a second model, and are respectively used to perform self-monitoring, self-diagnosis, and Self-optimize these three steps.
  • the suggester encodes and decodes the running data of the process in the database through a codec to obtain encoded data, and compares the encoded data with the running data input to the codec to obtain abnormal data .
  • the principle of using the codec to obtain abnormal data is: the codec can restore the normal original data, but not the abnormal original data, so that the input original data is encoded and then decoded to obtain the encoded data.
  • the encoded data is compared with the original data to know whether there is abnormal data.
  • the suggester can further use the first model to diagnose the cause of the abnormality according to the abnormal data, wherein the first model is based on depth A model constructed by a learning algorithm; if the operating data belongs to query index data (such as average delay), the suggester can further diagnose and obtain the cause of the abnormality based on the abnormal data through the second model, wherein the second model is based on Models built by deep learning algorithms.
  • the suggester may also include more or fewer models for realizing the process of self-monitoring and self-diagnosis of the database. In the embodiment of this application, it is recommended Included are three models for illustration only.
  • the advisor implements the functions of self-monitoring and self-diagnosis of the database through the models contained therein, and realizes the purpose of the database management system to automatically and intelligently manage the operating data of the database.
  • the first model may include an LSRM model and a classifier.
  • the suggester encodes the found abnormal data into a compressed vector (ie dimensionality reduction/ Dimension-uplifted vector), and then use learning-type classifiers (such as binary classifiers, multi-classifiers, etc.) to reason about the corresponding root cause (such as database backup operations);
  • the second model can include Tree-LSTM models and The softmax function, specifically, the suggester encodes the slow query (that is, the execution time of the query is long) by calling the Tree-LSTM model, locates the operation of the physical operator (that is, the execution operator) that causes the exception, and then uses the softmax function to identify the resulting The root cause of the exception.
  • the above-mentioned first target model can not only be any one of the n models in the optimizer , can also be any one of the p models in the suggester.
  • the first target model can be any one of the optimizer and the recommender, which expands the selection range of the first target model, so that the subsequent model evaluator can not only evaluate the model in the optimizer performance, you can further evaluate the performance of the model in the proposer, and have the feasibility of the scheme.
  • the optimizer may specifically include three models, which may be referred to as the third model, the fourth model, and the fifth model, which are respectively used to perform logical query rewriting, cost estimation, and The physical plan is generated in these three steps.
  • the optimizer rewrites the logical query of the SQL statement (also referred to as SQL query) input to the database through the third model, so as to obtain the rewritten logical plan, wherein the third model is based on tree search
  • the model constructed by the algorithm for example, can be a Monte Carlo tree search algorithm; afterward, the optimizer generates q physical plans according to the logical plan through the fourth model, wherein the fourth model is constructed based on a deep learning algorithm
  • the model for example, can be a model based on Tree-LSTM, q ⁇ 1; finally, the optimizer calculates the q execution costs corresponding to the q physical plans through the fifth model (one physical plan corresponds to one execution cost) , and determine the final target physical plan to be executed according to the q execution overheads, wherein the fifth model is a model constructed based on a reinforcement learning algorithm, for example, it may be a model based on deep reinforcement learning (deep Q-learning, DQN) .
  • DQN deep reinforcement learning
  • the optimizer implements the functions of logical query rewriting, cost estimation and physical plan generation of the database through the model included in it, thereby replacing the traditional heuristic optimizer.
  • the combination of technologies realizes the conversion of logical queries into physical execution plans with higher execution efficiency, and can effectively solve the problems of inaccurate cost evaluation caused by current database architecture problems and poor physical plans generated by complex SQL statements.
  • the model evaluator will also be used to trigger the database to use the native kernel components in the database to generate the final to-be-executed
  • the target physical plan for example, the index selection module enables the traditional hill-climbing algorithm to create new indexes to execute the logic of the SQL statement. That is to say, if the performance of the second target model after finetune still does not meet the requirements, the traditional kernel component algorithm of the database is used to generate the target physical plan.
  • the model evaluator since the original kernel components of the database are not deleted, but coexist with the newly added optimizer in the database software, therefore, during the operation of the database, if the performance of the second target model does not meet the requirements of the first Three preset requirements, the model evaluator will also trigger the database to use the native kernel components in the database to generate the final target physical plan to be executed (because in this case, the native kernel components perform better), so that it can be based on The method of dynamically adjusting the evaluation of the current performance of the second target model in real time to generate the target physical plan improves the performance of the database as a whole.
  • the model evaluator can be used to trigger the database to generate the final target to be executed using native kernel components in the database
  • the physical plan may also feed back the information that the model update fails (that is, the performance of the second target model does not meet the third preset requirement) to the model manager, so that the model manager adjusts the first target model based on the information fine-tuning strategy.
  • the model evaluator can further feed back information about model update failure to the model manager, so that the model manager can provide reference for subsequent model training strategies, thereby improving model training capability and training efficiency.
  • the execution overhead meeting the first preset requirement includes but is not limited to:
  • the execution overhead of the target physical plan is the lowest among the q execution overheads.
  • the q execution overheads are the corresponding execution overheads of the q physical plans generated based on the SQL statements input into the database.
  • One physical plan corresponds to one execution overhead, where ,q ⁇ 1;
  • the execution cost of the target physical plan is lower than a certain preset value (which may be referred to as a first preset threshold).
  • a certain preset value which may be referred to as a first preset threshold.
  • the execution cost of the target physical plan is the lowest among the q execution costs as the case where the execution cost meets the first preset requirement, and details will not be described later.
  • the first target model meeting the second preset requirement includes but is not limited to:
  • the performance of the first target model can be predicted by a model evaluator, assuming that the predicted performance of the first target model declines
  • the probability of reaching a certain preset value may be referred to as the third preset threshold value, for example, the probability that the predicted performance will decline reaches 80%;
  • the continuous running time of the first target model reaches a preset time period, for example, the continuous running time of the first target model reaches 30 minutes.
  • the second target model meeting the third preset requirement may include but not limited to:
  • the fourth preset threshold can be zero, which means that as long as When the performance of the second target model reaches the performance level of the original first target model, it is considered that the second target model meets the third preset requirement; as another example, the fourth preset threshold can also be a certain value greater than zero Or a certain ratio, indicating that only when the performance of the second target model is improved to a certain extent compared with the performance of the original first target model, the second target model is considered to meet the third preset requirement;
  • the performance of the second target model is improved by a fifth preset threshold compared with the performance of the native kernel components in the database. That is to verify the performance improvement of the second target model compared with the traditional database algorithm. If the performance improvement reaches a certain threshold, the model used by the corresponding module of the data is actually replaced. Otherwise, the traditional database algorithm is still used to execute the target physical plan.
  • the value of the fifth preset threshold may be zero, or a certain value or a certain ratio greater than zero. For details, please refer to the above-mentioned first method, which will not be repeated here.
  • the second aspect of the embodiment of the present application also provides a data management method, the method includes: first, the local computer device receives the SQL statement sent by the client device to the database deployed in the computer device, wherein the database includes Optimizer and database native kernel components, the optimizer includes n models, n ⁇ 1. After the computer device receives the SQL statement sent by the client device, it will first judge whether there are n models in the optimizer that do not meet a certain preset requirement (which can be called the second preset requirement).
  • a target model that is, one of the n models
  • the computer device will obtain the target physical plan through the n models included in the optimizer according to the SQL statement, wherein the target physical plan To execute a physical plan whose cost meets the first preset requirement.
  • the computer device After the computer device obtains the final target physical plan, it will execute the target physical plan.
  • the essence of this execution process is to use the generated target physical plan to execute the actual logic of the input SQL statement.
  • the computer device obtains the target physical plan based on the optimizer included in the database and finally executes the target physical plan.
  • the database deployed in the computer device includes an optimizer with n models, It replaces the traditional heuristic optimizer and realizes the conversion of logical queries into physical execution plans with higher execution efficiency by combining with machine learning technology.
  • the optimizer may specifically include three models, which may be referred to as the third model, the fourth model, and the fifth model, which are respectively used to perform logical query rewriting, cost estimation, and Steps for physical plan generation.
  • the way the computer device obtains the target physical plan through the n models included in the optimizer according to the SQL statement can specifically be: first, the computer device uses the third model to input the SQL statement (also can be referred to as SQL query) to rewrite the logical query, so as to obtain the rewritten logical plan, wherein, the model A is a model based on a tree search algorithm, for example, it can be a Monte Carlo tree search algorithm; after that, through The fourth model is to generate q physical plans according to the logical plan, wherein the fourth model is a model based on a deep learning algorithm, for example, a model based on Tree-LSTM, q ⁇ 1; finally calculated by the fifth model q execution overheads corresponding to the q physical plans (
  • the optimizer implements the functions of logical query rewriting, cost estimation and physical plan generation of the database through the model included in it, thereby replacing the traditional heuristic optimizer.
  • the combination of technologies realizes the conversion of logical queries into physical execution plans with higher execution efficiency, and can effectively solve the problems of inaccurate cost evaluation caused by current database architecture problems and poor physical plans generated by complex SQL statements.
  • the computer device may also send the running data of the processes in the database deployed therein to the recommender, and the recommender may be deployed in the computer device or in the
  • the remote device is not limited here.
  • the advisor can send abnormal data based on the operation data, and diagnose the cause of the abnormality based on the obtained abnormal data, and finally optimize the self-optimization module corresponding to the abnormal cause based on the abnormal cause, so as to reduce the subsequent process in the database.
  • the computer device can also feed back the running data of the processes in the database to the suggester, and the suggester can give all-round optimization suggestions for the database based on the running data, and can realize an unattended database Performance monitoring and root cause identification greatly liberate operation and maintenance manpower, and can help the database system quickly recover from abnormalities or improve performance.
  • the computer device may also send the running data of the processes in the database deployed therein to the recommender, or send the running data to the training data collector, the
  • the training data collector can be deployed in the computer device or in the remote device, which is not limited here. After receiving the running data, the training data collector can obtain training data according to the running data, and construct m training sets based on these training data, where m ⁇ 1.
  • the computer device can also feed back the running data of the processes in the database to the training data collector, and the training data collector can generate the training data of the models involved in the database based on the running data of the processes in the database.
  • Data can realize continuous optimization of the database system, reduce the probability of misjudgment of the database system, and provide credible autonomous operation and maintenance services.
  • the computer device may further send a model manager (the model manager may be deployed in the computer device, or may be not deployed on the computer device, not limited here) to send a first instruction, the first instruction is used to instruct the model manager to fine-tune the first target model, and, when the performance of the second target model satisfies a predetermined
  • the computer device receives the model parameters of the second target model sent by the model manager, wherein the second target model is used by the model manager and the first target model
  • the target training set corresponding to the model is a model obtained by fine-tuning the first target model.
  • the target training set is one of the m training sets.
  • the computer device updates the first target model to the second target model, and passes The updated n models (at this time, the updated n models do not include the first target model, but include the second target model) to obtain the target physical plan.
  • the first target model does not meet the second preset requirements, it is also necessary to use the target training set corresponding to the first target model to set the first target model through the model manager.
  • the model is fine-tuned, and when the performance of the second target model obtained after fine-tuning meets the third preset requirement, the first target model is updated, so as to dynamically update and replace the used in the database according to the real-time operation status of the database corresponding model.
  • the computer device will also receive a second instruction sent by the model evaluator, and the second The instruction is used to instruct the database to use native kernel components in the database to generate the final target physical plan, wherein the model evaluator can be deployed in the computer device or in a remote device, which is not limited here.
  • the model evaluator is used to evaluate the performance of the second target model.
  • the computer device receives the second instruction of the model evaluator to instruct the database to use the traditional algorithm of the database (that is, the native kernel component) to generate the target physical plan.
  • the embodiment of the present application provides multiple options for generating the target physical plan, and has flexibility.
  • the first target model meeting the second preset requirements includes but is not limited to:
  • the performance of the first target model can be predicted by a model evaluator, assuming that the predicted performance of the first target model declines
  • the probability of reaching a certain preset value may be referred to as the third preset threshold value, for example, the probability that the predicted performance will decline reaches 80%;
  • the continuous running time of the first target model reaches a preset time period, for example, the continuous running time of the first target model reaches 30 minutes.
  • the second target model meeting the third preset requirement may include but not limited to:
  • the performance of the second target model has increased by a certain preset value (which may be called the fourth preset threshold).
  • the fourth preset threshold can be zero, which means that as long as When the performance of the second target model reaches the performance level of the original first target model, it is considered that the second target model meets the third preset requirement; as another example, the fourth preset threshold can also be a certain value greater than zero Or a certain ratio, it means that only when the performance of the second target model is improved to a certain extent compared with the performance of the original first target model, the second target model is considered to meet the third preset requirement;
  • the performance of the second target model is improved by a fifth preset threshold compared with the performance of the native kernel components in the database. That is to verify the performance improvement of the second target model compared with the traditional database algorithm. If the performance improvement reaches a certain threshold, the model used by the corresponding module of the data is actually replaced. Otherwise, the traditional database algorithm is still used to execute the target physical plan.
  • the value of the fifth preset threshold may be zero, or a certain value or a certain ratio greater than zero. For details, please refer to the above-mentioned first method, which will not be repeated here.
  • the execution overhead meeting the first preset requirement includes but is not limited to:
  • the execution overhead of the target physical plan is the lowest among the q execution overheads.
  • the q execution overheads are the corresponding execution overheads of the q physical plans generated based on the SQL statements input into the database.
  • One physical plan corresponds to one execution overhead, where ,q ⁇ 1;
  • the execution cost of the target physical plan is lower than a certain preset value (which may be referred to as a first preset threshold).
  • a certain preset value which may be referred to as a first preset threshold.
  • the execution cost of the target physical plan is the lowest among the q execution costs as the case where the execution cost meets the first preset requirement, and details will not be described later.
  • a third aspect of the embodiments of the present application provides a computer device, where the computer device has a function of implementing the method of the second aspect or any possible implementation manner of the second aspect.
  • This function may be implemented by hardware, or may be implemented by executing corresponding software on the hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the fourth aspect of the embodiment of the present application provides a computer device, which may include a memory, a processor, and a bus system, wherein the memory is used to store a program, and the processor is used to call the program stored in the memory to execute the second aspect of the embodiment of the present application Or any possible implementation method of the second aspect.
  • the fifth aspect of the embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores instructions, and when it is run on a computer, the computer can execute any one of the above-mentioned second aspect or the second aspect. method of possible implementation.
  • the sixth aspect of the embodiments of the present application provides a computer program, which, when running on a computer, causes the computer to execute the method of the above-mentioned second aspect or any possible implementation manner of the second aspect.
  • the seventh aspect of the embodiment of the present application provides a chip, the chip includes at least one processor and at least one interface circuit, the interface circuit is coupled to the processor, and the at least one interface circuit is used to perform the function of sending and receiving, and send instructions to At least one processor, at least one processor is used to run computer programs or instructions, which has the function of realizing the method of the second aspect or any possible implementation mode of the second aspect above, and this function can be realized by hardware or by software Realization can also be achieved through a combination of hardware and software, where the hardware or software includes one or more modules corresponding to the above functions.
  • the interface circuit is used to communicate with other modules outside the chip.
  • Fig. 1 is a schematic diagram of a system architecture for constructing a database management system provided by the embodiment of the present application;
  • FIG. 2 is a schematic diagram of a logical architecture of the database management system provided by the embodiment of the present application.
  • Fig. 3 is a schematic diagram of the principle of the optimizer provided by the embodiment of the present application.
  • Fig. 4 is a schematic diagram of the principle of the model evaluator provided by the embodiment of the present application.
  • Fig. 5 is a schematic diagram of the principle of the suggester provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of the comparison between the optimizer provided by the embodiment of the present application and three re-strategies
  • Fig. 7 is a schematic diagram comparing the model evaluator provided by the embodiment of the present application with two known performance evaluation methods
  • FIG. 8 is a schematic flowchart of a data processing method provided in an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • FIG. 10 is another schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the embodiment of the present application provides a database management system, data processing method and equipment, which can automatically perform database optimization, protection, update and other routine database management tasks traditionally performed by DBAs by combining with machine learning technology function without human intervention.
  • the embodiment of the present application involves a lot of related knowledge about databases, models, etc.
  • the following first introduces related terms and concepts that may be involved in the embodiment of the present application. It should be understood that the interpretation of related concepts may be limited due to the specific conditions of the embodiment of the application, but it does not mean that the application is limited to the specific conditions, and there may be differences in the specific conditions of different embodiments. Specifically, there is no limitation here.
  • a database is a computer software system that stores and manages data according to its data structure.
  • the concept of the database actually includes two meanings: a.
  • the database is an entity, which is a "warehouse” that can reasonably store data. Users store transaction data to be managed in the "warehouse”. Concepts are combined into a database.
  • b. Database is a new method and technology of data management. It can organize data more appropriately, maintain data more conveniently, control data more closely and utilize data more effectively.
  • Database software is deployed on local devices, such as local servers and local terminal devices (such as mobile phones, smart watches, personal computers, etc.), and usually exists in the form of single or multiple processes, so database software can also be called database processes .
  • a neural network can be composed of neural units. Specifically, it can be understood as a neural network with an input layer, a hidden layer, and an output layer. Generally speaking, the first layer is the input layer, and the last layer is the output layer. The number of layers in the middle is is the hidden layer. Among them, a neural network with many hidden layers is called a deep neural network (DNN).
  • DNN deep neural network
  • the work of each layer in a neural network can be expressed mathematically To describe, from the physical level, the work of each layer in the neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column space) through five operations on the input space (a collection of input vectors). Space), these five operations include: 1. Dimension up/down; 2. Zoom in/out; 3.
  • Space refers to the collection of all individuals of such things, where W is the weight matrix of each layer of the neural network , each value in this matrix represents the weight value of a neuron in the layer.
  • the matrix W determines the space transformation from the input space to the output space mentioned above, that is, the W of each layer of the neural network controls how to transform the space.
  • the purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network. Therefore, the training process of the neural network is essentially to learn the way to control the spatial transformation, and more specifically, to learn the weight matrix.
  • the error back propagation (BP) algorithm can be used to correct the size of the parameters in the initial neural network model, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, the forward transmission of the input signal until the output will generate an error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.
  • Machine learning is a multi-field interdisciplinary subject, involving probability theory, statistics, convex analysis, algorithm complexity and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire knowledge or skills and reorganize existing The knowledge structure enables it to continuously improve its own functions.
  • the following introduces several machine learning models used in the embodiments of this application:
  • Monte Carlo tree search (monte carlo tree search, MCTS) is a method of strategy optimization on artificial intelligence problems, usually for those parts that require mobile planning in combination games, Monte Carlo tree search will randomly simulate Versatility combined with tree search accuracy. Due to its achievements in computer Go and the potential to solve some difficult problems, the application field of the Monte Carlo tree search algorithm can cover any field that can be described in the form of (state, action) and predict the result through simulation ( e.g., the rewrite sequence selection problem in query rewriting).
  • RNN Recurrent neural networks
  • RNN is a kind of neural network, and the purpose of RNN is to process sequence data.
  • the layers are fully connected, and the nodes between each layer are unconnected.
  • this ordinary neural network is powerless for many problems. For example, to predict what the next word in a sentence is, you generally need to use the previous words, because the preceding and following words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output.
  • RNN can process sequence data of any length.
  • the time recurrent neural network can describe dynamic time behavior, because unlike the feedforward neural network (feed forward neural network) that accepts a more specific structure input, RNN circulates the state in its own network, so it can accept a wider range of time series structure input .
  • LSTM can also be called long-term short-term memory network. It is a time-cycle neural network. It is specially designed to solve the long-term dependence problem of general RNN. All RNNs have a chain form of repeated neural network modules. . In standard RNNs, this repeated structural module has only a very simple structure, such as a tanh layer.
  • Tree long short-term memory artificial neural network (tree long short-term memory, Tree-LSTM)
  • Tree-LSTM mainly extends LSTM to the tree-shaped input structure, and surpasses the traditional LSTM model in predicting semantic relevance and semantic classification tasks on the tree.
  • CNN Convolutional neural networks
  • the CNN is a deep neural network with a convolutional structure.
  • the convolutional neural network includes a feature extractor composed of a convolutional layer and a subsampling layer.
  • the feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolutional feature map.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can only be connected to some adjacent neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
  • Neural units of the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as a way to extract image information that is independent of location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. That means that the image information learned in one part can also be used in another part. So for all positions on the image, we can use the same learned image information.
  • multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • GCN Graph convolutional network
  • RL is a field in machine learning that emphasizes how to act based on the environment to maximize the expected benefit.
  • RL is the third basic machine learning method besides supervised learning and unsupervised learning. Unlike supervised learning, RL does not require labeled input/output pairs, nor does it require accurate correction of non-optimal solutions. Its focus is to find the balance between exploration (for unknown domains) and utilization (for existing knowledge), the exchange of "exploration-utilization" in reinforcement learning, multi-armed bandit problems and limited Markov decision-making processes (markov decision process (MDP) is the most studied.
  • MDP Markov decision process
  • RL controls the execution sequence of each connection operation to ensure that the total query execution cost is minimized.
  • RL also has a training process, which requires continuous execution of actions, observation of the effect of the execution of actions, and accumulation of experience to form a model. Different from supervised learning, each action here generally does not have a directly calibrated label value as a supervisory signal, and the system only gives a feedback to the action performed by the algorithm.
  • Remote procedure call protocol remote procedure call
  • RPC is a protocol for requesting services from remote computer programs over a network without requiring knowledge of the underlying network technology.
  • RPC is used to realize the rapid interaction between the database kernel component and the external model manager, such as making a model update request, creating a new model, and so on.
  • the core idea of SageDB is to build multiple cumulative distribution functions (CDF) about data distribution, and use these CDF models to generate learning indexes, replacement cost estimation models, and accelerated physical operators.
  • CDF cumulative distribution functions
  • SageDB first assumes that it can learn a "perfect" CDF model, that is, the probability distribution of the model accurately conforms to the data distribution of the corresponding data table, and then inserts the CDF model into different modules of the database to provide machine learning-based Reasoning ability: 1) For the optimizer, SageDB directly replaces the cost estimation model with the CDF model learned on a single table to estimate the cost (cost) and cardinality of different queries; 2) For the data structure, SageDB directly uses the traditional The block on the multi-way search tree (balance tree, B-Tree) is replaced by the learned CDF model, and a list of abnormal conditions is summarized to correct the problem of CDF positioning errors; 3) For physical operator acceleration, use Take the sorting operation as an example. SageDB first performs rough sorting on the underlying data based on the learned CDF model (input data values, output relative position numbers), and then use traditional sorting algorithms (such as quick sort) to get the final sorting results.
  • a "perfect" CDF model that is, the probability distribution of
  • SageDB is still in the process of research, using a simple CDF experimental model, which is difficult to adapt to large-scale data sets and other issues; and all learning functions of SageDB are based on the CDF model cluster learned on a single table, suitable for simple single-table query scenario, but cannot effectively deal with multi-table connection problems; in addition, CDF is only used to learn data or load distribution, and cannot provide intelligent decision-making functions such as query rewriting, query plan generation, abnormal diagnosis, etc., and SageDB does not provide multi-CDF model management and model update mechanism.
  • Oracle Database has a long-term investment in the automated operation and maintenance of the database.
  • Oracle 10g introduces various self-management functions to simplify management, improve efficiency, and reduce the total cost related to system management. These management functions include: 1) Statistical analysis related to SQL query optimization; 2) Automatic storage manager: Simplifies how data files, control files and log files are stored; 3) Automatic workload repository: Stores and manages Adjusted information; 4) Automatic database diagnostic monitor: analyze stored statistics, identify possible performance bottlenecks and provide suggestions for solving found problems; 5) Automatic query optimization: determine the execution structure by using query rewriting rules and cost models 6) Automatically generate tuning suggestions for SQL statements or workloads.
  • SQL tuning recommendations make decisions based on information provided by query optimizers, including automatic database diagnostic monitors and automatic workload repositories; 8) based on current load Quantity recommended indexes (including bitmap indexes, function-based indexes, and B-tree indexes), materialized views, table partitions, and indexes.
  • Optimizer statistics collector collect statistics about optimization; 10) By managing database snapshots and storing information, coordinating all in the server Self-management; 11) Server generates alerts, configure the system to automatically generate alerts when events are triggered; 12) Automate pre-installation and post-installation tasks: systems are automatically checked before installation to ensure the success of the installation process and recommend changes; 13) Automatically manage the shared memory used by the Oracle database instance, and eliminate the need for administrators to manually configure shared memory components; 14) The database resource manager allows the DBA to logically divide the workload into different units, and central processing unit (central processing unit) , CPU) resources are allocated to these units without additional overhead.
  • central processing unit central processing unit
  • on-line transaction processing should take precedence over on-line analytical processing (OLAP), and vice versa.
  • OLTP on-line transaction processing
  • OLAP on-line analytical processing
  • Its scheduling mechanism is used for fixed time intervals and controls the number of active sessions executing at one time. When an active session's available slots are filled with new sessions, the remaining sessions are queued until a slot becomes available.
  • Its first autonomous database release version is 19c, which provides external services in the form of public cloud, including functions such as automatic indexing, distributed columns, and materialized view recommendations.
  • the optimization capabilities of the Oracle database are mainly presented in the form of single-point, independent functions, and the functions are not unified to form a closed loop. Users need to call on demand according to their own needs; in addition, the autonomous functions of the Oracle database are mainly reflected in the In the stage of analysis and management, based on limited rules or traditional statistical learning, the ability to optimize database exceptions is limited; moreover, Oracle database does not provide a unified model and training data management and update mechanism, nor does it provide component performance verification functions, which is passive Tuning.
  • the embodiment of this application provides a new database management system, which is based on machine learning algorithms and expert experience, realizes self-learning kernel and model optimization, and builds all-round autonomous functions of the database.
  • FIG. 1 is a schematic diagram of a system architecture for constructing the database management system provided by the embodiment of the present application.
  • Machine learning platform component 102, self-learning suggester 103 may be referred to as suggester 103 for short
  • the module functions of each structure are introduced below:
  • the database software 101 (similar to server software) is deployed on a local device, for example, can be deployed on a local server or a local terminal device (such as a mobile phone, a personal computer, etc.), and usually exists in the form of a single or multiple processes.
  • the system architecture of the embodiment of the present application includes the self-learning kernel components of the database, and replaces or replaces the algorithm or implementation of the traditional database native kernel components (it should be noted that the native kernel components are not deleted, and they are still in the database process). Overall database reliability or performance improvement.
  • the self-learning kernel component means that it does not replace a certain algorithm through machine learning at a single point. Its outstanding capabilities are based on changes in system load or business status, automatic adaptation to scenarios, automatic updates based on algorithm training models, and access to feedback mechanisms and verification mechanisms. , for automatic model drift and continuous model availability.
  • the self-learning kernel components can be identified as self-learning optimizer 1011 (may be referred to as optimizer 1011 for short), self-learning index, self-learning storage and self-learning executor, etc., different
  • the self-learning kernel component can correspondingly realize the functions of the corresponding modules in the native kernel component.
  • the implementation mechanism and calling logic of the self-learning kernel component are mainly introduced by using the optimizer 1011 .
  • the data source is the database system, including but not limited to the internal indicators of the database (such as the number of transaction processing per second (transactions per second, TPS), cache hits, active transactions, resource usage, etc.), operating system information, log information, etc.
  • TPS transactions per second
  • the model manager 1022 also referred to as the model management platform
  • the training algorithm combined with the data information, completes the model training.
  • the trained model will be pushed to the model evaluator 1023, and only the model that has been evaluated and meets business expectations can be identified as the model to be applied, otherwise it needs to be readjusted and trained.
  • the information collector can be deployed in the database software 101, or it can be a process deployed separately outside the database software 101.
  • the purpose of the separate deployment is to decouple from the database and to implement specific functions (that is, the process in the database
  • the collection function of the operating data) the specific embodiment of the present application does not limit the deployment mode of the information collector.
  • the function of the information collector is to collect data
  • the function of the information collector can be integrated in the optimizer 1011, or integrated in the training data collection
  • the training data collector 1021 also functions as an information collector is taken as an example for illustration, and details will not be described later.
  • the models included in the machine learning platform component 102 may be pre-trained in advance, and the models deployed in the system architecture may refer to pre-trained models.
  • the training data collected by the training data collector 1021 in the embodiment of the present application can be used to fine-tune the pre-trained model (for example, fine-tuning can be performed when the performance of the model declines after the model has been applied for a period of time); the machine The models included in the learning platform component 102 may also not be pre-trained in advance, but the initialized models are directly deployed on the system architecture, and then the training data collected by the training data collector 1021 are used to train each model and Subsequent fine-tuning, specifically, this application does not limit the state of the model included in the machine learning platform component 102 during deployment (that is, whether it has been pre-trained).
  • the machine learning platform component 102 may be deployed on a remote device (for example, a remote server, as shown in FIG. 1 is deployed on a remote device), or may Deploying local devices (for example, local servers) can even be implemented together with the database in the same process (that is, implemented in the database kernel).
  • the individual component program is the least intrusive way to the existing capabilities of the database, and can be used as an accessory to iteratively replace the database kernel module capabilities gradually. The same is true for deployment on local devices. It is less intrusive to the database, but it will force the same device resources as the database. Usually, new scheduling components are required for balance and resource control.
  • Integrate the machine learning component platform in the database kernel that is, the database itself provides machine learning (including but not limited to deep learning, reinforcement learning, etc.) capabilities, which is very intrusive to the database, but the data privacy protection is good, communication overhead is reduced, and the interface is indirectly convenient and more convenient. Easy model tuning or fine-tuning.
  • the suggester 103 is used to discover possible problems in the running process of the database, and perform diagnosis and tuning for intelligent operation and maintenance management of the database.
  • the recommender 103 also needs a machine learning platform component to manage the algorithm model used in the intelligent operation and maintenance process.
  • the machine learning platform used can be of the same origin as the machine learning platform components used by the database system, that is, the two can use the same machine learning platform, or can be managed separately, that is, the model in the suggester 103 can deploy a machine learning platform component separately, specifically This application does not limit this, but the functions or mechanisms of the machine learning platform components have not changed, and the model is also automatically updated to provide a learning and feedback mechanism to ensure the availability of the model.
  • the suggester 103 is not a manifestation of the database kernel capability, but is used for database management, and can tune or strengthen the capabilities provided by the database kernel module. After interacting with the database, the suggester 103 obtains more information and suggestions, and the model is more optimized, which is beneficial to the intelligent implementation of the system.
  • the suggester 103 needs to report diagnostic information and health index, and also needs to accept instructions from users. At this time, a web front-end is required to realize this function.
  • the implementation method is a conventional operation, which will not be described here.
  • the system architecture of the data management system includes a suggester 103, and it can be seen from the above description that the module function of the suggester 103 is to find possible problems in the database operation process, and perform diagnostics and tuning. Therefore, in other implementation manners of the present application, the suggester 103 may not be required.
  • Each functional module is respectively the optimizer 1011 shown in Figure 1, Training Data Collector 1021 (Training Data Collector may also be referred to as Training Data Collection Platform), Model Manager 1022 (Model Manager may also be referred to as Model Management Platform), Model Evaluator 1023, Adviser 103 (in some embodiments , the suggester 103 may also not be required). Please refer to FIG. 2 for details.
  • FIG. 1 Training Data Collector 1021 (Training Data Collector may also be referred to as Training Data Collection Platform), Model Manager 1022 (Model Manager may also be referred to as Model Management Platform), Model Evaluator 1023, Adviser 103 (in some embodiments , the suggester 103 may also not be required).
  • FIG. 2 for details.
  • the new module includes self-learning optimizer 201 (it should be noted that some kernel components included in the optimizer 201 provided by this application can be native kernel components, but the n models included in the optimizer 201 This application is newly added), training data collector 202, model manager 203, model evaluator 204 and self-learning suggester 205 (in some embodiments, suggester 205 may not be required), the following are respectively from the specific functions
  • Each function module is described logically and called:
  • the learning optimizer used by the embodiment of the present application uses machine learning technology to improve performance.
  • the optimizer 201 includes n models, where n ⁇ 1.
  • the optimizer 201 is used to obtain the final physical plan (which may be called the target physical plan) to be executed through the n models according to the SQL statement input into the database, wherein the target physical plan means that the execution cost meets a certain predetermined
  • a physical plan for setting requirements (which may be referred to as the first predetermined requirement).
  • the execution overhead meeting the first preset requirements includes but is not limited to: 1) The execution overhead of the target physical plan is the lowest among the q execution overheads, and the q execution overheads are The corresponding execution costs of q physical plans generated based on the SQL statements input into the database, one physical plan corresponds to one execution cost, where q ⁇ 1; 2) The execution cost of the target physical plan is lower than a certain preset value (which can be called is the first preset threshold).
  • the execution cost of the target physical plan is the lowest among the q execution costs as the case where the execution cost meets the first preset requirement, and details will not be described later.
  • the optimizer 201 may specifically include three models, which may be referred to as model A, model B, and model C respectively, which are respectively used to perform logical query rewriting, cost estimation, and physical The plan generates these three steps. It should be noted here that, in other implementations of this application, the optimizer 201 may also include more or fewer models for realizing the process of logical query rewriting, cost estimation, and physical plan generation. In this application, the three models included in the optimizer 201 are only for illustration, and will not be described in detail later.
  • the optimizer 201 performs logical query rewriting on the SQL statement (also referred to as SQL query) input to the database through model A, so as to obtain a rewritten logical plan, wherein the model A is based on a tree search algorithm
  • the constructed model for example, may be a Monte Carlo tree search algorithm; afterward, the optimizer 201 generates q physical plans according to the logical plan through the model B, wherein the model B is a model constructed based on a deep learning algorithm, For example, it may be a model based on Tree-LSTM, q ⁇ 1; finally, the optimizer 201 calculates the q execution costs corresponding to the q physical plans (one physical plan corresponds to one execution cost) through the model C, and The final target physical plan to be executed is determined according to the q execution costs, wherein the model C is a model constructed based on a reinforcement learning algorithm, for example, a model based on DQN.
  • the above three steps performed by the optimizer 201 can specifically be implemented through three sub-modules: a learning-type rewriter, a learning-type cost estimator, and a learning-type plan generator.
  • this process can specifically be: first, the rewriter provided in the embodiment of the present application uses a model based on a tree search algorithm to convert the initial SQL statement input into the database system into a semantically equivalent logical plan A recognizable by the database system; then based on The cost estimator obtains a rewritten logical plan B with higher execution efficiency.
  • the optimizer 201 including three sub-modules of a learning-type rewriter, a learning-type cost estimator, and a learning-type plan generator as an example, the logical query rewriting and cost estimation executed by the optimizer 201 And the process of generating the physical plan is described in detail:
  • the learning rewriter uses an MCTS-based approach to rewrite the input SQL statement into an equivalent but less expensive query. It first constructs a strategy tree, where the root node is the original query, and the tree nodes are obtained from its The parent node rewrites the resulting query. This rewriter uses an MCTS method.
  • each child node represents the semantically equivalent logical plan obtained by the parent node through a rewriting operation; then in the strategy tree The last iteration selects the equivalent logical plan with the least cost or the lowest probability of being selected, and expands the strategy tree (that is, according to all the rewriting strategies of the selected plan, add a new child node under the tree node corresponding to the selected plan); finally The logical plan with the least execution cost on the strategy tree is selected as the output of the rewriter.
  • the learned cost estimator uses deep learning-based methods to estimate the cost and cardinality of queries, which can capture the correlation between different columns. It designs a tree structure model of a physical plan that matches the query statement.
  • the analogy physical plan is composed of multiple sub-plans, and each tree structure model can be composed of several sub-models.
  • the embodiment of this application uses the tree structure model to estimate the cost of the plan or base.
  • the learning plan generator uses a method based on reinforcement learning to generate an optimized physical plan (that is, the target physical plan).
  • the logic here is: the generated equivalent logical plan will correspond to multiple execution plan trees, and each execution plan tree includes One or more execution operators (also called physical operators), each execution plan tree may have multiple execution paths, involving different execution operators, one execution plan tree corresponds to a total cost, our The goal is to find the physical plan with the smallest total overhead. It uses the reinforcement learning of LSTM with tree structure to select the connection order.
  • the plan generator uses Tree-LSTM to encode the current physical plan into a compressed vector as the state of deep reinforcement learning, and then iterates multiple times, each Select the connection operation with the highest long-term income for the first time, and finally output the physical plan with the lowest execution cost as the actual logic for executing the SQL statement.
  • GCN may be used to capture the structure of the connection tree supporting database schema update and multiple aliases for table names. The model can automatically select the appropriate physical operator.
  • the call logic of the optimizer 201 is: in the learning optimizer 201 provided by the embodiment of the present application, for the logical plan input through the SQL query parser, the rewriter first builds a The logical plan is the strategy tree of the root node, where each child node represents the equivalent logical plan obtained by the parent node after a rewriting operation. Based on MCTS, the rewriter searches the strategy tree for an equivalent logical plan with the least cost, and then inputs it to the plan generator. The plan generator iteratively adjusts the sequence of join operations to obtain multiple different physical plans. For each physical plan, the cost estimator is used to estimate the execution cost, and then the physical plan output with the smallest execution cost is selected.
  • FIG. 3 is a schematic diagram of the principle of the optimizer provided by the embodiment of the present application.
  • the learning model rewrites SQL statements, estimates costs, and selects physical execution plans during database execution.
  • the model in this process can be updated through feedback and incremental training to dynamically adapt to load changes. Its core steps are as follows:
  • a strategy tree is constructed first, in which the root node is the input SQL query, and the non-root nodes are rewritten query statements. Find the rewriting sequence that obtains the maximum benefit through the MCTS search algorithm, that is, iteratively select the equivalent logical plan with the least cost or the least frequently selected equivalent logical plan on the strategy tree, and expand the strategy tree (that is, according to all rewrite strategies of the selected plan , add a new child node under the corresponding tree node of the selected plan); finally select the logical plan with the least execution cost on the strategy tree as the output of the rewriter.
  • the second step is to confirm the potential income of each tree node, and design a neural network-based income estimation model (for example, the attention layer calculates the similarity between rules and rules on the rewriting operator), and predicts the execution overhead that can be reduced in the subsequent queries.
  • a neural network-based income estimation model For example, the attention layer calculates the similarity between rules and rules on the rewriting operator), and predicts the execution overhead that can be reduced in the subsequent queries.
  • the third step in order to improve the search efficiency, especially when there are many logical operators in the query, dynamic programming is used to calculate from the bottom up to select the optimal top N nodes without ancestor-descendant relationship from each node and its subtrees. node, to ensure the maximum total revenue value; then output the node selection scheme corresponding to the root node, which means that the optimal rewritten query can be obtained with the highest probability from the corresponding N node expansion strategy tree.
  • the training data is based on the collected historical query statements, and through feature extraction, the training data is input into the model to be trained.
  • the weight of the model is updated through backpropagation.
  • each intermediate state contains optional plans for part of the plan tree, forming a plan forest.
  • the training process is divided into cost training and delay adjustment. Among them, the cost training continuously selects fragments in the execution plan through the reinforcement learning method to judge whether the operation of the newly selected plan conforms to the optimal plan; in this process, the Q value retrieval method in the Tree-LSTM model is used to initially determine the pros and cons of the plan.
  • delay tuning only a few planned delays are used as training data for model fine-tuning.
  • DQN uses Q network estimation and finds which execution tree is better.
  • plan tree there are three types of leaf nodes, including columns, tables, and operations. Use the depth-first search to traverse the plan tree, and the Tree-LSTM network layer judges that each leaf node represents the cost.
  • the matching plan is the real executable plan of the statement.
  • the database execution engine executes the above-mentioned optimized SQL statement execution plan, that is, executes the final target physical plan.
  • the training data collector 202 is configured to obtain training data according to the running data of the processes in the database, and construct m training sets based on the obtained training data, where m ⁇ 1.
  • the training data collector 202 can automatically collect statistical information of the database, including database operation indicators, query logs, system logs, etc., and use these information to generate all the learning data involved in the database management system.
  • type model i.e. the model included in the suggester 205 (if any), the optimizer 201 and the model evaluator 204
  • can generate corresponding training sets for different models i.e. construct m training sets
  • the optimizer 201, the recommender 205, and the model evaluation 204 include 6 models in total
  • a model corresponds to a training set.
  • the training set constructed can also be less than 6 types, that is, m ⁇ 6.
  • some models can share one training set. There is no limit to the corresponding relationship between the training set and the model.
  • the models involved in the embodiments of this application may be pre-trained in advance, that is, the models deployed in the database management system all refer to pre-trained models.
  • the training data collected by the training data collector 202 can be used to fine-tune the pre-trained model (for example, fine-tuning can be performed when the performance of the model declines after a period of application); the model involved in the embodiment of the present application can also be It may not be pre-trained in advance, but directly deploy the initialized model to the database management system, and then use the training data collected by the training data collector 202 to train each model and perform subsequent fine-tuning.
  • the deployment state of the model included in the management system (that is, whether it has been pre-trained) is not limited.
  • the training data collector 202 can collect the running data of the processes in the database from various aspects, including but not limited to: 1) Database indicators: the running status of the database, such as queries per second Number (query per second, QPS), CPU usage, cache hit rate, etc. These are usually represented by time-series data; 2) SQL queries: which collect SQL queries and their statistics, such as physical plan, response time and duration, etc. 3) Database log: It collects running logs. Since different models in the database management system require different training data, this embodiment of the application can intelligently organize training data for different learning modules, including organizing related columns into the same table To reduce connection overhead, select training data for the model, etc.
  • the invocation logic of the training data collector 202 is: the training data collector 202 receives the running data (such as the collection information of the database Agent program) from the process in the database, and performs data cleaning on the received running data And data processing operations (for example, after data cleaning, data merging, multi-indicator direct correlation analysis and other operations, so that the data is more suitable for subsequent model training or fine-tuning), the training data is obtained, and the training data is constructed into m training sets, It is used for training or fine-tuning of each model in the database management system.
  • the running data such as the collection information of the database Agent program
  • data processing operations for example, after data cleaning, data merging, multi-indicator direct correlation analysis and other operations, so that the data is more suitable for subsequent model training or fine-tuning
  • the model is a model that has not been pre-trained, train the model based on the specified algorithm and training data; if the model is a model that has been pre-trained, evaluate the relationship between the newly obtained training data and the pre-trained model relationship, and continuously monitor to decide whether to update the model; the monitoring time is based on the data source of the training model, whether it is high frequency and easy to change data.
  • the model manager 203 is configured to use the target training set corresponding to the first target model when the first target model satisfies a certain preset requirement (which may be referred to as the second preset requirement).
  • the first target model is fine-tuned to obtain a second target model (the second target model is essentially the first target model with model parameters updated).
  • the first target model is one of the n models
  • the target training set is one of the m training sets.
  • the model parameters of the finetune second target model can be passed to the model evaluator 204 for model performance evaluation.
  • the first target model meeting the second preset requirements includes but is not limited to: 1) the performance of the first target model starts to decline; 2) the first target model The performance of the model not only declines, but the degree of decline reaches a certain preset value (which may be called the second preset threshold); 3) Evaluate the real-time performance of the first target model and predict the next performance, for example , the performance of the first target model can be predicted by the model evaluator 204, assuming that the predicted probability of performance degradation of the first target model reaches a certain preset value (which can be called a third preset threshold), such as predicting The probability that the performance of the first target model will decrease reaches 80%; 4) The continuous running time of the first target model reaches a certain preset time period, for example, the continuous running time of the first target model has reached 30 minutes.
  • the model manager 203 integrates commonly used machine learning capabilities to provide a unified application access interface and support management and scheduling of learning models. Specifically, the model manager 203 generates a better model according to the training data updated by the training data collector 202 to conform to the current system running state.
  • the invocation logic of the model manager 203 is: after receiving the training data, the model manager 203 judges whether the model needs to be updated, and if it needs to be updated, after the model is updated, the model parameters of the model are passed to the model for verification device 204.
  • the model evaluator 204 is used to evaluate the performance of the obtained second target model, and when the performance of the second target model satisfies a certain preset requirement (which may be referred to as the third preset requirement) Next, update the first objective model in the optimizer 201 to the second objective model.
  • the update process may specifically be: the model evaluator 204 sends the model parameters of the second target model to the optimizer 201, and the optimizer 201 assigns the received updated model parameters to the first target model, Thus the second target model is obtained.
  • the model evaluator 204 may be a performance prediction model based on graph embedding.
  • the second target model meeting the third preset requirements may include but not limited to: 1) The performance of the second target model is improved by a certain amount compared with the performance of the first target model A preset value (can be called the fourth preset threshold), as an example, the fourth preset threshold can be zero, indicating that as long as the performance of the second target model reaches the performance level of the original first target model, it is considered The second target model satisfies the third preset requirement; as another example, the fourth preset threshold can also be a certain value or a certain ratio greater than zero, indicating that only the performance of the second target model is compared with that of the original first target model.
  • a preset value can be called the fourth preset threshold
  • the second target model is considered to meet the third preset requirement only when the performance has been improved to a certain extent; 2)
  • the performance of the second target model is improved by the fifth preset threshold compared with the performance of the native kernel components in the database. That is to verify the performance improvement of the second target model compared with the traditional database algorithm. If the performance improvement reaches a certain threshold, the model used by the corresponding module of the data is actually replaced. Otherwise, the traditional database algorithm is still used to execute the target physical plan.
  • the value of the fifth preset threshold may be zero, or a certain value or a certain ratio greater than zero. For details, please refer to the above-mentioned first method, which will not be repeated here.
  • the model evaluator 204 will also be used to trigger the database to be generated using native kernel components in the database.
  • the final target physical plan to be executed for example, the index selection module enables the traditional hill-climbing algorithm to create an index to execute the logic of the SQL statement. That is to say, if the performance of the second target model after finetune still does not meet the requirements, the traditional kernel component algorithm of the database is used to generate the target physical plan. Since the original kernel components of the database are not deleted, but coexist with the newly added optimizer in the database software, therefore, during the operation of the database, which method can improve the performance of the database will be dynamically adjusted in real time to improve overall performance. database performance.
  • the model evaluator 204 can be used to trigger the database to use native kernel components in the database to generate the final
  • the target physical plan to be executed may also feed back the information that the model update fails (that is, the performance of the second target model does not meet the third preset requirement) to the model manager 203, so that the model manager 203 can based on the information Adjust the fine-tuning strategy for the first target model to provide a reference for subsequent model training strategies.
  • the purpose of the model evaluator is to verify whether the model is effective for the workload. If the database uses a learning model, the model evaluator 204 can be used to predict the performance of the model. For the models deployed in the database management system provided by the embodiment of the present application, the present application can perform performance prediction through the model evaluator 204 . If the performance of the model becomes better (for example, the performance improvement reaches a certain threshold), the new model obtained (that is, the original model has updated model parameters) is marked as the best model, which can be actually deployed to the database management system; otherwise, it is marked as The model needs to be updated, abandoning the deployment.
  • the new model obtained that is, the original model has updated model parameters
  • the calling logic of the model evaluator 204 is: the model evaluator 204 obtains the latest model generated by the model manager 203, and verifies whether the model is stable and reliable, and at the same time can improve system performance. The verification result is fed back to the model manager 203, which identifies it as the best model or needs to re-update the model.
  • the evaluator evaluates the performance of the model deployed in the database management system constructed in the embodiment of the present application (such as the model included in the optimizer and the recommender), and checks whether it can obtain performance improvement. If the new model does not improve performance, then abandon the deployment of the model.
  • the model evaluator may be implemented based on GNN, and its structure is shown in FIG. 4 . Its core steps are as follows:
  • the estimated execution effect can be given by combining the load characteristics of the old and new models, and then comparing the new model is it effective.
  • the embodiment of the present application proposes a graph compression algorithm, which deletes redundant vertices and merges similar vertices.
  • the specific implementation process is as follows:
  • a graph model is used to capture the workload characteristics, where the vertices represent the operator characteristics extracted from the query plan, and the edges between two operators represent the query correlation and resources between them compete.
  • the performance prediction model feeds features into the prediction model, in which this application proposes a graph embedding algorithm that embeds graph features (e.g., operator features and K-hop neighbors) at the operator level, and constructs a Deep learning models to predict query performance.
  • graph features e.g., operator features and K-hop neighbors
  • the graph compression algorithm in the load graph optimization program uses the graph compression algorithm in the load graph optimization program to reduce the size of the load graph by merging nodes with overlapping time.
  • the method used is to first divide the existing time according to the execution time range of each node The overlapping nodes are clustered, and then the nodes without edge relationship are merged within each class by least fully connected subgraph (clique).
  • the database management system may further include a suggester 205, and the suggester 205 includes p models, where p ⁇ 1.
  • the suggester 205 is used to discover abnormal conditions (that is, find abnormal data) in the running data (such as CPU utilization rate, user response time, etc.) of processes in the database, and diagnose the cause of the abnormality based on the obtained abnormal data, and then , then optimize the optimization module corresponding to the abnormal cause based on the obtained abnormal cause (the optimization module is also located in the suggester 205, and the function of the optimization module is to optimize the parameters of the database), so as to reduce the running time of the process in the database The probability of anomalies in the data.
  • the above-mentioned first target model can not only be any one of the n models in the optimizer 201, but can also be a suggestion Any one of the p models in the device 205.
  • the recommender 205 may specifically include three models, which may be called codec, model D, and model E, respectively, and are respectively used to perform self-monitoring, self-diagnosis and Self-optimize these three steps. It should be noted here that in other implementations of this application, the suggester 205 may also include more or fewer models for realizing the process of self-monitoring, self-diagnosis and self-optimization of the database. In this example, the suggester 205 includes three models for illustration only, and details will not be described later.
  • the suggester 205 uses a codec to encode and then decode the running data of the process in the database to obtain the coded data, and compares the coded data with the running data input to the codec to obtain an abnormal data.
  • the principle of using the codec to obtain abnormal data is: the codec can restore the normal original data, but not the abnormal original data, so that the input original data is encoded and then decoded to obtain the encoded data. The encoded data is compared with the original data to know whether there is abnormal data.
  • the suggester 205 can further diagnose the cause of the abnormality based on the abnormal data through the model D, wherein the model D is based on deep learning A model constructed by an algorithm, for example, the model D may include an LSRM model and a classifier.
  • the suggester 205 encodes the abnormal data found by calling the LSRM model into a compressed vector (that is, a dimension-reduced/up-dimensional vector) , and then use learning-type classifiers (such as binary classifiers, multi-classifiers, etc.) to reason about the corresponding root cause (such as database backup operations); if the running data belongs to query index data (such as average delay), then the The suggester 205 can further use the model E to diagnose and obtain the cause of the abnormality based on the abnormal data, wherein the model E is a model constructed based on a deep learning algorithm.
  • the model E may include a Tree-LSTM model and a softmax function.
  • the suggester 205 encodes a slow query (that is, the execution time of the query is long) by calling the Tree-LSTM model, and locates the physical operator that causes the exception (that is, Execute the operator) operation, and then use the softmax function to identify the root cause of the abnormality.
  • the suggester 205 finds the root cause of the data anomaly based on self-monitoring and self-diagnosis, it will select the corresponding optimization module for optimization according to the root cause of the performance degradation of the database system. For example, if the performance degradation is caused by the lack of indexes, The suggester 205 can call the index selection module to build a new index based on deep reinforcement learning, so that the performance of the query load can be improved (the root cause is that there is no suggestion for the index); Parameter tuning based on empirical rules or reinforcement learning at the level, connection level, and system level. The purpose of optimization is to make the system have as few abnormal operating data as possible.
  • optimization module may correspond to a series (that is, multiple) root causes It is also possible that one optimization module corresponds to one root cause, which is not limited in this application.
  • Table 1 is a schematic diagram of the corresponding relationship between optimization modules in the suggester 205 and some root causes.
  • the suggester 205 is mainly used to implement the following three functions:
  • Self-monitoring database status and providing running data (such as CPU usage, response time, running log) when the database is running.
  • this application utilizes a codec to automatically detect anomalies based on data distribution and metric correlation. Specifically, the running data is converted into a low-dimensional representation by an encoder, and the low-dimensional representation is restored by a decoder. Data that the codec cannot reconstruct well is considered outlier.
  • Self-diagnosis aims at automatically diagnosing anomalies and discovering the root causes of abnormal data.
  • the abnormal data is system index data (for example, lock conflict)
  • the abnormal data found is encoded into a compressed vector (that is, the vector after dimension reduction/dimension enhancement) by calling the LSRM model, and then a learning classifier (such as , binary classifier, multi-classifier, etc.) to infer the corresponding root cause (such as database backup operation);
  • the abnormal data is query index data (such as slow query)
  • the slow query is encoded by calling the Tree-LSTM model, Locate the physical operator (that is, the execution operator) operation that causes the exception, and then use the softmax function to identify the root cause of the exception.
  • Self-tuning automatically optimizes the database for query workloads, e.g. index/view recommendations.
  • learned view recommendation utilizes an encoder-decoder model to automatically recommend views.
  • Self-optimization is optimized for the database system, and the learning parameter adjustment module uses deep reinforcement learning technology to adjust parameter values.
  • the embodiment of the present application can use the Actor-Critic model to automatically select appropriate parameter values, and can support SQL-level, session-level and system-level parameter tuning.
  • the invocation logic of the suggester 205 is as follows: first, dynamically collect database and query execution status indicators, and then use the self-monitoring module (ie codec) to find abnormal data.
  • the self-diagnosis module uses the system diagnosis function (namely model D) and the query diagnosis function (namely model E) to find the root cause of database performance degradation, and then specifies the self-optimization module to perform corresponding optimization functions. For example, if the root cause is that no index is established for the accessed column, the self-diagnosis module will call the index selection module of the self-optimization module for optimization.
  • the module function of the suggester 205 is to discover possible problems in the running process of the database, and perform diagnosis and optimization. Therefore, in some implementation manners of the present application, the suggester 103 may not be needed, and the present application does not limit this.
  • FIG. 5 There are three parts: monitoring, self-diagnosis and self-optimization. Among them, self-monitoring can judge whether there are problems in the history, current and future of the database according to the performance indicators of the running data of the process in the database, and judge the abnormal or possible abnormal state of the database. By exploring the abnormal state of the database, the database self-diagnosis and self-optimization functions are used to solve the actual problems of the database.
  • the advisor includes a self-monitoring module, a self-diagnosis module, and a self-optimization module, which are used to realize self-monitoring, self-diagnosis, and self-optimization respectively.
  • the core steps are as follows:
  • the self-monitoring module continuously collects database performance indicators. When abnormalities occur inside or outside the database, it can be reflected through corresponding indicators and system logs. Therefore, openGauss performs real-time anomaly monitoring and discovery by analyzing database and operating system indicators. The specific process is as follows:
  • the training data collector continuously collects indicators and logs from the database and operating system, such as QPS, operation logs, etc., and then puts these data together to form time series data.
  • the reconstruction-based algorithm is used to find anomalies, that is, normal time series data always have a regular change pattern, and the abnormal change pattern is likely to be a system anomaly.
  • the embodiment of the present application adopts an LSTM-based automatic codec with an attention layer.
  • Raw time series data is encoded into a low-dimensional representation, and a decoder parses the representation and attempts to recover the original data.
  • the training loss is the reconstruction quality.
  • the model learns the distribution of these multidimensional data and acquires the ability to reconstruct. Data that cannot be reconstructed (errors exceed a threshold) are reported as anomalies.
  • the embodiment of the present application adopts the statistical method "extreme value theory" to determine the dynamic threshold.
  • the user needs to set the system sensitivity to 1% or 5%, and it will calculate the corresponding threshold based on historical data: the present invention first normalizes the training data, and then inputs the processed data into the time series autoencoder to update the model parameters, After the model has the ability to reconstruct normal database indicators, openGauss collects the reconstruction error and calculates the threshold.
  • step 2 If no abnormality is found, wait for a period of time (that is, the preset duration), and then repeat step 1); if abnormality is found, then execute step 3).
  • the self-diagnosis module is called to conduct root cause analysis.
  • the self-diagnosis module judges the found faults, and if there are indeed problems, it gives the root cause of the problem at the system level or SQL statement level.
  • the specific process is as follows:
  • the self-diagnosis function of the database can identify the root causes of faults or abnormalities at the system level and SQL statement level.
  • the system-level fault analysis method is realized by the LSTM+KNN algorithm
  • the SQL statement-level fault root cause is realized by the Tree-LSTM algorithm. For locating the root cause of the fault, call the self-optimization function and give corresponding optimization suggestions to solve the problem.
  • the self-optimization module includes optimizing parameter configuration according to the characteristics of the database system.
  • Parameter recommendation is realized through deep reinforcement learning: First, the embodiment of the present application models the database parameter configuration and its corresponding performance through historical learning, that is, searches for the parameter with the best performance in the search space composed of selected parameters Then, the deep reinforcement learning model takes the database state and load characteristics as the input state, and selects the appropriate parameter configuration as the output behavior according to the parameter tuning experience learned from historical data, thus giving the optimal database parameter optimization scheme.
  • the self-optimization module also includes tuning for database SQL statements, for example, materialized view recommendation and index recommendation.
  • materialized view recommendation is implemented through RNN and reinforcement learning.
  • Index recommendation refers to the index recommendation for the load level, and provides the optimal index configuration scheme adapted to it according to the user's addition, deletion, query and modification ratio.
  • Figure 6 is a schematic diagram of the comparison between the optimizer and the three rewriting strategies provided by the embodiment of the present application.
  • the embodiment of the present application compares the query rewriting in openGauss with the three rewriting strategies Strategies (random rewriting, top-down rewriting, and heuristic rewriting) are compared.
  • the embodiment of the present application extracts 82 rewriting rules from the query optimization engine Calcite, and rewrites the query with corresponding strategies.
  • the embodiment of the present application uses the tool SQL-smith to generate 15,750 and 10,673 slow queries (>1s) for TPC-H and JOB respectively.
  • this rewriting strategy outperforms other methods in all cases, namely, the execution time of TPC-H is reduced by more than 49.7%, and that of JOB is reduced by more than 36.8%.
  • the reason is mainly twofold: First, openGauss explores a rewrite order that is less expensive to execute than the default top-down order in PostgreSQL. For example, with outer joins, PostgreSQL cannot push down predicates to input tables, while openGauss solves the problem by first converting outer joins to inner joins and then pushing down predicates. Second, the estimation model in openGauss predicts potential cost reductions, from which openGauss chooses a rewrite order with lower execution overhead. In addition, openGauss works better on TPC-H than JOB, because TPC-H query contains many subqueries that can be optimized by query rewriting, while multi-joins in JOB query will be further optimized by plan enumerator.
  • the optimizer 201 included in the database management system provided by the embodiment of the present application can perform fine-grained optimization during the execution of SQL statements according to the optimization method given by the AI model, improve the execution efficiency of SQL statements, and improve the performance of the database. Performance.
  • Table 2 and Table 3 are a schematic diagram of the comparison between the suggester provided by the embodiment of this application and the two index strategies. Taking index selection as an example, the present invention is based on TPC-H and TPC-C Experiments were conducted and the proposer of the embodiment of the application was compared with the default index and the artificially designed index. The results are shown in Table 2 and Table 3.
  • the index selection algorithm of the present application outperforms the default and manual indexes on both workloads. This is because the index selection algorithm encodes system statistics into the state representation and is able to optimize the index selection strategy based on historical data in order to dynamically update the index configuration.
  • TPC-H(s) TPC-C (tpmC) Database management system openGauss 122.9 10202 database administrator DBA 130.1 10001 default setting 140.8 9700
  • the optimizer 205 included in the database management system provided by the embodiment of the present application can timely discover whether there are faults or abnormalities in the database history, current and future, and give corresponding solutions according to business types and characteristics. root causes of failures, and provide optimal optimization and configuration solutions.
  • Figure 7 is a schematic diagram of the comparison between the model evaluator provided by the embodiment of the present application and the two known performance evaluation methods of BAL and DL.
  • BAL estimates the average buffer access delay and uses linear regression to predict Query latency of concurrent queries
  • DL uses a neural network designed according to the query plan structure to predict the performance of a single query.
  • the embodiment of this application compares the prediction accuracy and prediction time on JOB, and the results are shown in Figure 7. From the comparison results in Figure 7, it can be seen that the error rate of the model evaluator provided by the embodiment of this application is the lowest, which is about 29.9 times lower than that of BAL , 22.5 times lower than DL.
  • the workload graph in the model evaluator encodes concurrency factors such as resource contention, which increases the query delay of JOB by more than 20% compared with serial execution.
  • BAL collects buffer access latencies, while DL relies on a single query function.
  • openGauss utilizes a graph embedding network to directly map structural information to performance factors, which can improve generality when workloads vary. Instead, BAL uses a linear regression approach that requires many statistical samples for a single workload. In addition, it can be seen from Fig.
  • the prediction delay of the model evaluator provided by the embodiment of the present application is less than that of BAL and DL, and when the concurrency level increases, the prediction delay of the model evaluator provided by the embodiment of the present application is relatively stable.
  • the model evaluator provided in the embodiment of this application simultaneously predicts the execution time of all vertices. It embeds the localized graph for all vertices in the workload graph, so the total prediction time for the workload is close to predicting the vertex with the largest localized graph.
  • BAL it takes the longest time to predict because it predicts the performance while executing the workload.
  • For DL it propagates intermediate data features in the query plan tree in a bottom-up manner, which takes relatively longer time than openGauss.
  • the model evaluator 204 included in the database management system provided by the embodiment of the present application can effectively and timely check whether the new model is valid, and if it is effective, it will be deployed; otherwise, the model update will be abandoned.
  • the training data collector 202 and the model manager 203 do not exist in the known database management system, and these two modules included in the database management system constructed in this application can ensure the reliability of data processing.
  • the embodiment of the present application provides an autonomous database framework, which implements a self-learning kernel and a suggester based on machine learning algorithms and expert experience , to build a full range of autonomous functions of the database.
  • the learning-type optimizer built into the database kernel in the database management system constructed in the embodiment of the present application may specifically include an MCTS-based rewriter, a Tree-LSTM-based cost estimator, and an RL-based plan generator. Realize the efficient query optimization of the optimizer to meet the demands of multi-scenario business;
  • the database management system provided by the embodiment of the present invention may also include a learning-type suggester of the database, which is based on machine learning technology to realize automatic abnormal monitoring, automatic system diagnosis, automatic slow Query diagnosis and automatic performance optimization (such as parameter tuning, index recommendation, and view recommendation) functions, satisfy customers with one-click operation and maintenance management operations, improve operation and maintenance efficiency and database execution efficiency;
  • the database management system provided by the embodiment of the present invention also provides It can include an efficient model evaluator, which is based on machine learning technology, estimates the performance of the model deployed in the database management system, judges the benefits brought by the application of the corresponding model, and ensures that the database management system is always running at high performance and high reliability;
  • the training data collector automatically collects the running data of processes in the database, including database running indicators, query logs, system logs, etc. Use this information to generate training data for the model deployed in the database management system; the model manager provides a unified interface to manage and control the model version, and dynamically update and replace the models used by each module.
  • Each module of a traditional database is based on a classic heuristic or rule-defined algorithm. However, after replacing each module with a machine learning model, each machine learning model needs to collect data and train the model, and update the model when the scene changes. If the above operations are performed separately, the training and management costs of the model are very high.
  • the embodiment of the present application provides a training data collector and a model manager with a unified interface to evaluate the availability of the model and automatically update the model according to changes in the collected information.
  • Figure 8 is a schematic flow chart of the data processing method provided by the embodiment of the present application , the method may specifically include the following steps:
  • the computer device receives the SQL statement sent by the client to the database.
  • the database is deployed in the computer device.
  • the database includes an optimizer and native kernel components.
  • the optimizer includes n models, where n ⁇ 1.
  • a local computer device receives the SQL statement sent by the client device to the database deployed in the computer device, wherein the database includes an optimizer and a database native kernel component, the The optimizer includes n models, n ⁇ 1.
  • the computer device obtains the target physical plan through the n models included in the optimizer according to the SQL statement, and the target physical plan is that the execution cost meets the first preset requirement. Assuming a required physical plan, the first target model is one of n models.
  • the computer device After the computer device receives the SQL statement sent by the client device, it will first judge whether there are n models in the optimizer that do not meet a certain preset requirement (which can be called the second preset requirement). A target model (that is, one of the n models) does not meet the second preset requirement, then the computer device will obtain the target physical plan through the n models included in the optimizer according to the SQL statement, wherein the target physical plan To execute a physical plan whose cost meets the first preset requirement.
  • a certain preset requirement which can be called the second preset requirement
  • the execution overhead meeting the first preset requirements includes but is not limited to: 1) The execution overhead of the target physical plan is the lowest among the q execution overheads, and the q execution overheads are The corresponding execution costs of q physical plans generated based on the SQL statements input into the database, one physical plan corresponds to one execution cost, where q ⁇ 1; 2) The execution cost of the target physical plan is lower than a certain preset value (which can be called is the first preset threshold).
  • the execution cost of the target physical plan is the lowest among the q execution costs as the case where the execution cost meets the first preset requirement, and details will not be described later.
  • the failure of the first target model to meet the second preset requirements may include but not limited to: 1) the performance of the first target model does not degrade; Although the performance of a target model has declined, the degree of decline has not reached a certain preset value (which can be called a second preset threshold); 3) Evaluate the real-time performance of the first target model and predict the next performance Performance, assuming that the predicted probability of performance degradation of the first target model does not reach a certain preset value (which may be referred to as the third preset threshold), such as the probability that the predicted performance will decline does not reach 80%; 4) the first The duration of continuous running of the target model has not reached a certain preset duration, for example, the duration of continuous running of the first target model has not reached 30 minutes.
  • the optimizer may specifically include three models, which may be called model A, model B, and model C respectively, and are respectively used to perform logical query rewriting , cost estimation and physical plan generation steps.
  • the way the computer device obtains the target physical plan through the n models included in the optimizer according to the SQL statement can specifically be: first, the computer device uses the model A to input the SQL statement of the database (also can be called SQL query) to rewrite the logical query to obtain the rewritten logical plan, wherein, the model A is a model based on a tree search algorithm, for example, it can be a Monte Carlo tree search algorithm; after that, through the model B, generate q physical plans according to the logical plan, where the model B is a model based on a deep learning algorithm, for example, it can be a model based on Tree-LSTM, q ⁇ 1; finally, the q physical plans are calculated by the model C
  • the q execution overheads corresponding to the physical plan one physical plan corresponds to one execution overhead
  • the final target physical plan to be executed is determined according to the q execution overheads, where the model C is a model constructed based on a reinforcement learning algorithm, for example, It can be a model based
  • the optimizer may also include more or fewer models to implement the process of logical query rewriting, cost estimation, and physical plan generation.
  • the optimizer includes three models for illustration only, and details will not be described later.
  • the computer device executes the target physical plan.
  • the computer equipment After the computer equipment obtains the final target physical plan, it will execute the target physical plan.
  • the essence of this execution process is to use the generated target physical plan to execute the actual logic of the input SQL statement.
  • the computer device obtains the target physical plan based on the optimizer included in the database and finally executes the target physical plan.
  • the database deployed in the computer device includes an optimizer with n models, Therefore, it replaces the traditional heuristic optimizer.
  • By combining with machine learning technology it realizes the conversion of logical queries into physical execution plans with higher execution efficiency, and can effectively solve the inaccurate and complex cost evaluation caused by the current database architecture problems. Problem with poor physical plan generation for SQL statements.
  • the computer device can also send the running data of the processes in the database deployed in it to the recommender, and the recommender can be deployed in the computer device or can not be deployed
  • the advisor can send abnormal data based on the operation data, and diagnose the cause of the abnormality based on the obtained abnormal data, and finally optimize the self-optimization module corresponding to the abnormal cause based on the abnormal cause, so as to reduce the subsequent process in the database.
  • the specific functions and calling logic of the recommender involved in the embodiment of the present application can refer to the part of the recommender 205 described in the above-mentioned embodiment corresponding to FIG. 2 , and details will not be repeated here. .
  • the computer device can also feed back the running data of the processes in the database to the suggester, and the suggester can give all-round optimization suggestions for the database based on the running data, and can realize an unattended database Performance monitoring and root cause identification greatly liberate operation and maintenance manpower, and can help the database system quickly recover from abnormalities or improve performance.
  • the computer device may also send the running data of the processes in the database deployed in it to the recommender, or send the running data to the training data collector
  • the training data collector can be deployed in the computer device, and can not be deployed in the remote device, which is not limited here.
  • the training data collector After receiving the running data, the training data collector can obtain training data according to the running data, and construct m training sets based on these training data, where m ⁇ 1.
  • the specific functions and calling logic of the training data collector involved in the embodiment of the present application can refer to the part of the training data collector 202 described in the embodiment corresponding to FIG. 2 above. I won't repeat them here.
  • the computer device can also feed back the running data of the processes in the database to the training data collector, and the training data collector can generate the training data of the models involved in the database based on the running data of the processes in the database.
  • Data can realize continuous optimization of the database system, reduce the probability of misjudgment of the database system, and provide credible autonomous operation and maintenance services.
  • the situation that the first target model does not meet the second preset requirements includes but is not limited to, on the contrary, the first target model meets the second preset requirements including but not limited to
  • the limited situation is: 1) the performance of the first target model begins to decline; 2) the performance of the first target model not only declines, but the degree of decline reaches a certain preset value (which can be called the second preset value).
  • the performance of the first target model can be predicted by the model evaluator 204, assuming the predicted first
  • the probability that the performance of the target model declines reaches a certain preset value (which can be referred to as the third preset threshold), such as the probability that the predicted performance will decline reaches 80%
  • the duration of the first target model's continuous operation reaches a certain preset The duration is set, for example, the duration of the continuous operation of the first target model has reached 30 minutes.
  • the computer device may further submit a request to the model manager (the model manager may be deployed in the computer device, or may not deployed on the computer device, not limited here) to send a first instruction, the first instruction is used to instruct the model manager to fine-tune the first target model, and, when the performance of the second target model satisfies
  • the computer device receives the model parameters of the second target model sent by the model manager, wherein the second target model is used by the model manager and the first target model A target training set corresponding to a target model is obtained by fine-tuning the first target model, the target training set is one of m training sets, and finally, the computer device updates the first target model to a second target model, And the target physical plan is obtained through the updated n models (at this time, the updated n models do not include the first target model,
  • the second target model meeting the third preset requirements may include but not limited to: 1) The performance of the second target model is improved by a certain amount compared with the performance of the first target model A preset value (can be called the fourth preset threshold), as an example, the fourth preset threshold can be zero, indicating that as long as the performance of the second target model reaches the performance level of the original first target model, it is considered The second target model satisfies the third preset requirement; as another example, the fourth preset threshold can also be a certain value or a certain ratio greater than zero, indicating that only the performance of the second target model is compared with that of the original first target model.
  • a preset value can be called the fourth preset threshold
  • the second target model is considered to meet the third preset requirement only when the performance has been improved to a certain extent; 2)
  • the performance of the second target model is improved by the fifth preset threshold compared with the performance of the native kernel components in the database. That is to verify the performance improvement of the second target model compared with the traditional database algorithm. If the performance improvement reaches a certain threshold, the model used by the corresponding module of the data is actually replaced. Otherwise, the traditional database algorithm is still used to execute the target physical plan.
  • the value of the fifth preset threshold may be zero, or a certain value or a certain ratio greater than zero. For details, please refer to the above-mentioned first method, which will not be repeated here.
  • the model manager invokes the corresponding target training set in the training data collector to fine-tune the first target model , which can dynamically update and replace the corresponding model used in the database according to the real-time operation status of the database.
  • the computer device will also receive a second instruction sent by the model evaluator, and the second instruction is used to instruct the database to adopt The native kernel component in the database generates the final target physical plan, wherein, the model evaluator can be deployed in the computer device or a remote device, which is not limited here.
  • the model evaluator is used to evaluate the performance of the second target model.
  • the computer device receives the second instruction of the model evaluator to instruct the database to use the traditional algorithm of the database (that is, the native kernel component) to generate the target physical plan.
  • the embodiment of the present application provides multiple options for generating the target physical plan, and has flexibility.
  • FIG. 9 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • the computer device 900 may specifically include: a receiving module 901, a determining module 902, and an execution module 903, wherein the receiving module 901 is used to receive The SQL statement sent by the client to the database; the determination module 902 is used to obtain the target physical plan through the n models according to the SQL statement when the first target model does not meet the second preset requirement, and the target physical The plan is a physical plan whose execution cost meets a first preset requirement, and the first target model is one of the n models; the execution module 903 is configured to execute the target physical plan.
  • the n models include a third model, a fourth model, and a fifth model.
  • the determination module 902 is specifically configured to: use the third model to execute the SQL statement Rewrite the logical query to obtain the rewritten logical plan.
  • the third model is a model based on the tree search algorithm; through the fourth model, q physical plans are generated according to the logical plan.
  • the fourth model is based on deep learning The model constructed by the algorithm, q ⁇ 1; through the fifth model, calculate the q execution costs corresponding to the q physical plans, and determine the target physical plan according to the q execution costs, one physical plan corresponds to one execution cost
  • the fifth model is a model constructed based on a reinforcement learning algorithm.
  • the computer device 900 further includes a sending module 904, configured to: send the operation data of the processes in the database to the advisor, so that the advisor finds abnormal data based on the operation data, And make the recommender diagnose the cause of the abnormality based on the abnormal data, and optimize the self-optimization module corresponding to the abnormal cause based on the abnormal cause, so as to reduce the probability of abnormal occurrence of the operating data, wherein the suggester includes p models , p ⁇ 1.
  • the sending module 904 can also be configured to: send the running data of the processes in the database to the training data collector, so that the training data collector can obtain training data according to the running data, and based on The training data constructs m training sets, m ⁇ 1.
  • the sending module 904 may also be configured to: when the first target model meets the second preset requirement, send a first instruction to the model manager, where the first instruction is used to instructing the model manager to fine-tune the first target model, where the first target model is one of the n models; the receiving module 901 may also be used for: when the performance of the second target model satisfies a third preset If required, receive the model parameters of the second target model sent by the model manager, the second target model is the first target model that the model manager uses the target training set corresponding to the first target model The model obtained by fine-tuning, the target training set is one of the m training sets; the determination module 902 can also be used to: update the first target model to the second target model, and pass the updated n models Get a physics plan for that goal.
  • the receiving module 901 may also be configured to: when the performance of the second target model does not meet the third preset requirement, receive the second instruction sent by the model evaluator, the first The second instruction is used to instruct the database to use native kernel components in the database to generate the target physical plan, and the model evaluator is used to evaluate the performance of the second target model.
  • the meeting of the second preset requirement by the first target model includes at least any one of the following: the performance of the first target model is degraded; or, the degree of performance degradation of the first target model reaches the second Two preset thresholds, or, the predicted probability of performance degradation of the first target model reaches a third preset threshold; or, the duration of continuous operation of the first target model satisfies a preset duration.
  • the performance of the second target model meeting the third preset requirement includes at least any one of the following: the performance of the second target model is improved by a fourth preset compared with the performance of the first target model threshold; or, the performance of the second target model is improved by a fifth preset threshold compared with the performance of the native kernel components in the database.
  • the execution overhead meeting the first preset requirement includes at least any one of the following: the execution overhead of the target physical plan is the lowest among the q execution overheads, and the q execution overheads are based on the SQL statement
  • Each of the generated q physical plans corresponds to an execution cost, one physical plan corresponds to one execution cost, q ⁇ 1; or, the execution cost of the target physical plan is lower than a first preset threshold.
  • FIG. 10 is a schematic structural diagram of the computer device provided by the embodiment of the present application.
  • the computer device 1000 can be deployed with The described computer device 900 is used to realize the functions of the computer device 900 in the embodiment corresponding to FIG. Differences may include one or more central processing units (central processing units, CPU) 1022 and memory 1032, one or more storage media 1030 (such as one or more mass storage devices) for storing application programs 1042 or data 1044.
  • the memory 1032 and the storage medium 1030 may be temporary storage or persistent storage.
  • the program stored in the storage medium 1030 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the computer device 1000 .
  • the central processing unit 1022 may be configured to communicate with the storage medium 1030 , and execute a series of instruction operations in the storage medium 1030 on the computer device 1000 .
  • Computer device 1000 can also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input and output interfaces 1058, and/or, one or more operating systems 1041, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • one or more operating systems 1041 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the central processing unit 1022 is configured to execute the steps performed by the computer device in the embodiment corresponding to FIG. 8 .
  • the central processing unit 1022 can be used to: firstly, receive the SQL statement sent by the client device to the database deployed in the computer device, wherein the database includes an optimizer and database native kernel components, and the optimizer includes n model, n ⁇ 1. After receiving the SQL statement sent by the client device, it will first judge whether there are n models in the optimizer that do not meet a preset requirement (which can be called the second preset requirement).
  • the target physical plan is obtained through the n models included in the optimizer, wherein the target physical plan is that the execution cost meets the first A physical plan of preset requirements. After the final target physical plan is obtained, the target physical plan will be executed. The essence of this execution process is to use the generated target physical plan to execute the actual logic of the input SQL statement.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be A physical unit can be located in one place, or it can be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between the modules indicates that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines.
  • the essence of the technical solution of this application or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a floppy disk of a computer , U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be A personal computer, a training device, or a network device, etc.) executes the methods described in various embodiments of the present application.
  • a computer device which can be A personal computer, a training device, or a network device, etc.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, training device, or data
  • the center transmits to another website site, computer, training device or data center via wired (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wired eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (solid state disk, SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un système de gestion de base de données, un procédé de traitement de données et un dispositif. Le système comprend : un optimiseur comprenant n modèles; un collecteur de données de formation; un gestionnaire de modèle; et un évaluateur de modèle. L'optimiseur est utilisé pour obtenir un plan physique cible au moyen des n modèles selon une instruction SQL; le collecteur de données d'apprentissage est utilisé pour construire m ensembles d'apprentissage en fonction des données de fonctionnement d'un processus de base de données; le gestionnaire de modèle est utilisé pour affiner un premier modèle cible (qui appartient aux n modèles et doit répondre à une exigence prédéfinie, par exemple une réduction de performance) en utilisant un ensemble de formation cible (qui appartient aux m ensembles de formation), de façon à obtenir un second modèle cible; et l'évaluateur de modèle est utilisé pour évaluer la performance du second modèle cible, et mettre à jour le premier modèle cible vers le second modèle cible lorsque la performance répond à une exigence prédéfinie (par exemple une amélioration de performance). La présente invention est combinée à un apprentissage automatique, afin de réaliser la fonction d'exécution automatique de la mise au point et de la mise à jour de la base de données et d'autres tâches de gestion de la base de données qui sont traditionnellement exécutées par un administrateur de bases de données, sans intervention manuelle.
PCT/CN2022/111991 2021-08-13 2022-08-12 Système de gestion de base de données, procédé de traitement de données et dispositif WO2023016537A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110930569.8 2021-08-13
CN202110930569.8A CN115705322A (zh) 2021-08-13 2021-08-13 一种数据库管理系统、数据处理方法及设备

Publications (1)

Publication Number Publication Date
WO2023016537A1 true WO2023016537A1 (fr) 2023-02-16

Family

ID=85180159

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/111991 WO2023016537A1 (fr) 2021-08-13 2022-08-12 Système de gestion de base de données, procédé de traitement de données et dispositif

Country Status (2)

Country Link
CN (1) CN115705322A (fr)
WO (1) WO2023016537A1 (fr)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108025B (zh) * 2023-04-14 2023-08-01 安元科技股份有限公司 一种数据虚拟化性能优化方法
CN116821171B (zh) * 2023-06-27 2024-04-19 杭州云之重器科技有限公司 一种生成新虚拟视图加速计算任务的方法
CN116627773B (zh) * 2023-07-21 2023-09-22 四川发展环境科学技术研究院有限公司 产销差统计平台系统的异常分析方法及系统
CN118153065A (zh) * 2024-05-09 2024-06-07 青岛国创智能家电研究院有限公司 基于图神经网络的密码误用漏洞识别方法及装置、系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100030758A1 (en) * 2008-07-30 2010-02-04 Oracle International Corporation Hybrid optimization strategies in automatic SQL tuning
CN106991116A (zh) * 2017-02-10 2017-07-28 阿里巴巴集团控股有限公司 数据库执行计划的优化方法和装置
CN111813888A (zh) * 2019-04-12 2020-10-23 微软技术许可有限责任公司 训练目标模型
CN112215357A (zh) * 2020-09-29 2021-01-12 三一专用汽车有限责任公司 模型优化方法、装置、设备和计算机可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100030758A1 (en) * 2008-07-30 2010-02-04 Oracle International Corporation Hybrid optimization strategies in automatic SQL tuning
CN106991116A (zh) * 2017-02-10 2017-07-28 阿里巴巴集团控股有限公司 数据库执行计划的优化方法和装置
CN111813888A (zh) * 2019-04-12 2020-10-23 微软技术许可有限责任公司 训练目标模型
CN112215357A (zh) * 2020-09-29 2021-01-12 三一专用汽车有限责任公司 模型优化方法、装置、设备和计算机可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YANG HE-BIAO, LIU LING, YANG LI-FAN: "SQL (A Study of the Automated Programming Assessment Model for SQL Based on Structure Similarity Matching", COMPUTER ENGINEERING AND SCIENCE, GUOFANG KEJI DAXUE JISUANJI XUEYUAN, CN, vol. 32, no. 11, 15 November 2010 (2010-11-15), CN , pages 92 - 96, XP009543424, ISSN: 1007-130X *

Also Published As

Publication number Publication date
CN115705322A (zh) 2023-02-17

Similar Documents

Publication Publication Date Title
US11567937B2 (en) Automated configuration parameter tuning for database performance
WO2023016537A1 (fr) Système de gestion de base de données, procédé de traitement de données et dispositif
Li et al. Qtune: A query-aware database tuning system with deep reinforcement learning
Zhang et al. An end-to-end automatic cloud database tuning system using deep reinforcement learning
Hilprecht et al. Learning a partitioning advisor for cloud databases
Van Aken et al. An inquiry into machine learning-based automatic configuration tuning services on real-world database management systems
Christophides et al. End-to-end entity resolution for big data: A survey
Zhang et al. Towards dynamic and safe configuration tuning for cloud databases
US20180032405A1 (en) Method for data protection for cloud-based service system
US20180157978A1 (en) Predicting Performance of Database Queries
Zhao et al. Queryformer: A tree transformer model for query plan representation
Raza et al. Autonomic performance prediction framework for data warehouse queries using lazy learning approach
Zhao et al. Automatic database knob tuning: a survey
CN109634924A (zh) 基于机器学习的文件系统参数自动调优方法及系统
CN113010547A (zh) 一种基于图神经网络的数据库查询优化方法及系统
Zhang et al. CDBTune+: An efficient deep reinforcement learning-based automatic cloud database tuning system
Pavlo et al. Make your database system dream of electric sheep: towards self-driving operation
Paludo Licks et al. SmartIX: A database indexing agent based on reinforcement learning
Prats et al. You only run once: spark auto-tuning from a single run
US20150356485A1 (en) Methods and systems for intelligent evolutionary optimization of workflows using big data infrastucture
Cai et al. HUNTER: an online cloud database hybrid tuning system for personalized requirements
Zhu et al. Lero: A learning-to-rank query optimizer
Doshi et al. Kepler: Robust learning for parametric query optimization
Chen et al. Leon: A new framework for ml-aided query optimization
Zou et al. Survey on learnable databases: A machine learning perspective

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22855517

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE