WO2021169231A1 - 一种数据库管理系统性能预估的方法、装置及系统 - Google Patents

一种数据库管理系统性能预估的方法、装置及系统 Download PDF

Info

Publication number
WO2021169231A1
WO2021169231A1 PCT/CN2020/113204 CN2020113204W WO2021169231A1 WO 2021169231 A1 WO2021169231 A1 WO 2021169231A1 CN 2020113204 W CN2020113204 W CN 2020113204W WO 2021169231 A1 WO2021169231 A1 WO 2021169231A1
Authority
WO
WIPO (PCT)
Prior art keywords
dbms
parameter group
vector
performance
parameter
Prior art date
Application number
PCT/CN2020/113204
Other languages
English (en)
French (fr)
Inventor
李国良
王天庆
李士福
李凌涛
李坤
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20921945.0A priority Critical patent/EP4099178A4/en
Publication of WO2021169231A1 publication Critical patent/WO2021169231A1/zh
Priority to US17/897,008 priority patent/US11940966B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/217Database tuning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3428Benchmarking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/80Database-specific techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Definitions

  • This application relates to the technical field of database management, and in particular to a method, device and system for predicting the performance of a database management system.
  • DBMS database management system
  • This method of obtaining the performance of the DBMS by interacting with the DBMS will cause interference to the DBMS, which in turn affects the performance of the services supported by the DBMS.
  • the embodiment of the present application provides a method for predicting the performance of a database management system, which can predict the performance of the DBMS without interacting with the database management system (DBMS), which is a self-tuning and self-tuning method for the DBMS.
  • DBMS database management system
  • Management, etc. provide a means of performance measurement to ensure the reliable and stable operation of the DBMS.
  • a method for predicting the performance of a database management system including: obtaining a first parameter group configured for a DBMS of the database management system, and obtaining a data volume of active data in the data managed by the DBMS, the first parameter group including at least One parameter; obtain a second parameter group, the second parameter group includes at least one parameter; determine the performance change ratio of the DBMS according to the data volume of the first parameter group, the second parameter group and the active data, and the performance change ratio indicates the use of the second parameter group After the DBMS is configured, the performance of the DBMS is compared to the degree of change when the DBMS is configured using the first parameter group.
  • the first parameter group usually includes multiple parameters (Knob), and the parameters in the first parameter group may include, for example, the buffer pool size of the storage engine (innodb_buffer_pool_size), and the buffer size of the storage engine.
  • the block size in the pool innodb_buffer_pool_chunk_size
  • the number of buffer pool instances of the storage engine innodb_buffer_pool_instances
  • the size of the temporary table tmp_table_size
  • key_buffer_size size of the key buffer
  • Active data refers to the data used when the DBMS executes historical SQL statements.
  • the parameters included in the second parameter group may correspond to the parameters in the first parameter group, and the value of each parameter in the second parameter group may be the same as or different from the value of the parameter in the first parameter group. Usually At least one parameter in the first parameter group and the second parameter group has a different value.
  • the difference between the first parameter group and the second parameter group is that the parameters in the first parameter group have been configured on the DBMS, which means that the DBMS runs using the parameters in the first parameter group.
  • the second parameter group is not configured on the DBMS, and the second parameter group may be obtained after tuning using the tuning model of DBMS parameters, or may be determined by the user.
  • the first parameter group, the second parameter group, and the data volume of the active data in the data managed by the DBMS can determine the performance change ratio of the DBMS, and there is no need to actually configure the second parameter group to the DBMS.
  • the DBMS executes the job and then observes, that is, without interacting with the DBMS, the rate of change of the DBMS can be estimated, that is, the security of the DBMS can be ensured, and the performance measurement is provided for the self-tuning and self-management of the DBMS.
  • the method further includes: obtaining a Structured Query Language (SQL) statement; estimating the first estimated execution time for the DBMS to execute the SQL statement.
  • SQL Structured Query Language
  • the SQL statement does not need to be entered into the DBMS and the SQL statement executed by the DBMS can estimate the estimated execution time of the SQL statement executed by the DBMS, which adds another item to the tuning of the DBMS Reference indicators, this can ensure a better realization of the self-tuning and self-management of the DBMS, and further ensure the reliable and stable operation of the DBMS.
  • the method further includes: estimating the performance index of the DBMS according to the performance change ratio of the DBMS, the first estimated execution time and the data volume of the active data, the performance index including the first SQL statement 2. At least one of the estimated execution time and the throughput of the DBMS, and the second estimated execution time is a modified value of the first estimated execution time.
  • the estimated execution time of the SQL statement can be further optimized, and the throughput of the DBMS when the DBMS is configured with the second parameter group can also be estimated, so that the second parameter group can be better confirmed Whether it is suitable to be configured to DBMS, if the performance indicator is not suitable, it is not configured to DBMS.
  • This is more conducive to ensuring the stability of DBMS operation and improving the stability of DBMS operation compared to the DBMS execution result after configuration to DBMS and then determining that the second parameter group is not suitable for DBMS The accuracy of DBMS parameter configuration is improved.
  • the above steps predict the performance index of the DBMS according to the performance change ratio of the DBMS, the first estimated execution time and the data volume of active data, including: the performance change ratio of the DBMS,
  • the first estimated execution time and the amount of active data are input to the performance estimation model.
  • the training samples of the performance estimation model include the historical performance change ratio of the DBMS, the execution time of the historical SQL statement, and the DBMS managed when the DBMS executes the historical SQL statement The data volume and historical performance indicators of the active data.
  • historical performance change ratio and “historical SQL statement execution time” referred to in this application refer to the data collected during the execution of the previous DBMS.
  • the performance indicators of the DBMS are estimated through the three dimensions of the DBMS performance change ratio, the first estimated execution time and the data volume of the active data, such as: execution time or throughput, etc. Adjusting the DBMS after the situation reflected by the performance indicators can determine more suitable parameters for the DBMS, which can improve the accuracy of DBMS tuning.
  • the above steps determining the performance change ratio of the DBMS according to the data volume of the first parameter group, the second parameter group and the active data, including: correcting the first parameter group according to the data volume of the active data in the DBMS Parameters related to the data volume of the active data in the first parameter group and the second parameter group; dimensionality reduction is performed on the revised first parameter group and the second parameter group to obtain the first vector and the second vector, the first parameter group It is a vector with a dimension higher than the first vector, and the second parameter group is a vector with a dimension higher than the second vector; the performance change ratio of the DBMS is determined according to the first vector and the second vector.
  • the first parameter group and the second parameter group are corrected by the data volume of the active data, so that the data volume of the active data can be fitted with the parameters of the DBMS, so as to ensure the determination of the DBMS tuning
  • the output parameters match the current situation of the DBMS to the greatest extent, so that excessive resource waste will not occur, thereby ensuring the resource utilization of the database system.
  • the above steps correcting the parameters related to the data volume of the active data in the first parameter group and the second parameter group according to the data volume of the active data in the DBMS, including: changing the first parameter
  • the target parameters related to the data volume of the active data in the group and the second parameter group are adjusted to be equal to the data volume of the active data, and the target parameter is a parameter in the first parameter group and the second parameter group that is larger than the data volume of the active data.
  • the data correction process refers to if the value of a parameter in the first parameter group or the second parameter group is greater than the data volume of the active data, then the value of the parameter is reduced to the same value as that of the active data. The amount of data is equal. Take the parameter innodb_buffer_pool_size as an example. If the value of the parameter in the first parameter group is greater than the data volume of the active data, the value of the parameter in the first parameter group is corrected to be equal to the data volume.
  • the data amount correction module can increase the speed of parameter correction in the first parameter group or the second parameter group.
  • the above steps determining the performance change ratio of the DBMS according to the first vector and the second vector, including: inputting the first vector and the second vector into the dual-parameter hybrid model, and the dual-parameter hybrid
  • the training samples of the model include multiple pairs of the first historical parameter group and the second historical parameter group that the DBMS has configured, and the performance change ratio of the DBMS corresponding to each pair of the first historical parameter group and the second historical parameter group; through dual-parameter mixing
  • the model determines the performance change rate of the DBMS.
  • the first historical parameter group is a set of parameters used by the DBMS
  • the second historical parameter group is another set of parameters used by the DBMS.
  • the first historical parameter group and the second historical parameter group form a pair, and the second historical parameter group is used.
  • the performance change ratio of the DBMS in the parameter group compared to the performance change ratio of the first historical parameter group is the performance change ratio of the DBMS corresponding to the pair of the first historical parameter group and the second historical parameter group.
  • a double knobs mixture model can quickly determine the performance change ratio of the DBMS.
  • the above steps estimating the first estimated execution time of the SQL statement executed by the DBMS, including: converting the SQL statement into a third vector; determining the complexity coefficient of the SQL statement according to the third vector ; Estimate the first estimated execution time of the SQL statement executed by the DBMS according to the complexity coefficient.
  • the complexity coefficient of the SQL statement is determined through the vector, and the execution time is further estimated through the complexity coefficient, ensuring the accuracy of the execution time estimation.
  • the SQL statement includes a sub-query statement
  • the above step: converting the SQL statement into a third vector includes: determining the sub-query statement by using the relationship between the complexity and the condition factor
  • the condition factor includes the condition type in the subquery and the table type involved in the subquery; the vector of the subquery is determined according to the complexity of the subquery; the vector of the subquery is input into the cost model to obtain the subquery Complexity coefficient, the training sample of the cost model includes the vector of historical SQL statements executed by the DBMS and the complexity coefficient of the historical SQL statement; according to the complexity coefficient of the sub-query statement, and the relationship between the sub-query statement and the upper-level query statement containing the sub-query statement The nesting relationship determines the third vector of the SQL statement.
  • the step of converting the SQL statement into the third vector can also be expressed as:
  • the SQL statement is converted into a statement template. If the SQL statement includes a sub-query statement, the statement template includes the corresponding sub-query statement At least one sub-template; use the relationship between complexity and condition factor to determine the complexity of each sub-template, the condition factor includes the condition type in the sub-template and the type of table involved; determine the sub-query according to the complexity of each sub-template The vector of the statement; input the vector of the sub-query statement into the cost model to obtain the complexity coefficient of the sub-query statement.
  • the training sample of the cost model includes the vector of the historical SQL statement executed by the DBMS and the complex coefficient of the historical SQL statement; according to the sub-query statement
  • the complexity coefficient and the recursive relationship between the sub-query statement and the upper-level query statement recursively determine the third vector of the SQL statement.
  • the preset template rule refers to which structures in the SQL statement can form a template, for example, the select*from*where structure can form a template.
  • a SQL statement can contain two or more layers of query relationships, and the statement of each layer can be understood as a sub-query statement.
  • the large template corresponding to the SQL statement can be called a statement template, where the template corresponding to the sub-query statement can be called a sub-template, for example: if the statement template includes Select*from where ⁇ select*from*where; +min ⁇ +orderby+limit , Then select*from*where in ⁇ select*from*where; +min ⁇ ; can be called a sub-template, and min can also be called a sub-template.
  • the sub-query statement may include "equal conditions” or "unequal conditions", etc. These "equal conditions” or “unequal conditions” are all condition types, and the sub-template may also involve some tables, such as: query area Forms or query forms for ethnic groups, etc., these are all form types.
  • the initial vector of the sub-query sentence can be identified as 0 in each dimension.
  • the specific value of the complexity of the sub-template can be used to modify the 0 of the corresponding dimension of the sub-template in the initial vector, thus Determine the vector of the sub-query statement, through the cost model, you can get the complexity coefficient of the sub-query statement, process it recursively, you can get the vector of the previous query statement, and so on, you can get the SQL statement
  • the vector is the third vector.
  • determining the complexity coefficient of the SQL statement according to the third vector includes: inputting the third vector into the cost model, and determining the complexity coefficient of the SQL statement through the cost model.
  • the complexity coefficient is determined through a pre-trained cost model, which can increase the speed of determining the complexity coefficient.
  • the above steps estimate the first estimated execution time of the SQL statement executed by the DBMS according to the complexity coefficient, including: according to the relationship between the complexity coefficient and the execution time, and the complexity coefficient of the SQL statement Estimate the first estimated execution time of the SQL statement executed by the DBMS.
  • the above steps estimating the first estimated execution time of the SQL statement executed by the DBMS according to the complex coefficients, including: inputting the complex coefficients into the execution time correction model, and training of the execution time correction model
  • the sample includes the complexity coefficient of the historical SQL statement executed by the DBMS and the execution time of the historical SQL statement; the execution time correction model is used to determine the first estimated execution time of the SQL statement executed by the DBMS.
  • the execution time is estimated through a pre-trained execution time correction model, which can increase the speed of the execution time estimation.
  • a second aspect of the present application provides a method for predicting the performance of a database management system, including: obtaining a first parameter group and a second parameter group, the first parameter group including at least one parameter, and the second parameter group including at least one parameter; and obtaining a DBMS
  • the data volume of the active data in the managed data; the performance change ratio of the DBMS is determined according to the data volume of the first parameter group, the second parameter group, and the active data. Compared with the degree of change when using the first parameter group to configure the DBMS.
  • any possible implementation of the foregoing first aspect can also be combined into any possible implementation of the second aspect in the second aspect.
  • the difference between the second aspect and the first aspect is that the first parameter set obtained in the second aspect may not be the first parameter set used by the DBMS, and the first parameter set in the second aspect may also be obtained through a parameter tuning model , Or user-configured.
  • a device for predicting the performance of a database management system which is used to execute the foregoing first aspect or any possible implementation of the first aspect.
  • the device includes a module or unit for executing the above-mentioned first aspect or any possible implementation of the first aspect.
  • an apparatus for predicting the performance of a database management system which is used to execute the foregoing second aspect or any possible implementation method of the second aspect.
  • the device includes a module or unit for executing the foregoing second aspect or any possible implementation of the second aspect.
  • a device for predicting the performance of a database management system may include at least one processor, memory, and communication interface.
  • the processor is coupled with the memory and the communication interface.
  • the memory is used to store instructions
  • the processor is used to execute the instructions
  • the communication interface is used to communicate with other network elements under the control of the processor.
  • a device for predicting the performance of a database management system may include at least one processor, memory, and communication interface.
  • the processor is coupled with the memory and the communication interface.
  • the memory is used to store instructions
  • the processor is used to execute the instructions
  • the communication interface is used to communicate with other network elements under the control of the processor.
  • a database system in the seventh aspect, includes the device for predicting the performance of the database management system of the third aspect or the fifth aspect, a database management system and a database.
  • a database system in an eighth aspect, includes the device for predicting the performance of the database management system of the fourth aspect or the sixth aspect, a database management system and a database.
  • a computer-readable storage medium stores a program that enables the data query device to execute the above-mentioned first aspect and any one of its various implementation modes for database management The method of system performance estimation.
  • a computer-readable storage medium stores a program that enables the data query device to execute the above-mentioned second aspect and any one of its various implementation modes for database management The method of system performance estimation.
  • a computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium; at least one processor of the device can read the computer-readable storage medium.
  • a computer-executable instruction is executed by at least one processor to enable the device to implement the method for predicting the performance of a database management system provided by the first aspect or any one of the possible implementations of the first aspect.
  • a computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium; at least one processor of the device can read the computer-readable storage medium. At least one processor executes a computer-executable instruction to enable the device to implement the method for predicting the performance of the database management system provided by the second aspect or any one of the possible implementations of the second aspect.
  • any of the above-provided devices, computer storage media, or computer program products for predicting the performance of a database management system are all used to execute the corresponding method for predicting the performance of a database management system provided above.
  • beneficial effects that can be achieved reference may be made to the beneficial effects in the corresponding methods provided above, which will not be repeated here.
  • Figure 1A is a schematic diagram of a stand-alone database system
  • Figure 1B is a schematic diagram of a cluster database system using a shared disk architecture
  • Figure 1C is a schematic diagram of a cluster database system adopting a shared nothing architecture
  • FIG. 2 is a schematic diagram of an embodiment of a method for predicting performance of a database management system provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a scene of a method for predicting performance of a database management system provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of another embodiment of a method for predicting performance of a database management system provided by an embodiment of the present application
  • FIG. 5 is a schematic diagram of another embodiment of a method for predicting performance of a database management system provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of another embodiment of a method for predicting performance of a database management system provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an interface for comparing performance indicators provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an interface of the query ratio trend provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of an embodiment of a database server provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of an embodiment of an apparatus for predicting the performance of a database management system provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of another embodiment of an apparatus for predicting the performance of a database management system provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of another embodiment of an apparatus for predicting performance of a database management system provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of an embodiment of a cluster database system provided by an embodiment of the present application.
  • FIG. 14 is a schematic diagram of another embodiment of a database system provided by an embodiment of the present application.
  • the embodiment of the present application provides a method for predicting the performance of a database management system, which can predict the performance of the DBMS without interacting with the database management system (DBMS), which is a self-tuning and self-tuning method for the DBMS.
  • DBMS database management system
  • Management, etc. provide a means of performance measurement to ensure the reliable and stable operation of the DBMS. Detailed descriptions are given below.
  • the method provided in the embodiments of the present application may be implemented as a part of a database management system, or implemented as an independent component.
  • the method provided in the embodiments of the present application can be applied to a single-machine database system or a cluster database system.
  • the database system generally consists of the following three parts: (1) Database (DB) refers to a collection of organized and shareable data stored in a computer for a long time. The data in the database is organized, described, and stored according to a certain mathematical model, with less redundancy, higher data independence and easy scalability, and can be shared by various users.
  • Hardware including data storage required to store data, such as memory and/or disks.
  • Software including DBMS.
  • DBMS is the core software of the database system. It is used for scientifically organizing and storing data, as well as efficiently acquiring and maintaining data.
  • the database engine is the core component of the DBMS.
  • FIG. 1A is a schematic diagram of a stand-alone database system, including a database management system and a data store.
  • the database management system is used to provide services such as database query and modification.
  • the database management system stores data in the data store. .
  • the database management system and data storage are usually located on a single server, such as a symmetric multi-processor (SMP) server.
  • SMP server includes multiple processors, and all processors share resources, such as buses, memory, and I/O systems.
  • the functions of the database management system can be implemented by one or more processors executing programs in the memory.
  • FIG. 1B is a schematic diagram of a cluster database system adopting a shared-storage architecture, including multiple nodes (nodes 1-N in Figure 1B).
  • Each node is deployed with a database management system to provide users with database queries.
  • multiple database management systems store shared data in the shared data storage, and perform read and write operations on the data in the data storage through the switch.
  • the shared data storage can be a shared disk array.
  • the nodes in the cluster database system can be physical machines, such as database servers, or virtual machines (VM) or containers (containers) running on abstract hardware resources.
  • the database management system deployed in the node is a DBMS instance, and the DBMS may be a process or a thread.
  • the switch is a storage area network (SAN) switch, an Ethernet switch, an optical fiber switch, or other physical switching equipment.
  • the switch is a virtual switch.
  • FIG. 1C is a schematic diagram of a cluster database system adopting a shared-nothing architecture.
  • Each node has its own hardware resources (such as data storage), operating system, and database, and the nodes communicate through the network. Under this system, data will be distributed to each node according to the database model and application characteristics, and the query task will be divided into several parts, which will be executed in parallel on all nodes and coordinated with each other to provide database services as a whole. All communication functions are in Realized on a high-bandwidth network interconnection system.
  • the nodes can be either physical machines or virtual machines.
  • the data store of the database system includes, but is not limited to, solid state drives (SSD), disk arrays, or other types of non-transitory computer-readable media.
  • SSD solid state drives
  • the database is stored in a data storage.
  • a database system may include fewer or more components than those shown in Figures 1A-1C, or include components different from those shown in Figures 1A-1C.
  • Figures 1A-1C only The components that are more related to the implementation manner disclosed in the embodiments of the present application are shown.
  • a cluster database system can include any number of nodes.
  • the database management system function of each node can be implemented by a suitable combination of software, hardware and/or firmware running on each node.
  • a node in a cluster database system with a shared-nothing architecture that is, a database server
  • a shared-nothing architecture that is, a database server
  • the method for predicting the performance of the database management system may include the following solutions:
  • the second type Estimate the first estimated execution time of the SQL statement executed by the DBMS.
  • the third type use the first type of "DBMS performance change ratio" and the second type of "first estimated execution time", as well as the amount of active data in the data managed by the DBMS to determine the performance indicators of the DBMS.
  • an embodiment of the method for predicting the performance of a database management system provided in an embodiment of the present application includes:
  • 201 Acquire a first parameter group configured for a DBMS, and acquire a data amount of active data in data managed by the DBMS, where the first parameter group includes at least one parameter.
  • the first parameter group usually includes multiple parameters (Knob).
  • the parameters in the first parameter group may include: the buffer pool size of the storage engine (innodb_buffer_pool_size), the block size in the buffer pool of the storage engine (innodb_buffer_pool_chunk_size), and The number of engine buffer pool instances (innodb_buffer_pool_instances), temporary table size (tmp_table_size) and key buffer size (key_buffer_size), etc.
  • Active data refers to the data used when the DBMS executes historical SQL statements.
  • the parameters included in the second parameter group may correspond to the parameters in the first parameter group, and the value of each parameter in the second parameter group may be the same as or different from the value of the parameter in the first parameter group. Usually At least one parameter in the first parameter group and the second parameter group has a different value.
  • the difference between the first parameter group and the second parameter group is that each parameter in the first parameter group has been configured on the DBMS, which means that the DBMS runs using the parameters in the first parameter group.
  • the second parameter group is not configured on the DBMS, and the second parameter group may be obtained after tuning using the tuning model of DBMS parameters, or may be determined by the user.
  • the performance change ratio indicates that after the DBMS is configured using the second parameter group, the performance of the DBMS is compared with the configuration using the first parameter group The degree of change in the DBMS.
  • the variation is a general term, and the value of the performance change ratio can indicate the specific degree of change, for example, the performance is increased by 0.5 times, or decreased by 0.1 times, etc.
  • the first parameter group, the second parameter group and the data volume of the active data in the data managed by the DBMS can determine the performance change ratio of the DBMS, and there is no need to actually configure the second parameter group to DBMS, then the DBMS executes the job and then observes, that is, you can estimate the rate of change of the DBMS without interacting with the DBMS, that is, you can ensure the security of the DBMS, and provide performance metrics for the self-tuning and self-management of the DBMS.
  • the method guarantees the reliable and stable operation of the DBMS.
  • the second solution provided in the embodiments of the present application may be implemented on the basis of the first solution, or may be implemented independently of the first solution.
  • An embodiment of the method for predicting the performance of a database management system provided by the second solution includes: obtaining a structured query language SQL statement; estimating the first estimated execution time of the SQL statement executed by the DBMS.
  • the third solution provided by the embodiments of the present application is executed on the basis of the first solution and the second solution.
  • An embodiment of the method for predicting the performance of a database management system provided by the second solution includes: according to DBMS The performance change ratio, the first estimated execution time and the amount of active data are estimated to be the performance indicators of the DBMS.
  • the performance indicators include at least one of the second estimated execution time of the SQL statement and the throughput of the DBMS.
  • the second estimate The execution time is the modified value of the first estimated execution time.
  • the device for performance estimation of the database management system includes a ratio estimation model, a query-estimator and a performance estimation model.
  • the ratio prediction model is obtained by training with training samples.
  • the training samples include multiple pairs of the first historical parameter group and the second historical parameter group and the performance of the DBMS corresponding to each pair of the first historical parameter group and the second historical parameter group.
  • the rate of change, and the amount of active data when the first historical parameter group and the second historical parameter group are used to configure the DBMS.
  • the query estimation model is obtained through SQL pipeline training, and the SQL pipeline includes the execution time of the SQL statement corresponding to the SQL pipeline processed by the DBMS, or the complexity coefficient of the SQL statement.
  • the training samples of the performance estimation model include the historical performance change ratio of the DBMS, the execution time of historical SQL statements, and the data volume and historical performance indicators of the active data managed by the DBMS when the DBMS executes the historical SQL statements.
  • the aforementioned SQL pipeline and training samples can be collected by interacting with the DBMS through an agent.
  • the ratio estimation model receives the data volume of the first parameter group, the second parameter group and the active data, and then estimates it to obtain the performance change ratio of the DBMS.
  • the performance change ratio indicates the use of the second parameter group.
  • the performance of the DBMS is compared to the degree of change when the DBMS is configured with the first parameter group.
  • the query estimation model After the query estimation model receives the SQL statement input by the user, the first estimated execution time of the SQL statement executed by the DBMS can be obtained.
  • the performance estimation model obtains the performance change rate of the DBMS output by the rate estimation model, the first estimated execution time of the SQL statement, and the data volume of the active data, and then the performance indicators of the DBMS can be estimated.
  • the performance indicators include SQL At least one of the second estimated execution time of the statement and the throughput of the DBMS, and the second estimated execution time is a modified value of the first estimated execution time.
  • the above step 203 may include: correcting the parameters related to the data volume of the active data in the first parameter group and the second parameter group according to the data volume of the active data in the DBMS;
  • the parameter group is dimensionally reduced to obtain the first vector and the second vector.
  • the first parameter group is a vector with a dimension higher than the first vector
  • the second parameter group is a vector with a dimension higher than the second vector; according to the first vector and the first vector The two vectors determine the performance change ratio of the DBMS.
  • the ratio prediction model includes a data volume correction module and a double knobs mixture model (DKMM).
  • the data volume correction module may adjust the target parameter related to the data volume of the active data in the first parameter group and the second parameter group to the data volume of the active data, and the target parameter is the data volume of the active data. Parameters in the first parameter group and the second parameter group that are greater than the data amount of the active data.
  • the dual-parameter mixed model is trained based on the multiple pairs of the first historical parameter group and the second historical parameter group that have been configured for the DBMS, and the performance change ratio of the DBMS corresponding to each pair of the first historical parameter group and the second historical parameter group.
  • the first historical parameter group is a group of parameters used by the DBMS
  • the second historical parameter group is another group of parameters used by the DBMS.
  • the first historical parameter group and the second historical parameter group form a pair, and the second historical parameter group is used
  • the performance change ratio of the DBMS compared to the performance change ratio of the first historical parameter group at that time is the performance change ratio of the DBMS corresponding to the pair of the first historical parameter group and the second historical parameter group.
  • a first history parameter group that has been configured for the DBMS, a second history parameter group that has been configured for the DBMS, and the performance of the DBMS when the second history parameter group is used is compared with the performance change ratio of the first historical parameter group is used as a training sample, and when there are multiple similar training samples, the two-parameter mixed model can be obtained through a machine learning method.
  • the training model based on training samples can be implemented with reference to the prior art, and details are not described in the embodiment of the present invention.
  • the data volume correction module receives the data volume of the active data in the first parameter group, the second parameter group, and the DBMS, and then corrects the parameters related to the data volume of the active data in the first parameter group and the second parameter group. .
  • the process of data correction refers to that if the value of a parameter in the first parameter group or the second parameter group is greater than the data amount of the active data, then the value of the parameter is reduced to be equal to the data amount.
  • the value of the parameter in the first parameter group is corrected to be equal to the data volume.
  • the value of the parameter in the second parameter group is equal to the data volume. If the value of the parameter is also greater than the amount of data, it is also corrected to be equal to the amount of data, or the value of the parameter is less than the amount of data, no correction is made. No matter it is the parameter in the first parameter group or the second parameter group, only Correct the situation that is greater than the amount of data.
  • the parameters in the modified first parameter group and the second parameter group are usually high-dimensional vectors. Therefore, the scale prediction model may also include an autoencoder, so that the autoencoder can perform the correction of the first parameter group.
  • the parameter group and the second parameter group are dimensionally reduced to obtain the first vector and the second vector.
  • the first parameter group is a vector with a dimension higher than the first vector
  • the second parameter group is a vector with a dimension higher than the second vector.
  • DKMM receives the first vector and the second vector, and determines the performance change ratio of the DBMS through the neural network structure of DKMM.
  • DKMM can concatenate the first vector and the second vector into a long vector, and then determine the performance change ratio of the DBMS based on the long vector.
  • Table 1 is used to illustrate the performance change ratios estimated by the solution of the present application when the data amounts of the first parameter group, the second parameter group, and the active data take different values.
  • the first parameter group and the second parameter group of Table 1 above each include five parameters, and the five parameters may be innodb_buffer_pool_size, innodb_buffer_pool_chunk_size, innodb_buffer_pool_instances, tmp_table_size, and key_buffer_size.
  • sample 3 when the first parameter and the second parameter of the second parameter group are increased relative to the first parameter group, the value of innodb_buffer_pool_size is adjusted from 0.25 to 0.5, and the value of innodb_buffer_pool_chunk_size is adjusted from 0.1 to 0.2.
  • the amount of active data continues to increase (100GBytes)
  • the performance difference of the DBMS relative to sample 2 is not much different, and the performance change ratio is 2.3 Indicates that the performance of the DBMS when the second set of parameters is used to configure the DBMS is 2.3 times the performance of the DBMS when the first set of parameters is used to configure the DBMS.
  • the data volume of active data is taken into consideration as preprocessing, that is, parameter correction, which not only reduces input features, avoids overfitting, but also makes the prediction result sensitive to the data volume.
  • the result we output is a proportional value, which eliminates the influence of other factors (such as system status) on the result.
  • the foregoing estimation of the first estimated execution time of the SQL statement executed by the DBMS includes: converting the SQL statement into a third vector; determining the complexity coefficient of the SQL statement according to the third vector; estimating the execution of the SQL statement by the DBMS according to the complexity coefficient The first estimated execution time of.
  • the SQL statement includes a subquery statement
  • the SQL statement is converted into a third vector, including: using the relationship between the complexity and the condition factor to determine the complexity of the subquery statement.
  • the condition factor includes the condition type and the subquery in the subquery. The type of table involved; determine the vector of the sub-query statement according to the complexity of the sub-query statement; input the vector of the sub-query statement into the cost model to obtain the complexity coefficient of the sub-query statement.
  • the training sample of the cost model includes the historical SQL executed by the DBMS
  • This step converts the SQL statement into the third vector, which can also be expressed as:
  • the SQL statement is converted into a statement template. If the SQL statement includes a sub-query statement, the statement template includes the corresponding sub-query statement At least one sub-template; use the relationship between complexity and condition factor to determine the complexity of each sub-template, the condition factor includes the condition type in the sub-template and the type of table involved; determine the sub-query according to the complexity of each sub-template The vector of the statement; input the vector of the sub-query statement into the cost model to obtain the complexity coefficient of the sub-query statement.
  • the training sample of the cost model includes the vector of the historical SQL statement executed by the DBMS and the complex coefficient of the historical SQL statement; according to the sub-query statement
  • the complexity coefficient and the recursive relationship between the sub-query statement and the upper-level query statement recursively determine the third vector of the SQL statement.
  • determining the complexity coefficient of the SQL statement according to the third vector includes: inputting the third vector into the cost model; and determining the complexity coefficient of the SQL statement through the cost model.
  • the first estimated execution time of the SQL statement executed by the DBMS is estimated according to the complexity coefficient, which may include:
  • the first estimated execution time for the DBMS to execute the SQL statement is estimated. Or, input the complex coefficients into the execution time correction model.
  • the training samples of the execution time correction model include the complex coefficients of the historical SQL statements executed by the DBMS and the execution time of the historical SQL statements; the execution time correction model determines the first SQL statement executed by the DBMS Estimated execution time.
  • the query prediction model includes a vector conversion model, a cost model, and an execution time correction model.
  • the process of the above query prediction model is completed by the vector conversion model, the cost model, and the execution time correction model.
  • the cost model is obtained through training of training samples, and the training samples of the cost model include the vector of historical SQL statements executed by the DBMS and the complexity coefficients of the historical SQL statements.
  • the execution time correction model is obtained through training of training samples.
  • the training samples of the execution time correction model include the complex coefficients of the historical SQL statements executed by the DBMS and the execution time of the historical SQL statements.
  • the vector conversion model converts the SQL statement into a statement template according to a preset template rule. If the SQL statement includes a sub-query statement, the statement template includes at least one sub-template corresponding to the sub-query statement Use the relationship between complexity and condition factor to determine the complexity of each sub-template, where the condition factor includes the condition type in the sub-template and the table type involved; according to the complexity of each sub-template The degree determines the vector of the sub-query sentence.
  • the vector conversion model inputs the vector of the sub-query sentence into the cost model to obtain the complexity coefficient of the sub-query sentence.
  • the vector conversion model and the cost model recursively determine the third vector of the SQL statement based on the complexity coefficient of the sub-query statement and the recursive relationship between the sub-query statement and the upper-level query statement.
  • the cost model receives the third vector, and determines the complexity coefficient of the SQL statement through the third vector.
  • the execution time correction model receives the complexity coefficient of the SQL statement, and according to the relationship between the complexity coefficient and the execution time, estimates the first estimated execution time of the SQL statement executed by the DBMS.
  • the SQL statement is:
  • the above SQL statement is a very complex SQL statement.
  • the statement template of the SQL statement can be: Select*from where ⁇ select*from*where; +min ⁇ +orderby+limit.
  • the step of converting the above SQL statement into a vector may include first identifying the sub-template in the lowermost sub-query statement ⁇ select*from*where; +min ⁇ in the SQL statement.
  • the lowermost sub-query statement includes two sub-templates.
  • the templates are "select*from*where;” and "min”.
  • the calculation process of the complexity of each sub-template is introduced by taking "select*from*where;” as an example.
  • complex factor represents the complexity of the sub-template
  • table_num represents the number of table types involved in the sub-template
  • condition_factor represents the condition type coefficient in the sub-template
  • sum represents the sum.
  • the parameters described above represent the complexity, the number of table types, and the condition type coefficient in the uppermost template.
  • condition_factor take the where condition in the sub-template "select*from*where;” as an example.
  • the coefficients of various condition types in the sub-template “select*from*where;” can be set to the values in Table 2:
  • the vectorized value of each dimension in the sub-query statement ⁇ select*from*where; +min ⁇ is 0 when it is initialized, after calculating the complexity of the sub-template "select*from*where;” and the sub-template "min”, use The value of the complexity of these two sub-templates replaces the 0 in the corresponding dimension in the initialization vector of the sub-query statement, and the vector v1 of the sub-query statement can be obtained.
  • the vector v1 of the sub-query statement can be expressed as [0, 0,6.5,0,1,00 together].
  • the cost model After obtaining the vector v1 of the sub-query sentence, input the vector v1 into the cost model, and the cost model will output the complex coefficient c1 corresponding to the sub-query sentence, and then perform recursive calculations according to the relationship between the sub-query sentences of each layer in the SQL sentence. Repeating the above process similar to the sub-query statement ⁇ select*from*where; +min ⁇ , the third vector of the SQL statement can be calculated.
  • the execution time correction model has a relationship between complex coefficients and execution time, when the time correction model receives the complex coefficients of the SQL statement, it can predict the first estimated execution of the SQL statement executed by the DBMS based on the relationship. duration.
  • the predicted execution time obtained by the solution provided by the embodiment of the application is not much different from the actual execution time when the DBMS actually executes the SQL statement of each row.
  • the accuracy of the predicted execution time obtained by the solution provided by the application embodiment is very high, so that the automatic tuning and automatic management of the DBMS can be realized through the solution provided by this application.
  • the vector conversion model performs vector conversion on the SQL statement input by the user, and outputs the converted vector.
  • the converted vector can be expressed as [1,0,3,4,0,1,...,0], for example.
  • the cost conversion model determines the complexity coefficient of the SQL statement according to the converted vector.
  • the time correction model determines the first estimated execution duration according to the complexity coefficient determined in step 302.
  • steps 301 to 303 can be understood with reference to the corresponding content of the embodiment corresponding to FIG. 5, and will not be described in detail here.
  • the data volume correction module receives the data volume of the first parameter group, the second parameter group, and the active data, and then corrects the parameters in the first parameter group and the second parameter group according to the data volume.
  • the autoencoder performs dimensionality reduction on the modified first parameter group and the second parameter group, and converts a high-dimensional vector into a low-dimensional first vector and a second vector.
  • the DKMM determines the performance change ratio according to the first vector and the second vector.
  • steps 304 to 306 can be understood with reference to the corresponding content of the embodiment corresponding to FIG. 4, which will not be described in detail here.
  • the performance estimation model receives the first estimated execution time output in step 303, the performance change ratio output in step 306, and the data volume of active data, and then according to the first estimated execution time, performance change ratio, and active data Perform performance estimation on the amount of data to obtain performance indicators.
  • the performance indicator may include at least one of the second estimated execution duration and throughput.
  • the second estimated execution time and throughput output by the performance estimation model can be displayed on the user interface (UI) of the terminal device, and the display form can be understood with reference to FIG. 7.
  • the throughput when the first parameter group is used to configure the DBMS is much lower than the throughput when the second parameter group is used to configure the DBMS.
  • the execution time of executing one SQL statement when the DBMS is configured with the first parameter group is much longer than the execution time of executing the same SQL statement when the DBMS is configured with the second parameter group. Therefore, the parameter group used to update the DBMS can be determined based on the estimated results of this application, so that the configuration can be performed under the condition of predicting the effect, without the need to configure and then check the effect, and then modify it, which can effectively improve the DBMS parameters
  • the accuracy of the configuration improves the stability of the DBMS.
  • the throughput can also reflect the query volume per second (QPS).
  • QPS query volume per second
  • the user can also see that the second parameter group is used to configure the DBMS for a period of time.
  • the change trend of the QPS can be understood with reference to Figure 8. As shown in Figure 8, the QPS has a steady growth trend after a period of small fluctuations, indicating that the performance will become better after the second parameter group is configured with a DBMS.
  • the device for predicting the performance of the database management system described above is independent of the DBMS. In fact, the device for predicting the performance of the database management system can also be integrated on the DBMS.
  • the following describes a database server integrated on the DBMS with reference to Figure 9 .
  • an embodiment of the present application provides a database server 100, including: at least one processor 104, a non-transitory computer-readable medium 106 storing executable codes, and database management System 108.
  • the executable code when executed by the at least one processor 104, is configured to implement the components and functions of the database management system 108.
  • the non-transitory computer readable medium 106 may include one or more non-volatile memories.
  • the non-volatile memory includes a semiconductor memory device, such as erasable programmable read only memory (EPROM) , Electrically erasable read only memory (EEPROM) and flash memory; magnetic disks, such as internal hard disks or removable disks, magneto optical disks , And CD ROM and DVD-ROM.
  • the non-transitory computer-readable medium 106 may also include any device configured as a main memory.
  • the at least one processor 104 may include any type of general computing circuit or dedicated logic circuit, such as a field-programmable gate array (FPGA) or an application specific integrated circuit (ASIC).
  • the at least one processor 104 may also be one or more processors, such as a CPU, coupled to one or more semiconductor substrates.
  • the database management system 108 may be a relational database management system (relational database management system, RDBMS).
  • the database management system 108 supports structured query language (SQL).
  • SQL refers to a special programming language used to manage data stored in a relational database.
  • SQL can refer to various types of data-related languages, including, for example, data definition languages and data manipulation languages.
  • the scope of SQL can include data insertion, query, update and deletion, schema creation and modification, and data access control.
  • SQL may include descriptions related to various language elements, including clauses, expressions, predicates, queries, and statements.
  • clauses can refer to various components of statements and queries, and in some cases, clauses can be considered optional.
  • the expression can be configured to generate scalar values and/or tables including data columns and/or rows.
  • predicates can be configured to specify conditions for adjusting the effects of sentences and queries.
  • a query is a request to view, access, and/or manipulate data stored in the database.
  • the database management system 108 may receive queries in SQL format (referred to as SQL queries) from the database client 102.
  • the SQL query can also be called an SQL statement.
  • the database management system 108 generates query results corresponding to the query by accessing relevant data from the database and manipulating the relevant data, and returns the query results to the database client 102.
  • a database is a collection of data organized, described, and stored according to a certain mathematical model.
  • the database can include one or more database structures or formats, such as row storage and column storage.
  • the database is usually stored in a data storage, such as the external data storage 120 in FIG. 9 or the non-transitory computer readable medium 106. When the database is stored on the non-transitory computer readable medium 106, the database management system 108 is an in-memory database management system.
  • the database client 102 may include any type of device or application that is configured to interact with the database management system 108.
  • the database client 102 includes one or more application servers.
  • the database management system 108 includes a SQL engine 110, an execution engine 122, a storage engine 134, and a database management system performance estimation device 124 in the embodiment of the present application.
  • the SQL engine 110 generates a corresponding execution plan according to the SQL statement submitted by the client 102, such as a query, and the execution engine 122 operates according to the execution plan of the statement to generate a query result.
  • the storage engine 134 is responsible for managing table data and actual content of the index on the file system, and also managing data such as Cache, Buffer, transaction, and Log at runtime. For example, the storage engine 134 may write the execution result of the execution engine 122 into the data storage 120 through physical I/O.
  • the SQL engine 110 includes a parser 112 and an optimizer 114.
  • the parser 110 is used to perform grammatical and semantic analysis of SQL statements, expand and divide the view in the query into small query blocks.
  • the optimizer 114 generates a set of execution plans that may be used for the statement, estimates the cost of each execution plan, compares the cost of the plans, and finally selects an execution plan with the least cost.
  • the device 124 for predicting the performance of the database management system can obtain historical data of the DBMS from the SQL engine 110 and the execution engine 122, for example: historical configuration parameters of the DBMS and the execution duration of historical SQL statements.
  • the method for predicting the performance of the database management system provided by the embodiments of the present application can be implemented by a device for predicting the performance of the database management system, and the device for predicting the performance of the database management system can be integrated with the above description.
  • the DBMS 108 can also be independent of the above-described DBMS 108. No matter whether the database management system performance estimation device is integrated on the DBMS or independent of the DBMS, the database management system performance estimation device can be used with or without Based on the interaction of the DBMS, the performance of the DBMS is estimated, which provides a means of performance measurement for the self-tuning and self-management of the database, and ensures the reliable and stable operation of the DBMS.
  • the device for predicting the performance of the database management system may be one or more servers, a physical machine, or a virtual machine (VM).
  • an embodiment of an apparatus 40 for predicting the performance of a database management system includes:
  • the first obtaining unit 401 is configured to obtain a first parameter group configured for a DBMS of a database management system, and obtain a data amount of active data in data managed by the DBMS, the first parameter group includes at least one parameter;
  • the second obtaining unit 402 is configured to obtain a second parameter group, where the second parameter group includes at least one parameter;
  • the determining unit 403 is configured to determine the performance change ratio of the DBMS according to the first parameter group and the data volume of the active data obtained by the first obtaining unit 401 and the second parameter group obtained by the second obtaining unit 402, and the performance change ratio indicates the use of the first parameter group. After the DBMS is configured with the second parameter group, the performance of the DBMS changes compared to the degree of change when the DBMS is configured with the first parameter group.
  • the performance change ratio of the DBMS can be determined, without actually configuring the second parameter group Give the DBMS, and then the DBMS will perform the job and observe it, that is, you can estimate the rate of change of the DBMS without interacting with the DBMS, which can ensure the safety of the DBMS, and provide performance for the self-tuning and self-management of the DBMS.
  • the measurement method ensures the reliable and stable operation of the DBMS.
  • the device 40 further includes: a third acquiring unit 404 and a first estimating unit 405.
  • the third obtaining unit 404 is configured to obtain a structured query language SQL statement
  • the first estimating unit 405 is configured to estimate the first estimated execution time length of the SQL statement obtained by the third obtaining unit executed by the DBMS.
  • the device 40 further includes a second estimation unit 406.
  • the second estimating unit 406 is configured to estimate the DBMS performance based on the DBMS performance change ratio determined by the determining unit, the first estimated execution duration estimated by the first estimating unit, and the data volume of the active data acquired by the first acquiring unit
  • the performance indicator, the performance indicator includes at least one of the second estimated execution time of the SQL statement and the throughput of the DBMS, and the second estimated execution time is a modified value of the first estimated execution time.
  • the second estimation unit 406 is configured to: input the performance change ratio of the DBMS, the first estimated execution time and the amount of active data into the performance estimation model, and the training samples of the performance estimation model Including the historical performance change ratio of the DBMS, the execution time of historical SQL statements, and the data volume and historical performance indicators of the active data managed by the DBMS when the DBMS executes the historical SQL statements; the performance indicators of the DBMS are estimated through the performance estimation model.
  • the determining unit 403 is configured to: modify the parameters related to the data volume of the active data in the first parameter group and the second parameter group according to the data volume of the active data in the DBMS; Dimensionality reduction is performed between the group and the second parameter group to obtain the first vector and the second vector.
  • the first parameter group is a vector with a dimension higher than the first vector
  • the second parameter group is a vector with a dimension higher than the second vector
  • the first vector and the second vector determine the performance change ratio of the DBMS.
  • the determining unit 403 is configured to: adjust the target parameter related to the data volume of the active data in the first parameter group and the second parameter group to be equal to the data volume of the active data, and the target parameter is the first parameter group. Parameters in the parameter group and the second parameter group that are larger than the data amount of the active data.
  • the determining unit 403 is configured to: input the first vector and the second vector into the dual-parameter mixed model, and the training samples of the dual-parameter mixed model include multiple pairs of first historical parameter groups and The second historical parameter group and the performance change ratio of the DBMS corresponding to each pair of the first historical parameter group and the second historical parameter group; the performance change ratio of the DBMS is determined by a two-parameter hybrid model.
  • the first estimation unit 405 is configured to: convert the SQL statement into a third vector; determine the complexity coefficient of the SQL statement according to the third vector; estimate the first prediction of the SQL statement executed by the DBMS according to the complexity coefficient Estimate the execution time.
  • the SQL statement includes a sub-query statement
  • the first estimation unit 405 is configured to: use the relationship between the complexity and the condition factor to determine the complexity of the sub-query statement, and the condition factor includes the The condition type and the table type involved in the subquery; determine the vector of the subquery according to the complexity of the subquery; input the vector of the subquery into the cost model to obtain the complexity coefficient of the subquery.
  • the training sample of the cost model includes DBMS execution The vector of past historical SQL statements and the complexity coefficient of historical SQL statements; according to the complexity coefficient of the sub-query statement and the nesting relationship between the sub-query statement and the upper-level query statement containing the sub-query statement, the third SQL statement is determined vector.
  • the first estimation unit 405 is configured to: convert the SQL statement into a statement template according to a preset template rule. If the SQL statement includes a sub-query statement, the statement template includes a statement corresponding to the sub-query statement. At least one sub-template of the sub-template; determine the complexity of each sub-template by using the relationship between the complexity and the condition factor.
  • the condition factor includes the condition type in the sub-template and the table type involved; the sub-template is determined according to the complexity of each sub-template
  • the vector of the query statement; the vector of the sub-query statement is input into the cost model to obtain the complexity coefficient of the sub-query statement.
  • the training sample of the cost model includes the vector of the historical SQL statement executed by the DBMS and the complex coefficient of the historical SQL statement; according to the sub-query statement The complexity coefficient of, and the recursive relationship between the sub-query statement and the upper-level query statement, recursively determine the third vector of the SQL statement.
  • the first estimation unit 405 is configured to: input the third vector into the cost model; and determine the complexity coefficient of the SQL statement through the cost model.
  • the first estimating unit 405 is configured to estimate the first estimated execution time of the SQL statement executed by the DBMS according to the relationship between the complexity coefficient and the execution time length, and the complexity coefficient of the SQL statement.
  • the first estimation unit 405 is configured to: input complex coefficients into the execution time correction model, and the training samples of the execution time correction model include the complex coefficients of the historical SQL statements executed by the DBMS and the execution of the historical SQL statements Duration: Determine the first estimated execution duration of the SQL statement executed by the DBMS through the execution time correction model.
  • FIG. 12 is a schematic diagram of a possible logical structure of the apparatus 50 for predicting the performance of the database management system involved in the above-mentioned embodiment provided by the embodiment of this application.
  • the device 50 for predicting the performance of a database management system includes a processor 501, a communication interface 502, a memory 503, and a bus 504.
  • the processor 501, the communication interface 502, and the memory 503 are connected to each other through a bus 504.
  • the processor 501 is configured to control and manage the actions of the device 50 for predicting the performance of the database management system.
  • the processor 501 is configured to execute steps 201 to 203 in FIG. 3, and steps in FIG. 301 to 307 and/or other processes used in the techniques described herein.
  • the communication interface 502 is used to communicate with the device 50 that supports the performance estimation of the database management system.
  • the communication interface 502 can execute the data volume of the first parameter group, the second parameter group, the active data, and the SQL statement in the foregoing method embodiment. step.
  • the memory 503 is used to store the program code and data of the database server 50.
  • the processor 501 may be a central processing unit, a general-purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array, or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It can implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of this application.
  • the processor may also be a combination of computing functions, for example, a combination of one or more microprocessors, a combination of a digital signal processor and a microprocessor, and so on.
  • the bus 504 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • an embodiment of the present application further provides a cluster database system 500, which includes a hardware layer 1007, a virtual machine monitor (VMM) 1001 running on the hardware layer 1007, and multiple virtual machines 1002.
  • a virtual machine can be used as a data node of the cluster database system 500.
  • the virtual machine 1002 is a virtual computer that is simulated on public hardware resources through virtual machine software.
  • An operating system and application programs can be installed on the virtual machine, and the virtual machine can also access network resources.
  • the virtual machine is like working in a real computer.
  • Hardware layer 1007 a hardware platform running in a virtualized environment, which can be abstracted from the hardware resources of one or more physical hosts.
  • the hardware layer may include a variety of hardware, such as a processor 1004 (such as a CPU) and a memory 1005, and may also include a network card 1003 (such as an RDMA network card), high-speed/low-speed input/output (I/O, Input/Output) devices , And other equipment with specific processing functions.
  • the virtual machine 1002 runs an executable program based on the VMM and the hardware resources provided by the hardware layer 1007 to implement part or all of the functions of the database management system performance estimation device in the above-mentioned related embodiments of FIGS. 3-9. For the sake of brevity, I will not repeat them here.
  • the cluster database system 500 may also include a host machine (Host): as a management layer, it is used to complete the management and allocation of hardware resources; present a virtual hardware platform for the virtual machine; and realize the scheduling and isolation of the virtual machine.
  • Host may be a virtual machine monitor (VMM); it may also be a combination of VMM and a privileged virtual machine.
  • the virtual hardware platform provides various hardware resources for each virtual machine running on it, such as providing a virtual processor (such as a VCPU), a virtual memory, a virtual disk, a virtual network card, and so on.
  • the virtual disk may correspond to a file or a logical block device of the Host.
  • the virtual machine runs on the virtual hardware platform prepared by the Host, and one or more virtual machines run on the Host.
  • the VCPU of the virtual machine 1002 executes an executable program stored in its corresponding virtual memory to implement or execute the method steps described in the foregoing method embodiments of the present invention. For example, part or all of the functions of the device for predicting the performance of the database management system in the above-mentioned embodiments related to FIG. 3 to FIG. 9 are realized.
  • an embodiment of the present application also provides a database system, including: a database server 800 based on a client device 900 connected to the database server 800 through a communication network.
  • a client operating system 904 runs on the hardware layer 906 of the client device 900, and an application program 902 runs on the operating system 904; an operating system 814 runs on the hardware layer 816 of the database server 800, and a database runs on the operating system 814 Management system 812.
  • the application program 902 is connected to the database management system 812 running on the database server 800 via the communication network and accesses or operates the database stored in the data storage 818, for example, querying, updating or deleting data in the database through SQL statements, or importing new The data to the database.
  • the hardware layers 906 and 816 include basic hardware units required for operating an operating system and application programs, such as a processor, such as a CPU, a memory (Memory), an input/output device, a network interface, and the like.
  • a processor such as a CPU, a memory (Memory), an input/output device, a network interface, and the like.
  • the data storage 818 may be an external storage of the database server 800, such as a hard disk, a magnetic disk, a storage array, or a storage server, etc., and is connected to the database server 800 in communication. Alternatively, the data storage 818 may also be integrated inside the database server 800, and exchange data with the processor and I/O devices through a bus or other internal communication methods.
  • Executable code is stored in the memory of the database server 800, and the executable code is configured to implement the components and functions of the database management system 812 when executed by the processor.
  • the database management system 812 may specifically be the database management system 108 shown in FIG. 2.
  • FIG. 2 For related functions and implementation details, please refer to the related embodiments of FIG. 3 to FIG. 9, which will not be repeated here.
  • a computer-readable storage medium stores computer-executable instructions.
  • the device executes the above-mentioned figure. The method for predicting the performance of the database management system described in the part of the embodiment from 3 to FIG. 9.
  • a computer program product in another embodiment of the present application, includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium; at least one processor of the device can be accessed from a computer.
  • the read storage medium reads the computer-executed instruction, and at least one processor executes the computer-executed instruction to make the device execute the method for predicting the performance of the database management system described in the above-mentioned embodiments in FIGS. 3 to 9.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is realized in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solutions of the embodiments of the present application are essentially or the part that contributes to the prior art or the part of the technical solutions can be embodied in the form of a software product, and the computer software product is stored in a storage medium.
  • Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据库管理系统性能预估的方法,根据获取到的第一参数组、第二参数组和DBMS管理的数据中活跃数据的数据量即可确定出DBMS的性能变化比率,不需要实际将第二参数组配置给DBMS;然后由DBMS执行作业再进行观察,即不需要与DBMS交互就可以预估出DBMS的性能变化比率,即可确保DBMS的安全,而且为DBMS的自调优、自管理等提供了性能度量的手段,保障了DBMS可靠稳定运行。

Description

一种数据库管理系统性能预估的方法、装置及系统
本申请要求于2020年2月28日提交中国专利局、申请号为202010134058.0、发明名称为“一种数据库管理系统性能预估的方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据库管理技术领域,具体涉及一种数据库管理系统性能预估的方法、装置及系统。
背景技术
数据库管理系统(database management system,DBMS)的性能是数据库管理中的一个十分重要的指标,它的优劣会直接影响到实际业务的发展和用户体验。影响DBMS性能的因素有很多,例如:DBMS负载的波动、DBMS事务处理(transaction process,TP)、分析处理(analysis process,AP)等不同类型的处理作业、DBMS配置参数的变化、DBMS所在系统配置参数的变化以及DBMS存储空间的变化都会导致DBMS的性能发生变化。由于影响DBMS性能的因素众多,因此对其性能变化进行预测是非常有难度的。
目前可以通过与DBMS交互来获取DBMS的执行表现,从而预估出DBMS的性能。例如:在DBMS执行作业过程中观察执行效果,并使用解释(explain)指令,通过DBMS的优化器给出一个预估的代价(cost)值,从而得到DBMS的性能。
这种通过与DBMS交互的方法来获取DBMS的性能的方法,会对DBMS造成干扰,进而影响DBMS支撑的业务的性能。
发明内容
本申请实施例提供一种数据库管理系统性能预估的方法,可以在不与数据库管理系统(database management system,DBMS)进行交互的基础上预估出DBMS的性能,为DBMS的自调优、自管理等提供了性能度量的手段,保障了DBMS可靠稳定运行。
为达到上述目的,本申请的实施例采用如下技术方案:
第一方面,提供一种数据库管理系统性能预估的方法,包括:获取为数据库管理系统DBMS配置的第一参数组,以及获取DBMS管理的数据中活跃数据的数据量,第一参数组包括至少一个参数;获取第二参数组,第二参数组包括至少一个参数;根据第一参数组、第二参数组和活跃数据的数据量确定DBMS的性能变化比率,性能变化比率指示使用第二参数组配置DBMS后,DBMS的性能相比于使用第一参数组配置DBMS时的变化程度。
上述第一方面提供的技术方案中,第一参数组中通常包括多个参数(Knob),该第一参数组中的参数例如可以包括:存储引擎的缓冲池大小(innodb_buffer_pool_size),存储引擎的缓冲池中块大小(innodb_buffer_pool_chunk_size),存储引擎的缓冲池实例数(innodb_buffer_pool_instances),临时表大小(tmp_table_size)和键缓冲大小(key_buffer_size)等。活跃数据指的是DBMS执行历史SQL语句时所使用过的数据。第二参数组中包括的参数可以是与第一参数组中的参数对应的,第二参数组中各参数取值的大小可以与第一参数组中参数的取值相同,也可以不同,通常第一参数组和第二参数组中至少有一个参数的取值不同。第一参数组与第二参数组的区别在于:第一参数组中的各参数 已在DBMS上配置,也就是说DBMS运行使用的就是第一参数组中的参数。第二参数组未配置到DBMS上,该第二参数组可以是使用DBMS参数的调优模型进行调优后的到的,也可以是用户自行确定的。性能变化比率指的是若采用第二参数组配置DBMS的性能/采用第一参数组配置DBMS的性能,该性能变化比率可以指示若采用第二参数组配置DBMS的后该DBMS的性能是变好了还是变差了,若该性能变化比率=1,则表示没有变化,若大于1,则表示性能变好了,若小于1,则表示性能变差了,当然,变好或者变差都是笼统的说法,该性能变化比率的数值可以指示出具体的变化程度,例如,性能提高了0.5倍,或者降低了0.1倍等。由该第一方面可知,获取到第一参数组、第二参数组和DBMS管理的数据中活跃数据的数据量即可确定出DBMS的性能变化比率,不需要实际将第二参数组配置给DBMS,然后由DBMS执行作业再进行观察,即不需要与DBMS交互就可以预估出DBMS的能变化比率,即可以确保DBMS的安全,而且为DBMS的自调优、自管理等提供了性能度量的手段,保障了DBMS可靠稳定运行。
在第一方面一种可能的实现方式中,该方法还包括:获取结构化查询语言(Structured Query Language,SQL)语句;预估DBMS执行该SQL语句的第一预估执行时长。
该种可能的实现方式中,不需要将SQL语句输入到DBMS中由DBMS执行该SQL语句就可以预估出该DBMS执行该SQL语句的预估执行时长,为DBMS的调优又增加了一项参考指标,这样可以确保更好的实现DBMS的自调优、自管理,进一步的保障了DBMS可靠稳定运行。
在第一方面一种可能的实现方式中,该方法还包括:根据DBMS的性能变化比率、第一预估执行时长和活跃数据的数据量预估DBMS的性能指标,性能指标包括SQL语句的第二预估执行时长和DBMS的吞吐量中的至少一个,第二预估执行时长是第一预估执行时长的修正值。
该种可能的实现方式中,还可以进一步优化SQL语句的预估执行时长,并且还可以预估出DBMS采用第二参数组配置时DBMS的吞吐量,这样可以更好的确认出第二参数组是否适合配置给DBMS,性能指标指示不适合,则不配置给DBMS,这样相对于配置给DBMS后通过DBMS执行结果再确定第二参数组不适合DBMS更有利于保障DBMS运行的稳定性,也提高了DBMS参数配置的准确度。
在第一方面一种可能的实现方式中,上述步骤:根据DBMS的性能变化比率、第一预估执行时长和活跃数据的数据量预估DBMS的性能指标,包括:将DBMS的性能变化比率、第一预估执行时长和活跃数据的数据量输入性能预估模型,性能预估模型的训练样本包括DBMS的历史性能变化比率、历史SQL语句的执行时长、以及DBMS执行历史SQL语句时DBMS所管理的活跃数据的数据量和历史性能指标。
需要说明的是,本申请中涉及到的“历史性能变化比率”和“历史SQL语句的执行时长”指的是之前DBMS执行过程中所收集到的数据。
该种可能的实现方式中,通过DBMS的性能变化比率、第一预估执行时长和活跃数据的数据量三个维度的数据来预估DBMS的性能指标,如:执行时长或吞吐量等,通过性能指标所反映出来的情况再调整DBMS,可以为DBMS确定出更适配的参数,从而可以提高DBMS调优的准确度。
在第一方面一种可能的实现方式中,上述步骤:根据第一参数组、第二参数组和活跃数据的数据量确定DBMS的性能变化比率,包括:根据DBMS中活跃数据的数据量修正第一参 数组和第二参数组中与活跃数据的数据量相关的参数;对修正后的第一参数组和第二参数组进行降维,以得到第一向量和第二向量,第一参数组是维度高于第一向量的向量,第二参数组是维度高于第二向量的向量;根据第一向量和第二向量确定DBMS的性能变化比率。
该种可能的实现方式中,通过活跃数据的数据量来修正第一参数组和第二参数组,这样可以将活跃数据的数据量与DBMS的参数进行拟合,从而可以确保为DBMS调优确定出的参数最大程度的匹配DBMS当前的情况,不至于出现过大的资源浪费,从而确保数据库系统的资源利用率。
在第一方面一种可能的实现方式中,上述步骤:根据DBMS中活跃数据的数据量修正第一参数组和第二参数组中与活跃数据的数据量相关的参数,包括:将第一参数组和第二参数组中与活跃数据的数据量相关的目标参数调整到与活跃数据的数据量相等,目标参数为第一参数组和第二参数组中大于与活跃数据的数据量的参数。
该种可能的实现方式中,该数据修正的过程指的是若第一参数组中或第二参数组中的某一个参数的值大于活跃数据的数据量,则将该参数的值缩小到与该数据量相等。以参数innodb_buffer_pool_size为例,若第一参数组中该参数的值大于活跃数据的数据量,则将该第一参数组中该参数的值修正到与该数据量相等,若第二参数组中该参数的值也大于该数据量,则也修正为与该数据量相等,或该参数的值都小于该数据量,则不作修正,无论是第一参数组还是第二参数组中的参数,只修正大于该数据量的情况。该种可能的实现方式中,通过数据量校正模块可以提高第一参数组中或第二参数组中参数修正的速度。
在第一方面一种可能的实现方式中,上述步骤:根据第一向量和第二向量确定DBMS的性能变化比率,包括:将第一向量和第二向量输入到双参数混合模型,双参数混合模型的训练样本包括DBMS已配置过的多对第一历史参数组和第二历史参数组以及与每对第一历史参数组和第二历史参数组对应的DBMS的性能变化比率;通过双参数混合模型确定DBMS的性能变化比率。其中,第一历史参数组是DBMS使用过的一组参数,第二历史参数组是DBMS使用过的另一组参数,第一历史参数组和第二历史参数组构成一对,使用第二历史参数组时所述DBMS的性能相比于使用第一历史参数组的性能变化比率即为该对第一历史参数组和第二历史参数组对应的DBMS的性能变化比率。
该种可能的实现方式中,双参数混合模型(double knobs mixture model,DKMM)可以快速确定出DBMS的性能变化比率。
在第一方面一种可能的实现方式中,上述步骤:预估DBMS执行SQL语句的第一预估执行时长,包括:将SQL语句转换为第三向量;根据第三向量确定SQL语句的复杂系数;根据复杂系数预估DBMS执行SQL语句的第一预估执行时长。
该种可能的实现方式中,通过向量确定出SQL语句的复杂系数,进一步通过复杂系数预估出执行时长,确保了执行时长预估的准确度。
在第一方面一种可能的实现方式中,所述SQL语句中包括子查询语句,上述步骤:将SQL语句转换为第三向量,包括:利用复杂度与条件因子的关系式,确定子查询语句的复杂度,条件因子包括子查询中的条件类型和子查询涉及到的表格类型;根据子查询语句的复杂度确定子查询语句的向量;将子查询语句的向量输入代价模型以得到子查询语句的复杂系数, 代价模型的训练样本包括DBMS执行过的历史SQL语句的向量以及历史SQL语句的复杂系数;根据子查询语句的复杂系数,以及子查询语句与包含子查询语句的上层查询语句之间的嵌套关系,确定出SQL语句的第三向量。
该将SQL语句转换为第三向量的步骤也可以表述为:按照预设模板规则,将SQL语句转换为语句模板,若SQL语句中包括子查询语句,则语句模板中包括与子查询语句对应的至少一个子模板;利用复杂度与条件因子的关系式,确定每个子模板的复杂度,条件因子包括子模板中的条件类型和所涉及到的表格类型;根据每个子模板的复杂度确定子查询语句的向量;将子查询语句的向量输入代价模型以得到子查询语句的复杂系数,代价模型的训练样本包括DBMS执行过的历史SQL语句的向量以及历史SQL语句的复杂系数;根据子查询语句的复杂系数,以及子查询语句与上层查询语句之间的递归关系,递归的确定出SQL语句的第三向量。
该种可能的实现方式中,预设模板规则指的是SQL语句中的哪些结构可以组成一个模板,例如:select*from*where结构可以构成一个模板等。一个SQL语句可以包含两层或更多层的查询关系,每一层的语句都可以理解为是一个子查询语句。SQL语句对应的大模板可以称为语句模板,其中,子查询语句对应的模板可以称为子模板,例如:若语句模板包括Select*from where{select*from*where;+min}+orderby+limit,则{select*from*where;+min}中的select*from*where;可以称为一个子模板,min也可以称为一个子模板。子查询语句中可能会包括“相等条件”或“不相等条件”等,这些“相等条件”或“不相等条件”都属于条件类型,子模板中还可能会涉及到一些表格,例如:查询区域的表格或查询民族的表格等,这些都属于表格类型。子查询语句的初始向量在各个维度上都可以标识为0,在确定出子模板的复杂度后,可以用子模板的复杂度的具体值去修改初始向量中该子模板对应维度的0,从而确定出该子查询语句的向量,通过代价模型就可以得到该子查询语句的复杂系数,按照递归方式去处理,就可以得到上一层查询语句的向量,以此类推,就可以得到SQL语句的向量,也就是第三向量。
在第一方面一种可能的实现方式中,上述步骤:根据第三向量确定SQL语句的复杂系数,包括:将第三向量输入代价模型,通过代价模型确定SQL语句的复杂系数。
该种可能的实现方式中,通过预先训练好的代价模型来确定复杂系数,可以提高复杂系数确定的速度。
在第一方面一种可能的实现方式中,上述步骤:根据复杂系数预估DBMS执行SQL语句的第一预估执行时长,包括:根据复杂系数与执行时长的关系式,以及SQL语句的复杂系数预估DBMS执行SQL语句的第一预估执行时长。
在第一方面一种可能的实现方式中,上述步骤:根据复杂系数预估DBMS执行SQL语句的第一预估执行时长,包括:将复杂系数输入执行时间校正模型,该执行时间校正模型的训练样本包括DBMS执行过的历史SQL语句的复杂系数以及历史SQL语句的执行时长;通过执行时间校正模型确定DBMS统执行SQL语句的第一预估执行时长。
该种可能的实现方式中,通过预先训练好的执行时间校正模型来预估执行时长,可以提高执行时长预估的速度。
本申请第二方面提供一种数据库管理系统性能预估的方法,包括:获取第一参数组和第二参数组,第一参数组包括至少一个参数,第二参数组包括至少一个参数;获取DBMS管理的数据中活跃数据的数据量;根据第一参数组、第二参数组和活跃数据的数据量确定DBMS的性能变化比率,性能变化比率指示使用第二参数组配置DBMS后,DBMS的性能相比于使用第一参数组配置DBMS时的变化程度。
结合上述第二方面,上述第一方面中的任一可能的实现也都可以结合在该第二方面中成为第二方面中的任一可能的实现方式。
第二方面与第一方面的差别在于第二方面中获取的第一参数组可以不是DBMS在使用的第一参数组,该第二方面中的第一参数组也可以是通过参数调优模型得到的,或者用户配置的。
第三方面,提供了一种数据库管理系统性能预估的装置,用于执行上述第一方面或第一方面的任意可能的实现方式中的方法。具体地,该装置包括用于执行上述第一方面或第一方面的任意可能的实现方式中的方法的模块或单元。
第四方面,提供了一种数据库管理系统性能预估的装置,用于执行上述第二方面或第二方面的任意可能的实现方式中的方法。具体地,该装置包括用于执行上述第二方面或第二方面的任意可能的实现方式中的方法的模块或单元。
第五方面,提供了一种数据库管理系统性能预估的装置。该装置可以包括至少一个处理器、存储器和通信接口。处理器与存储器和通信接口耦合。存储器用于存储指令,处理器用于执行该指令,通信接口用于在处理器的控制下与其他网元进行通信。该指令在被处理器执行时,使处理器执行第一方面或第一方面的任意可能的实现方式中的方法。
第六方面,提供了一种数据库管理系统性能预估的装置。该装置可以包括至少一个处理器、存储器和通信接口。处理器与存储器和通信接口耦合。存储器用于存储指令,处理器用于执行该指令,通信接口用于在处理器的控制下与其他网元进行通信。该指令在被处理器执行时,使处理器执行第二方面或第二方面的任意可能的实现方式中的方法。
第七方面,提供了一种数据库系统。该数据库系统包括第三方面或第五方面的数据库管理系统性能预估的装置、数据库管理系统和数据库。
第八方面,提供了一种数据库系统。该数据库系统包括第四方面或第六方面的数据库管理系统性能预估的装置、数据库管理系统和数据库。
第九方面,提供了一种计算机可读存储介质,该计算机可读存储介质存储有程序,该程序使得数据查询的装置执行上述第一方面,及其各种实现方式中的任一种数据库管理系统性能预估的方法。
第十方面,提供了一种计算机可读存储介质,该计算机可读存储介质存储有程序,该程序使得数据查询的装置执行上述第二方面,及其各种实现方式中的任一种数据库管理系统性能预估的方法。
第十一方面,提供一种计算机程序产品,该计算机程序产品包括计算机执行指令,该计算机执行指令存储在计算机可读存储介质中;设备的至少一个处理器可以从计算机可读存储介质读取该计算机执行指令,至少一个处理器执行该计算机执行指令使得设备实施上 述第一方面或者第一方面的任一种可能的实现方式所提供的数据库管理系统性能预估的方法。
第十二方面,提供一种计算机程序产品,该计算机程序产品包括计算机执行指令,该计算机执行指令存储在计算机可读存储介质中;设备的至少一个处理器可以从计算机可读存储介质读取该计算机执行指令,至少一个处理器执行该计算机执行指令使得设备实施上述第二方面或者第二方面的任一种可能的实现方式所提供的数据库管理系统性能预估的方法。
可以理解地,上述提供的任一种数据库管理系统性能预估的装置、计算机存储介质或者计算机程序产品均用于执行上文所提供的对应的数据库管理系统性能预估的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
附图说明
图1A是单机数据库系统示意图;
图1B是采用共享磁盘架构的集群数据库系统示意图;
图1C是采用无共享架构的集群数据库系统示意图;
图2是本申请实施例提供的数据库管理系统性能预估的方法的一实施例示意图;
图3是本申请实施例提供的数据库管理系统性能预估的方法的一场景示意图;
图4是本申请实施例提供的数据库管理系统性能预估的方法的另一实施例示意图;
图5是本申请实施例提供的数据库管理系统性能预估的方法的另一实施例示意图;
图6是本申请实施例提供的数据库管理系统性能预估的方法的另一实施例示意图;
图7是本申请实施例提供的性能指标对比的一界面示意图;
图8是本申请实施例提供的查询比走势的一界面示意图;
图9是本申请实施例提供的数据库服务器的一实施例示意图;
图10是本申请实施例提供的数据库管理系统性能预估的装置的一实施例示意图;
图11是本申请实施例提供的数据库管理系统性能预估的装置的另一实施例示意图;
图12是本申请实施例提供的数据库管理系统性能预估的装置的另一实施例示意图;
图13是本申请实施例提供的集群数据库系统的一实施例示意图;
图14是本申请实施例提供的数据库系统的另一实施例示意图。
具体实施方式
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它 步骤或单元。
本申请实施例提供一种数据库管理系统性能预估的方法,可以在不与数据库管理系统(database management system,DBMS)进行交互的基础上预估出DBMS的性能,为DBMS的自调优、自管理等提供了性能度量的手段,保障了DBMS可靠稳定运行。以下分别进行详细说明。
本申请实施例提供的方法、可以实现为数据库管理系统的一部分,或者实现为一个独立的组件。本申请实施例提供的方法可以应用于单机数据库系统或者集群数据库系统。数据库系统(database system)一般由以下三部分组成:(1)数据库(database,DB),指长期存储在计算机内的,有组织,可共享的数据的集合。数据库中的数据按一定的数学模型组织、描述和存储,具有较小的冗余,较高的数据独立性和易扩展性,并可为各种用户共享。(2)硬件,包括存储数据所需的数据存储器,例如内存和/或磁盘。(3)软件,包括DBMS,DBMS是数据库系统的核心软件,是用于科学地组织和存储数据,以及高效获取和维护数据的系统软件,数据库引擎是DBMS的核心组件。
具体地,图1A为单机数据库系统示意图,包括一个数据库管理系统和数据存储器(data store),该数据库管理系统用于提供数据库的查询和修改等服务,该数据库管理系统将数据存储到数据存储器中。在单机数据库系统中,数据库管理系统和数据存储器通常位于单一服务器上,比如一台对称多处理(symmetric multi-processor,SMP)服务器。该SMP服务器包括多个处理器,所有的处理器共享资源,如总线,内存和I/O系统等。数据库管理系统的功能可由一个或多个处理器执行内存中的程序来实现。
图1B为采用共享磁盘(shared-storage)架构的集群数据库系统示意图,包括多个节点(如图1B中的节点1-N),每个节点部署有数据库管理系统,分别为用户提供数据库的查询和修改等服务,多个数据库管理系统存储有共享的数据在共享数据存储器中,并且通过交换机对数据存储器中的数据执行读写操作。共享数据存储器可以为共享磁盘阵列。集群数据库系统中的节点可以为物理机,比如数据库服务器,也可以为运行在抽象硬件资源上的虚拟机(virtual machine,VM)或容器(container)。当节点为虚拟机或容器时,节点中部署的数据库管理系统为DBMS实例(instance),该DBMS可以是进程或线程。若节点为物理机,则交换机为存储区网络(storage area network,SAN)交换机、以太网交换机,光纤交换机或其它物理交换设备。若节点为虚拟机或容器,则交换机为虚拟交换机。
图1C为采用无共享(shared-nothing)架构的集群数据库系统示意图,每个节点具有各自独享的硬件资源(如数据存储器)、操作系统和数据库,节点之间通过网络来通信。该体系下,数据将根据数据库模型和应用特点被分配到各个节点上,查询任务将被分割成若干部分,在所有节点上并行执行,彼此协同计算,作为整体提供数据库服务,所有通信功能都在一个高宽带网络互联体系上实现。如同图1B所描述的shared-storage架构的集群数据库系统一样,这里的节点既可以是物理机,也可以是虚拟机。
在本申请所有实施例中,数据库系统的数据存储器(data store)包括但不限于固态硬盘(solid state drives,SSD)、磁盘阵列或其他类型的非瞬态计算机可读介质。图1A-1C中虽未示出数据库,应理解,数据库存储在数据存储器中。所属领域的技术人员可以理解 一个数据库系统可能包括比图1A-1C中所示的部件更少或更多的组件,或者包括与图1A-1C中所示组件不同的组件,图1A-1C仅仅示出了与本申请实施例所公开的实现方式更加相关的组件。例如,虽然图1B和1C中已经描述了4个节点,但所属领域的技术人员可理解成一个集群数据库系统可包含任何数量的节点。各节点的数据库管理系统功能可分别由运行在各节点上的软件、硬件和/或固件的适当组合来实现。
为了便于理解和描述,作为示例而非限定,下面以shared-nothing架构集群数据库系统中一个节点,即数据库服务器为例说明本申请实施例的方案。但是本领域技术人员根据本申请实施例的教导可以很清楚地理解,本申请实施例的方法同样可以应用于shared-storage架构的集群数据库系统,单机数据库系统,以及任何类型的关系型数据库系统。
基于上述图1A至图1C的数据库系统,本申请实施例提供的数据库管理系统性能预估的方法可以包括如下几种方案:
第一种:确定DBMS的性能变化比率。
第二种:预估DBMS执行SQL语句的第一预估执行时长。
第三种:使用第一种的“DBMS的性能变化比率”和第二种的“第一预估执行时长”,以及DBMS管理的数据中活跃数据的数据量确定DBMS的性能指标。
下面结合附图对上述三种方案分别进行介绍。
针对第一种方案,如图2所示,本申请实施例提供的数据库管理系统性能预估的方法的一实施例包括:
201、获取为DBMS配置的第一参数组,以及获取DBMS管理的数据中活跃数据的数据量,第一参数组包括至少一个参数。
第一参数组中通常包括多个参数(Knob),例如,该第一参数组中的参数可以包括:存储引擎的缓冲池大小(innodb_buffer_pool_size),存储引擎的缓冲池中块大小(innodb_buffer_pool_chunk_size),存储引擎的缓冲池实例数(innodb_buffer_pool_instances),临时表大小(tmp_table_size)和键缓冲大小(key_buffer_size)等。
活跃数据指的是DBMS执行历史SQL语句时所使用过的数据。
202、获取第二参数组,第二参数组包括至少一个参数。
第二参数组中包括的参数可以是与第一参数组中的参数对应的,第二参数组中各参数取值的大小可以与第一参数组中参数的取值相同,也可以不同,通常第一参数组和第二参数组中至少有一个参数的取值不同。
第一参数组与第二参数组的区别在于:第一参数组中的各参数已在DBMS上配置,也就是说DBMS运行使用的就是第一参数组中的参数。第二参数组未配置到DBMS上,该第二参数组可以是使用DBMS参数的调优模型进行调优后的到的,也可以是用户自行确定的。
203、根据第一参数组、第二参数组和活跃数据的数据量确定DBMS的性能变化比率,性能变化比率指示使用第二参数组配置DBMS后,DBMS的性能相比于使用第一参数组配置DBMS时的变化程度。
在一个实施例中,性能变化比率指示:若采用第二参数组配置DBMS的性能/采用第一参 数组配置DBMS的性能,该性能变化比率可以指示若采用第二参数组配置DBMS的后该DBMS的性能是变好了还是变差了,若该性能变化比率=1,则表示没有变化,若大于1,则表示性能变好了,若小于1,则表示性能变差了,当然,变好或者变差都是笼统的说法,该性能变化比率的数值可以指示出具体的变化程度,例如,性能提高了0.5倍,或者降低了0.1倍等。
由上述第一种方案可知,获取到第一参数组、第二参数组和DBMS管理的数据中活跃数据的数据量即可确定出DBMS的性能变化比率,不需要实际将第二参数组配置给DBMS,然后由DBMS执行作业再进行观察,即不需要与DBMS交互就可以预估出DBMS的能变化比率,即可以确保DBMS的安全,而且为DBMS的自调优、自管理等提供了性能度量的手段,保障了DBMS可靠稳定运行。
本申请实施例提供的第二种方案可以是在第一种方案的基础上进行,也可以是独立于第一种方案单独执行。
该第二种方案提供的数据库管理系统性能预估的方法的一实施例包括:获取结构化查询语言SQL语句;预估DBMS执行SQL语句的第一预估执行时长。
本申请实施例提供的第三种方案是在第一种方案和第二种方案的基础上执行的,该第二种方案提供的数据库管理系统性能预估的方法的一实施例包括:根据DBMS的性能变化比率、第一预估执行时长和活跃数据的数据量预估DBMS的性能指标,性能指标包括SQL语句的第二预估执行时长和DBMS的吞吐量中的至少一个,第二预估执行时长是第一预估执行时长的修正值。
为了更好的理解上述三种方案,下面结合图3的场景示意图对上述三种方案的实现过程做具体介绍。
如图3所示,该场景中,数据库管理系统性能预估的装置包括比率预估模型、查询预估模型(query-estimator)和性能预估模型。
该比率预估模型是通过训练样本训练得到的,该训练样本包括多对第一历史参数组和第二历史参数组以及与每对第一历史参数组和第二历史参数组对应的DBMS的性能变化比率,以及使用第一历史参数组和第二历史参数组配置DBMS时的活跃数据的数据量。
该查询预估模型是通过SQL流水训练得到的,该SQL流水包括DBMS处理该SQL流水对应的SQL语句的执行时长,或SQL语句的复杂系数。
性能预估模型的训练样本包括DBMS的历史性能变化比率、历史SQL语句的执行时长、以及DBMS执行历史SQL语句时DBMS所管理的活跃数据的数据量和历史性能指标。
上述SQL流水和训练样本可以通过代理(agent)与DBMS交互进行采集。
上述比率预估模型、查询预估模型(query-estimator)和性能预估模型训练好后,就可以应用在数据库管理系统性能预估的装置中执行相应的预估功能。
在预估过程中,比率预估模型接收到第一参数组、第二参数组和活跃数据的数据量,然后进行预估,就可以得到DBMS的性能变化比率,该性能变化比率指示使用第二参数组配置DBMS后,DBMS的性能相比于使用第一参数组配置DBMS时的变化程度。
查询预估模型接收到用户输入的SQL语句,就可以得到DBMS执行SQL语句的第一预估执行时长。
性能预估模型获取到比率预估模型输出的DBMS的性能变化比率、SQL语句的第一预估执行时长,以及活跃数据的数据量,就可以预估出DBMS的性能指标,该性能指标包括SQL语句的第二预估执行时长和DBMS的吞吐量中的至少一个,第二预估执行时长是第一预估执行时长的修正值。
可选地,上述步骤203可以包括:根据DBMS中活跃数据的数据量修正第一参数组和第二参数组中与活跃数据的数据量相关的参数;对修正后的第一参数组和第二参数组进行降维,以得到第一向量和第二向量,第一参数组是维度高于第一向量的向量,第二参数组是维度高于第二向量的向量;根据第一向量和第二向量确定DBMS的性能变化比率。
也可以理解为比率预估模型包括数据量校正模块和双参数混合模型(double knobs mixture model,DKMM)。
该数据量校正模块可以将所述第一参数组和所述第二参数组中与所述活跃数据的数据量相关的目标参数调整到与所述活跃数据的数据量,所述目标参数为所述第一参数组和所述第二参数组中大于与所述活跃数据的数据量的参数。
该双参数混合模型是根据为DBMS已配置过的多对第一历史参数组和第二历史参数组以及与每对第一历史参数组和第二历史参数组对应的DBMS的性能变化比率训练得到的。第一历史参数组是DBMS使用过的一组参数,第二历史参数组是DBMS使用过的另一组参数,第一历史参数组和第二历史参数组构成一对,使用第二历史参数组时所述DBMS的性能相比于使用第一历史参数组的性能变化比率即为该对第一历史参数组和第二历史参数组对应的DBMS的性能变化比率。举例来说,将为DBMS已配置过的一个第一历史参数组、为DBMS已配置过的一个第二历史参数组,以及使用该第二历史参数组时所述DBMS的性能相比于使用该第一历史参数组的性能变化比率作为一个训练样本,当有多个相似的训练样本时,通过机器学习的方法即可得到该双参数混合模型。基于训练样本训练模型可以参照现有技术实现,本发明实施例不再赘述。
这样,比率预估模型执行预估的过程可以通过数据量校正模块和双参数混合模型来完成,该过程可以参阅图4进行理解。
如图4所示,数据量校正模块接收第一参数组、第二参数组和DBMS中活跃数据的数据量,然后修正第一参数组和第二参数组中与活跃数据的数据量相关的参数。
该数据修正的过程指的是若第一参数组中或第二参数组中的某一个参数的值大于活跃数据的数据量,则将该参数的值缩小到与该数据量相等。以参数innodb_buffer_pool_size为例,若第一参数组中该参数的值大于活跃数据的数据量,则将该第一参数组中该参数的值修正到与该数据量相等,若第二参数组中该参数的值也大于该数据量,则也修正为与该数据量相等,或该参数的值都小于该数据量,则不作修正,无论是第一参数组还是第二参数组中的参数,只修正大于该数据量的情况。
修正后的第一参数组和第二参数组中的参数通常是高维向量,因此该比例预估模型中还可以包括自编码器(autoencoder),这样该自编码器可以对修正后的第一参数组和第二参数组进行降维,以得到第一向量和第二向量,第一参数组是维度高于第一向量的向量,第二参数组是维度高于第二向量的向量。
然后,DKMM接收到第一向量和第二向量,会通过DKMM的神经网络结构确定出DBMS的性能变化比率。DKMM可以将第一向量和第二向量拼接成一条长向量,然后根据该长向量确定出DBMS的性能变化比率。
为了便于理解,下面通过表1示例出第一参数组、第二参数组、活跃数据的数据量取不同值时通过本申请的方案预估出的性能变化比率。
表1:性能变化比率表
Figure PCTCN2020113204-appb-000001
上述表1第一参数组和第二参数组都分别包括5个参数,这五个参数可以是innodb_buffer_pool_size,innodb_buffer_pool_chunk_size,innodb_buffer_pool_instances,tmp_table_size和key_buffer_size。
表1中,通过样本1可以看出,当第二参数组相对于第一参数组只是调大了其中第一个参数,innodb_buffer_pool_size的值由0.25调为0.5。当活跃数据的数据量(0.5GBytes)不大时,无论是采用第一组参数配置DBMS,还是采用第二组参数配置DBMS,DBMS的性能都基本相同,性能变化比率如样本1中的结果所示出的1,表示DBMS的性能基本不变。
通过样本2可以看出,当第二参数组相对于第一参数组只是调大了其中第一个参数,innodb_buffer_pool_size的值由0.25调为0.5。当活跃数据的数据量增大(10GBytes)时,采用第一组参数配置DBMS,还是采用第二组参数配置DBMS,DBMS的性能出现了差异,性能变化比率2表示采用第二组参数配置DBMS时DBMS的性能时采用第一组参数配置DBMS是DBMS的性能的2倍。
通过样本3可以看出,当第二参数组相对于第一参数组调大了其中第一个参数和第二个参数,innodb_buffer_pool_size的值由0.25调为0.5,innodb_buffer_pool_chunk_size由0.1调为0.2。当活跃数据的数据量持续增大(100GBytes)时,采用第一组参数配置DBMS,还是采用第二组参数配置DBMS,DBMS的性能差异相对于样本2的变化就不大了,性能变化比率2.3表示采用第二组参数配置DBMS时DBMS的性能时采用第一组参数配置DBMS是DBMS的性能的2.3倍。
通过样本4可以看出,当第二参数组相对于第一参数组调小了其中第一个参数、第二个参数和第三个参数,innodb_buffer_pool_size的值由0.5调为0.15,innodb_buffer_pool_chunk_size由0.1调为0.02,innodb_buffer_pool_instances由0.6 调为0.3。当第二参数组的参数值普遍小于第一参数组的参数值时,DBMS的性能变差,性能变化比率0.1表示采用第二组参数配置DBMS时DBMS的性能时采用第一组参数配置DBMS是DBMS的性能的0.1倍。
本申请实施例将活跃数据的数据量考虑进来用作预处理,也就是参数修正,这样既减少了输入特征,避免过拟合,又能够使预测结果对数据量敏感。另外,我们输出的结果是一个比例值,这就消除了其他因素(如系统状态等)对结果造成的影响。
可选地,上述预估DBMS执行SQL语句的第一预估执行时长,包括:将SQL语句转换为第三向量;根据第三向量确定SQL语句的复杂系数;根据复杂系数预估DBMS执行SQL语句的第一预估执行时长。
其中,SQL语句中包括子查询语句,将SQL语句转换为第三向量,包括:利用复杂度与条件因子的关系式,确定子查询语句的复杂度,条件因子包括子查询中的条件类型和子查询涉及到的表格类型;根据子查询语句的复杂度确定子查询语句的向量;将子查询语句的向量输入代价模型以得到子查询语句的复杂系数,代价模型的训练样本包括DBMS执行过的历史SQL语句的向量以及历史SQL语句的复杂系数;根据子查询语句的复杂系数,以及子查询语句与包含子查询语句的上层查询语句之间的嵌套关系,确定出SQL语句的第三向量。
该步骤将SQL语句转换为第三向量,也可以表述为:按照预设模板规则,将SQL语句转换为语句模板,若SQL语句中包括子查询语句,则语句模板中包括与子查询语句对应的至少一个子模板;利用复杂度与条件因子的关系式,确定每个子模板的复杂度,条件因子包括子模板中的条件类型和所涉及到的表格类型;根据每个子模板的复杂度确定子查询语句的向量;将子查询语句的向量输入代价模型以得到子查询语句的复杂系数,代价模型的训练样本包括DBMS执行过的历史SQL语句的向量以及历史SQL语句的复杂系数;根据子查询语句的复杂系数,以及子查询语句与上层查询语句之间的递归关系,递归的确定出SQL语句的第三向量。
其中,根据第三向量确定SQL语句的复杂系数,包括:将第三向量输入代价模型;通过代价模型确定SQL语句的复杂系数。
其中,根据复杂系数预估DBMS执行SQL语句的第一预估执行时长,可以包括:
根据复杂系数与执行时长的关系式,以及SQL语句的复杂系数预估DBMS执行SQL语句的第一预估执行时长。或者,将复杂系数输入执行时间校正模型,执行时间校正模型的训练样本包括DBMS执行过的历史SQL语句的复杂系数以及历史SQL语句的执行时长;通过执行时间校正模型确定DBMS执行SQL语句的第一预估执行时长。
也可以理解为,如图5所示,查询预估模型包括向量转换模型、代价模型和执行时间校正模型,上述查询预估模型的过程由向量转换模型、代价模型和执行时间校正模型来完成。
该代价模型是通过训练样本训练得到的,代价模型的训练样本包括所述DBMS执行过的历史SQL语句的向量以及所述历史SQL语句的复杂系数。
该执行时间校正模型是通过训练样本训练得到的,该执行时间校正模型的训练样本包括所述DBMS执行过的历史SQL语句的复杂系数以及所述历史SQL语句的执行时长。
该向量转换模型按照预设模板规则,将所述SQL语句转换为语句模板,若所述SQL语句 中包括子查询语句,则所述语句模板中包括与所述子查询语句对应的至少一个子模板;利用复杂度与条件因子的关系式,确定所述每个子模板的复杂度,所述条件因子包括所述子模板中的条件类型和所涉及到的表格类型;根据所述每个子模板的复杂度确定所述子查询语句的向量。
向量转换模型将所述子查询语句的向量输入代价模型以得到所述子查询语句的复杂系数。
向量转换模型和代价模型根据所述子查询语句的复杂系数,以及所述子查询语句与上层查询语句之间的递归关系,递归的确定出所述SQL语句的第三向量。
然后代价模型接收到第三向量,通过第三向量确定SQL语句的复杂系数。
然后执行时间校正模型接收到该SQL语句的复杂系数,并根据复杂系数与执行时长的关系式,预估DBMS执行SQL语句的第一预估执行时长。
上述向量转换模型和代价模型将SQL语句转换为第三向量的过程可以参阅如下示例进行理解。
SQL语句为:
Figure PCTCN2020113204-appb-000002
Figure PCTCN2020113204-appb-000003
上述SQL语句是一条非常复杂的SQL语句,该SQL语句的语句模板可以为:Select*from where{select*from*where;+min}+orderby+limit。
将上述SQL语句转为为向量的步骤可以包括,先识别SQL语句中的最下层的子查询语句{select*from*where;+min}中的子模板,该最下层的子查询语句包括两个子模板,分别为“select*from*where;”和“min”,每个子模板的复杂度的计算过程以“select*from*where;”为例进行介绍。
复杂度与条件因子的关系式可以参阅如下关系式进行理解:
complex factor=table_nums*1+sum(condition_factor)
其中,complex factor表示子模板的复杂度,table_num表示该子模板所涉及到的表格类型的数量,condition_factor表示子模板中条件类型系数,sum表示求和。当然,若该关系式应用到最上层模板,则上述所介绍的各参数表示在最上层模板中的复杂度、表格类型的数量以及条件类型系数等。
关于condition_factor,以子模板“select*from*where;”中where条件为例,中各种条件类型系数例如可以设置为表2中的数值:
表2:条件类型系数表
相等条件 0.5
不相等条件 0.7
范围查询 0.3
涉及子查询 该子查询的complex factor
由上述子模板“select*from*where;”可知,from下包括partsupp、supplier、nation和region四个表格类型,则根据table_nums*1可以确定该子模板中from处的复杂 度=4*1=4,该子模板中where处有5个相等条件,分别为:p_partkey=ps_partkey、s_suppkey=ps_suppkey、s_nationkey=n_nationkey、n_regionkey=r_regionkey和r_name=ASIA,则根据sum(condition_factor)以及表2中相等条件的系数=0.5可以确定该子模板中where处的复杂系数=5*0.5=2.5。
根据上述子模板的复杂度的关系式,就可以确定出该子模板的复杂度=4+5*0.5=6.5。
同理,因为子模板“min”只涉及到ps_supplycost这一个条件,所以子模板“min”的复杂度=1。
因为子查询语句{select*from*where;+min}初始化时各维度上的向量化值都是0,计算出子模板“select*from*where;”和子模板“min”的复杂度后,用这两个子模板的复杂度的值去替换该子查询语句初始化向量中相应维度上的0,就可以得到该子查询语句的向量v1,例如:该子查询语句的向量v1可以表示为[0,0,6.5,0,1,00……0]。
得到子查询语句的向量v1后,将该向量v1输入代价模型,代价模型就会输出该子查询语句对应的复杂系数c1,然后按照SQL语句中各层子查询语句之间的关系进行递归计算,重复执行上述类似于子查询语句{select*from*where;+min}的过程,就可以计算得到该SQL语句的第三向量。
然后将该第三向量再输入到代价模型中,就可以得到该SQL语句的复杂系数。
因为执行时间校正模型具有复杂系数与执行时长的关系式,所以当时间校正模型接收到该SQL语句的复杂系数后,就可以根据该关系式,预估出DBMS执行SQL语句的第一预估执行时长。
需要说明的是,上述给出的复杂度的计算公式是通过加法的关系给出的,实际上这只是一种示例,其他减法、除法、乘法或者其他算式关系也可以适用于本申请上述复杂度的计算过程。
为了便于说明通过本申请实施例预估的第一预估执行时长的准确度,下面通过表3的一组对比数据来表示。
表3:时长对照表
Figure PCTCN2020113204-appb-000004
Figure PCTCN2020113204-appb-000005
如表3每行所示,在输入不同的SQL语句时,通过本申请实施例所提供的方案得到的预测执行时长与DBMS真实执行每行的SQL语句时的真实执行时长相差不大,可见本申请实施例提供的方案得到的预测执行时长的准确度很高,这样就可以通过本申请提供的方案来实现DBMS的自动调优以及自动管理。
为了更好的理解本申请实施例提供的数据库管理系统性能预估的过程,下面参阅图6对该预估过程进行介绍。
301、向量转换模型对用户输入的SQL语句进行向量转换,并输出转换后的向量。
该转换后的向量例如可以表示为[1,0,3,4,0,1,…,0]。
302、代价转换模型根据转换后的向量确定SQL语句的复杂系数。
303、时间校正模型根据步骤302确定的复杂系数确定第一预估执行时长。
步骤301至303的过程可以参阅上述图5对应的实施例的相应内容进行理解,此处不再详细赘述。
304、数据量校正模块接收第一参数组、第二参数组和活跃数据的数据量,然后根据该数据量对第一参数组和第二参数组中的参数进行修正。
305、自编码器对修正后的第一参数组和第二参数组进行降维,将高维度的向量转换为低维度的第一向量和第二向量。
306、DKMM根据第一向量和第二向量确定性能变化比率。
步骤304至306的过程可以参阅上述图4对应的实施例的相应内容进行理解,此处不再详细赘述。
307、性能预估模型接收到步骤303输出的第一预估执行时长、步骤306输出的性能变化比率,以及活跃数据的数据量,然后根据第一预估执行时长、性能变化比率,以及活跃数据的数据量进行性能预估,以得到性能指标。该性能指标可以包括第二预估执行时长和吞吐量中的至少一个。
该性能预估模型输出的第二预估执行时长和吞吐量可以在终端设备的用户界面(user interface,UI)上展示,展示的形式可以参阅图7进行理解。
如图7所示,从吞吐量上来看,采用第一参数组配置DBMS时的吞吐量远远低于采用第二 参数组配置DBMS时的吞吐量。从执行时长上来看,采用第一参数组配置DBMS时执行一个SQL语句的执行时长远远大于采用第二参数组配置DBMS时执行同一个SQL语句的执行时长。因此,可以通过本申请预估的结果来确定用于更新DBMS的参数组,这样可以在预知效果的情况下去配置,而不需要先配置再查看效果,再进行修改,这样可以有效的提高DBMS参数配置的精准度,提高DBMS的稳定性。
本申请实施例中,吞吐量中还可以反映出查询量/秒(queries per second,QPS)的情况,这样通过QPS的变化趋势,用户还可以看到在采用第二参数组配置DBMS后一段时间内性能的变化范围。该QPS的变化趋势,可以参阅图8进行理解。如图8所示,该QPS在经历一段时间小幅度波动后成稳步增长的趋势,表明在更换第二参数组配置DBMS后性能会变好。
以上描述的数据库管理系统性能预估的装置是独立于DBMS的,实际上,该数据库管理系统性能预估的装置也可以集成在DBMS上,下面结合图9介绍集成于DBMS上的一种数据库服务器。
如图9所示,本申请实施例提供一种数据库服务器100,包括:至少一个处理器104、存储可执行代码的非瞬态计算机可读介质(non-transitory computer-readable medium)106和数据库管理系统108。可执行代码在被至少一个处理器104执行时被配置为实现数据库管理系统108的组件和功能。非瞬态计算机可读介质106可以包括一个或多个非易失性存储器,作为示例,非易失性存储器包括半导体存储器设备,例如可擦可编程只读存储器(erasable programmable read only memory,EPROM),电可擦只读存储器(electrically erasable programmable read only memory,EEPROM)和闪存(flash memory);磁盘,例如内部硬盘(internal hard disk)或可移动磁盘(removable disk),磁光盘(magneto optical disk),以及CD ROM和DVD-ROM。此外,非瞬态计算机可读介质106还可以包括被配置为主存储器(main memory)的任何设备。至少一个处理器104可以包括任何类型的通用计算电路或专用逻辑电路,例如:现场可编程门阵列(field-programmable gate array,FPGA)或专用集成电路(application specific integrated circuit,ASIC)。至少一个处理器104也可以是耦合到一个或多个半导体基板的一个或多个处理器,例如CPU。
数据库管理系统108可以是关系型数据库管理系统(relational database management system,RDBMS)。数据库管理系统108支持结构化查询语言(structured query language,SQL)。通常,SQL是指专门用于管理关系型数据库中保存的数据的专用编程语言。SQL可以指代各种类型的数据相关语言,包括例如数据定义语言和数据操纵语言,其中SQL的范围可以包括数据插入,查询,更新和删除,模式创建和修改以及数据访问控制。此外,在一些示例中,SQL可以包括与各种语言元素相关的描述,包括子句(clause),表达式(expression),谓词(predicate),查询(query)和语句(statement)。例如,子句可以指语句和查询的各种组成部分,并且在一些情况下,子句可以被认为是可选的。此外,表达式可以被配置为产生包括数据列和/或行的标量值(scalar value)和/或表。另外,谓词可经配置以指定条件,以用于调节语句和查询的效果。
查询(query)是请求查看,访问和/或操纵存储在数据库中的数据。数据库管理系统108可以从数据库客户端102接收SQL格式的查询(称为SQL查询)。该SQL查询也可以称为SQL 语句。通常,数据库管理系统108通过从数据库访问相关数据并操纵相关数据以生成查询所对应的查询结果,并将查询结果返回到数据库客户端102。数据库是按一定的数学模型组织、描述和存储的数据集合,数据库可以包括一个或多个数据库结构或格式,例如行存储和列存储。数据库通常存储于数据存储器中,比如图9中的外部数据存储器120,或者非瞬态计算机可读介质106。当数据库存储于非瞬态计算机可读介质106时,数据库管理系统108为内存数据库管理系统。
数据库客户端102可以包括被配置成与数据库管理系统108交互的任何类型的设备或应用程序。在一些示例中,数据库客户端102包括一个或多个应用服务器。
数据库管理系统108包括SQL引擎110、执行引擎122、存储引擎134和本申请实施例中的数据库管理系统性能预估的装置124。SQL引擎110根据客户端102提交的SQL语句,例如查询(Query),生成对应的执行计划,执行引擎122依照语句的执行计划进行操作,以产生查询结果。存储引擎134负责在文件系统之上,管理表的数据、索引的实际内容,同时也会管理运行时的Cache、Buffer、事务、Log等数据。例如存储引擎134可以将执行引擎122的执行结果通过物理I/O写入数据存储器120。SQL引擎110包括解析器112和优化器114,其中,解析器110用于执行对SQL语句的语法、语义分析,将查询中的视图展开、划分为小的查询块。优化器114为语句生成一组可能被使用的执行计划,估算出每个执行计划的代价,比较计划的代价,最终选择一个代价最小的执行计划。数据库管理系统性能预估的装置124可以从SQL引擎110和执行引擎122获取DBMS的历史数据,例如:DBMS的历史配置参数和对历史SQL语句的执行时长等。
由上述描述克制,本申请实施例提供的数据库管理系统性能预估的方法可以是通过数据库管理系统性能预估的装置该实现的,该数据库管理系统性能预估的装置可以是集成于以上描述的DBMS 108,也可以是独立于以上描述的DBMS 108之外的,无论数据库管理系统性能预估的装置是集成在DBMS上还是独立于DBMS,该数据库管理系统性能预估的装置都可以在不与DBMS进行交互的基础上预估出DBMS的性能,为数据库的自调优、自管理等提供了性能度量的手段,保障了DBMS可靠稳定运行。下面介绍本申请实施例提供的数据库管理系统性能预估的方法。该数据库管理系统性能预估的装置可以是一台或多台服务器,可以是物理机,也可以是虚拟机(virtual machine,VM)。
以上描述了本申请实施例提供的方法及数据库系统,下面结合附图介绍本申请实施例提供的数据库管理系统性能预估的装置。
如图10所示,本申请实施例提供的数据库管理系统性能预估的装置40的一实施例包括:
第一获取单元401,用于获取为数据库管理系统DBMS配置的第一参数组,以及获取DBMS管理的数据中活跃数据的数据量,第一参数组包括至少一个参数;
第二获取单元402,用于获取第二参数组,第二参数组包括至少一个参数;
确定单元403,用于根据第一获取单元401获取的第一参数组和活跃数据的数据量,以及第二获取单元402获取的第二参数组确定DBMS的性能变化比率,性能变化比率指示使用第二参数组配置DBMS后,DBMS的性能相比于使用第一参数组配置DBMS时的变化程度。
本申请实施例提供的方案,只要获取到第一参数组、第二参数组和DBMS管理的数据中 活跃数据的数据量即可确定出DBMS的性能变化比率,不需要实际将第二参数组配置给DBMS,然后由DBMS执行作业再进行观察,即不需要与DBMS交互就可以预估出DBMS的能变化比率,即可以确保DBMS的安全,而且为DBMS的自调优、自管理等提供了性能度量的手段,保障了DBMS可靠稳定运行。
一种可能的实施例中,如图11所示,该装置40还包括:第三获取单元404和第一预估单元405。
第三获取单元404,用于获取结构化查询语言SQL语句;
第一预估单元405,用于预估DBMS执行第三获取单元获取的SQL语句的第一预估执行时长。
一种可能的实施例中,如图11所示,该装置40还包括第二预估单元406。
第二预估单元406,用于根据确定单元确定的DBMS的性能变化比率、第一预估单元预估的第一预估执行时长和第一获取单元获取的活跃数据的数据量预估DBMS的性能指标,性能指标包括SQL语句的第二预估执行时长和DBMS的吞吐量中的至少一个,第二预估执行时长是第一预估执行时长的修正值。
一种可能的实施例中,第二预估单元406用于:将DBMS的性能变化比率、第一预估执行时长和活跃数据的数据量输入性能预估模型,该性能预估模型的训练样本包括DBMS的历史性能变化比率、历史SQL语句的执行时长、以及DBMS执行历史SQL语句时DBMS所管理的活跃数据的数据量和历史性能指标;通过性能预估模型预估DBMS的性能指标。
一种可能的实施例中,确定单元403用于:根据DBMS中活跃数据的数据量修正第一参数组和第二参数组中与活跃数据的数据量相关的参数;对修正后的第一参数组和第二参数组进行降维,以得到第一向量和第二向量,第一参数组是维度高于第一向量的向量,第二参数组是维度高于第二向量的向量;根据第一向量和第二向量确定DBMS的性能变化比率。
一种可能的实施例中,确定单元403用于:将第一参数组和第二参数组中与活跃数据的数据量相关的目标参数调整到与活跃数据的数据量相等,目标参数为第一参数组和第二参数组中大于与活跃数据的数据量的参数。
一种可能的实施例中,确定单元403用于:将第一向量和第二向量输入到双参数混合模型,双参数混合模型的训练样本包括DBMS已配置过的多对第一历史参数组和第二历史参数组以及与每对第一历史参数组和第二历史参数组对应的DBMS的性能变化比率;通过双参数混合模型确定DBMS的性能变化比率。
一种可能的实施例中,第一预估单元405用于:将SQL语句转换为第三向量;根据第三向量确定SQL语句的复杂系数;根据复杂系数预估DBMS执行SQL语句的第一预估执行时长。
一种可能的实施例中,SQL语句中包括子查询语句,第一预估单元405用于:利用复杂度与条件因子的关系式,确定子查询语句的复杂度,条件因子包括子查询中的条件类型和子查询涉及到的表格类型;根据子查询语句的复杂度确定子查询语句的向量;将子查询语句的向量输入代价模型以得到子查询语句的复杂系数,代价模型的训练样本包括DBMS执行过的历史SQL语句的向量以及历史SQL语句的复杂系数;根据子查询语句的复杂系数,以及子查询语句与包含子查询语句的上层查询语句之间的嵌套关系,确定出SQL语句的第三向量。
一种可能的实施例中,第一预估单元405用于:按照预设模板规则,将SQL语句转换为语句模板,若SQL语句中包括子查询语句,则语句模板中包括与子查询语句对应的至少一个子模板;利用复杂度与条件因子的关系式,确定每个子模板的复杂度,条件因子包括子模板中的条件类型和所涉及到的表格类型;根据每个子模板的复杂度确定子查询语句的向量;将子查询语句的向量输入代价模型以得到子查询语句的复杂系数,代价模型的训练样本包括DBMS执行过的历史SQL语句的向量以及历史SQL语句的复杂系数;根据子查询语句的复杂系数,以及子查询语句与上层查询语句之间的递归关系,递归的确定出SQL语句的第三向量。
一种可能的实施例中,第一预估单元405用于:将第三向量输入代价模型;通过代价模型确定SQL语句的复杂系数。
一种可能的实施例中,第一预估单元405用于:根据复杂系数与执行时长的关系式,以及SQL语句的复杂系数预估DBMS执行SQL语句的第一预估执行时长。
一种可能的实施例中,第一预估单元405用于:将复杂系数输入执行时间校正模型,执行时间校正模型的训练样本包括DBMS执行过的历史SQL语句的复杂系数以及历史SQL语句的执行时长;通过执行时间校正模型确定DBMS执行SQL语句的第一预估执行时长。
图12所示,为本申请的实施例提供的上述实施例中所涉及的数据库管理系统性能预估的装置50的一种可能的逻辑结构示意图。数据库管理系统性能预估的装置50包括:处理器501、通信接口502、存储器503以及总线504。处理器501、通信接口502以及存储器503通过总线504相互连接。在本申请的实施例中,处理器501用于对数据库管理系统性能预估的装置50的动作进行控制管理,例如,处理器501用于执行图3中的步骤201至203,图7中步骤301至307和/或用于本文所描述的技术的其他过程。通信接口502用于支持数据库管理系统性能预估的装置50进行通信,例如:通信接口502可以执行上述方法实施例中接收第一参数组、第二参数组、活跃数据的数据量以及SQL语句的步骤。存储器503,用于存储数据库服务器50的程序代码和数据。
其中,处理器501可以是中央处理器单元,通用处理器,数字信号处理器,专用集成电路,现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等等。总线504可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图12中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
参见图13,本申请实施例还提供一种集群数据库系统500,包括:包括硬件层1007和运行在硬件层1007之上的虚拟机监控器(VMM)1001,以及多个虚拟机1002。一个虚拟机可以作为集群数据库系统500的一个数据节点。可选第,还可以指定一个虚拟机作为协调节点。
具体的,虚拟机1002是通过虚拟机软件在公共硬件资源上模拟出的虚拟的计算机,虚拟机上可以安装操作系统和应用程序,虚拟机还可访问网络资源。对于在虚拟机中运行的应用程序而言,虚拟机就像是在真正的计算机中进行工作。
硬件层1007:虚拟化环境运行的硬件平台,可以由一个或多个物理主机的硬件资源抽象得到的。其中,硬件层可包括多种硬件,例如包括处理器1004(例如CPU)和存储器1005,还可以包括网卡1003(例如RDMA网卡)、高速/低速输入/输出(I/O,Input/Output)设备,及具有特定处理功能的其它设备。
虚拟机1002基于VMM,以及硬件层1007提供的硬件资源,运行可执行程序,以实现上述图3至图9相关的实施例中数据库管理系统性能预估的装置的部分或全部功能。为了简洁,在此不再赘述。
进一步地,该集群数据库系统500还可以包括宿主机(Host):作为管理层,用以完成硬件资源的管理、分配;为虚拟机呈现虚拟硬件平台;实现虚拟机的调度和隔离。其中,Host可能是虚拟机监控器(VMM);也有可能是由VMM和1个特权虚拟机的结合。其中,虚拟硬件平台对其上运行的各个虚拟机提供各种硬件资源,如提供虚拟处理器(如VCPU)、虚拟内存、虚拟磁盘、虚拟网卡等等。其中,该虚拟磁盘可对应Host的一个文件或者一个逻辑块设备。虚拟机运行在Host为其准备的虚拟硬件平台上,Host上运行一个或多个虚拟机。虚拟机1002的VCPU通过执行存储在其对应的虚拟内存中的可执行程序,以实现或者执行本发明上述各方法实施例中所描述的方法步骤。例如,实现上述图3至图9相关的实施例中数据库管理系统性能预估的装置的部分或全部功能。
参见图14,本申请实施例还提供一种数据库系统,包括:数据库服务器800,基于与数据库服务器800通过通信网络连接的客户端设备900。其中,
客户端设备900的硬件层906上运行有客户端操作系统904,操作系统904上运行有应用程序902;数据库服务器800的硬件层816上运行有操作系统814,以及在操作系统814上运行有数据库管理系统812。应用程序902经由通信网络与在数据库服务器800上运行的数据库管理系统812连接并且访问或者操作存储在数据存储器818中的数据库,例如,通过SQL语句查询、更新或删除数据库中的数据,或者导入新的数据至数据库。
硬件层906和816包含操作系统和应用程序运行所需的基本硬件单元,例如,处理器,例如CPU,内存(Memory)、输入/输出设备、网络接口等。
数据存储器818可以是数据库服务器800的外部存储器,比如硬盘、磁盘、存储阵列,或存储服务器等,与数据库服务器800通信连接。或者,数据存储器818也可以集成在数据库服务器800内部,与处理器和I/O设备通过总线或其它内部通信方式交互数据。
数据库服务器800的内存中存储有可执行代码,该可执行代码在被处理器执行时被配置为实现数据库管理系统812的组件和功能。数据库管理系统812具体可以为图2所示的数据库管理系统108,相关功能及实现细节可参照图3至图9相关的实施例,此处不再赘述。
在本申请的另一实施例中,还提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,当设备的至少一个处理器执行该计算机执行指令时,设备执行上述图3至图9部分实施例所描述的数据库管理系统性能预估的方法。
在本申请的另一实施例中,还提供一种计算机程序产品,该计算机程序产品包括计算机执行指令,该计算机执行指令存储在计算机可读存储介质中;设备的至少一个处理器可以从计算机可读存储介质读取该计算机执行指令,至少一个处理器执行该计算机执行指令 使得设备执行上述图3至图9部分实施例所描述的数据库管理系统性能预估的方法。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请实施例所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请实施例各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请实施例各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上,仅为本申请实施例的具体实施方式,但本申请实施例的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请实施例揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请实施例的保护范围之内。因此,本申请实施例的保护范围应以权利要求的保护范围为准。

Claims (28)

  1. 一种数据库管理系统性能预估的方法,其特征在于,包括:
    获取为数据库管理系统DBMS配置的第一参数组,以及获取所述DBMS管理的数据中活跃数据的数据量,所述第一参数组包括至少一个参数;
    获取第二参数组,所述第二参数组包括至少一个参数;
    根据所述第一参数组、所述第二参数组和所述活跃数据的数据量确定所述DBMS的性能变化比率,所述性能变化比率指示使用所述第二参数组配置所述DBMS后,所述DBMS的性能相比于使用所述第一参数组配置所述DBMS时的变化程度。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取结构化查询语言SQL语句;
    预估所述DBMS执行所述SQL语句的第一预估执行时长。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    根据所述DBMS的性能变化比率、所述第一预估执行时长和所述活跃数据的数据量预估所述DBMS的性能指标,所述性能指标包括所述SQL语句的第二预估执行时长和所述DBMS的吞吐量中的至少一个,所述第二预估执行时长是所述第一预估执行时长的修正值。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述DBMS的性能变化比率、所述第一预估执行时长和所述活跃数据的数据量预估所述DBMS的性能指标,包括:
    将所述DBMS的性能变化比率、所述第一预估执行时长和所述活跃数据的数据量输入性能预估模型,所述性能预估模型的训练样本包括所述DBMS的历史性能变化比率、历史SQL语句的执行时长、所述DBMS执行所述历史SQL语句时活跃数据的数据量和历史性能指标;
    通过所述性能预估模型预估所述DBMS的性能指标。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述根据所述第一参数组、所述第二参数组和所述活跃数据的数据量确定所述DBMS的性能变化比率,包括:
    根据所述DBMS中活跃数据的数据量修正所述第一参数组和所述第二参数组中与所述活跃数据的数据量相关的参数;
    对所述修正后的第一参数组和第二参数组进行降维,以得到第一向量和第二向量,所述第一参数组是维度高于所述第一向量的向量,所述第二参数组是维度高于所述第二向量的向量;
    根据所述第一向量和所述第二向量确定所述DBMS的性能变化比率。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述DBMS中活跃数据的数据量修正所述第一参数组和所述第二参数组中与所述活跃数据的数据量相关的参数,包括:
    将所述第一参数组和所述第二参数组中与所述活跃数据的数据量相关的目标参数调整到与所述活跃数据的数据量相等,所述目标参数为所述第一参数组和所述第二参数组中大于与所述活跃数据的数据量的参数。
  7. 根据权利要求5或6所述的方法,其特征在于,所述根据所述第一向量和所述第二向量确定所述DBMS的性能变化比率,包括:
    将所述第一向量和所述第二向量输入到双参数混合模型,所述双参数混合模型的训练 样本包括所述DBMS已配置过的多对第一历史参数组和第二历史参数组以及与每对所述第一历史参数组和所述第二历史参数组对应的所述DBMS的性能变化比率;
    通过所述双参数混合模型确定所述DBMS的性能变化比率。
  8. 根据权利要求2-4任一项所述的方法,其特征在于,所述预估所述DBMS执行所述SQL语句的第一预估执行时长,包括:
    将所述SQL语句转换为第三向量;
    根据所述第三向量确定所述SQL语句的复杂系数;
    根据所述复杂系数预估所述DBMS执行所述SQL语句的第一预估执行时长。
  9. 根据权利要求8所述的方法,其特征在于,所述SQL语句中包括子查询语句,所述将所述SQL语句转换为第三向量,包括:
    利用复杂度与条件因子的关系式,确定所述子查询语句的复杂度,所述条件因子包括所述子查询中的条件类型和所述子查询涉及到的表格类型;
    根据所述子查询语句的复杂度确定所述子查询语句的向量;
    将所述子查询语句的向量输入代价模型以得到所述子查询语句的复杂系数,所述代价模型的训练样本包括所述DBMS执行过的历史SQL语句的向量以及所述历史SQL语句的复杂系数;
    根据所述子查询语句的复杂系数,以及所述子查询语句与包含所述子查询语句的上层查询语句之间的嵌套关系,确定出所述SQL语句的第三向量。
  10. 根据权利要求8或9所述的方法,其特征在于,所述根据所述第三向量确定所述SQL语句的复杂系数,包括:
    将所述第三向量输入所述代价模型;
    通过所述代价模型确定所述SQL语句的复杂系数。
  11. 根据权利要求8-10任一项所述的方法,其特征在于,所述根据所述复杂系数预估所述DBMS执行所述SQL语句的第一预估执行时长,包括:
    根据复杂系数与执行时长的关系式,以及所述SQL语句的复杂系数预估所述DBMS执行所述SQL语句的第一预估执行时长。
  12. 根据权利要求8-10任一项所述的方法,其特征在于,所述根据所述复杂系数预估所述DBMS执行所述SQL语句的第一预估执行时长,包括:
    将所述复杂系数输入执行时间校正模型,所述执行时间校正模型的训练样本包括所述DBMS执行过的历史SQL语句的复杂系数以及所述历史SQL语句的执行时长;
    通过所述执行时间校正模型确定所述DBMS执行所述SQL语句的第一预估执行时长。
  13. 一种数据库管理系统性能预估的装置,其特征在于,包括:
    第一获取单元,用于获取为数据库管理系统DBMS配置的第一参数组,以及获取所述DBMS管理的数据中活跃数据的数据量,所述第一参数组包括至少一个参数;
    第二获取单元,用于获取第二参数组,所述第二参数组包括至少一个参数;
    确定单元,用于根据所述第一获取单元获取的第一参数组和所述活跃数据的数据量,以及所述第二获取单元获取的所述第二参数组确定所述DBMS的性能变化比率,所述性能变 化比率指示使用所述第二参数组配置所述DBMS后,所述DBMS的性能相比于使用所述第一参数组配置所述DBMS时的变化程度。
  14. 根据权利要求13所述的装置,其特征在于,所述装置还包括:
    第三获取单元,用于获取结构化查询语言SQL语句;
    第一预估单元,用于预估所述DBMS执行所述第三获取单元获取的SQL语句的第一预估执行时长。
  15. 根据权利要求14所述的装置,其特征在于,所述装置还包括:
    第二预估单元,用于根据所述确定单元确定的所述DBMS的性能变化比率、所述第一预估单元预估的第一预估执行时长和所述第一获取单元获取的所述活跃数据的数据量预估所述DBMS的性能指标,所述性能指标包括所述SQL语句的第二预估执行时长和所述DBMS的吞吐量中的至少一个,所述第二预估执行时长是所述第一预估执行时长的修正值。
  16. 根据权利要求15所述的装置,其特征在于,
    所述第二预估单元用于:
    将所述DBMS的性能变化比率、所述第一预估执行时长和所述活跃数据的数据量输入性能预估模型,所述性能预估模型的训练样本包括所述DBMS的历史性能变化比率、历史SQL语句的执行时长、以及所述DBMS执行所述历史SQL语句时所述DBMS所管理的活跃数据的数据量和历史性能指标;
    通过所述性能预估模型预估所述DBMS的性能指标。
  17. 根据权利要求13-16任一项所述的装置,其特征在于,
    所述确定单元用于:
    根据所述DBMS中活跃数据的数据量修正所述第一参数组和所述第二参数组中与所述活跃数据的数据量相关的参数;
    对所述修正后的第一参数组和第二参数组进行降维,以得到第一向量和第二向量,所述第一参数组是维度高于所述第一向量的向量,所述第二参数组是维度高于所述第二向量的向量;
    根据所述第一向量和所述第二向量确定所述DBMS的性能变化比率。
  18. 根据权利要求17所述的装置,其特征在于,
    所述确定单元用于:
    将所述第一参数组和所述第二参数组中与所述活跃数据的数据量相关的目标参数调整到与所述活跃数据的数据量相等,所述目标参数为所述第一参数组和所述第二参数组中大于与所述活跃数据的数据量的参数。
  19. 根据权利要求17或18所述的装置,其特征在于,
    所述确定单元用于:
    将所述第一向量和所述第二向量输入到双参数混合模型,所述双参数混合模型的训练样本包括所述DBMS已配置过的多对第一历史参数组和第二历史参数组以及与每对所述第一历史参数组和所述第二历史参数组对应的所述DBMS的性能变化比率;
    通过所述双参数混合模型确定所述DBMS的性能变化比率。
  20. 根据权利要求14-16任一项所述的装置,其特征在于,
    所述第一预估单元用于:
    将所述SQL语句转换为第三向量;
    根据所述第三向量确定所述SQL语句的复杂系数;
    根据所述复杂系数预估所述DBMS执行所述SQL语句的第一预估执行时长。
  21. 根据权利要求20所述的装置,其特征在于,所述SQL语句中包括子查询语句,
    所述第一预估单元用于:
    利用复杂度与条件因子的关系式,确定所述子查询语句的复杂度,所述条件因子包括所述子查询中的条件类型和所述子查询涉及到的表格类型;
    根据所述子查询语句的复杂度确定所述子查询语句的向量;
    将所述子查询语句的向量输入代价模型以得到所述子查询语句的复杂系数,所述代价模型的训练样本包括所述DBMS执行过的历史SQL语句的向量以及所述历史SQL语句的复杂系数;
    根据所述子查询语句的复杂系数,以及所述子查询语句与包含所述子查询语句的上层查询语句之间的嵌套关系,确定出所述SQL语句的第三向量。
  22. 根据权利要求20或21所述的装置,其特征在于,
    所述第一预估单元用于:
    将所述第三向量输入所述代价模型;
    通过所述代价模型确定所述SQL语句的复杂系数。
  23. 根据权利要求20所述的装置,其特征在于,
    所述第一预估单元用于:
    根据复杂系数与执行时长的关系式,以及所述SQL语句的复杂系数预估所述DBMS执行所述SQL语句的第一预估执行时长。
  24. 根据权利要求20、21或23所述的装置,其特征在于,
    所述第一预估单元用于:
    将所述复杂系数输入执行时间校正模型,所述执行时间校正模型的训练样本包括所述DBMS执行过的历史SQL语句的复杂系数以及所述历史SQL语句的执行时长;
    通过所述执行时间校正模型确定所述DBMS执行所述SQL语句的第一预估执行时长。
  25. 一种数据库管理系统性能预估的装置,其特征在于,所述装置包括至少一个处理器、存储器及存储在所述存储器上并可被所述至少一个处理器执行的指令,其特征在于,所述至少一个处理器执行所述指令,以实现权利要求1至12中任一项所述的方法的步骤。
  26. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1至12中任一项所述的方法的步骤。
  27. 一种数据库系统,其特征在于,包括数据库和集成有权利要求13至24中任一项所述的数据库管理系统性能预估的装置的数据库管理系统。
  28. 一种数据库系统,其特征在于,包括权利要求13至24中任一项所述的数据库管理系统性能预估的装置、数据库管理系统性和数据库。
PCT/CN2020/113204 2020-02-28 2020-09-03 一种数据库管理系统性能预估的方法、装置及系统 WO2021169231A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20921945.0A EP4099178A4 (en) 2020-02-28 2020-09-03 METHOD, DEVICE AND SYSTEM FOR PERFORMANCE ESTIMATION OF A DATABASE MANAGEMENT SYSTEM
US17/897,008 US11940966B2 (en) 2020-02-28 2022-08-26 Method, apparatus, and system for estimating database management system performance

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010134058.0A CN113326246A (zh) 2020-02-28 2020-02-28 一种数据库管理系统性能预估的方法、装置及系统
CN202010134058.0 2020-02-28

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/897,008 Continuation US11940966B2 (en) 2020-02-28 2022-08-26 Method, apparatus, and system for estimating database management system performance

Publications (1)

Publication Number Publication Date
WO2021169231A1 true WO2021169231A1 (zh) 2021-09-02

Family

ID=77412937

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113204 WO2021169231A1 (zh) 2020-02-28 2020-09-03 一种数据库管理系统性能预估的方法、装置及系统

Country Status (4)

Country Link
US (1) US11940966B2 (zh)
EP (1) EP4099178A4 (zh)
CN (1) CN113326246A (zh)
WO (1) WO2021169231A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247811A (zh) * 2017-07-21 2017-10-13 中国联合网络通信集团有限公司 基于Oracle数据库的SQL语句性能优化方法及装置
CN109213664A (zh) * 2018-08-23 2019-01-15 北京小度信息科技有限公司 Sql语句的性能分析方法、装置、存储介质和电子设备
CN110019229A (zh) * 2017-12-25 2019-07-16 航天信息股份有限公司 数据库配置系统
US20190370235A1 (en) * 2017-02-27 2019-12-05 Huawei Technologies Co., Ltd. Information Processing Method and Apparatus

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7539608B1 (en) 2002-05-10 2009-05-26 Oracle International Corporation Techniques for determining effects on system performance of a memory management parameter
US7596560B2 (en) * 2004-12-23 2009-09-29 Raytheon Company System and method for adaptive query identification and acceleration
US20070130231A1 (en) * 2005-12-06 2007-06-07 Brown Douglas P Closed-loop supportability architecture
US8015454B1 (en) * 2008-06-02 2011-09-06 Quest Software, Inc. Computer systems and methods for predictive performance management of data transactions
US9110948B2 (en) 2012-11-09 2015-08-18 International Business Machines Corporation Relative performance prediction of a replacement database management system (DBMS)
CN105138621B (zh) * 2015-08-14 2019-03-26 浪潮电子信息产业股份有限公司 一种Sybase ASE数据库配置优化系统及方法
US11314720B2 (en) 2017-03-01 2022-04-26 Oracle International Corporation Global consistency with recency in distributed database management systems
CN109240901B (zh) 2018-08-28 2023-01-17 北京星选科技有限公司 性能分析方法、性能分析装置、存储介质和电子设备
US11514275B2 (en) * 2019-10-21 2022-11-29 Sap Se Database instance tuning in a cloud platform
US11816589B2 (en) * 2020-01-16 2023-11-14 Sap Se Performance throttling identification service for autonomous databases as a service

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190370235A1 (en) * 2017-02-27 2019-12-05 Huawei Technologies Co., Ltd. Information Processing Method and Apparatus
CN107247811A (zh) * 2017-07-21 2017-10-13 中国联合网络通信集团有限公司 基于Oracle数据库的SQL语句性能优化方法及装置
CN110019229A (zh) * 2017-12-25 2019-07-16 航天信息股份有限公司 数据库配置系统
CN109213664A (zh) * 2018-08-23 2019-01-15 北京小度信息科技有限公司 Sql语句的性能分析方法、装置、存储介质和电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4099178A4

Also Published As

Publication number Publication date
CN113326246A (zh) 2021-08-31
EP4099178A1 (en) 2022-12-07
EP4099178A4 (en) 2023-08-09
US20220414075A1 (en) 2022-12-29
US11940966B2 (en) 2024-03-26

Similar Documents

Publication Publication Date Title
US20200301916A1 (en) Query Template Based Architecture For Processing Natural Language Queries For Data Analysis
US10210209B2 (en) Hybrid query execution plan
US20170083573A1 (en) Multi-query optimization
Kamat et al. Distributed and interactive cube exploration
Hutchison et al. LaraDB: A minimalist kernel for linear and relational algebra computation
US9305044B2 (en) System and method for modelling data
US20190026335A1 (en) Query engine selection
CN110795455A (zh) 依赖关系解析方法、电子装置、计算机设备及可读存储介质
US8219547B2 (en) Indirect database queries with large OLAP cubes
CN108804473B (zh) 数据查询的方法、装置和数据库系统
Wu et al. Combining Design and Performance in a Data Visualization Management System.
US11055353B2 (en) Typeahead and autocomplete for natural language queries
Lambert et al. rFSA: an R package for finding best subsets and interactions
US20140025658A1 (en) Validating database table partitioning schemes using stratified random sampling
US10157234B1 (en) Systems and methods for transforming datasets
US9984124B2 (en) Data management in relational databases
Demirci et al. Scaling sparse matrix-matrix multiplication in the accumulo database
WO2021169231A1 (zh) 一种数据库管理系统性能预估的方法、装置及系统
US11669538B1 (en) Systems, devices, and methods for data analytics
Reniers et al. Schema design support for semi-structured data: Finding the sweet spot between NF and De-NF
CN108780452B (zh) 一种存储过程处理方法及装置
WO2017019883A1 (en) Locality-sensitive hashing for algebraic expressions
WO2017019889A1 (en) Maintaining performance in the presence of insertions, deletions, and streaming queries
CN113806190A (zh) 一种预测数据库管理系统的性能的方法、装置及系统
Misal et al. DBQA: Multi-Environment Analyzer for Query Execution Time and Cost

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20921945

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020921945

Country of ref document: EP

Effective date: 20220831

NENP Non-entry into the national phase

Ref country code: DE