CN113704246A

CN113704246A - Database index optimization method and system based on virtual index

Info

Publication number: CN113704246A
Application number: CN202011146017.XA
Authority: CN
Inventors: 曹靖城; 张继东; 王猛德; 王培才
Original assignee: Tianyi Smart Family Technology Co Ltd
Current assignee: Tianyi Digital Life Technology Co Ltd
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-11-26

Abstract

The invention provides a method and a system for database index optimization based on virtual indexes. The method comprises the following steps: responding to the index updating request, and synchronizing the production database and the log information; creating a set of virtual indices based on an analysis of corresponding query statements in the log; selecting and automatically updating one or more best indexes to a production database based on the actual query execution time for each virtual index in the set of virtual indexes; and reselecting and automatically updating the one or more best indexes to the production database based on a virtual index rating value reflecting an improvement in query execution time before and after each index is created.

Description

Database index optimization method and system based on virtual index

Technical Field

The invention relates to the field of databases, in particular to a database index optimization method and system based on virtual indexes.

Background

Databases are ubiquitous in software systems, both web applications and mobile APPs, and are not isolated. MySQL is one of the most popular relational database management systems, is popular with developers, has gradually become the mainstream database solution, and is widely applied in various business fields.

The index is a structure for sequencing values of one or more columns in a database table, and provides pointers to data values stored in a designated column of the table.

The creation of the current index mainly depends on the experience of developers, and when the developers realize the function of the business system, the developers create the index simultaneously according to the past experience of the developers. Because developers do not have knowledge of mastering the indexes of the database systematically, the subjectivity is strong, the indexes cannot be matched with the SQL sentences, and the query efficiency is low; in addition, when the data volume of the database table changes or the distribution of the table data changes, the historical index is not suitable any more, and the query efficiency cannot be guaranteed. These make it unreasonable to have a lot of indexes in practical application, and finally affect the query performance of the system.

The chinese patent application "database index creation method and apparatus" (201610181731.X) proposes to perform statistical analysis on each query field combination of data query operations, and to establish an index according to the query field combination when the number of data query operations satisfies a preset automatic index creation condition. The method introduced by the patent only uses the combination of query fields with more query times as a basis to create the corresponding data table index, but does not consider the running time of each index in the SQL statement of the actual environment, and cannot determine that the created index improves the actual query speed.

In Chinese patent application, "Python-based database index optimization method, device and system" (CN201811284394.2), a device is designed, a django frame based on Python is used as a front end, a database operation statement input by a user is received, and an open source tool SQLAdvisor is adopted to process and analyze at the rear end to obtain an index optimization result. The method of the patent analyzes SQL sentences input by developers to obtain index optimization suggestions, and can help SQL create or optimize indexes before online; however, when the data volume of the database table or the distribution of the table data in the actual environment changes, the index created before online cannot match the existing environment, and the query efficiency cannot be guaranteed.

Therefore, in order to further improve the query efficiency of the database index and adapt to the automatic optimization and adjustment of the database index in different service scenarios, it is desirable to provide an improved method and system for optimizing the query efficiency of the database index.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

The invention aims to provide a database index optimization method and system based on virtual indexes, which select the optimal index from the actual operation efficiency of SQL query statements, can adapt to automatic optimization and adjustment of database indexes under different service scenes, does not depend on manual experience any more, improves the index query efficiency of a database system, and improves the query performance of the database, wherein the database comprises but is not limited to a MySQL database. The invention establishes three modules: the system comprises a production module, a replication module and a self-analysis module. The method comprises the steps of synchronizing database data and logs in a production module to a replication module, selecting SQL sentences in slow query logs to perform lexical analysis and syntactic analysis to generate a group of index sets, evaluating and obtaining effective index combinations, namely virtual indexes, calculating the running time of each virtual index, comparing and selecting the best index and automatically updating the best index to a production database to realize the first updating of the virtual index, training and establishing a virtual index evaluation value model by adopting a method of fusing three models of XBOOT, LIGHT and CATBOOST on the basis of data actually generated in the first updating through a self-analysis module, predicting the best index of a production environment with real-time change of the data, automatically updating the model to the production database, and completing the real-time monitoring and maintenance of the index efficiency.

According to an aspect of the present invention, there is provided a database index optimization method based on a virtual index, the method including:

responding to the index updating request, and synchronizing the production database and the log information;

creating a set of virtual indices based on an analysis of corresponding query statements in the log;

selecting and automatically updating one or more best indexes to a production database based on the actual query execution time for each virtual index in the set of virtual indexes; and

the one or more best indexes are again selected and automatically updated to the production database based on the virtual index rating values reflecting the improvements to query execution times before and after each index creation.

According to one embodiment of the invention, the index update request is triggered periodically or in response to production database query efficiency falling below a threshold preset in the log.

According to a further embodiment of the present invention, the creating a set of virtual indices further comprises:

selecting corresponding query sentences in the log to perform lexical analysis and syntactic analysis to generate a candidate index set; and

evaluating and acquiring a virtual index set with index evaluation parameters, wherein the index evaluation parameters comprise: column combinations of equivalent predicate associations behind WHEREs in the query; fields in the ORDER or GROUP need to be included in the combined index and are consistent in sequence; and an overlay index.

According to a further embodiment of the present invention, the reselecting and automatically updating one or more optimal indexes to the production database based on the virtual index rating values reflecting improvements for query execution time before and after creation of each index further comprises:

record query and index features F ═ (F)₁,f₂,…,f_k) (ii) a And

generating training data including the features for training of the virtual index evaluation value model, wherein the data amount is a virtual index combination number (SQL statement number) query times.

According to a further embodiment of the invention, the virtual index evaluation value model is trained and established by adopting a method of fusing XGBOOT, LIGHT TGBM and CATBOST models.

According to a further embodiment of the present invention, the re-selecting and automatically updating one or more best indexes to the production database based on the virtual index rating value reflecting the improvement of the query execution time before and after each index creation further comprises:

obtaining the actual running time of each virtual index in the virtual index set and the characteristic F ' (F ' of query and index '₁,f′₂,…,f′_k)；

By passing

To train and obtain a model M, wherein the model M₁、M₂、M₃Are respectively obtained by XGBOOT, LIGHT TGBM and CATBOOST training, and the optimal parameter Para₁、Para₂、Para₃Are respectively obtained through five-fold cross validation;

by passing

Fitting best model M_best＝a*M₁+b*M₂+c*M₃Where Opt (I, q) is the index I's optimization index to query SQL statement q;

simulating a predicted optimal index using the optimal model for different SQL statements related to the same table in the log,

calculating an optimal index using an index evaluation function Z, I_best＝min(Z|I_i)i∈[0,k]Wherein Z ═ Σ_q∈SOpt (I, q); and

and comparing the index with the minimum virtual index evaluation value with the current application index, and reserving one or more best indexes.

According to a further embodiment of the invention, the Opt (I, q) is the inverse of the difference in execution time before and after creation of the index,

according to another aspect of the present invention, there is provided a virtual index-based database index optimization system, the system including:

a production module configured to: recording log information, monitoring query efficiency, and sending an index updating request at regular time or when the query efficiency is lower than a preset threshold value;

a replication module configured to:

synchronizing production databases, log information, and

creating a set of virtual indices based on an analysis of corresponding query statements in the log; and a self-analysis module configured to:

According to an embodiment of the present invention, the log information includes a user query condition in a preset time period, including an original file of a slow query log, and a statistically generated slow query SQL statement (SQL _ text), a start execution time (start _ time), a statement execution time (query _ time), a lock table time (lock _ time), a record number returned by a query (rows _ send), a number of scanned rows (rows _ extended), and a number of times of statement execution (rows _ count).

According to a further embodiment of the invention, the self-analysis module is further configured to:

record query and askThe characteristic of the lead is F ═ F₁,f₂,…,f_k)；

Generating training data comprising the features for training of a virtual index evaluation value model, wherein the data amount is a virtual index combination number (SQL statement number) query times; and

and training and establishing the virtual index evaluation value model by adopting a method of fusing XGBOOT, LIGHT TGBM and CATBOOST models based on the training data, and performing optimal index prediction based on the virtual index evaluation value model.

Compared with the scheme in the prior art, the database index optimization system and the database index optimization method provided by the invention have the following advantages:

(1) at present, some techniques implement index creation methods based on field word frequency analysis in query logs, SQL statement execution analysis, and other methods, but the query efficiency of indexes is not strongly related to the format of query fields and SQL statements, and the characteristics, data volume, data distribution of a database table, and the like are also considered, so that these techniques cannot generate indexes with optimal query efficiency. The invention provides a simulation operation method through virtual indexing, which can finally obtain the most accurate indexing result, greatly improve the actual query speed of the indexing and reduce the expenditure of hardware resources.

(2) The invention provides a method for selecting the optimal index by analyzing the virtual index combination from the slow query log and simulating the SQL statement operation calculation, which improves the operation effect of the traditional field combination index optimization based on the query log without considering the index in the actual production environment.

(3) The invention provides a brand-new method for improving index query efficiency of a MySQL database, which comprises the steps of automatically creating training samples and collecting data characteristics from SQL query simulation operation based on virtual indexes, and effectively solving the problem of insufficient prediction accuracy of the traditional index creation method by adopting a method of fusing XGBOOT, LIGHT TGBM and CATBOOST machine learning models.

(4) Has the online learning ability and strong adaptability. The traditional index creating method is mainly based on data statistics and analysis, and the method disclosed by the patent continuously trains and automatically optimizes an index evaluation value model by adopting machine learning according to the change of data characteristics of a database, continuously adapts to the requirement of the database on indexes, and finally improves the query performance.

(5) The method has universality, although database query statements and data tables of different services are different and the indexing requirements are different, the method selects the optimal index based on virtual index to copy environment simulation operation, and simultaneously, a virtual index evaluation value model is established by utilizing real data based on machine learning training and can be used for predicting the optimal index, so that the method can be used for optimizing the database index under different scenes of different services.

These and other features and advantages will become apparent upon reading the following detailed description and upon reference to the accompanying drawings. It is to be understood that both the foregoing general description and the following detailed description are explanatory only and are not restrictive of aspects as claimed.

Drawings

So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only some typical aspects of this invention and are therefore not to be considered limiting of its scope, for the description may admit to other equally effective aspects.

FIG. 1 is an exemplary architecture diagram of a virtual index based database index query efficiency optimization system according to one embodiment of the invention.

FIG. 2 is a flowchart of a method for optimizing the efficiency of database index queries based on virtual indexes, according to an embodiment of the present invention.

FIG. 3 is a flow diagram of a method for performing a first update of a virtual index based on an actual operation best index update, according to one embodiment of the invention.

FIG. 4 is a flowchart of a method for performing a second update of a virtual index based on a virtual index rating value model, according to one embodiment of the invention.

Detailed Description

The present invention will be described in detail below with reference to the attached drawings, and the features of the present invention will be further apparent from the following detailed description.

FIG. 1 is an exemplary architecture diagram of a virtual index based database index optimization system 100, according to one embodiment of the invention. As shown in FIG. 1, the system 100 of the present invention includes a production module 101, a replication module 102, and a self-analysis module 103.

The production module 101 includes a timing unit 104, an abnormality detection unit 105, a user log unit 106, and a production database 107. The production database 107 is a database of real operating services. The user log unit 106 is responsible for recording and counting user query conditions in a preset time period, including a slow query log original file, and counting generated slow query SQL statements (SQL _ text), start execution time (start _ time), statement execution time (query _ time), lock table time (lock _ time), query returned record number (rows _ send), scanned row number (rows _ exposed), statement execution times (rows _ count), and other information. The anomaly detection unit 105 is configured to record and detect whether the execution time of the current query statement is significantly aggravated than the history, generate anomalous query data, and trigger an index update request. The timing unit 104 is used to time synchronize all data and log information of the database and trigger an index update request.

The copy module 102 includes a copy unit 108 and a virtual index unit 109. The replication unit 108 is used to replicate and synchronize the production database and the user log. The virtual index unit 109 decomposes each query statement in the copied user log into query condition information and field information, analyzes and generates an index combination, and finally generates a group of virtual index combinations with index evaluation parameters by combining with the self-analysis module 103.

The self-analysis module 103 includes a calculation unit 110, a prediction unit 111, and an optimization unit 112. The calculation unit 110 is configured to perform corresponding query statement calculations using the various virtual indexes in the copied database again to generate an effect evaluation value that measures the actual effect of the virtual indexes. The prediction unit 111 is used for optimal virtual index training based on data features to realize an index automatic creation process based on data change trend. The optimization unit 112 evaluates and adjusts the prediction results.

Those skilled in the art will appreciate that the database index optimization system of the present invention and its various modules may be implemented in hardware or software, and that the modules may be combined or combined in any suitable manner.

FIG. 2 is a flow diagram of a virtual index based database index optimization method 200 according to one embodiment of the invention. The method 200 begins at step 201 where the production module 101 records log information, monitors query efficiency, and sends an index update request at regular times or upon detection of an anomaly. The log information comprises a slow query log original file, and information such as a statistically generated slow query SQL statement (SQL _ text), a start execution time (start _ time), a statement execution time (query _ time), a lock table time (lock _ time), a record number (rows _ send) returned by the query, a number of scanned rows (rows _ exposed), a statement execution frequency (rows _ count), and the like. The query efficiency criterion is based on a threshold in the slow query log, which is set based on expert experience, for example, the threshold defaults to 1 in the present system, and if the query efficiency is lower than the threshold, an exception alarm will be triggered and recorded in the exception log. The anomaly detection unit 105 in the production module 101 triggers an index update request in response to the production database query efficiency falling below a preset threshold in the slow query log. The timing unit 104 in the production module 101 periodically synchronizes all data and log information of the database to the replication module, triggering an index update request, e.g., the timing unit synchronizes data every 3 months and triggers an index update request.

In step 202, replication module 102 creates a virtual index set based on the binlog log synchronized production database, log information. After the replication and synchronization of the database and the log information are completed by the replication unit 108, the virtual index unit 109 in the replication module 102 selects the corresponding query statements to perform one-by-one analysis according to the index update triggering condition, and the analysis process is completed by a lexical analysis tool Flex and a syntactic analysis tool Bison. As an example, the SQL statement is: SELECT a, B FROM T WHERE C ═ x and a > y ORDER BY B, which can be resolved BY the virtual index unit 109 as follows:

{SELECT:{A,B},FROM:{T},WHERE:{AND:{C:{C＝x},A:{A>y}}},ORDER:{B}}

the virtual index unit 109 generates a set of candidate index sets (e.g., { [ a ], [ C, a, B ], [ C, B, a ] }) according to the parsing result after the parsing is completed. The virtual index unit 109 then acquires a set of valid index combinations (i.e., virtual indexes) based on the index evaluation criterion. The virtual index evaluation criteria are as follows: (1) column combinations of equivalent predicate associations behind WHEREs in the query; (2) fields in ORDER or GROUP need to be included in the combined index and are ordered consistently (3) to include all columns in the query statement (overlay index). Examples of virtual indices with index evaluation criteria parameters are as follows:

	index combination	Evaluation criteria parameters
			Index_1	[A]
Index_2	[C]
			Index_3	[C,A]	[1]
Index_4	[C,A,B]	[1,3]
			Index_5	[C,B,A]	[2,3]

In step 203, the self-analysis module 103 performs a first update of the virtual index based on the actual operation of the optimal index update. The specific update steps are further illustrated in fig. 3.

In step 204, the self-analysis module 103 performs a second update of the virtual index based on the virtual index evaluation value model, and completes the real-time index detection and maintenance. The specific update steps are further illustrated in fig. 4.

FIG. 3 is a flow diagram of a method 300 for performing a first update of a virtual index based on an actual operation best index update, according to one embodiment of the invention. The method 300 starts in step 301, and creates each virtual index (including the un-indexed index) acquired by the virtual indexing unit 109 from the computing unit 110 in the analysis module 103, and executes and records the actual running time of each SQL query.

In step 302, the computing unit 110 records the feature attributes of the query and index as follows:

name (R)	Object	Description of the invention
			S_TYPE	Query q	Query categories (select, insert, update, etc.),delete)
S_TNUM	Query q	Looking up the number of involved tables
			S_RNUM	Query q	Looking up the number of records in the related table (averaging multiple tables)
S_ROWS_EXAMINED	Query q	Querying the number of scanned rows
			S_COST	Query q	Cost of query
S_ROWS_SENT	Query q	Querying number of records returned
			S_ROWS_COUNT	Query q	Number of times of execution of query statement
S_OPERATION	Query q	Number of operators involved in a query
			S_CODE	Query q	Coding conditions of query condition field
S_ZNUM	Query q	Number of types of query field
			I_RNUM	Index I	Number of records in data table where index is located
I_ZNUM	Index I	Number of fields contained in index
			I_CODE	Index I	Coding cases for fields contained in the index

In step 303, training data is generated for model training and building by the prediction unit 111, where the training data includes the actual running time and feature summary F ═ F (F) of each virtual index₁,f₂,…,f_k) And the data quantity is the virtual index combination number SQL statement number table number query times.

At step 304, the actual run times of the virtual indices are compared, the top K best indices are selected as outputs to be fed back to the production module and the indices are automatically created. And finishing the first updating process of the virtual index.

FIG. 4 is a flow diagram of a method 400 for performing a second update of a virtual index based on a virtual index rating value model, according to one embodiment of the invention. The method 400 begins at step 401 by obtaining the index actual run time and feature summary in the first update, F ═ F (F)₁,f₂,…,f_k) Wherein a class feature (e.g., S _ TYPE) is quantized to convert to a class value, a data feature (e.g., S _ RNUM) is binned, and finally the features are summarized as F ═ F'₁,f′₂,…,f′_k)。

In step 402, the model M is obtained by training xgoost, lightgbm and catboost respectively₁、M₂、M₃。

In step 403, the optimal parameters Para of the 3 algorithms are respectively obtained through five-fold cross validation₁、Para₂、Para₃。

At step 404, training obtains model M

Where M reflects the relationship between the index and the query, M (F) → Opt (I, q), where Opt (I, q) is an index of optimization of index I to query SQL statement q. Opt (I, q) is measured by the inverse of the difference in execution time before and after creation of the index, i.e.

In real time, Opt (I, q) cannot be directly obtained, and the value can be predicted, for example, by machine learning.

In step 405, by

Fitting best model M_best＝a*M₁+b*M₂+c*M₃。

At step 406, the predicted best index is simulated for different SQL statements related to the same table in the log, i.e., the best index is simulated

In step 407, the index evaluation function Z of the same table is used to calculate the optimal index, i.e. the index evaluation function Z of the same table is used to calculate the optimal index

I_best＝min(Z|I_i)i∈[0,k]

Wherein Z ═ Σ_q∈sOpt(I,q)。

In step 408, the index with the smallest evaluation value of the virtual index is indexed with the current applicationAnd comparing the rows, and keeping the first K best indexes, namely automatically deleting the indexes which are not in the first K best indexes and automatically creating the indexes which are not created in the first K best indexes. Furthermore, in the case where the production database does not actually create an index, the virtual index may be directly based on the best model M fitted in step 405_bestImplementation of evaluation value Z_I＝∑_q∈sOpt (I, q).

It should be understood that, in an actual database environment, after an index update request is issued, the self-analysis module 103 no longer needs to spend a lot of time to create and execute each SQL query statement under each virtual index, and compares actual running time to select an optimal index, but completes an index real-time detection and maintenance task based on the virtual index evaluation value model.

What has been described above includes examples of aspects of the claimed subject matter. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the claimed subject matter, but one of ordinary skill in the art may recognize that many further combinations and permutations of the claimed subject matter are possible. Accordingly, the disclosed subject matter is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims.

Claims

1. A database index optimization method based on virtual index is characterized by comprising the following steps:

2. The method of claim 1, wherein the index update request is triggered periodically or in response to production database query efficiency falling below a threshold preset in the log.

3. The method of claim 1, wherein the creating a set of virtual indices further comprises:

4. The method of claim 1, wherein the reselecting and automatically updating one or more best indexes to a production database based on a virtual index rating value reflecting improvements to query execution time before and after each index creation further comprises:

record query and index features F ═ (F)₁，f₂，…，f_k) (ii) a And

5. The method of claim 4, wherein the virtual index rating model is trained and established by a method of fusing XGBOOT, LIGHT TGBM and CATBOOST models.

6. The method of claim 5, wherein reselecting and automatically updating one or more best indexes to the production database based on a virtual index rating value reflecting improvements to query execution time before and after each index creation further comprises:

obtaining an actual query execution time for each virtual index in the set of virtual indices and quantized features F ' (F ' of queries and indices '₁，f′₂，…，f′_k)；

By passing

by passing

calculating an optimal index using an index evaluation function Z, I_best＝min(Z|I_i)i∈[0，k]Wherein Z ═ Σ_q∈SOpt (I, q); and

7. The method of claim 6, wherein Opt (I, q) is the inverse of the difference between the pre-creation and post-creation execution times for the index,

8. a virtual index based database index optimization system, the system comprising:

a replication module configured to:

synchronizing production databases, log information, and

9. The system of claim 8, wherein the log information includes user query conditions in a preset time period, including a slow query log original file, and statistically generated slow query SQL statements (SQL _ text), start execution time (start _ time), statement execution time (query _ time), lock table time (lock _ time), number of records returned by the query (rows _ send), number of scanned rows (rows _ affected), and number of times of statement execution (rows _ count).

10. The system of claim 8, wherein the self-analysis module is further configured to:

record query and index features F ═ (F)₁，f₂，…，f_k)；