CN108460052B

CN108460052B - Method and device for automatically creating index and database system

Info

Publication number: CN108460052B
Application number: CN201710094631.8A
Authority: CN
Inventors: 谢东
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2017-02-22
Filing date: 2017-02-22
Publication date: 2022-11-01
Anticipated expiration: 2037-02-22
Also published as: WO2018153210A1; CN108460052A

Abstract

The invention discloses a method and a device for automatically creating an index and a database system. The method comprises the following steps: analyzing a target data table and operating conditions in the data query statement; judging whether data distribution needs to be analyzed, and when the data distribution needs to be analyzed, calculating the distribution condition of table data meeting the operation conditions in the target data table; and when the distribution condition of the table data meets a preset index creation strategy, creating the operating condition as the index of the target data table. Compared with the existing method for creating the index by manual analysis and estimation, the method has higher accuracy and can effectively improve the execution efficiency of the SQL statement.

Description

Method and device for automatically creating index and database system

Technical Field

The invention relates to the technical field of databases, in particular to a method and a device for automatically creating an index and a database system.

Background

The relational database is a database built on the basis of a relational database model, and data in the database are processed by means of concepts and methods such as set algebra and the like. With the development of information technology, data has penetrated into various industries and applications, and relational databases have been widely used in various industries.

Under the technical condition of the existing database, a database application model is established in the following way: designing a table structure by analyzing the logical relationship of data; through analyzing the application scene of the application to the database, estimating which possible SQL statements are in advance, analyzing possible query conditions, and then selecting one or more columns of the table as index fields to create indexes. It is known that the index is created in advance according to the experience of the designer. Over time, the demand for data usage varies. For example: holidays may require new query SQL statements, underyear reports may require new complex report SQL statements, and the system may require an index corresponding thereto.

It can be seen that the current index of relational databases is created based on the need collection and the experience of the designer. If the demand collection may be inaccurate and the designers may have uneven capabilities, the indexing may be inaccurate, unreasonable and inefficient. Moreover, the index is manually created at the beginning of system design or manually optimized and adjusted in an upgrading mode at the later stage, and the indexing flexibility is relatively low. With the continuous change of real world data characteristics, for the situations of complex data structure and large data volume, which are collectively referred to as big data, the current index creation method for the relational database cannot meet the requirements of people on higher data processing efficiency and flexibility.

Disclosure of Invention

The invention provides a method and a device for automatically creating an index and a database system, which are used for solving the problems of low creating efficiency and low flexibility in the existing index creating method.

In order to achieve the purpose, the invention adopts the following technical scheme;

according to an aspect of the present invention, there is provided a method of automatically creating an index, comprising:

analyzing a target data table and operating conditions in the data query statement;

judging whether data distribution needs to be analyzed, and when the data distribution needs to be analyzed, calculating the distribution condition of table data meeting the operation conditions in the target data table;

and when the distribution condition of the table data meets a preset index creation strategy, creating the operating condition as the index of the target data table.

Preferably, analyzing the target data table and the operating condition in the data query statement specifically includes:

carrying out syntax check on the data query statement, and judging whether the data query statement is an invalid statement;

and when the statement is judged to be valid, acquiring the target data table and the operating condition according to the grammar rule of the data query statement.

Preferably, the determining whether the data distribution needs to be analyzed specifically includes:

acquiring a creating basis of the index creating strategy, and judging whether the target data table is in a preset data table white list or not according to the creating basis;

and if so, continuously judging whether data distribution needs to be analyzed according to the creation basis, otherwise, forbidding to continuously create the index for the target data table.

Preferably, calculating the distribution of the table data satisfying the operating condition in the target data table specifically includes:

reading a preset table data scanning strategy;

acquiring table data meeting the operating conditions from the target data table according to the scanning strategy;

and calculating the distribution condition of the table data according to a preset statistical method.

Preferably, when the distribution condition of the table data meets a preset index creation policy, creating the operating condition as an index of the target data table, specifically including:

obtaining the screening efficiency of each operating condition according to the distribution condition of the table data, sorting the operating conditions according to the screening efficiency, and establishing a composite index for the sorted operating conditions;

creating a single index in the leader of the compound index;

judging whether the result set proportion of the composite index and the single index is greater than a preset threshold value: if so, taking the single index as the index of the target data table; and if not, taking the composite index as the index of the target data table.

Preferably, after the creating the operating condition as the index of the target data table, the method further comprises:

and monitoring whether the data query statement and/or the data table in the database are changed or not in real time, and determining to build an index or delete the index according to the change condition when the change is monitored.

Preferably, the method further comprises:

and when judging that the data distribution does not need to be analyzed, directly creating the index of the target data table by using the operating condition.

Preferably, the operation condition includes a condition corresponding to any one or more of a where screening, a group by grouping, and an order by sorting.

According to an aspect of the present invention, there is provided an apparatus for automatically creating an index, comprising:

the analysis unit is used for analyzing the target data table and the operating conditions in the data query statement;

a calculating unit, configured to determine whether data distribution needs to be analyzed, and when it is determined that data distribution needs to be analyzed, calculate a distribution situation of table data that satisfies the operating condition in the target data table;

and the creating unit is used for creating the operating condition as the index of the target data table when the distribution condition of the table data meets a preset index creating strategy.

Preferably, the analysis unit is specifically configured to:

performing syntax check on the data query statement, and judging whether the data query statement is an invalid statement;

and when the sentence is judged to be a valid sentence, acquiring the target data table and the operating condition according to the grammar rule of the data query sentence.

Preferably, the computing unit is specifically configured to:

when judging whether data distribution needs to be analyzed, acquiring a creating basis of the index creating strategy, and judging whether the target data table is in a preset data table white list according to the creating basis;

Preferably, the computing unit is specifically configured to:

reading a preset table data scanning strategy;

and calculating the distribution condition of the table data according to a preset statistical device.

Preferably, the creating unit is specifically configured to:

creating a single index in the leader of the compound index;

judging whether the result set proportion of the composite index and the single index is greater than a preset threshold value: if so, taking the single index as the index of the target data table; otherwise, the compound index is used as the index of the target data table.

Preferably, the apparatus further comprises a monitoring unit for:

and after the operating condition is created as the index of the target data table, monitoring whether a data query statement and/or a data table in a database are changed or not in real time, and when the change is monitored, determining to establish the index or delete the index according to the change condition.

Preferably, the creating unit is further configured to:

According to an aspect of the present invention, there is provided a database system comprising the above-mentioned apparatus for automatically creating an index.

The invention has the following beneficial effects:

according to the method, the device and the database system for automatically creating the index, which are provided by the invention, the index is automatically created for the SQL statement to use by analyzing the composition structure of the current SQL statement and based on the distribution condition of the table data in the table. Compared with the existing method for creating indexes by manual analysis and estimation, the method has higher accuracy and can improve the execution efficiency of the SQL sentences.

The above description is only an overview of the technical solutions of the present invention, and the present invention can be implemented in accordance with the content of the description so as to make the technical means of the present invention more clearly understood, and the above and other objects, features, and advantages of the present invention will be more clearly understood.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method for automatically creating an index according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating an execution flow of insert statements provided by the present invention;

FIG. 3 is a schematic diagram illustrating a new SQL statement execution flow provided by the present invention; .

FIG. 4 is a schematic structural diagram of an apparatus for automatically creating an index according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a data publishing model provided by the present invention;

FIG. 6 is a schematic diagram of a data consumption model provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a method for automatically creating an index, which specifically comprises the following steps as shown in fig. 1:

step S101, analyzing a target data table and operating conditions in a data query statement;

here, the data query statement is an SQL statement. And when detecting that the system starts an index creating command, automatically acquiring the SQL statement. The data source mode can be obtained from an application program or a database related data dictionary table. For a data table, it is necessary to obtain as many related SQL statements as possible in order to build an efficient and comprehensive index.

When analyzing the obtained SQL sentences, the SQL sentences which conform to ANSI SQL syntax are split, and the related components of SQL execution efficiency are mainly considered. The split SQL statement comprises a target data table name and an operation condition. Wherein, the key relevant operation comprises conditions corresponding to the operations of where screening, group by grouping and order by sorting. And classifying and storing the analysis results according to the data table.

Preferably, the analyzing the target data table and the operating condition in the data query statement specifically includes:

reading table structure information in advance, carrying out syntax check on SQL sentences in advance, and acquiring a target data table and operating conditions according to syntax rules of data query sentences when the SQL sentences are judged to be valid sentences; when the statement is determined to be invalid, the data query statement is not analyzed. By excluding invalid SQL statements in advance, the speed of creating the index can be effectively improved.

And step S102, judging whether data distribution needs to be analyzed, and calculating the distribution situation of the table data meeting the operation conditions in the target data table when the data distribution needs to be analyzed.

Wherein, judging whether the data distribution needs to be analyzed specifically comprises:

acquiring a creating basis of an index creating strategy, and judging whether a target data table is in a preset data table white list or not according to the creating basis; and if so, continuously judging whether the data distribution needs to be analyzed according to the creation basis, otherwise, forbidding to continuously create the index for the target data table.

Here, the index creation policy is to create index expert knowledge (conditions or rules to be satisfied for creating an index), is pre-established, and can be periodically maintained, modified and deleted. The creation policy here includes according to SQL statements, according to SQL statements and data distribution, and according to predefined rules. Other pre-programmed strategies are certainly possible and are not specifically limited herein.

When the SQL-based statement of the index is created and data distribution does not need to be analyzed, the field to be operated is used as the index of the target data table. For example, in the SQL statement of the key value query, the index is required to be established by the where condition field. When the basis of creating the index is SQL statements and data distribution, adding the consideration of the data distribution of the table, and determining how to create the high-efficiency index according to the specific distribution condition; when the basis for creating the index can also be a predefined rule, whether the index needs to be created is judged according to a preset rule. For example, for a table with frequently inserted data, the creation of an index is prohibited, depending on the actual running conditions of the application.

Wherein, in calculating the distribution of the table data satisfying the operation condition in the target data table, the method specifically comprises the following steps:

reading a preset table data scanning strategy;

acquiring data meeting the operating condition table from a target data table according to a scanning strategy;

The scanning strategy includes information such as trigger condition and/or scanning granularity. The trigger condition may be a timing scan, and may be a new SQL statement trigger. By setting a trigger condition, when a new SQL statement needs to be run and is changed, the system can automatically create an index corresponding to the SQL statement or delete an index which is not needed any more. The scan granularity may be a full scan, or may be sampled by a percentage of data.

And when the distribution condition of the data in the table is calculated, analyzing the distribution condition of the data in the table based on the table composition structure. The analysis data includes fields of operation, data type of each field, distribution of each field data, cardinality of the data. The distribution of the data includes the data divided into several columns (belonging fields), and the statistical information such as maximum, minimum, average, variance and the like of the data in each column, and can also be the discrete distribution of the data. And are not intended to be unduly limiting herein. The reading scanning and the calculation can be combined, and the relational operation capability of the database is fully utilized. The calculated results are also sorted and stored according to the data table, and are embodied in each column.

And step S103, when the distribution condition of the table data meets a preset index creation strategy, creating the operation condition as the index of the target data table.

The index creation strategy needs to be further combined with specific data distribution conditions to determine whether preset establishment conditions are met, and when the preset establishment conditions are met, the index creation strategy can be further created. The preset establishment condition is a condition for ensuring the establishment of the efficient index of the index, and specifically includes the following steps:

creating a single index in the leader of the composite index;

judging whether the ratio of the composite index to the result set of the single index is greater than a preset threshold value: if the index is larger than the preset value, the single index is used as the index of the target data table; otherwise, the compound index is used as the index of the target data table.

The composite index is an index including a plurality of operating conditions, and the leader of the composite index is a single index created by the operating condition with the highest screening efficiency. The manner in which the above-described index is established will be described later in a specific embodiment. According to the invention, the index is created by combining the data distribution condition, so that the SQL statement execution efficiency is highest.

Based on the above, the method for automatically creating the index provided by the invention automatically creates the index for the SQL statement to use based on the distribution condition of the table data in the table by analyzing the composition structure of the current SQL statement. Compared with the existing method for creating indexes by manual analysis and estimation, the method has higher accuracy and can improve the execution efficiency of the SQL sentences.

The method for automatically creating an index according to the present invention is described below with reference to specific embodiments. Taking the user calling process as an example, the information is recorded in the call log table, and the call log needs to be queried. As shown in table 1, the calllog table calllog main fields include: serial number, calling number, called number, start time, end time, details, etc.

TABLE 1

The system needs to satisfy the operator's detailed query for a certain user over a certain period of time. For example, a call log table of a certain day is queried according to a calling number, and the obtained SQL statement is:

select*from calllog where cno＝’15800000000’

and sdate>＝’2016-02-01 00:00:00’

and sdate<’2016-02-02 00:00:00’

after the SQL sentence is split, the method is divided into the following parts:

the lookup table name is calllog.

The where conditions are in part:

return field list: * And represents all fields.

After the SQL statement is analyzed, the two fields are screened at the same time, and the results are respectively as follows:

field cno, single value screening, operating conditions are: cno equals 15800000000;

field sdate, range screening, operating conditions are: sdate in the interval [2016-02-01 00,2016-02-02 00.

Then, the distribution of the data on the table and the columns is calculated, and two fields of the calling number and the starting time are considered according to the SQL analysis result. For example, the calculation results are shown in table 2:

TABLE 2

In this embodiment, the analysis index module, in combination with the SQL analysis result and the data distribution result, relates to a plurality of fields, reads the corresponding policy, and performs the following process analysis:

strategy one: under the condition of a plurality of screening conditions, the screening efficiency is high, the screening efficiency is placed foremost, and the efficiency is higher

Because the ratio of the first condition result is 0.0005%, the screening efficiency is high, the ratio of the second condition result is 1.9%, and the screening efficiency is low, the sequence of the fields in the index is as follows: cno, sdate. Thus, the index that takes into account all screening cases is: index one, calllog (cno, sdate).

And (2) strategy two: the leader in the compound index may also create the index, but may not be present at the same time. Therefore, an index can also be created that is: index two, calllog (cno)

Strategy three: generally, the result set occupation ratio of the screening record number is greater than a preset threshold value, here, greater than 15%, the screening efficiency is low, and the creation of the index is not recommended. From condition one to condition three, the resulting record ratio =5/460=1.1%. Through comparison, the screening efficiency is less than 15%, and the screening efficiency is high, so that the index I is a suggested index, the index II cannot exist at the same time, and the index II is not suggested. By analysis, the final proposed index is: index one, calllog (cno, sdate).

In order to make the addition and deletion of the index more flexible, the method for automatically creating the index provided by the embodiment of the invention further comprises the following steps after the creation of the index is completed:

and monitoring whether the data query statement and/or the data table in the database are changed in real time, and determining to establish an index or delete the index according to the change condition when the change is monitored.

When the data in the database data table is monitored to change, for example, the cardinality or the data distribution of the data change, the distribution of the table data in the table is recalculated. The calculation mode can be incremental calculation or full calculation. And re-establishing the index according to the re-calculated data distribution condition.

As shown in fig. 2, the example of the execution flow of insert statement is described, which includes the following steps:

step S201, data acquisition is started.

And step S202, executing insert statement. After the database application program collects a data set from a data source, the data set is packaged into insert statements and then sent to a database for execution. After the execution is successful, the data is saved in a database table.

In step S203, data distribution is calculated. And detecting the change of the table data in the database, and recalculating the distribution index of the data in the table. The calculation mode can configure a strategy, and for example, the calculation mode can be incremental calculation or full calculation. Should not significantly affect system performance. The calculation result is used as the basis for automatically creating the index.

And step S204, ending.

As shown in fig. 3, the new SQL statement execution flow diagram mainly includes the following steps:

step S301 starts.

Step S302, analyzing indexes needed by the new SQL statement. By analyzing the SQL statement where screening condition part, group by grouping part and order by sequencing part, how to create index is found out, so that the efficiency of the SQL statement is the highest.

Step S303, based on the table index definition information, determines whether there is a corresponding index. If not, an index needs to be created, and the step S304 is carried out; if so, the SQL statement is executed without recreating the index, and the step S305 is performed.

In step S304, an index is automatically created. It should be noted that, in order not to affect the system performance, step S340 may be asynchronous, and step S370 is performed without waiting for the execution result.

In step S305, an SQL statement is executed. The database executes the SQL statement and returns the calculation result to the database application program.

And step S306, ending.

Based on the above, when a new SQL statement needs to be run and is changed, the system can automatically create the index corresponding to the new SQL statement and delete the index which is not needed any more; when the data distribution is different and the data bases in the columns are different, the method for automatically creating the index can automatically adapt to the change all the time, deletes the index which is not suitable any more and creates an effective index.

An embodiment of the present invention further provides an apparatus for automatically creating an index, as shown in fig. 4, including:

an analysis unit 41 for analyzing the target data table and the operation condition in the data query statement;

a calculation unit 42 for judging whether the data distribution needs to be analyzed, and when it is judged that the data distribution needs to be analyzed, calculating a distribution situation of table data satisfying the operation condition in the target data table;

a creating unit 43, configured to create an operation condition as an index of the target data table when the distribution of the table data meets a preset index creation policy; and when the data distribution does not need to be analyzed, directly creating the index of the target data table by the operating condition.

The operation conditions comprise conditions corresponding to any one or more of the operations of where screening, group by grouping and order by sequencing.

Further, the analysis unit 41 is specifically configured to:

and when the sentence is judged to be a valid sentence, acquiring a target data table and the operating condition according to the grammar rule of the data query sentence.

Further, the calculating unit 42 is specifically configured to:

when judging whether data distribution needs to be analyzed, acquiring a creating basis of an index creating strategy, and judging whether a target data table is in a preset data table white list according to the creating basis;

and if so, continuously judging whether the data distribution needs to be analyzed according to the creation basis, otherwise, forbidding to continuously create the index for the target data table.

Further, the calculating unit 42 is specifically configured to:

reading a preset table data scanning strategy;

acquiring table data meeting the operating conditions from a target data table according to a scanning strategy;

Further, the creating unit 43 is specifically configured to:

creating a single index in the leader of the composite index;

and when judging whether the ratio of the composite index to the single index is greater than a preset threshold value: if the index is larger than the preset value, the single index is used as the index of the target data table; otherwise, the compound index is used as the index of the target data table.

Further, the apparatus further comprises a monitoring unit for:

and after the operating condition is created as the index of the target data table, monitoring whether the data query statement and/or the data table in the database are changed or not in real time, and when the change is monitored, determining to establish the index or delete the index according to the change condition.

The invention also provides a database system which comprises the automatic index creating device in the embodiment. The system can be applied to a data publishing model and a data consuming model.

Referring to fig. 5, the structural diagram of the data publishing model provided by the present invention is that the database application is used as a publisher role, which is simpler. The data publishing model includes the following parts:

publisher (database application): the database application program function is to collect data sets from data sources and send the data sets to a database system.

A database: and storing the data set and managing the data.

The automatic index creating device comprises: for performing the automatic index creation method described above.

For the designated flow of each part in the model, reference may be made to the execution sequence in fig. 2, which is not described here.

Referring to fig. 6, a schematic structural diagram of a data consumption model provided by the present invention is shown. After the database system is established, the data of the data source is acquired to the system and can be provided for an application program to use, and the use scene of the data consumption model is to inquire, modify and delete the data in the database. The consumption model comprises the following parts:

consumer (database application): and converting the data use request of the user into SQL (structured query language) statements of types such as query, modification and deletion, sending the SQL statements to a database system for execution, and returning a result.

The automatic index creating device: and acquiring the SQL statements, and automatically creating indexes if no needed indexes exist. And meanwhile, monitoring the change of the data in the database, and recalculating the distribution index of the data in the table if the data changes.

A database: and after the saved data is calculated, returning the data to the application program.

For the designated flow of each part in the model, reference may be made to the execution sequence in fig. 3, which is not described here.

The database system provided by the invention optimizes the database design, and the operation on the data comprises two aspects of storage and use. When the data is stored, the application program belongs to a publisher of the data; the application belongs to the consumer of the data when the data is used. As a publisher, focus on data structures; as a consumer, attention is paid to the index. After the method and the system are adopted, the composition structure of the SQL statement is automatically analyzed, the index is automatically created for the SQL statement to use based on the table composition structure and the distribution condition of data in the table, a consumer is not limited by the limitation of fixed index in the initial stage of system construction, and the database design is more reasonable. Under the large data environment, the data volume is large, the later data use requirements are unknown or variable, and the beneficial effects of the invention are more obvious.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the device embodiments, since they are substantially similar to the method embodiments, reference may be made to some of the descriptions of the method embodiments for relevant points. Also, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

In addition, those skilled in the art will appreciate that all or part of the steps of implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for automatically creating an index, comprising:

when the distribution condition of the table data meets a preset index creation strategy, creating the operating condition as the index of the target data table;

when the distribution condition of the table data meets a preset index creation policy, creating the operating condition as an index of the target data table, specifically including:

creating a single index in the leader of the compound index;

judging whether the result set ratio of the composite index and the single index is greater than a preset threshold value: if so, taking the single index as the index of the target data table; and if not, taking the composite index as the index of the target data table.

2. The method of claim 1, wherein analyzing the target data table and the operating conditions in the data query statement specifically comprises:

3. The method of claim 1, wherein determining whether data distribution needs to be analyzed specifically comprises:

4. The method according to claim 1 or 3, wherein calculating the distribution of the table data satisfying the operating condition in the target data table specifically comprises:

reading a preset table data scanning strategy;

5. The method of claim 1, wherein after the creating the operating condition as an index to the target data table, the method further comprises:

6. The method of claim 1, wherein the method further comprises:

and when the data distribution does not need to be analyzed, directly creating the index of the target data table by using the operating condition.

7. The method of claim 1, wherein the operating condition comprises a condition corresponding to any one or more of a where filter, group by grouping, order by ordering.

8. An apparatus for automatically creating an index, comprising:

the creating unit is used for creating the operating condition as the index of the target data table when the distribution condition of the table data meets a preset index creating strategy;

the creating unit is specifically configured to:

creating a single index in the leader of the compound index;

9. The apparatus of claim 8, wherein the analysis unit is specifically configured to:

10. The apparatus of claim 8, wherein the computing unit is specifically configured to:

11. The apparatus according to claim 8 or 10, wherein the computing unit is specifically configured to:

reading a preset table data scanning strategy;

12. The apparatus of claim 8, further comprising a monitoring unit to:

13. The apparatus of claim 8, wherein the creating unit is further configured to:

14. The apparatus of claim 8, wherein the operating condition comprises a condition corresponding to any one or more of a where filter, group by grouping, order by ordering.

15. A database system comprising the apparatus for automatically creating an index according to any one of claims 8 to 14.