CN111274270B

CN111274270B - Statistical information processing and using method of database optimizer and storage device

Info

Publication number: CN111274270B
Application number: CN202010111247.6A
Authority: CN
Inventors: 丁晓晖; 谭忠涛; 王兴家
Original assignee: Beijing Dongfang Jinxin Technology Co ltd
Current assignee: Beijing Dongfang Jinxin Technology Co.,Ltd.
Priority date: 2020-02-24
Filing date: 2020-02-24
Publication date: 2020-09-18
Anticipated expiration: 2040-02-24
Also published as: CN111274270A

Abstract

The invention relates to a statistical information processing and using method and a storage device of a database optimizer, wherein the method comprises the following specific processes: s1, constructing multilayer statistical information including specific constraint condition statistical information, hotspot statistical information and traditional statistical information; s2, obtaining the selectivity of the specified constraint condition, the hotspot interval histogram statistical information and the full-interval histogram statistical information from the multi-layer statistical information in a layered searching mode, and further obtaining the estimated line number of the specified constraint condition. The invention adopts the storage structure design of multi-layer statistical information with different precisions, reduces the coupling of the system, and the high-precision statistical information dynamically constructed based on the service query characteristics is pertinently used for the optimization of specific query and specific data interval query.

Description

Statistical information processing and using method of database optimizer and storage device

Technical Field

The invention relates to a statistical information processing and using method and storage equipment of a database optimizer, and relates to the technical field of database optimization.

Background

Cost-based query optimization cbo (cost Base optimizer) may also be referred to as cost optimization, physical optimization, and the main flow thereof is to enumerate various to-be-selected physical query paths, and calculate the costs of the to-be-selected paths according to context information, thereby selecting the path with the minimum cost.

The physical optimization of the database needs to calculate the cost of various physical paths, and the estimation process of the cost depends on the statistical information of the database. Whether the statistical information can accurately describe the distribution of data in the table is one of important conditions for determining the accuracy of the cost evaluation. Common table-level statistics are shown in table 1, and column-level statistics are shown in table 2:

TABLE 1

T_tuple	Total number of lines
		T_page	Total number of pages

TABLE 2

In the case of a large amount of data, the statistical information is generated by randomly sampling a part of data lines, instead of using all the data. Increasing the sample space may increase the significance of the statistics, but may reduce the performance of the statistics calculations, requiring a balance between accuracy and performance. Through statistical information, the cost estimation system can estimate the proportion of data filtered by the constraint condition to the total data quantity, and the proportion is called the degree of selectivity. The selectivity is the basis for estimating the number of rows in the result set, and the accuracy of the selectivity directly influences the path selection of the optimizer.

In conventional statistical information processing, the method for improving the statistical accuracy is to increase the sample capacity, i.e. increase the sampling rate, and the limitations are that: 1) an excessively high sampling rate may reduce performance of statistical information calculation and increase maintenance costs. In a typical service scenario with data being updated continuously, statistical information also needs to be updated synchronously, so that the performance of database query is seriously influenced; 2) there is a lack of statistical and feedback mechanisms for the data access characteristics that occur in the actual query.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a statistical information processing and using method for a database optimizer and a storage device, which effectively improve the accuracy of cost evaluation and improve the performance of database query by recording data access characteristics in actual query and using multi-layer statistical information.

In order to achieve the purpose, the invention adopts the technical scheme that:

in a first aspect, the present invention provides a statistical information processing and using method for a database optimizer, which comprises a specific process S1 of constructing multi-layer statistical information including specific constraint statistical information, hotspot statistical information and traditional statistical information;

s2, obtaining the selectivity of the specified constraint condition, the hotspot interval histogram statistical information and the full-interval histogram statistical information from the multi-layer statistical information in a layered searching mode, and further obtaining the estimated line number of the specified constraint condition.

Further, the specific process of step S1 is as follows:

s11, establishing statistical information of specific constraint conditions, and storing the specific constraint statistical information in a selection degree mode;

s12, establishing statistical information of the hot spot interval, and storing the histogram statistical information of the hot spot interval;

and S13, establishing traditional statistical information, and storing the full-interval histogram statistical information of the traditional statistical information.

Furthermore, the constraint condition refers to a filtering constraint condition used in a WHERE clause in database query, the query comprises a plurality of tables, the optimizer decomposes the filtering condition according to the table to which the filtering condition belongs to form a single table condition, and each corresponding single table condition is a specific constraint condition.

Further, the specific process of storing the specific constraint statistical information in a selectivity manner is as follows:

recording the use frequency of the constraint conditions in historical query, storing and updating the selectivity of the constraint conditions N before the use frequency is ranked, storing specific constraint statistical information in a selectivity mode, and calculating the value of the selectivity by the actual line number of single table scanning/the total line number of the table returned during query execution.

Further, the specific process of S12 is as follows: and (3) historical query of the data is carried out, the data is counted according to hit intervals on the column histogram, interval data histograms of M buckets are established for the hit intervals N before the access frequency ranking, and the histogram information of the M buckets is stored.

Further, the specific process of S2 is as follows:

if the specified constraint condition can be hit in the specific constraint statistical information, returning the estimated line number;

if the specified constraint condition is not hit in the specific constraint statistical information, trying to estimate the line number by using the hotspot statistical table information, if the constraint condition is range query and the query range can be found, calling hit, and returning the estimated line number by using the hotspot statistical information; if the constraint condition is range query and a part of the query range can be found, the query range is called partial hit, and at the moment, hotspot statistical information and traditional statistical information are used for returning the estimated line number;

if the constraint condition is not hit in the hotspot statistical information, returning the estimated line number by adopting the traditional statistical information.

In a second aspect, the present invention further provides a storage device, where the storage device stores computer program instructions, where the program instructions, when executed by a processor, are configured to implement the steps corresponding to the statistical information processing and using method of the database optimizer.

In a third aspect, the present invention further provides a computer program, which includes computer program instructions, where the program instructions, when executed by a processor, are configured to implement the steps corresponding to the statistical information processing and using method of the database optimizer.

The technical scheme adopted by the invention has the following advantages:

1. because the global high sampling rate does not have feasibility and practicability in an actual service scene, the method carries out modeling and statistics on data access characteristics generated in database query, marks the most common query and the most common data interval, adopts the high sampling rate for the local data, effectively improves the accuracy of hotspot query cost evaluation by recording and using a multi-layer statistical information mode, and further outputs a more optimal query plan;

2. the invention fully utilizes the data access characteristics in the actual query, pertinently and dynamically carries out specific query and specific data interval, improves the sampling rate of statistical information, and has better performance and lower establishment and maintenance cost compared with the global high sampling rate;

3. the invention adopts the storage structure design of multi-layer statistical information with different precisions, reduces the coupling of the system, and the high-precision statistical information dynamically constructed based on the service query characteristics is pertinently used for the optimization of specific query and specific data interval query;

in conclusion, the method and the device can be widely applied to the cost evaluation optimizer.

Drawings

Fig. 1 is a schematic view of a statistical information hierarchical searching process according to a first embodiment of the present invention;

FIG. 2(a) is an example of a full-range histogram according to a first embodiment of the present invention;

fig. 2(b) is an example of a hotspot interval histogram in the first embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Example one

The statistical information processing and using method of the database optimizer provided by the embodiment comprises the following contents:

s1, whether the statistical information can accurately describe the distribution of the data fields in the table is one of the important conditions for determining the accuracy of the cost evaluation of the optimizer. Constructing multilayer statistical information comprising specific constraint condition statistical information, hotspot statistical information and traditional statistical information, and the specific process is as follows:

s11, establishing statistical information of specific constraint conditions

Constraints refer to filtering constraints used in the WHERE clause in a database query. The query includes a plurality of tables, and the optimizer decomposes the filter conditions according to the tables to which the filter conditions belong to form single table conditions. Each corresponding single table condition is a specific constraint condition.

The use frequency of the constraint conditions in the history query is recorded, the constraint conditions of the top N (for example, N is 10) of the use frequency are ranked, and the selection degree is stored and updated. The constraint-specific statistics are stored in a selective manner. The value of the degree of selection is calculated from the actual number of rows of the single table scan/the total number of rows of the table that are returned when the query is executed.

S12, establishing statistical information of hot spot intervals

Taking column t1.a in table 6 as an example, the histogram is divided into a 5-bucket equal frequency histogram, and the table lists the edge values of the histogram, and the edges divide the data into 5 bins. In real traffic, for example, a range query t1.a >0and t1.a <100 for column t1.a with the bin falling within the first bucket of the histogram, the access frequency of bin [0,218] is increased once after the query is executed, and bin [0,218] is the hit bin of the constraint.

For historical query of data, statistics is carried out on hit intervals on column histograms, and interval data histograms of M buckets are established for hit intervals with the access frequency ranked N (for example, N is 10). Table 5 lists the histogram statistics for a typical hotspot interval: including the edge values of the equal frequency histogram of the columns, intervals and intervals.

S13, establishing traditional statistical information

Taking the source database postgresql as an example, the generation of the single-column statistical information can be completed by executing the analyze command. Such information is referred to as conventional statistics in this embodiment and includes the table level statistics and column level statistics described above.

For example, the column-level statistical information may be calculated by using a full-interval histogram, where the column-level statistical information is divided into S intervals, and if a certain interval of the S intervals is a hot-spot interval, the hot-spot statistical information is a histogram of M intervals created by subdividing the interval, and is a histogram of local data created for the hot-spot interval, for example, fig. 2(a) is a full-interval histogram of conventional statistical information, and fig. 2(b) is an interval histogram of hot-spot statistical information.

In summary, the specific constraint statistical information is higher than the hotspot statistical information in accuracy than the traditional statistical information. The specific constraint statistics are smaller than the hotspot statistics than the traditional statistics over the coverage area. In the use of the statistical information, the statistical information with higher accuracy is preferentially used.

S2, as shown in fig. 1, obtaining the selectivity of the specified constraint condition from the multi-layer statistical information by using a hierarchical search method, wherein when the database optimizer requires to calculate the estimated line number of the specified constraint condition, the specific process is as follows:

if a given constraint hits in a particular constraint statistic, i.e., a single table constraint used by a query, for example, can be found in table 4, then the estimated number of rows returned is the total number of rows (T) in the table × of the corresponding condition's selectivity stored in table 4_tuple)；

If the specified constraint condition is not hit in the specific constraint statistical information, trying to estimate the line number by using the hotspot statistical table information, if the constraint condition is range query and the query range can be found in the table 5, calling the line number as hit, and returning the estimated line number; for example, for a range query t1.a >0and t1.a <100 for column t1.a, whose bin falls within bin [0,218] of the histogram of Table 5, the number of estimated rows returned is calculated as:

total number of rows (T) in (100-0)/(218-0) × 1/(M × N) × Table_tuple)

If the constraint condition is range query and a part of the query range can be found in the table 5, the query range is called partial hit, and at this time, the estimated line number is calculated and returned by using the table 5 and the table 6 together; for example, for a range query of column t1.a, t1.a >300and t1.a <500, whose bins fall within bin [218,410] of the histogram of Table 5 and bin [410,611] of the histogram of Table 6, the number of estimated rows returned is calculated as:

the total number of rows (T) of (410-300)/(410-218) × 1/(M × N) + (500-410)/(611-410) × 1/N) ×_tuple)

If the constraint misses in the hotspot statistics, the estimated number of rows is returned using the statistics in table 6. For example, for a range query for column t1.a, t1.a >500and t1.a <600, whose bin falls within the histogram's table 6 bin [410,611], the number of estimated rows returned is calculated as:

(600 + 500)/(611 + 410) × 1/N × table_tuple)

In summary, the statistical information processing and using method of the database optimizer provided by the invention can improve the speed and accuracy of calculating the estimated line number of the hot point constraint condition in the system, thereby improving the accuracy of cost evaluation of the database system and obtaining a better query plan.

Further, the example of the multi-layer statistical information construction process of this embodiment specifically includes:

by means of query feedback, when a query is executed, historical constraints are recorded, and the historical constraints are represented as shown in table 3:

TABLE 3

Execution time	Constraint conditions
		11:00:00	t1.a＝10
10:50:00	t1.a>5and t1.a<20
		10:40:00	t1.a＝10
10:20:00	t1.a>8and t1.a<15
		…	…

According to the execution load of the database, the selectivity of the constraint condition of the TOPN frequency is counted and stored in the specific constraint condition statistical information when the database is idle. And counting the constraint condition interval of the using frequency TOP N, counting a histogram of the constraint condition interval and recording the histogram into hotspot statistical information.

Simplified constraint-specific statistics examples, simplified hotspot-specific statistics examples, and simplified conventional statistics are shown in tables 4-6

TABLE 4

TABLE 5

Column(s) of	Hot spot interval	Histogram of the data
			t1.a	[0,218]	{1,20,51,90,120,218}
t1.a	[218,410]	{218,260,280,310,350,410}
			…	…	…

TABLE 6

Column(s) of	Histogram of the data
		t1.a	{0,218,410,611,804,1000}
…	…

Example two

The present embodiment further provides a storage device, where the storage device stores computer program instructions, where the program instructions, when executed by a processor, are configured to implement the steps corresponding to the statistical information processing and using method of the database optimizer of the first embodiment.

EXAMPLE III

The present embodiment further provides a computer program, which includes computer program instructions, where the program instructions, when executed by a processor, are configured to implement the steps corresponding to the statistical information processing and using method of the database optimizer of the first embodiment.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the scope of protection thereof, and although the present application is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: after reading this application, those skilled in the art will be able to make various changes, modifications and equivalents to the embodiments of the application, which are within the scope of the claims of this application.

Claims

1.A statistical information processing and using method of a database optimizer is characterized by comprising the following specific processes:

s1, constructing multilayer statistical information including specific constraint condition statistical information, hotspot statistical information and traditional statistical information, wherein the specific process is as follows:

s13, establishing traditional statistical information, and storing the full-interval histogram statistical information of the traditional statistical information;

s2, obtaining the selectivity of the specified constraint condition, the hotspot interval histogram statistical information and the full-interval histogram statistical information from the multilayer statistical information by using a layered search mode, and further obtaining the estimated line number of the specified constraint condition, wherein the specific process is as follows:

2. The statistical information processing and using method of a database optimizer of claim 1, wherein the constraint condition refers to a filter constraint condition used in a WHERE clause in a database query, the query includes a plurality of tables, the optimizer decomposes the filter condition according to the table to which the filter condition belongs to form a single table condition, and each corresponding single table condition is a specific constraint condition.

3. The statistical information processing and using method of the database optimizer of claim 1, wherein the specific process of storing the specific constraint statistical information in a selective manner is:

4. The statistical information processing and using method of database optimizer as claimed in claim 1, wherein the specific process of S12 is as follows:

and (3) historical query of the data is carried out, the data is counted according to hit intervals on the column histogram, interval data histograms of M buckets are established for the hit intervals N before the access frequency ranking, and the histogram information of the M buckets is stored.

5. A storage device having computer program instructions stored thereon, wherein the program instructions, when executed by a processor, are adapted to implement the steps corresponding to the statistical information processing and usage method of the database optimizer of any one of claims 1-4.