CN111274270B - Statistical information processing and using method of database optimizer and storage device - Google Patents
Statistical information processing and using method of database optimizer and storage device Download PDFInfo
- Publication number
- CN111274270B CN111274270B CN202010111247.6A CN202010111247A CN111274270B CN 111274270 B CN111274270 B CN 111274270B CN 202010111247 A CN202010111247 A CN 202010111247A CN 111274270 B CN111274270 B CN 111274270B
- Authority
- CN
- China
- Prior art keywords
- statistical information
- query
- specific
- constraint
- constraint condition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a statistical information processing and using method and a storage device of a database optimizer, wherein the method comprises the following specific processes: s1, constructing multilayer statistical information including specific constraint condition statistical information, hotspot statistical information and traditional statistical information; s2, obtaining the selectivity of the specified constraint condition, the hotspot interval histogram statistical information and the full-interval histogram statistical information from the multi-layer statistical information in a layered searching mode, and further obtaining the estimated line number of the specified constraint condition. The invention adopts the storage structure design of multi-layer statistical information with different precisions, reduces the coupling of the system, and the high-precision statistical information dynamically constructed based on the service query characteristics is pertinently used for the optimization of specific query and specific data interval query.
Description
Technical Field
The invention relates to a statistical information processing and using method and storage equipment of a database optimizer, and relates to the technical field of database optimization.
Background
Cost-based query optimization cbo (cost Base optimizer) may also be referred to as cost optimization, physical optimization, and the main flow thereof is to enumerate various to-be-selected physical query paths, and calculate the costs of the to-be-selected paths according to context information, thereby selecting the path with the minimum cost.
The physical optimization of the database needs to calculate the cost of various physical paths, and the estimation process of the cost depends on the statistical information of the database. Whether the statistical information can accurately describe the distribution of data in the table is one of important conditions for determining the accuracy of the cost evaluation. Common table-level statistics are shown in table 1, and column-level statistics are shown in table 2:
TABLE 1
Ttuple | Total number of lines |
Tpage | Total number of pages |
TABLE 2
In the case of a large amount of data, the statistical information is generated by randomly sampling a part of data lines, instead of using all the data. Increasing the sample space may increase the significance of the statistics, but may reduce the performance of the statistics calculations, requiring a balance between accuracy and performance. Through statistical information, the cost estimation system can estimate the proportion of data filtered by the constraint condition to the total data quantity, and the proportion is called the degree of selectivity. The selectivity is the basis for estimating the number of rows in the result set, and the accuracy of the selectivity directly influences the path selection of the optimizer.
In conventional statistical information processing, the method for improving the statistical accuracy is to increase the sample capacity, i.e. increase the sampling rate, and the limitations are that: 1) an excessively high sampling rate may reduce performance of statistical information calculation and increase maintenance costs. In a typical service scenario with data being updated continuously, statistical information also needs to be updated synchronously, so that the performance of database query is seriously influenced; 2) there is a lack of statistical and feedback mechanisms for the data access characteristics that occur in the actual query.
Disclosure of Invention
In view of the above problems, an object of the present invention is to provide a statistical information processing and using method for a database optimizer and a storage device, which effectively improve the accuracy of cost evaluation and improve the performance of database query by recording data access characteristics in actual query and using multi-layer statistical information.
In order to achieve the purpose, the invention adopts the technical scheme that:
in a first aspect, the present invention provides a statistical information processing and using method for a database optimizer, which comprises a specific process S1 of constructing multi-layer statistical information including specific constraint statistical information, hotspot statistical information and traditional statistical information;
s2, obtaining the selectivity of the specified constraint condition, the hotspot interval histogram statistical information and the full-interval histogram statistical information from the multi-layer statistical information in a layered searching mode, and further obtaining the estimated line number of the specified constraint condition.
Further, the specific process of step S1 is as follows:
s11, establishing statistical information of specific constraint conditions, and storing the specific constraint statistical information in a selection degree mode;
s12, establishing statistical information of the hot spot interval, and storing the histogram statistical information of the hot spot interval;
and S13, establishing traditional statistical information, and storing the full-interval histogram statistical information of the traditional statistical information.
Furthermore, the constraint condition refers to a filtering constraint condition used in a WHERE clause in database query, the query comprises a plurality of tables, the optimizer decomposes the filtering condition according to the table to which the filtering condition belongs to form a single table condition, and each corresponding single table condition is a specific constraint condition.
Further, the specific process of storing the specific constraint statistical information in a selectivity manner is as follows:
recording the use frequency of the constraint conditions in historical query, storing and updating the selectivity of the constraint conditions N before the use frequency is ranked, storing specific constraint statistical information in a selectivity mode, and calculating the value of the selectivity by the actual line number of single table scanning/the total line number of the table returned during query execution.
Further, the specific process of S12 is as follows: and (3) historical query of the data is carried out, the data is counted according to hit intervals on the column histogram, interval data histograms of M buckets are established for the hit intervals N before the access frequency ranking, and the histogram information of the M buckets is stored.
Further, the specific process of S2 is as follows:
if the specified constraint condition can be hit in the specific constraint statistical information, returning the estimated line number;
if the specified constraint condition is not hit in the specific constraint statistical information, trying to estimate the line number by using the hotspot statistical table information, if the constraint condition is range query and the query range can be found, calling hit, and returning the estimated line number by using the hotspot statistical information; if the constraint condition is range query and a part of the query range can be found, the query range is called partial hit, and at the moment, hotspot statistical information and traditional statistical information are used for returning the estimated line number;
if the constraint condition is not hit in the hotspot statistical information, returning the estimated line number by adopting the traditional statistical information.
In a second aspect, the present invention further provides a storage device, where the storage device stores computer program instructions, where the program instructions, when executed by a processor, are configured to implement the steps corresponding to the statistical information processing and using method of the database optimizer.
In a third aspect, the present invention further provides a computer program, which includes computer program instructions, where the program instructions, when executed by a processor, are configured to implement the steps corresponding to the statistical information processing and using method of the database optimizer.
The technical scheme adopted by the invention has the following advantages:
1. because the global high sampling rate does not have feasibility and practicability in an actual service scene, the method carries out modeling and statistics on data access characteristics generated in database query, marks the most common query and the most common data interval, adopts the high sampling rate for the local data, effectively improves the accuracy of hotspot query cost evaluation by recording and using a multi-layer statistical information mode, and further outputs a more optimal query plan;
2. the invention fully utilizes the data access characteristics in the actual query, pertinently and dynamically carries out specific query and specific data interval, improves the sampling rate of statistical information, and has better performance and lower establishment and maintenance cost compared with the global high sampling rate;
3. the invention adopts the storage structure design of multi-layer statistical information with different precisions, reduces the coupling of the system, and the high-precision statistical information dynamically constructed based on the service query characteristics is pertinently used for the optimization of specific query and specific data interval query;
in conclusion, the method and the device can be widely applied to the cost evaluation optimizer.
Drawings
Fig. 1 is a schematic view of a statistical information hierarchical searching process according to a first embodiment of the present invention;
FIG. 2(a) is an example of a full-range histogram according to a first embodiment of the present invention;
fig. 2(b) is an example of a hotspot interval histogram in the first embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Example one
The statistical information processing and using method of the database optimizer provided by the embodiment comprises the following contents:
s1, whether the statistical information can accurately describe the distribution of the data fields in the table is one of the important conditions for determining the accuracy of the cost evaluation of the optimizer. Constructing multilayer statistical information comprising specific constraint condition statistical information, hotspot statistical information and traditional statistical information, and the specific process is as follows:
s11, establishing statistical information of specific constraint conditions
Constraints refer to filtering constraints used in the WHERE clause in a database query. The query includes a plurality of tables, and the optimizer decomposes the filter conditions according to the tables to which the filter conditions belong to form single table conditions. Each corresponding single table condition is a specific constraint condition.
The use frequency of the constraint conditions in the history query is recorded, the constraint conditions of the top N (for example, N is 10) of the use frequency are ranked, and the selection degree is stored and updated. The constraint-specific statistics are stored in a selective manner. The value of the degree of selection is calculated from the actual number of rows of the single table scan/the total number of rows of the table that are returned when the query is executed.
S12, establishing statistical information of hot spot intervals
Taking column t1.a in table 6 as an example, the histogram is divided into a 5-bucket equal frequency histogram, and the table lists the edge values of the histogram, and the edges divide the data into 5 bins. In real traffic, for example, a range query t1.a >0and t1.a <100 for column t1.a with the bin falling within the first bucket of the histogram, the access frequency of bin [0,218] is increased once after the query is executed, and bin [0,218] is the hit bin of the constraint.
For historical query of data, statistics is carried out on hit intervals on column histograms, and interval data histograms of M buckets are established for hit intervals with the access frequency ranked N (for example, N is 10). Table 5 lists the histogram statistics for a typical hotspot interval: including the edge values of the equal frequency histogram of the columns, intervals and intervals.
S13, establishing traditional statistical information
Taking the source database postgresql as an example, the generation of the single-column statistical information can be completed by executing the analyze command. Such information is referred to as conventional statistics in this embodiment and includes the table level statistics and column level statistics described above.
For example, the column-level statistical information may be calculated by using a full-interval histogram, where the column-level statistical information is divided into S intervals, and if a certain interval of the S intervals is a hot-spot interval, the hot-spot statistical information is a histogram of M intervals created by subdividing the interval, and is a histogram of local data created for the hot-spot interval, for example, fig. 2(a) is a full-interval histogram of conventional statistical information, and fig. 2(b) is an interval histogram of hot-spot statistical information.
In summary, the specific constraint statistical information is higher than the hotspot statistical information in accuracy than the traditional statistical information. The specific constraint statistics are smaller than the hotspot statistics than the traditional statistics over the coverage area. In the use of the statistical information, the statistical information with higher accuracy is preferentially used.
S2, as shown in fig. 1, obtaining the selectivity of the specified constraint condition from the multi-layer statistical information by using a hierarchical search method, wherein when the database optimizer requires to calculate the estimated line number of the specified constraint condition, the specific process is as follows:
if a given constraint hits in a particular constraint statistic, i.e., a single table constraint used by a query, for example, can be found in table 4, then the estimated number of rows returned is the total number of rows (T) in the table × of the corresponding condition's selectivity stored in table 4tuple);
If the specified constraint condition is not hit in the specific constraint statistical information, trying to estimate the line number by using the hotspot statistical table information, if the constraint condition is range query and the query range can be found in the table 5, calling the line number as hit, and returning the estimated line number; for example, for a range query t1.a >0and t1.a <100 for column t1.a, whose bin falls within bin [0,218] of the histogram of Table 5, the number of estimated rows returned is calculated as:
total number of rows (T) in (100-0)/(218-0) × 1/(M × N) × Tabletuple)
If the constraint condition is range query and a part of the query range can be found in the table 5, the query range is called partial hit, and at this time, the estimated line number is calculated and returned by using the table 5 and the table 6 together; for example, for a range query of column t1.a, t1.a >300and t1.a <500, whose bins fall within bin [218,410] of the histogram of Table 5 and bin [410,611] of the histogram of Table 6, the number of estimated rows returned is calculated as:
the total number of rows (T) of (410-300)/(410-218) × 1/(M × N) + (500-410)/(611-410) × 1/N) ×tuple)
If the constraint misses in the hotspot statistics, the estimated number of rows is returned using the statistics in table 6. For example, for a range query for column t1.a, t1.a >500and t1.a <600, whose bin falls within the histogram's table 6 bin [410,611], the number of estimated rows returned is calculated as:
(600 + 500)/(611 + 410) × 1/N × tabletuple)
In summary, the statistical information processing and using method of the database optimizer provided by the invention can improve the speed and accuracy of calculating the estimated line number of the hot point constraint condition in the system, thereby improving the accuracy of cost evaluation of the database system and obtaining a better query plan.
Further, the example of the multi-layer statistical information construction process of this embodiment specifically includes:
by means of query feedback, when a query is executed, historical constraints are recorded, and the historical constraints are represented as shown in table 3:
TABLE 3
Execution time | Constraint conditions |
11:00:00 | t1.a=10 |
10:50:00 | t1.a>5and t1.a<20 |
10:40:00 | t1.a=10 |
10:20:00 | t1.a>8and t1.a<15 |
… | … |
According to the execution load of the database, the selectivity of the constraint condition of the TOPN frequency is counted and stored in the specific constraint condition statistical information when the database is idle. And counting the constraint condition interval of the using frequency TOP N, counting a histogram of the constraint condition interval and recording the histogram into hotspot statistical information.
Simplified constraint-specific statistics examples, simplified hotspot-specific statistics examples, and simplified conventional statistics are shown in tables 4-6
TABLE 4
TABLE 5
Column(s) of | Hot spot interval | Histogram of the data |
t1.a | [0,218] | {1,20,51,90,120,218} |
t1.a | [218,410] | {218,260,280,310,350,410} |
… | … | … |
TABLE 6
Column(s) of | Histogram of the data |
t1.a | {0,218,410,611,804,1000} |
… | … |
Example two
The present embodiment further provides a storage device, where the storage device stores computer program instructions, where the program instructions, when executed by a processor, are configured to implement the steps corresponding to the statistical information processing and using method of the database optimizer of the first embodiment.
EXAMPLE III
The present embodiment further provides a computer program, which includes computer program instructions, where the program instructions, when executed by a processor, are configured to implement the steps corresponding to the statistical information processing and using method of the database optimizer of the first embodiment.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the scope of protection thereof, and although the present application is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: after reading this application, those skilled in the art will be able to make various changes, modifications and equivalents to the embodiments of the application, which are within the scope of the claims of this application.
Claims (5)
1.A statistical information processing and using method of a database optimizer is characterized by comprising the following specific processes:
s1, constructing multilayer statistical information including specific constraint condition statistical information, hotspot statistical information and traditional statistical information, wherein the specific process is as follows:
s11, establishing statistical information of specific constraint conditions, and storing the specific constraint statistical information in a selection degree mode;
s12, establishing statistical information of the hot spot interval, and storing the histogram statistical information of the hot spot interval;
s13, establishing traditional statistical information, and storing the full-interval histogram statistical information of the traditional statistical information;
s2, obtaining the selectivity of the specified constraint condition, the hotspot interval histogram statistical information and the full-interval histogram statistical information from the multilayer statistical information by using a layered search mode, and further obtaining the estimated line number of the specified constraint condition, wherein the specific process is as follows:
if the specified constraint condition can be hit in the specific constraint statistical information, returning the estimated line number;
if the specified constraint condition is not hit in the specific constraint statistical information, trying to estimate the line number by using the hotspot statistical table information, if the constraint condition is range query and the query range can be found, calling hit, and returning the estimated line number by using the hotspot statistical information; if the constraint condition is range query and a part of the query range can be found, the query range is called partial hit, and at the moment, hotspot statistical information and traditional statistical information are used for returning the estimated line number;
if the constraint condition is not hit in the hotspot statistical information, returning the estimated line number by adopting the traditional statistical information.
2. The statistical information processing and using method of a database optimizer of claim 1, wherein the constraint condition refers to a filter constraint condition used in a WHERE clause in a database query, the query includes a plurality of tables, the optimizer decomposes the filter condition according to the table to which the filter condition belongs to form a single table condition, and each corresponding single table condition is a specific constraint condition.
3. The statistical information processing and using method of the database optimizer of claim 1, wherein the specific process of storing the specific constraint statistical information in a selective manner is:
recording the use frequency of the constraint conditions in historical query, storing and updating the selectivity of the constraint conditions N before the use frequency is ranked, storing specific constraint statistical information in a selectivity mode, and calculating the value of the selectivity by the actual line number of single table scanning/the total line number of the table returned during query execution.
4. The statistical information processing and using method of database optimizer as claimed in claim 1, wherein the specific process of S12 is as follows:
and (3) historical query of the data is carried out, the data is counted according to hit intervals on the column histogram, interval data histograms of M buckets are established for the hit intervals N before the access frequency ranking, and the histogram information of the M buckets is stored.
5. A storage device having computer program instructions stored thereon, wherein the program instructions, when executed by a processor, are adapted to implement the steps corresponding to the statistical information processing and usage method of the database optimizer of any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010111247.6A CN111274270B (en) | 2020-02-24 | 2020-02-24 | Statistical information processing and using method of database optimizer and storage device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010111247.6A CN111274270B (en) | 2020-02-24 | 2020-02-24 | Statistical information processing and using method of database optimizer and storage device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111274270A CN111274270A (en) | 2020-06-12 |
CN111274270B true CN111274270B (en) | 2020-09-18 |
Family
ID=71000308
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010111247.6A Active CN111274270B (en) | 2020-02-24 | 2020-02-24 | Statistical information processing and using method of database optimizer and storage device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111274270B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112380243B (en) * | 2020-11-19 | 2021-11-09 | 东北大学 | SQL query selectivity estimation method based on machine learning |
CN117667993A (en) * | 2022-08-25 | 2024-03-08 | 华为技术有限公司 | Selective rate estimation method and estimation device |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106339244A (en) * | 2016-08-30 | 2017-01-18 | 中国银行股份有限公司 | Method and device for realizing statistical information collection |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105069134B (en) * | 2015-08-18 | 2018-07-27 | 上海新炬网络信息技术股份有限公司 | A kind of automatic collection method of Oracle statistical informations |
US10853368B2 (en) * | 2018-04-02 | 2020-12-01 | Cloudera, Inc. | Distinct value estimation for query planning |
CN108664635B (en) * | 2018-05-15 | 2020-12-04 | 上海达梦数据库有限公司 | Method, device, equipment and storage medium for acquiring database statistical information |
CN108829768A (en) * | 2018-05-29 | 2018-11-16 | 中国银行股份有限公司 | A kind of collection method and device of statistical information |
-
2020
- 2020-02-24 CN CN202010111247.6A patent/CN111274270B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106339244A (en) * | 2016-08-30 | 2017-01-18 | 中国银行股份有限公司 | Method and device for realizing statistical information collection |
Also Published As
Publication number | Publication date |
---|---|
CN111274270A (en) | 2020-06-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8812481B2 (en) | Management of interesting database statistics | |
US9747337B2 (en) | Group-by size result estimation | |
US5778353A (en) | Computer program product for optimizing data retrieval using index scanning | |
CN111274270B (en) | Statistical information processing and using method of database optimizer and storage device | |
US5875445A (en) | Performance-related estimation using pseudo-ranked trees | |
US8914354B2 (en) | Cardinality and selectivity estimation using a single table join index | |
US20040225639A1 (en) | Optimizer dynamic sampling | |
US9218391B2 (en) | Scalable computation of data | |
US8880511B2 (en) | Database query optimization and cost estimation | |
US20080046455A1 (en) | Query feedback-based configuration of database statistics | |
US20030167255A1 (en) | Getpage-workload based index optimizer | |
WO2018153210A1 (en) | Method, device and database system for use in automatically creating indexes | |
CN108764307A (en) | The density peaks clustering method of natural arest neighbors optimization | |
US20120296905A1 (en) | Density-based data clustering method | |
CN108733781A (en) | The cluster temporal data indexing means calculated based on memory | |
CA2266990A1 (en) | System and techniques for fast approximate query answering | |
El-Helw et al. | Collecting and maintaining just-in-time statistics | |
CN110619168A (en) | Super-large-scale chip information analysis method based on netlist | |
Peyravi | A schema selection framework for data warehouse design | |
US9158815B2 (en) | Estimating a number of unique values in a list | |
CN115146141A (en) | Index recommendation method and device based on data characteristics | |
Tao et al. | Performance analysis of R*-trees with arbitrary node extents | |
CN114328605A (en) | Database information statistical method and device based on sketch technology | |
US9367594B2 (en) | Determining statistics for cost-based optimization of a workflow | |
Lin et al. | A Cardinality Estimation Approach Based on Two Level Histograms. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | "change of name, title or address" |
Address after: 301, 3 / F, block F, Zhizao street, Zhongguancun, 45 Chengfu Road, Haidian District, Beijing 100062 Patentee after: Beijing Dongfang Jinxin Technology Co.,Ltd. Address before: 9 / F, Jiahe Guoxin building, 15 Baiqiao street, Dongcheng District, Beijing 100062 Patentee before: Beijing Dongfang Jinxin Technology Co.,Ltd. |
|
CP03 | "change of name, title or address" |