CN111274270B - Statistical information processing and using method of database optimizer and storage device - Google Patents

Statistical information processing and using method of database optimizer and storage device Download PDF

Info

Publication number
CN111274270B
CN111274270B CN202010111247.6A CN202010111247A CN111274270B CN 111274270 B CN111274270 B CN 111274270B CN 202010111247 A CN202010111247 A CN 202010111247A CN 111274270 B CN111274270 B CN 111274270B
Authority
CN
China
Prior art keywords
statistical information
query
specific
constraint
constraint condition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010111247.6A
Other languages
Chinese (zh)
Other versions
CN111274270A (en
Inventor
丁晓晖
谭忠涛
王兴家
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dongfang Jinxin Technology Co.,Ltd.
Original Assignee
Beijing Dongfang Jinxin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dongfang Jinxin Technology Co ltd filed Critical Beijing Dongfang Jinxin Technology Co ltd
Priority to CN202010111247.6A priority Critical patent/CN111274270B/en
Publication of CN111274270A publication Critical patent/CN111274270A/en
Application granted granted Critical
Publication of CN111274270B publication Critical patent/CN111274270B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a statistical information processing and using method and a storage device of a database optimizer, wherein the method comprises the following specific processes: s1, constructing multilayer statistical information including specific constraint condition statistical information, hotspot statistical information and traditional statistical information; s2, obtaining the selectivity of the specified constraint condition, the hotspot interval histogram statistical information and the full-interval histogram statistical information from the multi-layer statistical information in a layered searching mode, and further obtaining the estimated line number of the specified constraint condition. The invention adopts the storage structure design of multi-layer statistical information with different precisions, reduces the coupling of the system, and the high-precision statistical information dynamically constructed based on the service query characteristics is pertinently used for the optimization of specific query and specific data interval query.

Description

Statistical information processing and using method of database optimizer and storage device
Technical Field
The invention relates to a statistical information processing and using method and storage equipment of a database optimizer, and relates to the technical field of database optimization.
Background
Cost-based query optimization cbo (cost Base optimizer) may also be referred to as cost optimization, physical optimization, and the main flow thereof is to enumerate various to-be-selected physical query paths, and calculate the costs of the to-be-selected paths according to context information, thereby selecting the path with the minimum cost.
The physical optimization of the database needs to calculate the cost of various physical paths, and the estimation process of the cost depends on the statistical information of the database. Whether the statistical information can accurately describe the distribution of data in the table is one of important conditions for determining the accuracy of the cost evaluation. Common table-level statistics are shown in table 1, and column-level statistics are shown in table 2:
TABLE 1
Ttuple Total number of lines
Tpage Total number of pages
TABLE 2
Figure BDA0002390082450000011
In the case of a large amount of data, the statistical information is generated by randomly sampling a part of data lines, instead of using all the data. Increasing the sample space may increase the significance of the statistics, but may reduce the performance of the statistics calculations, requiring a balance between accuracy and performance. Through statistical information, the cost estimation system can estimate the proportion of data filtered by the constraint condition to the total data quantity, and the proportion is called the degree of selectivity. The selectivity is the basis for estimating the number of rows in the result set, and the accuracy of the selectivity directly influences the path selection of the optimizer.
In conventional statistical information processing, the method for improving the statistical accuracy is to increase the sample capacity, i.e. increase the sampling rate, and the limitations are that: 1) an excessively high sampling rate may reduce performance of statistical information calculation and increase maintenance costs. In a typical service scenario with data being updated continuously, statistical information also needs to be updated synchronously, so that the performance of database query is seriously influenced; 2) there is a lack of statistical and feedback mechanisms for the data access characteristics that occur in the actual query.
Disclosure of Invention
In view of the above problems, an object of the present invention is to provide a statistical information processing and using method for a database optimizer and a storage device, which effectively improve the accuracy of cost evaluation and improve the performance of database query by recording data access characteristics in actual query and using multi-layer statistical information.
In order to achieve the purpose, the invention adopts the technical scheme that:
in a first aspect, the present invention provides a statistical information processing and using method for a database optimizer, which comprises a specific process S1 of constructing multi-layer statistical information including specific constraint statistical information, hotspot statistical information and traditional statistical information;
s2, obtaining the selectivity of the specified constraint condition, the hotspot interval histogram statistical information and the full-interval histogram statistical information from the multi-layer statistical information in a layered searching mode, and further obtaining the estimated line number of the specified constraint condition.
Further, the specific process of step S1 is as follows:
s11, establishing statistical information of specific constraint conditions, and storing the specific constraint statistical information in a selection degree mode;
s12, establishing statistical information of the hot spot interval, and storing the histogram statistical information of the hot spot interval;
and S13, establishing traditional statistical information, and storing the full-interval histogram statistical information of the traditional statistical information.
Furthermore, the constraint condition refers to a filtering constraint condition used in a WHERE clause in database query, the query comprises a plurality of tables, the optimizer decomposes the filtering condition according to the table to which the filtering condition belongs to form a single table condition, and each corresponding single table condition is a specific constraint condition.
Further, the specific process of storing the specific constraint statistical information in a selectivity manner is as follows:
recording the use frequency of the constraint conditions in historical query, storing and updating the selectivity of the constraint conditions N before the use frequency is ranked, storing specific constraint statistical information in a selectivity mode, and calculating the value of the selectivity by the actual line number of single table scanning/the total line number of the table returned during query execution.
Further, the specific process of S12 is as follows: and (3) historical query of the data is carried out, the data is counted according to hit intervals on the column histogram, interval data histograms of M buckets are established for the hit intervals N before the access frequency ranking, and the histogram information of the M buckets is stored.
Further, the specific process of S2 is as follows:
if the specified constraint condition can be hit in the specific constraint statistical information, returning the estimated line number;
if the specified constraint condition is not hit in the specific constraint statistical information, trying to estimate the line number by using the hotspot statistical table information, if the constraint condition is range query and the query range can be found, calling hit, and returning the estimated line number by using the hotspot statistical information; if the constraint condition is range query and a part of the query range can be found, the query range is called partial hit, and at the moment, hotspot statistical information and traditional statistical information are used for returning the estimated line number;
if the constraint condition is not hit in the hotspot statistical information, returning the estimated line number by adopting the traditional statistical information.
In a second aspect, the present invention further provides a storage device, where the storage device stores computer program instructions, where the program instructions, when executed by a processor, are configured to implement the steps corresponding to the statistical information processing and using method of the database optimizer.
In a third aspect, the present invention further provides a computer program, which includes computer program instructions, where the program instructions, when executed by a processor, are configured to implement the steps corresponding to the statistical information processing and using method of the database optimizer.
The technical scheme adopted by the invention has the following advantages:
1. because the global high sampling rate does not have feasibility and practicability in an actual service scene, the method carries out modeling and statistics on data access characteristics generated in database query, marks the most common query and the most common data interval, adopts the high sampling rate for the local data, effectively improves the accuracy of hotspot query cost evaluation by recording and using a multi-layer statistical information mode, and further outputs a more optimal query plan;
2. the invention fully utilizes the data access characteristics in the actual query, pertinently and dynamically carries out specific query and specific data interval, improves the sampling rate of statistical information, and has better performance and lower establishment and maintenance cost compared with the global high sampling rate;
3. the invention adopts the storage structure design of multi-layer statistical information with different precisions, reduces the coupling of the system, and the high-precision statistical information dynamically constructed based on the service query characteristics is pertinently used for the optimization of specific query and specific data interval query;
in conclusion, the method and the device can be widely applied to the cost evaluation optimizer.
Drawings
Fig. 1 is a schematic view of a statistical information hierarchical searching process according to a first embodiment of the present invention;
FIG. 2(a) is an example of a full-range histogram according to a first embodiment of the present invention;
fig. 2(b) is an example of a hotspot interval histogram in the first embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Example one
The statistical information processing and using method of the database optimizer provided by the embodiment comprises the following contents:
s1, whether the statistical information can accurately describe the distribution of the data fields in the table is one of the important conditions for determining the accuracy of the cost evaluation of the optimizer. Constructing multilayer statistical information comprising specific constraint condition statistical information, hotspot statistical information and traditional statistical information, and the specific process is as follows:
s11, establishing statistical information of specific constraint conditions
Constraints refer to filtering constraints used in the WHERE clause in a database query. The query includes a plurality of tables, and the optimizer decomposes the filter conditions according to the tables to which the filter conditions belong to form single table conditions. Each corresponding single table condition is a specific constraint condition.
The use frequency of the constraint conditions in the history query is recorded, the constraint conditions of the top N (for example, N is 10) of the use frequency are ranked, and the selection degree is stored and updated. The constraint-specific statistics are stored in a selective manner. The value of the degree of selection is calculated from the actual number of rows of the single table scan/the total number of rows of the table that are returned when the query is executed.
S12, establishing statistical information of hot spot intervals
Taking column t1.a in table 6 as an example, the histogram is divided into a 5-bucket equal frequency histogram, and the table lists the edge values of the histogram, and the edges divide the data into 5 bins. In real traffic, for example, a range query t1.a >0and t1.a <100 for column t1.a with the bin falling within the first bucket of the histogram, the access frequency of bin [0,218] is increased once after the query is executed, and bin [0,218] is the hit bin of the constraint.
For historical query of data, statistics is carried out on hit intervals on column histograms, and interval data histograms of M buckets are established for hit intervals with the access frequency ranked N (for example, N is 10). Table 5 lists the histogram statistics for a typical hotspot interval: including the edge values of the equal frequency histogram of the columns, intervals and intervals.
S13, establishing traditional statistical information
Taking the source database postgresql as an example, the generation of the single-column statistical information can be completed by executing the analyze command. Such information is referred to as conventional statistics in this embodiment and includes the table level statistics and column level statistics described above.
For example, the column-level statistical information may be calculated by using a full-interval histogram, where the column-level statistical information is divided into S intervals, and if a certain interval of the S intervals is a hot-spot interval, the hot-spot statistical information is a histogram of M intervals created by subdividing the interval, and is a histogram of local data created for the hot-spot interval, for example, fig. 2(a) is a full-interval histogram of conventional statistical information, and fig. 2(b) is an interval histogram of hot-spot statistical information.
In summary, the specific constraint statistical information is higher than the hotspot statistical information in accuracy than the traditional statistical information. The specific constraint statistics are smaller than the hotspot statistics than the traditional statistics over the coverage area. In the use of the statistical information, the statistical information with higher accuracy is preferentially used.
S2, as shown in fig. 1, obtaining the selectivity of the specified constraint condition from the multi-layer statistical information by using a hierarchical search method, wherein when the database optimizer requires to calculate the estimated line number of the specified constraint condition, the specific process is as follows:
if a given constraint hits in a particular constraint statistic, i.e., a single table constraint used by a query, for example, can be found in table 4, then the estimated number of rows returned is the total number of rows (T) in the table × of the corresponding condition's selectivity stored in table 4tuple);
If the specified constraint condition is not hit in the specific constraint statistical information, trying to estimate the line number by using the hotspot statistical table information, if the constraint condition is range query and the query range can be found in the table 5, calling the line number as hit, and returning the estimated line number; for example, for a range query t1.a >0and t1.a <100 for column t1.a, whose bin falls within bin [0,218] of the histogram of Table 5, the number of estimated rows returned is calculated as:
total number of rows (T) in (100-0)/(218-0) × 1/(M × N) × Tabletuple)
If the constraint condition is range query and a part of the query range can be found in the table 5, the query range is called partial hit, and at this time, the estimated line number is calculated and returned by using the table 5 and the table 6 together; for example, for a range query of column t1.a, t1.a >300and t1.a <500, whose bins fall within bin [218,410] of the histogram of Table 5 and bin [410,611] of the histogram of Table 6, the number of estimated rows returned is calculated as:
the total number of rows (T) of (410-300)/(410-218) × 1/(M × N) + (500-410)/(611-410) × 1/N) ×tuple)
If the constraint misses in the hotspot statistics, the estimated number of rows is returned using the statistics in table 6. For example, for a range query for column t1.a, t1.a >500and t1.a <600, whose bin falls within the histogram's table 6 bin [410,611], the number of estimated rows returned is calculated as:
(600 + 500)/(611 + 410) × 1/N × tabletuple)
In summary, the statistical information processing and using method of the database optimizer provided by the invention can improve the speed and accuracy of calculating the estimated line number of the hot point constraint condition in the system, thereby improving the accuracy of cost evaluation of the database system and obtaining a better query plan.
Further, the example of the multi-layer statistical information construction process of this embodiment specifically includes:
by means of query feedback, when a query is executed, historical constraints are recorded, and the historical constraints are represented as shown in table 3:
TABLE 3
Execution time Constraint conditions
11:00:00 t1.a=10
10:50:00 t1.a>5and t1.a<20
10:40:00 t1.a=10
10:20:00 t1.a>8and t1.a<15
According to the execution load of the database, the selectivity of the constraint condition of the TOPN frequency is counted and stored in the specific constraint condition statistical information when the database is idle. And counting the constraint condition interval of the using frequency TOP N, counting a histogram of the constraint condition interval and recording the histogram into hotspot statistical information.
Simplified constraint-specific statistics examples, simplified hotspot-specific statistics examples, and simplified conventional statistics are shown in tables 4-6
TABLE 4
Figure BDA0002390082450000051
Figure BDA0002390082450000061
TABLE 5
Column(s) of Hot spot interval Histogram of the data
t1.a [0,218] {1,20,51,90,120,218}
t1.a [218,410] {218,260,280,310,350,410}
TABLE 6
Column(s) of Histogram of the data
t1.a {0,218,410,611,804,1000}
Example two
The present embodiment further provides a storage device, where the storage device stores computer program instructions, where the program instructions, when executed by a processor, are configured to implement the steps corresponding to the statistical information processing and using method of the database optimizer of the first embodiment.
EXAMPLE III
The present embodiment further provides a computer program, which includes computer program instructions, where the program instructions, when executed by a processor, are configured to implement the steps corresponding to the statistical information processing and using method of the database optimizer of the first embodiment.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the scope of protection thereof, and although the present application is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: after reading this application, those skilled in the art will be able to make various changes, modifications and equivalents to the embodiments of the application, which are within the scope of the claims of this application.

Claims (5)

1.A statistical information processing and using method of a database optimizer is characterized by comprising the following specific processes:
s1, constructing multilayer statistical information including specific constraint condition statistical information, hotspot statistical information and traditional statistical information, wherein the specific process is as follows:
s11, establishing statistical information of specific constraint conditions, and storing the specific constraint statistical information in a selection degree mode;
s12, establishing statistical information of the hot spot interval, and storing the histogram statistical information of the hot spot interval;
s13, establishing traditional statistical information, and storing the full-interval histogram statistical information of the traditional statistical information;
s2, obtaining the selectivity of the specified constraint condition, the hotspot interval histogram statistical information and the full-interval histogram statistical information from the multilayer statistical information by using a layered search mode, and further obtaining the estimated line number of the specified constraint condition, wherein the specific process is as follows:
if the specified constraint condition can be hit in the specific constraint statistical information, returning the estimated line number;
if the specified constraint condition is not hit in the specific constraint statistical information, trying to estimate the line number by using the hotspot statistical table information, if the constraint condition is range query and the query range can be found, calling hit, and returning the estimated line number by using the hotspot statistical information; if the constraint condition is range query and a part of the query range can be found, the query range is called partial hit, and at the moment, hotspot statistical information and traditional statistical information are used for returning the estimated line number;
if the constraint condition is not hit in the hotspot statistical information, returning the estimated line number by adopting the traditional statistical information.
2. The statistical information processing and using method of a database optimizer of claim 1, wherein the constraint condition refers to a filter constraint condition used in a WHERE clause in a database query, the query includes a plurality of tables, the optimizer decomposes the filter condition according to the table to which the filter condition belongs to form a single table condition, and each corresponding single table condition is a specific constraint condition.
3. The statistical information processing and using method of the database optimizer of claim 1, wherein the specific process of storing the specific constraint statistical information in a selective manner is:
recording the use frequency of the constraint conditions in historical query, storing and updating the selectivity of the constraint conditions N before the use frequency is ranked, storing specific constraint statistical information in a selectivity mode, and calculating the value of the selectivity by the actual line number of single table scanning/the total line number of the table returned during query execution.
4. The statistical information processing and using method of database optimizer as claimed in claim 1, wherein the specific process of S12 is as follows:
and (3) historical query of the data is carried out, the data is counted according to hit intervals on the column histogram, interval data histograms of M buckets are established for the hit intervals N before the access frequency ranking, and the histogram information of the M buckets is stored.
5. A storage device having computer program instructions stored thereon, wherein the program instructions, when executed by a processor, are adapted to implement the steps corresponding to the statistical information processing and usage method of the database optimizer of any one of claims 1-4.
CN202010111247.6A 2020-02-24 2020-02-24 Statistical information processing and using method of database optimizer and storage device Active CN111274270B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010111247.6A CN111274270B (en) 2020-02-24 2020-02-24 Statistical information processing and using method of database optimizer and storage device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010111247.6A CN111274270B (en) 2020-02-24 2020-02-24 Statistical information processing and using method of database optimizer and storage device

Publications (2)

Publication Number Publication Date
CN111274270A CN111274270A (en) 2020-06-12
CN111274270B true CN111274270B (en) 2020-09-18

Family

ID=71000308

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010111247.6A Active CN111274270B (en) 2020-02-24 2020-02-24 Statistical information processing and using method of database optimizer and storage device

Country Status (1)

Country Link
CN (1) CN111274270B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380243B (en) * 2020-11-19 2021-11-09 东北大学 SQL query selectivity estimation method based on machine learning
CN117667993A (en) * 2022-08-25 2024-03-08 华为技术有限公司 Selective rate estimation method and estimation device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339244A (en) * 2016-08-30 2017-01-18 中国银行股份有限公司 Method and device for realizing statistical information collection

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069134B (en) * 2015-08-18 2018-07-27 上海新炬网络信息技术股份有限公司 A kind of automatic collection method of Oracle statistical informations
US10853368B2 (en) * 2018-04-02 2020-12-01 Cloudera, Inc. Distinct value estimation for query planning
CN108664635B (en) * 2018-05-15 2020-12-04 上海达梦数据库有限公司 Method, device, equipment and storage medium for acquiring database statistical information
CN108829768A (en) * 2018-05-29 2018-11-16 中国银行股份有限公司 A kind of collection method and device of statistical information

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339244A (en) * 2016-08-30 2017-01-18 中国银行股份有限公司 Method and device for realizing statistical information collection

Also Published As

Publication number Publication date
CN111274270A (en) 2020-06-12

Similar Documents

Publication Publication Date Title
US8812481B2 (en) Management of interesting database statistics
US9747337B2 (en) Group-by size result estimation
US5778353A (en) Computer program product for optimizing data retrieval using index scanning
CN111274270B (en) Statistical information processing and using method of database optimizer and storage device
US5875445A (en) Performance-related estimation using pseudo-ranked trees
US8914354B2 (en) Cardinality and selectivity estimation using a single table join index
US20040225639A1 (en) Optimizer dynamic sampling
US9218391B2 (en) Scalable computation of data
US8880511B2 (en) Database query optimization and cost estimation
US20080046455A1 (en) Query feedback-based configuration of database statistics
US20030167255A1 (en) Getpage-workload based index optimizer
WO2018153210A1 (en) Method, device and database system for use in automatically creating indexes
CN108764307A (en) The density peaks clustering method of natural arest neighbors optimization
US20120296905A1 (en) Density-based data clustering method
CN108733781A (en) The cluster temporal data indexing means calculated based on memory
CA2266990A1 (en) System and techniques for fast approximate query answering
El-Helw et al. Collecting and maintaining just-in-time statistics
CN110619168A (en) Super-large-scale chip information analysis method based on netlist
Peyravi A schema selection framework for data warehouse design
US9158815B2 (en) Estimating a number of unique values in a list
CN115146141A (en) Index recommendation method and device based on data characteristics
Tao et al. Performance analysis of R*-trees with arbitrary node extents
CN114328605A (en) Database information statistical method and device based on sketch technology
US9367594B2 (en) Determining statistics for cost-based optimization of a workflow
Lin et al. A Cardinality Estimation Approach Based on Two Level Histograms.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 "change of name, title or address"

Address after: 301, 3 / F, block F, Zhizao street, Zhongguancun, 45 Chengfu Road, Haidian District, Beijing 100062

Patentee after: Beijing Dongfang Jinxin Technology Co.,Ltd.

Address before: 9 / F, Jiahe Guoxin building, 15 Baiqiao street, Dongcheng District, Beijing 100062

Patentee before: Beijing Dongfang Jinxin Technology Co.,Ltd.

CP03 "change of name, title or address"