CN103927346B

CN103927346B - Query connection method on basis of data volumes

Info

Publication number: CN103927346B
Application number: CN201410124531.1A
Authority: CN
Inventors: 陈岭; 周强
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2014-03-28
Filing date: 2014-03-28
Publication date: 2017-02-15
Anticipated expiration: 2034-03-28
Also published as: CN103927346A

Abstract

The invention discloses a query connection method on the basis of data volumes. Characteristics such as line file reading are taken into deep consideration during real-time query on big data by the aid of the query connection method, so that costs can be estimated, and the optimal connection sequences can be assuredly generated. The query connection method mainly includes constructing metadata servers; collecting statistical information; querying the metadata servers and acquiring relevant statistical information of various tables participating in connection; estimating the selectivity and relevant parameters such as the data volumes according to the statistical information; computing the corresponding costs of various execution plans to find out the optimal connection sequences. The query connection method has the advantages that the cost estimation accuracy can be improved by the aid of the query connection method, accordingly, the optimal execution plans can be assuredly found out, and the integral query efficiency can be effectively improved.

Description

Inquiry method of attachment based on data volume

Technical field

The present invention relates to big data real-time query optimisation technique field, more particularly, to a kind of inquiry connection based on data volume Method.

Background technology

Big data real-time query is important big data technology, existing big data inquiry system have Google Dremel, Cloudera Impala, Berkeley Shark, Apache Drill etc..Big data real-time query typically adopts distributed meter Calculate framework, due to weakening the support to functions such as affairs, so there is higher expanding with respect to relevant database cluster Malleability.Simultaneously because big data real-time query is well positioned to meet the user's request of real-time query, therefore it is in internet, wisdom There is wide application space in the fields such as city.

Multi-link sequential query optimization is the important component part of data base management system, in big data real-time query technology It is likewise supplied with irreplaceable importance in field.It passes through using certain optimization method, constantly travels through searching of executive plan Rope space, finds out the optimal order of connection, to generate optimal executive plan, thus lifting the performance of big data inquiry system, Meet the real-time demand of user's inquiry.

Estimate it is very important part in multi-link sequential query optimization process due to cost, can be given a kind of effective Result size estimation method be the key that query optimization is effectively realized.Traditional cost method of estimation is a kind of based on table radix Method, conventional cost can effectively be solved by the method and estimate problem, thus ensureing to find and meet Cost Model Good executive plan.But in distributed data base system or data warehouse, exist with the tables of data of column stored in file format, should Formatted file is to optimize I/O performance and minimizing data transfer rate amount when bottom data is read out, with RCFile literary composition As a example part, this document is a kind of first file format by row transversally cutting and then by the longitudinal cutting of row, and it will only read and transmit Required data row.When participating in connecting to the tables of data having column stored in file format, using the generation based on table radix for the tradition When valency method of estimation is estimated, its result may produce serious deviation, and then leads to order of connection optimized algorithm to be found out The executive plan meeting Cost Model is not optimal, that is, the order of connection finding is not optimum, so that making whole inquiry Postpone higher.

Content of the invention

The technical problem to be solved in the present invention is how to guarantee that big data real time inquiry system carries out multi-link sequential optimization The accuracy that its cost of Shi Tisheng is estimated, thus lift the overall efficiency of inquiry.Entered based on table radix to solve above-mentioned tradition The problem that row cost estimation exists, the present invention proposes the multi-join query cost method of estimation based on data volume it is contemplated that using The part relations participating in the inquiry that family is submitted to connecting may be stored with column file, by deeply considering column file reading etc. Characteristic, increases more fine-grained statistical information, and the average length using each field is big with the connection intermediate result of estimating inquiry Little, thus effectively guaranteeing the accuracy of cost estimation.

A kind of inquiry method of attachment based on data volume, including：

Step 1, submits inquiry request to meta data server, obtains the statistical information corresponding to each table participating in connecting；

Step 2, the statistical information estimation according to getting obtains the data volume of all tables in current executive plan；

Step 3, repeat step 1 and step 2, until the search space of traversal executive plan, find out with suitable data amount Make the minimum executive plan of Query Cost, carry out the connection of table by the order of connection in this executive plan.

The search space of wherein executive plan refers to the set of the table order of connection obtained by all executive plans.

The present invention to determine, as Query Cost, the order connecting in multi-join query using data volume, so that it is guaranteed that big data Real time inquiry system carries out lifting the accuracy of its cost estimation during multi-link sequential optimization, thus lifting the overall effect of inquiry Rate.

Wherein, meta data server building mode is, chooses relevant database and designs the table schema of row rank, according to The table schema designing creates metadatabase and table relation in corresponding relevant database, obtains meta data server.

In order to for inquiry system provide table level not, the statistical information of three kinds of granularities such as subregion rank and row rank, if Counting corresponding table schema needs to meet suitable normal form, and simultaneously on the premise of can completing cost estimation, reduce need not as far as possible The storage overhead wanted.

Statistical information in meta data server is the statistical information corresponding to every table, and described statistical information is according to design Table schema statistics carried out to table obtain.

The fine granularity of statistical information obtains due to table schema being row class pattern according to the fine granularity of table schema, therefore counts Information includes the statistical information of row rank.

Described relevant database is：MYSQL database, Derby database or oracle database.

According to the actual demand of enterprise customer and system, choose suitable relevant database as big data real-time query The meta data server of system.

Statistical information includes：The upper bound of data value in the lower bound of data value in row name, row, row, table midrange are according to being empty Quantity, table midrange according to different value quantity, row in field data average length and row in field data maximum length, Table or total line number of view.

Meta data server and statistical information all complete in the storage in meta data server under off-line state.

Due to the structure of meta data server and the collection of statistical information be all complete offline so that actual inquired about when The return carrying out statistical information does not need expended how many run-time overheads, significantly reduces the time delay of cost estimation.

In step 2, selectance according to corresponding to this table for the data volume of each table, field average amount and table is total Line number is calculated.

Selectance according to the statistical information such as upper bound of data value in the lower bound of data value in the row of statistical information gained, row with And correlated condition in Connection inquiring, wherein typically represented with selectivity.

The evaluation method of selectance is accordingly to be calculated according to the querying condition in inquiry and statistical information, obtain table The shared ratio in object set to be inquired about of the middle row meeting querying condition.

Object set therein be table, view or intermediate result set.

The computing formula of data volume size is as follows：

Selectivity represents the selectance of inquiry, and numsOfTableLine is total line number of table or view, avgColSize_iRepresent the average amount of the i-th row field in the table needing to return, j is the columns of table.

Compared to the evaluation method based on table radix for the tradition, it is big that it depends not only upon the line number of inquiry intermediate result generation Little, and also the data volume of estimation is taken into account, thus lifting the accuracy of cost estimation simultaneously.

Advantages of the present invention includes：

Exist for the cost method based on table radix for the tradition and estimate inaccurate problem, deeply consider column file and read Etc. characteristic, increase more fine-grained statistical information, effectively improve the accuracy of cost estimation.

By meta data server storage and maintenance table associated statistical information, it is to avoid repeatedly carry out substantial amounts of analysis work, Decrease run-time overhead, improve the efficiency of cost estimation.

Brief description

Fig. 1 is the inquiry method of attachment overview flow chart based on data volume for one embodiment of the inventive method；

The query processing Organization Chart that Fig. 2 is adopted by present example of the present invention；

Fig. 3 is that in present example of the present invention, meta data server builds flow chart；

Fig. 4 is that in present example of the present invention, statistical information collects flow chart；

Fig. 5 is statistical information querying flow figure in present example of the present invention；

Fig. 6 is data volume estimation flow chart in present example of the present invention；

Fig. 7 is order of connection product process figure in present example of the present invention.

Specific embodiment

The present invention proposes the inquiry method of attachment based on data volume, is carried out cost when inquiring about to multi-join query Estimate, the overall procedure of cost method of estimation is as shown in Figure 1.It carries out the construction work of meta data server first；Then complete Become the collection of statistical information；Obtain the associated statistical information of each table participating in connecting secondly by query metadata server；Connect The estimation work carrying out the relevant parameters such as selectance and data volume according to statistical information；Finally adopt the estimation based on data volume Method calculates the corresponding cost of each executive plan and finds out the optimal order of connection.

In order to more intuitively introduce effect in query optimization for the method proposed by the present invention, now provide the frame of query processing Structure is as shown in Fig. 2 it elaborates the relation between cost estimation module and order of connection generation module based on data volume.Its In, carried out the work of executive plan search in order of connection generation module by related optimization, and the cost based on data volume Estimation module is mainly made up of Cost Model and MetaStore two parts, to complete the work of cost estimation.User is submitted to Inquiry, through parsing after by the work to complete sequential optimization by multi-link sequential query optimization method, it is being executed Plan search during, need the estimation work calling associated costs estimation module to carry out cost, with guarantee to find meet to Determine the Best link order of Cost Model.

The step of the multi-join query cost method of estimation based on data volume proposed by the present invention includes：

Carrying out before inquiry connects firstly the need of building meta data server and by institute in the table in meta data server The statistical information of storage.

Relevant database simultaneously designs table schema, builds meta data server.

In order to the cost method of estimation based on data volume can efficiently be realized it is necessary first to carry out meta data server Construction work, its flow process is as shown in figure 3, comprise the following steps that：

According to the actual demand of enterprise customer and system, choose suitable relevant database（As MYSQL database, Derby database）Meta data server as big data real time inquiry system；

In order to for inquiry system provide table level not, the statistical information of three kinds of granularities such as subregion rank and row rank, if Counting corresponding table schema needs to meet suitable normal form, and simultaneously on the premise of can completing cost estimation, reduce need not as far as possible The storage overhead wanted；

Metadatabase and table relation are created in corresponding database server according to the table schema designing, for follow-up Step uses.

According to designed good table schema, analyze the relation in every table and corresponding statistical information is stored metadata clothes To complete the collection of statistical information in business device；

In order to be attached sequential optimization work to the inquiry after parsing, need to complete after creating meta data server The work that statistical information is collected, its flow process is as shown in figure 4, comprise the following steps that：

Estimate to obtain the expense of statistical information to reduce cost in order of connection optimization process, first pass through corresponding point Analysis sentence or instrument are analyzed work to the table being often attached inquiring about；

Table after analysis is carried out with the collection work of associated statistical information, and this statistical information is stored Metadata Service In the respective table of device, the cost in order to preferably complete based on data volume is estimated, needs collection to include field average length AVG_ The statistical information of the row rank such as COL_LEN, it is given during carrying out table schema design.Wherein statistical information includes：Row The upper bound of data value in the lower bound of data value in name, row, row, table midrange according to for empty quantity, table midrange according to different value Quantity, row in field data average amount and row in field data maximum amount of data, table or view total line number.

Meta data server（I.e. metadatabase）Establishment and the collection of statistical information be and complete offline, then carry out Inquiry.

Step 1, by submitting to inquiry request to obtain the ASSOCIATE STATISTICS letter of each table participating in connecting to meta data server Breath；

This step mainly complete associated statistical information inquiry and obtain work, its flow process as shown in figure 5, concrete steps such as Under：

In order to participate in the corresponding statistical information of each table connecting in being inquired about, need from query optimization module to corresponding unit Data server submits inquiry request to；

The corresponding statistical information of each table relation is returned by meta data server, to complete the acquisition work of statistical information, from And it is used for the calculating of next stage relevant parameter.

Because the structure of meta data server and the collection of statistical information are all to complete offline, therefore this step does not need consuming many Few run-time overhead, greatly reduces the time delay of cost estimation.

Step 2, the statistical information estimation according to getting obtains the data volume of all tables in current executive plan.

Wherein executive plan refers to the inquiry carrying out with the different table order of connection.

Before the corresponding cost carrying out executive plan is estimated, need to complete the estimation work of relevant parameter, relevant parameter Including selectance data amount its flow process of calculating as shown in fig. 6, comprising the following steps that：

By the associated statistical information getting in previous step, carry out first participating in the meter of each table selectance of connection Calculate, step 2-1, accordingly calculated according to the querying condition in Connection inquiring and statistical information, the row being met condition exists Shared ratio in object set to be inquired about.

For any two querying condition comprising in inquiry, the corresponding computing formula of different relations of satisfaction is different：

Inquiry meets selectance selectivity during querying condition A and querying condition B simultaneously_(AandB)Computing formula For：

selectivity_(AandB)=selectivity_(A)×selectivity_(B)（1）

Wherein, selectivity_(A)Represent the selectance of single query condition A, selectivity_(B)Represent single query The selectance of condition B；

Inquiry meets selectance selevtivity during querying condition A or querying condition B_(AorB)Computing formula is：

selevtivity_(AorB)=P (A)+P (B)-selectivity_(AandB)（2）

P（A）Represent the probability of occurrence of querying condition A, P（B）Represent the probability of occurrence of querying condition B；

Selectance selectivity when inquiry meets exclusion querying condition A_(notA)Computing formula：

selectivity_(ntoA)=1-selectivity_(A)（3）

The relation being met between any two querying condition A and B is：Meet simultaneously, meet A or meet B, querying condition It is also possible to as not comprising A.When comprising multiple queries condition and comprising multiple relation between querying condition, can be respectively according to above-mentioned public affairs Formula carries out combination of two to querying condition therein, is calculated according to the relation that each combination is met, obtains final choosing Degree of selecting.

Step 2-2, calculates the data volume of each table according to the selectance of step 2-1 gained, and computing formula is as follows：

Selectivity represents that step 2-1 calculates gained selectance, and numsOfTableLine is the head office of table or view Number, avgColSize_iRepresent the average amount of the i-th row field in the table needing to return, j is the columns of table.

Formula (4) is calculated each table data volume input Cost Model of gained, carry out the cost estimation of multi-join query, from And obtain the cost of different executive plan gained.Compared to the evaluation method based on table radix for the tradition, it depends not only upon inquiry The line number size that intermediate result produces, and also the data volume of estimation is taken into account, thus lifting the standard of cost estimation simultaneously Really property.

Step 3, repeat step 1 and step 2, until the search space of traversal executive plan, find out the minimum table of data volume The order of connection is attached.

In order to find the optimal order of connection, need using base proposed by the present invention in the search procedure of executive plan In the cost method of estimation of data volume, its flow process is as shown in fig. 7, comprise the following steps that：

Order of connection optimization method according to being adopted carries out the space search work of executive plan（I.e. repeat step 1 and Step 2）, it passes through to consider the characteristic of real time inquiry system and increase corresponding technology of prunning branches simultaneously to optimize executive plan search Performance, reduce algorithm itself execute query latency；

Obtain the estimated value of the data volume of corresponding executive plan by step 2, find out the execution meeting given Cost Model Plan, and stored；

The Optimum Implementation Plan found out according to above-mentioned steps, to generate the optimal order of connection, due to employing the present invention The cost estimation method proposing, thus effectively increase the accuracy of cost estimation.

Claims

1. a kind of inquiry method of attachment based on data volume is it is characterised in that include：

Step 2, the statistical information estimation according to getting obtains the data volume of all tables in current queries executive plan；

Step 3, repeat step 1 and step 2, until the search space of traversal queries executive plan, find out with suitable data amount Make the minimum executive plan of Query Cost, carry out the connection of table by the order of connection in this executive plan；

In step 1, meta data server building mode is to choose relevant database and design the table schema of row rank, according to The table schema designing creates metadatabase and table relation in corresponding relevant database, builds meta data server；

In step 2, total line number meter of the data volume of each table selectance, field average amount and table according to corresponding to this table Obtain；The evaluation method of selectance is accordingly to be calculated according to the querying condition in inquiry and statistical information, obtain in table Meet the shared ratio in object set to be inquired about of the row of querying condition；The computing formula of every table data volume size is such as Under：

s i z e = s e l e c t i v i t y \times n u m O f T a b l e L i n e \times Σ_{i = 1}^{j} {avgColSize}_{i}

Selectivity represents the selectance of inquiry, and numsOfTableLine is total line number of table or view, avgColSize_i Represent the average amount of the i-th row field in the table needing to return, j is the columns of table.

2. as claimed in claim 1 the inquiry method of attachment based on data volume it is characterised in that storage in meta data server Statistical information is the statistical information corresponding to every table, and described statistical information carries out counting to table according to the table schema of design Arrive.

3. the inquiry method of attachment based on data volume as claimed in claim 1 is it is characterised in that described relevant database is： MYSQL database, Derby database or oracle database.

4. the inquiry method of attachment based on data volume as claimed in claim 1 is it is characterised in that statistical information includes：Row name, row The upper bound of data value in the lower bound of middle data value, row, table midrange according to be empty quantity, table midrange according to different value quantity, In row the average amount of field data and row in field data maximum amount of data, table or view total line number.

5. as claimed in claim 1 the inquiry method of attachment based on data volume it is characterised in that wherein, meta data server with And statistical information all completes in the storage in meta data server under off-line state.