CN103927346A

CN103927346A - Query connection method on basis of data volumes

Info

Publication number: CN103927346A
Application number: CN201410124531.1A
Authority: CN
Inventors: 陈岭; 周强
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2014-03-28
Filing date: 2014-03-28
Publication date: 2014-07-16
Anticipated expiration: 2034-03-28
Also published as: CN103927346B

Abstract

The invention discloses a query connection method on the basis of data volumes. Characteristics such as line file reading are taken into deep consideration during real-time query on big data by the aid of the query connection method, so that costs can be estimated, and the optimal connection sequences can be assuredly generated. The query connection method mainly includes constructing metadata servers; collecting statistical information; querying the metadata servers and acquiring relevant statistical information of various tables participating in connection; estimating the selectivity and relevant parameters such as the data volumes according to the statistical information; computing the corresponding costs of various execution plans to find out the optimal connection sequences. The query connection method has the advantages that the cost estimation accuracy can be improved by the aid of the query connection method, accordingly, the optimal execution plans can be assuredly found out, and the integral query efficiency can be effectively improved.

Description

Inquiry method of attachment based on data volume

Technical field

The present invention relates to large data real-time query optimisation technique field, relate in particular to a kind of inquiry method of attachment based on data volume.

Background technology

Large data real-time query is important large data technique, and existing large data query system has Google Dremel, Cloudera Impala, Berkeley Shark, Apache Drill etc.Large data real-time query generally adopts distributed computing architecture, due to the support having weakened functions such as affairs, so have higher extensibility with respect to relevant database cluster.Be well positioned to meet the user's request of real-time query due to large data real-time query simultaneously, therefore its in internet, there is wide application space in the field such as wisdom city.

Multi-link sequential query optimization is the important component part of data base management system (DBMS), in large data real-time query technical field, possesses equally irreplaceable importance.It,, by adopting certain optimization method, constantly travels through the search volume of executive plan, finds out the best order of connection, to generate best executive plan, thereby promotes the performance of large data query system, meets the real-time demand of user's inquiry.

Estimate it is very important part in multi-link sequential query optimizing process due to cost, can provide a kind of effective result size estimation method be the key that query optimization is effectively realized.Traditional cost method of estimation is a kind of method based on table radix, can effectively solve traditional cost estimation problem by the method, thereby ensures to find the Optimum Implementation Plan that meets Cost Model.But in distributed data base system or data warehouse, there is the tables of data with row formula stored in file format, this formatted file is the I/O performance when optimizing bottom data and to read and reduces data transmission data volume, taking RCFile file as example, this file be a kind of first by row transversally cutting then by the file layout of the longitudinal cutting of row, it will only read and transmit required data rows.In the time that the tables of data to there being row formula stored in file format participates in connecting, while adopting the cost method of estimation of tradition based on table radix to estimate, its the possibility of result can produce serious deviation, and then cause order of connection optimized algorithm to be found out meeting the executive plan of Cost Model not for best, the order of connection finding not, for optimum, consequently makes whole query latency higher.

Summary of the invention

The technical problem to be solved in the present invention is how to guarantee that large data real time inquiry system promotes the accuracy that its cost is estimated while carrying out multi-link sequential optimization, thereby promotes the overall efficiency of inquiry.The problem of carrying out cost estimation existence based on table radix in order to solve above-mentioned tradition, the present invention proposes the multi-join query cost method of estimation based on data volume, consider that the part relations that participates in connecting in the inquiry of user's submission may be with the storage of row formula file, by characteristics such as deep consideration row formula file read, increase more fine-grained statistical information, utilize the average length of each field with the connection intermediate result size of estimation inquiry, thereby effectively guarantee the accuracy of cost estimation.

An inquiry method of attachment based on data volume, comprising:

Step 1, to the request of meta data server submit Query, obtains the corresponding statistical information of each table that participates in connection;

Step 2, obtains the data volume of all tables in current executive plan according to the statistical information estimation getting;

Step 3, repeating step 1 and step 2, until the executive plan that has suitable data amount and make Query Cost minimum is found out, the connection of showing by the order of connection in this executive plan in the search volume of traversal executive plan.

Wherein the search volume of executive plan refers to the set of the table order of connection that all executive plans obtain.

The present invention determines the order connecting in multi-join query using data volume as Query Cost, thereby guarantees that large data real time inquiry system promotes the accuracy that its cost is estimated while carrying out multi-link sequential optimization, thereby promotes the overall efficiency of inquiry.

Wherein, meta data server building mode is, chooses relevant database and designs the table schema of row rank, creates metadatabase and table relation according to the table schema designing in corresponding relevant database, obtains meta data server.

For the statistical information of three kinds of granularities such as table rank, subregion rank and row rank can be provided for inquiry system, design corresponding table schema and need to meet suitable normal form, can complete under the prerequisite of cost estimation simultaneously, reduce unnecessary storage overhead as far as possible.

Statistical information in meta data server is every corresponding statistical information of table, and described statistical information is added up and obtained according to the table schema his-and-hers watches of design.

The fine granularity of statistical information obtains because table schema is row class pattern according to the fine granularity of table schema, and therefore statistical information comprises the statistical information of row rank.

Described relevant database is: MYSQL database, Derby database or oracle database.

According to the actual demand of enterprise customer and system, choose the meta data server of suitable relevant database as large data real time inquiry system.

Statistical information comprises: the upper bound of data value in the lower bound of data value, row in row names, row, table midrange be the total line number according to maximum length, table or the view of field data in the average length of field data in the quantity of different value, row and row according to the quantity for empty, table midrange.

The storage in meta data server of meta data server and statistical information all completes under off-line state.

Because the structure of meta data server and the collection of statistical information are all that off-line completes, while making actual inquiry, carrying out returning of statistical information does not need to expend how many run-time overheads, has greatly reduced the time delay that cost is estimated.

In step 2, the data volume of each table calculates according to total line number of the corresponding selectance of this table, field average amount and table.

Selectance is according to statistical informations such as the upper bounds of data value in the lower bound of data value in the row of statistical information gained, row and connect correlated condition in inquiry, wherein generally represents with selectivity.

The evaluation method of selectance is, carries out corresponding calculating according to querying condition and statistical information in inquiry, the row that obtains meeting in table querying condition shared ratio in the object set that will inquire about.

Object set is wherein to be the set of table, view or intermediate result.

The computing formula of data volume size is as follows:

size = selectivity \times numsOfTableLine \times Σ_{i = 1}^{j} avgCol {Size}_{i}

Selectivity represents the selectance of inquiry, and numsOfTableLine is total line number of table or view, avgColSize _ithe average amount of i row field in the table that expression need to be returned, j is the columns of table.

Evaluation method compared to tradition based on table radix, it not only depends on the line number size that inquiry intermediate result produces, and also the data volume of estimation is taken into account simultaneously, thereby promotes the accuracy of cost estimation.

Advantage of the present invention comprises:

There is the inaccurate problem of estimation in the cost method for tradition based on table radix, deeply considers the characteristics such as row formula file reads, and increases more fine-grained statistical information, effectively promoted the accuracy of cost estimation.

By meta data server storage and maintenance table ASSOCIATE STATISTICS information, avoid repeatedly carrying out a large amount of analytical works, reduce run-time overhead, promote the efficiency that cost is estimated.

Brief description of the drawings

Fig. 1 is the inquiry method of attachment overview flow chart of embodiment of the inventive method based on data volume;

The query processing Organization Chart that Fig. 2 adopts for the current embodiment of the present invention;

Fig. 3 is that in the current embodiment of the present invention, meta data server builds process flow diagram;

Fig. 4 is that in the current embodiment of the present invention, statistical information is collected process flow diagram;

Fig. 5 is statistical information querying flow figure in the current embodiment of the present invention;

Fig. 6 is data volume estimation process flow diagram in the current embodiment of the present invention;

Fig. 7 is order of connection product process figure in the current embodiment of the present invention.

Embodiment

The present invention proposes the inquiry method of attachment based on data volume, in the time inquiring about, multi-join query is carried out to cost estimation, the overall procedure of cost method of estimation as shown in Figure 1.First it carry out the construction work of meta data server; Then complete the collection of statistical information; Secondly obtain by query metadata server the ASSOCIATE STATISTICS information that participates in the each table connecting; Then carry out the estimation work of the correlation parameters such as selectance and data volume according to statistical information; Finally adopt method of estimation based on data volume to calculate the corresponding cost of each executive plan and find out the best order of connection.

The effect of method in query optimization proposing in order to introduce more intuitively the present invention, now provides the framework of query processing as shown in Figure 2, and it has set forth the relation between cost estimation module and the order of connection generation module based on data volume.Wherein, in order of connection generation module, carried out the work of executive plan search by related optimization, and cost estimation module based on data volume is mainly made up of Cost Model and MetaStore two parts, the work of estimating to complete cost.The inquiry of submitting to for user, through parsing after by by multi-link sequential query optimization method to complete the work of sequential optimization, it is carrying out in the process of executive plan search, need to call associated costs estimation module and carry out the estimation work of cost, to guarantee to find the Best link order that meets given Cost Model.

The step of the multi-join query cost method of estimation based on data volume that the present invention proposes comprises:

First need to build meta data server and by the statistical information of storing in the table in meta data server inquiring about before connecting.

Relevant database also designs table schema, builds meta data server.

For the cost method of estimation based on data volume can be able to efficient realization, first need to carry out the construction work of meta data server, as shown in Figure 3, concrete steps are as follows for its flow process:

According to the actual demand of enterprise customer and system, choose the meta data server of suitable relevant database (as MYSQL database, Derby database) as large data real time inquiry system;

For the statistical information of three kinds of granularities such as table rank, subregion rank and row rank can be provided for inquiry system, design corresponding table schema and need to meet suitable normal form, can complete under the prerequisite of cost estimation simultaneously, reduce unnecessary storage overhead as far as possible;

In corresponding database server, create metadatabase and table relation according to the table schema designing, use for subsequent step.

According to designed good table schema, analyze the relation in every table and corresponding statistical information is stored in meta data server to complete the collection of statistical information;

For the inquiry after resolving is carried out to order of connection Optimization Work, the work that has needed statistical information to collect after creating meta data server, as shown in Figure 4, concrete steps are as follows for its flow process:

Estimate to obtain the expense of statistical information in order to reduce cost in order of connection optimizing process, first carry out analytical work by corresponding anolytic sentence or instrument to often connecting the table of inquiring about;

Table after analyzing is carried out to the collection work of ASSOCIATE STATISTICS information, and this statistical information is stored in the respective table of meta data server, for the cost better completing based on data volume is estimated, need to collect the statistical information that comprises the row ranks such as field average length AVG_COL_LEN, it provides in the process of carrying out table schema design.Wherein statistical information comprises: the upper bound of data value in the lower bound of data value, row in row names, row, table midrange be the total line number according to maximum amount of data, table or the view of field data in the average amount of field data in the quantity of different value, row and row according to the quantity for empty, table midrange.

The establishment of meta data server (being metadatabase) and the collection of statistical information are off-line and complete, and then inquire about.

Step 1, by the request of meta data server submit Query to obtain the ASSOCIATE STATISTICS information of each table that connects of participating in;

This step mainly completes the inquiry of ASSOCIATE STATISTICS information and obtains work, and as shown in Figure 5, concrete steps are as follows for its flow process:

In order to obtain participating in inquiry the corresponding statistical information of the each table connecting, need to be by query optimization module to the request of respective meta-data server submit Query;

Return to the corresponding statistical information of each table relation by meta data server, to complete the work of obtaining of statistical information, thereby for the calculating of next stage correlation parameter.

Because the structure of meta data server and the collection of statistical information are all that off-line completes, therefore this step does not need to expend how many run-time overheads, greatly reduce the time delay that cost is estimated.

Step 2, obtains the data volume of all tables in current executive plan according to the statistical information estimation getting.

Wherein executive plan refers to the inquiry of carrying out with the different table order of connection.

Before the corresponding cost of carrying out executive plan is estimated, need to complete the estimation work of correlation parameter, as shown in Figure 6, concrete steps are as follows for its flow process of calculating that correlation parameter comprises selectance and data volume:

By the ASSOCIATE STATISTICS information getting in previous step, first participate in the calculating of the each table selectance connecting, step 2-1, carries out corresponding calculating according to the querying condition and the statistical information that connect in inquiry, the row that is met condition shared ratio in the object set that will inquire about.

For any two querying conditions that comprise in inquiry, the corresponding computing formula difference of satisfied different relations:

Selectance selectivity when inquiry meets querying condition A and querying condition B simultaneously _(AandB)computing formula be:

selectivity _(AandB)＝selectivity _(A)×selectivity _(B) （1）

Wherein, selectivity _(A)represent the selectance of single query condition A, selectivity _(B)represent the selectance of single query condition B;

Selectance selevtivity when inquiry meets querying condition A or querying condition B _(AorB)computing formula is:

selevtivity _(AorB)＝P(A)+P(B)-selectivity _(AandB) （2）

P(A) represent the probability of occurrence of querying condition A, P(B) represent the probability of occurrence of querying condition B;

Inquiry meets selectance selectivity while getting rid of querying condition A _(notA)computing formula:

selectivity _(ntoA)=1-selectivity _(A) （3）

Between any two querying condition A and B, satisfied pass is: meet simultaneously, meet A or meet B, querying condition also may be for not comprising A.When comprising multiple queries condition and comprising between querying condition multiple the relation, can carry out combination of two according to above-mentioned formula to querying condition wherein respectively, calculate according to the satisfied relation of each combination, obtain final selectance.

Step 2-2, calculates the data volume of each table according to the selectance of step 2-1 gained, computing formula is as follows:

size = selectivity \times numsOfTableLine \times Σ_{i = 1}^{j} avgC {olSize}_{i} - - - (4)

Selectivity represents that step 2-1 calculates gained selectance, and numsOfTableLine is total line number of table or view, avgColSize _ithe average amount of i row field in the table that expression need to be returned, j is the columns of table.

Each table data volume input Cost Model that formula (4) is calculated to gained, carries out the cost estimation of multi-join query, thereby obtains the cost of different executive plan gained.Evaluation method compared to tradition based on table radix, it not only depends on the line number size that inquiry intermediate result produces, and also the data volume of estimation is taken into account simultaneously, thereby promotes the accuracy of cost estimation.

Step 3, repeating step 1 and step 2, until the search volume of traversal executive plan, the table order of connection of finding out data volume minimum connects.

In order to find the best order of connection, in the search procedure of executive plan, need the cost method of estimation based on data volume that uses the present invention to propose, as shown in Figure 7, concrete steps are as follows for its flow process:

Carry out the space search work (being repeating step 1 and step 2) of executive plan according to adopted order of connection optimization method, it,, by consider the characteristic of real time inquiry system and increase corresponding technology of prunning branches to optimize the performance of executive plan search simultaneously, reduces the query latency that algorithm itself is carried out;

Obtain the estimated value of the data volume of corresponding executive plan by step 2, find out the executive plan that meets given Cost Model, and store;

The Optimum Implementation Plan of finding out according to above-mentioned steps, to generate the best order of connection, due to the cost estimation method that has adopted the present invention to propose, thereby has effectively improved the accuracy that cost is estimated.

Claims

1. the inquiry method of attachment based on data volume, is characterized in that, comprising:

2. the inquiry method of attachment based on data volume as claimed in claim 1, it is characterized in that, wherein, meta data server building mode is, choose relevant database and design the table schema of row rank, in corresponding relevant database, create metadatabase and table relation according to the table schema designing, build meta data server.

3. the inquiry method of attachment based on data volume as claimed in claim 1, is characterized in that, the statistical information of storing in meta data server is every corresponding statistical information of table, and described statistical information is added up and obtained according to the table schema his-and-hers watches of design.

4. the inquiry method of attachment based on data volume as claimed in claim 1, is characterized in that, described relevant database is: MYSQL database, Derby database or oracle database.

5. the inquiry method of attachment based on data volume as claimed in claim 1, it is characterized in that, statistical information comprises: the upper bound of data value in the lower bound of data value, row in row names, row, table midrange be the total line number according to maximum amount of data, table or the view of field data in the average amount of field data in the quantity of different value, row and row according to the quantity for empty, table midrange.

6. the inquiry method of attachment based on data volume as claimed in claim 1, is characterized in that, wherein, the storage in meta data server of meta data server and statistical information all completes under off-line state.

7. the inquiry method of attachment based on data volume as claimed in claim 1, is characterized in that, in step 2, the data volume of each table calculates according to total line number of the corresponding selectance of this table, field average amount and table.

8. the inquiry method of attachment based on data volume as claimed in claim 7, it is characterized in that, the evaluation method of selectance is, carries out corresponding calculating according to querying condition and statistical information in inquiry, the row that obtains meeting in table querying condition shared ratio in the object set that will inquire about.

9. the inquiry method of attachment based on data volume as claimed in claim 8, is characterized in that, the computing formula of every table data volume size is as follows:

size = selectivity \times numsOfTableLine \times Σ_{i = 1}^{j} avgCol {Size}_{i}