CN104834650A

CN104834650A - Method and system for generating effective query tasks

Info

Publication number: CN104834650A
Application number: CN201410049127.2A
Authority: CN
Inventors: 汪东升; 李宝禄; 王占业
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2014-02-12
Filing date: 2014-02-12
Publication date: 2015-08-12

Abstract

The present invention discloses a method and system for generating effective query tasks. The method includes the following steps: storing the structured table data in blocks after sorting the data by a key column to obtain a plurality of data blocks corresponding to the structured table data; obtaining the value range of the key column in each data block to create data block index; according to the data block index, generating effective query tasks for the data block containing result information when a query task of query by or including the key column is received. The method of generating effective query tasks provided in the present invention, creates data block index based on the value range of a specific column in a data block, generates effective query tasks, thus reducing invalid query tasks, improving the speed of data processing, and lowering the burden of the data management system.

Description

A kind of effective query task creating method and system

Technical field

The present invention relates to microcomputer data processing field, particularly relate to a kind of effective query task creating method and system.

Background technology

Along with the fast development of internet and the universal rapidly of various mobile terminal, the data scale of relevant enterprise and unit maintains sustained and rapid growth, especially internet data, its data scale constantly expands with index rank, and can keep this rising tendency within a period of time in future always.According to famous consulting firm IDC(International Data Corporation, International Data Corporation (IDC)) statistics, the global data total amount being created and copying in 2011 is 21 powers of 1.8ZB(10), wherein 75% comes from individual, mainly document, picture, video and music etc., 15 powers considerably beyond the data total amount 200PB(10 of all printing materials since the dawn of human civilization).US Internet data center points out, the data on internet will increase by 50% every year, every two years just will double, and at present in the world the data of more than 90% be just produce recent years, the process of massive structured data is extremely urgent.

In recent years, IT company large is abroad proposed oneself massive structured data processing scheme, the Stinger of the Greenplum of such as EMC Inc., Hortonworks company, Impala of Cloudera company etc.The core concept of these schemes is all the store and management being realized massive structured data by distributed parallel, wherein,

The Data distribution8 formula of Greenplum exists in PostgreSQL database Postgresql, and namely in cluster, every platform machine all installs Postgresql, and data are deposited by the mode of Hash, and each node deposits the partial data of a table.When performing inquiry, node containing data performs same operation, and result gathers and returns by last Master node;

SQL statement is resolved to directed acyclic graph DAG (Directed AcyclicGraph) by Stinger, namely operates the one query of data block.Bottom stores and adopts HDFS, and when performing query manipulation, performing DAG operation containing on the node of data, end product writes back in HDFS;

Impala bottom stores and adopts HDFS, when performing query manipulation, first generated query plan, then by this inquiry plan, the node be distributed to containing data block performs, what perform inquiry plan is distributed data base enforcement engine, and namely each node has a database enforcement engine to carry out data query.

The distributed parallel thought of existing massive structured data processing scheme, namely Data distribution8 formula is deposited, inquiry executed in parallel.In the inquiry of reality; often can carry out data filtering by some fields (row); all data that this table relates to all can by scanning one time; for the table that data volume is many; not containing object information in a lot of data block, query task is performed to these data blocks and can produce much invalid query task.

Summary of the invention

(1) technical matters that will solve

Technical matters to be solved by this invention is: often all data related in table are all scanned one time at existing data query, for the table that data volume is many, not containing object information in a lot of data block, when query task is performed to these data blocks, much invalid query task can be produced.

(2) technical scheme

For this purpose, the present invention proposes a kind of effective query task creating method, comprise the following steps:

Structuring table data are carried out piecemeal storage according to after key column sequence, obtains multiple data blocks that described structuring table data are corresponding;

The span obtaining key column in each data block creates data block index;

When receiving according to key column or comprising the query task of key column, according to described data block index, effective query task is generated to the data block containing object information.

Preferably, described method also comprises:

Receive the query task that client sends;

Judge whether described query task is according to key column or the query task comprising key column.

Preferably, when described query task is not according to key column or comprises the query task of key column, query task is performed to all data blocks.

Preferably, in each data block of described acquisition, the span establishment data block index of key column is specially:

Obtain the span of key column in each data block;

Record the data block index of span as this data block of key column in described data block.

Preferably, describedly according to described data block index, effective query task is generated to the data block containing object information, specifically comprises:

Extract the search condition of current queries task;

The data block index meeting described search condition is read according to described search condition;

The data block of object information is contained according to described data block index search;

Effective query task is generated to the data block containing object information, carries out data query.

Preferably, described method also comprises: for described structuring table data arrange key column.

In addition, present invention also offers a kind of effective query task generation system, comprise deblocking module, acquisition module and generation module;

Deblocking module, for structuring table data are carried out piecemeal storage according to after key column sequence, obtains multiple data blocks that described structuring table data are corresponding;

Acquisition module, creates data block index for the span obtaining key column in each data block;

Generation module, for when receiving according to key column or comprising the query task of key column, generates effective query task according to described data block index to the data block containing object information.

Preferably, described system also comprises: receiver module and judge module;

Receiver module, for receiving the query task that client sends;

Judge module, for judging whether described query task is according to key column or the query task comprising key column.

Preferably, described acquisition module comprises: acquiring unit and record cell;

Acquiring unit, for obtaining the span of key column in each data block;

Record cell, for recording the data block index of span as this data block of key column in described data block.

Preferably, described generation module comprises: extraction unit, reading unit, search unit and generation unit;

Extraction unit, for extracting the search condition of current queries task;

Reading unit, for reading the data block index meeting described search condition according to described search condition;

Search unit, for containing the data block of object information according to described data block index search;

Generation unit, for generating effective query task to the data block containing object information, carries out data query.

(3) beneficial effect

By adopting a kind of effective query task creating method disclosed by the invention and system, the method creates data block index based on the span of particular column in data block, generate effective query task, reduce invalid query task, improve the speed of data processing, reduce the burden of data management system, and this system is general and stable, the difficulty of system development and test is low, easily realizes.

Accompanying drawing explanation

Can understanding the features and advantages of the present invention clearly by reference to accompanying drawing, accompanying drawing is schematic and should not be construed as and carry out any restriction to the present invention, in the accompanying drawings:

Fig. 1 is the process flow diagram of a kind of effective query task creating method of the present invention;

Fig. 2 is the module map of a kind of effective query task of the present invention generation system.

Embodiment

Below in conjunction with accompanying drawing, embodiments of the present invention is described in detail.

The present invention proposes a kind of effective query task creating method and system, and this system adopts distributed system architecture, and structuring table data are stored in HDFS, and structuring table data store according to particular column piecemeal, and the span of particular column is as index stores.The query manipulation of table is realized by distributed data base enforcement engine, and the node namely containing a certain list data block starts database enforcement engine and carries out data query operation.Native system can realize the generation of particular column as effective query task during search condition.Node in system deployment request data center is interconnected by switch, and all nodes all mutually can be accessed and can be carried out data transmission.

This system adopts distributed structure/architecture, runs based on Hadoop distributed file system HDFS, and each node of data center runs a finger daemon of executing the task, and this process is responsible for receiving the querying command inquired about and perform other finger daemon and send.The node of the request of receiving an assignment is called task scheduling node, for current queries is responsible for.After being responsible for a certain query manipulation, being responsible for the distribution of current queries task and returning gathering of result.The method can to the query generation effective query task retrieved according to particular column.Wherein, effective query task refers to that the current data block that will operate contains last object information, and the invalid query task corresponding with it refers to those query tasks certainly not having object information.

The embodiment of the present invention proposes a kind of effective query task creating method, as shown in Figure 1, comprises the following steps:

Structuring table data are carried out piecemeal storage according to after key column sequence, are obtained multiple data blocks that described structuring table data are corresponding by step 101;

By suitable instrument, the database file (such as Oracle file, MySQL file, DB2 file etc.) of standard is imported in HDFS according to particular column sequence, after importing HDFS, in each data block, the span of this particular column is different, sort in certain sequence, obtain multiple data blocks that in the database file of standard, structuring table data are corresponding.The described specific key column being classified as structuring table data.

Step 102, the span obtaining key column in each data block creates data block index;

The span of key column in each data block will be recorded as index after data block storage simultaneously.

Step 103, when receiving according to key column or comprising the query task of key column, generates effective query task according to described data block index to the data block containing object information.

When receiving according to key column or comprising the query task of key column, current queries task can carry out the generation of effective query task by particular column index, first the search condition of current queries task and the span of particular column is extracted, the data block index meeting described search condition is read according to described search condition, the data block of object information is contained according to described data block index search, then effective query task is generated to the data block containing object information, carry out data query.

Preferably, described method also comprises:

Step 201, receives the query task that client sends;

Step 202, judges whether described query task is according to key column or the query task comprising key column.

In the embodiment of the present invention, when performing inquiry to certain table, if current query manipulation is according to particular column or the inquiry comprising particular column, so this query manipulation just can generate effective query task by index, namely only to the data block generated query task containing object information, to certainly containing the data block just not generated query task of object information, ensure the maximization of task efficiency.

Step 301, obtains the span of key column in each data block;

Step 302, records the data block index of span as this data block of key column in described data block.

Step 401, extracts the search condition of current queries task;

Step 402, reads the data block index meeting described search condition according to described search condition;

Step 403, contains the data block of object information according to described data block index search;

Step 404, generates effective query task to the data block containing object information, carries out data query.

In addition, the invention process row additionally provide a kind of effective query task generation system, and as shown in Figure 2, this system comprises deblocking module 1, acquisition module 2 and generation module 3;

Deblocking module 1, for structuring table data are carried out piecemeal storage according to after key column sequence, obtains multiple data blocks that described structuring table data are corresponding;

Acquisition module 2, creates data block index for the span obtaining key column in each data block;

Generation module 3, for when receiving according to key column or comprising the query task of key column, generates effective query task according to described data block index to the data block containing object information.

Preferably, described system also comprises: receiver module and judge module;

Receiver module, for receiving the query task that client sends;

Acquiring unit, for obtaining the span of key column in each data block;

Extraction unit, for extracting the search condition of current queries task;

Through the above description of the embodiments, those skilled in the art can be well understood to the present invention can by hardware implementing, and the mode that also can add necessary general hardware platform by software realizes.Based on such understanding, technical scheme of the present invention can embody with the form of software product, it (can be CD-ROM that this software product can be stored in a non-volatile memory medium, USB flash disk, portable hard drive etc.) in, comprise some instructions and perform method described in each embodiment of the present invention in order to make a computer equipment (can be personal computer, server, or the network equipment etc.).

It will be appreciated by those skilled in the art that accompanying drawing is the schematic diagram of a preferred embodiment, the module in accompanying drawing or flow process might not be that enforcement the present invention is necessary.

The foregoing is only embodiments of the invention; not thereby the scope of the claims of the present invention is limited; every utilize instructions of the present invention and accompanying drawing content to do equivalent structure or equivalent flow process conversion; or be directly or indirectly used in other relevant technical fields, be all in like manner included in scope of patent protection of the present invention.

Claims

1. an effective query task creating method, is characterized in that, comprises the following steps:

The span obtaining key column in each data block creates data block index;

2. method according to claim 1, is characterized in that, described method also comprises:

Receive the query task that client sends;

3. method according to claim 2, is characterized in that, when described query task is not according to key column or comprises the query task of key column, performs query task to all data blocks.

4. method according to claim 1, is characterized in that, in each data block of described acquisition, the span establishment data block index of key column is specially:

Obtain the span of key column in each data block;

5. method according to claim 1, is characterized in that, describedly generates effective query task according to described data block index to the data block containing object information, specifically comprises:

Extract the search condition of current queries task;

6. the method according to claim 1-5, is characterized in that, described method also comprises: for described structuring table data arrange key column.

7. an effective query task generation system, is characterized in that, comprises deblocking module, acquisition module and generation module;

8. system according to claim 7, is characterized in that, described system also comprises: receiver module and judge module;

Receiver module, for receiving the query task that client sends;

9. system according to claim 7, is characterized in that, described acquisition module comprises: acquiring unit and record cell;

Acquiring unit, for obtaining the span of key column in each data block;

10. system according to claim 7, is characterized in that, described generation module comprises: extraction unit, reading unit, search unit and generation unit;

Extraction unit, for extracting the search condition of current queries task;