CN104021169A

CN104021169A - Hive connection inquiry method based on SDD-1 algorithm

Info

Publication number: CN104021169A
Application number: CN201410237997.2A
Authority: CN
Inventors: 周莲英; 吴淑跃; 郭远; 郑吉�; 喻志浩
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2014-05-30
Filing date: 2014-05-30
Publication date: 2014-09-03
Anticipated expiration: 2034-05-30
Also published as: CN104021169B

Abstract

The invention discloses a Hive connection inquiry method based on an SDD-1 algorithm. The method is achieved by means of a data preprocessing technique and a double semi-join technique. In the data preprocessing stage, simplification of data is finished through projection and other unary operations before data transmission, and meanwhile the data are pre-sequenced on nodes. According to the double semi-join technique, data in a row are shortened, and meanwhile, data in a column are also shortened. As is indicated in a result, the data transmission volume between nodes can be greatly reduced by means of the double semi-join technique, and consumption of bandwidth resources is greatly reduced. Meanwhile, data merge sort preprocessing is carried out, and accordingly when the number of tuples reaches a certain value, the response speed is increased.

Description

A kind of Hive based on SDD-1 algorithm connects querying method

Technical field

The invention belongs to computer information technology application, be specifically related to a kind of Hive based on SDD-1 algorithm and connect querying method.

Background technology

SDD-1 algorithm is a kind of querying method of widespread use in traditional distributed relevant database.Hive is a data warehouse framework based on Hadoop file system, has realized the SQL statement query function of similar traditional relational.Existing Hive has adopted sort merge algorithm in the time connecting inquiry, the execution of this algorithm is divided into Map(data-mapping) stage and Reduce(data processing) stage: the Map stage concentrates and sorts according to connection attribute carrying out the database table being connected, the segmentation ranking results that the Reduce stage generates each Map stage carries out merger connection, output Query Result.

There are two problems in above-mentioned algorithm: a large amount of intermediate result data that (1) Map stage produces need to arrive Reduce end by Internet Transmission, can consume a large amount of bandwidth; (2) Reduce end need to carry out repeatedly merge sort operation, and the execution time is longer.For the better connection inquiry that uses Hive to carry out mass data, need to solve in actual applications this two problems.

Summary of the invention

The object of the invention is to solve Hive and adopt the problems such as long and bandwidth resource consumption of existing execution time of the original connection search algorithm of Hive is large connecting when inquiry, provide a kind of Hive based on SDD-1 algorithm to connect querying method, so that Hive system reaches response soon and the few object of bandwidth consumption.

Based on the object of foregoing invention, technical scheme of the present invention is:

Hive based on SDD-1 algorithm connects a querying method, comprises the following steps:

1) on each distribution node, carry out the operations such as projection, all executable unary operations and partial operation are formed to implementation strategy collection, raw data is simplified;

2) above-mentioned implementation strategy collection is carried out to merge sort pre-service, each attribute is sorted, make its each attribute form an orderly intermediate data sequence;

3) in Hadoop, middle data sequence is carried out to Map processing;

4) result that uses the two half-connection technology based on row and column that the Map stage is produced is sent to Reduce end;

5) process from the more orderly data of Map end at Reduce end;

6) result of query processing is returned to client.

Further, described step 2) in implementation strategy collection carry out merge sort pre-service and comprise two stages: the first stage is that database relation is carried out to segmentation sequence; Subordinate phase is that the sublist of database relation is carried out to merger operation.

Further, described merge sort pre-service need to be carried out the reading of local data repeatedly, merger and sorting operation.

Further, the concrete steps that use the two half-connection technology based on row and column to carry out data transmission in described step 4) are as follows:

A) determine and connect the attribute row and column that inquiry relates to;

B) remove by projection operation in conjunction with inquiry application and be connected the irrelevant row attribute of inquiry and Column Properties;

C) construct multiple pair of half-connection, calculate respectively transmission cost, build two half-connection collection;

D) select the data transmission that minimum transmission cost produces the Map stage and hold to Reduce from two half-connections of building are concentrated.

Further, also comprise test data checking link, choose corresponding test data, compare with the original connection search algorithm of Hive with regard to CPU cost and time response.

Useful result of the present invention is: the present invention makes full use of Hive and carries out data and connect data transmission that when inquiry need carry out and the characteristic of merge sort operation, adopts two half-connection technology and aggregation of data sequence preconditioning technique, and then accelerates query processing speed.Experimental result shows, the present invention adopts two half-connection technology can greatly reduce the volume of transmitted data between each node, thereby greatly reduces the consumption to bandwidth resources; Adopt aggregation of data sequence pre-service simultaneously, can work as after number of tuples reaches certain scale response speed is accelerated.

Brief description of the drawings

Fig. 1 is execution step process flow diagram of the present invention;

Fig. 2 is the schematic diagram of the original connection search algorithm of Hive CPU cost;

Fig. 3 is the schematic diagram that connects the CPU cost of querying method based on the Hive of SDD-1 algorithm;

Fig. 4 be the present invention under different pieces of information amount with original comparison diagram time response that is connected search algorithm of Hive.

Embodiment

Below in conjunction with the drawings and specific embodiments, further illustrate the specific embodiment of the present invention.

As shown in Figure 1, the present invention proposes a kind of SDD-1 based on data pre-service and two half-connections and improve algorithm, data pre-service refers to carrying out before data transmission with the simplifying of the complete paired datas of unary operation such as projection, simultaneously the presort to the enterprising row data of each node also; Two half-connections refer to not only reduces the data of row, the data of row is reduced simultaneously.This scheme comprises following step:

Step 1 is carried out the operations such as projection on each distribution node, and all executable unary operations and partial operation are formed to implementation strategy collection, and raw data is simplified;

Step 2, carries out merge sort pre-service to above-mentioned implementation strategy collection, and each attribute is sorted, and makes its each attribute form an orderly intermediate data sequence;

Step 3 is carried out Map processing to middle data sequence in Hadoop;

Step 4, the result that uses the two half-connection technology based on row and column that the Map stage is produced is sent to Reduce end;

Step 5, processes from the more orderly data of Map end at Reduce end;

Step 6, returns to client by the result of query processing.

In above-mentioned steps 2, merge sort pre-service mainly comprises two stages:

First stage is that database relation is carried out to segmentation sequence, first the database R of needs sequence is divided into the sublist of size for M piece, wherein M is the number that can be used for the memory headroom of sequence, taking piece as unit, again sublist is put into each internal memory and adopted the main memory sort algorithms such as quicksort to carry out sorting operation, so just can obtain an ordering sublist in inside;

Subordinate phase is that the sublist of database relation is carried out to merger operation, the content that reads in order a piece from the sublist of each sequence is put into internal memory, unified to the record execution merger operation in these pieces in internal memory, database R ' put in each record of selecting minimum (maximum), deletes corresponding record in sublist simultaneously; In the time that the piece of sublist in internal memory got sky, from sublist, order reads a new piece and puts into internal memory continuation execution merger operation.

In step 4, adopt two half-connection technology based on row and column to carry out the detailed execution step of data transmission as follows:

A) determine and connect the attribute row and column that inquiry relates to;

As shown in Figures 2 and 3, in linux system, the database of number of tuples 3000 is tested to (utility command vmstat 3, gathered cpu usage data every three seconds), can find out that connecting inquiry based on the Hive of SDD-1 algorithm improves one's methods and exchange the transmission time between different websites for the localization process time when the data transmission, compare with the original search algorithm that is connected of Hive, wherein cs, us and sy parameter value are larger, reflect that improving algorithm need to have higher CPU usage and I/O reading times, because improve algorithm in the time carrying out data pre-service, need to carry out reading of local data repeatedly, merger and sorting operation, consume a large amount of system resource.

As shown in Figure 4, in the Hadoop cluster of building, number of tuples is tested from the database of 1000-8000.Test shows, when number of tuples is fewer, in the processing time that the time that Hive based on SDD-1 algorithm connects querying method to be needed because each node carries out merge sort is greater than reduction data and accelerates Reduce end, so comparing the original connection search algorithm of Hive, query responding time increases on the contrary; But along with the increase of number of tuples, the superiority of improving algorithm starts to manifest.Quantize, in the time that total number of tuples reaches 8000, the response time of improving algorithm starts to be less than the original connection search algorithm of Hive; And along with the further increase of number of tuples, the lifting amplitude of response time also strengthens thereupon.

Connection search algorithm based on Hive just completes on node just directly carrying out Reduce operation after the Map operation of all data originally in cluster, thereby the result of inquiry is returned to user.The present invention improves it, greatly reduces the volume of transmitted data between different nodes, reduces the use of bandwidth; Introduce pre-merge sort pretreatment operation simultaneously, while making to carry out Reduce operation, have a more orderly attribute column, thereby reduce the time of Reduce end merge sort, improve the efficiency of inquiry.

Should understand the above-mentioned example of executing and only be not used in and limit the scope of the invention for the present invention is described, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the amendment of the various equivalent form of values of the present invention.

Claims

1. the Hive based on SDD-1 algorithm connects a querying method, it is characterized in that: comprise the following steps:

3) in Hadoop, middle data sequence is carried out to Map processing;

5) process from the more orderly data of Map end at Reduce end;

6) result of query processing is returned to client.

2. the Hive based on SDD-1 algorithm according to claim 1 connects querying method, it is characterized in that: described step 2) in implementation strategy collection carry out merge sort pre-service and comprise two stages: the first stage is that database relation is carried out to segmentation sequence; Subordinate phase is that the sublist of database relation is carried out to merger operation.

3. the Hive based on SDD-1 algorithm according to claim 1 and 2 connects querying method, it is characterized in that: described merge sort pre-service need to be carried out the reading of local data repeatedly, merger and sorting operation.

4. the Hive based on SDD-1 algorithm according to claim 1 connects querying method, it is characterized in that: the concrete steps that use the two half-connection technology based on row and column to carry out data transmission in described step 4) are as follows:

A) determine and connect the attribute row and column that inquiry relates to;

5. the Hive based on SDD-1 algorithm according to claim 1 connects querying method, it is characterized in that: also comprise test data checking link, choose corresponding test data, compare with the original connection search algorithm of Hive with regard to CPU cost and time response respectively.