CN104021169A - Hive connection inquiry method based on SDD-1 algorithm - Google Patents

Hive connection inquiry method based on SDD-1 algorithm Download PDF

Info

Publication number
CN104021169A
CN104021169A CN201410237997.2A CN201410237997A CN104021169A CN 104021169 A CN104021169 A CN 104021169A CN 201410237997 A CN201410237997 A CN 201410237997A CN 104021169 A CN104021169 A CN 104021169A
Authority
CN
China
Prior art keywords
data
hive
sdd
algorithm
carried out
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410237997.2A
Other languages
Chinese (zh)
Other versions
CN104021169B (en
Inventor
周莲英
吴淑跃
郭远
郑吉�
喻志浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN201410237997.2A priority Critical patent/CN104021169B/en
Publication of CN104021169A publication Critical patent/CN104021169A/en
Application granted granted Critical
Publication of CN104021169B publication Critical patent/CN104021169B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations

Abstract

The invention discloses a Hive connection inquiry method based on an SDD-1 algorithm. The method is achieved by means of a data preprocessing technique and a double semi-join technique. In the data preprocessing stage, simplification of data is finished through projection and other unary operations before data transmission, and meanwhile the data are pre-sequenced on nodes. According to the double semi-join technique, data in a row are shortened, and meanwhile, data in a column are also shortened. As is indicated in a result, the data transmission volume between nodes can be greatly reduced by means of the double semi-join technique, and consumption of bandwidth resources is greatly reduced. Meanwhile, data merge sort preprocessing is carried out, and accordingly when the number of tuples reaches a certain value, the response speed is increased.

Description

A kind of Hive based on SDD-1 algorithm connects querying method
Technical field
The invention belongs to computer information technology application, be specifically related to a kind of Hive based on SDD-1 algorithm and connect querying method.
Background technology
SDD-1 algorithm is a kind of querying method of widespread use in traditional distributed relevant database.Hive is a data warehouse framework based on Hadoop file system, has realized the SQL statement query function of similar traditional relational.Existing Hive has adopted sort merge algorithm in the time connecting inquiry, the execution of this algorithm is divided into Map(data-mapping) stage and Reduce(data processing) stage: the Map stage concentrates and sorts according to connection attribute carrying out the database table being connected, the segmentation ranking results that the Reduce stage generates each Map stage carries out merger connection, output Query Result.
There are two problems in above-mentioned algorithm: a large amount of intermediate result data that (1) Map stage produces need to arrive Reduce end by Internet Transmission, can consume a large amount of bandwidth; (2) Reduce end need to carry out repeatedly merge sort operation, and the execution time is longer.For the better connection inquiry that uses Hive to carry out mass data, need to solve in actual applications this two problems.
Summary of the invention
The object of the invention is to solve Hive and adopt the problems such as long and bandwidth resource consumption of existing execution time of the original connection search algorithm of Hive is large connecting when inquiry, provide a kind of Hive based on SDD-1 algorithm to connect querying method, so that Hive system reaches response soon and the few object of bandwidth consumption.
Based on the object of foregoing invention, technical scheme of the present invention is:
Hive based on SDD-1 algorithm connects a querying method, comprises the following steps:
1) on each distribution node, carry out the operations such as projection, all executable unary operations and partial operation are formed to implementation strategy collection, raw data is simplified;
2) above-mentioned implementation strategy collection is carried out to merge sort pre-service, each attribute is sorted, make its each attribute form an orderly intermediate data sequence;
3) in Hadoop, middle data sequence is carried out to Map processing;
4) result that uses the two half-connection technology based on row and column that the Map stage is produced is sent to Reduce end;
5) process from the more orderly data of Map end at Reduce end;
6) result of query processing is returned to client.
Further, described step 2) in implementation strategy collection carry out merge sort pre-service and comprise two stages: the first stage is that database relation is carried out to segmentation sequence; Subordinate phase is that the sublist of database relation is carried out to merger operation.
Further, described merge sort pre-service need to be carried out the reading of local data repeatedly, merger and sorting operation.
Further, the concrete steps that use the two half-connection technology based on row and column to carry out data transmission in described step 4) are as follows:
A) determine and connect the attribute row and column that inquiry relates to;
B) remove by projection operation in conjunction with inquiry application and be connected the irrelevant row attribute of inquiry and Column Properties;
C) construct multiple pair of half-connection, calculate respectively transmission cost, build two half-connection collection;
D) select the data transmission that minimum transmission cost produces the Map stage and hold to Reduce from two half-connections of building are concentrated.
Further, also comprise test data checking link, choose corresponding test data, compare with the original connection search algorithm of Hive with regard to CPU cost and time response.
Useful result of the present invention is: the present invention makes full use of Hive and carries out data and connect data transmission that when inquiry need carry out and the characteristic of merge sort operation, adopts two half-connection technology and aggregation of data sequence preconditioning technique, and then accelerates query processing speed.Experimental result shows, the present invention adopts two half-connection technology can greatly reduce the volume of transmitted data between each node, thereby greatly reduces the consumption to bandwidth resources; Adopt aggregation of data sequence pre-service simultaneously, can work as after number of tuples reaches certain scale response speed is accelerated.
Brief description of the drawings
Fig. 1 is execution step process flow diagram of the present invention;
Fig. 2 is the schematic diagram of the original connection search algorithm of Hive CPU cost;
Fig. 3 is the schematic diagram that connects the CPU cost of querying method based on the Hive of SDD-1 algorithm;
Fig. 4 be the present invention under different pieces of information amount with original comparison diagram time response that is connected search algorithm of Hive.
Embodiment
Below in conjunction with the drawings and specific embodiments, further illustrate the specific embodiment of the present invention.
As shown in Figure 1, the present invention proposes a kind of SDD-1 based on data pre-service and two half-connections and improve algorithm, data pre-service refers to carrying out before data transmission with the simplifying of the complete paired datas of unary operation such as projection, simultaneously the presort to the enterprising row data of each node also; Two half-connections refer to not only reduces the data of row, the data of row is reduced simultaneously.This scheme comprises following step:
Step 1 is carried out the operations such as projection on each distribution node, and all executable unary operations and partial operation are formed to implementation strategy collection, and raw data is simplified;
Step 2, carries out merge sort pre-service to above-mentioned implementation strategy collection, and each attribute is sorted, and makes its each attribute form an orderly intermediate data sequence;
Step 3 is carried out Map processing to middle data sequence in Hadoop;
Step 4, the result that uses the two half-connection technology based on row and column that the Map stage is produced is sent to Reduce end;
Step 5, processes from the more orderly data of Map end at Reduce end;
Step 6, returns to client by the result of query processing.
In above-mentioned steps 2, merge sort pre-service mainly comprises two stages:
First stage is that database relation is carried out to segmentation sequence, first the database R of needs sequence is divided into the sublist of size for M piece, wherein M is the number that can be used for the memory headroom of sequence, taking piece as unit, again sublist is put into each internal memory and adopted the main memory sort algorithms such as quicksort to carry out sorting operation, so just can obtain an ordering sublist in inside;
Subordinate phase is that the sublist of database relation is carried out to merger operation, the content that reads in order a piece from the sublist of each sequence is put into internal memory, unified to the record execution merger operation in these pieces in internal memory, database R ' put in each record of selecting minimum (maximum), deletes corresponding record in sublist simultaneously; In the time that the piece of sublist in internal memory got sky, from sublist, order reads a new piece and puts into internal memory continuation execution merger operation.
In step 4, adopt two half-connection technology based on row and column to carry out the detailed execution step of data transmission as follows:
A) determine and connect the attribute row and column that inquiry relates to;
B) remove by projection operation in conjunction with inquiry application and be connected the irrelevant row attribute of inquiry and Column Properties;
C) construct multiple pair of half-connection, calculate respectively transmission cost, build two half-connection collection;
D) select the data transmission that minimum transmission cost produces the Map stage and hold to Reduce from two half-connections of building are concentrated.
As shown in Figures 2 and 3, in linux system, the database of number of tuples 3000 is tested to (utility command vmstat 3, gathered cpu usage data every three seconds), can find out that connecting inquiry based on the Hive of SDD-1 algorithm improves one's methods and exchange the transmission time between different websites for the localization process time when the data transmission, compare with the original search algorithm that is connected of Hive, wherein cs, us and sy parameter value are larger, reflect that improving algorithm need to have higher CPU usage and I/O reading times, because improve algorithm in the time carrying out data pre-service, need to carry out reading of local data repeatedly, merger and sorting operation, consume a large amount of system resource.
As shown in Figure 4, in the Hadoop cluster of building, number of tuples is tested from the database of 1000-8000.Test shows, when number of tuples is fewer, in the processing time that the time that Hive based on SDD-1 algorithm connects querying method to be needed because each node carries out merge sort is greater than reduction data and accelerates Reduce end, so comparing the original connection search algorithm of Hive, query responding time increases on the contrary; But along with the increase of number of tuples, the superiority of improving algorithm starts to manifest.Quantize, in the time that total number of tuples reaches 8000, the response time of improving algorithm starts to be less than the original connection search algorithm of Hive; And along with the further increase of number of tuples, the lifting amplitude of response time also strengthens thereupon.
Connection search algorithm based on Hive just completes on node just directly carrying out Reduce operation after the Map operation of all data originally in cluster, thereby the result of inquiry is returned to user.The present invention improves it, greatly reduces the volume of transmitted data between different nodes, reduces the use of bandwidth; Introduce pre-merge sort pretreatment operation simultaneously, while making to carry out Reduce operation, have a more orderly attribute column, thereby reduce the time of Reduce end merge sort, improve the efficiency of inquiry.
Should understand the above-mentioned example of executing and only be not used in and limit the scope of the invention for the present invention is described, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the amendment of the various equivalent form of values of the present invention.

Claims (5)

1. the Hive based on SDD-1 algorithm connects a querying method, it is characterized in that: comprise the following steps:
1) on each distribution node, carry out the operations such as projection, all executable unary operations and partial operation are formed to implementation strategy collection, raw data is simplified;
2) above-mentioned implementation strategy collection is carried out to merge sort pre-service, each attribute is sorted, make its each attribute form an orderly intermediate data sequence;
3) in Hadoop, middle data sequence is carried out to Map processing;
4) result that uses the two half-connection technology based on row and column that the Map stage is produced is sent to Reduce end;
5) process from the more orderly data of Map end at Reduce end;
6) result of query processing is returned to client.
2. the Hive based on SDD-1 algorithm according to claim 1 connects querying method, it is characterized in that: described step 2) in implementation strategy collection carry out merge sort pre-service and comprise two stages: the first stage is that database relation is carried out to segmentation sequence; Subordinate phase is that the sublist of database relation is carried out to merger operation.
3. the Hive based on SDD-1 algorithm according to claim 1 and 2 connects querying method, it is characterized in that: described merge sort pre-service need to be carried out the reading of local data repeatedly, merger and sorting operation.
4. the Hive based on SDD-1 algorithm according to claim 1 connects querying method, it is characterized in that: the concrete steps that use the two half-connection technology based on row and column to carry out data transmission in described step 4) are as follows:
A) determine and connect the attribute row and column that inquiry relates to;
B) remove by projection operation in conjunction with inquiry application and be connected the irrelevant row attribute of inquiry and Column Properties;
C) construct multiple pair of half-connection, calculate respectively transmission cost, build two half-connection collection;
D) select the data transmission that minimum transmission cost produces the Map stage and hold to Reduce from two half-connections of building are concentrated.
5. the Hive based on SDD-1 algorithm according to claim 1 connects querying method, it is characterized in that: also comprise test data checking link, choose corresponding test data, compare with the original connection search algorithm of Hive with regard to CPU cost and time response respectively.
CN201410237997.2A 2014-05-30 2014-05-30 A kind of Hive Connection inquiring methods based on the algorithms of SDD 1 Expired - Fee Related CN104021169B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410237997.2A CN104021169B (en) 2014-05-30 2014-05-30 A kind of Hive Connection inquiring methods based on the algorithms of SDD 1

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410237997.2A CN104021169B (en) 2014-05-30 2014-05-30 A kind of Hive Connection inquiring methods based on the algorithms of SDD 1

Publications (2)

Publication Number Publication Date
CN104021169A true CN104021169A (en) 2014-09-03
CN104021169B CN104021169B (en) 2018-01-16

Family

ID=51437923

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410237997.2A Expired - Fee Related CN104021169B (en) 2014-05-30 2014-05-30 A kind of Hive Connection inquiring methods based on the algorithms of SDD 1

Country Status (1)

Country Link
CN (1) CN104021169B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106802787A (en) * 2017-01-15 2017-06-06 天泽信息产业股份有限公司 MapReduce optimization methods based on GPU sequences
CN107463702A (en) * 2017-08-16 2017-12-12 中科院成都信息技术股份有限公司 A kind of database multi-join query optimization method based on evolution algorithm

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1464451A (en) * 2002-06-26 2003-12-31 联想(北京)有限公司 A sorting method of data record
US20100162230A1 (en) * 2008-12-24 2010-06-24 Yahoo! Inc. Distributed computing system for large-scale data handling
CN102110158A (en) * 2011-02-24 2011-06-29 上海大学 Multi-join query optimization method for database based on improved SDD-1 (System for Distributed Database) algorithm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1464451A (en) * 2002-06-26 2003-12-31 联想(北京)有限公司 A sorting method of data record
US20100162230A1 (en) * 2008-12-24 2010-06-24 Yahoo! Inc. Distributed computing system for large-scale data handling
CN102110158A (en) * 2011-02-24 2011-06-29 上海大学 Multi-join query optimization method for database based on improved SDD-1 (System for Distributed Database) algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
叶文宸: ""基于Hive的性能优化方法的研究与实践"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
赵彦荣等: ""基于Hadoop的高效连接查询处理算法CHMJ"", 《软件学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106802787A (en) * 2017-01-15 2017-06-06 天泽信息产业股份有限公司 MapReduce optimization methods based on GPU sequences
CN106802787B (en) * 2017-01-15 2019-08-02 天泽信息产业股份有限公司 MapReduce optimization method based on GPU sequence
CN107463702A (en) * 2017-08-16 2017-12-12 中科院成都信息技术股份有限公司 A kind of database multi-join query optimization method based on evolution algorithm

Also Published As

Publication number Publication date
CN104021169B (en) 2018-01-16

Similar Documents

Publication Publication Date Title
CN107291807B (en) SPARQL query optimization method based on graph traversal
Zhao et al. Modeling MongoDB with relational model
CN103309958B (en) The star-like Connection inquiring optimization method of OLAP under GPU and CPU mixed architecture
US9721007B2 (en) Parallel data sorting
JP6964384B2 (en) Methods, programs, and systems for the automatic discovery of relationships between fields in a mixed heterogeneous data source environment.
WO2017096892A1 (en) Index construction method, search method, and corresponding device, apparatus, and computer storage medium
CN112015741A (en) Method and device for storing massive data in different databases and tables
CN109325029A (en) RDF data storage and querying method based on sparse matrix
CN105740264A (en) Distributed XML database sorting method and apparatus
CN105938479A (en) Structural transfer method of relational tables and non-relational tables
CN103310350B (en) A kind of based on predicate differentiation and the quick subscription associated and matching process
Ghotiya et al. Migration from relational to NoSQL database
CN110032676B (en) SPARQL query optimization method and system based on predicate association
CN104021169A (en) Hive connection inquiry method based on SDD-1 algorithm
US20160275146A1 (en) Use a parallel hardware search device to implement big databases efficiently
CN108319604B (en) Optimization method for association of large and small tables in hive
CN113918605A (en) Data query method, device, equipment and computer storage medium
US8832157B1 (en) System, method, and computer-readable medium that facilitates efficient processing of distinct counts on several columns in a parallel processing system
KR20180077830A (en) Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method
CN103279328A (en) BlogRank algorithm parallelization processing construction method based on Haloop
CN106933844A (en) Towards the construction method of the accessibility search index of extensive RDF data
CN109753520B (en) Semi-connection query method, device, server and storage medium
CN114791935A (en) Method for realizing high-performance multidimensional data warehouse based on cloud object storage
CN108874849B (en) Optimization method and system for non-equivalent associated sub-query
KEAWPIBAL et al. Optimizing range query processing for Dual bitmap index

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180116

Termination date: 20180530