CN104021169A - Hive connection inquiry method based on SDD-1 algorithm - Google Patents
Hive connection inquiry method based on SDD-1 algorithm Download PDFInfo
- Publication number
- CN104021169A CN104021169A CN201410237997.2A CN201410237997A CN104021169A CN 104021169 A CN104021169 A CN 104021169A CN 201410237997 A CN201410237997 A CN 201410237997A CN 104021169 A CN104021169 A CN 104021169A
- Authority
- CN
- China
- Prior art keywords
- data
- hive
- sdd
- algorithm
- carried out
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
Abstract
The invention discloses a Hive connection inquiry method based on an SDD-1 algorithm. The method is achieved by means of a data preprocessing technique and a double semi-join technique. In the data preprocessing stage, simplification of data is finished through projection and other unary operations before data transmission, and meanwhile the data are pre-sequenced on nodes. According to the double semi-join technique, data in a row are shortened, and meanwhile, data in a column are also shortened. As is indicated in a result, the data transmission volume between nodes can be greatly reduced by means of the double semi-join technique, and consumption of bandwidth resources is greatly reduced. Meanwhile, data merge sort preprocessing is carried out, and accordingly when the number of tuples reaches a certain value, the response speed is increased.
Description
Technical field
The invention belongs to computer information technology application, be specifically related to a kind of Hive based on SDD-1 algorithm and connect querying method.
Background technology
SDD-1 algorithm is a kind of querying method of widespread use in traditional distributed relevant database.Hive is a data warehouse framework based on Hadoop file system, has realized the SQL statement query function of similar traditional relational.Existing Hive has adopted sort merge algorithm in the time connecting inquiry, the execution of this algorithm is divided into Map(data-mapping) stage and Reduce(data processing) stage: the Map stage concentrates and sorts according to connection attribute carrying out the database table being connected, the segmentation ranking results that the Reduce stage generates each Map stage carries out merger connection, output Query Result.
There are two problems in above-mentioned algorithm: a large amount of intermediate result data that (1) Map stage produces need to arrive Reduce end by Internet Transmission, can consume a large amount of bandwidth; (2) Reduce end need to carry out repeatedly merge sort operation, and the execution time is longer.For the better connection inquiry that uses Hive to carry out mass data, need to solve in actual applications this two problems.
Summary of the invention
The object of the invention is to solve Hive and adopt the problems such as long and bandwidth resource consumption of existing execution time of the original connection search algorithm of Hive is large connecting when inquiry, provide a kind of Hive based on SDD-1 algorithm to connect querying method, so that Hive system reaches response soon and the few object of bandwidth consumption.
Based on the object of foregoing invention, technical scheme of the present invention is:
Hive based on SDD-1 algorithm connects a querying method, comprises the following steps:
1) on each distribution node, carry out the operations such as projection, all executable unary operations and partial operation are formed to implementation strategy collection, raw data is simplified;
2) above-mentioned implementation strategy collection is carried out to merge sort pre-service, each attribute is sorted, make its each attribute form an orderly intermediate data sequence;
3) in Hadoop, middle data sequence is carried out to Map processing;
4) result that uses the two half-connection technology based on row and column that the Map stage is produced is sent to Reduce end;
5) process from the more orderly data of Map end at Reduce end;
6) result of query processing is returned to client.
Further, described step 2) in implementation strategy collection carry out merge sort pre-service and comprise two stages: the first stage is that database relation is carried out to segmentation sequence; Subordinate phase is that the sublist of database relation is carried out to merger operation.
Further, described merge sort pre-service need to be carried out the reading of local data repeatedly, merger and sorting operation.
Further, the concrete steps that use the two half-connection technology based on row and column to carry out data transmission in described step 4) are as follows:
A) determine and connect the attribute row and column that inquiry relates to;
B) remove by projection operation in conjunction with inquiry application and be connected the irrelevant row attribute of inquiry and Column Properties;
C) construct multiple pair of half-connection, calculate respectively transmission cost, build two half-connection collection;
D) select the data transmission that minimum transmission cost produces the Map stage and hold to Reduce from two half-connections of building are concentrated.
Further, also comprise test data checking link, choose corresponding test data, compare with the original connection search algorithm of Hive with regard to CPU cost and time response.
Useful result of the present invention is: the present invention makes full use of Hive and carries out data and connect data transmission that when inquiry need carry out and the characteristic of merge sort operation, adopts two half-connection technology and aggregation of data sequence preconditioning technique, and then accelerates query processing speed.Experimental result shows, the present invention adopts two half-connection technology can greatly reduce the volume of transmitted data between each node, thereby greatly reduces the consumption to bandwidth resources; Adopt aggregation of data sequence pre-service simultaneously, can work as after number of tuples reaches certain scale response speed is accelerated.
Brief description of the drawings
Fig. 1 is execution step process flow diagram of the present invention;
Fig. 2 is the schematic diagram of the original connection search algorithm of Hive CPU cost;
Fig. 3 is the schematic diagram that connects the CPU cost of querying method based on the Hive of SDD-1 algorithm;
Fig. 4 be the present invention under different pieces of information amount with original comparison diagram time response that is connected search algorithm of Hive.
Embodiment
Below in conjunction with the drawings and specific embodiments, further illustrate the specific embodiment of the present invention.
As shown in Figure 1, the present invention proposes a kind of SDD-1 based on data pre-service and two half-connections and improve algorithm, data pre-service refers to carrying out before data transmission with the simplifying of the complete paired datas of unary operation such as projection, simultaneously the presort to the enterprising row data of each node also; Two half-connections refer to not only reduces the data of row, the data of row is reduced simultaneously.This scheme comprises following step:
Step 1 is carried out the operations such as projection on each distribution node, and all executable unary operations and partial operation are formed to implementation strategy collection, and raw data is simplified;
Step 2, carries out merge sort pre-service to above-mentioned implementation strategy collection, and each attribute is sorted, and makes its each attribute form an orderly intermediate data sequence;
Step 3 is carried out Map processing to middle data sequence in Hadoop;
Step 4, the result that uses the two half-connection technology based on row and column that the Map stage is produced is sent to Reduce end;
Step 5, processes from the more orderly data of Map end at Reduce end;
Step 6, returns to client by the result of query processing.
In above-mentioned steps 2, merge sort pre-service mainly comprises two stages:
First stage is that database relation is carried out to segmentation sequence, first the database R of needs sequence is divided into the sublist of size for M piece, wherein M is the number that can be used for the memory headroom of sequence, taking piece as unit, again sublist is put into each internal memory and adopted the main memory sort algorithms such as quicksort to carry out sorting operation, so just can obtain an ordering sublist in inside;
Subordinate phase is that the sublist of database relation is carried out to merger operation, the content that reads in order a piece from the sublist of each sequence is put into internal memory, unified to the record execution merger operation in these pieces in internal memory, database R ' put in each record of selecting minimum (maximum), deletes corresponding record in sublist simultaneously; In the time that the piece of sublist in internal memory got sky, from sublist, order reads a new piece and puts into internal memory continuation execution merger operation.
In step 4, adopt two half-connection technology based on row and column to carry out the detailed execution step of data transmission as follows:
A) determine and connect the attribute row and column that inquiry relates to;
B) remove by projection operation in conjunction with inquiry application and be connected the irrelevant row attribute of inquiry and Column Properties;
C) construct multiple pair of half-connection, calculate respectively transmission cost, build two half-connection collection;
D) select the data transmission that minimum transmission cost produces the Map stage and hold to Reduce from two half-connections of building are concentrated.
As shown in Figures 2 and 3, in linux system, the database of number of tuples 3000 is tested to (utility command vmstat 3, gathered cpu usage data every three seconds), can find out that connecting inquiry based on the Hive of SDD-1 algorithm improves one's methods and exchange the transmission time between different websites for the localization process time when the data transmission, compare with the original search algorithm that is connected of Hive, wherein cs, us and sy parameter value are larger, reflect that improving algorithm need to have higher CPU usage and I/O reading times, because improve algorithm in the time carrying out data pre-service, need to carry out reading of local data repeatedly, merger and sorting operation, consume a large amount of system resource.
As shown in Figure 4, in the Hadoop cluster of building, number of tuples is tested from the database of 1000-8000.Test shows, when number of tuples is fewer, in the processing time that the time that Hive based on SDD-1 algorithm connects querying method to be needed because each node carries out merge sort is greater than reduction data and accelerates Reduce end, so comparing the original connection search algorithm of Hive, query responding time increases on the contrary; But along with the increase of number of tuples, the superiority of improving algorithm starts to manifest.Quantize, in the time that total number of tuples reaches 8000, the response time of improving algorithm starts to be less than the original connection search algorithm of Hive; And along with the further increase of number of tuples, the lifting amplitude of response time also strengthens thereupon.
Connection search algorithm based on Hive just completes on node just directly carrying out Reduce operation after the Map operation of all data originally in cluster, thereby the result of inquiry is returned to user.The present invention improves it, greatly reduces the volume of transmitted data between different nodes, reduces the use of bandwidth; Introduce pre-merge sort pretreatment operation simultaneously, while making to carry out Reduce operation, have a more orderly attribute column, thereby reduce the time of Reduce end merge sort, improve the efficiency of inquiry.
Should understand the above-mentioned example of executing and only be not used in and limit the scope of the invention for the present invention is described, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the amendment of the various equivalent form of values of the present invention.
Claims (5)
1. the Hive based on SDD-1 algorithm connects a querying method, it is characterized in that: comprise the following steps:
1) on each distribution node, carry out the operations such as projection, all executable unary operations and partial operation are formed to implementation strategy collection, raw data is simplified;
2) above-mentioned implementation strategy collection is carried out to merge sort pre-service, each attribute is sorted, make its each attribute form an orderly intermediate data sequence;
3) in Hadoop, middle data sequence is carried out to Map processing;
4) result that uses the two half-connection technology based on row and column that the Map stage is produced is sent to Reduce end;
5) process from the more orderly data of Map end at Reduce end;
6) result of query processing is returned to client.
2. the Hive based on SDD-1 algorithm according to claim 1 connects querying method, it is characterized in that: described step 2) in implementation strategy collection carry out merge sort pre-service and comprise two stages: the first stage is that database relation is carried out to segmentation sequence; Subordinate phase is that the sublist of database relation is carried out to merger operation.
3. the Hive based on SDD-1 algorithm according to claim 1 and 2 connects querying method, it is characterized in that: described merge sort pre-service need to be carried out the reading of local data repeatedly, merger and sorting operation.
4. the Hive based on SDD-1 algorithm according to claim 1 connects querying method, it is characterized in that: the concrete steps that use the two half-connection technology based on row and column to carry out data transmission in described step 4) are as follows:
A) determine and connect the attribute row and column that inquiry relates to;
B) remove by projection operation in conjunction with inquiry application and be connected the irrelevant row attribute of inquiry and Column Properties;
C) construct multiple pair of half-connection, calculate respectively transmission cost, build two half-connection collection;
D) select the data transmission that minimum transmission cost produces the Map stage and hold to Reduce from two half-connections of building are concentrated.
5. the Hive based on SDD-1 algorithm according to claim 1 connects querying method, it is characterized in that: also comprise test data checking link, choose corresponding test data, compare with the original connection search algorithm of Hive with regard to CPU cost and time response respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410237997.2A CN104021169B (en) | 2014-05-30 | 2014-05-30 | A kind of Hive Connection inquiring methods based on the algorithms of SDD 1 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410237997.2A CN104021169B (en) | 2014-05-30 | 2014-05-30 | A kind of Hive Connection inquiring methods based on the algorithms of SDD 1 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104021169A true CN104021169A (en) | 2014-09-03 |
CN104021169B CN104021169B (en) | 2018-01-16 |
Family
ID=51437923
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410237997.2A Expired - Fee Related CN104021169B (en) | 2014-05-30 | 2014-05-30 | A kind of Hive Connection inquiring methods based on the algorithms of SDD 1 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104021169B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106802787A (en) * | 2017-01-15 | 2017-06-06 | 天泽信息产业股份有限公司 | MapReduce optimization methods based on GPU sequences |
CN107463702A (en) * | 2017-08-16 | 2017-12-12 | 中科院成都信息技术股份有限公司 | A kind of database multi-join query optimization method based on evolution algorithm |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1464451A (en) * | 2002-06-26 | 2003-12-31 | 联想(北京)有限公司 | A sorting method of data record |
US20100162230A1 (en) * | 2008-12-24 | 2010-06-24 | Yahoo! Inc. | Distributed computing system for large-scale data handling |
CN102110158A (en) * | 2011-02-24 | 2011-06-29 | 上海大学 | Multi-join query optimization method for database based on improved SDD-1 (System for Distributed Database) algorithm |
-
2014
- 2014-05-30 CN CN201410237997.2A patent/CN104021169B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1464451A (en) * | 2002-06-26 | 2003-12-31 | 联想(北京)有限公司 | A sorting method of data record |
US20100162230A1 (en) * | 2008-12-24 | 2010-06-24 | Yahoo! Inc. | Distributed computing system for large-scale data handling |
CN102110158A (en) * | 2011-02-24 | 2011-06-29 | 上海大学 | Multi-join query optimization method for database based on improved SDD-1 (System for Distributed Database) algorithm |
Non-Patent Citations (2)
Title |
---|
叶文宸: ""基于Hive的性能优化方法的研究与实践"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
赵彦荣等: ""基于Hadoop的高效连接查询处理算法CHMJ"", 《软件学报》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106802787A (en) * | 2017-01-15 | 2017-06-06 | 天泽信息产业股份有限公司 | MapReduce optimization methods based on GPU sequences |
CN106802787B (en) * | 2017-01-15 | 2019-08-02 | 天泽信息产业股份有限公司 | MapReduce optimization method based on GPU sequence |
CN107463702A (en) * | 2017-08-16 | 2017-12-12 | 中科院成都信息技术股份有限公司 | A kind of database multi-join query optimization method based on evolution algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN104021169B (en) | 2018-01-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107291807B (en) | SPARQL query optimization method based on graph traversal | |
Zhao et al. | Modeling MongoDB with relational model | |
CN103309958B (en) | The star-like Connection inquiring optimization method of OLAP under GPU and CPU mixed architecture | |
US9721007B2 (en) | Parallel data sorting | |
JP6964384B2 (en) | Methods, programs, and systems for the automatic discovery of relationships between fields in a mixed heterogeneous data source environment. | |
WO2017096892A1 (en) | Index construction method, search method, and corresponding device, apparatus, and computer storage medium | |
CN112015741A (en) | Method and device for storing massive data in different databases and tables | |
CN109325029A (en) | RDF data storage and querying method based on sparse matrix | |
CN105740264A (en) | Distributed XML database sorting method and apparatus | |
CN105938479A (en) | Structural transfer method of relational tables and non-relational tables | |
CN103310350B (en) | A kind of based on predicate differentiation and the quick subscription associated and matching process | |
Ghotiya et al. | Migration from relational to NoSQL database | |
CN110032676B (en) | SPARQL query optimization method and system based on predicate association | |
CN104021169A (en) | Hive connection inquiry method based on SDD-1 algorithm | |
US20160275146A1 (en) | Use a parallel hardware search device to implement big databases efficiently | |
CN108319604B (en) | Optimization method for association of large and small tables in hive | |
CN113918605A (en) | Data query method, device, equipment and computer storage medium | |
US8832157B1 (en) | System, method, and computer-readable medium that facilitates efficient processing of distinct counts on several columns in a parallel processing system | |
KR20180077830A (en) | Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method | |
CN103279328A (en) | BlogRank algorithm parallelization processing construction method based on Haloop | |
CN106933844A (en) | Towards the construction method of the accessibility search index of extensive RDF data | |
CN109753520B (en) | Semi-connection query method, device, server and storage medium | |
CN114791935A (en) | Method for realizing high-performance multidimensional data warehouse based on cloud object storage | |
CN108874849B (en) | Optimization method and system for non-equivalent associated sub-query | |
KEAWPIBAL et al. | Optimizing range query processing for Dual bitmap index |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20180116 Termination date: 20180530 |