CN103324762A

CN103324762A - Hadoop-based index creation method and indexing method thereof

Info

Publication number: CN103324762A
Application number: CN2013103026691A
Authority: CN
Inventors: 陆嘉恒; 程明
Original assignee: Individual
Current assignee: Individual
Priority date: 2013-07-17
Filing date: 2013-07-17
Publication date: 2013-09-25

Abstract

The invention provides a Hadoop-based index creation method and an indexing method thereof. A three-stage indexing mechanism of an index based on documents, an index based on data blocks as well an index based on record are respectively established, and when data is read, according to the index information, input splits are filtered layer by layer, so that a final enquiry result can directly skip the useless data and the read action can be executed. According to the invention, useless data is prevented from being read by Hadoop, and the treatment efficiency of multiple data is improved.

Description

Index creation method and indexing means thereof based on Hadoop

Technical field

The present invention relates to the cloud computing field, particularly a kind of index creation method and indexing means thereof based on Hadoop.

Background technology

The Hadoop distributed platform provides various services by the Map-Reduce Computational frame, so that the user just can use processing and analysis task that Map and Reduce computation process are finished mass data after having built cheap Hadoop cluster, therefore, rely on that it is increased income, easy-to-use characteristics, the Hadoop distributed platform becomes management and processes the preferred option of mass data.

But, in the prior art, because Hadoop inside is without index mechanism, cause Hadoop when processing the inquiry of structuring mass data, each Map Task all needs to scan successively all data of burst input, and then filtering out gibberish according to query context, described input burst is generated the raw data burst by the MapReduce framework.

According to above description as seen, the data reading manner of prior art is to read first rear filtration, and it has greatly wasted the performance of Hadoop, and particularly when processing the range query of mass data, a large amount of Hadoop performances all have been wasted in reading of gibberish.

Summary of the invention

For the deficiencies in the prior art, the invention provides a kind of index creation method and indexing means thereof based on Hadoop, to prevent Hadoop scanning gibberish when processing inquiry.

For realizing above purpose, the present invention is achieved by the following technical programs:

The invention provides a kind of index creation method based on Hadoop, may further comprise the steps:

To input burst and read by row, and form respectively the first key-value pair, and the described character string that will read generates the first List of input by row;

Obtain the format information of described the first List of input, according to described the first key-value pair, each line character string of described the first List of input is recombinated, generate the second List of input;

Extract the index attributes value of described the second List of input;

Calculate the burst number of described each second key-value pair, according to the burst of described each second key-value pair number, described each second key-value pair is divided into groups;

Described each charting corresponding to the second key-value pair in every group is sorted according to its index attributes value, and according to the described key-value pair in the middle of each that generates, extract the index span of described every group of charting, create the index based on file.

Wherein, based on described described each charting corresponding to the second key-value pair in every group is sorted according to its index attributes value, described method further comprises:

Charting after the described ordering is read by row, form respectively the 3rd key-value pair, and the described character string that will read generates the 3rd List of input by row;

Obtain the format information of described the 3rd List of input, according to described the 3rd key-value pair, each line character string of described the 3rd List of input is recombinated, generate the 4th List of input;

Read the index attributes value of described the 4th List of input, and extract its index span, create the index of based on data.

Wherein, after reading the index attributes value of described the 4th List of input, further comprise:

Described index attributes value is kept at together with the line displacement amount of its corresponding charting, forms the leaf node of CSS tree;

Described each leaf node is made up a CSS tree, create the index based on record.

The present invention also provides a kind of indexing means based on Hadoop, may further comprise the steps:

The query context of file to be checked and described index attributes value scope based on file are asked friendship, when its crossing result is empty, then should abandon by file to be checked, otherwise, should file preservation to be checked, form pending document queue.

Wherein, described method further comprises: the query context of described pending document queue and the index attributes value scope of described based on data piece are asked friendship, when crossing result is empty, then should abandon by pending document queue, otherwise, should process document queue and preserve, form pending data block formation.

Wherein, described method further comprises: the query context of described pending data block formation and described index attributes value scope based on index are asked friendship, when crossing result is empty, then should pending data block formation abandon, otherwise, search the minimum index property value among its crossing result, and jump to line displacement amount corresponding to described minimum index property value and begin reading out data.

A kind of index creation method and indexing means thereof based on Hadoop provided by the invention, utilize Hadoop from raw data, to extract index information, carrying out data when reading, according to described index information, the direct data of skip useless and carry out and read action, thereby avoided Hadoop to read gibberish, improved the treatment effeciency of mass data.

Description of drawings

Fig. 1 is based on the process flow diagram of the index creation method of Hadoop in one embodiment of the invention;

Fig. 2 is a schematic diagram of second embodiment of the invention;

Fig. 3 is another schematic diagram in the second embodiment of the invention.

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for explanation the present invention, but are not used for limiting the scope of the invention.

Embodiment 1:

As shown in Figure 1, the invention provides a kind of index creation method based on Hadoop, may further comprise the steps:

Extract the index attributes value of described the second List of input;

Described input burst is generated the raw data burst according to Hadoop acquiescence burst mechanism by the MapReduce framework, described Hadoop acquiescence burst mechanism refers to that the MapReduce framework is when carrying out operation, at first according to the operation input path, obtain the data block information of each file under the input path, combined block size and configuration burst are big or small again, partition data piece or merge a plurality of data blocks forms the input burst, and then processes for Map Task of each input burst establishment.

Below in conjunction with its specific implementation in the Hadoop cluster, describe the operating process of this embodiment in detail:

Step 101:MapReduce framework with the raw data burst, generates the input burst according to Hadoop acquiescence burst mechanism; And create respectively a Map task for each described input burst; Following steps all are the operations for described each Map tasks carrying;

The RecordReader of step 102:Map process reads this input burst by row, forms respectively the first key-value pair＜offset, record 〉, the described character string that reads is generated the first List of input by row;

Step 103:Map process is according to information such as user's needs configuration index property location, index attributes type and data form names, and Schema obtains the format information of described the first List of input from the storehouse, i.e. Schema information;

Step 104: according to described the first key-value pair＜offset, record 〉, each line character string of described the first List of input is recombinated, generate the second List of input;

Step 105: extract the index attributes value of described the second List of input, and each the second key-value pair＜indexkey that will generate, record〉give the Shuffle process by the control of MapR educe framework;

Step 106: calculate the burst number of described each second key-value pair according to indexkey, and according to the burst of described each second key-value pair number, described each second key-value pair is divided into groups;

Step 107: described each charting corresponding to the second key-value pair in every group is sorted, and each centre key-value pair＜indexkey that will generate record-list according to its index attributes value indexkey〉give the Reduce process;

Step 108: according to described key-value pair＜indexkey in the middle of each, record-list 〉, extract the index span of described every group of charting, create index based on file at HDFS.

This index data is by chain table organization, and the span of index entry is orderly.

Step 109: with the charting after the described ordering in the step 107 as raw data after, execution in step 101,102,103 and 104 content, and corresponding the 3rd key-value pair＜offset, the record of generating respectively 〉, the 3rd List of input and the 4th List of input;

Step 110: read the index attributes value of described the 4th List of input, and extract its index span, create the index of based on data at HDFS;

This index data is with chain sheet form tissue, and span is orderly in each chained list, and whole index is organized with Hash table, and Key is file path, and Value is the chained list that all data block index of this document form.

Step 111: after reading the index attributes value of described the 4th List of input, the line displacement amount offset of described index attributes value charting record corresponding to it is kept at together, forms the leaf node of CSS tree;

Step 112: described each leaf node is made up a CSS tree, create based on the index that records at HDFS.

So far, complete based on the index creation of Hadoop, it comprises respectively the index of index based on file, based on data piece and based on the index of record, by creating three level list mechanism, for calling of follow-up Index Algorithm provides the basis.

After the three level list that creates based on Hadoop, call corresponding index interface, to read after inquiry file filters to treat, its concrete operations are as follows:

Step 201: the query context of file to be checked and described index attributes value scope based on file are asked friendship, when its crossing result is empty, then should abandon by file to be checked, otherwise, should file preservation to be checked, form pending document queue.

This step realizes in the getSplits function, is used for filtering the file that does not comprise useful data, reduces the data area of bottom index.

Step 202: the query context of described pending document queue and the index attributes value scope of described based on data piece are asked friendship, when crossing result is empty, then should abandon by pending document queue, otherwise, should process document queue and preserve, form pending data block formation.

This step also is to realize in the getSplits function, is used for filtering the data block that does not comprise useful data, further reduces the data area of bottom index, improves the filter effect of file index.

Step 203: read pending data block formation, according to the index data of described based on data piece, rebuild CSS Tree index at internal memory;

This step realizes in the init function, is used for the input burst information according to the Map task, rebuilds the index of burst at internal memory.

Step 204: the query context of described pending data block formation and the index attributes value scope of described CSS Tree index are asked friendship, when crossing result is empty, then should pending data block formation abandon, otherwise, search the minimum index property value among its crossing result, and jump to line displacement amount corresponding to described minimum index property value and begin reading out data, obtain accurate location.

The inlet flow that to input at last burst jumps to this accurate location place and begins reading out data.

Embodiment 2:

In this embodiment, by concrete description of test result of use of the present invention:

1, experimental configuration

This part experiment is carried out in real cluster environment, and cluster is comprised of four nodes.The configuration of each node is as follows: Intel Core i7-3770,3.40GHz eight nuclear CPU, 16G physical memory, 1T amount of physical memory, ubuntu12.04 system.The Hadoop cluster version of experiment operation is 0.20.2.The data set of experiment is the TPC-H data set, and scale is 200g.TPC-H is by TPC(Transaction Processing Performance Council, the U.S. transaction processing usefulness council) the database performance benchmark test that proposes.The inquiry that moves in the experiment mainly comprises some inquiry, range query, TPC query1 and the TPC query6 based on TPC-H data centralization LINEITEM table.Wherein the selection rate of range query have 1%, 14%, 42% and the 100%(interval selection be respectively 1 month, 1 year, 3 years and 7 years, the value interval of data was 7 years)

2, HadoopIndex type experiment

Fig. 2 is the as a result figure of HadoopIndex type experiment.Wherein Non represents not use any index, and R represents to have used the index based on record, and B has represented to use the index of based on data piece, and F represents to have used the index based on file.Can find out from experimental result picture, HadoopIndex supports the flexible configuration index type, and HadoopIndex index at different levels can from file, data block and the effective filtering useless data of three aspects of record, improve treatment effeciency simultaneously.

3, Block Size experiment

The main test of Block Size experiment different B lock Size is on the impact based on the recording indexes effect.Experiment test two kinds of Block Size, be respectively 64M and 128M.Experimental result such as Fig. 3 (Non represents and need not inquire about working time during any index).

Can see from experimental result, when Block Size increases, more obvious based on the effect of recording indexes.This mainly is that in local data's amount increase and the constant situation of global data amount, it is large that partial indexes affects the data quantitative change, although the partial indexes monodrome index time is elongated, the index data quantitative change is large, and whole index number of times reduces, so Block Size is larger, better based on the index performance of record.

In sum, a kind of index creation method and indexing means thereof based on Hadoop provided by the invention, it has set up respectively the index of index based on file, based on data piece and based on the index of record, carrying out data when reading, according to described index information, tell input burst is successively filtered so that final Query Result directly skip useless data and carry out and read action, the present invention has avoided Hadoop to read gibberish, has improved the treatment effeciency of mass data.

Above embodiment only is used for explanation the present invention; and be not limitation of the present invention; the those of ordinary skill in relevant technologies field; in the situation that do not break away from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all technical schemes that are equal to also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims

1. the index creation method based on Hadoop is characterized in that, may further comprise the steps:

Extract the index attributes value of described the second List of input;

2. the method for claim 1 is characterized in that, based on described described each charting corresponding to the second key-value pair in every group is sorted according to its index attributes value, and described method further comprises:

3. method as claimed in claim 2 is characterized in that, after reading the index attributes value of described the 4th List of input, further comprises:

4. the indexing means based on Hadoop is characterized in that, may further comprise the steps:

5. method as claimed in claim 4 is characterized in that, described method further comprises:

The query context of described pending document queue and the index attributes value scope of described based on data piece are asked friendship, when crossing result is empty, then should abandon by pending document queue, otherwise, should process document queue and preserve, form pending data block formation.

6. method as claimed in claim 5 is characterized in that, described method further comprises:

The query context of described pending data block formation and described index attributes value scope based on index are asked friendship, when crossing result is empty, then should pending data block formation abandon, otherwise, search the minimum index property value among its crossing result, and jump to line displacement amount corresponding to described minimum index property value and begin reading out data.