CN103324762A - Hadoop-based index creation method and indexing method thereof - Google Patents

Hadoop-based index creation method and indexing method thereof Download PDF

Info

Publication number
CN103324762A
CN103324762A CN2013103026691A CN201310302669A CN103324762A CN 103324762 A CN103324762 A CN 103324762A CN 2013103026691 A CN2013103026691 A CN 2013103026691A CN 201310302669 A CN201310302669 A CN 201310302669A CN 103324762 A CN103324762 A CN 103324762A
Authority
CN
China
Prior art keywords
index
input
list
key
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013103026691A
Other languages
Chinese (zh)
Inventor
陆嘉恒
程明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN2013103026691A priority Critical patent/CN103324762A/en
Publication of CN103324762A publication Critical patent/CN103324762A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a Hadoop-based index creation method and an indexing method thereof. A three-stage indexing mechanism of an index based on documents, an index based on data blocks as well an index based on record are respectively established, and when data is read, according to the index information, input splits are filtered layer by layer, so that a final enquiry result can directly skip the useless data and the read action can be executed. According to the invention, useless data is prevented from being read by Hadoop, and the treatment efficiency of multiple data is improved.

Description

Index creation method and indexing means thereof based on Hadoop
Technical field
The present invention relates to the cloud computing field, particularly a kind of index creation method and indexing means thereof based on Hadoop.
Background technology
The Hadoop distributed platform provides various services by the Map-Reduce Computational frame, so that the user just can use processing and analysis task that Map and Reduce computation process are finished mass data after having built cheap Hadoop cluster, therefore, rely on that it is increased income, easy-to-use characteristics, the Hadoop distributed platform becomes management and processes the preferred option of mass data.
But, in the prior art, because Hadoop inside is without index mechanism, cause Hadoop when processing the inquiry of structuring mass data, each Map Task all needs to scan successively all data of burst input, and then filtering out gibberish according to query context, described input burst is generated the raw data burst by the MapReduce framework.
According to above description as seen, the data reading manner of prior art is to read first rear filtration, and it has greatly wasted the performance of Hadoop, and particularly when processing the range query of mass data, a large amount of Hadoop performances all have been wasted in reading of gibberish.
Summary of the invention
For the deficiencies in the prior art, the invention provides a kind of index creation method and indexing means thereof based on Hadoop, to prevent Hadoop scanning gibberish when processing inquiry.
For realizing above purpose, the present invention is achieved by the following technical programs:
The invention provides a kind of index creation method based on Hadoop, may further comprise the steps:
To input burst and read by row, and form respectively the first key-value pair, and the described character string that will read generates the first List of input by row;
Obtain the format information of described the first List of input, according to described the first key-value pair, each line character string of described the first List of input is recombinated, generate the second List of input;
Extract the index attributes value of described the second List of input;
Calculate the burst number of described each second key-value pair, according to the burst of described each second key-value pair number, described each second key-value pair is divided into groups;
Described each charting corresponding to the second key-value pair in every group is sorted according to its index attributes value, and according to the described key-value pair in the middle of each that generates, extract the index span of described every group of charting, create the index based on file.
Wherein, based on described described each charting corresponding to the second key-value pair in every group is sorted according to its index attributes value, described method further comprises:
Charting after the described ordering is read by row, form respectively the 3rd key-value pair, and the described character string that will read generates the 3rd List of input by row;
Obtain the format information of described the 3rd List of input, according to described the 3rd key-value pair, each line character string of described the 3rd List of input is recombinated, generate the 4th List of input;
Read the index attributes value of described the 4th List of input, and extract its index span, create the index of based on data.
Wherein, after reading the index attributes value of described the 4th List of input, further comprise:
Described index attributes value is kept at together with the line displacement amount of its corresponding charting, forms the leaf node of CSS tree;
Described each leaf node is made up a CSS tree, create the index based on record.
The present invention also provides a kind of indexing means based on Hadoop, may further comprise the steps:
The query context of file to be checked and described index attributes value scope based on file are asked friendship, when its crossing result is empty, then should abandon by file to be checked, otherwise, should file preservation to be checked, form pending document queue.
Wherein, described method further comprises: the query context of described pending document queue and the index attributes value scope of described based on data piece are asked friendship, when crossing result is empty, then should abandon by pending document queue, otherwise, should process document queue and preserve, form pending data block formation.
Wherein, described method further comprises: the query context of described pending data block formation and described index attributes value scope based on index are asked friendship, when crossing result is empty, then should pending data block formation abandon, otherwise, search the minimum index property value among its crossing result, and jump to line displacement amount corresponding to described minimum index property value and begin reading out data.
A kind of index creation method and indexing means thereof based on Hadoop provided by the invention, utilize Hadoop from raw data, to extract index information, carrying out data when reading, according to described index information, the direct data of skip useless and carry out and read action, thereby avoided Hadoop to read gibberish, improved the treatment effeciency of mass data.
Description of drawings
Fig. 1 is based on the process flow diagram of the index creation method of Hadoop in one embodiment of the invention;
Fig. 2 is a schematic diagram of second embodiment of the invention;
Fig. 3 is another schematic diagram in the second embodiment of the invention.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for explanation the present invention, but are not used for limiting the scope of the invention.
Embodiment 1:
As shown in Figure 1, the invention provides a kind of index creation method based on Hadoop, may further comprise the steps:
To input burst and read by row, and form respectively the first key-value pair, and the described character string that will read generates the first List of input by row;
Obtain the format information of described the first List of input, according to described the first key-value pair, each line character string of described the first List of input is recombinated, generate the second List of input;
Extract the index attributes value of described the second List of input;
Calculate the burst number of described each second key-value pair, according to the burst of described each second key-value pair number, described each second key-value pair is divided into groups;
Described each charting corresponding to the second key-value pair in every group is sorted according to its index attributes value, and according to the described key-value pair in the middle of each that generates, extract the index span of described every group of charting, create the index based on file.
Described input burst is generated the raw data burst according to Hadoop acquiescence burst mechanism by the MapReduce framework, described Hadoop acquiescence burst mechanism refers to that the MapReduce framework is when carrying out operation, at first according to the operation input path, obtain the data block information of each file under the input path, combined block size and configuration burst are big or small again, partition data piece or merge a plurality of data blocks forms the input burst, and then processes for Map Task of each input burst establishment.
Wherein, based on described described each charting corresponding to the second key-value pair in every group is sorted according to its index attributes value, described method further comprises:
Charting after the described ordering is read by row, form respectively the 3rd key-value pair, and the described character string that will read generates the 3rd List of input by row;
Obtain the format information of described the 3rd List of input, according to described the 3rd key-value pair, each line character string of described the 3rd List of input is recombinated, generate the 4th List of input;
Read the index attributes value of described the 4th List of input, and extract its index span, create the index of based on data.
Wherein, after reading the index attributes value of described the 4th List of input, further comprise:
Described index attributes value is kept at together with the line displacement amount of its corresponding charting, forms the leaf node of CSS tree;
Described each leaf node is made up a CSS tree, create the index based on record.
The present invention also provides a kind of indexing means based on Hadoop, may further comprise the steps:
The query context of file to be checked and described index attributes value scope based on file are asked friendship, when its crossing result is empty, then should abandon by file to be checked, otherwise, should file preservation to be checked, form pending document queue.
Wherein, described method further comprises: the query context of described pending document queue and the index attributes value scope of described based on data piece are asked friendship, when crossing result is empty, then should abandon by pending document queue, otherwise, should process document queue and preserve, form pending data block formation.
Wherein, described method further comprises: the query context of described pending data block formation and described index attributes value scope based on index are asked friendship, when crossing result is empty, then should pending data block formation abandon, otherwise, search the minimum index property value among its crossing result, and jump to line displacement amount corresponding to described minimum index property value and begin reading out data.
A kind of index creation method and indexing means thereof based on Hadoop provided by the invention, utilize Hadoop from raw data, to extract index information, carrying out data when reading, according to described index information, the direct data of skip useless and carry out and read action, thereby avoided Hadoop to read gibberish, improved the treatment effeciency of mass data.
Below in conjunction with its specific implementation in the Hadoop cluster, describe the operating process of this embodiment in detail:
Step 101:MapReduce framework with the raw data burst, generates the input burst according to Hadoop acquiescence burst mechanism; And create respectively a Map task for each described input burst; Following steps all are the operations for described each Map tasks carrying;
The RecordReader of step 102:Map process reads this input burst by row, forms respectively the first key-value pair<offset, record 〉, the described character string that reads is generated the first List of input by row;
Step 103:Map process is according to information such as user's needs configuration index property location, index attributes type and data form names, and Schema obtains the format information of described the first List of input from the storehouse, i.e. Schema information;
Step 104: according to described the first key-value pair<offset, record 〉, each line character string of described the first List of input is recombinated, generate the second List of input;
Step 105: extract the index attributes value of described the second List of input, and each the second key-value pair<indexkey that will generate, record〉give the Shuffle process by the control of MapR educe framework;
Step 106: calculate the burst number of described each second key-value pair according to indexkey, and according to the burst of described each second key-value pair number, described each second key-value pair is divided into groups;
Step 107: described each charting corresponding to the second key-value pair in every group is sorted, and each centre key-value pair<indexkey that will generate record-list according to its index attributes value indexkey〉give the Reduce process;
Step 108: according to described key-value pair<indexkey in the middle of each, record-list 〉, extract the index span of described every group of charting, create index based on file at HDFS.
This index data is by chain table organization, and the span of index entry is orderly.
Step 109: with the charting after the described ordering in the step 107 as raw data after, execution in step 101,102,103 and 104 content, and corresponding the 3rd key-value pair<offset, the record of generating respectively 〉, the 3rd List of input and the 4th List of input;
Step 110: read the index attributes value of described the 4th List of input, and extract its index span, create the index of based on data at HDFS;
This index data is with chain sheet form tissue, and span is orderly in each chained list, and whole index is organized with Hash table, and Key is file path, and Value is the chained list that all data block index of this document form.
Step 111: after reading the index attributes value of described the 4th List of input, the line displacement amount offset of described index attributes value charting record corresponding to it is kept at together, forms the leaf node of CSS tree;
Step 112: described each leaf node is made up a CSS tree, create based on the index that records at HDFS.
So far, complete based on the index creation of Hadoop, it comprises respectively the index of index based on file, based on data piece and based on the index of record, by creating three level list mechanism, for calling of follow-up Index Algorithm provides the basis.
After the three level list that creates based on Hadoop, call corresponding index interface, to read after inquiry file filters to treat, its concrete operations are as follows:
Step 201: the query context of file to be checked and described index attributes value scope based on file are asked friendship, when its crossing result is empty, then should abandon by file to be checked, otherwise, should file preservation to be checked, form pending document queue.
This step realizes in the getSplits function, is used for filtering the file that does not comprise useful data, reduces the data area of bottom index.
Step 202: the query context of described pending document queue and the index attributes value scope of described based on data piece are asked friendship, when crossing result is empty, then should abandon by pending document queue, otherwise, should process document queue and preserve, form pending data block formation.
This step also is to realize in the getSplits function, is used for filtering the data block that does not comprise useful data, further reduces the data area of bottom index, improves the filter effect of file index.
Step 203: read pending data block formation, according to the index data of described based on data piece, rebuild CSS Tree index at internal memory;
This step realizes in the init function, is used for the input burst information according to the Map task, rebuilds the index of burst at internal memory.
Step 204: the query context of described pending data block formation and the index attributes value scope of described CSS Tree index are asked friendship, when crossing result is empty, then should pending data block formation abandon, otherwise, search the minimum index property value among its crossing result, and jump to line displacement amount corresponding to described minimum index property value and begin reading out data, obtain accurate location.
The inlet flow that to input at last burst jumps to this accurate location place and begins reading out data.
Embodiment 2:
In this embodiment, by concrete description of test result of use of the present invention:
1, experimental configuration
This part experiment is carried out in real cluster environment, and cluster is comprised of four nodes.The configuration of each node is as follows: Intel Core i7-3770,3.40GHz eight nuclear CPU, 16G physical memory, 1T amount of physical memory, ubuntu12.04 system.The Hadoop cluster version of experiment operation is 0.20.2.The data set of experiment is the TPC-H data set, and scale is 200g.TPC-H is by TPC(Transaction Processing Performance Council, the U.S. transaction processing usefulness council) the database performance benchmark test that proposes.The inquiry that moves in the experiment mainly comprises some inquiry, range query, TPC query1 and the TPC query6 based on TPC-H data centralization LINEITEM table.Wherein the selection rate of range query have 1%, 14%, 42% and the 100%(interval selection be respectively 1 month, 1 year, 3 years and 7 years, the value interval of data was 7 years)
2, HadoopIndex type experiment
Fig. 2 is the as a result figure of HadoopIndex type experiment.Wherein Non represents not use any index, and R represents to have used the index based on record, and B has represented to use the index of based on data piece, and F represents to have used the index based on file.Can find out from experimental result picture, HadoopIndex supports the flexible configuration index type, and HadoopIndex index at different levels can from file, data block and the effective filtering useless data of three aspects of record, improve treatment effeciency simultaneously.
3, Block Size experiment
The main test of Block Size experiment different B lock Size is on the impact based on the recording indexes effect.Experiment test two kinds of Block Size, be respectively 64M and 128M.Experimental result such as Fig. 3 (Non represents and need not inquire about working time during any index).
Can see from experimental result, when Block Size increases, more obvious based on the effect of recording indexes.This mainly is that in local data's amount increase and the constant situation of global data amount, it is large that partial indexes affects the data quantitative change, although the partial indexes monodrome index time is elongated, the index data quantitative change is large, and whole index number of times reduces, so Block Size is larger, better based on the index performance of record.
In sum, a kind of index creation method and indexing means thereof based on Hadoop provided by the invention, it has set up respectively the index of index based on file, based on data piece and based on the index of record, carrying out data when reading, according to described index information, tell input burst is successively filtered so that final Query Result directly skip useless data and carry out and read action, the present invention has avoided Hadoop to read gibberish, has improved the treatment effeciency of mass data.
Above embodiment only is used for explanation the present invention; and be not limitation of the present invention; the those of ordinary skill in relevant technologies field; in the situation that do not break away from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all technical schemes that are equal to also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims (6)

1. the index creation method based on Hadoop is characterized in that, may further comprise the steps:
To input burst and read by row, and form respectively the first key-value pair, and the described character string that will read generates the first List of input by row;
Obtain the format information of described the first List of input, according to described the first key-value pair, each line character string of described the first List of input is recombinated, generate the second List of input;
Extract the index attributes value of described the second List of input;
Calculate the burst number of described each second key-value pair, according to the burst of described each second key-value pair number, described each second key-value pair is divided into groups;
Described each charting corresponding to the second key-value pair in every group is sorted according to its index attributes value, and according to the described key-value pair in the middle of each that generates, extract the index span of described every group of charting, create the index based on file.
2. the method for claim 1 is characterized in that, based on described described each charting corresponding to the second key-value pair in every group is sorted according to its index attributes value, and described method further comprises:
Charting after the described ordering is read by row, form respectively the 3rd key-value pair, and the described character string that will read generates the 3rd List of input by row;
Obtain the format information of described the 3rd List of input, according to described the 3rd key-value pair, each line character string of described the 3rd List of input is recombinated, generate the 4th List of input;
Read the index attributes value of described the 4th List of input, and extract its index span, create the index of based on data.
3. method as claimed in claim 2 is characterized in that, after reading the index attributes value of described the 4th List of input, further comprises:
Described index attributes value is kept at together with the line displacement amount of its corresponding charting, forms the leaf node of CSS tree;
Described each leaf node is made up a CSS tree, create the index based on record.
4. the indexing means based on Hadoop is characterized in that, may further comprise the steps:
The query context of file to be checked and described index attributes value scope based on file are asked friendship, when its crossing result is empty, then should abandon by file to be checked, otherwise, should file preservation to be checked, form pending document queue.
5. method as claimed in claim 4 is characterized in that, described method further comprises:
The query context of described pending document queue and the index attributes value scope of described based on data piece are asked friendship, when crossing result is empty, then should abandon by pending document queue, otherwise, should process document queue and preserve, form pending data block formation.
6. method as claimed in claim 5 is characterized in that, described method further comprises:
The query context of described pending data block formation and described index attributes value scope based on index are asked friendship, when crossing result is empty, then should pending data block formation abandon, otherwise, search the minimum index property value among its crossing result, and jump to line displacement amount corresponding to described minimum index property value and begin reading out data.
CN2013103026691A 2013-07-17 2013-07-17 Hadoop-based index creation method and indexing method thereof Pending CN103324762A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013103026691A CN103324762A (en) 2013-07-17 2013-07-17 Hadoop-based index creation method and indexing method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013103026691A CN103324762A (en) 2013-07-17 2013-07-17 Hadoop-based index creation method and indexing method thereof

Publications (1)

Publication Number Publication Date
CN103324762A true CN103324762A (en) 2013-09-25

Family

ID=49193505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013103026691A Pending CN103324762A (en) 2013-07-17 2013-07-17 Hadoop-based index creation method and indexing method thereof

Country Status (1)

Country Link
CN (1) CN103324762A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577604A (en) * 2013-11-20 2014-02-12 电子科技大学 Image indexing structure for Hadoop distributed type environment
CN108121807A (en) * 2017-12-26 2018-06-05 云南大学 The implementation method of multi-dimensional index structures OBF-Index under Hadoop environment
CN109947703A (en) * 2017-11-09 2019-06-28 北京京东尚科信息技术有限公司 File system, file memory method, storage device and computer-readable medium
CN110019204A (en) * 2017-10-27 2019-07-16 航天信息股份有限公司 Method and apparatus are indexed inside split towards HDFS
CN111178373A (en) * 2018-11-09 2020-05-19 中科寒武纪科技股份有限公司 Operation method, device and related product
CN111221814A (en) * 2018-11-27 2020-06-02 阿里巴巴集团控股有限公司 Secondary index construction method, device and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110162069A1 (en) * 2009-12-31 2011-06-30 International Business Machines Corporation Suspicious node detection and recovery in mapreduce computing
CN102426609A (en) * 2011-12-28 2012-04-25 厦门市美亚柏科信息股份有限公司 Index generation method and index generation device based on MapReduce programming architecture
CN102915365A (en) * 2012-10-24 2013-02-06 苏州两江科技有限公司 Hadoop-based construction method for distributed search engine

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110162069A1 (en) * 2009-12-31 2011-06-30 International Business Machines Corporation Suspicious node detection and recovery in mapreduce computing
CN102426609A (en) * 2011-12-28 2012-04-25 厦门市美亚柏科信息股份有限公司 Index generation method and index generation device based on MapReduce programming architecture
CN102915365A (en) * 2012-10-24 2013-02-06 苏州两江科技有限公司 Hadoop-based construction method for distributed search engine

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
蒋明原: "云计算平台在搜索引擎中的关键技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577604A (en) * 2013-11-20 2014-02-12 电子科技大学 Image indexing structure for Hadoop distributed type environment
CN110019204A (en) * 2017-10-27 2019-07-16 航天信息股份有限公司 Method and apparatus are indexed inside split towards HDFS
CN109947703A (en) * 2017-11-09 2019-06-28 北京京东尚科信息技术有限公司 File system, file memory method, storage device and computer-readable medium
CN108121807A (en) * 2017-12-26 2018-06-05 云南大学 The implementation method of multi-dimensional index structures OBF-Index under Hadoop environment
CN111178373A (en) * 2018-11-09 2020-05-19 中科寒武纪科技股份有限公司 Operation method, device and related product
CN111178373B (en) * 2018-11-09 2021-07-09 中科寒武纪科技股份有限公司 Operation method, device and related product
CN111221814A (en) * 2018-11-27 2020-06-02 阿里巴巴集团控股有限公司 Secondary index construction method, device and equipment
CN111221814B (en) * 2018-11-27 2023-06-27 阿里巴巴集团控股有限公司 Method, device and equipment for constructing secondary index

Similar Documents

Publication Publication Date Title
Ikeda et al. Provenance for Generalized Map and Reduce Workflows.
US10789231B2 (en) Spatial indexing for distributed storage using local indexes
US20220164345A1 (en) Managed query execution platform, and methods thereof
CN109255055B (en) Graph data access method and device based on grouping association table
CN103324762A (en) Hadoop-based index creation method and indexing method thereof
CN102915365A (en) Hadoop-based construction method for distributed search engine
CN106471501B (en) Data query method, data object storage method and data system
JP6642651B2 (en) Storage method using user access preference model
CN101996250A (en) Hadoop-based mass stream data storage and query method and system
CN105260464B (en) The conversion method and device of data store organisation
CN104090962A (en) Nested query method oriented to mass distributed-type database
CN106970929A (en) Data lead-in method and device
CN104933143B (en) Obtain the method and device of recommended
CN104408067A (en) Multi-tree structure database design method and device
Jiang et al. Parallel K-Medoids clustering algorithm based on Hadoop
Binnig et al. SQLScript: Efficiently analyzing big enterprise data in SAP HANA
JP6696062B2 (en) How to cache multiple 2MB or smaller files based on Hadoop
CN104408128A (en) Read optimization method for asynchronously updating indexes based on B+ tree
CN103761298B (en) Distributed-architecture-based entity matching method
CN104794237B (en) web information processing method and device
Balaji et al. Distributed graph path queries using spark
KR20180085633A (en) Method and apparatus for processing query
CN104951442A (en) Method and device for determining result vector
Sahane et al. Analysis of Research Data using MapReduce Word Count Algorithm
CN108121807A (en) The implementation method of multi-dimensional index structures OBF-Index under Hadoop environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned
AD01 Patent right deemed abandoned

Effective date of abandoning: 20170531