CN103646073A

CN103646073A - Condition query optimizing method based on HBase table

Info

Publication number: CN103646073A
Application number: CN201310667847.0A
Authority: CN
Inventors: 郭美思; 吴楠
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2013-12-11
Filing date: 2013-12-11
Publication date: 2014-03-19

Abstract

The invention provides a condition query optimizing method based on an HBase table. According to the analyzing and processing capability and the parallel computing characteristics for data from a distributed computing frame, the condition query of the HBase table is based on a MapReduce computing frame, and then the query efficiency is improved. The optimizing method is achieved mainly through Region pre-allocation, RowKey designing and a MapReduce module. Compared with the prior art, the condition query optimizing method based on the HBase table has the advantages that the condition query efficiency is improved, an optimizing method is provided for the condition query, the condition query process of the method is simple, the parallel computation is allowed, and therefore the condition query optimizing method is efficient, high in practicality and easy to popularize.

Description

A kind of condition query optimization method based on HBase table

Technical field

The present invention relates to communication information technical field, specifically a kind of condition query optimization method based on HBase table.

Background technology

Along with the magnanimity sharp increase of data, large data become current focus already.In the required technology of large data, distributed file system, distributed data base etc. is all the technology that is applicable to large data.HBase be one distributed, towards row the database of increasing income.It is to utilize Hadoop HDFS as its document storage system.Along with HBase continues to improve in performance and stability, HBase becomes gradually in one of the standard in large data NoSQL field.HBase is adopted by a lot of major companies, as Facebook, and Twitter, Adobe, Cloudera, IBM etc.Therefore, the condition query optimization based on HBase table is very important.

Inquiry realization for HBase at present has two kinds of modes: a kind of is by specifying RowKey to obtain the Get method of a unique record; Another kind is the Scan method of a batch record of obtaining by the condition of appointment.What wherein realization condition query function was used is Scan method, and scan can improve speed (trading space for time) by setCaching and setBatch method; Also can carry out limited range according to setStartRow and setEndRow simultaneously.Scope is less, and performance is higher.By the design of RowKey cleverly, make to obtain in batches element in set of records ends and concentrate in together (should under same Region), can when traversing result, obtain good performance.

In distributed computing framework, parallel computation improves a lot in efficiency, and distributed file system HDFS can preserve a large amount of data, and extensibility is strong.For these reasons, a kind of condition query optimization method based on HBase form is proposed.This scheme adopts Region predistribution and RowKey reasonable in design that source data is imported to during HBase shows.And utilize MapReduce framework to realize the condition query of HBase table, improve search efficiency, and by corresponding configuration parameter in cluster is set, reach the effect of optimization.

Summary of the invention

Technical assignment of the present invention is to solve the deficiencies in the prior art, and a kind of condition query optimization method based on HBase table is provided.

Technical scheme of the present invention realizes in the following manner, this kind of condition query optimization method based on HBase table, and the specific implementation process of the method is:

According to the configuration of the data volume size of form and cluster, determine predistribution table subregion Region number: in Region predistribution, according to the number that imports the data volume of HBase and the make out the scale Region of distributed type assemblies, then by data volume, the rule of Row Key designs and distributes Region in advance, the Region of HBase is along with the continuous change conference of size triggers a threshold value, once trigger, Region is division automatically; More Region can guarantee concurrency performance;

According to the RowKey reasonable in design that should be used for of condition query: if within the Row Key of record drops on the start key of certain Region and the scope of end key, these data will store on this Region;

According to Region predistribution, RowKey and distributed programmed framework, improve query performance: in MapReduce Computational frame, Map module need to be according to the record meeting in condition query HBase table, according to the feature of condition query and RowKey, design the prioritization scheme of condition query, utilize StartKey and EndKey parameter in Scan to improve search performance, according to distributed type assemblies Rational Parameters configuration Map quantity, reach effect of optimization simultaneously.

Described MapReduce programming framework is the processing procedure that finally obtains condition query result: Map resume module according to condition query hbase, show, finally obtain the record of inquiry, again the record of inquiry is processed, the query note form that obtains wanting, and the DLL (dynamic link library) of traversal is provided.

The detailed step of described optimization method is:

First in table subregion Region predistribution, according to the environment in cluster and configuration, the attribute of zookeeper is set, creates pre-HBase table name and the row Praenomen importing and claim, then according to the Split function creation HBase table of writing; Wherein Split function is according to source data form, to determine the number of Split, represents the number of region by two-dimensional array; Treat that Region predistribution finishes, can source data be generated to Hfile file according to the design of RowKey; Finally will with completebulkload order, complete the importing of data, at this moment data have imported to HBase table according to predetermined form.

Write MapReduce program: the MapReduce program that this user writes is an operation, user's configuration is also submitted to an operation in framework, and framework can resolve into this operation Map tasks and the reduce tasks of some row; Framework is responsible for task distribution and is carried out; In this query optimization, need to complete the work in map stage, finally will meet user's Query Result output, this output procedure is:

According to the querying condition of user program appointment, first according to the query argument of input, carry out the processing of form, query argument is spliced into the form that meets line unit RowKey in HBase table, the reference position of inquiry is set according to the startKey in scan and endKey again, and the parameter setting in scan; Then will be submitted in Hadoop framework according to initTableMapperJob function, in the corresponding Map task of each data block of this process middle frame split; For each Map task, according to the mode of iteration, each data recording is processed according to the method for appointment in Map function.

In map function, first according to value.getRow (), obtain the RowKey of eligible inquiry, according to value.value (), obtain the data of first row; Then the data of the RowKey of extraction and first row are processed and converted to the output format that user expects, the character string that can need according to the intercepting of substring function is also set predetermined form for, directly qualified result Output rusults is outputed in the file of appointment, and by setOutputPath function, set the path of output.

The beneficial effect that the present invention compared with prior art produced is:

A kind of condition query optimization method based on HBase table of the present invention has computation capability, by to the appropriate design of RowKey and predistribution region, and realized the map interface of condition query, reach the object of optimization: the HDFS of Hadoop distributed type assemblies provides sufficient storage capacity, map task can obtain to reduce the resource that data transmission consumes nearby by data; In Hadoop distributed type assemblies, corresponding parameter in map quantity and HBase can be set and carry out parallel processing operation, improved the efficiency of condition query, for condition query provides the method for optimizing; The condition query process of the method is comparatively succinct, and supports parallel computation, therefore more efficient, practical, is easy to promote.

Accompanying drawing explanation

Accompanying drawing 1 is the flowchart of HBase surface condition query optimization of the present invention.

Accompanying drawing 2 is the structural drawing of Hregion Server.

Accompanying drawing 3 is the Computational frame flowchart without the reduce stage.

Embodiment

Below in conjunction with accompanying drawing, a kind of condition query optimization method based on HBase table of the present invention is described in detail below.

HBase be one distributed, towards row the database of increasing income.It is to utilize Hadoop HDFS as its document storage system.Along with HBase continues to improve in performance and stability, HBase becomes gradually in one of the standard in large data NoSQL field.Therefore, the query optimization based on HBase table is very important.For the condition query optimization of HBase table, the invention provides a kind of condition query optimization method based on HBase table, relate generally to the predistribution of Region while building table, the design of RowKey, conceptual design three aspects: during inquiry.According to the configuration of the data volume size of form and cluster, determine predistribution Region number, and according to the appropriate design of RowKey (RowKey reasonable in design according to condition query), source data is evenly distributed in these Region.For the ready work of condition query of HBase table, and provide rational inquiry environment; According to the RowKey reasonable in design that should be used for of condition query; The Map stage: improve query performance according to Region predistribution, RowKey and distributed programmed framework, reach effect of optimization, this stage is that the parameter of importing into according to condition query is done processing processing according to the Query Result of Scan, the form that becomes user to expect the recording processing of eligible inquiry, because there is no the Reduce stage, therefore can directly result be outputed in output directory, reduce the limit bandwidth of transmission.

The specific implementation process of the method is:

According to the configuration of the data volume size of form and cluster, determine predistribution table subregion Region number: in Region predistribution, according to the number that imports the data volume of HBase and the make out the scale Region of distributed type assemblies, then by data volume, the rule of Row Key designs and distributes Region in advance, can significantly reduce the number of times of Region Split, not even Split.The Region of HBase is along with the continuous change conference of size triggers a threshold value, once trigger, Region is division automatically.More Region can guarantee concurrency performance.Therefore, reasonably determine the number of region according to the scale of distributed type assemblies environment, programming realizes the Region number that meets application, improves concurrency performance.

According to the RowKey reasonable in design that should be used for of condition query: if within the Row Key of record drops on the start key of certain Region and the scope of end key, these data will store on this Region.If within certain period, very multidata row key is within the scope of certain specific row key.The region that this particular range row key is corresponding can be very busy, and other region unusual free time probably cause the wasting of resources, affect performance.Therefore, when RowKey designs will with reference to import the application of data and as far as possible guarantor unit in the time row key of data writing for region, be evenly distributed.

According to Region predistribution, RowKey and distributed programmed framework, improve query performance: in MapReduce Computational frame, Map module need to be according to the record meeting in condition query HBase table, designs the prioritization scheme of condition query according to the feature of condition query and RowKey.Utilize StartKey and EndKey parameter in Scan to improve search performance, simultaneously according to distributed type assemblies Rational Parameters configuration Map quantity, guarantee to reach the effect of optimization.

Described MapReduce programming framework, for the concurrent operation of large-scale dataset (being greater than 1TB), is the processing procedure that finally obtains condition query result.Map resume module according to condition query hbase table, finally obtain the record of inquiry, then the record of inquiry processed to the query note form that obtains wanting.Do not use Reducer, can reduce the restriction of bandwidth in cluster.Result is outputed in the output directory of appointment.This process can arrange a plurality of map quantity, has improved treatment effeciency, has greatly promoted performance.Programming trouble when MapReduce framework has been simplified concurrent processor, provides the DLL (dynamic link library) traveling through.

The detailed step of described optimization method is:

First in Region predistribution, according to the environment in cluster and configuration, the attribute of zookeeper is set, creates pre-HBase table name and the row Praenomen importing and claim, then according to the Split function creation HBase table of writing.Wherein Split function is according to source data form, to determine the number of Split, represents the number of region by two-dimensional array.Treat that Region predistribution finishes, can source data be generated to Hfile file according to the design of RowKey, wherein due to data analysis is also reasonably designed to RowKey, this has just guaranteed that the data in each region are uniformly, which Region data can not occur much or which Region data phenomenon seldom.Finally will with completebulkload order, complete the importing of data, at this moment data have imported to HBase table according to predetermined form.

ZooKeeper in technique scheme is the formal sub-project of Hadoop, it be one for the reliable coherent system of large-scale distributed system, the function providing comprises: configuring maintenance, name Service, distributed synchronization, group service etc.The target of ZooKeeper is exactly the key service that packaged complexity is easily made mistakes, and the interface and the performance system efficient, function-stable that are simple and easy to use are offered to user.

According to HBase, show condition query optimization method to realize by writing MapReduce program.The MapReduce program that user writes is an operation, and user's configuration is also submitted to an operation in framework, and framework can resolve into this operation Map tasks and the reduce tasks of some row.Framework is responsible for task distribution and is carried out.In this query optimization, need to complete the work in map stage, finally will meet user's Query Result output.

Condition query optimization based on HBase table has mainly realized Map interface.The main treatment scheme of this module is: according to the querying condition of user program appointment, first according to the query argument of input, carry out the processing of form, query argument is spliced into the form that meets RowKey in HBase table, the reference position of inquiry is set according to the startKey in scan and endKey again, and the parameter setting in scan, as Batch and Caching etc.Then will be submitted in Hadoop framework according to initTableMapperJob function, in the corresponding Map task of each data block of this process middle frame split.For each Map task, according to the mode of iteration, each data recording is processed according to the method for appointment in Map function.In map function, first according to value.getRow (), obtain the RowKey of eligible inquiry, according to value.value (), obtain the data of first row.Then the data of the RowKey of extraction and first row are processed and converted to the output format that user expects, the character string that can need according to the intercepting of substring function is also set predetermined form for, for guaranteed performance, prevent the restriction of the network bandwidth, inapplicable Reduce function, directly qualified result Output rusults is outputed in the file of appointment, can set by setOutputPath function the path of output.

According to Distributed Architecture, can improve degree of parallelism, then the parameter configuration of the relevant read operation in map number of tasks and HBase is set according to the Hadoop cluster scale of building, as caching arranges larger being conducive to, read.Batch arranges larger being conducive to and once can capture many data, by rationally arranging of these parameters, can improve performance, reaches the object of optimization.

Embodiment is as shown in accompanying drawing 1, Fig. 2, Fig. 3, and its specific operation process is:

First dispose distributed type assemblies environment, the hardware environment in this cluster is 7 station servers, and every station server is 96G internal memory, and cpu has 24core, and hard disk is 12*2T.Operating system is centos6.3.According to official's document, hadoop assembly is installed in server.Then hdfs, mapreduce and hbase are opened to service according to normal sequence.In this example, the form of source data is QVW75520121124120403222,22222,4,3.First 6 of first row represents license plate number, and in first row, latter 8 represent the date, and last 9 of first row represents Hour Minute Second and millisecond, secondary series representative card slogan.Condition query refers to the given number-plate number and from date time and time Close Date, search in the meantime in the fixing bayonet socket information of license plate number process.Source data has 10,000,000,000 data, and it is necessary improving search efficiency.By analysis, when source data imports HBase, RowKey can be designed to meet this time effect of inquiry, can, using vehicle number and time on date as RowKey, can extract very fast like this record of wanting inquiry.The flowchart of the condition query optimization method based on HBase table as shown in Figure 1.First according to Split function, obtain the number M of region, with region predistribution routine call Split function, generate and have the HBase of M region table, then, in data importing being shown to HBase according to the RowKey designing, finally write mapreduce program and come condition query to reach the result of optimization.

Most crucial module in HBase, is mainly responsible for response user I/O request, in HDFS file system, reads and writes data.The structural drawing of Hregion Server as shown in Figure 2, HRegionServer inner management a series of HRegion objects, each HRegion correspondence a Region in Table, in HRegion, by a plurality of HStore, formed.Each HStore correspondence the storage of a Column Family in Table, each Column Family is exactly a concentrated storage unit in fact, therefore preferably the column that possesses common IO characteristic is placed in a Column Family, the most efficient like this.Therefore, when HBase is arrived in data importing, first we be evenly distributed to data in M region according to region predistribution, then according to application query condition, reasonably designs RowKey, and whole data are imported in HBase table according to the RowKey pre-establishing.

In Optimizing Queries, be first that client is communicated by letter once with regionserver, can find the region of regionserver, and scan region and return to a given data.This data volume is by the Batch appointment of scan.And the effect of caching is communicated by letter exactly and is once found region, call scanning caching time, that is to say by these two parameters, the data that can return of once communicating by letter are caching*batch bar.Obviously this can reduce the traffic of client and rs.

In writing MapReduce, in order to reduce the restriction of bandwidth, without reduce function.Without the Computational frame flowchart in reduce stage as shown in Figure 3.In MapReduce algorithm operational process, there is a primary control program, be called master.Primary control program can produce a lot of job procedures, is called worker.And M these worker of map task, allow them go.The worker that has been assigned with map task reads and processes relevant input data, by the key/value analyzing (key/value).Owing to there is no reduce function, intermediate result key/value (key/value) that map () function produces directly outputs in output file.

Writing first of Map function resolved from date and Close Date, with the parse function in SimpleDateFormat class, the Parameter analysis of electrochemical of input is become to the form on date, the form of again Date of parsing being arranged to want, with the format function in SimpleDateFormat class, at this moment from date and car can be joined to the form that number sets is synthesized RowKey.In this condition query, RowKey is designed to the combination of time on the number-plate number+date, can the reference position in condition query be determined according to the number-plate number and from date and time like this, according to the number-plate number and Close Date and time, the end position in condition query being determined, by these two values, give respectively startkey and the endkey in scan.Due at vehicle number fixedly in the situation that, the bayonet socket information of searching its process may be all in a region, during return data, be first put into client and carry out buffer memory, by caching configuration item, the data number that HBase scanner once captures from service end can be set.By being set to a rational value, can reduce the time overhead of next () in scan process, cost is that scanner need to maintain these by the line item of cache by the internal memory of client.In this test, there are 7 station servers, every station server has 96G internal memory, therefore, can cache be set to larger value and improves performance.Then we are according to initTableMapperJob (sourceTable, scan, Mapper.class, Text.class, Text.class, job) function starts to execute the task, and wherein sourceTable refers to import to the table in HBase, the form that will inquire about, removes the record that meets of enquiry form by startkey, endkey and corresponding caching and batch size are set in scan.In Mapper.class, mainly contain the implementation of map function, this process is processed qualified record, and in the record of output, form is: QVW755 2012-11-24 10:23 22222.According to value.getRow () function, obtain RowKey, according to value.value (), obtain bayonet socket information again, above-mentioned information is processed to the output format that obtains wanting, can carry out substring to RowKey, obtain date and time information, temporal information, then according to output format, be translated into expected effect, owing to there is no the reduce stage, therefore, can directly the result in map stage be outputed to the file of indication, the catalogue of output can be set according to setOutputPath.Finally the configuration file in whole cluster is optimized to adjustment, by the relevant parameter of mapreduce task, as map number of tasks etc.The parameter that hbase is relevant, arrives first in Memstore and looks into data for read request, can not find out in the BlockCache that arrives and looks into, then can not find out to arrive on disk and read, and the result of reading is put into BlockCache.On a Regionserver, have a BlockCache and N Memstore, their big or small sum can not be more than or equal to heapsize * 0.8, otherwise HBase can not start.Acquiescence BlockCache is 0.2, and Memstore is 0.4.For the system of focusing on reading the response time, BlockCache can be established greatly, such as BlockCache=0.4 is set, Memstore=0.39, to strengthen the hit rate of buffer memory.

The foregoing is only embodiments of the invention, within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. a condition query optimization method of showing based on HBase, is characterized in that the specific implementation process of the method is:

According to the configuration of the data volume size of form and cluster, determine predistribution table subregion Region number: in Region predistribution, according to the number that imports the data volume of HBase and the make out the scale Region of distributed type assemblies, then by the rule of data volume, line unit Row Key, design in advance and distribute Region, the Region of HBase is along with the continuous change conference of size triggers a threshold value, once trigger, Region is division automatically, and more Region can guarantee concurrency performance;

2. a kind of condition query optimization method based on HBase table according to claim 1, it is characterized in that: described MapReduce programming framework is the processing procedure that finally obtains condition query result: Map resume module according to condition query hbase, show, finally obtain the record of inquiry, again the record of inquiry is processed, the query note form that obtains wanting, and the DLL (dynamic link library) of traversal is provided.

3. a kind of condition query optimization method based on HBase table according to claim 2, is characterized in that: the detailed step of described optimization method is:

First in table subregion Region predistribution, according to the environment in cluster and configuration, the attribute of zookeeper is set, creates pre-HBase table name and the row Praenomen importing and claim, then according to the Split function creation HBase table of writing; Wherein Split function is according to source data form, to determine the number of Split, represents the number of region by two-dimensional array; Treat that Region predistribution finishes, can source data be generated to Hfile file according to the design of RowKey; Finally will with completebulkload order, complete the importing of data, at this moment data have imported to HBase table according to predetermined form;

According to the querying condition of user program appointment, first according to the query argument of input, carry out the processing of form, query argument is spliced into the form that meets line unit RowKey in HBase table, the reference position of inquiry is set according to the startKey in scan and endKey again, and the parameter setting in scan; Then will be submitted in Hadoop framework according to initTableMapperJob function, in the corresponding Map task of each data block of this process middle frame split; For each Map task, according to the mode of iteration, each data recording is processed according to the method for appointment in Map function: