CN109165222A

CN109165222A - A kind of HBase secondary index creation method and system based on coprocessor

Info

Publication number: CN109165222A
Application number: CN201810945470.3A
Authority: CN
Inventors: 郭昆; 许玲玲; 郑建宁; 黄长贵; 周健倩
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2018-08-20
Filing date: 2018-08-20
Publication date: 2019-01-08

Abstract

Index data and master data are carried out logical separation according to pre- subregion and random hash strategy by the present invention relates to a kind of HBase secondary index creation method and system based on coprocessor；According to the different column families in same table, index data and master data are subjected to physical separation.The system includes: insertion module, for constructing secondary index according to index configurations file, index data being inserted into secondary index area, newly-generated master data is inserted into main data area in data insertion；Further include: enquiry module, for constructing querying condition according to index configurations file in data query, after the secondary index area of parallel query Region obtains index line unit, the asynchronous acquisition master data on Region.A kind of HBase secondary index creation method and system based on coprocessor proposed by the present invention efficiently, rapidly can carry out field search to HBase.

Description

A kind of HBase secondary index creation method and system based on coprocessor

Technical field

The present invention relates to database technical field, especially a kind of HBase secondary index creation side based on coprocessor Method and system.

Background technique

The arriving of big data era has pushed the multi-field theoretical high speed development with engineering practice such as data storage, processing, Following data continue into explosive growth, and traditional data storage and management method has been difficult to adapt to current extensive Demand of the data management to efficiency.For this purpose, non-relational NoSQL database is rapidly developed.HBase is as NoSQL data The representative in library, has been widely used in the data storage and management of all trades and professions.Compared with traditional database, HBase can only It is inquired according to the range of line unit or line unit, it is excessively single, it is not flexible, in most cases it is necessary to arrange HBase Value inquiry, and HBase needs to use more advanced filter for arranging without establishing index to inquire.Filter needs pair Full table is scanned, and search efficiency is lower, reduces the performance that HBase table is inquired by train value, and cause machine physical resource Waste.

Currently, the scheme for establishing secondary index on HBase mainly has scheme based on third party's independent engine, based on association The scheme of processor, the scheme intercepted based on memory and scheme based on complementary cluster formula etc..

Secondary index scheme based on third independent engine has ElasticSearch and Solr.ElasticSearch and Solr is the full-text search server based on Lucene, by the field of inquiry involved in HBase table in ElasticSearch or It is indexed in Solr.But this mode needs to safeguard a set of index cluster, causes overhead.

The Hindex that secondary index scheme based on coprocessor mainly has Huawei to propose.The program divides data and index It opens and is stored in different tables, after being inserted into data in main table, write index column in another concordance list with coprocessor.But It is that this sets of plan needs to modify HBase source code, invasive larger.Meanwhile in Region division, need to keep indexing Region and the cut-off of data Region are logically consistent.

The IHBASE that the scheme intercepted based on memory mainly has YoramKulbak and DanWashusen to propose.The program exists Region rank establishes index rather than table level is other, has been filled with when brushing into disk inside, will do it interception request, and in memory Data construct index, index be stored in table in a manner of another column family.But need to reconstruct HBase, and several recently Year does not all update.

Scheme based on complementary cluster formula mainly has the Computer Department of the Chinese Academy of Science to propose CCIndex.Detailed letter of the program data Breath is also stored in concordance list, does not need to go in former table to go again by the line unit of acquisition to search data.But it is deleted more in data When new, safeguard that the data in concordance list are more complicated.Simultaneously as its copy mechanism for having disabled bottom HDFS, causes data Reliability decrease.

Summary of the invention

The purpose of the present invention is to provide a kind of HBase secondary index method and system based on coprocessor, with gram Take defect existing in the prior art.

To achieve the above object, the technical scheme is that a kind of HBase secondary index creation based on coprocessor Index data and master data are carried out logical separation according to pre- subregion and random hash strategy by method；According in same table not Index data and master data are carried out physical separation by same column family.

In an embodiment of the present invention, two are logically divided on the same Region according to pre- subregion and random hash Grade index area and main data area, secondary index area are used to store index data, and main data area is used to store master data.

In an embodiment of the present invention, same table is divided into two column families, a column family is used to store index data, separately One column family is used to master data.

It further, further include a kind of HBase secondary index creation system based on coprocessor, comprising:

It is inserted into module, for secondary index being constructed according to index configurations file, index data being inserted into data insertion To secondary index area, newly-generated master data is inserted into main data area；

Enquiry module, for constructing querying condition, parallel query Region according to index configurations file in data query Secondary index area obtain index line unit after, the asynchronous acquisition master data on Region.

In an embodiment of the present invention, the insertion module is indexed insertion in accordance with the following steps:

Step 11: when building table, calculating the number of Region, pre- subregion is carried out to each Region；

Step 12: after pre- subregion, obtaining the section startKey and endKey of each Region；

Step 13: rewriteeing prePut () Hook Function of coprocessor, master data is inserted by function acquisition rowkey；

Step 14: a value in the section startKey and endKey of Region is randomly generated, and is denoted as hashKey； By hashKey splicing before being inserted into the rowkey of master data, and remember that the line unit being newly inserted into is hashRowkey；

Step 15: the search index configuration file in prePut () Hook Function, after obtaining the field indexed, parsing Put object judges whether there is index field insertion；If so, the startKey in the section Region where hashKey is then obtained, and The value of index field is spliced before being inserted into the line unit of master data, as index line unit；Otherwise directly put object is inserted into and is led Data field；

Step 16: obtaining starting line unit, index train value and be inserted into the length value of line unit, splice according to following format: rising Begin key, and _ index train value _ is inserted into line unit, and is inserted into a column of index column family；

Step 17: index data being inserted into index column family, master data is inserted into data column family；The title of index column family is solid Fixed, do not allow to modify and do not allow name of weighing with index column Praenomen.

In an embodiment of the present invention, further include following steps in the step S11:

Step S111: determine that the pre- subregion number of cluster, the pre- subregion number calculation formula of individual node are as follows:

Wherein, M indicates the memory size of RegionServer；F indicates that RegionServer gives the ratio of memstore；S Indicate the size of memstore, unit M；A is the number of column family in table；

Step S112: determine that the node number of cluster, the calculation formula of the total pre- subregion number of cluster are as follows:

R=P*N

Wherein, R indicates the total number of the pre- subregion of cluster, and P indicates the number of the pre- subregion of each node, and N indicates to save in cluster The number of point.

In an embodiment of the present invention, further include following steps in the step S13:

Step S131: client issues put request；

Step S132: the request is assigned to corresponding RegionServer and Region；

Step S133: coprocessor intercepts the request, then on each RegionObserver online on the table PrePut () Hook Function is called, put request is intercepted, parses put object, obtain rowkey value；

Step S134: if do not intercepted by prePut () Hook Function, put request continues to be sent to Region, then It is handled.

In an embodiment of the present invention, the enquiry module, which includes the following steps, is inquired:

Step 21: when querying condition is arranged in client, querying condition being parsed by enquiring component, reads index configurations text Part judges that the field whether there is in index configurations file；

Step 22: if it does, indicating that the field establishes index, then constructing querying condition by search index device It originates line unit and terminates line unit；The startKey that the starting line unit of querying condition is each Region splices querying condition, looks into The startKey that line unit is each Region that terminates of inquiry condition splices a value greater than querying condition, and goes to step S24；

Step 23: if it does not exist, then carrying out full table scan, and going to step S27；

Step 24: after constructing new querying condition, the secondary index area in multi-thread concurrent to each Region is carried out Scan inquiry；

Step 25: inquiring qualified index line unit in secondary index area, this is then parsed by the train value indexed The line unit of the master data of record；

Step 26: asynchronous to obtain master data in batches on the Region after the index line unit for finding a Region Record；

Step 27: result being pooled at enquiring component, then is pooled to client return.

Compared to the prior art, the invention has the following advantages:

1. the secondary index on data insertion speed based on coprocessor is better than being based on third-party secondary index；

2. for the secondary index based on coprocessor using concurrently inquiring, inquiry velocity is better than base in data query speed In third-party secondary index；

3. in space expense, the secondary index itself based on coprocessor there are on HBase, HBase bottom be with The storage of HFile format, HFile are compressed and are stored on HDFS, facilitate to save hard disk relative to third-party secondary index.

Detailed description of the invention

Fig. 1 is the flow chart that insertion module is inserted into the present invention.

Fig. 2 is the flow chart that interrogation model is inquired in the present invention.

Specific embodiment

With reference to the accompanying drawing, technical solution of the present invention is specifically described.

The present invention proposes a kind of HBase secondary index creation method based on coprocessor, according to pre- subregion and scattered at random Index data and master data are carried out logical separation by column strategy；According to the different column families in same table, by index data and main number According to progress physical separation.

In the present embodiment, secondary index is logically divided on the same Region according to pre- subregion and random hash Area and main data area, secondary index area are used to store index data, and main data area is used to store master data.

In the present embodiment, same table is divided into two column families, a column family is used to store index data, another column Race is used to master data.

Further, in the present embodiment, as shown in Figure 1, insertion module is indexed insertion in accordance with the following steps:

Step 17: index data being inserted into index column family, master data is inserted into data column family；The title of index column family is solid Fixed, do not allow to modify and do not allow to weigh with index column Praenomen name namely unchangeable while also not allowing other column Praenomen is known as index.

Further, in the present embodiment, in step s 11, further include following steps:

Wherein, M indicates the memory size of RegionServer；F indicates that RegionServer gives the ratio of memstore, 0.4 is defaulted as in HBase；S indicates the size of memstore, and the default value in unit M, HBase is 128；A is column family in table Number；In the present embodiment, table includes at least 2 column families: one be storage secondary index column family index, it is another A column family data for storage master data；

R=P*N

Further, in the present embodiment, in step s 13, further include following steps:

Step S131: client issues put request；

Step S132: the request is assigned to corresponding suitable RegionServer and Region；

Further, in the present embodiment, as shown in Fig. 2, enquiry module includes the following steps is inquired:

Step 22: if it does, indicating that the field establishes index, then constructing querying condition by search index device It originates line unit and terminates line unit；The startKey that the starting line unit of querying condition is each Region splices querying condition, looks into The startKey that line unit is each Region that terminates of inquiry condition splices a value just greater than querying condition, and goes to step Rapid S24；

The above are preferred embodiments of the present invention, all any changes made according to the technical solution of the present invention, and generated function is made When with range without departing from technical solution of the present invention, all belong to the scope of protection of the present invention.

Claims

1. a kind of HBase secondary index creation method based on coprocessor, which is characterized in that according to pre- subregion and random hash Index data and master data are carried out logical separation by strategy；According to the different column families in same table, by index data and master data Carry out physical separation.

2. a kind of HBase secondary index creation method based on coprocessor according to claim 1, which is characterized in that According to pre- subregion and random hash, on the same Region, it is logically divided into secondary index area and main data area, secondary index Area is used to store index data, and main data area is used to store master data.

3. a kind of HBase secondary index creation method based on coprocessor according to claim 1, which is characterized in that Same table is divided into two column families, a column family is used to store index data, another column family is used to master data.

4. a kind of HBase secondary index based on coprocessor creates system characterized by comprising

It is inserted into module, for secondary index being constructed according to index configurations file, index data being inserted into two in data insertion Grade index area, is inserted into main data area for newly-generated master data；

Enquiry module constructs querying condition according to index configurations file in data query, and the two of parallel query Region After grade index area obtains index line unit, the asynchronous acquisition master data on Region.

5. a kind of HBase secondary index based on coprocessor according to claim 4 creates system, which is characterized in that The insertion module is indexed insertion in accordance with the following steps:

Step 13: rewriteeing prePut () Hook Function of coprocessor, the rowkey for being inserted into master data is obtained by the function；

Step 14: a value in the section startKey and endKey of Region is randomly generated, and is denoted as hashKey；It should HashKey splices before being inserted into the rowkey of master data, and remembers that the line unit being newly inserted into is hashRowkey；

Step 15: the search index configuration file in prePut () Hook Function parses put after obtaining the field indexed Object judges whether there is index field insertion；If so, then obtaining the startKey in the section Region where hashKey, and will The value of index field is spliced before being inserted into the line unit of master data, as index line unit；Otherwise put object is directly inserted into main number According to area；

Step 16: obtaining starting line unit, index train value and be inserted into the length value of line unit, splice according to following format: initial row Key _ index train value _ is inserted into line unit, and is inserted into a column of index column family；

Step 17: index data being inserted into index column family, master data is inserted into data column family；It is fixed for indexing the title of column family , do not allow to modify and do not allow name of weighing with index column Praenomen.

6. a kind of HBase secondary index based on coprocessor according to claim 5 creates system, which is characterized in that Further include following steps in the step S11:

Wherein, M indicates the memory size of RegionServer；F indicates that RegionServer gives the ratio of memstore；S is indicated The size of memstore, unit M；A is the number of column family in table；

R=P*N

Wherein, R indicates the total number of the pre- subregion of cluster, and P indicates the number of the pre- subregion of each node, and N indicates cluster interior joint Number.

7. a kind of HBase secondary index based on coprocessor according to claim 5 creates system, which is characterized in that Further include following steps in the step S13:

Step S131: client issues put request；

Step S132: the request is assigned to corresponding RegionServer and Region；

Step S133: coprocessor intercepts the request, then calls on each RegionObserver online on the table PrePut () Hook Function intercepts put request, parses put object, obtains rowkey value；

Step S134: if do not intercepted by prePut () Hook Function, put request continues to be sent to Region, then carries out Processing.

8. a kind of HBase secondary index based on coprocessor according to claim 4 creates system, which is characterized in that The enquiry module, which includes the following steps, to be inquired:

Step 21: when querying condition is arranged in client, querying condition is parsed by enquiring component, reads index configurations file, Judge that the field whether there is in index configurations file；

Step 22: if it does, indicating that the field establishes index, then constructing the starting of querying condition by search index device Line unit and end line unit；The startKey that the starting line unit of querying condition is each Region splices querying condition, inquires item The startKey that line unit is each Region that terminates of part splices a value greater than querying condition, and goes to step S24；

Step 24: after constructing new querying condition, the secondary index area in multi-thread concurrent to each Region carries out primary Scan inquiry；

Step 25: inquiring qualified index line unit in secondary index area, the record is then parsed by the train value indexed Master data line unit；

Step 26: after the index line unit for finding a Region, the asynchronous record for obtaining master data in batches on the Region；