CN107506464A

CN107506464A - A kind of method that HBase secondary indexs are realized based on ES

Info

Publication number: CN107506464A
Application number: CN201710763058.5A
Authority: CN
Inventors: 雷万钧; 于起超
Original assignee: Wuhan Fiberhome Digtal Technology Co Ltd
Current assignee: Wuhan Fiberhome Digtal Technology Co Ltd
Priority date: 2017-08-30
Filing date: 2017-08-30
Publication date: 2017-12-22

Abstract

The invention discloses a kind of method that HBase secondary indexs are realized based on ES, it is related to big data technical field.This method is：1. being listed according to inquiry business to related data in ES and establishing secondary index table, the corresponding secondary index table of a basic query business, a complex query business corresponds to multiple secondary index tables；2. inquire about, the line unit for obtaining corresponding data is inquired about according to concordance list first, data are obtained further according to line unit inquiry tables of data；3., it is necessary to secondary index table update simultaneously corresponding to when updating the data table related column.The introducing that the present invention passes through ES distributed search engines, each data manipulation, only very few several Region, fundamentally reduce the pressure of cluster, the burden of network service is alleviated, makes the dependence reduction to high-performance server, enhances the efficiency and stability of work, and possess preferable scalability, have good value for applications.

Description

A kind of method that HBase secondary indexs are realized based on ES

Technical field

The present invention relates to big data technical field, more particularly to a kind of method that HBase secondary indexs are realized based on ES.

Background technology

With the arrival in big data epoch, geometric growth is presented in public security system data volume, and mass data is to traditional database Technology proposes storage and the challenge of retrieval performance, and the data statistics difficulty of each dimension also becomes big therewith.It is traditional at present to be By writing MapReduce or the method using instruments such as Hive, Pig, conventional method is that full table is scanned, to cluster Performance consumption and the occupancy of network bandwidth are larger, are not applied under the scene of ultra-large data volume.It is only hard by upgrading physics Part or Optimized code, do not adapt to the growth rate of information and the demand of information processing efficiency, and researcher starts to explore New data statistical approach.How to solve this problem turns into difficult point.

The HBase databases run in Hadoop platform be a high reliability, high-performance, towards row and it is expansible Distributed memory system.HBase is that one kind is increased income NoSQL databases, is suitable for various unstructured and semi-structured loose The storage and management of data, large-scale storage cluster can be erected on low-cost server cluster using HBase database technologys, It disclosure satisfy that the storage demand of public security big data.But the big data storage scheme based on HBase is not fully solved data Efficient retrieval problem.In actual applications, it is often necessary to retrieval is combined according to specific field, or several fields, especially It is in face of public security big data is complicated, flexible inquiry business demand, and single line unit can not necessarily meet service inquiry needs, because A kind of this urgently big data search method that disclosure satisfy that needs.

ES full name ElasticSearch, it can establish and index convenient for data, an index can be divided into multiple ropes Drawing burst, (index burst number can be specified by user, be defaulted as 5), multiple bursts are balancedly then distributed in into all of cluster can With on node, distributed frame is formed, alleviates the burden of individual node.Can also be every in ElasticSearch clusters It is individual index burst set copy (number of copies still can voluntarily be specified by user, be defaulted as 1), when certain index burst failure when, Copy can be timely used and recover data.ElasticSearch also possesses automatic discovery Node Mechanism and fast data recovery machine System, when there is new node to add cluster, ElasticSearch can in time have found and re-start load balancing automatically, for new section Point distribution data；When certain node failure, it equally can distribute data for enabled node again automatically.

The content of the invention

The purpose of the present invention is that the above mentioned problem solved existing for prior art, there is provided one kind is realized based on ES The method of HBase secondary indexs.

The object of the present invention is achieved like this：

Specifically, this method comprises the following steps：

1. related data is listed in ES according to inquiry business and establishes secondary index table, a basic query business corresponding one Secondary index table is opened, a complex query business corresponds to multiple secondary index tables；

A, according to action type, secondary index table is created in ES

For selecting inquiry operation, the M data row for being related to selection inquiry are respectively stored into M secondary index table, Wherein, M is more than or equal to 1, and the line unit R of each secondary index table is formed by three parts, is successively：QUALIFIER、VALUE And ROEKEY；Wherein QUALIFIER is the identifier that data arrange in tables of data, and VALUE is the value that data arrange in tables of data, ROWKEY is the line unit of tables of data；

B, according to data column-generation secondary index entry and secondary index table is inserted

Operated for connection Query, the N number of data row for being related to connection Query are stored into a secondary index table, its In, N is more than or equal to 2, and the line unit R of secondary index table is made up of three parts, is successively：PREFIX、VALUE、QUALIFIER；Its Middle PREFIX is generated by hash function, and for distinguishing the group of connection Query, VALUE is the value that data arrange in tables of data, QUALIFIER is the identifier that data arrange in tables of data；

The value that data arrange in the secondary index table is the ROWKEY of corresponding data table；Data arrange in the secondary index table Value and the line unit R of secondary index table collectively form an entry of secondary index table；Secondary index table is created in ES, and will The incidence relation that data arrange corresponding secondary index table is stored into metadata table, and the line unit of metadata table, which is formed, to be followed successively by： Table name, row Praenomen, row name, the action type of secondary index table, the timestamp of tables of data, value corresponding to the line unit of metadata table For：The action type and secondary index table name of secondary index table；

The action type of secondary index table includes：Select inquiry operation and connection Query operation；

2. inquire about, the line unit for obtaining corresponding data is inquired about according to concordance list first, data are inquired about further according to line unit Table obtains data；

A, the line unit that secondary index table obtains data to be checked is scanned；

Each data in the M data row being related to for selection inquiry business are arranged, and first number is inquired about according to action type According to table, the title of secondary index table corresponding to acquisition；The secondary index table is looked into, specific query process is：Inquired about according to selection In condition value directly position to first qualified data, continue to scan on, until find an ineligible number According to；Scanned qualified data composition meets the ROWKEY of the querying condition of current data row set；If M etc. In 1, then ROWKEY set is the ROWKEY of data to be checked set；If M is more than 1, according to M data in inquiry business Corresponding set operation is done in logical relation in row, the ROWKEY set to different lines：Logical AND corresponds to the operation of intersection of sets collection, Logic or corresponding union operation, the result of computing is the ROWKEY of data to be checked set；

B, using data to be checked ROWKEY collection query tables of data

Arranged for N number of data that connection Query business is related to, two according to corresponding to obtaining action type query metadata table The title of level concordance list, N number of corresponding same secondary index table of row；The secondary index table is inquired about, specific query process is：Root Understood according to secondary index table row key form, it is continuous that N number of data with identical value are listed in corresponding entry in secondary index table Arrangement；If the number of the continuously arranged directory entry with identical data train value is N, the ROWKEY of N number of entry is formed One N tuple for meeting querying condition<R1, R2 ..., RN>；Scan whole secondary index table, then obtain all conditions that meet N tuples set<R1, R2 ..., RN>, then gather<R1,R2,...,RN>Be exactly data to be checked ROWKEY collection Close；The ROWKEY of the data to be checked obtained set is obtained corresponding by the HBase Get interface methods provided in tables of data Data value；

3., it is necessary to secondary index table update simultaneously corresponding to when updating the data table related column

Judge whether tables of data has renewal, if so, just renewal secondary index table, if not having, does not update secondary index table；

The method of renewal secondary index table comprises the following steps：

I, update the data table：The Put method interfaces provided by the HBase in Hadoop platform, the value of submission data row, The identifier of line unit, row race and row, the renewal of complete paired data table；

II, generation secondary index entry：For the row of the data currently updated, query metadata table, acquisition needs to update Secondary index table and secondary index table corresponding to action type, the lattice of corresponding secondary index table are selected according to action type Formula, meet the tabular entry of corresponding secondary index using the data message generation updated in tables of data；

III, renewal secondary index table：The interface method provided by Coprocessor in the HBase in Hadoop platform, The value of the form submission secondary index table of the secondary index entry generated according to step II, line unit, the identifier for arranging race and row, it is complete The renewal of paired secondary index table.

This method can realize basic renewal operation in the case where not causing larger pressure to Hadoop clusters, and The connection Query and selection inquiry operation between tables of data can be relatively efficiently realized for each specific business, so as to real Now to complexity business demand support and to it is daily increase newly data counted with total amount.

This method has following features：

1) secondary index table creates simple；

2) index file writes simultaneously with data file, ensures uniformity；

3) the data statistics time greatly reduces.

The present invention has following advantages and good effect：

By the introducing of ES distributed search engines, each data manipulation, only very few several Region, from basic On reduce the pressure of cluster, alleviate the burden of network service, make the dependence reduction to high-performance server, enhance work The efficiency and stability of work, and possess preferable scalability, have good value for applications.

Brief description of the drawings

Fig. 1 is the overview flow chart of this method；

Fig. 2 is the selection querying flow figure of this method step 2.；

Fig. 3 is the connection Query flow chart of this method step 2..

English to Chinese：

1、ES：Full name ElasticSearch is increasing income based on Lucene structures, distributed, and RESTful search is drawn Hold up.It is stable designed for real-time search in cloud computing, can be reached, it is reliably, quickly, easy to install.Support passes through HTTP Data directory is carried out using JSON.

We establish a website or application program, and to add function of search, make that we are stricken to be：Search work It is difficult.It is desirable that our search solution is fast, it is intended that have a zero configuration and one it is completely free Search pattern, it is therefore desirable to be able to the index data for simply passing through HTTP using JSON, it is intended that our search service Device can use all the time, it is therefore desirable to be able to which one starts and expands to hundreds of, and we will search in real time, and we simply will rent more Family, it is intended that establish the solution of a cloud.Elasticsearch aims to solve the problem that all these problems and more.

2、HBase：It is the non-relational an increased income distributed data base (NoSQL), it with reference to Google BigTable is modeled, and the programming language of realization is Java.It is a part for Apache Software Foundation Hadoop projects, operation On HDFS file system, the service similar to BigTable scales is provided for Hadoop.HBase is realized on row Compression algorithm, internal memory operation and the Bloom filter that BigTable papers are mentioned.HBase table can appoint as MapReduce The input and output of business, data can be accessed by Java API, REST, Avro or Thrift API can also be passed through To access.Although HBase performances are obviously improved, it can't directly substitute SQL database.It has been applied to more now Individual data driven type website.

Embodiment

With reference to the accompanying drawings and examples to the detailed description of the invention：

1st, method (totality)

Such as Fig. 1, overall procedure is：

Secondary index table is established according to the row of index first, then first judges to update the data or look into when calling Ask data；

If updating the data, then secondary index table is updated while table is updated the data；

If operation is inquires about, the data of secondary index are inquired about first, and the key assignments being retrieved according to secondary index obtains The related data row of tables of data.

2nd, step is 2.

1) selection inquiry

Such as Fig. 2, selecting the workflow of inquiry is：

For a compound selection inquiry business, the compound selection querying condition of business is split as single query bar first Part, the entry set for meeting single condition is then obtained by the line unit of concordance list, will finally meet the entry of each single condition Set carries out set operation, you can obtains all secondary index entries for meeting compound query condition, then is carried from these entries Take all qualified tables of data line units；Wherein, obtain meet the secondary index bar destination aggregation (mda) of single condition when, can be according to Directly position to first qualified data according to the line unit of concordance list, down scan, until discovery one is ineligible Data, then scanned entry is merged into the secondary index bar destination aggregation (mda) for meeting single condition.

2) connection Query

Such as Fig. 3, the workflow of connection Query is：

For compound connection Query business, inquiry can be divided into two connection Query groups, the number of same connection Query group When being inserted into according to row in concordance list, identical PREFIX values are produced by hash function；Value corresponding to line unit R is then that this is listed in data Line unit in table；Whole scan is carried out to secondary index table during inquiry, records qualified multi-component system set, then these are more Tuple-set carries out set operation, obtains the line unit value of eligible data；Wherein recording qualified multi-component system set During, when the multi-component system of only continuous entry composition can meet the condition of connection Query group, just this multi-component system is added Add in multi-component system set.

Claims

A kind of 1. method that HBase secondary indexs are realized based on ES, it is characterised in that：

1. being listed according to inquiry business to related data in ES and establishing secondary index table, a basic query business is corresponding one two Level concordance list, a complex query business correspond to multiple secondary index tables；

A, according to action type, secondary index table is created in ES

For selecting inquiry operation, the M data row for being related to selection inquiry are respectively stored into M secondary index table, wherein, M is more than or equal to 1, and the line unit R of each secondary index table is formed by three parts, is successively：QUALIFIER, VALUE and ROEKEY；Wherein QUALIFIER be in tables of data data arrange identifier, VALUE be in tables of data data arrange value, ROWKEY It is the line unit of tables of data；

B, according to data column-generation secondary index entry and secondary index table is inserted

Operated for connection Query, the N number of data row for being related to connection Query are stored into a secondary index table, wherein, N is big In equal to 2, the line unit R of secondary index table is made up of three parts, is successively：PREFIX、VALUE、QUALIFIER；Wherein PREFIX is generated by hash function, and for distinguishing the group of connection Query, VALUE is the value that data arrange in tables of data, QUALIFIER It is the identifier that data arrange in tables of data；

The value that data arrange in the secondary index table is the ROWKEY of corresponding data table；The value that data arrange in the secondary index table An entry of secondary index table is collectively formed with the line unit R of secondary index table；Secondary index table is created in ES, and by data The incidence relation for arranging corresponding secondary index table is stored into metadata table, and the line unit of metadata table, which is formed, to be followed successively by：Data Table name, row Praenomen, row name, the action type of secondary index table, the timestamp of table, value corresponding to the line unit of metadata table are：Two The action type and secondary index table name of level concordance list；

The action type of secondary index table includes：Select inquiry operation and connection Query operation；

2. inquire about, the line unit for obtaining corresponding data is inquired about according to concordance list first, is obtained further according to line unit inquiry tables of data Obtain data；

A, the line unit that secondary index table obtains data to be checked is scanned；

Each data row in the M data row being related to for selection inquiry business, according to action type query metadata table, The title of secondary index table corresponding to acquisition；The secondary index table is looked into, specific query process is：Bar in being inquired about according to selection Part value is directly positioned to first qualified data, is continued to scan on, until finding an ineligible data；Scanning The qualified data composition crossed meets the ROWKEY of the querying condition of current data row set；If M is equal to 1, ROWKEY set is the ROWKEY of data to be checked set；If M is more than 1, according in M data row in inquiry business Corresponding set operation is done in logical relation, the ROWKEY set to different lines：Logical AND correspond to intersection of sets collection operation, logic or Corresponding union operation, the result of computing is the ROWKEY of data to be checked set；

B, using data to be checked ROWKEY collection query tables of data

Arranged for N number of data that connection Query business is related to, the two level rope according to corresponding to obtaining action type query metadata table Draw the title of table, N number of corresponding same secondary index table of row；The secondary index table is inquired about, specific query process is：According to two Level concordance list line unit form understands that N number of data with identical value are listed in corresponding entry continuous arrangement in secondary index table； If the number of the continuously arranged directory entry with identical data train value is N, the ROWKEY of N number of entry forms one completely The N tuples of sufficient querying condition<R1, R2 ..., RN>；Whole secondary index table is scanned, then obtains all N tuples for meeting condition Set<R1, R2 ..., RN>, then gather<R1,R2,...,RN>Be exactly data to be checked ROWKEY set； Data to be checked ROWKEY set by HBase provide Get interface methods obtained in tables of data corresponding to number According to value；

3., it is necessary to secondary index table update simultaneously corresponding to when updating the data table related column

Judge whether tables of data has renewal, if so, just renewal secondary index table, if not having, does not update secondary index table；

The method of renewal secondary index table comprises the following steps：

I, update the data table：The Put method interfaces provided by HBase in Hadoop platform, submit the value, OK of data row The identifier of key, row race and row, the renewal of complete paired data table；

II, generation secondary index entry：For the row of the data currently updated, query metadata table, need to update two are obtained Action type corresponding to level concordance list and secondary index table, corresponding secondary index tableau format is selected according to action type, Meet the tabular entry of corresponding secondary index using the data message generation updated in tables of data；

III, renewal secondary index table：The interface method provided by Coprocessor in the HBase in Hadoop platform, according to The value of the form submission secondary index table of the secondary index entry of step b generations, line unit, the identifier for arranging race and row, completion pair The renewal of secondary index table.
2. a kind of method that HBase secondary indexs are realized based on ES as described in claim 1, it is characterised in that the step is 2. Its select inquiry workflow be：

For a compound selection inquiry business, the compound selection querying condition of business is split as single query condition first, Then the entry set for meeting single condition is obtained by the line unit of concordance list, will finally meet the entry set of each single condition Carry out set operation, you can obtain all secondary index entries for meeting compound query condition, then institute is extracted from these entries There is qualified tables of data line unit；Wherein, can be according to rope when acquisition meets the secondary index bar destination aggregation (mda) of single condition Draw the line unit directly positioning of table to first qualified data, down scan, until finding an ineligible number According to then scanned entry to be merged into the secondary index bar destination aggregation (mda) for meeting single condition.
3. a kind of method that HBase secondary indexs are realized based on ES as described in claim 1, it is characterised in that the step is 2. The workflow of its connection Query is：

For compound connection Query business, inquiry can be divided into two connection Query groups, the data row of same connection Query group When being inserted into concordance list, identical PREFIX values are produced by hash function；Value corresponding to line unit R is then that this is listed in tables of data Line unit；Whole scan is carried out to secondary index table during inquiry, records qualified multi-component system set, then by these multi-component systems Set carries out set operation, obtains the line unit value of eligible data；Wherein recording qualified multi-component system aggregation process In, when the multi-component system of only continuous entry composition can meet the condition of connection Query group, just this multi-component system is added to In multi-component system set.