CN106777343A

CN106777343A - increment distributed index system and method

Info

Publication number: CN106777343A
Application number: CN201710028299.5A
Authority: CN
Inventors: 张韶峰; 陈浪仙; 陈贺巍; 邹迎春
Original assignee: Bai Rong (beijing) Financial Information Service Ltd By Share Ltd
Current assignee: Bai Rong (beijing) Financial Information Service Ltd By Share Ltd
Priority date: 2017-01-16
Filing date: 2017-01-16
Publication date: 2017-05-31

Abstract

The embodiment of the invention provides a kind of increment distributed index system and method；Methods described includes：Obtain HBase databases in store data, and each data one-level index value and property value；It is each data, generation secondary index Key；Secondary index Key according to each data, generates secondary index table, and subregional retrieval is carried out to secondary index table with inquiry；Wherein described subregion is divided according to secondary index Key values.

Description

Increment distributed index system and method

Technical field

The invention belongs to computer software technical field, more particularly to a kind of increment distributed index system and side Method.

Background technology

With the development of society, data staged numerical expression increases, and passes through to analyze mass data and can therefrom obtain much to have Information, therefore data mining technology for mass data becomes recent hot spot technology.In order in big data condition Lower faster to inquire about the data for meeting condition, way the more commonly used at present is data to be pre-processed and to data creation Index, inquires about during inquiry according to index.

Index is by the data of these values of physical label in the set of a row or some train values in table and corresponding Compass The logic inventory of page.When certain value is indexed, then can be found comprising the value along pointer by searching for index to find particular value Row.Assuming that wanting to look up Adradvark characters in data as shown in Figure 1, need to go scanning when index is not set up All data in DataPage；And only needing to scanning first, second if after the foundation index shown in Fig. 1 IndexPage can just find the row data comprising Adradvark characters, can thus accelerate inquiry velocity.It is possible thereby to see Go out index can rapid access evidence, being ranked up, grouped data can effectively improve search efficiency when inquiring about.It is existing to be Storage, create index it is the more commonly used have Lucene, HBase algorithm：

Lucene is an open source projects under Apache foundations, there is provided can realize the Java of full-text index and retrieval API；Existing Lucene includes index engine and search engine two parts.For the document comprising multiple fields (Field) (Document), word segmentation processing can be carried out to the content of text in document field by the index engine of Lucene, builds and close Key word indexing.After the completion of index construct, specific fields can be carried out based on keyword by the search engine of Lucene Inquiry.Lucene supports various inquiry modes, including fuzzy search, Querying by group etc..For Query Result, Lucene uses base The ranking of Query Result is calculated in the rank algorithm of vector space model.The advantage of Lucene is can to support various looking into Inquiry, inquiry data are fast.Have the disadvantage only to support that the whole of single document update, do not support that documentation section updates, and establishment index, Merge index than relatively time-consuming.

HBase is that, based on row storage, the distributed memory system built on HDFS, HBase can be cheap by increasing PC improve system operation and storage ability.A table can be supported in HBase can billions of rows, row up to a million, Can seating surface nematic (race) storage and control of authority, arrange (race) independent retrieval, and data in each unit can have many Individual version, version number distributes automatically under default situations, is timestamp when cell is inserted.Compare and be adapted to storage storage organization Or the key-value inquiries of semi-structured data data and high concurrent.The shortcoming of Hbase is that a support is good for by row, row is good for Range query, it is impossible to according to Column Properties and inquiry.

The content of the invention

All there is a problem of obvious defect for Lucene, HBase scheme in the prior art, the embodiment of the present invention Purpose is to provide a kind of effective and efficient increment distributed index system and method.

In order to solve the above problems, the embodiment of the present invention proposes a kind of increment distributed index method, including：

Step 1, obtain HBase databases in store data, and each data one-level index value and property value；For Each data, generate secondary index Key by the following method；

{ initial Key values } _ { original property value } _ { original Key values }；

Wherein starting Key values are the initial values of all data Ll index values；Original property value is the property value of data；It is former Beginning Key value is all data Ll index values；

Step 2, the secondary index Key according to each data, generate secondary index table, with inquiry when to secondary index table Carry out subregional retrieval；Wherein described subregion is divided according to secondary index Key values.

Wherein, the secondary index Key and the storage of secondary index table are in HBase databases.

Wherein, methods described kind also includes：

The inquiry request of step 3, reception based on sql language, and the inquiry request generation based on sql language is directed to The querying condition of the secondary index table of HBase databases；The secondary index table of HBase databases is inquired about according to querying condition, is returned The data for meeting querying condition are returned, so that client arranges the returned data in each regions of HBase merges output.

Wherein, the step 3 is specifically included：

Step 31, when inquiry is received, the type of the inquiry is judged, if based on collecting the derivation of (Count) Inquiry, then jump to step 2；If for the paging query of single subregion (Region), then jumping to step N；

Step 32, the querying condition according to the derived query for being based on collecting, generation are directed to each subregional querying condition, So that each subregion is inquired about according to corresponding querying condition；Then after the data that each subregion returns are merged Return, step terminates；

Step 33, basis are directed to single subregional paging query, determine the subregion of this request；Inquired about for meeting The subregion of condition is inquired about, and Query Result is returned, and step terminates.

Wherein, the method inquired about subregion is specially：

Step a, according to querying condition and last registration id, regenerate querying condition, and build querying condition grammer Number；

Whether the result that step b, judgement have been obtained is if it is crucial by all rows in Query Result less than requirement The corresponding initial data of word is returned, and step is received；Step c is jumped to if not；

Step c, according to querying condition syntax tree obtain next line keyword, judge the next line keyword whether be , if it is be added to the next line keyword in retrieval result, and jump to step b by sky.

Meanwhile, the embodiment of the present invention also proposed a kind of increment distributed index system, including：

The one of secondary index Key generation modules, the data for obtaining storage in HBase databases, and each data Level index value and property value；It is each data, secondary index Key is generated by the following method；

Secondary index table generation module, for the secondary index Key according to each data, generates secondary index table, with Subregional retrieval is carried out to secondary index table during inquiry；Wherein described subregion is divided according to secondary index Key values 's.

Wherein, the system also includes：

Enquiry module, for receiving the inquiry request based on sql language, and to the inquiry request based on sql language Querying condition of the generation for the secondary index table of HBase databases；Two grades of ropes of HBase databases are inquired about according to querying condition Draw table, return meets the data of querying condition, so that client arranges the returned data in each regions of HBase merges output.

Wherein, the enquiry module is used to perform following operation：

Wherein, the method inquired about subregion is specially：

Above-mentioned technical proposal of the invention has the beneficial effect that：It is distributed that above-mentioned technical scheme proposes a kind of increment Directory system and method, can set up secondary index, so as to provide according to pass on the basis of the one-level of Hbase databases index The inquiry of keyword Key and Key value scope, with the further effect for improving retrieval.

Brief description of the drawings

Fig. 1 is the example for indexing in the prior art；

Fig. 2 a and Fig. 2 b are the examples of a kind of typical one-level index and secondary index；

Fig. 3 is the flow chart of the overall querying flow of the embodiment of the present invention；

Fig. 4 is the flow chart that server end is inquired about.

Specific embodiment

To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool Body embodiment is described in detail.

For data volume it is big, querying condition is complicated, data can incremental update database, existing Lucene, HBase side Case all cannot very well meet the situation of demand, and the embodiment of the present invention is realized using the scheme of HBase+ secondary indexs+sql.

The embodiment of the present invention proposes a kind of increment distributed index method, including：

Wherein, methods described kind also includes：

Wherein, the step 3 is specifically included：

Wherein, the method inquired about subregion is specially：

It is further described below by a specific example of Fig. 2 a and Fig. 2 b.As shown in Figure 2 a and 2 b be exactly The example of one typical one-level index and secondary index, wherein Fig. 2 a are a typical one-level concordance list (Primary User Table), it includes row keyword row (rowkey) and an attribute column (cf1:col)；Can from Fig. 2 a The primitive attribute of trip keyword 001 is A, the primitive attribute of row keyword 002 is B, the primitive attribute of row keyword 003 is C, The primitive attribute of row keyword 004 be A, the primitive attribute of row keyword 005 be A, the primitive attribute of row keyword 006 be B, OK The primitive attribute of keyword 007 is B.If Fig. 2 b are secondary index table (Secondary User Table), it is directed to Fig. 2 a category Property row (cf1:Col) the secondary index of generation.Just foregoing, the rule for generating secondary index is：

{ initial key values } _ { original property value } _ { original key values }

The initial key values that each data are can be seen that by Fig. 2 a are all 001, and original property value is the data in attribute Row (cf1:Col value), original key values be the data in fig. 2 a row keyword row (rowkey) value.Therefore pin Secondary index to the database table generation of Fig. 2 a is Fig. 2 b.

It can be seen that the bottom of the embodiment of the present invention is the database based on HBase, by the self-defined plug-in units of HBase Mode to realize secondary index+sql is inquired about.Because bottom is based on HBase, so system can very easily carry out level Extension, and can be easy to be updated data by the PUT of HBase, DELETE operation.By the dependent of dead military hero to HBase Property create secondary index, ensure that response speed is very fast when being inquired about by secondary index.And sql sentences are using fairly simple And it is relatively more flexible, can very easily support complex conditions inquiry, data aggregate statistical query.

The principle of secondary index is：Be in HBase database tables each data attribute set up index, and will index as One key is also saved in HBase databases, and the create-rule for indexing key is { initial key values } _ { original property value } _ { original Beginning key value }.Because HBase databases are supported to do range query by key and key, line range is entered by index key Inquiry, it is possible to inquire the row of corresponding initial data quickly, improves inquiry velocity.As shown in Figure 1, wherein Primary User Table are raw data tables, have cf1 in former table:The attribute of col1.If being now to inquire about cf1:The number of col1=A According to all data in needing to scan original table before not creating secondary index.After secondary index table is created, it is only necessary to Scanning index table key scopes 001_A to 001_A~between all data, it is possible to know and all meet condition rowkey.Parsed by sql, group of subscribers can be made to draw a portrait and support more complicated condition query, rather than only supporting basis Key, key prefix, key range queries.And the grammer of sql is more common, thus use also it is fairly simple, easily on Hand.In order to secondary index is constructed above in HBase systems, and data are inquired about, it is necessary to realize customized HBase according to sql Coprocessor, observer interface.When client initiates inquiry request to HBase, HBase can load customized Plug-in unit realizes, sql is parsed that generate sql syntax trees, inquiry secondary index is then back to meet the data of condition, finally Client needs to arrange the returned data of each nodes of hbase and merges output.Overall flow is as shown in Figure 2.When HBase plug-in units connect Receive after sql inquiries, the querying condition of HBase secondary indexs can be generated according to the querying condition of sql sentences, by two grades Index can inquire all rowkey for meeting condition, then just can easily inquire initial data by rowkey Value.Overall flow is as shown in Figure 3.

The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, on the premise of principle of the present invention is not departed from, some improvements and modifications can also be made, these improvements and modifications Should be regarded as protection scope of the present invention.

Claims

1. a kind of increment distributed index method, it is characterised in that including：

Wherein starting Key values are the initial values of all data Ll index values；Original property value is the property value of data；It is original Key values are all data Ll index values；

Step 2, the secondary index Key according to each data, generate secondary index table, and secondary index table is carried out with inquiry Subregional retrieval；Wherein described subregion is divided according to secondary index Key values.

2. increment distributed index method according to claim 1, it is characterised in that the secondary index Key and two Level concordance list storage is in HBase databases.

3. increment distributed index method according to claim 1, it is characterised in that methods described kind also includes：

4. increment distributed index method according to claim 3, it is characterised in that the step 3 is specifically included：

Step 31, when inquiry is received, judge the type of the inquiry, if based on collecting the derived query of (Count), Then jump to step 2；If for the paging query of single subregion (Region), then jumping to step N；

Step 32, the querying condition according to the derived query for being based on collecting, generation are directed to each subregional querying condition, so that Each subregion is inquired about according to corresponding querying condition；Then returned after the data that each subregion returns are merged Return, step terminates；

Step 33, basis are directed to single subregional paging query, determine the subregion of this request；For meeting querying condition Subregion inquired about, and Query Result is returned, step terminates.

5. increment distributed index method according to claim 4, it is characterised in that the method inquired about subregion Specially：

Whether the result that step b, judgement have been obtained is less than requirement, if it is by all row keywords pair in Query Result The initial data answered is returned, and step is received；Step c is jumped to if not；

Step c, according to querying condition syntax tree obtain next line keyword, judge the next line keyword whether be sky, such as Fruit is the next line keyword to be added in retrieval result, and jump to step b.

6. a kind of increment distributed index system, it is characterised in that including：

The one-level rope of secondary index Key generation modules, the data for obtaining storage in HBase databases, and each data Draw value and property value；It is each data, secondary index Key is generated by the following method；

Secondary index table generation module, for the secondary index Key according to each data, generates secondary index table, with inquiry When subregional retrieval is carried out to secondary index table；Wherein described subregion is divided according to secondary index Key values.

7. increment distributed index system according to claim 6, it is characterised in that the secondary index Key and two Level concordance list storage is in HBase databases.

8. increment distributed index system according to claim 6, it is characterised in that the system also includes：

Enquiry module, for receiving the inquiry request based on sql language, and to the inquiry request generation based on sql language For the querying condition of the secondary index table of HBase databases；The secondary index of HBase databases is inquired about according to querying condition Table, return meets the data of querying condition, so that client arranges the returned data in each regions of HBase merges output.

9. increment distributed index system according to claim 8, it is characterised in that the enquiry module be used to performing with Lower operation：

10. increment distributed index system according to claim 9, it is characterised in that the side inquired about subregion Method is specially：