CN106326429A

CN106326429A - Hbase second-level query scheme based on solr

Info

Publication number: CN106326429A
Application number: CN201610723701.7A
Authority: CN
Inventors: 童浩; 杨凡
Original assignee: Wuhan Optics Valley Information Technologies Co Ltd
Current assignee: Wuhan Optics Valley Information Technologies Co Ltd
Priority date: 2016-08-25
Filing date: 2016-08-25
Publication date: 2017-01-11

Abstract

The invention discloses an Hbase second-level query scheme based on solr. The Hbase second-level query scheme comprises the following steps of inserting raw data into an Hbase column-oriented database; calling a MapReduce increment to update an index in the solr, obtaining the raw data, and storing into a server of the solr with a particular file format of the solr; accessing the server of the solr, and establishing the index; firstly, searching the index, obtaining rowkey from the index, and querying required result data from an Hbase main list. The Hbase second-level query scheme has the advantages that the searching speed is high, and the accuracy is high; by adopting a solr and Hbase combining technique, the massive data can be searched in a second-level way, and the rowkey of data of one page can be returned back by a page separating function of the solr; because the number of data of each page is extremely limited, the response speed is higher when the Hbase query is performed according to the rowkey of the corresponding page, and is controlled to the millisecond level.

Description

A kind of Hbase second level query scheme based on solr

Technical field

The present invention relates to hbase technical field, particularly relate to a kind of Hbase second level query scheme based on solr.

Background technology

Solr is a complete search service based on lucene under apache.Solr mainly includes two parts core Assembly: indexing component and searching component.Indexing component is for setting up index by the data needing index in search utility, and searches Rope assembly carrys out search index for the request of customer in response end.Solr is a high-performance, uses Java5 exploitation, based on The full-text search server of Lucene.It is extended, it is provided that the ratio query language of Lucene more horn of plenty simultaneously, with Time achieve configurable, expansible and query performance be optimized, and provide a perfect function management interface, It it is the most outstanding a full-text search engine.Document utilizes XML to be added in a search set by Http.Inquire about this set Also it is to receive an XML/JSON response by http to realize.Its key property includes: efficiently, caching function flexibly, Vertical search function, is highlighted Search Results, improves availability by index copy, it is provided that a set of powerful Data Schema defines field, type and arrange text analyzing, it is provided that Web-based enterprise management interface etc..

Hbase is the Hadoop family distributed storage scheme for mass data, when us by rowkey to being stored in The response of second level can be reached, it is achieved more satisfactory Consumer's Experience when mass data in Hbase is inquired about.But, when Under more complicated scene, if desired for when data are done multi-condition inquiry, the solution that Hbase provides is not the most to manage very much Think.

For multi-condition inquiry, there are two kinds of solutions comparing main flow Hbase present stage itself:

1, table is manually indexed by coprocessor when inserting data

Coprocessor in Hbase has two kinds: Observer and Endpoint.Observer is similar to relevant database In trigger, Endpoint is similar to the storing process in relevant database.

We use Observer when utilizing coprocessor to index table, are i.e. inserting data in Hbase table Time, add Observer operation, allow and before often inserting a data, all call our self-defining service logic life in concordance list Become to need the record of index field.

So when we carry out multi-condition inquiry for Hbase, our inquiry operation is divided into two steps: the first step is first Inquiring about at concordance list according to querying condition, the rowkey of the corresponding result of inquiry, second step goes master meter to look into further according to rowkey Ask the data that we need.

This scheme has several bigger problem:

(1) coprocessor is the most unstable

In existing version Hbase, when our oneself test generates index by coprocessor, once setting up Index process Middle code throw exception, whole Hadoop cluster all can be hung.

(2) index can affect insert data speed

Owing to inserting data and to index be a Tong Bus process, so shadow to a great extent is understood in the operation indexed Ring the speed inserting data.

(3) field needing index must determine before data are inserted, and the later stage can not revise

Inserting another problem of simultaneously indexing of data is exactly that we must disposably determine and be there is a need to set up rope The field drawn, if the later stage need in a new field set up index, before already inserted into data be will not the most again Set up index.

(4) the corresponding concordance list of each index field is inefficient

In order to flexible when the later stage makes index of reference, typically one can be set up for each single field when setting up concordance list Concordance list.Using field value as the rowkey of concordance list, using the rowkey of former table as the field of concordance list.This mode Although us can be facilitated to do multi-condition inquiry flexibly, but the quantity of concordance list can be increased, looking into when word enquiring simultaneously simultaneously When inquiry condition is more, needs the concordance list inquiry operation carried out repeatedly, the response inquired about also is had and compares large effect.

2, the filter using Hbase to carry filters in service end

Hbase carries number of types of filter, and we can also oneself filter self-defined simultaneously.When we are looking into Using filter when of inquiry, the result data of inquiry can be carried out by the logic of filter by Hbase in the service end of cluster Filter.

But same, this scheme also has a problem in that filter still needs scan data, and efficiency is low.

Although filter is to filter in service end, but still need all numbers meeting rowkey querying condition According to all checking out, it is scanned in these data the most again, filters out the data not meeting filtercondition.This process Can take a lot of service end internal memory when original query data volume is bigger, sweep time also can be the longest simultaneously, this mistake of light The time-consuming requirement that the most can not reach the inquiry of second level of journey.

There is some characteristic can not meet our demand based on both the above scheme, we have proposed a kind of based on solr Hbase second level query scheme.

Summary of the invention

The invention aims to solve shortcoming present in prior art, and propose a kind of based on solr Hbase second level query scheme.

A kind of Hbase second level query scheme based on solr, comprises the following steps:

Step 1, initial data is inserted in Hbase columnar database, keep the original mode of Hbase, be not required to do other What change；

Step 2, obtain initial data and initial data is stored in the distinctive document format of solr the service end of solr, After setting up document, document can be analyzed by solr automatically, after completing analysis, solr using the word that is syncopated as key, with Document carries out inverted index as value, i.e. forms index, and the rope set up in MapReduce incremental update solr is called in timing Draw；

When step 3, inquiry, access solr service end, need individually to set up in the field inquired about index, search index, From index, obtain rowkey, go Hbase columnar database is inquired about further according to rowkey, i.e. generate required number of results According to.

Preferably, after described solr sets up index, index compression can be stored in the disk of solr service end, simultaneously Map can be utilized to do the caching of part.

Preferably, segmenter can be optimized, for business scenario to being customized of participle by described solr index Optimization, extract the special word of industry.

Preferably, described solr carries two-page separation function, can return the rowkey of page of data every time.

Preferably, described sorl can combine with ripe memory database, is directly existed in memory database by index.

Preferably, described solr sets up the operation indexed and can also be placed in the coprocessor of Hbase execution.

A kind of based on solr Hbase second level query scheme that the present invention proposes, search speed is fast, and accuracy rate is high, passes through The technology that solr and hbase combines, it is achieved retrieving the second level of mass data, the two-page separation function that solr carries can be returned every time Return the rowkey of page of data, owing to the quantity of every page data is extremely limited, so rowkey based on this page goes Hbase to look into again During inquiry, response speed is very fast, can be controlled in Millisecond.

Accompanying drawing explanation

Fig. 1 is data Stored Procedure figures；

Fig. 2 is data query flow chart.

Detailed description of the invention

Below in conjunction with specific embodiment, the present invention is explained orally further.

With reference to Fig. 1-2, a kind of based on solr Hbase second level query scheme that the present invention proposes, comprise the following steps:

Step 2, timing are called in MapReduce incremental update solr and are indexed, and first obtain and insert in Hbase columnar database Initial data and initial data is stored in the server of solr with the distinctive document format of solr, set up solr after document Automatically document can be analyzed, relate among these by specific participle technique, the content in document is carried out participle, complete point After word, solr, using the word that is syncopated as key, carries out inverted index using document as value；

When step 3, inquiry, access solr service end, the field needing inquiry is individually set up index, set up index After, index compression can be stored in the disk of solr service end by solr, Map can be utilized simultaneously to do the caching of part, inquire about rope Draw, from index, obtain rowkey, solr carry two-page separation function, the rowkey of page of data can be returned every time, further according to Rowkey goes to inquire about in Hbase columnar database, i.e. generates required result data.

In the present invention solr set up index operation can also be placed in the coprocessor of Hbase execution, sorl can with become Ripe memory database combines, and is directly existed in memory database by index.

The above, the only present invention preferably detailed description of the invention, but protection scope of the present invention is not limited thereto, Any those familiar with the art in the technical scope that the invention discloses, according to technical scheme and Inventive concept equivalent or change in addition, all should contain within protection scope of the present invention.

Claims

1. a Hbase second level query scheme based on solr, it is characterised in that comprise the following steps:

Step 1, initial data is inserted in Hbase columnar database, keep the original mode of Hbase, be not required to do other any more Change；

Step 2, obtain initial data and initial data is stored in the distinctive document format of solr the service end of solr, setting up After document, document can be analyzed by solr automatically, and after completing analysis, solr is using the word that is syncopated as key, with document Carrying out inverted index as value, i.e. form index, the index set up in MapReduce incremental update solr is called in timing；

When step 3, inquiry, accessing solr service end, individually set up index in the field needing inquiry, search index, from rope Draw middle acquisition rowkey, go Hbase columnar database is inquired about further according to rowkey, i.e. generate required result data.

A kind of Hbase second level query scheme based on solr the most according to claim 1, it is characterised in that described solr After setting up index, index compression can be stored in the disk of solr service end, Map can be utilized simultaneously to do the caching of part.

A kind of Hbase second level query scheme based on solr the most according to claim 1, it is characterised in that described solr Segmenter can be optimized by index, for the business scenario optimization to being customized of participle, extracts the special use of industry Word.

A kind of Hbase second level query scheme based on solr the most according to claim 1, it is characterised in that described solr Carry two-page separation function, the rowkey of page of data can be returned every time.

A kind of Hbase second level query scheme based on solr the most according to claim 1, it is characterised in that described sorl Can combine with ripe memory database, directly index is existed in memory database.

A kind of Hbase second level query scheme based on solr the most according to claim 1, it is characterised in that described solr The operation setting up index can also be placed in the coprocessor of Hbase execution.