CN109977113A

CN109977113A - A kind of HBase Index Design method based on Bloom filter for medical imaging data

Info

Publication number: CN109977113A
Application number: CN201910070748.1A
Authority: CN
Inventors: 王丹; 陈文杰; 赵文兵; 杜金莲; 付利华; 杜晓琳; 苏航
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-01-25
Filing date: 2019-01-25
Publication date: 2019-07-05

Abstract

The present invention discloses a kind of HBase multiple index design method based on Bloom filter for medical imaging data, be adopted as each random function and individually distribute the mode of one group of bit vector reducing the false positive False Rate of Bloom filter, and as judge data to be retrieved whether the first step in set.Improved method is proposed to existing HBase secondary index later, the network I/O number of data will be reduced as main optimization point, unique line unit design ensure that tables of data and concordance list can be distributed on the same Region, and devise that a kind of sampling hashing solves Region writes hot issue, to which the characteristic of load balancing be utilized, and recall precision is accelerated to a certain extent.

Description

A kind of HBase Index Design based on Bloom filter for medical imaging data Method

Technical field

The invention belongs to computer software fields more particularly to it is a kind of for medical imaging data based on Bloom filter Multiple index design method.

Background technique

With the continuous development of medical information now, data volume is sharply increased, PACS (image archiving and communication system System) used in relevant database storage scheme be difficult to meet daily storage and retrieval demand.By Hadoop this The storage and retrieval that the big data distributed platform of sample and novel NoSQL database solve magnanimity medical imaging data has become Solve the problems, such as this one of effective way.

HBase columnar database belongs to the Hadoop ecosphere, and compatibility is very good, and friendship can be directly written and read with HDFS Mutually.Meanwhile the scalability of HBase makes user not need the structure for making table in advance, but dynamic is carried out as needed Extension, solves the problems, such as that relevant database must pre-establish table structure.However, HBase is still deficient in terms of index Lack, only support major key index, for the column of non-primary key must full table scan, it is inefficient.There are some scholars at present Research and propose the design method of some key indexes non-master for HBase.It is exemplified below.(1) method of secondary index is constructed. This method mainly takes the thought of inverted index, and the column that lithol yet to be built in main table is drawn are led as the major key in concordance list Value of the major key of table as concordance list takes the corresponding major key of related column, then from main table first by concordance list of inquiry Corresponding row is inquired, although this method is simple, needs to inquire twice, sacrifices some performances.(2) index side is linearized Method, this method are then to utilize the one-dimensional index technology of HBase by the way that K dimension data is mapped to the one-dimensional space.This method There is good effect in processing space data, but it is unsatisfactory to handle text data, and medical imaging file is mainly Number and text data, institute are also not suitable in this way.(3) the double-deck indexing method, this method use global index and local rope Draw the form matched, to reduce the back end number of inquiry, indexes low layer index from high level and reduce query context.But Require to introduce double-layer cable row maintenance when being each write-in data, cost is very big, and the specific double-deck index need using Different data structures is realized more complicated.

Index itself is also a kind of data structure, in order to and he can be divided into two processes to quick location data, First, judge the data being retrieved whether in set.Second, if navigating to the accurate location of data.Referring initially to first Step judges an element whether among a length is the set of n, and the most common scheme is exactly to take in this element and set Element compare in turn, such as sequence list.But the time complexity of this algorithm is O (n), inefficiency.Hash algorithm is With the bigger array of an index bound come storage element, the keyword of each element pass through between the hash function that sets Calculating, obtained result is corresponding with array index, this set is stored with this array location.The advantages of using Hash It is that can be quickly and accurately positioned element, it is only necessary to which the time complexity of O (1), certain this algorithm is it is possible that conflict, just It is that the keywords of different elements has obtained identical functional value, thereby produces many Conflict solving methods.However this method compares Waste memory headroom because in the case that data volume is very big, storage array to be also arranged it is especially big.Bu Long filtering is calculated Method is also the realization mechanism using Hash in principle, and only it has better space efficiency than Hash, and core is random Change the mapping function in Hash.The biggish bit string structure of a capacity is initially set up, if each keyword in set is passed through A dry Hash function calculates separately out corresponding hash value, and then these values are used with the length modulus of bit string respectively, is finally exactly The operation of similar Hash table, corresponding position sets 1 wherein, we are referred to as characteristic value, is briefly exactly by each key Word corresponds on several positions in bit string, when needing quickly to search some keyword, it is only necessary to be passed through several Then Hash functional operation is mapped to the correspondence position in bit string, if the correspondence position in bit string is entirely 1, illustrate keyword With success, if at least one is 0, searches and fail.But there is also defects for this algorithm, may exactly be not belonging to The element of this set is misjudged to belong to this set, i.e., " false positive ".If there is the element in a non-set, pass through Value after Hash sets 1 position in bit string, and just to belong to the position of element in set with some identical, just will appear Erroneous judgement.This invention takes some optimization methods to reduce False Rate.Second step is seen again, if it is decided that element to be checked is being gathered In after, how to position? this patent uses a kind of improved secondary index method, it can be seen that traditional secondary index Why inefficient method is, mainly needs to be inquired twice, returns to client wherein needing that result will be inquired for the first time End, then client initiates one query again, can generate some extra I/O operations in this way, and the speed of network I/O is for interior It is many slowly for the retrieval rate deposited, as long as so reducing the sacrifice of these I/O time by certain methods The inquiry velocity of secondary index method is substantially improved.

In conclusion the present invention judged by the grand filter algorithm of improved cloth element to be checked whether in set, It can rapidly be fed back and very low error rate, later the secondary index scheme by redesigning, it is possible to reduce pass Deficiency of the secondary index of system in efficiency.So the present invention proposes a kind of new index scheme for combining the two, use Bit string structure global in the grand filter algorithm of cloth is replaced with each random function and distributes a bit string by the method for bit vector Structure, preferably to reduce error rate.Meanwhile the data fragmentation by controlling HBase operates, so that concordance list and tables of data It is physically located in same Region, thus can initiate inquire twice in the same Region, and is saved time-consuming twice I/O operation.

Summary of the invention

The contents of the present invention:

1. the method based on the grand filter algorithm building HBase index of cloth is proposed, for determining element to be checked whether in rope Draw in table, and error rate is reduced by optimization, traditional Bloom filter is by element to be checked by multiple hash functions Value after hash is mapped in the same bit string, and optimization point of the invention is to dissipate element to be checked by multiple hash functions Value after column is mapped in different bit strings, these bit strings are formed a Vector Groups, are reduced in a manner of suitably increasing memory Error rate.

2. proposing a kind of design method of improved Hbase secondary index, the coprocessor carried by HBase is allowed Tables of data and concordance list are physically located together in the same Region, two will needed in traditional HBase secondary index scheme Secondary I/O operation is reduced to once, greatly improve retrieval rate.

3. proposing a kind of sampling hashing, solve HBase Region writes hot issue, i.e. adjacent number in logic According to can always write in same or adjacent Region.This method pre-estimates the quantity of the entire Region of HBase, so Afterwards by the sampling to line unit, hash is assigned to different write data requests on different Region, is solved and is write hot spot and ask Topic.

The present invention is a kind of Index Design method of integrated form, in view of many merits of the grand filter algorithm of cloth, using optimization The first step of the grand filter algorithm of cloth as index, while further increasing HBase index using improved secondary index scheme Retrieval rate.

To achieve the above object, the present invention adopts the following technical scheme that:

Step 1. optimizes traditional grand filter algorithm of cloth using bit vector method, and traditional Bloom filter is will be to Value of the element after multiple hash functions hash is looked into be mapped in the same bit string, and be will be to be checked for optimization point of the invention Value of the element after multiple hash functions hash is mapped in different bit strings, these bit strings are formed a Vector Groups, with The appropriate mode for increasing memory reduces error rate.It is applied in the index of HBase and is filtered as first layer, if to be checked Element then carries out step 2 in concordance list；If not, jumping to step 4.

Step 2. closes the auto plate separation function of HBase, using the improved secondary index method of the present invention, estimates The quantity of Region and the split point of Region, then hash the major key of tables of data, averagely divide the model of major key It encloses, so that the write-in of data not can be concentrated in some hot spot every time, it is each that the assigning to inquiry request of HBase is preferably utilized Characteristic on a server, it is most important that the uniformity that ensure that tables of data and concordance list, the time for reducing I/O operation disappear Consumption.

The Coprocessor coprocessor that step 3. is carried using HBase constructs index module, to be checked in step 1 In the case that element is in concordance list, the process inquired twice is carried out in server local by coprocessor, reduces inquiry Time.

Step 4. returns to query result.

Detailed description of the invention

The optimized flow chart of the grand filter algorithm of Fig. 1 cloth

The optimized flow chart of Fig. 2 secondary index scheme

The pre- slicing algorithm flow chart of Fig. 3

Fig. 4 comprehensive querying flow figure

Specific embodiment

The present invention improves algorithm therein using the basic scheme of mixing Bloom filter and secondary index, and On the index of application, it is desirable to achieve the purpose that faster retrieving to HBase index.

Traditional Bloom filter is exactly to be not belonging to the element that this is gathered to be misjudged to belong to there is " false positive " This set, we analyze the probability of this misjudgement.It is assumed that Bloom filter has the bit string of m bit size, each element The hash function of corresponding k information fingerprint, some are 1 in these m bits certainly, some are 0.Look at that some bit is first Zero probability.An element is inserted into this Bloom filter, its first hash function can be some in filter Bit position 1, therefore, the probability that any one bit is set to 1 is 1/m, and the probability that it is still 0 is 1-1/m.For filter In a specific position, if it is not all arranged to 1 by k hash function of this element, probability is 1 institute of formula Show:

If being inserted into second element in filter, some specific position is still not set to 1, and probability is public affairs Shown in formula 2:

If inserting n element altogether now, there are no some position is arranged to 1, probability is shown in formula 3:

In turn, then it is shown in formula 4 in the probability that the latter bit for inserting n element is set to 1:

The bit string for currently assuming that this n element is all placed to Bloom filter suffers, and new one in set Element, due to the hash function of its information fingerprint be all it is random, its first hash function just hits some The probability for the bit that value is 1 is exactly above-mentioned probability.One element not in set is misidentified in set, and all Kazakhstan are needed The uncommon corresponding bit value of function is 1, probability p, as shown in formula 5:

After abbreviation are as follows:

If n is bigger, can be approximated to be:

It is assumed that 16 bits of an element, k=8, then the probability of false positive is probably 5/10000ths.

We discuss the improvement of algorithm below.One key data is after d Hash Function Mapping, length N Bit string V in certain for 1 probability be d/N.Each function h (i) is independent random, and i value is 1~d, there is a length For the set S of y, when whole members of set S={ X1, X2, X3...Xy } are m by these Hash Function Mappings to length Array when, the probability P that a certain position is 1 in this array, as shown in formula 8:

If there is the element K outside some set is mistaken as data set represented by Hash, that is to say, that the element is by institute Have Hash mapping after as a result, there is h (K)=1.Therefore we obtain error rate Perr, as shown in formula 9:

p_err=p^d (9)

According to False Rate we it can be concluded that the average judgement time of the grand filtering of cloth is T, as shown in formula 10:

In general, the range N that bit vector indicates is more much bigger than the range y that the number of data source set S indicates, because If y > N, the error rate of the grand filtering of cloth can be very big, and the process from data Xi by Hash Function Mapping to bit vector is inevitable There are multiple conflicts, and conflict can only be reduced by the selection of Hash function.This is that the intrinsic characteristic of Hash representation generates Conflict, such conflict is known as interior conflict by us.And corresponding with interior conflict is outer conflict, due to multiple Hash Function Mappings Conflict to caused by the same bit vector.As can be seen that the basic reason clashed is because mapping address is inadequate.That energy Cannot appropriate under the premise of not victim queries performance it increase some address spaces? it based on this idea, is each Hash function h (i) carries out address of cache using an independent bit vector, to form a Vector Groups V.Assuming that there is one Data set A, x is some element in A, then the expression of Vector Groups V is as shown in formula 11

Wherein, V (i, j) indicates the jth position in i-th of vector.Assuming that all Hash functions are all random distributions, then The error rate p of each function address of cache due to caused by interior conflict in its bit string alone_newFor shown in formula 12:

It is improved averagely to determine that time formula is constant, but because error rate p_newBecome smaller, so entire average Determine that the time also becomes smaller.

The structure of algorithm can be clearly found out in conjunction with attached drawing 1.

Referring again to the improvement project of secondary index.In HBase, the data volume of individual usual table can be very big, therefore single The data of table can be respectively stored into one or more Region, equally can also be supervised by one or more Region server. Region has the major key of starting and termination mark, indicates the major key range of this Region, when being written and read, if main Key meets the major key range of some Region, then this Region will be hit, reads and writes related data.But has a problem in that single It can be divided after a Region storage to certain size, this is determined by the LSM tree index structure of HBase.So can Such a case can be will appear, the tables of data on the same Region and the data of concordance list, which may be split, originally assigns to Different Region.In this way client send an inquiry request, will front and back carry out four I/O operations, for the first time according to Data query concordance list is inquired, obtains the major key of main table for the second time, main table is inquired according to major key for the third time, the 4th time must to the end Query result.Although can greatly improve I/O speed now with many outstanding I/O frame such as Netty, such as Fruit can be reduced such I/O operation number, then bring performance improvement is predictable.Specific implementation is as follows:

The Coprocessor coprocessor that HBase is provided can directly run program on the server, reach a kind of journey Effect of the sequence in data local runtime.Remaining issues seeks to guarantee main table and concordance list in the same RegionServer On, by before for the introduction of Region it will be apparent that it is according to its master which Region is data, which be especially stored in, What the range of key determined, as long as so the major key of main table and the major key of concordance list match, and because the retrieval of major key is abided by Follow it is most left front sew principle, so only needing the beginning part of the major key of concordance list is identical as the major key of main table.And data The major key of table requires uniquely, in summary demand, we design the major key of concordance list are as follows: starting line unit+index of region Name+index value+main table line unit.Start-up portion ensure that with concordance list and main table, ending ensure that in same Region The major key uniqueness of concordance list.

There are three types of the region sharding method of HBase is basic, one is preparatory zone methods, that is, table foundation before just it is right The subregion number of table and the corresponding major key range of each Region are configured, and then data are written again.It is for second Auto-partition method, this method are the partition methods of HBase default, i.e., have just started only one Region, not with data Disconnected write-in, Region constantly increase, and two equal-sized Region will be split by waiting when increasing to certain volume, then Data are write to newly-generated Region again, continues to divide, continue.The third method is pressure zone method, that is, By HBase order line, specific instruction is inputted to control the fragment situation of HBase by force.

If being sequentially written in data using the auto-partition method of default according to the increase of major key, may generating Region's writes hot issue.After a Region is split into two, the range of major key is also divided into two, and data are write Enter be according to major key increase sequence, this mean that after write-in always can starting major key it is bigger Region it is enterprising Row, and originate the smaller Region of major key and be difficult to be written into again, that does not utilize distributed data base thus well Load balancing characteristic, HBase powerful write performance also will receive influence.

This programme uses the first preparatory fragment method, but solves the problems, such as to write hot spot further through sampling hashing, very The good characteristic that load balancing is utilized.This method can by HBase provide programming interface realize, but build table it Preceding needs

Know the Region quantity of tables of data and the split point of each Region, that is, fixes the master of each subregion Key range.In order to solve the hot issue of writing of Region, this programme devises a kind of sampling hashing.In conjunction with Fig. 3, detailed description The sampling hashing:

Step 1: estimating Region quantity M.

The quantity of Region has a great impact for the read-write efficiency of HBase entirety, if quantity is too many, memory Occupancy can be excessively high；If quantity is very little, and concurrent characteristic cannot be utilized well.Therefore, it is necessary to choose industry warp The reasonable value that many experiments provide is crossed, the value of Region is estimated in conjunction with the size of our tables of data, formula is as follows:

Wherein RSXmx is the memory size of a RegionServer, habse.regionserver.global.mems The optimal value that tore.size and hbase.hregion.memstore.flush.size uses HBase official to recommend, can be from HBase official document obtains, and cf is the column family number of tables of data, and the quantity M of Region has been calculated in we in this way.

Step 2: line unit being hashed, the character string of out-of-order is formed

Because we need to retrieve, reversible Encryption Algorithm is preferably selected, this programme uses AES encryption algorithm, will Major key hash is random character string.

Step 3: a certain number of major keys are taken out in sampling at random, then put it in a set according to ascending sort

Step 4: according to the subregion number M estimated, entire ensemble average being divided, split point is found.

Finally in conjunction with attached drawing 4, entire protocol procedures step is summarized:

Step 1: starting to query request.

Step 2: being parsed and inquired by Coprocessor coprocessor.

Step 3: search index table.

Step 4: being filtered by Bloom filter.If it does, jump to step 5, if there is no jumping to step 6.

Step 5: inquiring main table.

Step 6: returning to final result.

Claims

1. a kind of HBase multiple index design method based on Bloom filter for medical imaging data, which is characterized in that The following steps are included:

Inquiry request is sent to query service device first by step 1., and HBase Coprocessor coprocessor can parse this Inquiry request；

Inquiry is first passed through Bloom filter filtering by step 2.；

If step 3. is by filtering, then goes for seeking concordance list；The specific line unit of concordance list is designed as the starting of region Line unit+index name+value+main table line unit；

The pre- subregion of step 4. oversampling hashing progress Region.

2. the HBase multiple index design based on Bloom filter according to claim 1 for medical imaging data Method, which is characterized in that step 4 is specific as follows:

Step 1: estimating Region quantity N；

Wherein RSXmx is the memory size of a RegionServer；

Habse.regionserver.global.memstore.size and hbase.hregion.memstore.flush. Size uses the optimal value of system recommendation, obtains from HBase official document, cf is the column family number of tables of data, has obtained Region Quantity N；

Step 2: using irreversible cryptographic algorithm, major key is hashed as random character string；

Step 3: a certain number of major keys are taken out in sampling at random, then put it in a set according to ascending sort；

Step 4: according to the subregion number N estimated, entire ensemble average being divided, split point is found.

3. the HBase multiple index design based on Bloom filter according to claim 1 for medical imaging data Method, which is characterized in that come for each Hash function h i (x) using an independent bit vector when Bloom filter filters Address of cache is carried out, to form a Vector Groups V；Assuming that there is a data set A, x is some element in A, then vector Being expressed as follows for group V is shown

Wherein, V (i, j) indicates the jth position in i-th of vector.