CN104504011B - It is a kind of to look into the comparative approach for depositing algorithm - Google Patents
It is a kind of to look into the comparative approach for depositing algorithm Download PDFInfo
- Publication number
- CN104504011B CN104504011B CN201410758136.9A CN201410758136A CN104504011B CN 104504011 B CN104504011 B CN 104504011B CN 201410758136 A CN201410758136 A CN 201410758136A CN 104504011 B CN104504011 B CN 104504011B
- Authority
- CN
- China
- Prior art keywords
- data structure
- data
- algorithm
- candidate data
- space
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a kind of look into deposit the comparative approach of algorithm, includes following steps:A, with candidate data structure represent store in systems deposited set;B, the assessment parameter of candidate data structure is calculated;C, according to above-mentioned assessment parameter comparative assessment index so as to show that most suitable look into deposits algorithm.The present invention by candidate data structure represent storage in systems deposited set, algorithm is deposited to different looking into and makes unified description, in order to look into comparison and analysis in depositing algorithm in selection and assessment below;And relevant assessment parameter is calculated, include refusal time, space expense, represent efficiency and False Rate, for assessing and depositing algorithm for suitable the looking into of specific application demand selection.The present invention deposits the comparative approach of algorithm and can be widely applied to big data process field as a kind of look into.
Description
Technical field
It is especially a kind of to look into the comparative approach for depositing algorithm the present invention relates to big data process field.
Background technology
Under big data environment, daily newly generated mass data, it will be stored in Database Systems.In many situations
Under, system is newly arrived a data, whether in systems, that is, looking into for data is deposited to be required for inquiring about the data.With
, it is necessary to consume more times, to inquire about a data, whether there are system for the increase of the data volume of storage in systems
In.However, in practical applications, system is but needed to quickly respond this kind of inquiry.For example, fast disk storage applications are, it is necessary to fast
Fast locating file whether there is, so that it is determined that the change of file synchronizes.For another example, recently, network exposure Chengdu industrial and commercial bank ATM
Machine discharge jack per line counterfeit money, if the crown word number of energy quick search jack per line counterfeit money so vacation is occurring with the presence or absence of in the national treasury of bank
During paper money, it can quickly determine whether the banknote with identical crown word number comes from ATM in bank to differentiate the responsibility of bank and customer,
This is also that quick search data whether there is an application example in system.Therefore, whether one data of quick response have deposited
It is an application demand with practical significance in the storage system of mass data has been store.
Existing bitmap data structure, Bloom filter and its data structure algorithm of related amplification, it is each advantageous,
For representing to store all data in systems, with the inquiry in the whether existing system of quick response data.Bitmap data
Structure is typically used as indexing in actual database application, and Bloom filter is the data for exchanging efficient spatial for accuracy rate
Structure.
How to represent to store data in systems for above-mentioned, and whether one data of quick response have existed
Application scenarios demand in the storage system of mass data is store, common scheme is to access Database Systems by client,
Whether within the storage system to inquire about a certain specific data.This scheme is although feasible, but directly accesses the database inquiry
Relevant transaction record, it will cause unnecessary time overhead.In addition, also a kind of scheme is to realize one in memory
Data structure, for all data in expression system.When to inquire about a certain data and whether deposited in systems, it is only necessary to
Access this data structure.
Relative to the first scheme, second scheme need not access database, and relevant inquiring only needs complete in memory
Into, it is possible to know whether this data has been deposited in systems, be better than the first scheme on time overhead.Existing data
Structure, such as the data structure algorithm of bitmap data structure, Bloom filter and its related amplification, it is each advantageous, it can be used to
Represent all data of storage in systems.However, these algorithms not all show good performance in all scenes.Separately
On the one hand, up to the present, most of related works lay particular emphasis on the improvement that algorithm is deposited to looking into, also the not comparison to these algorithms
Selection.
The content of the invention
In order to solve the above-mentioned technical problem, the purpose of the present invention is:There is provided and a kind of deposit algorithm for existing look into and calculated
The comparative approach for depositing algorithm is looked into assessment.
The technical solution adopted in the present invention is:It is a kind of to look into the comparative approach for depositing algorithm, include following steps:
A, with candidate data structure represent store in systems deposited set;
B, the assessment parameter of candidate data structure is calculated;
C, according to above-mentioned assessment parameter comparative assessment index so as to show that most suitable look into deposits algorithm.
Further, the operation for being contained in different sets involved in the data structure is included in the candidate data structure
The set of the handling function composition of mapping relations and candidate data structure between data object, aforesaid operations data object.
Further, the operation data object includes data set and can provide the space cell of set.
Further, the assessment parameter includes refusal time, space expense, represents efficiency and False Rate.
Further, refuse time and space expense in the step C and look into for assessment to deposit algorithm and judge data not in system
In time and look into deposit data structure space expense balance.
Further, represent that efficiency is used for for different data volumes or is being related to the situation of change of data in the step C
The space expense deposited used in each data of algorithmic notation is looked into lower assessment.
Further, False Rate is used to assess the ability for looking into the reflection system data for depositing algorithm in the step C.
The beneficial effects of the invention are as follows:The present invention by candidate data structure represent storage in systems deposited set,
Algorithm is deposited to different looking into and makes unified description, in order to select and assessing the comparison and analysis looked into depositing algorithm below;
And relevant assessment parameter is calculated, include refusal time, space expense, represent efficiency and False Rate, for assessment and pin
Algorithm is deposited in looking into for being adapted to the selection of specific application demand.
Brief description of the drawings
Fig. 1 is the step flow chart of the method for the present invention;
Fig. 2 is that candidate data structure judges any one element with the presence or absence of the schematic diagram in system;
Fig. 3 is candidate data structure and has deposited set additions and deletions element schematic diagram;
Fig. 4 is bitmap data structure schematic diagram;
Fig. 5 is standard type Bloom filter structure diagram;
Fig. 6 is attribute Bloom filter structure diagram;
Fig. 7 is the structure diagram of D-left Bloom filters.
Embodiment
Related definition
Define 1 complete or collected works:Complete or collected works refer to by the corresponding whole data area space of the data in system.Formalized description is,
The set being made of all data elements being related in problem, is denoted as U.The element number of complete or collected works U is | U |, it is denoted as nu。
Define 2 and deposited set:Refer to deposit mass data collection in systems.Formalized description is, by storing in systems
Data element composition set, be denoted as DS.DS is a subset of complete or collected works U, i.e.,The element number of set DS is
| DS |, it is denoted as n.
Define 3 test set:Refer to by any one arrival system, it is necessary to test whether to be already present on new in system
The set of data composition, is denoted as TS.The element number of set TS is | TS |, it is denoted as nt.Equally, it meets
Define 4 space cells:Represent to be denoted as C, unit is ratio by the successive bits position of 1 or 1 composition described above in memory
Spy, in space cell, the number of bit is denoted as b.
Define 5 candidate data structures:The mass data being used to represent in storage system that refers to realize in memory and look into
The data structure algorithm deposited, that is, the data structure for having deposited set DS being intended to indicate that in system, is denoted as A=(D, R, F).Its
In, D refers to refer to the different pieces of information pair on D in the operation data object for being contained in different sets involved in the data structure, R
The mapping relations of elephant, the set that F is made of the handling function of candidate data structure.
Define 6 set:Expression puts 1 or assignment operation to space cell (see definition 4), is denoted as Set.Formalization representation
For for arbitrary space cell in memoryThere is Set to cause Set (C)=1.
Define 7 set set:For representing the continuous space cell in memory in candidate data structure (see definition 5)
(see definition 4), formalized description are that the set being made of the space cell for needing to carry out set (see definition 6), is denoted as M, set
The element number n of setm=| M |.It is noted herein that M belongs to wherein one in the D of candidate data structure A=(D, R, F)
Kind set.
Define 8 set mapping:Data in Database Systems are mapped to the space cell of candidate data structure by expression
Mapping function, that is, deposited set is to the mapping function of the set set of candidate data structure, formalized description, forMapping function f causes f:U→Mr, MrRepresent to select r element to carry out set from set M.Put
Bit mapping meets following characteristic:
1) set mapping can be one-to-one or one-to-many.When r represents to be mapped every time, set collection is chosen
The number of element in conjunction, wherein r >=1.
2) domain of set mapping
If 3) f:U→MrReversible, then its inverse mapping is f-1:U→Mr。
Define 9 and represent set:When being inserted into data, the part or all of space cell of candidate data structure (see definition 5)
After producing set, data set that candidate data structure can represent.Formalized description is as follows:Candidate data structure A=(D, R, F)
Set mapping produce after, the set that can be represented of candidate data structure A=(D, R, F), referred to as represent set, be denoted as S, it
MeetThe element number of set S is | S |, it is denoted as ns。
Define 10 and represent relation:Refer to when the part or all of space cell of candidate data structure is produced after set, group
The mapping relations for the data set that can be represented with it into the continuous space unit of candidate data structure.Formalized description is as follows:Wait
Being mapped from all elements for having deposited set DS by set for data structure A=(D, R, F) is selected to be mapped to set collection (see definition 8)
After conjunction, set set has selected t element to carry out set, it is this be set after mapping of the set set with representing set close
System, is denoted as<Mt,S>.Especially, during t=0, insertion element is represented, at this timeWork as t=| M |=nmWhen, S=U.
Define 11 membership queries:Give a certain data, user access candidate data structure inquire about the data whether there is in
In storage system.Formalized description is as follows:To any d ∈ TS (see definition 3), whether return d is contained in the inquiry response of DS
Function, is denoted as query (d).
Define 12 False Rates:Occurs inaccurate response during for describing to carry out membership query query (d) (see definition 11)
Probability.False Rate includes two classes, and one kind is false positive False Rate, is that the element that would not exist in data set is mistaken for existing
Error;Another kind of is false negative False Rate, that is, will be present in the error that data set DS is mistaken for being not present.
The form of False Rate represents as follows:
1) false positive False Rate FP:For givenMembership query query (d) returns to the true value for representing d ∈ DS,
It is denoted as
2) false negative False Rate FN:For given d ∈ DS, membership query query (d), which is returned, to be representedFalsity,
Define 13 space expenses:Refer to that candidate data structure A=(D, R, F) (see definition 5) needs space single in memory
The element number of the number, i.e. set set (see definition 6) of member, with the bit number needed for each space cell (see definition
4) product, is denoted as Mem, and unit is bit.
Define 14 and represent efficiency:Refer to the space expense Mem of candidate data structure A=(D, R, F) (see definition 5) (see fixed
The ratio between adopted 13) whole element number n=with set DS | DS |, it is denoted as E.For describing candidate data structure A=(D, R, F) table
When showing set DS, the number (usually being represented with digit bits) for the bit that average each element needs.
Define 15 and represent performance:Candidate data structure A=(D, R, F) represents the ability of an element, by representing efficiency
To measure.Represent that efficiency is bigger, it is more to have deposited the number of each bit needed for element in set DS, shows candidate data structure
The expression performance of A=(D, R, F) is poorer.
Defined for 16 refusal times:A certain inquiry data are given, candidate data structure A=(D, R, F) judges that the data do not exist
Time in system.Formalization is retouched as follows:Given element d ∈ TS, it is required that membership query query (d) judges that d is not belonging to DS
Time, is denoted as T.(herein, without considering judging that definition that d belongs to the DS required times is because this time is several
Unit interval, notices that the time mentioned here does not include accessing external memory to judge that d belongs to the time needed for DS.)
Define 17 uniformity:Refer to deposit set in candidate data structure A=(D, R, F) (see definition 5) reflection system
DS and its measurement of change.Usual uniformity is measured with False Rate (see definition 12).
The embodiment of the present invention is described further below in conjunction with the accompanying drawings:
It is a kind of to look into the comparative approach for depositing algorithm with reference to Fig. 1, include following steps:
A, with candidate data structure represent store in systems deposited set;
B, the assessment parameter of candidate data structure is calculated;
C, according to above-mentioned assessment parameter comparative assessment index so as to show that most suitable look into deposits algorithm.
Describe in detail step by step below:
Problem models:
According to related definition above, problem modeling is as follows:A large data sets are considered, by whole numbers in this data set
According to or partial data be stored in system, and in storing process, a data structure is realized in the memory of system, for representing to deposit
The mass data of storage in systems.Here it is the data that the large space in system is represented with the data structure in the small space of memory.When
During toward system insertion data, data structure is correspondingly inserted into new data, when system will delete a certain data, if data structure can
Element is deleted, then correspondingly deletes the data., should in memory as long as accessing to whether in systems inquire about any one element
Data structure.
Formalization representation is, complete or collected works U is the set that all different elements of large data sets are formed, of its all elements
Number is nu.DS is stored in the set of all elements in system, which is n.According to definition,n
≤nu, it is clear that set DS inherently its complete or collected works when, equal sign set up.Expression set in candidate data structure A=(D, R, F)Test set TS belongs to complete or collected works by any one and is used for testing whether the collection that the data for being present in system form
Close.
Problem modeling is illustrated below in conjunction with the accompanying drawings.
Such as Fig. 2, forCandidate data structure A=(D, R, F) is accessed, whether in systems, i.e., d is inquiry d
No to include DS, candidate data structure A=(D, R, F) has handling function, return d ∈ DS orSpecific embodiment
It is, for arbitrary data of newly arriving, to access the candidate data structure in Installed System Memory, whether in systems to inquire about this data,
The effect of implementation is that this candidate data structure returns to true-false value with this inquiry of response.
Such as Fig. 3, a data are inserted into when new in system, if attribute of the candidate data structure with insertion data, its energy
Data set in synchronous reflection system is newly inserted into new data.Formalized description is, forButCollection is deposited
DS=DS+ { d } is closed, if candidate data structure has insertion attribute of an element, A=(D, R, F) has handling function to cause S=S+
{d}.Specific embodiment is that when system is newly inserted into a data, candidate data structure passes through to several spaces
Unit carries out set, to represent the data.
Such as Fig. 3, a data are deleted when new in system, if candidate data structure is with the attribute for deleting data, its energy
Data set in synchronous reflection system deletes legacy data.ForAs DS=DS- { d }, if candidate data structure A=
(D, R, F) then has handling function to cause S=S- { d } with that can delete attribute of an element.Specific embodiment is to work as system
When deleting data, if the continuous space unit of this candidate data structure can recover the state before set, it will can be deleted
The corresponding space cell of data removed recovers the state before set.
Algorithm models:
Modeled according to the problem of above, complete or collected works U is the set that all different elements of large data sets are formed, its all member
The number of element is nu.DS is stored in the set of all elements in system, which is n.According to definition,n≤nu, it is clear that set DS inherently its complete or collected works when, equal sign set up.Therefore uniformly building to candidate data structure
Mould is as follows:
With candidate data structure A=(D, R, F) represent storage in systems deposited set DS,
1) the set D being made of the different pieces of information object involved in problem, equivalent to the complete or collected works U and set for having deposited set DS
The union of set M.Refer to and deposited set DS corresponding to the data in system, complete or collected works U is exactly the scope of data, and set set M is corresponded to
The space of data structure in Installed System Memory.For directly perceived, data object here is the object of candidate data structure operation
General designation.Therefore, set D is exactly the set for the data object composition that candidate data structure directly operates, and includes data set, can
The space cell of set is provided.
2) R is the mapping relations of the different pieces of information object on D:Set mapping relations R1=<d,Mr>, wherein d ∈ DS, MrTable
Showing from set M selects r element to carry out set (see definition 6), it is directly perceived for, it is exactly data and in candidate data structure
In all continuous space units select several space cells carry out set correspondence;Expression relation R2=<Mt,S>,
Wherein MtRepresent to select t element to carry out set from set M, S is to represent set (see definition 9).In fact, set mapping relations
R1After referring to system insertion a data, the data structure in memory selects several space cell set.Expression relation R2It is then
After the part or all of space cell of data structure in finger memory is set, denotable data.
3) the handling function F={ f of candidate data structure:U→Mr, query (d) }, wherein, f:U→MrIt is set mapping
(see definition 8), its domain dom (f)=DS,Query (d) is membership query (see definition 11), and d ∈ TS, TS are
Test set,
At this moment, the space expense of candidate data structure A=(D, R, F) is the element number and candidate data knot of set set
The product of the bit number of the space cell of structure, i.e.,
Mem=nm×b (1)
Especially, as mapping number r=1, the bit number b=1 of space cell, due to the randomness of DS, relation
R2<Mt,S>In expression set S must be DS complete or collected works, could represent DS, at this time
Mem=nu (2)
Since each element is a space cell in memory in the set set M of candidate data structure A=(D, R, F)
(see definition 4), when the bit number of each space cell is b>When 1, then the set collection of candidate data structure A=(D, R, F)
The set (see definition 6) that assignment operation occurs for any one element in M is closed, and the assignment set of the space cell produced is equal
Probability is
Known set maps f:U→MrOnce mapped the element number r selected from set set M>1, and each
When element is the set space being only made of a bit, when all elements in DS are all mapped set set by set
After the corresponding space cell set of M, at this moment, M has t element to be set, and is expressed as Mt, wherein t≤| M |=nm, thus produce pass
It is R2<Mt,S>.Therefore, any one element puts putting for 1 operation in the set set M of candidate data structure A=(D, R, F)
During position (see definition 6), the probability that set does not occur yet for an element of set set M is
The error of the first kind rate of the membership query query (d) of candidate data structure A=(D, R, F) (calculate by this specific embodiment
Method modeling puts aside error of the second kind rate) be
Can be obtained from above, the space expense of candidate data structure A=(D, R, F) depend on set set element number and
The bit number of space cell, and error rate by deposited the element number of set, the element number of set set, set map
Mapping number and space cell bit number determine.
Evaluation index
1) assessment of time and space expense is refused
Refuse the space expense pair of assessment mainly assessment candidate data structure of time and space expense it determine that any
The influence of the time in system is not present in one data.
Set set M represents continuous space cell in candidate data structure A=(D, R, F) memory, and Mem represents candidate's number
According to the space expense of structure A=(D, R, F) (see definition 13).ForSet function f (d)=f:U→MrWill be from putting
Select r element progress set in the set M of position, each element, i.e., the space cell in each set set is with 1/nmIt is general
Rate is set.
First, set mapping f is considered:U→MrOnce mapped the element number r=1 feelings selected from set set M
Condition.Work as r=1, during b=1, each elements of set set M of candidate data structure A=(D, R, F) are made of a bit
Space cell.Set maps f:U→MrWhen once being mapped, the set of a space cell can be only produced, i.e., to one
Bit carries out set.In other words, an element of set DS has been deposited corresponding to a space cell in set set.Work as DS
In all elements all by set map by after the corresponding space cell set of set set M, M has t element to be set, table
It is shown as Mt, produce relation R2<Mt,S>.It is exactly by the n of candidate data structure A=(D, R, F) for intuitivelymA space cell
Corresponding t position position after, it represent data set and set after candidate data structure correspondence.Therefore, set collection
The probability that set does not occur yet for an element for closing M is P=(1-1/nm)n, FP=1-P=1- (1-1/n at this timem)n, by nm> >
1, FP=0 can be obtained.Again because the randomness of DS, relation R2<Mt,S>In expression set S must be DS complete or collected works, could represent
DS, therefore
Mem=nm× b=nu× 1=nu (6)
As available from the above equation, r=1 is worked as, during b=1, there is no the time O that a unit is only needed in set for decision element
(1), therefore, space Mem does not produce dependence with refusal time T.
See again, work as r=1, b>When 1, space Mem is bigger, it is meant that the bit of the space cell of the element in set set M
Position number is more, and the possibility of identical set is lower in space cell, thus candidate data structure A=(D, R, F) judges
There is no the refusal time of set is fewer for element.
Consider further that set maps f:U→MrOnce mapped the element number r selected from set set M>1 feelings
Condition.After all elements in DS all map space cell set that set set M is corresponding by set, M has t element quilt
Set, is expressed as Mt, wherein t≤| M |=nm, produce relation R2<Mt,S>.It is easy to get, an element of set set M does not occur yet
The probability of set is
It is easy to get, thinks that element d has occurred in the false positive erroneous judgement of DS for d ∈ T, membership query query (d) function mistakes
Rate is
Due to r>1, it is assumed that r < < nm, nm> > 1, according to pertinent literature 1 (Zhong M, Lu P, Shen K, et
al.Optimizing data popularity conscious bloom filters[C].Proceedings of the
twenty-seventh ACM symposium on Principles of distributed computing.ACM,2008:
Derivation 355-364.), r are often much smaller than nm, approximation obtains equation below,
Equally, the approximation for asking to obtain refusal time T by above-mentioned document is
By defining 13, space expense is the element number of Mem set set and the bit needed for each space cell
Several products, further according to above-mentioned document, with reference to formula 9, can obtain the balance such as following formula of time, space and error rate,
Understood according to formula (11), in the range of assigned error rate, when space expense Mem is fewer, judge that d is not belonging to needed for DS
The time wanted can be relatively longer, conversely, the time relatively can be shorter.Explained from meaning directly perceived, space expense Mem is fewer, meaning
The element number that taste the set set M of candidate data structure A=(D, R, F) is fewer, then with set mapping f (d)=f:U
→MrDomain set dom (f)=DS scopes increase, i.e. element number increase, the element of set set M is selected set
Number accordingly increases, and it is longer that membership query query (d) finds the set set M not selected times, so as to cause to refuse the time
Increase.
2) assessment of the space expense with representing performance
Space expense with represent performance assessment candidate data structure space expense and expression performance with deposited gather
Element number and complete or collected works scope variation tendency.Represent that performance with efficiency metric is represented, represents that efficiency refers in system
The number for the bit that each data averagely need in memory, represents that efficiency is higher, look into deposit algorithm expression performance it is lower.
First, set mapping f is considered:U→MrOnce mapped the element number r=1 feelings selected from set set M
Condition.Work as r=1, during b=1, due to the randomness of DS, relation R2<Mt,S>In expression set S must be DS complete or collected works U, ability
Represent the random subset DS, Mem=n in complete or collected works Uu× b, therefore, expression efficiency at this time isIntuitively come
Say, work as r=1, during b=1, when set DS is more sparse relative to complete or collected works U, each element is in candidate data structure A=in set DS
In (D, R, F), the number of the bit of required element is more, shows that the expression efficiency of candidate data structure A=(D, R, F) is got over
Height, then its expression performance is poorer.
Consider further that set maps f:U→MrOnce mapped the element number r selected from set set M>1 feelings
Condition.After all elements in DS all map space cell set that set set M is corresponding by set, relation R is produced2<
Mt,S>, wherein t≤| M |=nm, and because DS be one by the set that any n element forms in complete or collected works U, to represent complete or collected works U
In the set DS that forms of any n element, thenIt is (different from the situation of r=1 here.r>When 1, relation R2<Mt,S>Meaning
Taste, which set set M, has t element to be set, and wherein t≤| M |=nm, at this momentOnly t=nmWhen, S=U.).With reference to
(Aguilar-Saborit J, Trancoso P, Muntes-Mulero V, the et al.Dynamic count of pertinent literature 2
filters[J].ACM SIGMOD Record,2006,35(1):Derivation 26-32.), makes 0<∈<1, represent set S at most
Include ns=n+ ∈ (nu- n) a element, relation R2<Mt,S>In set S be exactly to include ns=n+ ∈ (nu- n) a element spy
Fixed set, when t≤| M |=nm, meetSince any one of expression set S includes the subset of n these elements
Number of combinations beSo, S is made to represent own in complete or collected works UThe set of a combination, then
Need to meet pertinent literature 2:
Due in set set M each element be need carry out set space cell, each space cell is by b bit
Position composition, with reference to the derivation result of page 490 of pertinent literature 2, is easy to get,
Meanwhile point out to take r=ln2 (n in pertinent literature 2m/ n) when, FP is minimized, and error rate is no more than ∈,
FP≤∈, then
At this time, the expression efficiency of candidate data structure A=(D, R, F) is
Obviously, it is related to error rate, does not change with the element number for having deposited set and changes.Therefore, in given error rate scope
Interior, the expression performance of candidate data structure A=(D, R, F) is constant.
To sum up, r=1 is worked as, during b=1, its space expense Mem=nu× b, represents efficiencyObviously, when will
The set of expression only accounts for the sub-fraction of complete or collected works, and the expression performance of candidate data structure A=(D, R, F) is not substantially high.Therefore, if
The complete or collected works' scope for the set to be represented is little, and when its all elements accounts for the proportion of complete or collected works and exceedes half, it is proposed that set maps f:U
→MrIn r take r=1.
Work as r>1, its space expenselog2(1/ ∈), ∈ are represented
Given False Rate scope.Represent efficiencylog2(1/∈).In the scope of given False Rate
Under, the space expense Mem and candidate data structure all elements number n of set DS and the bit of unitary space to be represented
The product of number b is linearly related, it is generally the case that n > > b, and b is generally given, therefore, space expense Mem can regard as
It is directly proportional to n.And representing each element of the efficiency only with the set set M of candidate data structure A=(D, R, F), i.e., space is single
Member, the number b of required bit are directly proportional.Therefore, in the case of wanting DS only to account for the sub-fraction of complete or collected works U, that is, work as
Set DS relative to complete or collected works U it is sparse when, it is proposed that set map f:U→MrIn r take r>1.
3) compliance evaluation
Compliance evaluation refers to that the expression set S of candidate data structure A=(D, R, F) can deposit set in reflection system
The measurement of the change of DS.It includes three aspects, first, when having deposited set DS and increasing element newly, and correspondingly, candidate data structure A=
Two groups of relation R of (D, R, F)1=<d,Mr>, R2<Mt,S>It can reflect this change in time, that is, represent that set S is also newly-increased at the same time first
Element.Secondly, when deposited set DS deletion element when, candidate data structure A=(D, R, F) if reflecting this change in time, it
Represent that set S-phase answers ground to delete the element.Finally, when carrying out membership query to the arbitrary element in test set TS, query
(d) whether can reflect that the element whether there is system accurately or within a certain error range.Specific formalized description is as follows:
1) when candidate data structure A=(D, R, F) increases element newly, compliance evaluation is:If DS=DS ∪
{ d }, then have f:U→MrSo that S=S ∪ { d }.
2) when candidate data structure A=(D, R, F) deletes element, compliance evaluation is:If DS=DS- { d }
And f:U→MrIt is reversible, then there is f-1:U→MrSo that S=S- { d }.
3) when candidate data structure A=(D, R, F) carries out membership query, compliance evaluation is:query(d)
Return to the true-false value of d ∈ DS.
Compliance evaluation can be measured with False Rate,(or), query (d) is with certain error model
Enclose interior return true (vacation).
3.4 algorithm modelling evaluation flows
For above the problem of modeling and algorithm modeling, as shown in Figure 1, given candidate data structure, first, according to above
Related definition, this candidate data structure is described with above-mentioned unified Modeling, the main result of this step is:Provide candidate's number
The place being combined into according to the collection of structure data complete or collected works to be processed and candidate data structure setable space representation in memory
Manage set of data objects expression, the mapping relations and candidate data structure of candidate data structure different pieces of information object to be processed
Handling function;Handling function includes in candidate data structure, when storing any data, selects putting in corresponding memory
The mapping function of bit space, the membership query function also to any a data.Then, by Unify legislation mathematical modulo above
Type, i.e., with the notation of this model, derive refusal time of candidate data structure, space expense, represent efficiency and mistake
Sentence rate.Finally, according to the three aspect evaluation indexes provided, derive that the variation tendency of refusal time and space expense, space are opened
Whether pin represents the uniformity with system data in tolerance interval with the relation and candidate data structure memory for representing performance
It is interior.At this point, it should be noted that specifically receive scope and actual hardware resources supplIes (such as memory size etc.) breath manner of breathing
Close, therefore, the assessment Comparative indices of three aspects are only provided in this model, do not provide specific threshold value.
Five kinds of illustration, which is looked into, deposits algorithm modeling
1) bitmap data structure
The static data collection in system is represented with bitmap, exactly is used to represent right in set element by a position in bitmap
The value answered, it is assumed herein that our static data is concentrated without the element repeated.With the simple position mentioned in related work
The thinking of index of the picture is the same, i.e., simply by each different value of each expression.It is when being inserted into new element, new element is in place
Correspondence position 1 on figure.When the membership query of one element of response, directly inquire about the element and correspond to position.For example, it is assumed that
Represent set { 3,7,8,10 }, if representing scheme with bitmap, as shown in figure 4, corresponding 3rd in bitmap, the 7th,
8th and the 10th is put 1.Simple bit map schemes when range of convergence is little, spatially with query processing when
Between on it is very effective.But with the increase of range of convergence, corresponding space expense can also increase therewith, thus cause sparse
Problem.
The modeling of bitmap data structure is described as follows:
1) D={ U, M }, wherein, U is complete or collected works, and M is continuous space cell group in the bit array by bitmap data structure BM
Into set set, each element is the space cell of a bit in M, and | M |=nm, nmRepresent the length of the bit array of BM
Degree.
2) R={ R1,R2, wherein R1=<d,Mr>, d ∈ DS, DS are to have deposited set,MrRepresent putting from BM
R bit set, wherein r=1 are selected in the set of position.R2=<Mt,S>Represent to select t element set in the set set of BM
Afterwards, MtCorresponding expression set S.
3) F={ f:U→Mr, query (d) }, wherein f:U→Mr, and r=1, dom (f)=DS, d ∈ TS, TS are test sets
Close,
2) standard type Bloom filter
Bloom filter be for judge an element whether the data structure in set, be one by m bit
Bit array, there are the independent hash functions of k, the mapping range of each hash function falls at scope { 1,2,3 ... m }, is used for
Produce the bit of set.
When initial, the m bit of the data structure, the i.e. bit array is 0.When in an element insertion set,
Bloom filter will calculate the independent hash functions of this k respectively, then according to as a result, correspondence position 1 by bit array.Examining
When surveying an element whether in set, Bloom filter will calculate corresponding k cryptographic Hash according to the value of the element, obtain position
Corresponding k position in array, then detects whether this k position all puts 1, if it is 0 to have any one position, can determine whether the element not
In set, if all 1, which is possible in set, it is also possible to not in set.Obviously, detection elements are worked as
During existence, if corresponding k is all put 1, member is known as the possibility being not present.
As shown in figure 5, initially Bloom filter initial value is 0, element x1, x2Respectively will be to the grand mistake of cloth by hash function
Filter corresponding positions put 1.And in detection elements y1And y2When whether in the set represented by Bloom filter, it is calculated respectively
Correspondence position, be clear to y1It is 0 to have a position in corresponding position, it may be determined that it is not in set.And for y2Corresponding position is complete
It is 1, therefore, or in the set that Bloom filter represents, or the simply error of a false positive.The grand filtering of standard type cloth
The modeling of device is described as follows:
1) D={ U, M }, wherein, U is the complete or collected works for having deposited set DS, and M is in the bit array by standard type Bloom filter BF
The set set of continuous space cell composition, each element is the space cell of a bit in M, and | M |=nm, nmTable
Show the length of the bit array of BF.
2) R={ R1,R2, wherein R1=<d,Mr>, d ∈ DS, DS are to have deposited set,MrRepresent from the grand mistake of cloth
R bit set is selected in the set set of filter BF, i.e. this bit array.R2=<Mt,S>Represent Bloom filter BF's
After t bit set being selected in bit array, MtCorresponding expression set S.
3) F={ f:U→Mr, query (d) }, wherein f:U→Mr, and the hash function number of r=k, Bloom filter BF,D ∈ TS, TS are test set,
3) attribute Bloom filter
Attribute Bloom filter is by the way that the bit array of Bloom filter to be extended to the number for counting the position and being set to 1
Data structure, can not delete the limitation of element to make up Bloom filter.In this data structure, the grand filter for molten of cloth it is every
A record is exactly one and the relevant small-sized counter of basic Bloom filter, and the number of 1 value is set as recording this.
Initially, these counters are all initialized as 0.When an element a insertion or deletion, Bloom filter will calculate this k respectively
A independent hash function, then carries out correspondingly from increasing according to drawing as a result, by the counter of the correspondence position of bit array
Or subtract certainly, for example, (c (h1(a)),c(h2(a)),…,c(hk(a))).As standard type Bloom filter, into row element
Membership query when, attribute Bloom filter will calculate this k hash function according to the value of the element, obtain the element respectively
The counter of corresponding k position in bit array, if any one is 0, can determine whether the element not in the grand mistake of attribute cloth
Set represented by this data structure of filter.If corresponding counter is entirely non-zero, as standard type Bloom filter,
The element there may exist in set, and there is also the possibility that this is an error of the first kind.As shown in fig. 6, initially cloth is grand
Filter initial value is 0, element x1, x21 will be put to Bloom filter corresponding positions, collide by hash function respectively
Position, then counter increasing 1 certainly.The algorithm modeling of attribute Bloom filter CBF=(D, R, F) is described as follows:
1) D={ U, M }, wherein, U is the complete or collected works for having deposited set DS, and M is in the bit array by attribute Bloom filter BF
The set set of continuous space cell composition, the sky that each element is made of some bits (being denoted as b bit) in M
Between unit, corresponding to each counter of CBF, and | M |=nm, nmRepresent the number of the counter of CBF.
2) R={ R1,R2, wherein R1=<d,Mr>, d ∈ DS, DS are to have deposited set,MrRepresent putting from CBF
Position is gathered, i.e. r counter of selection is carried out from the set for increasing 1 in the m counter of CBF.R2=<Mt,S>Represent there are t in CBF
Counter is carried out from after increasing 1 set, MtCorresponding expression set S.
3) F={ f:U→Mr, query (d) }, wherein f:U→Mr, and r=k, k are the hash function numbers of CBF,D ∈ TS, TS are test set,
4)Dynamic Bloom Filter
Dynamic Bloom Filter are dynamic s × m bit matrix, which includes (or the counting of s standard
Type) Bloom filter.It is m that the matrix, which includes s size, and hash function number is k standard types Bloom filter (or attribute cloth
Grand filter).It is activity to only have a Bloom filter in a special time, Dynamic Bloom Filter, other
Equal inactive state.Being inserted into the number of the element of Bloom filter will be traced.When being inserted into element, first possesses this yuan
The Bloom filter that plain counter is less than specific threshold value will be chosen as movable Bloom filter.If this movable Bu Long
Filter cannot be found, and will create Bloom filter that is new and being set to activity, and then element is by insertion activity Bloom filter.
When carrying out membership query operation, the Bloom filter in set will be iterated, be somebody's turn to do if any one Bloom filter includes
Element, then return to true value.If this matrix is made of attribute Bloom filter, element can be deleted, therefore, is being deleted
, it is necessary to find first sub- Bloom filter claimed containing the element during element.An only Bloom filter claim containing
In the case of the element, it is only necessary to which, by corresponding position from subtracting 1, in the case of more than one Bloom filter, element is deleted
Except can at most cause k potential False negative errors.In this case, if retaining the mistake there is no false negative, element
Position cannot subtract certainly, thus delete operation fail, certain false positive error rate can be increased.Dynamic Bloom filters DY
The modeling of=(D, R, F) is described as follows:
1) D={ U, M }, wherein, U is the complete or collected works for having deposited set DS, and M (or is counted by s standard type Bloom filter BF
Type Bloom filter CBF) bit array on continuous space cell composition set set union, be denoted as M=M1∪M2…∪
Ms, MiIn the space cell that is made of several bits of each element, and | Mi|=m, 1≤i≤s, m represent BF (or CBF)
Bit array length, therefore, the element number M=n of set setm。
2) R={ R1,R2, wherein R1=<d,Mi r>, d ∈ DS, DS are to have deposited set,Mi rRepresent from matrix
In i-th of set set for enlivening BF (or CBF), r element set is selected.R2=<Mi t,Si>Represent to enliven for i-th BF (or
CBF after selecting t bit set in bit array), MtCorresponding expression set Si, can then obtain the whole grand mistake of Dynamic cloth
The expression set of filter DY
3) F={ f:U→Mr, query (d) }, wherein f:U→Mr, and the hash function number of r=k, BF (or CBF),D ∈ TS, TS are test set,
5) D-left Bloom filters
The data structure of D-left Hash and fingerprint (fingerprint), is divided into d sublist, each sublist by Hash table
Size is identical, and has n/d bucket, and wherein n is the sum of bucket.For each element, all one can be produced by hash function
Fingerprint (fingerprint), this fingerprint (fingerprint) include two parts, and a part is bucket index, and another part is surplus
Remaining fingerprint (remainder), bucket index refer to the bucket address for storing remaining fingerprint (remainder).And each bucket has c unit
(cell) space, each unit are some positions with fixed size, for storing remaining fingerprint (remainder) and counting
Device.
As shown in Figure 7, it is assumed that Hash table is divided into 4 sublists, contains 8 buckets respectively in each sublist, each bucket has 4 lists
Member.When being inserted into element, the cryptographic Hash f of the element is calculatedx=H (x)=(b, r), selects the candidate of same position in d sublist
Bucket and a fingerprint cryptographic Hash (fingerprint).Then, calculated with additional (pseudorandom) random alignment and fxIt is corresponding
D position and corresponding fingerprint Pi(fx)=(bi,ri), wherein 1≤i≤d.This is done to cause given element, it
The fingerprint being stored in sublist it is different.Then, first detect for arbitrary residue fingerprint (remainder) riWhether deposit
It is any bucket biIn, if so, corresponding counter just is added 1.Otherwise, which is inserted into the minimum sublist of element
In.Under average case, then it is put into the bucket of leftmost sublist.During inquiry, the inquiry of element is looked into using d the parallel of sublist
Ask, find out remaining fingerprint (remainder) and the counter comprising the value.During deletion, counter subtracts 1.These counters are than mark
Accurate Counting Bloom Filter want much less, and main cause is less by the D-left construction generations based on fingerprint
Collision.
Assuming that k represents sublist number, m represents the number that each sublist includes bucket, and c represents the unit that each bucket includes
Number, the modeling of D-Left Bloom filters DL=(D, R, F) are described as follows:
1) D={ U, M }, wherein, U is the complete or collected works for having deposited set DS, and M is by each sublist on D-Left Bloom filters
All barrels composition set set, equivalent in each sublist bucket composition union of sets collection, formalized description M=M1
∪M2…∪Mk, wherein MiIn each element be made of the unit (cell) in c D-Left Bloom filter, each unit
Comprising some bits (being denoted as b bit), | M |=nm, nmRepresent all barrels of the sum of DL, therefore, nm=km.
2) R={ R1,R2, wherein R1=<d,Mr>, d ∈ DS, DS are to have deposited set,MrRepresent putting from DL
Position set, selects r unit (cell) to carry out storage fingerprint and counter from the set for increasing 1.R2=<Mt,S>Represent there is t in DL
After a unit (cell) storage fingerprint, MtCorresponding expression set S.
3) F={ f:U→Mr, query (d) }, wherein set mapping f:U→Mr, r=1, i.e., only select 1 when mapping every time
Bucket address of the bucket (cell) as the remaining fingerprint (remainder) of storage,D ∈ TS, TS
It is test set,
Above is the preferable of the present invention is implemented to be illustrated, but the invention is not limited to the implementation
Example, those skilled in the art can also make a variety of equivalents on the premise of without prejudice to spirit of the invention or replace
Change, these equivalent deformations or replacement are all contained in the application claim limited range.
Claims (4)
1. a kind of look into the comparative approach for depositing algorithm, it is characterised in that:Include following steps:
A, with candidate data structure A=(D, R, F) represent storage in systems deposited set DS, candidate data structure is memory
Middle realization is used to represent mass data in storage system and looks into the data structure algorithm deposited, and wherein D refers in the data knot
The operation data object for being contained in different sets involved in structure, R refer to the mapping relations of the different pieces of information object on D, F be by
The set of the handling function composition of candidate data structure, has deposited the collection that set DS is the data element composition of storage in systems
Close;The operation data object includes data set and can provide the space cell of set, and wherein set is represented to space cell
Put 1 or assignment operation, space cell is represented in memory by the successive bits position of 1 or 1 composition described above;
B, the assessment parameter of candidate data structure is calculated, the assessment parameter includes refusal time, space expense, represents efficiency
And False Rate;The refusal time is to give a certain inquiry data, and candidate data structure A=(D, R, F) judges the data not
Time in systems;The space expense needs of space cell for candidate data structure A=(D, R, F) in memory
Number;It is described represent efficiency be candidate data structure A=(D, R, F) space expense and deposited set whole element numbers it
Than;When the False Rate is describes membership query, there is the probability of inaccurate response;Membership query is to give a certain data, is used
Family accesses candidate data structure and inquires about the data with the presence or absence of in storage system;
C, the variation tendency, space expense and expression performance of refusal time and space expense are derived according to above-mentioned assessment parameter
Relation and candidate data structure memory represent that the uniformity with system data whether in tolerance interval, is most closed so as to draw
Algorithm is deposited in suitable looking into;For the expression performance with efficiency metric is represented, it is that each data are average in memory in system to represent efficiency
The number of the bit needed;The candidate data structure includes bitmap data structure, standard type Bloom filter, attribute
Bloom filter, dynamic Bloom filter and D-left Bloom filters.
2. a kind of the comparative approach for depositing algorithm is looked into according to claim 1, it is characterised in that:When refusing in the step C
Between and space expense be used to assess to look into and deposit algorithm and judge the time of data not in systems and to look into the space expense of deposit data structure
Balance.
3. a kind of the comparative approach for depositing algorithm is looked into according to claim 1, it is characterised in that:Effect is represented in the step C
Rate is used to look into the data volume for depositing algorithm for difference or assess to look under the situation of change for being related to data to deposit algorithmic notation per number
According to space expense used.
4. a kind of the comparative approach for depositing algorithm is looked into according to claim 1, it is characterised in that:False Rate in the step C
When carrying out membership query for assessing the arbitrary element looked into and deposited during algorithm gathers test, if can be accurate or in certain error
In the range of reflect system data ability;Test set refers to any one arrival system, it is necessary to test whether to be already present on and be
New data set in system into set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410758136.9A CN104504011B (en) | 2014-12-10 | 2014-12-10 | It is a kind of to look into the comparative approach for depositing algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410758136.9A CN104504011B (en) | 2014-12-10 | 2014-12-10 | It is a kind of to look into the comparative approach for depositing algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104504011A CN104504011A (en) | 2015-04-08 |
CN104504011B true CN104504011B (en) | 2018-05-15 |
Family
ID=52945409
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410758136.9A Active CN104504011B (en) | 2014-12-10 | 2014-12-10 | It is a kind of to look into the comparative approach for depositing algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104504011B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105228265A (en) * | 2015-08-25 | 2016-01-06 | 深圳市唯传科技有限公司 | A kind of sharing method based on internet of things equipment and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7788242B2 (en) * | 2006-08-23 | 2010-08-31 | Oracle America, Inc. | Method and system for implementing a concurrent set of objects |
CN101923568A (en) * | 2010-06-23 | 2010-12-22 | 北京星网锐捷网络技术有限公司 | Method for increasing and canceling elements of Bloom filter and Bloom filter |
CN103559303A (en) * | 2013-11-15 | 2014-02-05 | 南京大学 | Evaluation and selection method for data mining algorithm |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9665572B2 (en) * | 2012-09-12 | 2017-05-30 | Oracle International Corporation | Optimal data representation and auxiliary structures for in-memory database query processing |
-
2014
- 2014-12-10 CN CN201410758136.9A patent/CN104504011B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7788242B2 (en) * | 2006-08-23 | 2010-08-31 | Oracle America, Inc. | Method and system for implementing a concurrent set of objects |
CN101923568A (en) * | 2010-06-23 | 2010-12-22 | 北京星网锐捷网络技术有限公司 | Method for increasing and canceling elements of Bloom filter and Bloom filter |
CN103559303A (en) * | 2013-11-15 | 2014-02-05 | 南京大学 | Evaluation and selection method for data mining algorithm |
Non-Patent Citations (4)
Title |
---|
"Dynamic Count Filters";J. Aguilar-Saborit ET AL;《SIGMOD Record》;20060331;第35卷(第1期);全文 * |
"Optimizing Data Popularity Conscious Bloom Filters";Ming Zhong ET AL;《PODC’08》;20080821;全文 * |
"一种面向网格资源预留的索引链表研究";吴黎兵 等;《武汉理工大学学报·信息与管理工程版》;20111231;第33卷(第6期);全文 * |
"游戏中寻找路径的改进算法";董改芳 等;《计算机工程与应用》;20091231;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN104504011A (en) | 2015-04-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2018253479B2 (en) | Characterizing data sources in a data storage system | |
CN103080924B (en) | For the method and apparatus processing data set | |
Wise et al. | Regionalisation tools for the exploratory spatial analysis of health data | |
CN102915347B (en) | A kind of distributed traffic clustering method and system | |
US7587410B2 (en) | Dynamic cube services | |
KR20100041750A (en) | Aggregation query processing | |
CN110659282B (en) | Data route construction method, device, computer equipment and storage medium | |
CN107622326B (en) | User classification and available resource prediction method, device and equipment | |
CN106326475A (en) | High-efficiency static hash table implement method and system | |
CN116756494B (en) | Data outlier processing method, apparatus, computer device, and readable storage medium | |
CN113763502A (en) | Chart generation method, device, equipment and storage medium | |
CN110825817B (en) | Enterprise suspected association judgment method and system | |
CN107783890A (en) | Software defect data processing method and device | |
CN104504011B (en) | It is a kind of to look into the comparative approach for depositing algorithm | |
Ding et al. | Efficient currency determination algorithms for dynamic data | |
CN104199924B (en) | The method and device of network form of the selection with snapshot relation | |
CN105787800A (en) | Intelligent social platform potential contact retrieval device, system and method | |
CN112882956A (en) | Method and device for automatically generating full-scene automatic test case through data combination calculation, storage medium and electronic equipment | |
CN109543712B (en) | Method for identifying entities on temporal data set | |
CN116881687A (en) | Power grid sensitive data identification method and device based on feature extraction | |
US7756854B2 (en) | Minimization of calculation retrieval in a multidimensional database | |
CN116881262B (en) | Intelligent multi-format digital identity mapping method and system | |
Seaver et al. | Efficiency performance and dominance in influential subsets: an evaluation using fuzzy clustering and pair-wise dominance | |
US20080282020A1 (en) | Determination of sampling characteristics based on available memory | |
CN115292297B (en) | Method and system for constructing data quality monitoring rule of data warehouse |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |