CN104391866A - Approximate membership query method based on high-dimension data filter - Google Patents
Approximate membership query method based on high-dimension data filter Download PDFInfo
- Publication number
- CN104391866A CN104391866A CN201410578880.0A CN201410578880A CN104391866A CN 104391866 A CN104391866 A CN 104391866A CN 201410578880 A CN201410578880 A CN 201410578880A CN 104391866 A CN104391866 A CN 104391866A
- Authority
- CN
- China
- Prior art keywords
- bmlbf
- data
- group
- distance
- namely
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an approximate membership query method based on a high-dimension data filter. The method has the advantages that a new structure supported by a new distance sensitive hash function is defined to respectively present multi-dimension data and to-be-inquired multi-dimension data in a target data set, so the reconstruction of the filter is not needed, approximate membership queries with more filtering distance parameters can be supported, and the space cost is greatly reduced; a plurality of function groups are utilized, each function group contains multiple functions, an and-or combining method is utilized to judge when whether a membership is the approximate membership in a target data set omega or not is finally determined, and the false negative rate of the filter is reduced.
Description
Technical field
The present invention relates to a kind of approximate member's querying method, especially relate to a kind of approximate member's querying method based on high dimensional data filtrator.
Background technology
In a lot of application, if the distance of data query and target data is nearer, the value of data is higher.Such as, security officer wants the material (having some detectable high dimensional feature) checking certain the unknown whether to belong to hazardous chemical listed by inventory; Whether network manager wants the behavioural characteristic knowing certain user to be harmful to; Photography match judge wants to check that whether the photo submitted to is similar with the photo in a certain large database.These inquiries all need to judge data query and (target data) gather in the distance of data.If the small data set of low-dimensional, solve by linear search, but linear search coupling is adopted to the High Dimensional Data Set of a magnanimity, can be very consuming time, real-time needs cannot be met in a lot of situation.For improving the speed of process, can arrange a high dimensional data filtrator and representing target data set, filtering out most of data query according to distance, a small amount of remaining data can be processed by conventional method more further, can significantly improve the overall performance of system.
What this filtrator completed is exactly approximate membership query (Approximate Membership Query, AMQ), namely answers " whether data query is close to certain data in data acquisition ".Existing AMQ filtrator is mainly in conjunction with LSH (distance sensitive Hash, Locality-Sensitive Hashing) and Bloom filter (Bloom Filter) technology, its main representative has DSBF (Distance-sensitive Bloom filters) and LSBF (Locality-sensitiveBloom filters).
The method of DSBF comprehensive LSH and Bloom filter first filters AMQ inquiry, it returns the approximate query result of group membership, the degree of approximation can adopt different criterions, it can improve speed and the space of network and database application, thus avoids arm and a leg compare operations such as complete K-NN search.LSBF is the improvement of DSBF, and use LSH function to construct Bloom filter and filter AMQ inquiry, LSBF additionally uses extra bit vector to reduce false positive rate.
But using these two technology of DSBF and LSBF to filter AMQ inquiry has a restriction, and namely they only can filter and inquire about to the AMQ of set a distance.But a given suitable distance is also not easy, excessive or too small distance value, may cause unacceptable Query Result.And fixing once the filtration distance parameter of filtrator after just can not change, filter multiple different distance value as needed simultaneously, then need to re-construct filtrator according to raw data, also namely change filtration distance parameter.But for saving space, raw data is not generally preserved.Secondly, the false negative rate of DSBF and LSBF is higher.
Summary of the invention
Technical matters to be solved by this invention is to provide a kind of high dimensional data based on filtrator and is similar to member's querying method, on the basis of the filtrator of original fixing filtration distance parameter, do not need to re-construct filtrator, namely can realize the approximate membership query of more filtering distance parameter.
The present invention solves the problems of the technologies described above adopted technical scheme: a kind of high dimensional data based on filtrator is similar to member's querying method, and target data set is defined as Ω, and is defined as by distance sensitive hash function H
wherein t=1,2 ..., k, j=1,2 ..., L, L are group of functions number, and k is the function number in each group of functions, and o is multidimensional data, a
t,jbeing the random vector identical with o dimension, is dot-product operation, 2
θw is distance filtration parameter, θ=0,1,2 ..., S-1, S are the species number filtering distance, and w is the minor increment filtration parameter of arithmetic number definition,
be lower rounding operation, then carry out the operation of following steps:
(1) building a capacity is m, and address is the bit vector of 0 to m-1, is defined as BMLBF, and sets BMLBF [i]=0, i=0,1,2 ..., m-1;
(2) to the data o of any one multidimensional in target data set Ω
y, characterize with distance sensitive hash function during θ=0, namely
wherein y=1,2 ..., n, and k × L position puts 1 in BMLBF, namely
(3) multidimensional data to be checked is defined as q, then characterizes with above-mentioned distance sensitive hash function, namely
(4) by the k of a jth group cryptographic hash, namely
convert 2 binary data respectively to, and become k address connecting thereafter θ 0, be defined as A
1, j, A
2, j..., A
k,j;
(5) as BMLBF [A
1, j], BMLBF [A
1, j+ 1] ..., BMLBF [A
1, j+ 2
θ-1] there is one to be 1 in, then define A
1, jaddress is passed through; As A
1, j, A
2, j..., A
k,jall pass through, then define jth group and pass through; If any one group is passed through in L group, then confirm that q is the approximate member of target data set Ω.
Compared with prior art, the invention has the advantages that the new construction supported by defining new distance sensitive hash function, characterize the multidimensional data in target data set and multidimensional data to be checked respectively, do not need to re-construct filtrator, the approximate membership query of more filtering distance parameter can be supported, drastically reduce the area space cost.Present invention uses multiple group of functions number, and each group of functions number comprises multiple function, use when being finally confirmed whether the approximate member of target data set Ω " with-or " mode that combines judges, also can reduce the false negative rate of filtrator.
Accompanying drawing explanation
Fig. 1 is that the false positive rate of the inventive method and prior art LSBF method in specific embodiment and false negative rate compare schematic diagram;
Fig. 2 is the error rate schematic diagram of the inventive method dummy filters in specific embodiment;
Fig. 3 is that in specific embodiment, the inventive method compares schematic diagram with the space cost of prior art LSBF method.
Embodiment
Below in conjunction with accompanying drawing embodiment, the present invention is described in further detail.
We use true handwritten numeral Letter identification data set to assess method more of the present invention and existing LSBF method.This data set comprises 5,620 data, and each data represent hand-written arabic numeral with 64 dimensional features, that is, ' 0', ' 1' ..., ' 9'.Range of characteristic values is the integer of 0 to 16.The data of ' 0' are divided into two groups, one group totally 10 data as set omega, another group as test data q, to test false negative rate; In addition, be taken as ' 10 data of 1' as set omega, other data as test data q, to test false positive rate.Experimental result is 10000 mean values calculated at random.
High dimensional data based on filtrator is similar to member's querying method, and target data set is defined as Ω, and is defined as by distance sensitive hash function H
wherein t=1,2 ..., k, j=1,2 ..., L, L are group of functions number, and k is the function number in each group of functions, and o is multidimensional data, a
t,jbeing the random vector identical with o dimension, is dot-product operation, 2
θw is distance filtration parameter, θ=0,1,2 ..., S-1, S are the species number filtering distance, and w is the minor increment filtration parameter of arithmetic number definition,
be lower rounding operation, then carry out the operation of following steps:
(1) building a capacity is m=2 × 10
5, address is the bit vector of 0 to m-1, is defined as BMLBF, and sets BMLBF [i]=0, i=0,1,2 ..., m-1;
(2) to the data o of any one 64 dimension in target data set Ω
y, characterize with distance sensitive hash function during θ=0, namely
wherein y=1,2 ..., 10, k=2, L=3, θ=0, a
t,jbe the random vector of 64 dimensions, w is the minor increment filtration parameter of arithmetic number definition,
be lower rounding operation, and k × L position put 1 in BMLBF, namely
(3) 64 dimension data to be checked are defined as q, then characterize with following distance sensitive hash function, namely
s is the species number filtering distance, gets S=4;
(4) by 2 cryptographic hash of the 1st group, namely
convert 2 binary data respectively to, and become 2 addresses connecting thereafter θ 0, be defined as A
1,1, A
2,1; By 2 cryptographic hash of the 2nd group, namely
convert 2 binary data respectively to, and become 2 addresses connecting thereafter θ 0, be defined as A
1,2, A
2,2; By 2 cryptographic hash of the 3rd group, namely
convert 2 binary data respectively to, and become 2 addresses connecting thereafter θ 0, be defined as A
1,3, A
2,3;
(5) as BMLBF [A
1,1], BMLBF [A
1,1+ 1] ..., BMLBF [A
1,1+ 2
θ-1] there is one to be 1 in, then define A
1,1address is passed through, as BMLBF [A
2,1], BMLBF [A
2,1+ 1] ..., BMLBF [A
2,1+ 2
θ-1] there is one to be 1 in, then define A
2,1address is passed through, as A
1,1, A
2,1all pass through, then define the 1st group and pass through; Equally, as BMLBF [A
1,2], BMLBF [A
1,2+ 1] ..., BMLBF [A
1,2+ 2
θ-1] there is one to be 1 in, then define A
1,2address is passed through, as BMLBF [A
2,2], BMLBF [A
2,2+ 1] ..., BMLBF [A
2,2+ 2
θ-1] there is one to be 1 in, then define A
2,2address is passed through, as A
1,2, A
2,2all pass through, then define the 2nd group and pass through; Equally, as BMLBF [A
1,3], BMLBF [A
1,3+ 1] ..., BMLBF [A
1,3+ 2
θ-1] there is one to be 1 in, then define A
1,3address is passed through, as BMLBF [A
2,3], BMLBF [A
2,3+ 1] ..., BMLBF [A
2,3+ 2
θ-1] there is one to be 1 in, then define A
2,3address is passed through; As A
1,3, A
2,3all pass through, then define the 3rd group and pass through; If any one group is passed through in 3 groups, then confirm that q is the approximate member of target data set Ω.
From the experimental result of Fig. 1, method of the present invention is more much lower than the false negative rate of the LSBF method of prior art, loses the application of authentic data after being relatively applicable to wishing filtration less.As can be seen from the space cost of Fig. 3 relatively, can produce 4 differently filter distances because method of the present invention only needs to set up an entity filtrator, this equals to employ 4 dummy filters; And the LSBF method of prior art need filter distance for difference, set up corresponding entity filtrator.Therefore, along with difference filters the increase of distance requirement, i.e. the increase of S, the space cost of prior art LSBF method increases, and method of the present invention then remains unchanged.
Claims (1)
1. the high dimensional data based on filtrator is similar to member's querying method, it is characterized in that target data set to be defined as Ω, and is defined as by distance sensitive hash function H
wherein t=1,2 ..., k, j=1,2 ..., L, L are group of functions number, and k is the function number in each group of functions, and o is multidimensional data, a
t,jbe the random vector identical with o dimension, the data fit standardized normal distribution of its every one dimension is dot-product operation, 2
θw is distance filtration parameter, θ=0,1,2 ..., S-1, S are the species number filtering distance, and w is the minor increment filtration parameter of arithmetic number definition,
be lower rounding operation, then carry out the operation of following steps:
(1) building a capacity is m, and address is the bit vector of 0 to m-1, is defined as BMLBF, and sets BMLBF [i]=0, i=0,1,2 ..., m-1;
(2) to the data o of any one multidimensional in target data set Ω
y, characterize with distance sensitive hash function during θ=0, namely
wherein y=1,2 ..., n, and k × L position puts 1 in BMLBF, namely
(3) multidimensional data to be checked is defined as q, then characterizes with above-mentioned distance sensitive hash function, namely
(4) by the k of a jth group cryptographic hash, namely
convert 2 binary data respectively to, and become k address connecting thereafter θ 0, be defined as A
1, j, A
2, j..., A
k,j;
(5) as BMLBF [A
1, j], BMLBF [A
1, j+ 1] ..., BMLBF [A
1, j+ 2
θ-1] there is one to be 1 in, then define A
1, jaddress is passed through; As A
1, j, A
2, j..., A
k,jall pass through, then define jth group and pass through; If any one group is passed through in L group, then confirm that q is the approximate member of target data set Ω.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410578880.0A CN104391866B (en) | 2014-10-24 | 2014-10-24 | A kind of approximate member's querying method based on high dimensional data filter |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410578880.0A CN104391866B (en) | 2014-10-24 | 2014-10-24 | A kind of approximate member's querying method based on high dimensional data filter |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104391866A true CN104391866A (en) | 2015-03-04 |
CN104391866B CN104391866B (en) | 2017-07-28 |
Family
ID=52609770
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410578880.0A Active CN104391866B (en) | 2014-10-24 | 2014-10-24 | A kind of approximate member's querying method based on high dimensional data filter |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104391866B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106339413A (en) * | 2016-08-12 | 2017-01-18 | 宁波大学 | Approximate membership query method based on high-dimensional data filter |
CN112214534A (en) * | 2020-10-21 | 2021-01-12 | 湖南大学 | Method, system and storage medium for performing approximate query on missing data |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7849038B2 (en) * | 2006-12-31 | 2010-12-07 | Brooks Roger K | Method for using the second homotopy group in assessing the similarity of sets of data |
US7966327B2 (en) * | 2004-11-08 | 2011-06-21 | The Trustees Of Princeton University | Similarity search system with compact data structures |
CN102609441A (en) * | 2011-12-27 | 2012-07-25 | 中国科学院计算技术研究所 | Local-sensitive hash high-dimensional indexing method based on distribution entropy |
US20130046767A1 (en) * | 2011-08-18 | 2013-02-21 | Ki-Yong Lee | Apparatus and method for managing bucket range of locality sensitive hash |
CN103631928A (en) * | 2013-12-05 | 2014-03-12 | 中国科学院信息工程研究所 | LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system |
CN103744934A (en) * | 2013-12-30 | 2014-04-23 | 南京大学 | Distributed index method based on LSH (Locality Sensitive Hashing) |
-
2014
- 2014-10-24 CN CN201410578880.0A patent/CN104391866B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7966327B2 (en) * | 2004-11-08 | 2011-06-21 | The Trustees Of Princeton University | Similarity search system with compact data structures |
US7849038B2 (en) * | 2006-12-31 | 2010-12-07 | Brooks Roger K | Method for using the second homotopy group in assessing the similarity of sets of data |
US20130046767A1 (en) * | 2011-08-18 | 2013-02-21 | Ki-Yong Lee | Apparatus and method for managing bucket range of locality sensitive hash |
CN102609441A (en) * | 2011-12-27 | 2012-07-25 | 中国科学院计算技术研究所 | Local-sensitive hash high-dimensional indexing method based on distribution entropy |
CN103631928A (en) * | 2013-12-05 | 2014-03-12 | 中国科学院信息工程研究所 | LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system |
CN103744934A (en) * | 2013-12-30 | 2014-04-23 | 南京大学 | Distributed index method based on LSH (Locality Sensitive Hashing) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106339413A (en) * | 2016-08-12 | 2017-01-18 | 宁波大学 | Approximate membership query method based on high-dimensional data filter |
CN112214534A (en) * | 2020-10-21 | 2021-01-12 | 湖南大学 | Method, system and storage medium for performing approximate query on missing data |
CN112214534B (en) * | 2020-10-21 | 2022-03-11 | 湖南大学 | Method, system and storage medium for performing approximate query on missing data |
Also Published As
Publication number | Publication date |
---|---|
CN104391866B (en) | 2017-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210182318A1 (en) | Data Retrieval Method and Apparatus | |
CN109117669B (en) | Privacy protection method and system for MapReduce similar connection query | |
Carfì et al. | Algorithms for payoff trajectories in C 1 parametric games | |
CN106254321A (en) | A kind of whole network abnormal data stream sorting technique | |
CN105589864A (en) | Data inquiry method and apparatus | |
Maree et al. | Real-valued evolutionary multi-modal optimization driven by hill-valley clustering | |
CN103338155A (en) | High-efficiency filtering method for data packets | |
CN106021386B (en) | Non-equivalent connection method towards magnanimity distributed data | |
CN105005584A (en) | Multi-subspace Skyline query computation method | |
CN106874788A (en) | A kind of method for secret protection in sensitive data issue | |
CN102915344A (en) | SQL (structured query language) statement processing method and device | |
CN112199722A (en) | K-means-based differential privacy protection clustering method | |
CN105745642A (en) | Device and method for inquiring data | |
CN104699747A (en) | AMQ (approximate membership query) method based on high-dimensional data filter | |
CN104391866A (en) | Approximate membership query method based on high-dimension data filter | |
Wang et al. | A density-based clustering structure mining algorithm for data streams | |
CN107562762A (en) | Data directory construction method and device | |
CN106055674B (en) | A kind of top-k under distributed environment based on metric space dominates querying method | |
CN106339413A (en) | Approximate membership query method based on high-dimensional data filter | |
CN104516946A (en) | Approximate member query method based on high-dimensional data filter | |
CN111147535A (en) | Method and device for preventing Internet of things platform from repeatedly creating terminal equipment | |
Fan et al. | DEXIN: A fast content-based multi-attribute event matching algorithm using dynamic exclusive and inclusive methods | |
CN105095455A (en) | Data connection optimization method and data operation system | |
Qing et al. | Device type identification via network traffic and lightweight convolutional neural network for Internet of things | |
CN104504714B (en) | The detection method of the common obvious object of image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20191126 Address after: 315000 No.168, Jinghua Road, Ningbo high tech Zone, Ningbo, Zhejiang Province Patentee after: Ningbo Ningbian Power Sci-Tech Co., Ltd. Address before: 315211 Zhejiang Province, Ningbo Jiangbei District Fenghua Road No. 818 Patentee before: Ningbo University |
|
TR01 | Transfer of patent right |