CN104391866A - Approximate membership query method based on high-dimension data filter - Google Patents

Approximate membership query method based on high-dimension data filter Download PDF

Info

Publication number
CN104391866A
CN104391866A CN201410578880.0A CN201410578880A CN104391866A CN 104391866 A CN104391866 A CN 104391866A CN 201410578880 A CN201410578880 A CN 201410578880A CN 104391866 A CN104391866 A CN 104391866A
Authority
CN
China
Prior art keywords
bmlbf
data
group
distance
namely
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410578880.0A
Other languages
Chinese (zh)
Other versions
CN104391866B (en
Inventor
陈叶芳
钱江波
陈华辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo Ningbian Power Sci-Tech Co., Ltd.
Original Assignee
Ningbo University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo University filed Critical Ningbo University
Priority to CN201410578880.0A priority Critical patent/CN104391866B/en
Publication of CN104391866A publication Critical patent/CN104391866A/en
Application granted granted Critical
Publication of CN104391866B publication Critical patent/CN104391866B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an approximate membership query method based on a high-dimension data filter. The method has the advantages that a new structure supported by a new distance sensitive hash function is defined to respectively present multi-dimension data and to-be-inquired multi-dimension data in a target data set, so the reconstruction of the filter is not needed, approximate membership queries with more filtering distance parameters can be supported, and the space cost is greatly reduced; a plurality of function groups are utilized, each function group contains multiple functions, an and-or combining method is utilized to judge when whether a membership is the approximate membership in a target data set omega or not is finally determined, and the false negative rate of the filter is reduced.

Description

A kind of approximate member's querying method based on high dimensional data filtrator
Technical field
The present invention relates to a kind of approximate member's querying method, especially relate to a kind of approximate member's querying method based on high dimensional data filtrator.
Background technology
In a lot of application, if the distance of data query and target data is nearer, the value of data is higher.Such as, security officer wants the material (having some detectable high dimensional feature) checking certain the unknown whether to belong to hazardous chemical listed by inventory; Whether network manager wants the behavioural characteristic knowing certain user to be harmful to; Photography match judge wants to check that whether the photo submitted to is similar with the photo in a certain large database.These inquiries all need to judge data query and (target data) gather in the distance of data.If the small data set of low-dimensional, solve by linear search, but linear search coupling is adopted to the High Dimensional Data Set of a magnanimity, can be very consuming time, real-time needs cannot be met in a lot of situation.For improving the speed of process, can arrange a high dimensional data filtrator and representing target data set, filtering out most of data query according to distance, a small amount of remaining data can be processed by conventional method more further, can significantly improve the overall performance of system.
What this filtrator completed is exactly approximate membership query (Approximate Membership Query, AMQ), namely answers " whether data query is close to certain data in data acquisition ".Existing AMQ filtrator is mainly in conjunction with LSH (distance sensitive Hash, Locality-Sensitive Hashing) and Bloom filter (Bloom Filter) technology, its main representative has DSBF (Distance-sensitive Bloom filters) and LSBF (Locality-sensitiveBloom filters).
The method of DSBF comprehensive LSH and Bloom filter first filters AMQ inquiry, it returns the approximate query result of group membership, the degree of approximation can adopt different criterions, it can improve speed and the space of network and database application, thus avoids arm and a leg compare operations such as complete K-NN search.LSBF is the improvement of DSBF, and use LSH function to construct Bloom filter and filter AMQ inquiry, LSBF additionally uses extra bit vector to reduce false positive rate.
But using these two technology of DSBF and LSBF to filter AMQ inquiry has a restriction, and namely they only can filter and inquire about to the AMQ of set a distance.But a given suitable distance is also not easy, excessive or too small distance value, may cause unacceptable Query Result.And fixing once the filtration distance parameter of filtrator after just can not change, filter multiple different distance value as needed simultaneously, then need to re-construct filtrator according to raw data, also namely change filtration distance parameter.But for saving space, raw data is not generally preserved.Secondly, the false negative rate of DSBF and LSBF is higher.
Summary of the invention
Technical matters to be solved by this invention is to provide a kind of high dimensional data based on filtrator and is similar to member's querying method, on the basis of the filtrator of original fixing filtration distance parameter, do not need to re-construct filtrator, namely can realize the approximate membership query of more filtering distance parameter.
The present invention solves the problems of the technologies described above adopted technical scheme: a kind of high dimensional data based on filtrator is similar to member's querying method, and target data set is defined as Ω, and is defined as by distance sensitive hash function H wherein t=1,2 ..., k, j=1,2 ..., L, L are group of functions number, and k is the function number in each group of functions, and o is multidimensional data, a t,jbeing the random vector identical with o dimension, is dot-product operation, 2 θw is distance filtration parameter, θ=0,1,2 ..., S-1, S are the species number filtering distance, and w is the minor increment filtration parameter of arithmetic number definition, be lower rounding operation, then carry out the operation of following steps:
(1) building a capacity is m, and address is the bit vector of 0 to m-1, is defined as BMLBF, and sets BMLBF [i]=0, i=0,1,2 ..., m-1;
(2) to the data o of any one multidimensional in target data set Ω y, characterize with distance sensitive hash function during θ=0, namely wherein y=1,2 ..., n, and k × L position puts 1 in BMLBF, namely
(3) multidimensional data to be checked is defined as q, then characterizes with above-mentioned distance sensitive hash function, namely
(4) by the k of a jth group cryptographic hash, namely convert 2 binary data respectively to, and become k address connecting thereafter θ 0, be defined as A 1, j, A 2, j..., A k,j;
(5) as BMLBF [A 1, j], BMLBF [A 1, j+ 1] ..., BMLBF [A 1, j+ 2 θ-1] there is one to be 1 in, then define A 1, jaddress is passed through; As A 1, j, A 2, j..., A k,jall pass through, then define jth group and pass through; If any one group is passed through in L group, then confirm that q is the approximate member of target data set Ω.
Compared with prior art, the invention has the advantages that the new construction supported by defining new distance sensitive hash function, characterize the multidimensional data in target data set and multidimensional data to be checked respectively, do not need to re-construct filtrator, the approximate membership query of more filtering distance parameter can be supported, drastically reduce the area space cost.Present invention uses multiple group of functions number, and each group of functions number comprises multiple function, use when being finally confirmed whether the approximate member of target data set Ω " with-or " mode that combines judges, also can reduce the false negative rate of filtrator.
Accompanying drawing explanation
Fig. 1 is that the false positive rate of the inventive method and prior art LSBF method in specific embodiment and false negative rate compare schematic diagram;
Fig. 2 is the error rate schematic diagram of the inventive method dummy filters in specific embodiment;
Fig. 3 is that in specific embodiment, the inventive method compares schematic diagram with the space cost of prior art LSBF method.
Embodiment
Below in conjunction with accompanying drawing embodiment, the present invention is described in further detail.
We use true handwritten numeral Letter identification data set to assess method more of the present invention and existing LSBF method.This data set comprises 5,620 data, and each data represent hand-written arabic numeral with 64 dimensional features, that is, ' 0', ' 1' ..., ' 9'.Range of characteristic values is the integer of 0 to 16.The data of ' 0' are divided into two groups, one group totally 10 data as set omega, another group as test data q, to test false negative rate; In addition, be taken as ' 10 data of 1' as set omega, other data as test data q, to test false positive rate.Experimental result is 10000 mean values calculated at random.
High dimensional data based on filtrator is similar to member's querying method, and target data set is defined as Ω, and is defined as by distance sensitive hash function H wherein t=1,2 ..., k, j=1,2 ..., L, L are group of functions number, and k is the function number in each group of functions, and o is multidimensional data, a t,jbeing the random vector identical with o dimension, is dot-product operation, 2 θw is distance filtration parameter, θ=0,1,2 ..., S-1, S are the species number filtering distance, and w is the minor increment filtration parameter of arithmetic number definition, be lower rounding operation, then carry out the operation of following steps:
(1) building a capacity is m=2 × 10 5, address is the bit vector of 0 to m-1, is defined as BMLBF, and sets BMLBF [i]=0, i=0,1,2 ..., m-1;
(2) to the data o of any one 64 dimension in target data set Ω y, characterize with distance sensitive hash function during θ=0, namely wherein y=1,2 ..., 10, k=2, L=3, θ=0, a t,jbe the random vector of 64 dimensions, w is the minor increment filtration parameter of arithmetic number definition, be lower rounding operation, and k × L position put 1 in BMLBF, namely
(3) 64 dimension data to be checked are defined as q, then characterize with following distance sensitive hash function, namely s is the species number filtering distance, gets S=4;
(4) by 2 cryptographic hash of the 1st group, namely convert 2 binary data respectively to, and become 2 addresses connecting thereafter θ 0, be defined as A 1,1, A 2,1; By 2 cryptographic hash of the 2nd group, namely convert 2 binary data respectively to, and become 2 addresses connecting thereafter θ 0, be defined as A 1,2, A 2,2; By 2 cryptographic hash of the 3rd group, namely convert 2 binary data respectively to, and become 2 addresses connecting thereafter θ 0, be defined as A 1,3, A 2,3;
(5) as BMLBF [A 1,1], BMLBF [A 1,1+ 1] ..., BMLBF [A 1,1+ 2 θ-1] there is one to be 1 in, then define A 1,1address is passed through, as BMLBF [A 2,1], BMLBF [A 2,1+ 1] ..., BMLBF [A 2,1+ 2 θ-1] there is one to be 1 in, then define A 2,1address is passed through, as A 1,1, A 2,1all pass through, then define the 1st group and pass through; Equally, as BMLBF [A 1,2], BMLBF [A 1,2+ 1] ..., BMLBF [A 1,2+ 2 θ-1] there is one to be 1 in, then define A 1,2address is passed through, as BMLBF [A 2,2], BMLBF [A 2,2+ 1] ..., BMLBF [A 2,2+ 2 θ-1] there is one to be 1 in, then define A 2,2address is passed through, as A 1,2, A 2,2all pass through, then define the 2nd group and pass through; Equally, as BMLBF [A 1,3], BMLBF [A 1,3+ 1] ..., BMLBF [A 1,3+ 2 θ-1] there is one to be 1 in, then define A 1,3address is passed through, as BMLBF [A 2,3], BMLBF [A 2,3+ 1] ..., BMLBF [A 2,3+ 2 θ-1] there is one to be 1 in, then define A 2,3address is passed through; As A 1,3, A 2,3all pass through, then define the 3rd group and pass through; If any one group is passed through in 3 groups, then confirm that q is the approximate member of target data set Ω.
From the experimental result of Fig. 1, method of the present invention is more much lower than the false negative rate of the LSBF method of prior art, loses the application of authentic data after being relatively applicable to wishing filtration less.As can be seen from the space cost of Fig. 3 relatively, can produce 4 differently filter distances because method of the present invention only needs to set up an entity filtrator, this equals to employ 4 dummy filters; And the LSBF method of prior art need filter distance for difference, set up corresponding entity filtrator.Therefore, along with difference filters the increase of distance requirement, i.e. the increase of S, the space cost of prior art LSBF method increases, and method of the present invention then remains unchanged.

Claims (1)

1. the high dimensional data based on filtrator is similar to member's querying method, it is characterized in that target data set to be defined as Ω, and is defined as by distance sensitive hash function H wherein t=1,2 ..., k, j=1,2 ..., L, L are group of functions number, and k is the function number in each group of functions, and o is multidimensional data, a t,jbe the random vector identical with o dimension, the data fit standardized normal distribution of its every one dimension is dot-product operation, 2 θw is distance filtration parameter, θ=0,1,2 ..., S-1, S are the species number filtering distance, and w is the minor increment filtration parameter of arithmetic number definition, be lower rounding operation, then carry out the operation of following steps:
(1) building a capacity is m, and address is the bit vector of 0 to m-1, is defined as BMLBF, and sets BMLBF [i]=0, i=0,1,2 ..., m-1;
(2) to the data o of any one multidimensional in target data set Ω y, characterize with distance sensitive hash function during θ=0, namely wherein y=1,2 ..., n, and k × L position puts 1 in BMLBF, namely
(3) multidimensional data to be checked is defined as q, then characterizes with above-mentioned distance sensitive hash function, namely
(4) by the k of a jth group cryptographic hash, namely convert 2 binary data respectively to, and become k address connecting thereafter θ 0, be defined as A 1, j, A 2, j..., A k,j;
(5) as BMLBF [A 1, j], BMLBF [A 1, j+ 1] ..., BMLBF [A 1, j+ 2 θ-1] there is one to be 1 in, then define A 1, jaddress is passed through; As A 1, j, A 2, j..., A k,jall pass through, then define jth group and pass through; If any one group is passed through in L group, then confirm that q is the approximate member of target data set Ω.
CN201410578880.0A 2014-10-24 2014-10-24 A kind of approximate member's querying method based on high dimensional data filter Active CN104391866B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410578880.0A CN104391866B (en) 2014-10-24 2014-10-24 A kind of approximate member's querying method based on high dimensional data filter

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410578880.0A CN104391866B (en) 2014-10-24 2014-10-24 A kind of approximate member's querying method based on high dimensional data filter

Publications (2)

Publication Number Publication Date
CN104391866A true CN104391866A (en) 2015-03-04
CN104391866B CN104391866B (en) 2017-07-28

Family

ID=52609770

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410578880.0A Active CN104391866B (en) 2014-10-24 2014-10-24 A kind of approximate member's querying method based on high dimensional data filter

Country Status (1)

Country Link
CN (1) CN104391866B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339413A (en) * 2016-08-12 2017-01-18 宁波大学 Approximate membership query method based on high-dimensional data filter
CN112214534A (en) * 2020-10-21 2021-01-12 湖南大学 Method, system and storage medium for performing approximate query on missing data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7849038B2 (en) * 2006-12-31 2010-12-07 Brooks Roger K Method for using the second homotopy group in assessing the similarity of sets of data
US7966327B2 (en) * 2004-11-08 2011-06-21 The Trustees Of Princeton University Similarity search system with compact data structures
CN102609441A (en) * 2011-12-27 2012-07-25 中国科学院计算技术研究所 Local-sensitive hash high-dimensional indexing method based on distribution entropy
US20130046767A1 (en) * 2011-08-18 2013-02-21 Ki-Yong Lee Apparatus and method for managing bucket range of locality sensitive hash
CN103631928A (en) * 2013-12-05 2014-03-12 中国科学院信息工程研究所 LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system
CN103744934A (en) * 2013-12-30 2014-04-23 南京大学 Distributed index method based on LSH (Locality Sensitive Hashing)

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7966327B2 (en) * 2004-11-08 2011-06-21 The Trustees Of Princeton University Similarity search system with compact data structures
US7849038B2 (en) * 2006-12-31 2010-12-07 Brooks Roger K Method for using the second homotopy group in assessing the similarity of sets of data
US20130046767A1 (en) * 2011-08-18 2013-02-21 Ki-Yong Lee Apparatus and method for managing bucket range of locality sensitive hash
CN102609441A (en) * 2011-12-27 2012-07-25 中国科学院计算技术研究所 Local-sensitive hash high-dimensional indexing method based on distribution entropy
CN103631928A (en) * 2013-12-05 2014-03-12 中国科学院信息工程研究所 LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system
CN103744934A (en) * 2013-12-30 2014-04-23 南京大学 Distributed index method based on LSH (Locality Sensitive Hashing)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339413A (en) * 2016-08-12 2017-01-18 宁波大学 Approximate membership query method based on high-dimensional data filter
CN112214534A (en) * 2020-10-21 2021-01-12 湖南大学 Method, system and storage medium for performing approximate query on missing data
CN112214534B (en) * 2020-10-21 2022-03-11 湖南大学 Method, system and storage medium for performing approximate query on missing data

Also Published As

Publication number Publication date
CN104391866B (en) 2017-07-28

Similar Documents

Publication Publication Date Title
US20210182318A1 (en) Data Retrieval Method and Apparatus
CN109117669B (en) Privacy protection method and system for MapReduce similar connection query
Carfì et al. Algorithms for payoff trajectories in C 1 parametric games
CN106254321A (en) A kind of whole network abnormal data stream sorting technique
CN105589864A (en) Data inquiry method and apparatus
Maree et al. Real-valued evolutionary multi-modal optimization driven by hill-valley clustering
CN103338155A (en) High-efficiency filtering method for data packets
CN106021386B (en) Non-equivalent connection method towards magnanimity distributed data
CN105005584A (en) Multi-subspace Skyline query computation method
CN106874788A (en) A kind of method for secret protection in sensitive data issue
CN102915344A (en) SQL (structured query language) statement processing method and device
CN112199722A (en) K-means-based differential privacy protection clustering method
CN105745642A (en) Device and method for inquiring data
CN104699747A (en) AMQ (approximate membership query) method based on high-dimensional data filter
CN104391866A (en) Approximate membership query method based on high-dimension data filter
Wang et al. A density-based clustering structure mining algorithm for data streams
CN107562762A (en) Data directory construction method and device
CN106055674B (en) A kind of top-k under distributed environment based on metric space dominates querying method
CN106339413A (en) Approximate membership query method based on high-dimensional data filter
CN104516946A (en) Approximate member query method based on high-dimensional data filter
CN111147535A (en) Method and device for preventing Internet of things platform from repeatedly creating terminal equipment
Fan et al. DEXIN: A fast content-based multi-attribute event matching algorithm using dynamic exclusive and inclusive methods
CN105095455A (en) Data connection optimization method and data operation system
Qing et al. Device type identification via network traffic and lightweight convolutional neural network for Internet of things
CN104504714B (en) The detection method of the common obvious object of image

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20191126

Address after: 315000 No.168, Jinghua Road, Ningbo high tech Zone, Ningbo, Zhejiang Province

Patentee after: Ningbo Ningbian Power Sci-Tech Co., Ltd.

Address before: 315211 Zhejiang Province, Ningbo Jiangbei District Fenghua Road No. 818

Patentee before: Ningbo University

TR01 Transfer of patent right