CN107656989B - Nearest Neighbor based on data distribution perception in cloud storage system - Google Patents

Nearest Neighbor based on data distribution perception in cloud storage system Download PDF

Info

Publication number
CN107656989B
CN107656989B CN201710822371.1A CN201710822371A CN107656989B CN 107656989 B CN107656989 B CN 107656989B CN 201710822371 A CN201710822371 A CN 201710822371A CN 107656989 B CN107656989 B CN 107656989B
Authority
CN
China
Prior art keywords
hash
point
hash table
query
high dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710822371.1A
Other languages
Chinese (zh)
Other versions
CN107656989A (en
Inventor
华宇
孙园园
冯丹
左鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201710822371.1A priority Critical patent/CN107656989B/en
Publication of CN107656989A publication Critical patent/CN107656989A/en
Application granted granted Critical
Publication of CN107656989B publication Critical patent/CN107656989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2264Multidimensional index structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the nearest Neighbors based on data distribution perception in a kind of cloud storage system, this method is using the principal component of data as the projection vector of the sensitive Hash in part, and the weight of each hash function and the cutting gap size for adjusting hash function in each Hash table in further quantization index table, Hash table quantity needed for reducing building index while to guarantee NN Query accuracy, to reduce the space expense of Hash table.Further, this method refines query result set according to the hash-collision frequency of NN Query result, eliminate a large amount of incoherent elements, greatly reduce the data volume calculated for distance, reduce inquiry time delay, the present invention can make full use of the characteristic of data distribution, meet quick search characteristic, and have good expansibility.

Description

Nearest Neighbor based on data distribution perception in cloud storage system
Technical field
The invention belongs to computer memory technical fields, are divided more particularly, in a kind of cloud storage system based on data The nearest Neighbor of cloth perception.
Background technique
Mass storage system (MSS) consumes a large amount of system resources, such as calculating, storage and network etc., to support inquiry correlation to ask It asks, however processing in real time and analysis magnanimity high dimensional data are still a huge challenge.For inquiry request, essence is obtained True result is very time-consuming and not easy to operate, this is because user is often difficult to inquiry request to provide accurate description.Cause This, NN Query service is received more and more attention due to it with real-time in practical applications.
The sensitive Hash (Locality Sensitive Hashing, LSH) in part has Hash calculation simply and can tie up The characteristic for holding data locality is widely used in supporting NN Query service.
The existing nearest Neighbor based on LSH has the following problems:
(1) accuracy is low.The projection vector of method random selection hash function based on traditional LSH is without considering data point Cloth.Equally distributed data are equiprobably mapped in Hash table by random projection vector, therefore the number in Hash bucket According to being balanced.But in practical applications, data distribution is all non-uniform mostly.Therefore, the data of uneven distribution are logical Randomly selected projecting direction mapping is crossed, leads to many incoherent data aggregates together, reduces NN Query operation Accuracy.
(2) space efficiency is low.The map vector of hash function based on traditional LSH is independently of data distribution.Traditional LSH Method guarantees to inquire accuracy using a large amount of Hash table.Therefore, serious memory overhead becomes based on traditional LSH method Performance bottleneck.
(3) inquiry time delay is big.Due to the Random Maps of traditional LSH method, many incoherent data quilts in inquiry operation It detects and stores in query result set.Therefore, it next needs and inquires containing a large amount of candidate's element in results set Element is carried out apart from calculating, and this calculating be it is very time-consuming, it is too big to inquire time delay for a user.
Summary of the invention
Aiming at the above defects or improvement requirements of the prior art, the present invention provides data are based in a kind of cloud storage system The nearest Neighbor of distributed awareness, thus solves that accuracy existing for the NN Query in mass storage system (MSS) is low, space Low efficiency and the big technical problem of inquiry time delay.
To achieve the above object, the present invention provides the NN Queries based on data distribution perception in a kind of cloud storage system Method, comprising:
S1, partial data composition high dimensional feature data set is randomly selected from original high dimensional data concentration;
S2, by each of high dimensional feature data set element representation be a multi-C vector, by high dimensional feature data Set representations are the matrix being made of multiple multi-C vectors, by principal component analysis come the covariance matrix of the off-line calculation matrix, And then obtain the feature vector and characteristic value of the matrix;
S3, required Hash table number in concordance list, hash function number and conflict threshold in each Hash table are obtained;
S4, the descending order according to characteristic value regard the corresponding feature vector of each characteristic value as Hash letter correspondingly Several projection vectors, and the weight of each hash function in each Hash table is calculated according to the corresponding characteristic value of feature vector, so The cutting gap size of hash function in each Hash table is adjusted afterwards, the Kazakhstan for finally obtaining original High Dimensional Data Set by optimization For uncommon Function Mapping into entire concordance list, the element for generating hash-collision passes through storage of linked list;
S5, for each query point, in each Hash table, by optimization after hash function be calculated accordingly Cryptographic Hash navigates in Hash table the position that hash-collision occurs by cryptographic Hash, all by all elements in the chained list of the position It is stored in result candidate collection, records the number that each element and query point in result candidate collection generate hash-collision, it will be small In the element removal of default conflict threshold, obtain NN Query set, by by neighbour's query set each point and inquiry It clicks through row distance calculating to compare, exports all elements for being less than pre-determined distance threshold value with query point distance.
Preferably, step S2 specifically includes following sub-step:
S2.1, n element in high dimensional feature data set X is regarded to the n vectors comprising d variable as, then by higher-dimension spy Sign data set X is expressed as the matrix being made of n d dimensional vector:
S2.2, byObtain covariance matrix, whereinExpression side Difference, while covariance isAccording to covariance matrix S be calculated feature to Amount group V and eigenvalue cluster N;
S2.3, using the corresponding feature vector of the biggish preceding k × L characteristic value of characteristic value in eigenvalue cluster N as higher-dimension spy The principal component group V' for levying data set X is mapped by V' high dimensional feature data set X being expressed as data set Y, and Y=XV', wherein K indicates the number of hash function in each Hash table, and L indicates required Hash table number in concordance list.
Preferably, step S3 specifically includes following sub-step:
S3.1, byObtain Hash table in concordance list L is counted, and obtains the number k of hash function in each Hash table, wherein p1Two points are indicated as approximate point and hash-collision occurs Probability, p2Indicate that two points are not approximate point but the probability that hash-collision occurs, α is conflict proportion threshold value and p2< α < p1, δ For the success rate size of NN Query, β is the False Rate of the sensitive Hash LSH in part;
S3.2, corresponding conflict proportion threshold value α when Hash table number L value minimum is obtained, andIts In
S3.3, it is obtained by the value of α
S3.4, according to the value of α and L', obtain conflict threshold
Preferably, step S4 specifically includes following sub-step:
S4.1, by LSH function representation are as follows:Wherein a is projection vector, and p is high dimensional feature data set X Any point in middle hyperspace, b be range [0, ω) in a randomly selected real number, ω be projection cutting interval;
The sequentially throwing as each hash function correspondingly of S4.2, k × L feature vector for selecting step S2.3 Shadow vector, it is assumed that the corresponding characteristic value of k feature vector in each Hash table is followed successively by N=[n according to descending arrangement1, n2,...,nk], then the weight of each hash function are as follows:And then in each Hash table, point p's Cryptographic Hash is
S4.3, in each Hash table, the cutting interval ω of k hash function is identical, Hash letter in next Hash table Several interval ω is the half of the interval ω of hash function in a upper Hash table;
S4.4, it is spaced according to the projection vector of hash function in each Hash table and cutting, to construct L Hash table, often A Hash table all includes k hash function, and the point of all hyperspace in high dimensional feature data set X is all passed through Hash mapping It is inserted into each Hash table in concordance list, the point that hash-collision occurs passes through storage of linked list.
Preferably, step S5 specifically includes following sub-step:
S5.1, for query point q, calculate its cryptographic Hash g in each Hash tablei(q) (1≤i≤L), will breathe out All elements in the correspondence Hash barrel chain table of uncommon conflict are all saved in query results C (q), and repeat element only saves one It is secondary, obtain approximate set of the query point q in high dimensional feature data set X;
In S5.2, record queries result set C (q), the number that each point and query point q are clashed in concordance list, and For any point p in query results C (q), the conflict number of it and query point q are as follows:Assuming that conflict threshold is m, i.e. certain point and inquiry in query results C (q) When the conflict number of point q is greater than m, just think that the point is approximate with query point q, and the point be stored in refining result set C'(q) in;
S5.3, for refine result set C'(q) in all the points, successively with query point q carry out Euclidean distance calculating, when two When distance between point is less than pre-determined distance threshold value, then using the point as the approximate point of query point q.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect:
This method is using the principal component of data as the projection vector of the sensitive Hash in part, and further in quantization index table The weight of each hash function and the cutting gap size for adjusting hash function in each Hash table, to guarantee that NN Query is accurate Hash table quantity needed for reducing building index while spending, to reduce the space expense of Hash table.Further, this method Query result set is refined according to the hash-collision frequency of NN Query result, eliminates a large amount of incoherent elements, greatly The data volume calculated for distance is reduced, inquiry time delay is reduced.
Detailed description of the invention
Fig. 1 is the nearest Neighbor based on data distribution perception in a kind of cloud storage system of present example offer Flow diagram;
Fig. 2 is the method flow schematic diagram that a kind of principal component analysis that present example provides calculates;
Fig. 3 is a kind of method flow schematic diagram for parameter setting that present example provides;
Fig. 4 is a kind of method flow schematic diagram for concordance list building that present example provides;
Fig. 5 is a kind of method flow schematic diagram for NN Query that present example provides.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.
The present invention is the nearest Neighbor based on data distribution perception in cloud storage system, in view of data distribution Characteristic instructs the projection vector in selection LSH algorithm using principal component analysis, and further each Hash in quantization index table The weight of function and the cutting gap size for adjusting hash function in each Hash table, while to guarantee NN Query accuracy Hash table quantity needed for reducing building index, to reduce the space expense of Hash table.Further, this method is according to neighbour The hash-collision frequency of query result refines query result set, eliminates a large amount of incoherent elements, greatly reduces use In the data volume that distance calculates, inquiry time delay is reduced.
It is the NN Query based on data distribution perception in a kind of cloud storage system of present example offer as shown in Figure 1 The flow diagram of method;In method shown in Fig. 1 the following steps are included:
S1, data set sampling: randomly selecting partial data composition high dimensional feature data set from original high dimensional data concentration, with Guarantee time efficiency;
S2, principal component analysis calculate: carrying out principal component analysis to the high dimensional feature data set that step S1 is obtained, obtain Hash The projection vector of function, specifically: by each of high dimensional feature data set element representation be a multi-C vector, will be high Dimensional feature dataset representation is the matrix being made of multiple multi-C vectors, by principal component analysis come the association of the off-line calculation matrix Variance matrix, and then obtain the feature vector and characteristic value of the matrix;
In an optional embodiment, as shown in Fig. 2, step S2 specifically includes following sub-step:
S2.1, n element in high dimensional feature data set X is regarded to the n vectors comprising d variable as, then by higher-dimension spy Sign data set X is expressed as the matrix being made of n d dimensional vector:
S2.2, byObtain covariance matrix, whereinExpression side Difference, while covariance isAccording to covariance matrix S be calculated feature to Amount group V and eigenvalue cluster N;
S2.3, using the corresponding feature vector of the biggish preceding k × L characteristic value of characteristic value in eigenvalue cluster N as higher-dimension spy The principal component group V' for levying data set X is mapped by V' high dimensional feature data set X being expressed as data set Y, and Y=XV', wherein K indicates the number of hash function in each Hash table, and L indicates required Hash table number in concordance list.
Generally, bigger data are represented by the data variance of the corresponding maps feature vectors of preceding several larger characteristic values Amount, can preferably reflect query performance.And then using can reflect preceding several feature vectors of data distribution as the throwing in LSH Shadow vector constructs concordance list, reducing space expense.
S3, parameter setting: required Hash table number in concordance list, hash function number and punching in each Hash table are obtained Prominent threshold value;
In an optional embodiment, as shown in figure 3, step S3 specifically includes following sub-step:
S3.1, byObtain Hash table in concordance list L is counted, and obtains the number k of hash function in each Hash table, wherein p1Two points are indicated as approximate point and hash-collision occurs Probability, p2Indicate that two points are not approximate point but the probability that hash-collision occurs, α is conflict proportion threshold value and p2< α < p1, δ For the success rate size of NN Query, β is the False Rate of the sensitive Hash LSH in part;
Wherein, δ preferred value isβ is preferably
S3.2, corresponding conflict proportion threshold value α when Hash table number L value minimum is obtained, andIts In
Specifically, ifAndSo L=max (L1, L2), due to p2< α < p1, then L1Increase, L with the increase of α2Reduce with the increase of α.Work as L1=L2When Hash table The value of number L reaches minimum, obtainsWherein
S3.3, the value of α is brought into L1In obtain
S3.4, according to the value of α and L', obtain conflict threshold
S4, concordance list building: according to the descending order of characteristic value, the corresponding feature vector of each characteristic value is one-to-one Each hash function in each Hash table is calculated as the projection vector of hash function, and according to the corresponding characteristic value of feature vector Weight, the cutting gap size of hash function in each Hash table is then adjusted, finally by original High Dimensional Data Set by excellent Change obtained hash function to be mapped in entire concordance list, the element for generating hash-collision passes through storage of linked list;
In an optional embodiment, as shown in figure 4, step S4 specifically includes following sub-step:
S4.1, weight quantization: by LSH function representation are as follows:Wherein a is projection vector, and p is that higher-dimension is special Levy in data set X any point in hyperspace, b be range [0, ω) in a randomly selected real number, ω is projection cutting Interval;
Wherein, the cryptographic Hash at hyperspace midpoint is generally acquired using k hash function in each Hash table.It is right For traditional LSH based on Random Maps, required in each Hash table to each hash function distribute one with Then machine weight calculates weighted sum as the cryptographic Hash of the point.For any point p in set X, in some Hash table In cryptographic Hash can calculate are as follows: g (p)=a1*h1(p)+a2*h2(p)+...+ak*hk(p), wherein weight aiBe from 0 to 1 with Machine number obeys p Stable distritation.The data to condense together can be passed through mapping by the hash function with preferable projection vector Calculate separate, that is to say, that similar point is mapped in the same Hash bucket by greater probability, compared with small probability will be farther away Hash mapping is put to identical Hash bucket.Intuitively, the hash function with preferable projection vector should be assigned biggish Weight, to obtain better query performance.
The sequentially throwing as each hash function correspondingly of S4.2, k × L feature vector for selecting step S2.3 Shadow vector, it is assumed that the corresponding characteristic value of k feature vector in each Hash table is followed successively by N=[n according to descending arrangement1, n2,...,nk], then the weight of each hash function are as follows:And then in each Hash table, point p's Cryptographic Hash is
S4.3, the adjustment of cutting interval: in each Hash table, the cutting interval ω of k hash function is identical, next Kazakhstan The interval ω of hash function is the half of the interval ω of hash function in a upper Hash table in uncommon table;
Specifically, parameter ω embodies the granularity of hash-collision.Hash-collision occurs for the larger similitude that can increase of interval ω Probability, but farther away point may also be mapped to identical Hash bucket, will affect inquiry accuracy.ω is smaller to subtract at interval The probability of few hash-collision compared with far point, but similar point may also be caused to be mapped in different Hash buckets, it affects and looks into Ask recall ratio.In embodiments of the present invention, adjustment cutting is spaced the value of ω to promote the performance of inquiry.In each Hash table, k The cutting interval ω of a hash function is identical.The interval ω of hash function is Hash in a upper Hash table in next Hash table The half of the interval ω of function, that is to say, that for i-th of Hash table, the size of ω is spaced in hash function are as follows:Wherein ω0For the initial gap value of the hash function of first Hash table.
S4.4, building insertion: being spaced according to the projection vector of hash function in each Hash table and cutting, to construct L Hash table, each Hash table include k hash function, and the point of all hyperspace in high dimensional feature data set X is all logical It crosses in each Hash table that Hash mapping is inserted into concordance list, the point that hash-collision occurs passes through storage of linked list.
S5, NN Query: it for each query point, in each Hash table, is calculated by the hash function after optimization Corresponding cryptographic Hash is obtained, the position that hash-collision occurs is navigated in Hash table by cryptographic Hash, it will be in the chained list of the position All elements are all stored in result candidate collection, are recorded each element and query point in result candidate collection and are generated hash-collision Number will be less than the element removal of default conflict threshold, obtain NN Query set, by by each of neighbour's query set Compared with point carries out distance calculating with query point, all elements for being less than pre-determined distance threshold value with query point distance are exported.
In an optional embodiment, as shown in figure 5, step S5 specifically includes following sub-step:
S5.1, for query point q, calculate its cryptographic Hash g in each Hash tablei(q) (1≤i≤L), will breathe out All elements in the correspondence Hash barrel chain table of uncommon conflict are all saved in query results C (q), and repeat element only saves one It is secondary, obtain approximate set of the query point q in high dimensional feature data set X;
In S5.2, record queries result set C (q), the number that each point and query point q are clashed in concordance list, and For any point p in query results C (q), the conflict number of it and query point q are as follows:Assuming that conflict threshold is m, i.e. certain point and inquiry in query results C (q) When the conflict number of point q is greater than m, just think that the point is approximate with query point q, and the point be stored in refining result set C'(q) in;
S5.3, for refine result set C'(q) in all the points, successively with query point q carry out Euclidean distance calculating, when two When distance between point is less than pre-determined distance threshold value, then using the point as the approximate point of query point q.
Wherein, pre-determined distance threshold value can be determined according to actual needs.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims (5)

1. the nearest Neighbor based on data distribution perception in a kind of cloud storage system characterized by comprising
S1, partial data composition high dimensional feature data set is randomly selected from original high dimensional data concentration;
S2, by each of high dimensional feature data set element representation be a multi-C vector, by high dimensional feature data set table It is shown as the matrix being made of multiple multi-C vectors, by principal component analysis come the covariance matrix of the off-line calculation matrix, in turn Obtain the feature vector and characteristic value of the matrix;
S3, required Hash table number in concordance list, hash function number and conflict threshold in each Hash table are obtained;
S4, the descending order according to characteristic value regard the corresponding feature vector of each characteristic value as hash function correspondingly Projection vector, and the weight of each hash function in each Hash table is calculated according to the corresponding characteristic value of feature vector, then adjust The cutting gap size of hash function in whole each Hash table, the Hash letter for finally obtaining original High Dimensional Data Set by optimization Number is mapped in entire concordance list, and the element for generating hash-collision passes through storage of linked list;
S5, for each query point, in each Hash table, by optimization after hash function corresponding Hash is calculated Value is navigated in Hash table the position that hash-collision occurs by cryptographic Hash, all elements in the chained list of the position is all stored in As a result in candidate collection, the number that each element and query point in result candidate collection generate hash-collision is recorded, will be less than pre- If the element of conflict threshold removes, NN Query set is obtained, by clicking through each point in neighbour's query set with inquiry Row distance calculating is compared, and all elements for being less than pre-determined distance threshold value with query point distance are exported.
2. the method according to claim 1, wherein step S2 specifically includes following sub-step:
S2.1, n element in high dimensional feature data set X is regarded to the n vectors comprising d variable as, then by high dimensional feature number The matrix being made of n d dimensional vector is expressed as according to collection X:
S2.2, byObtain covariance matrix, whereinIndicate variance, together When covariance beFeature vector group V is calculated according to covariance matrix S With eigenvalue cluster N;
S2.3, using the corresponding feature vector of the biggish preceding k × L characteristic value of characteristic value in eigenvalue cluster N as high dimensional feature number According to the principal component group V' of collection X, is mapped by V' and high dimensional feature data set X is expressed as data set Y, and Y=XV', wherein k table Show the number of hash function in each Hash table, L indicates required Hash table number in concordance list.
3. according to the method described in claim 2, it is characterized in that, step S3 specifically includes following sub-step:
S3.1, byHash table number L in concordance list is obtained, and The number k of hash function in each Hash table is obtained, wherein p1Indicate that two points are the probability of approximate point and generation hash-collision, p2Indicate that two points are not approximate point but the probability that hash-collision occurs, α is conflict proportion threshold value and p2< α < p1, δ is neighbour The success rate size of inquiry, β are the False Rate of the sensitive Hash LSH in part;
S3.2, corresponding conflict proportion threshold value α when Hash table number L value minimum is obtained, andWherein
S3.3, it is obtained by the value of α
S3.4, according to the value of α and L', obtain conflict threshold
4. according to the method described in claim 3, it is characterized in that, step S4 specifically includes following sub-step:
S4.1, by LSH function representation are as follows:Wherein a is projection vector, and p is more in high dimensional feature data set X Any point in dimension space, b be range [0, ω) in a randomly selected real number, ω be projection cutting interval;
S4.2, k × L feature vector for selecting step S2.3 sequentially correspondingly as the projection of each hash function to Amount, it is assumed that the corresponding characteristic value of k feature vector in each Hash table is followed successively by N=[n according to descending arrangement1,n2,..., nk], then the weight of each hash function are as follows:1≤i≤k, and then in each Hash table, the cryptographic Hash of point p For
S4.3, in each Hash table, the cutting interval ω of k hash function is identical, hash function in next Hash table Interval ω is the half of the interval ω of hash function in a upper Hash table;
S4.4, it is spaced according to the projection vector of hash function in each Hash table and cutting, to construct L Hash table, each Kazakhstan Uncommon table all includes k hash function, and the point of all hyperspace in high dimensional feature data set X is all inserted by Hash mapping In each Hash table into concordance list, the point that hash-collision occurs passes through storage of linked list.
5. according to the method described in claim 4, it is characterized in that, step S5 specifically includes following sub-step:
S5.1, for query point q, calculate its cryptographic Hash g in each Hash tablei(q), hash-collision will occur for 1≤i≤L Correspondence Hash barrel chain table in all elements be all saved in query results C (q), repeat element only saves once, obtains Approximate set of the query point q in high dimensional feature data set X;
In S5.2, record queries result set C (q), the number that each point and query point q are clashed in concordance list, and for Any point p in query results C (q), the conflict number of it and query point q are as follows:Assuming that conflict threshold is m, i.e. certain point and inquiry in query results C (q) When the conflict number of point q is greater than m, just think that the point is approximate with query point q, and the point be stored in refining result set C'(q) in;
S5.3, for refine result set C'(q) in all the points, successively with query point q carry out Euclidean distance calculating, work as point-to-point transmission Distance be less than pre-determined distance threshold value when, then using the point as the approximate point of query point q.
CN201710822371.1A 2017-09-13 2017-09-13 Nearest Neighbor based on data distribution perception in cloud storage system Active CN107656989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710822371.1A CN107656989B (en) 2017-09-13 2017-09-13 Nearest Neighbor based on data distribution perception in cloud storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710822371.1A CN107656989B (en) 2017-09-13 2017-09-13 Nearest Neighbor based on data distribution perception in cloud storage system

Publications (2)

Publication Number Publication Date
CN107656989A CN107656989A (en) 2018-02-02
CN107656989B true CN107656989B (en) 2019-09-13

Family

ID=61130009

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710822371.1A Active CN107656989B (en) 2017-09-13 2017-09-13 Nearest Neighbor based on data distribution perception in cloud storage system

Country Status (1)

Country Link
CN (1) CN107656989B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10949467B2 (en) * 2018-03-01 2021-03-16 Huawei Technologies Canada Co., Ltd. Random draw forest index structure for searching large scale unstructured data
CN109634952B (en) * 2018-11-02 2021-08-17 宁波大学 Self-adaptive nearest neighbor query method for large-scale data
CN109829320B (en) * 2019-01-14 2020-12-11 珠海天燕科技有限公司 Information processing method and device
CN110795469B (en) * 2019-10-11 2022-02-22 安徽工业大学 Spark-based high-dimensional sequence data similarity query method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609441A (en) * 2011-12-27 2012-07-25 中国科学院计算技术研究所 Local-sensitive hash high-dimensional indexing method based on distribution entropy
WO2012165135A1 (en) * 2011-05-27 2012-12-06 公立大学法人大阪府立大学 Database logging method and logging device relating to approximate nearest neighbor search
CN103631928A (en) * 2013-12-05 2014-03-12 中国科学院信息工程研究所 LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system
CN104035949A (en) * 2013-12-10 2014-09-10 南京信息工程大学 Similarity data retrieval method based on locality sensitive hashing (LASH) improved algorithm
CN105653656A (en) * 2015-12-28 2016-06-08 成都希盟泰克科技发展有限公司 Multi-feature document retrieval method based on improved LSH (Locality-Sensitive Hashing)
CN105808631A (en) * 2015-06-29 2016-07-27 中国人民解放军装甲兵工程学院 Data dependence based multi-index Hash algorithm

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012165135A1 (en) * 2011-05-27 2012-12-06 公立大学法人大阪府立大学 Database logging method and logging device relating to approximate nearest neighbor search
CN102609441A (en) * 2011-12-27 2012-07-25 中国科学院计算技术研究所 Local-sensitive hash high-dimensional indexing method based on distribution entropy
CN103631928A (en) * 2013-12-05 2014-03-12 中国科学院信息工程研究所 LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system
CN104035949A (en) * 2013-12-10 2014-09-10 南京信息工程大学 Similarity data retrieval method based on locality sensitive hashing (LASH) improved algorithm
CN105808631A (en) * 2015-06-29 2016-07-27 中国人民解放军装甲兵工程学院 Data dependence based multi-index Hash algorithm
CN105653656A (en) * 2015-12-28 2016-06-08 成都希盟泰克科技发展有限公司 Multi-feature document retrieval method based on improved LSH (Locality-Sensitive Hashing)

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
M2LSH:基于LSH的高维数据近似最近邻查找算法;李灿等;《电子学报》;20170615(第06期);第1431-1442页 *
基于LSH的高维大数据k近邻搜索算法;王忠伟等;《电子学报》;20160415(第04期);第906-912页 *

Also Published As

Publication number Publication date
CN107656989A (en) 2018-02-02

Similar Documents

Publication Publication Date Title
CN107656989B (en) Nearest Neighbor based on data distribution perception in cloud storage system
CN105589951B (en) A kind of mass remote sensing image meta-data distribution formula storage method and parallel query method
CN107040422B (en) Network big data visualization method based on materialized cache
CN103455531B (en) A kind of parallel index method supporting high dimensional data to have inquiry partially in real time
US8321873B2 (en) System and method for offline data generation for online system analysis
CN107451302B (en) Modeling method and system based on position top-k keyword query under sliding window
CN106599091B (en) RDF graph structure storage and index method based on key value storage
US10592153B1 (en) Redistributing a data set amongst partitions according to a secondary hashing scheme
CN109726225A (en) A kind of storage of distributed stream data and querying method based on Storm
Shao et al. An efficient load-balancing mechanism for heterogeneous range-queriable cloud storage
CN106599190A (en) Dynamic Skyline query method based on cloud computing
CN109597829B (en) Middleware method for realizing searchable encryption relational database cache
WO2024156238A1 (en) Map data aggregation display method and apparatus, and electronic device
KR20220070482A (en) Image incremental clustering method, apparatus, electronic device, storage medium and program product
Neglia et al. Similarity caching: Theory and algorithms
CN112699187A (en) Associated data processing method, device, equipment, medium and product
Mealha et al. Data replication on the cloud/edge
CN106599189A (en) Dynamic Skyline inquiry device based on cloud computing
Sadineni Comparative study on skyline query processing techniques on big data
Wang et al. Distributed collaborative filtering recommendation algorithm based on DHT
Zou et al. Semantic overlay network for large-scale spatial information indexing
Feng et al. HQ-Tree: A distributed spatial index based on Hadoop
CN110381136A (en) A kind of method for reading data, terminal, server and storage medium
Liu et al. Parallelizing uncertain skyline computation against n‐of‐N data streaming model
CN107609089B (en) A kind of data processing method, apparatus and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant