CN107656989B - Neighbor query method based on data distribution awareness in cloud storage system - Google Patents

Neighbor query method based on data distribution awareness in cloud storage system Download PDF

Info

Publication number
CN107656989B
CN107656989B CN201710822371.1A CN201710822371A CN107656989B CN 107656989 B CN107656989 B CN 107656989B CN 201710822371 A CN201710822371 A CN 201710822371A CN 107656989 B CN107656989 B CN 107656989B
Authority
CN
China
Prior art keywords
hash
point
query
hash table
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710822371.1A
Other languages
Chinese (zh)
Other versions
CN107656989A (en
Inventor
华宇
孙园园
冯丹
左鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201710822371.1A priority Critical patent/CN107656989B/en
Publication of CN107656989A publication Critical patent/CN107656989A/en
Application granted granted Critical
Publication of CN107656989B publication Critical patent/CN107656989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2264Multidimensional index structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种云存储系统中基于数据分布感知的近邻查询方法,该方法利用数据的主成分作为局部灵敏哈希的投影向量,并进一步量化索引表中每个哈希函数的权值和调整每个哈希表中哈希函数的切割间隔大小,以保证近邻查询精确度的同时减少构建索引所需的哈希表数量,从而减少哈希表的空间开销。进一步地,该方法根据近邻查询结果的哈希冲突频率来精炼查询结果集合,消除大量不相关的元素,极大地减小了用于距离计算的数据量,减小了查询时延,本发明能够充分利用数据分布的特性,满足快速查询特性,并具有良好的可拓展性。

The invention discloses a neighbor query method based on data distribution perception in a cloud storage system. The method uses the principal component of the data as the projection vector of the local sensitive hash, and further quantifies the weight and sum of each hash function in the index table. Adjust the cutting interval of the hash function in each hash table to ensure the accuracy of the neighbor query and reduce the number of hash tables required to build the index, thereby reducing the space overhead of the hash table. Furthermore, the method refines the query result set according to the hash collision frequency of the neighbor query results, eliminates a large number of irrelevant elements, greatly reduces the amount of data used for distance calculation, and reduces the query delay. The present invention can Make full use of the characteristics of data distribution, meet the characteristics of fast query, and have good scalability.

Description

云存储系统中基于数据分布感知的近邻查询方法Neighbor query method based on data distribution awareness in cloud storage system

技术领域technical field

本发明属于计算机存储技术领域,更具体地,涉及一种云存储系统中基于数据分布感知的近邻查询方法。The invention belongs to the technical field of computer storage, and more specifically relates to a neighbor query method based on data distribution perception in a cloud storage system.

背景技术Background technique

海量存储系统消耗大量系统资源,例如计算、存储和网络等,来支持查询相关请求,然而实时处理和分析海量高维数据仍然是一个巨大的挑战。对于查询请求来说,获得精确的结果非常耗时并且不易操作,这是因为用户时常对查询请求难以提供精确的描述。因此,近邻查询服务由于其具有实时性在实际应用中受到越来越多的关注。Massive storage systems consume a lot of system resources, such as computing, storage, and network, to support query-related requests. However, real-time processing and analysis of massive high-dimensional data is still a huge challenge. For query requests, obtaining accurate results is very time-consuming and difficult to operate, because users often have difficulty providing accurate descriptions for query requests. Therefore, the nearest neighbor query service has received more and more attention in practical applications due to its real-time nature.

局部灵敏哈希(Locality Sensitive Hashing,LSH)具有哈希计算简单和能够维持数据局部性的特性,被广泛应用于支持近邻查询服务。Locality Sensitive Hashing (LSH) has the characteristics of simple hash calculation and the ability to maintain data locality, and is widely used to support neighbor query services.

现有的基于LSH的近邻查询方法存在以下问题:The existing LSH-based neighbor query methods have the following problems:

(1)精确度低。基于传统LSH的方法随机选择哈希函数的投影向量而不考虑数据分布。均匀分布的数据被等概率地通过随机的投影向量映射到哈希表中,因此哈希桶中的数据是均衡的。但是在实际应用中,数据分布大多都是不均匀的。因此,不均匀分布的数据通过随机选择的投影方向映射,导致许多不相关的数据聚合在一起,降低了近邻查询操作的精确度。(1) The accuracy is low. Traditional LSH-based methods randomly select the projection vector of the hash function without considering the data distribution. Uniformly distributed data is mapped to the hash table with equal probability through random projection vectors, so the data in the hash bucket is balanced. However, in practical applications, the data distribution is mostly uneven. As a result, the unevenly distributed data is mapped through randomly selected projection directions, causing many irrelevant data to be aggregated together, reducing the accuracy of the nearest neighbor query operation.

(2)空间效率低。基于传统LSH的哈希函数的映射向量独立于数据分布。传统的LSH方法利用大量的哈希表来保证查询精确度。因此,严重的内存开销成为基于传统LSH方法的性能瓶颈。(2) The space efficiency is low. The mapping vectors of conventional LSH-based hash functions are independent of the data distribution. The traditional LSH method utilizes a large number of hash tables to ensure query accuracy. Therefore, severe memory overhead becomes the performance bottleneck of traditional LSH-based methods.

(3)查询时延大。由于传统LSH方法的随机映射,查询操作中许多不相关的数据被探测并存储到查询结果集合中。因此,结果集合中含有大量的候选元素接下来需要与查询元素进行距离计算,而这种计算是非常耗时的,对于用户来说查询时延太大。(3) The query delay is large. Due to the random mapping of traditional LSH methods, many irrelevant data in the query operation are detected and stored into the query result set. Therefore, the result set contains a large number of candidate elements and the distance calculation between the query element and the query element is required, and this calculation is very time-consuming, and the query delay is too large for the user.

发明内容Contents of the invention

针对现有技术的以上缺陷或改进需求,本发明提供了一种云存储系统中基于数据分布感知的近邻查询方法,由此解决在海量存储系统中的近邻查询存在的精确度低、空间效率低以及查询时延大的技术问题。Aiming at the above defects or improvement needs of the prior art, the present invention provides a neighbor query method based on data distribution perception in a cloud storage system, thereby solving the problem of low precision and low space efficiency in the neighbor query in a mass storage system And the technical problem of long query delay.

为实现上述目的,本发明提供了一种云存储系统中基于数据分布感知的近邻查询方法,包括:In order to achieve the above object, the present invention provides a neighbor query method based on data distribution perception in a cloud storage system, including:

S1、从原始高维数据集中随机抽取部分数据组成高维特征数据集;S1. Randomly extract some data from the original high-dimensional data set to form a high-dimensional feature data set;

S2、将高维特征数据集中的每一个元素表示为一个多维向量,以将高维特征数据集表示为由多个多维向量组成的矩阵,通过主成分分析来离线计算该矩阵的协方差矩阵,进而得到该矩阵的特征向量和特征值;S2. Express each element in the high-dimensional feature data set as a multi-dimensional vector, so as to represent the high-dimensional feature data set as a matrix composed of multiple multi-dimensional vectors, and calculate the covariance matrix of the matrix offline through principal component analysis, Then get the eigenvectors and eigenvalues of the matrix;

S3、获取索引表中所需哈希表个数,每个哈希表中哈希函数个数以及冲突阈值;S3. Obtain the required number of hash tables in the index table, the number of hash functions in each hash table, and the collision threshold;

S4、按照特征值的降序顺序,将各特征值对应的特征向量一一对应的作为哈希函数的投影向量,并根据特征向量对应的特征值计算每个哈希表中每个哈希函数的权值,然后调整每个哈希表中哈希函数的切割间隔大小,最后将原始高维数据集通过优化得到的哈希函数映射到整个索引表中,产生哈希冲突的元素通过链表存储;S4. According to the descending order of the eigenvalues, the eigenvectors corresponding to each eigenvalue are one-to-one corresponded as the projection vector of the hash function, and the calculation of each hash function in each hash table is performed according to the eigenvalues corresponding to the eigenvectors. Weight, then adjust the cutting interval of the hash function in each hash table, and finally map the hash function obtained by optimizing the original high-dimensional data set to the entire index table, and store the elements that generate hash conflicts through the linked list;

S5、对于每一个查询点,在每个哈希表中,通过优化后的哈希函数计算得到相应的哈希值,通过哈希值定位到哈希表中发生哈希冲突的位置,将该位置的链表中所有元素都存入结果候选集合中,记录结果候选集合中每个元素与查询点产生哈希冲突的次数,将小于预设冲突阈值的元素去除,得到近邻查询集合,通过将近邻查询集合中的每个点与查询点进行距离计算比较,输出所有与查询点距离小于预设距离阈值的元素。S5. For each query point, in each hash table, the corresponding hash value is calculated through the optimized hash function, and the hash value is used to locate the location where the hash conflict occurs in the hash table, and the All elements in the linked list of the position are stored in the result candidate set, and the number of hash collisions between each element in the result candidate set and the query point is recorded, and the elements smaller than the preset conflict threshold are removed to obtain the neighbor query set. Each point in the query set is compared with the query point for distance calculation, and all elements whose distance from the query point is less than the preset distance threshold are output.

优选地,步骤S2具体包括以下子步骤:Preferably, step S2 specifically includes the following sub-steps:

S2.1、将高维特征数据集X中的n个元素看作n个包含d个变量的向量,则将高维特征数据集X表示为由n个d维向量组成的矩阵:S2.1. Treat n elements in the high-dimensional feature data set X as n vectors containing d variables, then express the high-dimensional feature data set X as a matrix composed of n d-dimensional vectors:

S2.2、由得到协方差矩阵,其中,表示方差,同时协方差为根据协方差矩阵S计算得到特征向量组V和特征值组N;S2.2, by Get the covariance matrix, where, represents the variance, while the covariance is According to the calculation of the covariance matrix S, the eigenvector group V and the eigenvalue group N are obtained;

S2.3、将特征值组N中特征值较大的前k×L个特征值对应的特征向量作为高维特征数据集X的主成分组V',通过V'映射将高维特征数据集X表示为数据集Y,且Y=XV',其中,k表示每个哈希表中哈希函数的个数,L表示索引表中所需哈希表个数。S2.3. The eigenvectors corresponding to the first k×L eigenvalues with larger eigenvalues in the eigenvalue group N are used as the principal component group V' of the high-dimensional feature data set X, and the high-dimensional feature data set is mapped by V' X is represented as a data set Y, and Y=XV', where k represents the number of hash functions in each hash table, and L represents the number of required hash tables in the index table.

优选地,步骤S3具体包括以下子步骤:Preferably, step S3 specifically includes the following sub-steps:

S3.1、由得到索引表中哈希表个数L,并得到每个哈希表中哈希函数的个数k,其中p1表示两个点为近似点且发生哈希冲突的概率,p2表示两个点不是近似点但发生哈希冲突的概率,α为冲突比例阈值且p2<α<p1,δ为近邻查询的成功率大小,β为局部灵敏哈希LSH的误判率;S3.1, by Get the number L of hash tables in the index table, and get the number k of hash functions in each hash table, where p 1 represents the probability that two points are approximate points and a hash collision occurs, and p 2 represents two The point is not an approximate point but the probability of hash collision, α is the collision ratio threshold and p 2 <α<p 1 , δ is the success rate of neighbor query, β is the misjudgment rate of local sensitive hash LSH;

S3.2、得到哈希表个数L取值最小时对应的冲突比例阈值α,且其中 S3.2. Obtain the conflict ratio threshold α corresponding to the minimum value of the hash table number L, and in

S3.3、由α的值得到 S3.3, obtained from the value of α

S3.4、根据α和L'的值,得到冲突阈值 S3.4. Obtain the conflict threshold according to the values of α and L'

优选地,步骤S4具体包括以下子步骤:Preferably, step S4 specifically includes the following sub-steps:

S4.1、将LSH函数表示为:其中a为投影向量,p为高维特征数据集X中多维空间中任意一点,b为范围[0,ω)中一个随机选择的实数,ω为投影切割间隔;S4.1, express the LSH function as: Where a is the projection vector, p is any point in the multidimensional space in the high-dimensional feature data set X, b is a randomly selected real number in the range [0, ω), and ω is the projection cutting interval;

S4.2、将步骤S2.3选择的k×L个特征向量按序一一对应的作为每个哈希函数的投影向量,假设每个哈希表中的k个特征向量对应的特征值按照降序排列依次为N=[n1,n2,...,nk],则每个哈希函数的权值为:进而在每个哈希表中,点p的哈希值为 S4.2. Correspond the k×L eigenvectors selected in step S2.3 one by one as the projection vector of each hash function, assuming that the eigenvalues corresponding to the k eigenvectors in each hash table are according to The descending order is N=[n 1 ,n 2 ,...,n k ], then the weight of each hash function is: Then in each hash table, the hash value of point p is

S4.3、在每个哈希表中,k个哈希函数的切割间隔ω相同,下一个哈希表中哈希函数的间隔ω是上一个哈希表中哈希函数的间隔ω的一半;S4.3. In each hash table, the cutting interval ω of the k hash functions is the same, and the interval ω of the hash functions in the next hash table is half of the interval ω of the hash functions in the previous hash table ;

S4.4、根据每个哈希表中哈希函数的投影向量和切割间隔,来构建L个哈希表,每个哈希表都包含k个哈希函数,将高维特征数据集X中的所有多维空间的点都通过哈希映射插入到索引表中的每个哈希表中,发生哈希冲突的点通过链表存储。S4.4. According to the projection vector and cutting interval of the hash function in each hash table, L hash tables are constructed, each hash table contains k hash functions, and the high-dimensional feature data set X All the points in the multidimensional space of are inserted into each hash table in the index table through the hash map, and the points where the hash collision occurs are stored through the linked list.

优选地,步骤S5具体包括以下子步骤:Preferably, step S5 specifically includes the following sub-steps:

S5.1、对于查询点q,计算其在每个哈希表中的哈希值gi(q)(1≤i≤L),将发生哈希冲突的对应哈希桶链表中的所有元素都保存到查询结果集C(q)中,重复元素只保存一次,得到查询点q在高维特征数据集X中的近似集;S5.1. For the query point q, calculate its hash value g i (q) (1≤i≤L) in each hash table, and all elements in the corresponding hash bucket list that will cause a hash collision All are saved in the query result set C(q), and the repeated elements are only saved once, and the approximate set of the query point q in the high-dimensional feature data set X is obtained;

S5.2、记录查询结果集C(q)中,每个点与查询点q在索引表中发生冲突的次数,且对于查询结果集C(q)中任意一点p,它与查询点q的冲突次数为:假设冲突阈值为m,即查询结果集C(q)中的某一点与查询点q的冲突次数大于m时,才认为该点与查询点q近似,并将该点存入精炼结果集C'(q)中;S5.2. Record the number of times that each point in the query result set C(q) collides with the query point q in the index table, and for any point p in the query result set C(q), the number of times it collides with the query point q The number of conflicts is: Assuming that the conflict threshold is m, that is, when the number of collisions between a point in the query result set C(q) and the query point q is greater than m, the point is considered to be similar to the query point q, and the point is stored in the refined result set C'(q);

S5.3、对于精炼结果集C'(q)中的所有点,依次与查询点q进行欧氏距离计算,当两点间的距离小于预设距离阈值时,则将该点作为查询点q的近似点。S5.3. For all the points in the refined result set C'(q), calculate the Euclidean distance with the query point q in turn. When the distance between the two points is less than the preset distance threshold, use this point as the query point q approximation point.

总体而言,通过本发明所构思的以上技术方案与现有技术相比,能够取得下列有益效果:Generally speaking, compared with the prior art, the above technical solutions conceived by the present invention can achieve the following beneficial effects:

该方法利用数据的主成分作为局部灵敏哈希的投影向量,并进一步量化索引表中每个哈希函数的权值和调整每个哈希表中哈希函数的切割间隔大小,以保证近邻查询精确度的同时减少构建索引所需的哈希表数量,从而减少哈希表的空间开销。进一步地,该方法根据近邻查询结果的哈希冲突频率来精炼查询结果集合,消除大量不相关的元素,极大地减小了用于距离计算的数据量,减小了查询时延。This method uses the principal component of the data as the projection vector of local sensitive hashing, and further quantifies the weight of each hash function in the index table and adjusts the cutting interval of the hash function in each hash table to ensure that the nearest neighbor query Accurate while reducing the number of hash tables required to build the index, thereby reducing the space overhead of the hash table. Furthermore, the method refines the query result set according to the hash collision frequency of the neighbor query results, eliminates a large number of irrelevant elements, greatly reduces the amount of data used for distance calculation, and reduces the query delay.

附图说明Description of drawings

图1是本发明实例提供的一种云存储系统中基于数据分布感知的近邻查询方法的流程示意图;Fig. 1 is a schematic flow diagram of a neighbor query method based on data distribution awareness in a cloud storage system provided by an example of the present invention;

图2是本发明实例提供的一种主成分分析计算的方法流程示意图;Fig. 2 is a schematic flow chart of a principal component analysis calculation method provided by an example of the present invention;

图3是本发明实例提供的一种参数设置的方法流程示意图;Fig. 3 is a schematic flow chart of a parameter setting method provided by an example of the present invention;

图4是本发明实例提供的一种索引表构建的方法流程示意图;Fig. 4 is a schematic flow chart of a method for constructing an index table provided by an example of the present invention;

图5是本发明实例提供的一种近邻查询的方法流程示意图。Fig. 5 is a schematic flowchart of a method for neighbor query provided by an example of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。此外,下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

本发明为云存储系统中基于数据分布感知的近邻查询方法,其考虑到数据分布的特性利用主成分分析来指导选择LSH算法中的投影向量,并进一步量化索引表中每个哈希函数的权值和调整每个哈希表中哈希函数的切割间隔大小,以保证近邻查询精确度的同时减少构建索引所需的哈希表数量,从而减少哈希表的空间开销。进一步地,该方法根据近邻查询结果的哈希冲突频率来精炼查询结果集合,消除大量不相关的元素,极大地减小了用于距离计算的数据量,减小了查询时延。The present invention is a neighbor query method based on data distribution perception in a cloud storage system, which takes into account the characteristics of data distribution and uses principal component analysis to guide the selection of projection vectors in the LSH algorithm, and further quantifies the weight of each hash function in the index table Value and adjust the cutting interval size of the hash function in each hash table to ensure the accuracy of the neighbor query and reduce the number of hash tables required to build the index, thereby reducing the space overhead of the hash table. Furthermore, the method refines the query result set according to the hash collision frequency of the neighbor query results, eliminates a large number of irrelevant elements, greatly reduces the amount of data used for distance calculation, and reduces the query delay.

如图1所示是本发明实例提供的一种云存储系统中基于数据分布感知的近邻查询方法的流程示意图;在图1所示的方法中包括以下步骤:As shown in Figure 1, it is a schematic flow diagram of a neighbor query method based on data distribution awareness in a cloud storage system provided by an example of the present invention; the method shown in Figure 1 includes the following steps:

S1、数据集抽样:从原始高维数据集中随机抽取部分数据组成高维特征数据集,以保证时间效率;S1. Data set sampling: Randomly extract some data from the original high-dimensional data set to form a high-dimensional feature data set to ensure time efficiency;

S2、主成分分析计算:对步骤S1得到的高维特征数据集进行主成分分析,获得哈希函数的投影向量,具体为:将高维特征数据集中的每一个元素表示为一个多维向量,以将高维特征数据集表示为由多个多维向量组成的矩阵,通过主成分分析来离线计算该矩阵的协方差矩阵,进而得到该矩阵的特征向量和特征值;S2. Principal component analysis calculation: Perform principal component analysis on the high-dimensional feature data set obtained in step S1 to obtain the projection vector of the hash function, specifically: express each element in the high-dimensional feature data set as a multi-dimensional vector, as Express the high-dimensional feature data set as a matrix composed of multiple multi-dimensional vectors, and calculate the covariance matrix of the matrix offline through principal component analysis, and then obtain the eigenvectors and eigenvalues of the matrix;

在一个可选的实施方式中,如图2所示,步骤S2具体包括以下子步骤:In an optional implementation manner, as shown in FIG. 2, step S2 specifically includes the following sub-steps:

S2.1、将高维特征数据集X中的n个元素看作n个包含d个变量的向量,则将高维特征数据集X表示为由n个d维向量组成的矩阵:S2.1. Treat n elements in the high-dimensional feature data set X as n vectors containing d variables, then express the high-dimensional feature data set X as a matrix composed of n d-dimensional vectors:

S2.2、由得到协方差矩阵,其中,表示方差,同时协方差为根据协方差矩阵S计算得到特征向量组V和特征值组N;S2.2, by Get the covariance matrix, where, represents the variance, while the covariance is According to the calculation of the covariance matrix S, the eigenvector group V and the eigenvalue group N are obtained;

S2.3、将特征值组N中特征值较大的前k×L个特征值对应的特征向量作为高维特征数据集X的主成分组V',通过V'映射将高维特征数据集X表示为数据集Y,且Y=XV',其中,k表示每个哈希表中哈希函数的个数,L表示索引表中所需哈希表个数。S2.3. The eigenvectors corresponding to the first k×L eigenvalues with larger eigenvalues in the eigenvalue group N are used as the principal component group V' of the high-dimensional feature data set X, and the high-dimensional feature data set is mapped by V' X is represented as a data set Y, and Y=XV', where k represents the number of hash functions in each hash table, and L represents the number of required hash tables in the index table.

一般地,通过前几个较大特征值对应的特征向量映射的数据方差代表更大的数据量,其能更好的反映查询性能。进而使用能反映数据分布的前几个特征向量作为LSH中的投影向量来构建索引表,来减少空间开销。Generally, the data variance mapped by the eigenvectors corresponding to the first few larger eigenvalues represents a larger amount of data, which can better reflect the query performance. Furthermore, the first few eigenvectors that can reflect the data distribution are used as the projection vectors in LSH to construct the index table to reduce space overhead.

S3、参数设置:获取索引表中所需哈希表个数,每个哈希表中哈希函数个数以及冲突阈值;S3. Parameter setting: obtain the required number of hash tables in the index table, the number of hash functions in each hash table, and the collision threshold;

在一个可选的实施方式中,如图3所示,步骤S3具体包括以下子步骤:In an optional implementation manner, as shown in FIG. 3, step S3 specifically includes the following sub-steps:

S3.1、由得到索引表中哈希表个数L,并得到每个哈希表中哈希函数的个数k,其中p1表示两个点为近似点且发生哈希冲突的概率,p2表示两个点不是近似点但发生哈希冲突的概率,α为冲突比例阈值且p2<α<p1,δ为近邻查询的成功率大小,β为局部灵敏哈希LSH的误判率;S3.1, by Get the number L of hash tables in the index table, and get the number k of hash functions in each hash table, where p 1 represents the probability that two points are approximate points and a hash collision occurs, and p 2 represents two The point is not an approximate point but the probability of hash collision, α is the collision ratio threshold and p 2 <α<p 1 , δ is the success rate of neighbor query, β is the misjudgment rate of local sensitive hash LSH;

其中,δ优选值为β优选为 Among them, the preferred value of δ is β is preferably

S3.2、得到哈希表个数L取值最小时对应的冲突比例阈值α,且其中 S3.2. Obtain the conflict ratio threshold α corresponding to the minimum value of the hash table number L, and in

具体地,设那么L=max(L1,L2),由于p2<α<p1,那么L1随着α的增大而增大,L2随着α的增大而减小。当L1=L2时哈希表个数L的取值达到最小,得到其中 Specifically, let and Then L=max(L 1 , L 2 ), since p 2 <α<p 1 , then L 1 increases with the increase of α, and L 2 decreases with the increase of α. When L 1 =L 2 , the value of the number L of hash tables reaches the minimum, and we get in

S3.3、将α的值带入L1中得到 S3.3. Bring the value of α into L 1 to get

S3.4、根据α和L'的值,得到冲突阈值 S3.4. Obtain the conflict threshold according to the values of α and L'

S4、索引表构建:按照特征值的降序顺序,将各特征值对应的特征向量一一对应的作为哈希函数的投影向量,并根据特征向量对应的特征值计算每个哈希表中每个哈希函数的权值,然后调整每个哈希表中哈希函数的切割间隔大小,最后将原始高维数据集通过优化得到的哈希函数映射到整个索引表中,产生哈希冲突的元素通过链表存储;S4. Index table construction: According to the descending order of the eigenvalues, the eigenvectors corresponding to each eigenvalue are used as the projection vector of the hash function one by one, and each hash table in each hash table is calculated according to the eigenvalues corresponding to the eigenvectors. The weight of the hash function, and then adjust the cutting interval size of the hash function in each hash table, and finally map the hash function obtained by optimizing the original high-dimensional data set to the entire index table, and generate elements of hash conflicts Stored through a linked list;

在一个可选的实施方式中,如图4所示,步骤S4具体包括以下子步骤:In an optional implementation manner, as shown in FIG. 4, step S4 specifically includes the following sub-steps:

S4.1、权值量化:将LSH函数表示为:其中a为投影向量,p为高维特征数据集X中多维空间中任意一点,b为范围[0,ω)中一个随机选择的实数,ω为投影切割间隔;S4.1. Weight quantization: express the LSH function as: Where a is the projection vector, p is any point in the multidimensional space in the high-dimensional feature data set X, b is a randomly selected real number in the range [0, ω), and ω is the projection cutting interval;

其中,一般在每个哈希表中都使用k个哈希函数来求得多维空间中点的哈希值。对于传统的基于随机映射的LSH来说,每一个哈希表中都需要给每一个哈希函数分配一个随机权重,然后计算加权和来作为该点的哈希值。对于集合X中的任一点p,其在某一个哈希表中的哈希值可计算为:g(p)=a1*h1(p)+a2*h2(p)+...+ak*hk(p),其中权重ai是从0到1的随机数,服从p稳定分布。具有较好的投影向量的哈希函数能够将聚合在一起的数据通过映射计算分开,也就是说,较大概率的将相近的点映射到同一个哈希桶中,较小概率的将较远的点哈希映射到相同的哈希桶。直观地说,具有较好投影向量的哈希函数应该被分配较大的权重,以获得更好的查询性能。Wherein, generally k hash functions are used in each hash table to obtain the hash value of the point in the multi-dimensional space. For the traditional LSH based on random mapping, each hash table needs to assign a random weight to each hash function, and then calculate the weighted sum as the hash value of the point. For any point p in the set X, its hash value in a certain hash table can be calculated as: g(p)=a 1 *h 1 (p)+a 2 *h 2 (p)+.. .+a k *h k (p), where the weight a i is a random number from 0 to 1, subject to p stable distribution. A hash function with a better projection vector can separate the aggregated data through mapping calculations, that is, the higher probability maps similar points to the same hash bucket, and the smaller probability will be farther away The point hash maps to the same hash bucket. Intuitively, hash functions with better projection vectors should be assigned larger weights for better query performance.

S4.2、将步骤S2.3选择的k×L个特征向量按序一一对应的作为每个哈希函数的投影向量,假设每个哈希表中的k个特征向量对应的特征值按照降序排列依次为N=[n1,n2,...,nk],则每个哈希函数的权值为:进而在每个哈希表中,点p的哈希值为 S4.2. Correspond the k×L eigenvectors selected in step S2.3 one by one as the projection vector of each hash function, assuming that the eigenvalues corresponding to the k eigenvectors in each hash table are according to The descending order is N=[n 1 ,n 2 ,...,n k ], then the weight of each hash function is: Then in each hash table, the hash value of point p is

S4.3、切割间隔调整:在每个哈希表中,k个哈希函数的切割间隔ω相同,下一个哈希表中哈希函数的间隔ω是上一个哈希表中哈希函数的间隔ω的一半;S4.3. Cutting interval adjustment: In each hash table, the cutting interval ω of the k hash functions is the same, and the interval ω of the hash functions in the next hash table is the same as that of the hash functions in the previous hash table. half of the interval ω;

具体地,参数ω体现了哈希冲突的粒度。间隔ω较大能增加相似点发生哈希冲突的概率,但是也可能将较远的点映射到相同的哈希桶,会影响查询精确度。间隔ω较小能减少较远点的哈希冲突的概率,但是也可能造成相近的点映射到不同的哈希桶中,影响了查询查全率。在本发明实施例中,调整切割间隔ω的值来提升查询的性能。在每个哈希表中,k个哈希函数的切割间隔ω相同。下一个哈希表中哈希函数的间隔ω是上一个哈希表中哈希函数的间隔ω的一半,也就是说对于第i个哈希表,其哈希函数中间隔ω的大小为:其中ω0为第一个哈希表的哈希函数的初始间隔值。Specifically, the parameter ω reflects the granularity of hash collisions. A larger interval ω can increase the probability of hash collisions at similar points, but it may also map distant points to the same hash bucket, which will affect the query accuracy. A smaller interval ω can reduce the probability of hash collisions of distant points, but it may also cause similar points to be mapped to different hash buckets, which affects the query recall rate. In the embodiment of the present invention, the value of the cutting interval ω is adjusted to improve query performance. In each hash table, the cutting interval ω of the k hash functions is the same. The interval ω of the hash function in the next hash table is half of the interval ω of the hash function in the previous hash table, that is to say, for the i-th hash table, the size of the interval ω in the hash function is: Where ω 0 is the initial interval value of the hash function of the first hash table.

S4.4、构建插入:根据每个哈希表中哈希函数的投影向量和切割间隔,来构建L个哈希表,每个哈希表都包含k个哈希函数,将高维特征数据集X中的所有多维空间的点都通过哈希映射插入到索引表中的每个哈希表中,发生哈希冲突的点通过链表存储。S4.4. Construction insertion: according to the projection vector and cutting interval of the hash function in each hash table, L hash tables are constructed, each hash table contains k hash functions, and the high-dimensional feature data All multi-dimensional space points in the set X are inserted into each hash table in the index table through the hash map, and the points where the hash collision occurs are stored through the linked list.

S5、近邻查询:对于每一个查询点,在每个哈希表中,通过优化后的哈希函数计算得到相应的哈希值,通过哈希值定位到哈希表中发生哈希冲突的位置,将该位置的链表中所有元素都存入结果候选集合中,记录结果候选集合中每个元素与查询点产生哈希冲突的次数,将小于预设冲突阈值的元素去除,得到近邻查询集合,通过将近邻查询集合中的每个点与查询点进行距离计算比较,输出所有与查询点距离小于预设距离阈值的元素。S5. Neighbor query: For each query point, in each hash table, calculate the corresponding hash value through the optimized hash function, and locate the location where the hash conflict occurs in the hash table through the hash value , store all the elements in the linked list at this position in the result candidate set, record the number of hash collisions between each element in the result candidate set and the query point, remove the elements smaller than the preset conflict threshold, and obtain the neighbor query set, By comparing each point in the neighbor query set with the query point for distance calculation, all elements whose distance to the query point is less than the preset distance threshold are output.

在一个可选的实施方式中,如图5所示,步骤S5具体包括以下子步骤:In an optional implementation manner, as shown in FIG. 5, step S5 specifically includes the following sub-steps:

S5.1、对于查询点q,计算其在每个哈希表中的哈希值gi(q)(1≤i≤L),将发生哈希冲突的对应哈希桶链表中的所有元素都保存到查询结果集C(q)中,重复元素只保存一次,得到查询点q在高维特征数据集X中的近似集;S5.1. For the query point q, calculate its hash value g i (q) (1≤i≤L) in each hash table, and all elements in the corresponding hash bucket list that will cause a hash collision All are saved in the query result set C(q), and the repeated elements are only saved once, and the approximate set of the query point q in the high-dimensional feature data set X is obtained;

S5.2、记录查询结果集C(q)中,每个点与查询点q在索引表中发生冲突的次数,且对于查询结果集C(q)中任意一点p,它与查询点q的冲突次数为:假设冲突阈值为m,即查询结果集C(q)中的某一点与查询点q的冲突次数大于m时,才认为该点与查询点q近似,并将该点存入精炼结果集C'(q)中;S5.2. Record the number of times that each point in the query result set C(q) collides with the query point q in the index table, and for any point p in the query result set C(q), the number of times it collides with the query point q The number of conflicts is: Assuming that the conflict threshold is m, that is, when the number of collisions between a point in the query result set C(q) and the query point q is greater than m, the point is considered to be similar to the query point q, and the point is stored in the refined result set C'(q);

S5.3、对于精炼结果集C'(q)中的所有点,依次与查询点q进行欧氏距离计算,当两点间的距离小于预设距离阈值时,则将该点作为查询点q的近似点。S5.3. For all the points in the refined result set C'(q), calculate the Euclidean distance with the query point q in turn. When the distance between the two points is less than the preset distance threshold, use this point as the query point q approximation point.

其中,预设距离阈值可以根据实际需要进行确定。Wherein, the preset distance threshold may be determined according to actual needs.

本领域的技术人员容易理解,以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。Those skilled in the art can easily understand that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, All should be included within the protection scope of the present invention.

Claims (5)

1. the nearest Neighbor based on data distribution perception in a kind of cloud storage system characterized by comprising
S1, partial data composition high dimensional feature data set is randomly selected from original high dimensional data concentration;
S2, by each of high dimensional feature data set element representation be a multi-C vector, by high dimensional feature data set table It is shown as the matrix being made of multiple multi-C vectors, by principal component analysis come the covariance matrix of the off-line calculation matrix, in turn Obtain the feature vector and characteristic value of the matrix;
S3, required Hash table number in concordance list, hash function number and conflict threshold in each Hash table are obtained;
S4, the descending order according to characteristic value regard the corresponding feature vector of each characteristic value as hash function correspondingly Projection vector, and the weight of each hash function in each Hash table is calculated according to the corresponding characteristic value of feature vector, then adjust The cutting gap size of hash function in whole each Hash table, the Hash letter for finally obtaining original High Dimensional Data Set by optimization Number is mapped in entire concordance list, and the element for generating hash-collision passes through storage of linked list;
S5, for each query point, in each Hash table, by optimization after hash function corresponding Hash is calculated Value is navigated in Hash table the position that hash-collision occurs by cryptographic Hash, all elements in the chained list of the position is all stored in As a result in candidate collection, the number that each element and query point in result candidate collection generate hash-collision is recorded, will be less than pre- If the element of conflict threshold removes, NN Query set is obtained, by clicking through each point in neighbour's query set with inquiry Row distance calculating is compared, and all elements for being less than pre-determined distance threshold value with query point distance are exported.
2. the method according to claim 1, wherein step S2 specifically includes following sub-step:
S2.1, n element in high dimensional feature data set X is regarded to the n vectors comprising d variable as, then by high dimensional feature number The matrix being made of n d dimensional vector is expressed as according to collection X:
S2.2, byObtain covariance matrix, whereinIndicate variance, together When covariance beFeature vector group V is calculated according to covariance matrix S With eigenvalue cluster N;
S2.3, using the corresponding feature vector of the biggish preceding k × L characteristic value of characteristic value in eigenvalue cluster N as high dimensional feature number According to the principal component group V' of collection X, is mapped by V' and high dimensional feature data set X is expressed as data set Y, and Y=XV', wherein k table Show the number of hash function in each Hash table, L indicates required Hash table number in concordance list.
3. according to the method described in claim 2, it is characterized in that, step S3 specifically includes following sub-step:
S3.1, byHash table number L in concordance list is obtained, and The number k of hash function in each Hash table is obtained, wherein p1Indicate that two points are the probability of approximate point and generation hash-collision, p2Indicate that two points are not approximate point but the probability that hash-collision occurs, α is conflict proportion threshold value and p2< α < p1, δ is neighbour The success rate size of inquiry, β are the False Rate of the sensitive Hash LSH in part;
S3.2, corresponding conflict proportion threshold value α when Hash table number L value minimum is obtained, andWherein
S3.3, it is obtained by the value of α
S3.4, according to the value of α and L', obtain conflict threshold
4. according to the method described in claim 3, it is characterized in that, step S4 specifically includes following sub-step:
S4.1, by LSH function representation are as follows:Wherein a is projection vector, and p is more in high dimensional feature data set X Any point in dimension space, b be range [0, ω) in a randomly selected real number, ω be projection cutting interval;
S4.2, k × L feature vector for selecting step S2.3 sequentially correspondingly as the projection of each hash function to Amount, it is assumed that the corresponding characteristic value of k feature vector in each Hash table is followed successively by N=[n according to descending arrangement1,n2,..., nk], then the weight of each hash function are as follows:1≤i≤k, and then in each Hash table, the cryptographic Hash of point p For
S4.3, in each Hash table, the cutting interval ω of k hash function is identical, hash function in next Hash table Interval ω is the half of the interval ω of hash function in a upper Hash table;
S4.4, it is spaced according to the projection vector of hash function in each Hash table and cutting, to construct L Hash table, each Kazakhstan Uncommon table all includes k hash function, and the point of all hyperspace in high dimensional feature data set X is all inserted by Hash mapping In each Hash table into concordance list, the point that hash-collision occurs passes through storage of linked list.
5. according to the method described in claim 4, it is characterized in that, step S5 specifically includes following sub-step:
S5.1, for query point q, calculate its cryptographic Hash g in each Hash tablei(q), hash-collision will occur for 1≤i≤L Correspondence Hash barrel chain table in all elements be all saved in query results C (q), repeat element only saves once, obtains Approximate set of the query point q in high dimensional feature data set X;
In S5.2, record queries result set C (q), the number that each point and query point q are clashed in concordance list, and for Any point p in query results C (q), the conflict number of it and query point q are as follows:Assuming that conflict threshold is m, i.e. certain point and inquiry in query results C (q) When the conflict number of point q is greater than m, just think that the point is approximate with query point q, and the point be stored in refining result set C'(q) in;
S5.3, for refine result set C'(q) in all the points, successively with query point q carry out Euclidean distance calculating, work as point-to-point transmission Distance be less than pre-determined distance threshold value when, then using the point as the approximate point of query point q.
CN201710822371.1A 2017-09-13 2017-09-13 Neighbor query method based on data distribution awareness in cloud storage system Active CN107656989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710822371.1A CN107656989B (en) 2017-09-13 2017-09-13 Neighbor query method based on data distribution awareness in cloud storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710822371.1A CN107656989B (en) 2017-09-13 2017-09-13 Neighbor query method based on data distribution awareness in cloud storage system

Publications (2)

Publication Number Publication Date
CN107656989A CN107656989A (en) 2018-02-02
CN107656989B true CN107656989B (en) 2019-09-13

Family

ID=61130009

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710822371.1A Active CN107656989B (en) 2017-09-13 2017-09-13 Neighbor query method based on data distribution awareness in cloud storage system

Country Status (1)

Country Link
CN (1) CN107656989B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10949467B2 (en) * 2018-03-01 2021-03-16 Huawei Technologies Canada Co., Ltd. Random draw forest index structure for searching large scale unstructured data
CN109634952B (en) * 2018-11-02 2021-08-17 宁波大学 An adaptive nearest neighbor query method for large-scale data
CN109829320B (en) * 2019-01-14 2020-12-11 珠海天燕科技有限公司 Information processing method and device
CN110795469B (en) * 2019-10-11 2022-02-22 安徽工业大学 Spark-based high-dimensional sequence data similarity query method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609441A (en) * 2011-12-27 2012-07-25 中国科学院计算技术研究所 Local-sensitive hash high-dimensional indexing method based on distribution entropy
WO2012165135A1 (en) * 2011-05-27 2012-12-06 公立大学法人大阪府立大学 Database logging method and logging device relating to approximate nearest neighbor search
CN103631928A (en) * 2013-12-05 2014-03-12 中国科学院信息工程研究所 LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system
CN104035949A (en) * 2013-12-10 2014-09-10 南京信息工程大学 Similarity data retrieval method based on locality sensitive hashing (LASH) improved algorithm
CN105653656A (en) * 2015-12-28 2016-06-08 成都希盟泰克科技发展有限公司 Multi-feature document retrieval method based on improved LSH (Locality-Sensitive Hashing)
CN105808631A (en) * 2015-06-29 2016-07-27 中国人民解放军装甲兵工程学院 Data dependence based multi-index Hash algorithm

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012165135A1 (en) * 2011-05-27 2012-12-06 公立大学法人大阪府立大学 Database logging method and logging device relating to approximate nearest neighbor search
CN102609441A (en) * 2011-12-27 2012-07-25 中国科学院计算技术研究所 Local-sensitive hash high-dimensional indexing method based on distribution entropy
CN103631928A (en) * 2013-12-05 2014-03-12 中国科学院信息工程研究所 LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system
CN104035949A (en) * 2013-12-10 2014-09-10 南京信息工程大学 Similarity data retrieval method based on locality sensitive hashing (LASH) improved algorithm
CN105808631A (en) * 2015-06-29 2016-07-27 中国人民解放军装甲兵工程学院 Data dependence based multi-index Hash algorithm
CN105653656A (en) * 2015-12-28 2016-06-08 成都希盟泰克科技发展有限公司 Multi-feature document retrieval method based on improved LSH (Locality-Sensitive Hashing)

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
M2LSH:基于LSH的高维数据近似最近邻查找算法;李灿等;《电子学报》;20170615(第06期);第1431-1442页 *
基于LSH的高维大数据k近邻搜索算法;王忠伟等;《电子学报》;20160415(第04期);第906-912页 *

Also Published As

Publication number Publication date
CN107656989A (en) 2018-02-02

Similar Documents

Publication Publication Date Title
US8745055B2 (en) Clustering system and method
CN102915347B (en) A kind of distributed traffic clustering method and system
CN107656989B (en) Neighbor query method based on data distribution awareness in cloud storage system
CN103455531B (en) A kind of parallel index method supporting high dimensional data to have inquiry partially in real time
CN103744934A (en) Distributed index method based on LSH (Locality Sensitive Hashing)
CN110334290B (en) MF-Octree-based spatio-temporal data rapid retrieval method
CN106599091B (en) RDF graph structure storage and index method based on key value storage
Khan et al. Query-friendly compression of graph streams
Li et al. Parallel skyline queries over uncertain data streams in cloud computing environments
CN104809210B (en) One kind is based on magnanimity data weighting top k querying methods under distributed computing framework
KR101255639B1 (en) Column-oriented database system and join process method using join index thereof
Amagata et al. Space filling approach for distributed processing of top-k dominating queries
Yin et al. Efficient distributed skyline computation using dependency-based data partitioning
CN106503245B (en) Method and device for selecting a set of support points
CN107169114A (en) A kind of mass data multidimensional ordering searching method
CN108829846B (en) Service recommendation platform data clustering optimization system and method based on user characteristics
CN107239791A (en) A kind of higher-dimension K means cluster centre method for optimizing based on LSH
CN109446293A (en) A kind of parallel higher-dimension nearest Neighbor
CN106202303B (en) A kind of Chord routing table compression method and optimization file search method
CN107943927B (en) The memory module conversion method of multidimensional data in a kind of distributed memory system
Bai et al. An efficient skyline query algorithm in the distributed environment
CN113901278A (en) Data search method and device based on global multi-detection and adaptive termination
US9292555B2 (en) Information processing device
CN108090182B (en) A kind of distributed index method and system of extensive high dimensional data
Zhou et al. Accurate querying of frequent subgraphs in power grid graph data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant