WO2021036070A1 - Hamming space-based approximate query method and storage medium - Google Patents

Hamming space-based approximate query method and storage medium Download PDF

Info

Publication number
WO2021036070A1
WO2021036070A1 PCT/CN2019/122454 CN2019122454W WO2021036070A1 WO 2021036070 A1 WO2021036070 A1 WO 2021036070A1 CN 2019122454 W CN2019122454 W CN 2019122454W WO 2021036070 A1 WO2021036070 A1 WO 2021036070A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
query
hash
segmentation
hamming
Prior art date
Application number
PCT/CN2019/122454
Other languages
French (fr)
Chinese (zh)
Inventor
秦建斌
王尧舒
Original Assignee
深圳计算科学研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳计算科学研究院 filed Critical 深圳计算科学研究院
Publication of WO2021036070A1 publication Critical patent/WO2021036070A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Definitions

  • the invention relates to the field of database approximate query, in particular to a Hamming space approximate query method and storage medium.
  • the current approximate query of basic data types is a basic problem in the database field, such as the approximate query of strings and sets, and has been studied for many years.
  • the approximate query and approximate semantic query of more complex data types have not achieved good results in the database field.
  • hash mapping function Due to the simplicity and ease of query of binary data, the combination of hash mapping function and Hamming approximate query has played a key role in many applications, such as web search, image query, and science and technology libraries.
  • Google uses a SimHash hashing technology as a hash mapping function to map each web page to a 64-dimensional binary vector.
  • Hamming approximate query is used to find all approximate matching web pages.
  • the deep neural network model is used as a hash mapping function to map images into high-dimensional binary vectors, and Hamming approximate query can efficiently return images similar to the query image.
  • Hamming approximate query can be used to find similar molecular structures, in which a hash mapping function converts molecules into high-dimensional binary vectors, and molecules that meet the Hamming threshold are returned.
  • neural network models such as autoencoders, attention models, LSTMs, etc.
  • hash mapping functions to map text into high-dimensional binary vectors. Only the records with the smaller Hamming distance in the database are returned as a result.
  • the purpose of the present invention is to provide a Hamming space approximate query method and storage medium to solve the above-mentioned defects.
  • the present invention adopts the following technical solutions:
  • An approximate query method for Hamming space including the steps:
  • the index structure includes histogram and inverted hash index
  • the method for obtaining the hash database is: detecting the type and structure of the current data for each record and query data; according to the type and structure of the current data, selecting the corresponding hash map from the set of hash functions Function; through the selected hash mapping function, the input data is mapped into a hash binary vector.
  • the method for column reordering the binary data in the hash database is: designing a cost model based on column reordering; performing initial column division on the binary data; and performing approximate division after initializing the column division.
  • the method for initializing column division of binary data is: initializing an empty data division, selecting a data column, and if the data column can produce the minimum information entropy for the current data division, it is put into the current Data segmentation; select the next data column and repeat the same process until the size of the current data segmentation reaches the upper limit, that is, the first data segmentation is generated; after that, the segmentation process is repeated until all data columns are allocated to the corresponding data segmentation in.
  • the method for performing approximate segmentation is to iteratively exchange the two data columns with the largest difference between the current approximate query effects.
  • the method for assigning corresponding query thresholds for each data segmentation is:
  • a dynamic programming algorithm is used for query threshold allocation.
  • the Hamming space approximate query method further includes: extracting a candidate set according to an index structure, and verifying one by one to obtain a final result.
  • the method of extracting a candidate set according to the index structure and verifying one by one to obtain the final result is:
  • the embodiment of the present invention can effectively deal with data sets with different inclinations and can have efficient query capabilities, especially for data sets with large inclinations, such as biomolecular data sets, most of the existing methods are lost
  • the filtering ability can only scan and verify the data in the data set in order to get the result.
  • the novel pigeon nest principle proposed by the embodiment of the present invention can make good use of the tilt of the data, and perform threshold allocation according to the tilt, thereby filtering out a large amount of non-resulting data.
  • the embodiment of the present invention effectively performs threshold allocation based on data tilt, and uses a dynamic programming algorithm to optimize the candidate set, so as to achieve the best filtering effect.
  • Fig. 1 is a logical block diagram of a Hamming approximate query method provided by an embodiment of the present invention.
  • Fig. 2 is a flowchart of a Hamming approximate query method provided by an embodiment of the present invention.
  • the present invention aims to realize the approximate query of multiple data types: according to a given query input, find all records in the database whose vectors in the Hamming distance mapped to the query input are less than or equal to a given threshold.
  • the present invention is divided into two major steps: 1. Map the data and query in the database to the Hamming space with a given mapping function. 2. Query under Hamming space Perform Hamming approximate query on the data set.
  • the embodiment of the present invention provides a Hamming approximate query method based on the pigeon nest principle. As shown in Figure 1 and Figure 2, the method includes:
  • Step 10 Hash mapping function.
  • the hash mapping function realizes the mapping of any data type into a hash binary vector. This step is divided into two sub-steps: data type detection and hash generation. Since different data types use different hash mapping methods, data type detection aims to detect the type and structure of the input data, and then assign it to a hash generation module suitable for it.
  • the hash generation module is a collection of a series of hash functions, such as SimHash, MinHash, LSTM, convolutional neural network model, autoencoder model and so on. Its purpose is to map input data to vectors in Hamming space.
  • This step 10 specifically includes: step 101, data type detection; step 102, hash generation.
  • the hash mapping module is designed to map all records and query data in the database into a hash binary vector.
  • Step 20 Perform column reordering on the binary data in the hash database.
  • the existing methods are based on random sorting and methods of reducing the correlation of column clustering to reduce the skewness of the data. Their goal is to make the dimensions of each segmentation as evenly distributed as possible, so that the threshold assignment of each segmentation does not introduce a large number of candidate data.
  • the present invention is dedicated to increasing the inclination of each column segmentation, so that the Hamming threshold allocation can be more effective.
  • this embodiment designs a cost model for column reordering, and converts this problem into an optimization problem that optimizes the performance of query processing.
  • Step 201 Design a cost model based on column reordering.
  • a query set Q ⁇ q 1 , ⁇ 1 >, ⁇ q 2 , ⁇ 2 >,..., ⁇ q
  • > ⁇ is designed in advance, and m columns are split for the data set P, the cost model of query optimization is as follows:
  • the right side is the sum of the actual cost of approximate query for all the queries in the query set and the threshold. Ignore the calculation process of the query cost here, and discuss it in detail in the following steps. With the above formula, it can be encapsulated into an optimization problem: given a binary data set D and a query set Q, the goal of this embodiment is to find a column splitting method P, so as to achieve the minimum query cost, which is
  • the column partition optimization problem is an NP-hard problem.
  • Step 202 Initialize column partitioning.
  • column segmentation is an NP-hard problem
  • this embodiment is divided into two steps for discussion: initializing column segmentation and approximate segmentation algorithm.
  • initializing column segmentation only a local optimal solution can be obtained, so a good initialization is essential to improve the effect of the approximate algorithm.
  • the correlation between the columns plays a key role.
  • the method of this embodiment has the opposite goal.
  • the threshold assignment method of this embodiment will assign a larger threshold value to this segmentation, so as to give other segments to Smaller threshold. In other words, this embodiment assigns appropriate thresholds to different data segments. If the data column is uniformly distributed, all the partitions have the same distribution, so it is difficult to optimize some highly inclined partitions.
  • Embodiments of the present information entropy to measure the correlation between the data column, for a data dividing P i, with the present embodiment P i represents the row data set, the correlation is a measure of P i to the following formula:
  • the information entropy of the entire data segmentation scheme P is the cumulative sum of the information entropy of all data segmentation, which is:
  • the goal of this embodiment is to find an initial segmentation scheme P such that H(P) is minimized.
  • this embodiment uses an equal split greedy method: at the beginning P is an empty split plan, this embodiment greedily selects the data column, that is, if the data column produces the smallest value for the current split Entropy of information, it is put into the segmentation. This process continues until the size of one of the partitions reaches the upper limit, which is That is, the first data segmentation is generated. After that, this embodiment repeats the above process until all the data columns are allocated to the corresponding data partitions.
  • Step 203 Approximate segmentation algorithm.
  • this embodiment uses a greedy strategy, that is, iteratively exchanges the two data columns with the largest difference between the current approximate query effects.
  • the data columns in the two data partitions are randomly selected and exchanged. After that, an approximate query is run on the current exchanged data set using the query set, and the cost C workload is calculated. The two data columns with the smallest C workload are selected for exchange. This process is repeated until the current computationally than the smallest C workload C workload iteration, the algorithm stops partitioning scheme to produce the final data.
  • Step 30 Create an index structure for the newly generated data.
  • the index structure consists of two parts: histogram and inverted hash index.
  • Hist(p,t) represents the number of data in the data set that has a Hamming distance of t from the d-bit segmented data p.
  • the inverted hash index uses the values in all data partitions as hash values, and the recorded ID is added to its inverted table.
  • Step 40 Query optimization. In order to use the new pigeonhole principle to process queries, it is a key issue if thresholds are assigned to each segmentation.
  • Step 401 In order to better optimize the query threshold allocation, this embodiment designs an approximate query cost model as follows:
  • C sig_gen (q, T) C cand_gen (q, T) and C verify (q, T) respectively represent the cost of signature generation, candidate set generation and verification.
  • Signature generation means that the query data generates all possible hash values to be checked according to the Hamming threshold.
  • Candidate set generation refers to querying the inverted table in the index structure by querying the hash value generated by query data, extracting the corresponding records, and obtaining the candidate set after deduplication.
  • Verification refers to using the Hamming distance function to calculate the Hamming distance value with the query for each record in the candidate set, and output the final result by comparing the given threshold value.
  • the cost of signature generation is usually much less than the cost of candidate set generation and verification, because the time complexity of signature generation is limited by the size and threshold of the query. Therefore, in this embodiment, the influence of signature generation can be ignored in the process of query optimization.
  • c access is the cost of querying an element in the inverted table
  • c verify is the cost of verifying whether the Hamming distance of the two vectors is less than or equal to a given threshold. Both of these parameters are preset constants.
  • the threshold allocation can be formally transformed into an optimization problem: given a data set, query q and threshold ⁇ , find a threshold vector T that minimizes the approximate query cost, which is
  • Step 402 threshold allocation method. Because c access , c verify and ⁇ are independent of CN(q i , ⁇ i ), this embodiment can ignore (c access + ⁇ c verify ) in the above formula, and according to Get the optimal threshold allocation method.
  • CN(q i , ⁇ i ) is regarded as a black box with a time complexity of O(1), and a threshold allocation algorithm based on dynamic programming is proposed.
  • a dynamic programming algorithm is designed to realize the threshold allocation.
  • initialize the cost of the first segmentation namely OPT[1,-1],...,OPT[1, ⁇ ].
  • the threshold value -1 with a negative number is also considered to be assigned to other partitions.
  • the path to OPT[m, ⁇ -m+1] is traced, and the final threshold distribution vector is obtained.
  • the time complexity of the entire dynamic programming algorithm is O(m ⁇ ( ⁇ +1) 2 ).
  • Step 50 Extract the candidate set from the inverted list, remove duplicates and verify one by one.
  • Example: A column partition of the query data is 001, t 1. Enumerate all possible query hash values, which is 001,101,011,000.
  • hash values After all the hash values are enumerated, they respectively search for the corresponding key value in the pre-established inverted index, and extract the corresponding inverted table. After all the inverted tables are extracted, remove the duplicates and use the Hamming distance formula to calculate the Hamming distance between them and the query one by one. If the calculated value is less than or equal to the given threshold, it is returned as one of the results.
  • the embodiments of the present invention provide a series of hash mapping functions, such as SimHash, deep neural network models, etc., as hash mapping functions, hash mapping data and query data in the database into binary vectors;
  • an efficient online query optimization method based on the universal pigeonhole principle is designed to allocate thresholds, so that the allocation scheme is Optimal; designed an offline data column division method to solve the problem of division selectivity caused by data skew and dimensional relevance; also designed an offline data column division method to solve the division choice caused by data skew and dimensional relevance Sexual issues.
  • an embodiment of the present invention provides a storage medium in which multiple instructions are stored, and the instructions can be loaded by a processor to execute the steps in the Hamming space approximate query method provided by the embodiment of the present invention.
  • the storage medium may include: read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A Hamming space-based approximate query method and a storage medium. The Hamming space-based approximate query method comprises the steps of: mapping all records and query data in an original database into hash binary vectors in a Hamming space to obtain a hash database; performing column reordering on binary data in the hash database; establishing an index structure for data newly generated after column reordering, the index structure comprising a histogram and an inverted hash index; and performing parsing and querying, and allocating a corresponding query threshold for each data segmentation. According to the method, the inclination of data can be well utilized, and threshold allocation is performed according to the inclination, so as to filter out a large amount of non-result data; and a histogram index structure and an inverted hash index structure are used, dimension reordering is performed according to different inclinations of data, and data columns having a large inclination are put together, so as to the utilize inclinations of the data more effectively, thereby improving approximate query efficiency.

Description

一种海明空间近似查询方法及存储介质An approximate query method and storage medium for Hamming space 技术领域Technical field
本发明涉及数据库近似查询领域,尤其涉及一种海明空间近似查询方法及存储介质。The invention relates to the field of database approximate query, in particular to a Hamming space approximate query method and storage medium.
背景技术Background technique
当前基础数据类型的近似查询是数据库领域中的一个基础问题,比如字符串、集合的近似查询,并且已经研究很多年。然而更为复杂的数据类型的近似查询和近似语义查询在数据库领域还没有得到很好的效果。The current approximate query of basic data types is a basic problem in the database field, such as the approximate query of strings and sets, and has been studied for many years. However, the approximate query and approximate semantic query of more complex data types have not achieved good results in the database field.
由于二进制数据的简约型和易于查询,哈希映射函数和海明近似查询相结合的方案已经在很多应用中起到了关键的作用,比如网页搜索,图片查询,以及科学技术库。Due to the simplicity and ease of query of binary data, the combination of hash mapping function and Hamming approximate query has played a key role in many applications, such as web search, image query, and science and technology libraries.
在数以亿计的网页中处理近似检测的问题中,谷歌使用一种SimHash的哈希技术作为哈希映射函数把每个网页映射到64维度的二进制向量中。海明近似查询用来找到所有近似匹配的网页。在大规模图片搜索中,深度神经网络模型作为哈希映射函数将图片映射成高维度的二进制向量,而海明近似查询能高效地返回与查询图片相似的图片。在生物医学领域中,海明近似查询可以用来找到相似的分子结构,其中哈希映射函数将分子转换成高维度二进制向量,满足海明阈值的分子被返回。在自然语言近似语义搜索中,神经网络模型,比如自编码器,注意力模型,LSTM等,封装成哈希映射函数将文本映射成高维度二进制向量。只有数据库中海明距离较小的记录被返回出来作为结果。In dealing with the problem of approximate detection in hundreds of millions of web pages, Google uses a SimHash hashing technology as a hash mapping function to map each web page to a 64-dimensional binary vector. Hamming approximate query is used to find all approximate matching web pages. In large-scale image search, the deep neural network model is used as a hash mapping function to map images into high-dimensional binary vectors, and Hamming approximate query can efficiently return images similar to the query image. In the field of biomedicine, Hamming approximate query can be used to find similar molecular structures, in which a hash mapping function converts molecules into high-dimensional binary vectors, and molecules that meet the Hamming threshold are returned. In natural language approximate semantic search, neural network models, such as autoencoders, attention models, LSTMs, etc., are encapsulated into hash mapping functions to map text into high-dimensional binary vectors. Only the records with the smaller Hamming distance in the database are returned as a result.
然而,现阶段所有的海明空间近似查询方法主要有以下两个缺点:1.现有的方法基于海明查询的过滤下界不紧,这导致了较大的阈值,直接导致很差的运行效率。2.现有的方法对于数据分割的阈值分配是均匀的。数据被假设是均匀分布的,但是在实际中,数据往往都有些倾斜性。我们发现在实际情况下很多真实数据都有或多或少的倾斜性存在,并且数据中列与列之间存在复杂的关联性。所以均匀阈值分布在很多数据集上并不会取得好的效果, 不考虑数据的倾斜性导致了低效的查询性能。However, all current Hamming space approximate query methods mainly have the following two shortcomings: 1. The existing methods based on Hamming query have a weak lower bound for filtering, which leads to a larger threshold and directly leads to poor operating efficiency. . 2. Existing methods have uniform threshold distribution for data segmentation. The data is assumed to be evenly distributed, but in practice, the data tends to be somewhat skewed. We found that in actual situations, many real data have more or less skewness, and there are complex correlations between columns in the data. Therefore, the uniform threshold distribution on many data sets will not achieve good results, and the inefficient query performance is caused by ignoring the inclination of the data.
发明内容Summary of the invention
本发明的目的在于提供一种海明空间近似查询方法及存储介质,以解决上述缺陷。The purpose of the present invention is to provide a Hamming space approximate query method and storage medium to solve the above-mentioned defects.
为达此目的,本发明采用以下技术方案:To achieve this goal, the present invention adopts the following technical solutions:
一种海明空间近似查询方法,包括步骤:An approximate query method for Hamming space, including the steps:
将原始数据库中的所有记录和查询数据,映射成海明空间中的哈希二进制向量,得到哈希数据库;Map all records and query data in the original database into a hash binary vector in Hamming space to obtain a hash database;
对哈希数据库中的二进制数据进行列重排序;Column reordering the binary data in the hash database;
针对列重排序后新生成的数据建立索引结构,索引结构包含柱状图和倒排哈希索引;Create an index structure for the newly generated data after column reordering. The index structure includes histogram and inverted hash index;
解析查询,为各个数据分割分配相应的查询阈值。Analyze the query and assign corresponding query thresholds for each data segmentation.
可选的,所述哈希数据库的获得方法为:针对每个记录和查询数据,检测当前数据的类型和结构;按照当前数据的类型和结构,从哈希函数集合中选择相应的哈希映射函数;通过所选择的哈希映射函数,将输入数据映射成哈希二进制向量。Optionally, the method for obtaining the hash database is: detecting the type and structure of the current data for each record and query data; according to the type and structure of the current data, selecting the corresponding hash map from the set of hash functions Function; through the selected hash mapping function, the input data is mapped into a hash binary vector.
可选的,所述对哈希数据库中的二进制数据进行列重排序的方法为:设计基于列重排序的代价模型;对二进制数据进行初始化列分割;在初始化列分割后,进行近似分割。Optionally, the method for column reordering the binary data in the hash database is: designing a cost model based on column reordering; performing initial column division on the binary data; and performing approximate division after initializing the column division.
可选的,所述对二进制数据进行初始化列分割的方法为:初始化一个空的数据分割,选择一个数据列,如果该数据列对于当前数据分割能产生最小的信息熵,就被放入到当前数据分割中;选择下一数据列,重复进行相同处理,直到当前数据分割的大小达到上限,即产生第一个数据分割;之后,重复分割过程,直到所有的数据列被分配到相应的数据分割中。Optionally, the method for initializing column division of binary data is: initializing an empty data division, selecting a data column, and if the data column can produce the minimum information entropy for the current data division, it is put into the current Data segmentation; select the next data column and repeat the same process until the size of the current data segmentation reaches the upper limit, that is, the first data segmentation is generated; after that, the segmentation process is repeated until all data columns are allocated to the corresponding data segmentation in.
可选的,所述进行近似分割的方法为:迭代地把当前近似查询效果差距最大的两个数据列进行交换。Optionally, the method for performing approximate segmentation is to iteratively exchange the two data columns with the largest difference between the current approximate query effects.
可选的,所述为各个数据分割分配相应的查询阈值的方法为:Optionally, the method for assigning corresponding query thresholds for each data segmentation is:
设计基于阈值分配的代价模型;Design a cost model based on threshold allocation;
根据基于阈值分配的代价模型,采用动态规划算法进行查询阈值分配。According to the cost model based on threshold allocation, a dynamic programming algorithm is used for query threshold allocation.
可选的,所述海明空间近似查询方法还包括:根据索引结构抽取候选集,并逐一验证得到最终结果。Optionally, the Hamming space approximate query method further includes: extracting a candidate set according to an index structure, and verifying one by one to obtain a final result.
可选的,所述根据索引结构抽取候选集,并逐一验证得到最终结果的方法为:Optionally, the method of extracting a candidate set according to the index structure and verifying one by one to obtain the final result is:
对于查询的每一个列分割及其对应的被分配的查询阈值,枚举所有可能的哈希数值;对于每个哈希数值,分别去预先建立的倒排索引中查找对应的键值,并抽取出对应的倒排表;当所有的倒排表抽取出来之后,去重并用海明距离公式逐一计算它们与查询的海明距离。如果计算的数值小于或等于给定的阈值,就返回作为其中一个结果。For each column segmentation of the query and its corresponding assigned query threshold, enumerate all possible hash values; for each hash value, find the corresponding key value in the pre-established inverted index, and extract The corresponding inverted table is extracted; after all the inverted tables are extracted, the duplicates are removed and the Hamming distance formula is used to calculate the Hamming distance between them and the query one by one. If the calculated value is less than or equal to the given threshold, it is returned as one of the results.
一种存储介质,所述存储介质上存储有计算机程序,该计算机程序被处理器执行时实现如上任一项所述的海明空间近似查询方法。A storage medium in which a computer program is stored, and when the computer program is executed by a processor, the Hamming space approximate query method as described in any one of the above is implemented.
与现有技术相比,本发明的有益效果为:Compared with the prior art, the beneficial effects of the present invention are:
1)本发明实施例能够有效地应对不同倾斜度的数据集都能有着高效地查询能力,尤其是对于倾斜度很大的数据集,比如生物分子数据集,大部分现有的方法都失去了过滤能力,只能对数据集中的数据依次扫描验证,从而得到结果。本发明实施例提出的新型鸽巢原理能够很好地利用数据的倾斜性,根据倾斜度进行阈值分配,从而大量过滤掉非结果的数据。1) The embodiment of the present invention can effectively deal with data sets with different inclinations and can have efficient query capabilities, especially for data sets with large inclinations, such as biomolecular data sets, most of the existing methods are lost The filtering ability can only scan and verify the data in the data set in order to get the result. The novel pigeon nest principle proposed by the embodiment of the present invention can make good use of the tilt of the data, and perform threshold allocation according to the tilt, thereby filtering out a large amount of non-resulting data.
2)本发明实施例有效地根据数据倾斜性进行阈值分配,用动态规划算法使得候选集最优,从而达到最佳的过滤效果。2) The embodiment of the present invention effectively performs threshold allocation based on data tilt, and uses a dynamic programming algorithm to optimize the candidate set, so as to achieve the best filtering effect.
3)基于柱状图索引结构的海明近似查询,使得只有和查询有关的数据被抽取出来,从而实现高效查询。3) Hamming approximate query based on the histogram index structure, so that only the data related to the query is extracted, so as to achieve efficient query.
4)根据不同的数据的倾斜度进行维度重新排序,将倾斜度大的数据列放在一起,从而更加有效地利用数据的倾斜度,提高近似查询的效率。4) Re-sort the dimensions according to the inclination of different data, and put the data columns with large inclination together, so as to make more effective use of the inclination of the data and improve the efficiency of approximate query.
附图说明Description of the drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲, 在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1为本发明实施例提供的海明近似查询方法逻辑框图。Fig. 1 is a logical block diagram of a Hamming approximate query method provided by an embodiment of the present invention.
图2为本发明实施例提供的海明近似查询方法流程图。Fig. 2 is a flowchart of a Hamming approximate query method provided by an embodiment of the present invention.
具体实施方式detailed description
本发明旨在实现多数据类型的近似查询:按照给定的一个查询输入,找到所有数据库中的和查询输入映射在海明距离中的向量小于或等于一个给定阈值的记录。为实现多数据类型的近似查询,本发明分为两大步骤:1.将数据库中数据和查询用给定的映射函数将它们映射到海明空间。2.海明空间下的查询对数据集进行海明近似查询。The present invention aims to realize the approximate query of multiple data types: according to a given query input, find all records in the database whose vectors in the Hamming distance mapped to the query input are less than or equal to a given threshold. In order to realize the approximate query of multiple data types, the present invention is divided into two major steps: 1. Map the data and query in the database to the Hamming space with a given mapping function. 2. Query under Hamming space Perform Hamming approximate query on the data set.
为使得本发明的发明目的、特征、优点能够更加的明显和易懂,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,下面所描述的实施例仅仅是本发明一部分实施例,而非全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。In order to make the objectives, features, and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the following The described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
本发明实施例提供了一种基于鸽巢原理的海明近似查询方法,请结合图1和图2所示,该方法包括:The embodiment of the present invention provides a Hamming approximate query method based on the pigeon nest principle. As shown in Figure 1 and Figure 2, the method includes:
步骤10:哈希映射函数。Step 10: Hash mapping function.
哈希映射函数实现对任意数据类型映射成哈希二进制向量。该步骤分为两个子步骤:数据类型检测和哈希生成。由于不同数据类型使用不同的哈希映射方法,数据类型检测旨在检测输入数据的类型和结构,从而将它分配给适合它的哈希生成模块。哈希生成模块是一系列哈希函数的集合,比如SimHash,MinHash,LSTM,卷积神经网络模型,自编码器模型等等。它的目的是将输入数据映射到海明空间中的向量。The hash mapping function realizes the mapping of any data type into a hash binary vector. This step is divided into two sub-steps: data type detection and hash generation. Since different data types use different hash mapping methods, data type detection aims to detect the type and structure of the input data, and then assign it to a hash generation module suitable for it. The hash generation module is a collection of a series of hash functions, such as SimHash, MinHash, LSTM, convolutional neural network model, autoencoder model and so on. Its purpose is to map input data to vectors in Hamming space.
该步骤10具体包括:步骤101,数据类型检测;步骤102,哈希生成。This step 10 specifically includes: step 101, data type detection; step 102, hash generation.
哈希映射模块旨在对于数据库中的所有记录和查询数据映射成哈希二进制向量。The hash mapping module is designed to map all records and query data in the database into a hash binary vector.
步骤20:将哈希数据库中的二进制数据进行列重排序。Step 20: Perform column reordering on the binary data in the hash database.
为了解决数据的倾斜性和维度之间的关联性,现有的方法都是基于随机排序和减少列聚类的关联性的方法来减少数据的倾斜。他们的目标在于使得每个分割的维度都尽可能均匀分布,这样的话每个分割的阈值分配不会引入大量的候选数据。相比于现有的方法,本发明致力于增大每个列分割的倾斜性,使得海明阈值分配更能发挥效果。In order to solve the skewness of data and the correlation between dimensions, the existing methods are based on random sorting and methods of reducing the correlation of column clustering to reduce the skewness of the data. Their goal is to make the dimensions of each segmentation as evenly distributed as possible, so that the threshold assignment of each segmentation does not introduce a large number of candidate data. Compared with the existing methods, the present invention is dedicated to increasing the inclination of each column segmentation, so that the Hamming threshold allocation can be more effective.
为了实现这个目的,本实施例设计了一个列重排序的代价模型,并且把这个问题转化成一个给予优化查询处理的性能的最优化问题。In order to achieve this goal, this embodiment designs a cost model for column reordering, and converts this problem into an optimization problem that optimizes the performance of query processing.
步骤201:设计一个基于列重排序的代价模型。Step 201: Design a cost model based on column reordering.
这里预先设计一个查询集Q={<q 11>,<q 22>,...,<q |Q||Q|>},对于数据集进行m份列分割P,查询优化的代价模型如下: Here, a query set Q={<q 11 >,<q 22 >,...,<q |Q||Q| >} is designed in advance, and m columns are split for the data set P, the cost model of query optimization is as follows:
Figure PCTCN2019122454-appb-000001
Figure PCTCN2019122454-appb-000001
其中右侧是所有查询集合中的查询和阈值进行近似查询的实际代价之和。这里先忽略查询代价的计算过程,在后面的步骤细讲。有了上述公式,可以将它封装成一个最优化问题:给定一个二进制数据集D,一个查询集合Q,本实施例目标找到一个列分割方法P,从而使得达到最小的查询代价,即为
Figure PCTCN2019122454-appb-000002
The right side is the sum of the actual cost of approximate query for all the queries in the query set and the threshold. Ignore the calculation process of the query cost here, and discuss it in detail in the following steps. With the above formula, it can be encapsulated into an optimization problem: given a binary data set D and a query set Q, the goal of this embodiment is to find a column splitting method P, so as to achieve the minimum query cost, which is
Figure PCTCN2019122454-appb-000002
列分割最优化问题是一个NP-hard问题。The column partition optimization problem is an NP-hard problem.
步骤202:初始化列分割。Step 202: Initialize column partitioning.
由于列分割是一个NP-hard问题,本实施例分为两个步骤进行讨论:初始化列分割和近似分割算法。对于近似分割来说只能得到一个本地最优解,所以一个好的初始化对于提升近似算法的效果至关重要。Since column segmentation is an NP-hard problem, this embodiment is divided into two steps for discussion: initializing column segmentation and approximate segmentation algorithm. For approximate segmentation, only a local optimal solution can be obtained, so a good initialization is essential to improve the effect of the approximate algorithm.
在初始化列分割中,列之间的相关性起到了关键的作用。不同于以往的方法将数据的所有列分割得尽可能的均匀分布,本实施例的方法目标相反的 方法。我们观察如果关联性大的数据列放在同一个分割中,近似查询的性能通常会提高。这是因为本实施例的海明阈值分配方法能够在线地优化每一个查询,并且对高倾斜度的数据有更好的效果。当高度关联的数据列被放在一起,更多的错误会在同一个数据分割中被识别,因此本实施例的阈值分配方法会分配一个较大的阈值给这个分割,从而对其他的分割给更小的阈值。换句话说,本实施例把合适的阈值分配给不同的数据分割。如果数据列被均匀分布,所有的分割有着同样的分布,这样的话很难去优化一些倾斜度很高的分割。In the initial column partitioning, the correlation between the columns plays a key role. Unlike the previous method that divides all columns of data into as evenly distributed as possible, the method of this embodiment has the opposite goal. We observe that if data columns with large correlations are placed in the same partition, the performance of approximate queries will generally improve. This is because the Hamming threshold allocation method of this embodiment can optimize each query online, and has a better effect on highly inclined data. When highly correlated data columns are put together, more errors will be identified in the same data segmentation. Therefore, the threshold assignment method of this embodiment will assign a larger threshold value to this segmentation, so as to give other segments to Smaller threshold. In other words, this embodiment assigns appropriate thresholds to different data segments. If the data column is uniformly distributed, all the partitions have the same distribution, so it is difficult to optimize some highly inclined partitions.
本实施例用信息熵去度量数据列之间的关联度,对于一个数据分割P i,本实施例用
Figure PCTCN2019122454-appb-000003
表示P i的数据列的集合,则P i的关联度被度量成如下公式:
Embodiments of the present information entropy to measure the correlation between the data column, for a data dividing P i, with the present embodiment
Figure PCTCN2019122454-appb-000003
P i represents the row data set, the correlation is a measure of P i to the following formula:
Figure PCTCN2019122454-appb-000004
Figure PCTCN2019122454-appb-000004
根据公式,较小数值的信息熵说明当前的分割有更强的关联性。整个数据的分割方案P的信息熵是所有数据分割的信息熵的累加和,即为:According to the formula, a smaller value of information entropy indicates that the current segmentation has stronger relevance. The information entropy of the entire data segmentation scheme P is the cumulative sum of the information entropy of all data segmentation, which is:
Figure PCTCN2019122454-appb-000005
Figure PCTCN2019122454-appb-000005
本实施例的目标是找到一个初始化分割方案P使得H(P)最小化。为了达到这个目标,本实施例用了一个等分割的贪心方法:在最开始P是一个空的分割方案,本实施例贪心地选择数据列,即为如果该数据列对于当前的分割能产生最小的信息熵,它就被放入到该分割中。这个过程持续操作直到其中一个分割的大小达到的上限,即为
Figure PCTCN2019122454-appb-000006
即为第一个数据分割产生。之后本实施例重复上述过程直到所有的数据列被分配到相应的数据分割中。
The goal of this embodiment is to find an initial segmentation scheme P such that H(P) is minimized. In order to achieve this goal, this embodiment uses an equal split greedy method: at the beginning P is an empty split plan, this embodiment greedily selects the data column, that is, if the data column produces the smallest value for the current split Entropy of information, it is put into the segmentation. This process continues until the size of one of the partitions reaches the upper limit, which is
Figure PCTCN2019122454-appb-000006
That is, the first data segmentation is generated. After that, this embodiment repeats the above process until all the data columns are allocated to the corresponding data partitions.
步骤203:近似分割算法。Step 203: Approximate segmentation algorithm.
当初始化数据分割得到之后,需要利用查询集合对分割方案进行精炼。这里本实施例利用一种贪心的策略,即为迭代地把当前近似查询效果差距最大的两个数据列进行交换。After the initial data segmentation is obtained, the query set needs to be used to refine the segmentation plan. Here, this embodiment uses a greedy strategy, that is, iteratively exchanges the two data columns with the largest difference between the current approximate query effects.
在每次迭代中,随机选择两个数据分割中的数据列,进行交换。之后对于当前交换后的数据集用查询集合运行近似查询,计算出代价C workload。选择C workload最小的两个数据列进行交换。这个过程重复执行,直到当前计算最小的C workload比上一次迭代的C workload大,则停止算法产生最终的数据分割方案。 In each iteration, the data columns in the two data partitions are randomly selected and exchanged. After that, an approximate query is run on the current exchanged data set using the query set, and the cost C workload is calculated. The two data columns with the smallest C workload are selected for exchange. This process is repeated until the current computationally than the smallest C workload C workload iteration, the algorithm stops partitioning scheme to produce the final data.
步骤30:针对新生成的数据进行建立索引结构。索引结构包含两个部分:柱状图和倒排哈希索引。Step 30: Create an index structure for the newly generated data. The index structure consists of two parts: histogram and inverted hash index.
柱状图的作用是对于当前的数据收集统计信息。对于每个宽度是d的数据分割,枚举所有的二进制数据,即为2 d个数据,和d+1个阈值,即为0,1,2,...d。Hist(p,t)表示数据集中与d-bit的分割数据p有海明距离为t的数据的个数。 The role of the histogram is to collect statistics for the current data. For each data segment whose width is d, enumerate all binary data, which is 2 d data, and d+1 thresholds, which are 0,1,2,...d. Hist(p,t) represents the number of data in the data set that has a Hamming distance of t from the d-bit segmented data p.
倒排哈希索引是将所有数据分割中的值作为哈希值,记录的ID加入到其倒排表中。The inverted hash index uses the values in all data partitions as hash values, and the recorded ID is added to its inverted table.
步骤40:查询优化。为了利用新型鸽巢原理去处理查询,如果将阈值分配给各个分割是一个关键的问题。Step 40: Query optimization. In order to use the new pigeonhole principle to process queries, it is a key issue if thresholds are assigned to each segmentation.
步骤401:为了更好地优化查询阈值分配,本实施例设计一个近似查询的代价模型如下:Step 401: In order to better optimize the query threshold allocation, this embodiment designs an approximate query cost model as follows:
C query_proc(q,T)=C sig_gen(q,T)+C cand_gen(q,T)+C verify(q,T) C query_proc (q,T)=C sig_gen (q,T)+C cand_gen (q,T)+C verify (q,T)
其中,C sig_gen(q,T),C cand_gen(q,T)和C verify(q,T)分别表示签名生成,候选集生成和验证的代价。 Among them, C sig_gen (q, T), C cand_gen (q, T) and C verify (q, T) respectively represent the cost of signature generation, candidate set generation and verification.
签名生成是指查询数据根据海明阈值生成所有可能的待查的哈希值。候选集生成指查询数据生成的待查哈希值去查询索引结构中的倒排表,抽取出相应的记录,去重之后得到候选集合。验证是指对候选集合中的每一条记录,用海明距离函数去和查询计算海明距离数值,对比给定的阈值输出最终的结果。Signature generation means that the query data generates all possible hash values to be checked according to the Hamming threshold. Candidate set generation refers to querying the inverted table in the index structure by querying the hash value generated by query data, extracting the corresponding records, and obtaining the candidate set after deduplication. Verification refers to using the Hamming distance function to calculate the Hamming distance value with the query for each record in the candidate set, and output the final result by comparing the given threshold value.
在实际应用中,签名生成的代价通常远远小于候选集生成和验证的代价,因为签名生成的时间复杂度是被查询的大小和阈值限制住的。所以本实施例在查询优化的过程中可以忽略掉签名生成的影响。In practical applications, the cost of signature generation is usually much less than the cost of candidate set generation and verification, because the time complexity of signature generation is limited by the size and threshold of the query. Therefore, in this embodiment, the influence of signature generation can be ignored in the process of query optimization.
用CN(q ii)表示在第i个分割中对于查询和当前分配的阈值产生的候选集 合个数。则对于一个查询和它的阈值分配,抽取出来的倒排表的长度之和是
Figure PCTCN2019122454-appb-000007
假设倒排表的长度之和是正比于去重之后的候选集合的大小,即为
Figure PCTCN2019122454-appb-000008
因此近似查询的代价模型被估计成如下:
Use CN(q ii ) to represent the number of candidate sets generated for the query and the currently assigned threshold in the i-th segmentation. Then for a query and its threshold allocation, the sum of the length of the extracted inverted table is
Figure PCTCN2019122454-appb-000007
Assume that the sum of the length of the inverted list is proportional to the size of the candidate set after deduplication, which is
Figure PCTCN2019122454-appb-000008
Therefore, the cost model of approximate query is estimated as follows:
Figure PCTCN2019122454-appb-000009
Figure PCTCN2019122454-appb-000009
其中c access是查询倒排表中一个元素的代价,c verify是验证两个向量的海明距离是否小于等于给定阈值的代价。这两个参数都是预先设定的常量。 Where c access is the cost of querying an element in the inverted table, and c verify is the cost of verifying whether the Hamming distance of the two vectors is less than or equal to a given threshold. Both of these parameters are preset constants.
有了上述公式,可以正式把阈值分配转化为一个最优化问题:给定一个数据集,查询q和阈值τ,找到一个阈值向量T使得近似查询代价最小,即为With the above formula, the threshold allocation can be formally transformed into an optimization problem: given a data set, query q and threshold τ, find a threshold vector T that minimizes the approximate query cost, which is
Figure PCTCN2019122454-appb-000010
Figure PCTCN2019122454-appb-000010
步骤402:阈值分配方法。因为c access,c verify和α都独立于CN(q ii),本实施例可以忽略上述公式中的(c access+α·c verify),而根据
Figure PCTCN2019122454-appb-000011
得到最优的阈值分配方法。这里把CN(q ii)当做一个时间复杂度为O(1)的黑盒,提出一个基于动态规划的阈值分配算法。
Step 402: threshold allocation method. Because c access , c verify and α are independent of CN(q ii ), this embodiment can ignore (c access +α·c verify ) in the above formula, and according to
Figure PCTCN2019122454-appb-000011
Get the optimal threshold allocation method. Here CN(q ii ) is regarded as a black box with a time complexity of O(1), and a threshold allocation algorithm based on dynamic programming is proposed.
用OPT[i,t]记录对于1到i的数据分割以及当前阈值t的最小近似查询代价,则有如下递归公式:Using OPT[i,t] to record the minimum approximate query cost for the data segmentation from 1 to i and the current threshold t, there is the following recursive formula:
Figure PCTCN2019122454-appb-000012
Figure PCTCN2019122454-appb-000012
有了上述递推公式,设计一个动态规划算法来实现阈值分配。在初始阶段,初始化第一个分割的代价,即OPT[1,-1],...,OPT[1,τ]。之后利用上述公式去计算每个OPT[i,t]的最小值。这里带负数的阈值-1也被考虑被分配给其他分割。最后,追踪到达OPT[m,τ-m+1]的路径,并且得到最终的阈值分配向量。整个动态规划算法的时间复杂度是O(m·(τ+1) 2)。 With the above recursive formula, a dynamic programming algorithm is designed to realize the threshold allocation. In the initial stage, initialize the cost of the first segmentation, namely OPT[1,-1],...,OPT[1,τ]. Then use the above formula to calculate the minimum value of each OPT[i,t]. Here the threshold value -1 with a negative number is also considered to be assigned to other partitions. Finally, the path to OPT[m,τ-m+1] is traced, and the final threshold distribution vector is obtained. The time complexity of the entire dynamic programming algorithm is O(m·(τ+1) 2 ).
步骤50:从倒排表中抽取候选集合,去重并逐一验证。Step 50: Extract the candidate set from the inverted list, remove duplicates and verify one by one.
当海明阈值分配确定了之后,对于查询的每一个列分割q i和它对应的被分配的阈值t,枚举所有可能的哈希数值,即为
Figure PCTCN2019122454-appb-000013
When the Hamming threshold assignment is determined, for each column of the query split q i and its corresponding assigned threshold t, enumerate all possible hash values, which is
Figure PCTCN2019122454-appb-000013
例子:查询数据的一个列分割是001,t=1。枚举所有可能的查询哈希数值,即为001,101,011,000。Example: A column partition of the query data is 001, t=1. Enumerate all possible query hash values, which is 001,101,011,000.
当所有的哈希数值枚举出来之后,他们分别去预先建立的倒排索引中查找对应的键值,并抽取出对应的倒排表。当所有的倒排表抽取出来之后,去重并用海明距离公式逐一计算他们与查询的海明距离。如果计算的数值小于或等于给定的阈值,就返回作为其中一个结果。After all the hash values are enumerated, they respectively search for the corresponding key value in the pre-established inverted index, and extract the corresponding inverted table. After all the inverted tables are extracted, remove the duplicates and use the Hamming distance formula to calculate the Hamming distance between them and the query one by one. If the calculated value is less than or equal to the given threshold, it is returned as one of the results.
综上,本发明实施例提供了一系列哈希映射函数,比如SimHash,深度神经网络模型等,作为哈希映射函数,对数据库中的数据及查询数据进行哈希映射为二进制向量;提出了一种更普遍的鸽巢原理来获得更紧的过滤条件,并且有着更灵活的阈值分类方法;设计了一种基于普遍鸽巢原理的,高效的在线查询优化方法来分配阈值,使得分配的方案是最优的;设计了一个线下数据列划分方法来解决数据倾斜和维度关联性导致的划分选择性问题;还设计了一个线下数据列划分方法来解决数据倾斜和维度关联性导致的划分选择性问题。In summary, the embodiments of the present invention provide a series of hash mapping functions, such as SimHash, deep neural network models, etc., as hash mapping functions, hash mapping data and query data in the database into binary vectors; A more general pigeonhole principle to obtain tighter filter conditions and a more flexible threshold classification method; an efficient online query optimization method based on the universal pigeonhole principle is designed to allocate thresholds, so that the allocation scheme is Optimal; designed an offline data column division method to solve the problem of division selectivity caused by data skew and dimensional relevance; also designed an offline data column division method to solve the division choice caused by data skew and dimensional relevance Sexual issues.
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令来完成,或通过指令控制相关的硬件来完成,该指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructions, or by instructions to control related hardware, and the instructions can be stored in a computer-readable storage medium. It is loaded and executed by the processor.
为此,本发明实施例提供一种存储介质,其中存储有多条指令,该指令能够被处理器进行加载,以执行本发明实施例所提供的海明空间近似查询方法中的步骤。To this end, an embodiment of the present invention provides a storage medium in which multiple instructions are stored, and the instructions can be loaded by a processor to execute the steps in the Hamming space approximate query method provided by the embodiment of the present invention.
其中,该存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、磁盘或光盘等。Wherein, the storage medium may include: read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.
以上所述,以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其 中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the embodiments are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

  1. 一种海明空间近似查询方法,其特征在于,包括步骤:An approximate query method for Hamming space, which is characterized in that it comprises the following steps:
    将原始数据库中的所有记录和查询数据,映射成海明空间中的哈希二进制向量,得到哈希数据库;Map all records and query data in the original database into a hash binary vector in Hamming space to obtain a hash database;
    对哈希数据库中的二进制数据进行列重排序;Column reordering the binary data in the hash database;
    针对列重排序后新生成的数据建立索引结构,索引结构包含柱状图和倒排哈希索引;Create an index structure for the newly generated data after column reordering. The index structure includes histogram and inverted hash index;
    解析查询,为各个数据分割分配相应的查询阈值。Analyze the query and assign corresponding query thresholds for each data segmentation.
  2. 根据权利要求1所述的海明空间近似查询方法,其特征在于,所述哈希数据库的获得方法为:针对每个记录和查询数据,检测当前数据的类型和结构;按照当前数据的类型和结构,从哈希函数集合中选择相应的哈希映射函数;通过所选择的哈希映射函数,将输入数据映射成哈希二进制向量。The Hamming spatial approximate query method according to claim 1, wherein the method for obtaining the hash database is: detecting the type and structure of the current data for each record and query data; according to the type and structure of the current data Structure, the corresponding hash mapping function is selected from the set of hash functions; through the selected hash mapping function, the input data is mapped into a hash binary vector.
  3. 根据权利要求1所述的海明空间近似查询方法,其特征在于,所述对哈希数据库中的二进制数据进行列重排序的方法为:设计基于列重排序的代价模型;对二进制数据进行初始化列分割;在初始化列分割后,进行近似分割。The Hamming space approximate query method according to claim 1, wherein the method for column reordering the binary data in the hash database is: designing a cost model based on column reordering; and initializing the binary data Column segmentation; after initializing the column segmentation, perform approximate segmentation.
  4. 根据权利要求3所述的海明空间近似查询方法,其特征在于,所述对二进制数据进行初始化列分割的方法为:初始化一个空的数据分割,选择一个数据列,如果该数据列对于当前数据分割能产生最小的信息熵,就被放入到当前数据分割中;选择下一数据列,重复进行相同处理,直到当前数据分割的大小达到上限,即产生第一个数据分割;之后,重复分割过程,直到所有的数据列被分配到相应的数据分割中。The Hamming space approximate query method according to claim 3, wherein the method for initializing column division of binary data is: initializing an empty data division, selecting a data column, if the data column is relative to the current data The segmentation can produce the smallest information entropy and is put into the current data segmentation; select the next data column and repeat the same processing until the size of the current data segmentation reaches the upper limit, that is, the first data segmentation is generated; after that, the segmentation is repeated The process until all the data columns are allocated to the corresponding data segmentation.
  5. 根据权利要求3所述的海明空间近似查询方法,其特征在于,所述进行近似分割的方法为:迭代地把当前近似查询效果差距最大的两个数据列进行交换。The Hamming spatial approximate query method according to claim 3, characterized in that the method of performing approximate segmentation is: iteratively exchanging the two data columns with the largest difference between the current approximate query effects.
  6. 根据权利要求1所述的海明空间近似查询方法,其特征在于,所述为各个数据分割分配相应的查询阈值的方法为:The Hamming spatial approximate query method according to claim 1, wherein the method for assigning corresponding query thresholds for each data segmentation is:
    设计基于阈值分配的代价模型;Design a cost model based on threshold allocation;
    根据基于阈值分配的代价模型,采用动态规划算法进行查询阈值分配。According to the cost model based on threshold allocation, a dynamic programming algorithm is used for query threshold allocation.
  7. 根据权利要求1所述的海明空间近似查询方法,其特征在于,所述海 明空间近似查询方法还包括:根据索引结构抽取候选集,并逐一验证得到最终结果。The Hamming spatial approximate query method according to claim 1, wherein the Hamming spatial approximate query method further comprises: extracting a candidate set according to an index structure, and verifying one by one to obtain a final result.
  8. 根据权利要求7所述的海明空间近似查询方法,其特征在于,所述根据索引结构抽取候选集,并逐一验证得到最终结果的方法为:The Hamming space approximate query method according to claim 7, wherein the method of extracting candidate sets according to the index structure and verifying one by one to obtain the final result is:
    对于查询的每一个列分割及其对应的被分配的查询阈值,枚举所有可能的哈希数值;对于每个哈希数值,分别去预先建立的倒排索引中查找对应的键值,并抽取出对应的倒排表;当所有的倒排表抽取出来之后,去重并用海明距离公式逐一计算它们与查询的海明距离;如果计算的数值小于或等于给定的阈值,就返回作为其中一个结果。For each column segmentation of the query and its corresponding assigned query threshold, enumerate all possible hash values; for each hash value, find the corresponding key value in the pre-established inverted index, and extract The corresponding inverted table is extracted; after all the inverted tables are extracted, the duplicates are removed and the Hamming distance formula is used to calculate the distance between them and the queried Hamming one by one; if the calculated value is less than or equal to the given threshold, it will be returned as one A result.
  9. 一种存储介质,其特征在于,所述存储介质上存储有计算机程序,该计算机程序被处理器执行时实现如权利要求1-8中任一项所述的海明空间近似查询方法。A storage medium, characterized in that a computer program is stored on the storage medium, and when the computer program is executed by a processor, the Hamming spatial approximate query method according to any one of claims 1-8 is realized.
PCT/CN2019/122454 2019-08-30 2019-12-02 Hamming space-based approximate query method and storage medium WO2021036070A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910813733.XA CN110569244A (en) 2019-08-30 2019-08-30 Hamming space approximate query method and storage medium
CN201910813733.X 2019-08-30

Publications (1)

Publication Number Publication Date
WO2021036070A1 true WO2021036070A1 (en) 2021-03-04

Family

ID=68776965

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/122454 WO2021036070A1 (en) 2019-08-30 2019-12-02 Hamming space-based approximate query method and storage medium

Country Status (2)

Country Link
CN (1) CN110569244A (en)
WO (1) WO2021036070A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666279B (en) * 2020-04-14 2022-04-29 阿里巴巴集团控股有限公司 Query data processing method and device, electronic equipment and computer storage medium
CN111538867B (en) * 2020-04-15 2021-06-15 深圳计算科学研究院 Method and system for dividing bounded incremental graph
CN111815403B (en) * 2020-06-19 2024-05-10 北京石油化工学院 Commodity recommendation method and device and terminal equipment
CN112256727B (en) * 2020-10-19 2021-10-15 东北大学 Database query processing and optimizing method based on artificial intelligence technology

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100161614A1 (en) * 2008-12-22 2010-06-24 Electronics And Telecommunications Research Institute Distributed index system and method based on multi-length signature files
CN104731882A (en) * 2015-03-11 2015-06-24 北京航空航天大学 Self-adaptive query method based on Hash code weighting ranking
CN108830333A (en) * 2018-06-22 2018-11-16 河南广播电视大学 A kind of nearest neighbor search method based on three times bit quantization and non symmetrical distance
CN109919084A (en) * 2019-03-06 2019-06-21 南京大学 A kind of pedestrian's recognition methods again more indexing Hash based on depth

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPWO2013129580A1 (en) * 2012-02-28 2015-07-30 公立大学法人大阪府立大学 Approximate nearest neighbor search device, approximate nearest neighbor search method and program thereof
CN104317838B (en) * 2014-10-10 2017-05-17 浙江大学 Cross-media Hash index method based on coupling differential dictionary
CN110046268B (en) * 2016-02-05 2024-04-05 大连大学 High-dimensional space kNN query method based on inverted position sensitive hash index
CN106570166B (en) * 2016-11-07 2019-12-13 北京航空航天大学 Video retrieval method and device based on multiple locality sensitive hash tables
CN109299097B (en) * 2018-09-27 2022-06-21 宁波大学 Online high-dimensional data nearest neighbor query method based on Hash learning
CN109871379B (en) * 2018-12-10 2022-04-01 宁波大学 Online Hash nearest neighbor query method based on data block learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100161614A1 (en) * 2008-12-22 2010-06-24 Electronics And Telecommunications Research Institute Distributed index system and method based on multi-length signature files
CN104731882A (en) * 2015-03-11 2015-06-24 北京航空航天大学 Self-adaptive query method based on Hash code weighting ranking
CN108830333A (en) * 2018-06-22 2018-11-16 河南广播电视大学 A kind of nearest neighbor search method based on three times bit quantization and non symmetrical distance
CN109919084A (en) * 2019-03-06 2019-06-21 南京大学 A kind of pedestrian's recognition methods again more indexing Hash based on depth

Also Published As

Publication number Publication date
CN110569244A (en) 2019-12-13

Similar Documents

Publication Publication Date Title
WO2021036070A1 (en) Hamming space-based approximate query method and storage medium
Ying et al. Graph convolutional neural networks for web-scale recommender systems
Wang et al. Query-driven iterated neighborhood graph search for large scale indexing
US8625907B2 (en) Image clustering
Iscen et al. Memory vectors for similarity search in high-dimensional spaces
CN110609916A (en) Video image data retrieval method, device, equipment and storage medium
Lu et al. Efficiently Supporting Edit Distance Based String Similarity Search Using B $^+ $-Trees
Fu et al. An experimental evaluation of large scale GBDT systems
Lan et al. High performance implementation of 3D convolutional neural networks on a GPU
WO2013105505A1 (en) Index scanning apparatus and index scanning method
Song et al. Brepartition: Optimized high-dimensional knn search with bregman distances
Lv et al. Intelligent probing for locality sensitive hashing: Multi-probe LSH and beyond
JP6279771B2 (en) Cross-reference indexing with grouplets
Zhao et al. Large-scale visual search with binary distributed graph at alibaba
KR101116663B1 (en) Partitioning Method for High Dimensional Data
Kerber et al. Scalable symmetry detection for urban scenes
Chen et al. DBSCAN-PSM: an improvement method of DBSCAN algorithm on Spark
Nie et al. Efficient storage support for real-time near-duplicate video retrieval
Jun et al. Large-scale high-dimensional nearest neighbor search using flash memory with in-store processing
US20170031909A1 (en) Locality-sensitive hashing for algebraic expressions
Antaris et al. Similarity search over the cloud based on image descriptors' dimensions value cardinalities
Jafari et al. mmlsh: A practical and efficient technique for processing approximate nearest neighbor queries on multimedia data
Zhou et al. Large scale nearest neighbors search based on neighborhood graph
Chen et al. Neighborhood-exact nearest neighbor search for face retrieval
Yingfan et al. Revisiting $ k $-Nearest Neighbor Graph Construction on High-Dimensional Data: Experiments and Analyses

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19943646

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 090822)

122 Ep: pct application non-entry in european phase

Ref document number: 19943646

Country of ref document: EP

Kind code of ref document: A1