WO2021036070A1

WO2021036070A1 - Hamming space-based approximate query method and storage medium

Info

Publication number: WO2021036070A1
Application number: PCT/CN2019/122454
Authority: WO
Inventors: 秦建斌; 王尧舒
Original assignee: 深圳计算科学研究院
Priority date: 2019-08-30
Filing date: 2019-12-02
Publication date: 2021-03-04
Also published as: CN110569244A

Abstract

A Hamming space-based approximate query method and a storage medium. The Hamming space-based approximate query method comprises the steps of: mapping all records and query data in an original database into hash binary vectors in a Hamming space to obtain a hash database; performing column reordering on binary data in the hash database; establishing an index structure for data newly generated after column reordering, the index structure comprising a histogram and an inverted hash index; and performing parsing and querying, and allocating a corresponding query threshold for each data segmentation. According to the method, the inclination of data can be well utilized, and threshold allocation is performed according to the inclination, so as to filter out a large amount of non-result data; and a histogram index structure and an inverted hash index structure are used, dimension reordering is performed according to different inclinations of data, and data columns having a large inclination are put together, so as to the utilize inclinations of the data more effectively, thereby improving approximate query efficiency.

Description

An approximate query method and storage medium for Hamming space

Technical field

The invention relates to the field of database approximate query, in particular to a Hamming space approximate query method and storage medium.

Background technique

The current approximate query of basic data types is a basic problem in the database field, such as the approximate query of strings and sets, and has been studied for many years. However, the approximate query and approximate semantic query of more complex data types have not achieved good results in the database field.

Due to the simplicity and ease of query of binary data, the combination of hash mapping function and Hamming approximate query has played a key role in many applications, such as web search, image query, and science and technology libraries.

In dealing with the problem of approximate detection in hundreds of millions of web pages, Google uses a SimHash hashing technology as a hash mapping function to map each web page to a 64-dimensional binary vector. Hamming approximate query is used to find all approximate matching web pages. In large-scale image search, the deep neural network model is used as a hash mapping function to map images into high-dimensional binary vectors, and Hamming approximate query can efficiently return images similar to the query image. In the field of biomedicine, Hamming approximate query can be used to find similar molecular structures, in which a hash mapping function converts molecules into high-dimensional binary vectors, and molecules that meet the Hamming threshold are returned. In natural language approximate semantic search, neural network models, such as autoencoders, attention models, LSTMs, etc., are encapsulated into hash mapping functions to map text into high-dimensional binary vectors. Only the records with the smaller Hamming distance in the database are returned as a result.

However, all current Hamming space approximate query methods mainly have the following two shortcomings: 1. The existing methods based on Hamming query have a weak lower bound for filtering, which leads to a larger threshold and directly leads to poor operating efficiency. . 2. Existing methods have uniform threshold distribution for data segmentation. The data is assumed to be evenly distributed, but in practice, the data tends to be somewhat skewed. We found that in actual situations, many real data have more or less skewness, and there are complex correlations between columns in the data. Therefore, the uniform threshold distribution on many data sets will not achieve good results, and the inefficient query performance is caused by ignoring the inclination of the data.

Summary of the invention

The purpose of the present invention is to provide a Hamming space approximate query method and storage medium to solve the above-mentioned defects.

To achieve this goal, the present invention adopts the following technical solutions:

An approximate query method for Hamming space, including the steps:

Map all records and query data in the original database into a hash binary vector in Hamming space to obtain a hash database;

Column reordering the binary data in the hash database;

Create an index structure for the newly generated data after column reordering. The index structure includes histogram and inverted hash index;

Analyze the query and assign corresponding query thresholds for each data segmentation.

Optionally, the method for obtaining the hash database is: detecting the type and structure of the current data for each record and query data; according to the type and structure of the current data, selecting the corresponding hash map from the set of hash functions Function; through the selected hash mapping function, the input data is mapped into a hash binary vector.

Optionally, the method for column reordering the binary data in the hash database is: designing a cost model based on column reordering; performing initial column division on the binary data; and performing approximate division after initializing the column division.

Optionally, the method for initializing column division of binary data is: initializing an empty data division, selecting a data column, and if the data column can produce the minimum information entropy for the current data division, it is put into the current Data segmentation; select the next data column and repeat the same process until the size of the current data segmentation reaches the upper limit, that is, the first data segmentation is generated; after that, the segmentation process is repeated until all data columns are allocated to the corresponding data segmentation in.

Optionally, the method for performing approximate segmentation is to iteratively exchange the two data columns with the largest difference between the current approximate query effects.

Optionally, the method for assigning corresponding query thresholds for each data segmentation is:

Design a cost model based on threshold allocation;

According to the cost model based on threshold allocation, a dynamic programming algorithm is used for query threshold allocation.

Optionally, the Hamming space approximate query method further includes: extracting a candidate set according to an index structure, and verifying one by one to obtain a final result.

Optionally, the method of extracting a candidate set according to the index structure and verifying one by one to obtain the final result is:

For each column segmentation of the query and its corresponding assigned query threshold, enumerate all possible hash values; for each hash value, find the corresponding key value in the pre-established inverted index, and extract The corresponding inverted table is extracted; after all the inverted tables are extracted, the duplicates are removed and the Hamming distance formula is used to calculate the Hamming distance between them and the query one by one. If the calculated value is less than or equal to the given threshold, it is returned as one of the results.

A storage medium in which a computer program is stored, and when the computer program is executed by a processor, the Hamming space approximate query method as described in any one of the above is implemented.

Compared with the prior art, the beneficial effects of the present invention are:

1) The embodiment of the present invention can effectively deal with data sets with different inclinations and can have efficient query capabilities, especially for data sets with large inclinations, such as biomolecular data sets, most of the existing methods are lost The filtering ability can only scan and verify the data in the data set in order to get the result. The novel pigeon nest principle proposed by the embodiment of the present invention can make good use of the tilt of the data, and perform threshold allocation according to the tilt, thereby filtering out a large amount of non-resulting data.

2) The embodiment of the present invention effectively performs threshold allocation based on data tilt, and uses a dynamic programming algorithm to optimize the candidate set, so as to achieve the best filtering effect.

3) Hamming approximate query based on the histogram index structure, so that only the data related to the query is extracted, so as to achieve efficient query.

4) Re-sort the dimensions according to the inclination of different data, and put the data columns with large inclination together, so as to make more effective use of the inclination of the data and improve the efficiency of approximate query.

Description of the drawings

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

Fig. 1 is a logical block diagram of a Hamming approximate query method provided by an embodiment of the present invention.

Fig. 2 is a flowchart of a Hamming approximate query method provided by an embodiment of the present invention.

detailed description

The present invention aims to realize the approximate query of multiple data types: according to a given query input, find all records in the database whose vectors in the Hamming distance mapped to the query input are less than or equal to a given threshold. In order to realize the approximate query of multiple data types, the present invention is divided into two major steps: 1. Map the data and query in the database to the Hamming space with a given mapping function. 2. Query under Hamming space Perform Hamming approximate query on the data set.

In order to make the objectives, features, and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the following The described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

The embodiment of the present invention provides a Hamming approximate query method based on the pigeon nest principle. As shown in Figure 1 and Figure 2, the method includes:

Step 10: Hash mapping function.

The hash mapping function realizes the mapping of any data type into a hash binary vector. This step is divided into two sub-steps: data type detection and hash generation. Since different data types use different hash mapping methods, data type detection aims to detect the type and structure of the input data, and then assign it to a hash generation module suitable for it. The hash generation module is a collection of a series of hash functions, such as SimHash, MinHash, LSTM, convolutional neural network model, autoencoder model and so on. Its purpose is to map input data to vectors in Hamming space.

This step 10 specifically includes: step 101, data type detection; step 102, hash generation.

The hash mapping module is designed to map all records and query data in the database into a hash binary vector.

Step 20: Perform column reordering on the binary data in the hash database.

In order to solve the skewness of data and the correlation between dimensions, the existing methods are based on random sorting and methods of reducing the correlation of column clustering to reduce the skewness of the data. Their goal is to make the dimensions of each segmentation as evenly distributed as possible, so that the threshold assignment of each segmentation does not introduce a large number of candidate data. Compared with the existing methods, the present invention is dedicated to increasing the inclination of each column segmentation, so that the Hamming threshold allocation can be more effective.

In order to achieve this goal, this embodiment designs a cost model for column reordering, and converts this problem into an optimization problem that optimizes the performance of query processing.

Step 201: Design a cost model based on column reordering.

Here, a query set Q={<q ¹ ,τ ¹ >,<q ² ,τ ² >,...,<q ^|Q| ,τ ^|Q| >} is designed in advance, and m columns are split for the data set P, the cost model of query optimization is as follows:

The right side is the sum of the actual cost of approximate query for all the queries in the query set and the threshold. Ignore the calculation process of the query cost here, and discuss it in detail in the following steps. With the above formula, it can be encapsulated into an optimization problem: given a binary data set D and a query set Q, the goal of this embodiment is to find a column splitting method P, so as to achieve the minimum query cost, which is

The column partition optimization problem is an NP-hard problem.

Step 202: Initialize column partitioning.

Since column segmentation is an NP-hard problem, this embodiment is divided into two steps for discussion: initializing column segmentation and approximate segmentation algorithm. For approximate segmentation, only a local optimal solution can be obtained, so a good initialization is essential to improve the effect of the approximate algorithm.

In the initial column partitioning, the correlation between the columns plays a key role. Unlike the previous method that divides all columns of data into as evenly distributed as possible, the method of this embodiment has the opposite goal. We observe that if data columns with large correlations are placed in the same partition, the performance of approximate queries will generally improve. This is because the Hamming threshold allocation method of this embodiment can optimize each query online, and has a better effect on highly inclined data. When highly correlated data columns are put together, more errors will be identified in the same data segmentation. Therefore, the threshold assignment method of this embodiment will assign a larger threshold value to this segmentation, so as to give other segments to Smaller threshold. In other words, this embodiment assigns appropriate thresholds to different data segments. If the data column is uniformly distributed, all the partitions have the same distribution, so it is difficult to optimize some highly inclined partitions.

Embodiments of the present information entropy to measure the correlation between the data column, for a data dividing P _i, with the present embodiment

P _i represents the row data set, the correlation is a measure of P _i to the following formula:

According to the formula, a smaller value of information entropy indicates that the current segmentation has stronger relevance. The information entropy of the entire data segmentation scheme P is the cumulative sum of the information entropy of all data segmentation, which is:

The goal of this embodiment is to find an initial segmentation scheme P such that H(P) is minimized. In order to achieve this goal, this embodiment uses an equal split greedy method: at the beginning P is an empty split plan, this embodiment greedily selects the data column, that is, if the data column produces the smallest value for the current split Entropy of information, it is put into the segmentation. This process continues until the size of one of the partitions reaches the upper limit, which is

That is, the first data segmentation is generated. After that, this embodiment repeats the above process until all the data columns are allocated to the corresponding data partitions.

Step 203: Approximate segmentation algorithm.

After the initial data segmentation is obtained, the query set needs to be used to refine the segmentation plan. Here, this embodiment uses a greedy strategy, that is, iteratively exchanges the two data columns with the largest difference between the current approximate query effects.

In each iteration, the data columns in the two data partitions are randomly selected and exchanged. After that, an approximate query is run on the current exchanged data set using the query set, and the cost C _{workload is} calculated. The two data columns with the smallest C _{workload are selected for exchange.} This process is repeated until the current computationally than the smallest C _workload C _workload iteration, the algorithm stops partitioning scheme to produce the final data.

Step 30: Create an index structure for the newly generated data. The index structure consists of two parts: histogram and inverted hash index.

The role of the histogram is to collect statistics for the current data. For each data segment whose width is d, enumerate all binary data, which is 2 ^d data, and d+1 thresholds, which are 0,1,2,...d. Hist(p,t) represents the number of data in the data set that has a Hamming distance of t from the d-bit segmented data p.

The inverted hash index uses the values in all data partitions as hash values, and the recorded ID is added to its inverted table.

Step 40: Query optimization. In order to use the new pigeonhole principle to process queries, it is a key issue if thresholds are assigned to each segmentation.

Step 401: In order to better optimize the query threshold allocation, this embodiment designs an approximate query cost model as follows:

C _{query_proc} (q,T)=C _{sig_gen} (q,T)+C _{cand_gen} (q,T)+C _verify (q,T)

Among them, C _{sig_gen} (q, T), C _{cand_gen} (q, T) and C _verify (q, T) respectively represent the cost of signature generation, candidate set generation and verification.

Signature generation means that the query data generates all possible hash values to be checked according to the Hamming threshold. Candidate set generation refers to querying the inverted table in the index structure by querying the hash value generated by query data, extracting the corresponding records, and obtaining the candidate set after deduplication. Verification refers to using the Hamming distance function to calculate the Hamming distance value with the query for each record in the candidate set, and output the final result by comparing the given threshold value.

In practical applications, the cost of signature generation is usually much less than the cost of candidate set generation and verification, because the time complexity of signature generation is limited by the size and threshold of the query. Therefore, in this embodiment, the influence of signature generation can be ignored in the process of query optimization.

Use CN(q _i ,τ _i ) to represent the number of candidate sets generated for the query and the currently assigned threshold in the i-th segmentation. Then for a query and its threshold allocation, the sum of the length of the extracted inverted table is

Assume that the sum of the length of the inverted list is proportional to the size of the candidate set after deduplication, which is

Therefore, the cost model of approximate query is estimated as follows:

Where c _access is the cost of querying an element in the inverted table, and c _verify is the cost of verifying whether the Hamming distance of the two vectors is less than or equal to a given threshold. Both of these parameters are preset constants.

With the above formula, the threshold allocation can be formally transformed into an optimization problem: given a data set, query q and threshold τ, find a threshold vector T that minimizes the approximate query cost, which is

Step 402: threshold allocation method. Because c _access , c _verify and α are independent of CN(q _i ,τ _i ), this embodiment can ignore (c _access +α·c _verify ) in the above formula, and according to

Get the optimal threshold allocation method. Here CN(q _i ,τ _i ) is regarded as a black box with a time complexity of O(1), and a threshold allocation algorithm based on dynamic programming is proposed.

Using OPT[i,t] to record the minimum approximate query cost for the data segmentation from 1 to i and the current threshold t, there is the following recursive formula:

With the above recursive formula, a dynamic programming algorithm is designed to realize the threshold allocation. In the initial stage, initialize the cost of the first segmentation, namely OPT[1,-1],...,OPT[1,τ]. Then use the above formula to calculate the minimum value of each OPT[i,t]. Here the threshold value -1 with a negative number is also considered to be assigned to other partitions. Finally, the path to OPT[m,τ-m+1] is traced, and the final threshold distribution vector is obtained. The time complexity of the entire dynamic programming algorithm is O(m·(τ+1) ² ).

Step 50: Extract the candidate set from the inverted list, remove duplicates and verify one by one.

When the Hamming threshold assignment is determined, for each column of the query split q _i and its corresponding assigned threshold t, enumerate all possible hash values, which is

Example: A column partition of the query data is 001, t=1. Enumerate all possible query hash values, which is 001,101,011,000.

After all the hash values are enumerated, they respectively search for the corresponding key value in the pre-established inverted index, and extract the corresponding inverted table. After all the inverted tables are extracted, remove the duplicates and use the Hamming distance formula to calculate the Hamming distance between them and the query one by one. If the calculated value is less than or equal to the given threshold, it is returned as one of the results.

In summary, the embodiments of the present invention provide a series of hash mapping functions, such as SimHash, deep neural network models, etc., as hash mapping functions, hash mapping data and query data in the database into binary vectors; A more general pigeonhole principle to obtain tighter filter conditions and a more flexible threshold classification method; an efficient online query optimization method based on the universal pigeonhole principle is designed to allocate thresholds, so that the allocation scheme is Optimal; designed an offline data column division method to solve the problem of division selectivity caused by data skew and dimensional relevance; also designed an offline data column division method to solve the division choice caused by data skew and dimensional relevance Sexual issues.

Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructions, or by instructions to control related hardware, and the instructions can be stored in a computer-readable storage medium. It is loaded and executed by the processor.

To this end, an embodiment of the present invention provides a storage medium in which multiple instructions are stored, and the instructions can be loaded by a processor to execute the steps in the Hamming space approximate query method provided by the embodiment of the present invention.

Wherein, the storage medium may include: read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.

As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the embodiments are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

An approximate query method for Hamming space, which is characterized in that it comprises the following steps:

Map all records and query data in the original database into a hash binary vector in Hamming space to obtain a hash database;

Column reordering the binary data in the hash database;

Create an index structure for the newly generated data after column reordering. The index structure includes histogram and inverted hash index;

Analyze the query and assign corresponding query thresholds for each data segmentation.
The Hamming spatial approximate query method according to claim 1, wherein the method for obtaining the hash database is: detecting the type and structure of the current data for each record and query data; according to the type and structure of the current data Structure, the corresponding hash mapping function is selected from the set of hash functions; through the selected hash mapping function, the input data is mapped into a hash binary vector.
The Hamming space approximate query method according to claim 1, wherein the method for column reordering the binary data in the hash database is: designing a cost model based on column reordering; and initializing the binary data Column segmentation; after initializing the column segmentation, perform approximate segmentation.
The Hamming space approximate query method according to claim 3, wherein the method for initializing column division of binary data is: initializing an empty data division, selecting a data column, if the data column is relative to the current data The segmentation can produce the smallest information entropy and is put into the current data segmentation; select the next data column and repeat the same processing until the size of the current data segmentation reaches the upper limit, that is, the first data segmentation is generated; after that, the segmentation is repeated The process until all the data columns are allocated to the corresponding data segmentation.
The Hamming spatial approximate query method according to claim 3, characterized in that the method of performing approximate segmentation is: iteratively exchanging the two data columns with the largest difference between the current approximate query effects.
The Hamming spatial approximate query method according to claim 1, wherein the method for assigning corresponding query thresholds for each data segmentation is:

Design a cost model based on threshold allocation;

According to the cost model based on threshold allocation, a dynamic programming algorithm is used for query threshold allocation.
The Hamming spatial approximate query method according to claim 1, wherein the Hamming spatial approximate query method further comprises: extracting a candidate set according to an index structure, and verifying one by one to obtain a final result.
The Hamming space approximate query method according to claim 7, wherein the method of extracting candidate sets according to the index structure and verifying one by one to obtain the final result is:

For each column segmentation of the query and its corresponding assigned query threshold, enumerate all possible hash values; for each hash value, find the corresponding key value in the pre-established inverted index, and extract The corresponding inverted table is extracted; after all the inverted tables are extracted, the duplicates are removed and the Hamming distance formula is used to calculate the distance between them and the queried Hamming one by one; if the calculated value is less than or equal to the given threshold, it will be returned as one A result.
A storage medium, characterized in that a computer program is stored on the storage medium, and when the computer program is executed by a processor, the Hamming spatial approximate query method according to any one of claims 1-8 is realized.