CN116028500B

CN116028500B - Range query indexing method based on high-dimensional data

Info

Publication number: CN116028500B
Application number: CN202310060522.XA
Authority: CN
Inventors: 黎玲利; 孙文静
Original assignee: Heilongjiang University
Current assignee: Heilongjiang University
Priority date: 2023-01-17
Filing date: 2023-01-17
Publication date: 2023-07-14
Anticipated expiration: 2043-01-17
Also published as: CN116028500A

Abstract

The invention discloses a range query indexing method based on high-dimensional data, relates to the technical field of database similarity search, and aims to solve the problem that the indexing method in the prior art has low accuracy when being applied to the high-dimensional data. The whole query frame is optimized, so that the balance of efficiency and precision is achieved; and the automatic parameter adjustment is realized, and the diversity requirement of customers can be met. The method and the device can perform efficient and accurate query on high-dimensional data, enable range query to be more efficient (short in query time) and accurate (high in accuracy), avoid the problem of dimension disasters by performing data preprocessing through the PCA method, and classify the data according to the characteristics. And the method and the device have the advantages that the best index is built on different blocks, the characteristics of the data and the characteristics of the index are reasonably applied, and the efficiency of the method is maximized.

Description

Range query indexing method based on high-dimensional data

Technical Field

The invention relates to the technical field of database similarity searching, in particular to a range query indexing method based on high-dimensional data.

Background

In today's society, data is well known to be ubiquitous and it is found in various areas of real life. In the large-space data age, terabytes of multidimensional data are produced every day. For these huge high-dimensional data, we need to analyze them to make them valuable. The similarity search is one of the key steps. The purpose of similarity searching is to find objects that are similar to a given object. Whereas a scope query is one of the core parts of the similarity search field. Given a set of query objects

Wherein->

Distance threshold τ, distance function dist (,) the task of a range query is to return query objects in the dataset D for which all distance queries q are within a given distance threshold τ. It is important for text search, image search, product recommendation, etc.

The current method for solving the current similarity searching problem comprises a traditional accurate query method, such as EPT, GANT, LC, M-Tree and the like; approximate query methods, such as HNSW and HVS methods belonging to the graph structure, quantitative VAQ and PQ methods, hash OASIS and SAS methods, learning index LIMS, ZM-index and LISA, and the like.

With the increase of the data scale, the existing methods (such as LC) have very long index construction time, and the existing methods (such as M-tree) have very long query time and high calculation cost; and as the dimensions of the data become higher, some methods are very efficient at low dimensions, and are not as accurate as applying to high dimensional data. Although the learning index using the deep learning method greatly reduces the query time, there is a great problem in that accuracy is lost and it is difficult to guarantee.

Disclosure of Invention

The purpose of the invention is that: aiming at the problem of low accuracy rate when the indexing method is applied to high-dimensional data in the prior art, the range query indexing method based on the high-dimensional data is provided.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a range query indexing method based on high-dimensional data comprises the following steps:

step one: performing dimension reduction processing on data in a database, wherein the data is high-dimension data;

step two: based on the data after the dimension reduction processing, taking the data reduced to the same dimension as one class, arranging the classes in an ascending order, and then merging and blocking all the ordered classes, wherein the merging and blocking strategy is as follows:

the number of different dimensions contained in each block is the same, and the number of different dimensions in each block is 1-10;

step three: uniformly selecting a plurality of data randomly in each block according to the data distribution condition, respectively inquiring the selected data by utilizing a plurality of indexes, recording the inquiring time, selecting an optimal index from the plurality of indexes as an index of each block, determining the highest dimension in each block, and adjusting the rest dimensions in the block to the highest dimension;

step four: extracting a plurality of data which are consistent with uniform distribution from a database, adding noise to the extracted data, and taking the plurality of data containing the noise as a query workload Q, wherein each data is a query Q in the query workload Q;

step five: for each query q, in the database, using the M-tree to make a first range query of a distance threshold tau, wherein the first range query specifically comprises:

taking the query q as a center point and taking the threshold tau as a radius to obtain a hypersphere, wherein data contained in the hypersphere is a label;

step six: for each query q, making a second range query of a distance threshold τ in a database, wherein the second range query is specifically:

performing dimension reduction on the query q to obtain a query q ', determining a partition B where the query q' is located, enabling the dimension of the query q 'to be identical to the dimension of the partition B, taking the query q' as a center in the partition, taking a threshold tau as a radius to obtain a candidate point set, recovering data in the candidate point set and the query q 'to the original dimension to obtain recovery data and a query q', respectively calculating Euclidean distance between each recovery data and the query q ', and taking the recovery data as data in an answer set, and calculating the distances between all recovery data and the query q', so as to obtain the answer set;

step seven: comparing the first range query with the second range query, determining whether the data in the answer set corresponds to the labels one by one, and if so, performing left-right cross-block search based on the block B in the second range query until the data does not correspond to the labels, namely finishing the query;

the specific steps of performing the left-right cross-block search based on the block B in the second range query are as follows:

seventhly, step seven: taking the block B as a center, respectively selecting one block leftwards and rightwards, enabling the dimension of data in the left block to be identical with the dimension of the query q ', and enabling the dimension of the query q' to be identical with the dimension of the right block;

seventhly, step two: if the block numbers do not correspond to each other, the block numbers respectively selected leftwards and rightwards in the seventeenth step are increased by one, then the dimension of data in all blocks at the left side is the same as the dimension of the query q ', and then the dimension of the query q' is the same as the dimension of the block at the rightmost side;

seventhly, step seven: repeating the step seven two for iteration until no non-correspondence exists.

Further, the sixth and seventh steps are replaced by:

performing dimension reduction on each query q to obtain a query q ', determining a partition B to be searched for by the query q ', taking the query q ' as a center in the partition, taking a threshold tau as a radius to obtain a candidate point set, recovering data in the candidate point set and the query q ' to the original dimension to obtain recovery data and a query q ', respectively calculating Euclidean distance between each recovery data and the query q ', and taking the recovery data as data in an answer set and calculating the distances between all the recovery data and the query q ' to obtain an answer set if the distance is not greater than the distance threshold tau;

step seven: comparing and calculating the data in the answer set with the labels to obtain a recall rate, determining whether the recall rate meets the recall rate R, stopping if the recall rate meets the recall rate R, and if the recall rate does not meet the recall rate R, performing left-right block-crossing search based on the block B in the second-range query until the recall rate R is met;

the recall rate R is obtained through the following steps:

step 1: setting a recall rate lower limit value, and then constructing a coordinate system by taking a horizontal axis as the recall rate and a vertical axis as time, wherein the recall rate is the ratio of the number of data in an answer set to the number of labels;

step 2: acquiring a time-recall curve according to the constructed coordinate system, and when an inflection point appears on the curve, and the recall corresponding to the inflection point is larger than the recall lower limit value, determining that the recall corresponding to the inflection point is a recall R;

Further, the data is acron data.

Further, the dimension reduction processing is performed by PCA, namely principal component analysis.

Further, the dimension reduced in the dimension reduction process is the largest dimension within the error epsilon;

the error epsilon is the Euclidean distance difference value between the original data and the reconstructed data which is restored to the original dimension after the dimension is reduced.

Further, the initial value of the number of dimensions contained in each block is 2 or 4.

Further, the plurality of existing indexes includes: GNAT, EPT and M-tree.

Further, in the third step, the number of data selected in each block according to the data distribution condition is 100.

Further, in the fourth step, 1000 data corresponding to the uniform distribution are extracted from the existing database.

Further, the lower limit value of the recall ratio is 95%.

The beneficial effects of the invention are as follows:

the invention can realize quick and high-efficiency range query on various data sets (large-scale and high-dimension real data sets; different types of synthetic data sets). The whole query frame is optimized, so that the balance of efficiency and precision is achieved; and the automatic parameter adjustment is realized, and the diversity requirement of customers can be met.

The method and the device can perform efficient and accurate query on high-dimensional data, enable range query to be more efficient (short in query time) and accurate (high in accuracy), avoid the problem of dimension disasters by performing data preprocessing through the PCA method, and classify the data according to the characteristics. And the method and the device have the advantages that the best index is built on different blocks, the characteristics of the data and the characteristics of the index are reasonably applied, and the efficiency of the method is maximized.

Drawings

FIG. 1 is a schematic diagram of an index building portion;

FIG. 2 is a schematic diagram of a portion of a query processing.

Detailed Description

It should be noted in particular that, without conflict, the various embodiments disclosed herein may be combined with each other.

The first embodiment is as follows: referring to fig. 1, a range query indexing method based on high-dimensional data according to the present embodiment specifically includes:

step one: preprocessing the data in the existing database (acron dataset, 1369 dimensions), dimension reduction processing, and performing range queries on high-dimensional data is very challenging. For example, many methods that are effective for low dimensionality, accuracy decreases significantly as the dimensionality increases; and when the processing and the application are carried out on the high-dimensional data, the operation amount and the time are huge. It is necessary to preprocess the raw data in the dataset.

The difficulty of the method is that the data is high-dimensional, and an intuitive idea is to perform dimension reduction processing on the data, so that the original data with higher dimension is reduced to a lower dimension on the basis of not losing the relationship between data information and the data as much as possible, and subsequent processing is performed by the method. The present application uses the method of PCA, principal component analysis. Data is processed

Dimension-reducing and forming->

For the determination of the dimension to which each data point in the dataset is reduced, the present application determines based on the error ε. The maximum dimension within error epsilon, error epsilon being 36;

the error epsilon refers to the difference value of the distance between the original data and the reconstructed data which is restored to the original dimension after the dimension is reduced. It can be considered that the dimension is reduced for pairs of points that are very close together.

Step two: determining a blocking strategy.

The original data is reduced to different dimensions through preprocessing the data. Each dimension is referred to herein simply as a class, and is scaled to the number of data points that fall into that dimension according to a given error. Several classes of dimensions from small to large are thus obtained.

On the basis, the application performs merging and blocking on the classes. The partitioning strategy is that the number of dimensions contained in each block is the same, and the initial value of the number of dimensions contained in each block is 1-10 (the optimal value is 2 or 4);

because the data points falling on different dimensions at this time have different characteristics, the application needs to analyze the data points in a fine granularity, and an index structure which is most suitable for the characteristics of the data points is built on the data points; the method is beneficial to pruning operation, is convenient for screening out data points which are greatly different from the query q, reduces the calculated amount and improves the query efficiency.

In partition, the method and the device consider a partitioning strategy to reduce query cost as much as possible and ensure query precision. Because data in the same dimension has the same characteristics, the application considers that the data points in similar dimensions are similar in characteristics, and the data points are suitable for constructing the same index. The dimension contained in each block is approximately the same, and the same index is constructed for all data points in the dimension contained in each block so as to realize the balance of performances. Under this partitioning, there are fewer spans to be expected in a specific query, thereby reducing the time to calculate distance; but query accuracy is also guaranteed. Thus, the present application should optimize two objectives simultaneously: (1) maximizing query accuracy (recovery); (2) minimizing the query time (t).

Step three: and (5) constructing an index.

Uniformly and randomly selecting 100 data in each block according to the data distribution condition, respectively inquiring the 100 data by utilizing GNAT, EPT and M-tree, and selecting an optimal index according to the inquiring time; determining the highest dimension in each block, and adjusting the rest dimensions to the highest dimension;

each index works differently, with some indexes being more sensitive to dimensions and some indexes being more sensitive to data size. The method and the device combine the data characteristics of each block in different dimensions and the block scale to select the existing alternative indexes (GNAT, EPT and M-tree) so as to maximize the query efficiency. The data points on each partition are represented by the highest dimension thereon, and the data points within the partition are uniformly reconstructed to the highest dimension of the partition.

Step four: and (5) query processing.

Extracting 1000 data which are consistent with uniform distribution from an existing database, adding noise to the data (in order to distinguish the existing data), and taking 1000 data containing noise as a query workload Q, wherein each data is a query Q;

using M-tree as query workload, wherein 1000 data are distributed uniformly, and the range query with a distance threshold tau of 54 is set (a threshold tau is set, the distance threshold tau is 54, then query q is taken as a center point, the threshold tau is taken as a radius, and a super sphere is obtained, wherein the data contained in the super sphere are labels;

the range query can be realized through the following two technical schemes:

p1: taking the query q as a center point, taking a threshold tau as a radius to obtain a super sphere, taking data contained in the super sphere as a label, then carrying out dimension reduction on the query q to obtain a query q ', determining a partition B which is required to be searched for by the query q ', taking the query q ' as the center in the partition, taking the threshold tau as the radius to obtain a candidate point set, restoring the data in the candidate point set and the query q ' to the original dimension to obtain restored data and a query q ', then respectively calculating the distance between each restored data and the query q ', taking the restored data as data in an answer set if the distance is not greater than the distance threshold tau, and calculating the distances between all restored data and the query q ', thus obtaining an answer set; and then determining whether the data in the answer set corresponds to the labels one by one, and if the data does not correspond to the labels, performing left-right cross-block search based on the block B until the data does not correspond to the labels.

P2: taking the query q as a center point, taking a threshold tau as a radius to obtain a super sphere, taking data contained in the super sphere as a label, then carrying out dimension reduction on each query q to obtain a query q ', determining a partition B to be searched for by the query q ', taking the query q ' as a center in the partition, taking the threshold tau as the radius to obtain a candidate point set, restoring the data in the candidate point set and the query q ' to the original dimension to obtain restored data and a query q ', then respectively calculating the distance between each restored data and the query q ', taking the restored data as data in an answer set if the distance is not greater than the distance threshold tau, calculating all the restored data and the distance between the restored data and the query q ', obtaining an answer set, then comparing the data in the answer set and the label, obtaining a recall rate, determining whether the recall rate meets the recall rate R, stopping if the recall rate is not met, and carrying out left and right cross block searching based on the partition B until the recall rate R is met;

the recall rate R is obtained through the following steps:

step 2: and acquiring a time-recall curve according to the constructed coordinate system, and when the curve has an inflection point, and the recall corresponding to the inflection point is larger than the recall lower limit value, determining that the recall corresponding to the inflection point is the recall R.

The range query takes the constructed index Forest, the query object q and the threshold value tau as input, and returns all data points with the range of tau from the query q in the data set D. Briefly, the query is split into two steps: 1) Determining a block by searching on the index, so as to further determine a candidate point set; 2) All data points within a given distance threshold from the query are determined by calculating the distance from the query.

The candidate point set is determined, and the application firstly determines the block where the target point is located, namely the block intersected with the query range. For the query object q, the application also performs dimension reduction processing on the query object q. And converting q into q ', and the corresponding dimension of q' is tq according to the error epsilon determined in the index construction stage. The application can determine which partition the query point q falls into by tq. And searching by taking the block as a center and crossing the block left and right. The number of blocks involved after the left and right spans is sum_b=1+2Δb (B is the number of blocks that span left or right). For the blocks intersected with the query scope, the application continues to query the blocks to determine the candidate set. The data points in the selection set are the data points after the dimension reduction, and are not original data points.

For data points within the candidate set, the present application performs a screening operation. The reduced dimension point-to-point distance dist_pac is less than the distance dist between the original data points. Calculating the distance dist_pac between the data points in the candidate set and the query points, and if dist_pca < tau-2 epsilon, obtaining a query result; otherwise, the data points after dimension reduction are subjected to reduction operation, the distance dist between the original data points is calculated, and if dist is less than tau, the query result is obtained.

It should be noted that the detailed description is merely for explaining and describing the technical solution of the present invention, and the scope of protection of the claims should not be limited thereto. All changes which come within the meaning and range of equivalency of the claims and the specification are to be embraced within their scope.

Claims

1. The range query indexing method based on the high-dimensional data is characterized by comprising the following steps of:

2. The method for indexing a range query based on high-dimensional data according to claim 1, wherein said sixth and seventh steps are replaced by:

the recall rate R is obtained through the following steps:

3. A range query indexing method based on high-dimensional data according to claim 2, wherein said data is acron data.

4. A range query indexing method based on high-dimensional data according to claim 3, wherein said dimension reduction process is performed using PCA, principal component analysis.

5. The range query indexing method based on high-dimensional data according to claim 4, wherein the dimension reduced in the dimension reduction process is the largest dimension within error epsilon;

6. The method of claim 5, wherein the initial value of the number of dimensions contained in each block is 2 or 4.

7. The method of claim 6, wherein the plurality of indexes comprises: GNAT, EPT and M-tree.

8. The method of claim 7, wherein the number of data in each block is 100.

9. The method according to claim 8, wherein in the fourth step, a plurality of data conforming to uniform distribution is extracted from the database to be 1000.

10. The method of claim 9, wherein the recall lower limit is 95%.