CN116028500B - Range query indexing method based on high-dimensional data - Google Patents

Range query indexing method based on high-dimensional data Download PDF

Info

Publication number
CN116028500B
CN116028500B CN202310060522.XA CN202310060522A CN116028500B CN 116028500 B CN116028500 B CN 116028500B CN 202310060522 A CN202310060522 A CN 202310060522A CN 116028500 B CN116028500 B CN 116028500B
Authority
CN
China
Prior art keywords
query
data
dimension
block
taking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310060522.XA
Other languages
Chinese (zh)
Other versions
CN116028500A (en
Inventor
黎玲利
孙文静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Heilongjiang University
Original Assignee
Heilongjiang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Heilongjiang University filed Critical Heilongjiang University
Priority to CN202310060522.XA priority Critical patent/CN116028500B/en
Publication of CN116028500A publication Critical patent/CN116028500A/en
Application granted granted Critical
Publication of CN116028500B publication Critical patent/CN116028500B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a range query indexing method based on high-dimensional data, relates to the technical field of database similarity search, and aims to solve the problem that the indexing method in the prior art has low accuracy when being applied to the high-dimensional data. The whole query frame is optimized, so that the balance of efficiency and precision is achieved; and the automatic parameter adjustment is realized, and the diversity requirement of customers can be met. The method and the device can perform efficient and accurate query on high-dimensional data, enable range query to be more efficient (short in query time) and accurate (high in accuracy), avoid the problem of dimension disasters by performing data preprocessing through the PCA method, and classify the data according to the characteristics. And the method and the device have the advantages that the best index is built on different blocks, the characteristics of the data and the characteristics of the index are reasonably applied, and the efficiency of the method is maximized.

Description

Range query indexing method based on high-dimensional data
Technical Field
The invention relates to the technical field of database similarity searching, in particular to a range query indexing method based on high-dimensional data.
Background
In today's society, data is well known to be ubiquitous and it is found in various areas of real life. In the large-space data age, terabytes of multidimensional data are produced every day. For these huge high-dimensional data, we need to analyze them to make them valuable. The similarity search is one of the key steps. The purpose of similarity searching is to find objects that are similar to a given object. Whereas a scope query is one of the core parts of the similarity search field. Given a set of query objects
Figure BDA0004061161880000011
Figure BDA0004061161880000012
Wherein->
Figure BDA0004061161880000013
Distance threshold τ, distance function dist (,) the task of a range query is to return query objects in the dataset D for which all distance queries q are within a given distance threshold τ. It is important for text search, image search, product recommendation, etc.
The current method for solving the current similarity searching problem comprises a traditional accurate query method, such as EPT, GANT, LC, M-Tree and the like; approximate query methods, such as HNSW and HVS methods belonging to the graph structure, quantitative VAQ and PQ methods, hash OASIS and SAS methods, learning index LIMS, ZM-index and LISA, and the like.
With the increase of the data scale, the existing methods (such as LC) have very long index construction time, and the existing methods (such as M-tree) have very long query time and high calculation cost; and as the dimensions of the data become higher, some methods are very efficient at low dimensions, and are not as accurate as applying to high dimensional data. Although the learning index using the deep learning method greatly reduces the query time, there is a great problem in that accuracy is lost and it is difficult to guarantee.
Disclosure of Invention
The purpose of the invention is that: aiming at the problem of low accuracy rate when the indexing method is applied to high-dimensional data in the prior art, the range query indexing method based on the high-dimensional data is provided.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a range query indexing method based on high-dimensional data comprises the following steps:
step one: performing dimension reduction processing on data in a database, wherein the data is high-dimension data;
step two: based on the data after the dimension reduction processing, taking the data reduced to the same dimension as one class, arranging the classes in an ascending order, and then merging and blocking all the ordered classes, wherein the merging and blocking strategy is as follows:
the number of different dimensions contained in each block is the same, and the number of different dimensions in each block is 1-10;
step three: uniformly selecting a plurality of data randomly in each block according to the data distribution condition, respectively inquiring the selected data by utilizing a plurality of indexes, recording the inquiring time, selecting an optimal index from the plurality of indexes as an index of each block, determining the highest dimension in each block, and adjusting the rest dimensions in the block to the highest dimension;
step four: extracting a plurality of data which are consistent with uniform distribution from a database, adding noise to the extracted data, and taking the plurality of data containing the noise as a query workload Q, wherein each data is a query Q in the query workload Q;
step five: for each query q, in the database, using the M-tree to make a first range query of a distance threshold tau, wherein the first range query specifically comprises:
taking the query q as a center point and taking the threshold tau as a radius to obtain a hypersphere, wherein data contained in the hypersphere is a label;
step six: for each query q, making a second range query of a distance threshold τ in a database, wherein the second range query is specifically:
performing dimension reduction on the query q to obtain a query q ', determining a partition B where the query q' is located, enabling the dimension of the query q 'to be identical to the dimension of the partition B, taking the query q' as a center in the partition, taking a threshold tau as a radius to obtain a candidate point set, recovering data in the candidate point set and the query q 'to the original dimension to obtain recovery data and a query q', respectively calculating Euclidean distance between each recovery data and the query q ', and taking the recovery data as data in an answer set, and calculating the distances between all recovery data and the query q', so as to obtain the answer set;
step seven: comparing the first range query with the second range query, determining whether the data in the answer set corresponds to the labels one by one, and if so, performing left-right cross-block search based on the block B in the second range query until the data does not correspond to the labels, namely finishing the query;
the specific steps of performing the left-right cross-block search based on the block B in the second range query are as follows:
seventhly, step seven: taking the block B as a center, respectively selecting one block leftwards and rightwards, enabling the dimension of data in the left block to be identical with the dimension of the query q ', and enabling the dimension of the query q' to be identical with the dimension of the right block;
seventhly, step two: if the block numbers do not correspond to each other, the block numbers respectively selected leftwards and rightwards in the seventeenth step are increased by one, then the dimension of data in all blocks at the left side is the same as the dimension of the query q ', and then the dimension of the query q' is the same as the dimension of the block at the rightmost side;
seventhly, step seven: repeating the step seven two for iteration until no non-correspondence exists.
Further, the sixth and seventh steps are replaced by:
step six: for each query q, making a second range query of a distance threshold τ in a database, wherein the second range query is specifically:
performing dimension reduction on each query q to obtain a query q ', determining a partition B to be searched for by the query q ', taking the query q ' as a center in the partition, taking a threshold tau as a radius to obtain a candidate point set, recovering data in the candidate point set and the query q ' to the original dimension to obtain recovery data and a query q ', respectively calculating Euclidean distance between each recovery data and the query q ', and taking the recovery data as data in an answer set and calculating the distances between all the recovery data and the query q ' to obtain an answer set if the distance is not greater than the distance threshold tau;
step seven: comparing and calculating the data in the answer set with the labels to obtain a recall rate, determining whether the recall rate meets the recall rate R, stopping if the recall rate meets the recall rate R, and if the recall rate does not meet the recall rate R, performing left-right block-crossing search based on the block B in the second-range query until the recall rate R is met;
the recall rate R is obtained through the following steps:
step 1: setting a recall rate lower limit value, and then constructing a coordinate system by taking a horizontal axis as the recall rate and a vertical axis as time, wherein the recall rate is the ratio of the number of data in an answer set to the number of labels;
step 2: acquiring a time-recall curve according to the constructed coordinate system, and when an inflection point appears on the curve, and the recall corresponding to the inflection point is larger than the recall lower limit value, determining that the recall corresponding to the inflection point is a recall R;
the specific steps of performing the left-right cross-block search based on the block B in the second range query are as follows:
seventhly, step seven: taking the block B as a center, respectively selecting one block leftwards and rightwards, enabling the dimension of data in the left block to be identical with the dimension of the query q ', and enabling the dimension of the query q' to be identical with the dimension of the right block;
seventhly, step two: if the block numbers do not correspond to each other, the block numbers respectively selected leftwards and rightwards in the seventeenth step are increased by one, then the dimension of data in all blocks at the left side is the same as the dimension of the query q ', and then the dimension of the query q' is the same as the dimension of the block at the rightmost side;
seventhly, step seven: repeating the step seven two for iteration until no non-correspondence exists.
Further, the data is acron data.
Further, the dimension reduction processing is performed by PCA, namely principal component analysis.
Further, the dimension reduced in the dimension reduction process is the largest dimension within the error epsilon;
the error epsilon is the Euclidean distance difference value between the original data and the reconstructed data which is restored to the original dimension after the dimension is reduced.
Further, the initial value of the number of dimensions contained in each block is 2 or 4.
Further, the plurality of existing indexes includes: GNAT, EPT and M-tree.
Further, in the third step, the number of data selected in each block according to the data distribution condition is 100.
Further, in the fourth step, 1000 data corresponding to the uniform distribution are extracted from the existing database.
Further, the lower limit value of the recall ratio is 95%.
The beneficial effects of the invention are as follows:
the invention can realize quick and high-efficiency range query on various data sets (large-scale and high-dimension real data sets; different types of synthetic data sets). The whole query frame is optimized, so that the balance of efficiency and precision is achieved; and the automatic parameter adjustment is realized, and the diversity requirement of customers can be met.
The method and the device can perform efficient and accurate query on high-dimensional data, enable range query to be more efficient (short in query time) and accurate (high in accuracy), avoid the problem of dimension disasters by performing data preprocessing through the PCA method, and classify the data according to the characteristics. And the method and the device have the advantages that the best index is built on different blocks, the characteristics of the data and the characteristics of the index are reasonably applied, and the efficiency of the method is maximized.
Drawings
FIG. 1 is a schematic diagram of an index building portion;
FIG. 2 is a schematic diagram of a portion of a query processing.
Detailed Description
It should be noted in particular that, without conflict, the various embodiments disclosed herein may be combined with each other.
The first embodiment is as follows: referring to fig. 1, a range query indexing method based on high-dimensional data according to the present embodiment specifically includes:
step one: preprocessing the data in the existing database (acron dataset, 1369 dimensions), dimension reduction processing, and performing range queries on high-dimensional data is very challenging. For example, many methods that are effective for low dimensionality, accuracy decreases significantly as the dimensionality increases; and when the processing and the application are carried out on the high-dimensional data, the operation amount and the time are huge. It is necessary to preprocess the raw data in the dataset.
The difficulty of the method is that the data is high-dimensional, and an intuitive idea is to perform dimension reduction processing on the data, so that the original data with higher dimension is reduced to a lower dimension on the basis of not losing the relationship between data information and the data as much as possible, and subsequent processing is performed by the method. The present application uses the method of PCA, principal component analysis. Data is processed
Figure BDA0004061161880000041
Figure BDA0004061161880000042
Dimension-reducing and forming->
Figure BDA0004061161880000043
For the determination of the dimension to which each data point in the dataset is reduced, the present application determines based on the error ε. The maximum dimension within error epsilon, error epsilon being 36;
the error epsilon refers to the difference value of the distance between the original data and the reconstructed data which is restored to the original dimension after the dimension is reduced. It can be considered that the dimension is reduced for pairs of points that are very close together.
Step two: determining a blocking strategy.
The original data is reduced to different dimensions through preprocessing the data. Each dimension is referred to herein simply as a class, and is scaled to the number of data points that fall into that dimension according to a given error. Several classes of dimensions from small to large are thus obtained.
On the basis, the application performs merging and blocking on the classes. The partitioning strategy is that the number of dimensions contained in each block is the same, and the initial value of the number of dimensions contained in each block is 1-10 (the optimal value is 2 or 4);
because the data points falling on different dimensions at this time have different characteristics, the application needs to analyze the data points in a fine granularity, and an index structure which is most suitable for the characteristics of the data points is built on the data points; the method is beneficial to pruning operation, is convenient for screening out data points which are greatly different from the query q, reduces the calculated amount and improves the query efficiency.
In partition, the method and the device consider a partitioning strategy to reduce query cost as much as possible and ensure query precision. Because data in the same dimension has the same characteristics, the application considers that the data points in similar dimensions are similar in characteristics, and the data points are suitable for constructing the same index. The dimension contained in each block is approximately the same, and the same index is constructed for all data points in the dimension contained in each block so as to realize the balance of performances. Under this partitioning, there are fewer spans to be expected in a specific query, thereby reducing the time to calculate distance; but query accuracy is also guaranteed. Thus, the present application should optimize two objectives simultaneously: (1) maximizing query accuracy (recovery); (2) minimizing the query time (t).
Step three: and (5) constructing an index.
Uniformly and randomly selecting 100 data in each block according to the data distribution condition, respectively inquiring the 100 data by utilizing GNAT, EPT and M-tree, and selecting an optimal index according to the inquiring time; determining the highest dimension in each block, and adjusting the rest dimensions to the highest dimension;
each index works differently, with some indexes being more sensitive to dimensions and some indexes being more sensitive to data size. The method and the device combine the data characteristics of each block in different dimensions and the block scale to select the existing alternative indexes (GNAT, EPT and M-tree) so as to maximize the query efficiency. The data points on each partition are represented by the highest dimension thereon, and the data points within the partition are uniformly reconstructed to the highest dimension of the partition.
Step four: and (5) query processing.
Extracting 1000 data which are consistent with uniform distribution from an existing database, adding noise to the data (in order to distinguish the existing data), and taking 1000 data containing noise as a query workload Q, wherein each data is a query Q;
using M-tree as query workload, wherein 1000 data are distributed uniformly, and the range query with a distance threshold tau of 54 is set (a threshold tau is set, the distance threshold tau is 54, then query q is taken as a center point, the threshold tau is taken as a radius, and a super sphere is obtained, wherein the data contained in the super sphere are labels;
the range query can be realized through the following two technical schemes:
p1: taking the query q as a center point, taking a threshold tau as a radius to obtain a super sphere, taking data contained in the super sphere as a label, then carrying out dimension reduction on the query q to obtain a query q ', determining a partition B which is required to be searched for by the query q ', taking the query q ' as the center in the partition, taking the threshold tau as the radius to obtain a candidate point set, restoring the data in the candidate point set and the query q ' to the original dimension to obtain restored data and a query q ', then respectively calculating the distance between each restored data and the query q ', taking the restored data as data in an answer set if the distance is not greater than the distance threshold tau, and calculating the distances between all restored data and the query q ', thus obtaining an answer set; and then determining whether the data in the answer set corresponds to the labels one by one, and if the data does not correspond to the labels, performing left-right cross-block search based on the block B until the data does not correspond to the labels.
P2: taking the query q as a center point, taking a threshold tau as a radius to obtain a super sphere, taking data contained in the super sphere as a label, then carrying out dimension reduction on each query q to obtain a query q ', determining a partition B to be searched for by the query q ', taking the query q ' as a center in the partition, taking the threshold tau as the radius to obtain a candidate point set, restoring the data in the candidate point set and the query q ' to the original dimension to obtain restored data and a query q ', then respectively calculating the distance between each restored data and the query q ', taking the restored data as data in an answer set if the distance is not greater than the distance threshold tau, calculating all the restored data and the distance between the restored data and the query q ', obtaining an answer set, then comparing the data in the answer set and the label, obtaining a recall rate, determining whether the recall rate meets the recall rate R, stopping if the recall rate is not met, and carrying out left and right cross block searching based on the partition B until the recall rate R is met;
the recall rate R is obtained through the following steps:
step 1: setting a recall rate lower limit value, and then constructing a coordinate system by taking a horizontal axis as the recall rate and a vertical axis as time, wherein the recall rate is the ratio of the number of data in an answer set to the number of labels;
step 2: and acquiring a time-recall curve according to the constructed coordinate system, and when the curve has an inflection point, and the recall corresponding to the inflection point is larger than the recall lower limit value, determining that the recall corresponding to the inflection point is the recall R.
The range query takes the constructed index Forest, the query object q and the threshold value tau as input, and returns all data points with the range of tau from the query q in the data set D. Briefly, the query is split into two steps: 1) Determining a block by searching on the index, so as to further determine a candidate point set; 2) All data points within a given distance threshold from the query are determined by calculating the distance from the query.
The candidate point set is determined, and the application firstly determines the block where the target point is located, namely the block intersected with the query range. For the query object q, the application also performs dimension reduction processing on the query object q. And converting q into q ', and the corresponding dimension of q' is tq according to the error epsilon determined in the index construction stage. The application can determine which partition the query point q falls into by tq. And searching by taking the block as a center and crossing the block left and right. The number of blocks involved after the left and right spans is sum_b=1+2Δb (B is the number of blocks that span left or right). For the blocks intersected with the query scope, the application continues to query the blocks to determine the candidate set. The data points in the selection set are the data points after the dimension reduction, and are not original data points.
For data points within the candidate set, the present application performs a screening operation. The reduced dimension point-to-point distance dist_pac is less than the distance dist between the original data points. Calculating the distance dist_pac between the data points in the candidate set and the query points, and if dist_pca < tau-2 epsilon, obtaining a query result; otherwise, the data points after dimension reduction are subjected to reduction operation, the distance dist between the original data points is calculated, and if dist is less than tau, the query result is obtained.
It should be noted that the detailed description is merely for explaining and describing the technical solution of the present invention, and the scope of protection of the claims should not be limited thereto. All changes which come within the meaning and range of equivalency of the claims and the specification are to be embraced within their scope.

Claims (10)

1. The range query indexing method based on the high-dimensional data is characterized by comprising the following steps of:
step one: performing dimension reduction processing on data in a database, wherein the data is high-dimension data;
step two: based on the data after the dimension reduction processing, taking the data reduced to the same dimension as one class, arranging the classes in an ascending order, and then merging and blocking all the ordered classes, wherein the merging and blocking strategy is as follows:
the number of different dimensions contained in each block is the same, and the number of different dimensions in each block is 1-10;
step three: uniformly selecting a plurality of data randomly in each block according to the data distribution condition, respectively inquiring the selected data by utilizing a plurality of indexes, recording the inquiring time, selecting an optimal index from the plurality of indexes as an index of each block, determining the highest dimension in each block, and adjusting the rest dimensions in the block to the highest dimension;
step four: extracting a plurality of data which are consistent with uniform distribution from a database, adding noise to the extracted data, and taking the plurality of data containing the noise as a query workload Q, wherein each data is a query Q in the query workload Q;
step five: for each query q, in the database, using the M-tree to make a first range query of a distance threshold tau, wherein the first range query specifically comprises:
taking the query q as a center point and taking the threshold tau as a radius to obtain a hypersphere, wherein data contained in the hypersphere is a label;
step six: for each query q, making a second range query of a distance threshold τ in a database, wherein the second range query is specifically:
performing dimension reduction on the query q to obtain a query q ', determining a partition B where the query q' is located, enabling the dimension of the query q 'to be identical to the dimension of the partition B, taking the query q' as a center in the partition, taking a threshold tau as a radius to obtain a candidate point set, recovering data in the candidate point set and the query q 'to the original dimension to obtain recovery data and a query q', respectively calculating Euclidean distance between each recovery data and the query q ', and taking the recovery data as data in an answer set, and calculating the distances between all recovery data and the query q', so as to obtain the answer set;
step seven: comparing the first range query with the second range query, determining whether the data in the answer set corresponds to the labels one by one, and if so, performing left-right cross-block search based on the block B in the second range query until the data does not correspond to the labels, namely finishing the query;
the specific steps of performing the left-right cross-block search based on the block B in the second range query are as follows:
seventhly, step seven: taking the block B as a center, respectively selecting one block leftwards and rightwards, enabling the dimension of data in the left block to be identical with the dimension of the query q ', and enabling the dimension of the query q' to be identical with the dimension of the right block;
seventhly, step two: if the block numbers do not correspond to each other, the block numbers respectively selected leftwards and rightwards in the seventeenth step are increased by one, then the dimension of data in all blocks at the left side is the same as the dimension of the query q ', and then the dimension of the query q' is the same as the dimension of the block at the rightmost side;
seventhly, step seven: repeating the step seven two for iteration until no non-correspondence exists.
2. The method for indexing a range query based on high-dimensional data according to claim 1, wherein said sixth and seventh steps are replaced by:
step six: for each query q, making a second range query of a distance threshold τ in a database, wherein the second range query is specifically:
performing dimension reduction on each query q to obtain a query q ', determining a partition B to be searched for by the query q ', taking the query q ' as a center in the partition, taking a threshold tau as a radius to obtain a candidate point set, recovering data in the candidate point set and the query q ' to the original dimension to obtain recovery data and a query q ', respectively calculating Euclidean distance between each recovery data and the query q ', and taking the recovery data as data in an answer set and calculating the distances between all the recovery data and the query q ' to obtain an answer set if the distance is not greater than the distance threshold tau;
step seven: comparing and calculating the data in the answer set with the labels to obtain a recall rate, determining whether the recall rate meets the recall rate R, stopping if the recall rate meets the recall rate R, and if the recall rate does not meet the recall rate R, performing left-right block-crossing search based on the block B in the second-range query until the recall rate R is met;
the recall rate R is obtained through the following steps:
step 1: setting a recall rate lower limit value, and then constructing a coordinate system by taking a horizontal axis as the recall rate and a vertical axis as time, wherein the recall rate is the ratio of the number of data in an answer set to the number of labels;
step 2: acquiring a time-recall curve according to the constructed coordinate system, and when an inflection point appears on the curve, and the recall corresponding to the inflection point is larger than the recall lower limit value, determining that the recall corresponding to the inflection point is a recall R;
the specific steps of performing the left-right cross-block search based on the block B in the second range query are as follows:
seventhly, step seven: taking the block B as a center, respectively selecting one block leftwards and rightwards, enabling the dimension of data in the left block to be identical with the dimension of the query q ', and enabling the dimension of the query q' to be identical with the dimension of the right block;
seventhly, step two: if the block numbers do not correspond to each other, the block numbers respectively selected leftwards and rightwards in the seventeenth step are increased by one, then the dimension of data in all blocks at the left side is the same as the dimension of the query q ', and then the dimension of the query q' is the same as the dimension of the block at the rightmost side;
seventhly, step seven: repeating the step seven two for iteration until no non-correspondence exists.
3. A range query indexing method based on high-dimensional data according to claim 2, wherein said data is acron data.
4. A range query indexing method based on high-dimensional data according to claim 3, wherein said dimension reduction process is performed using PCA, principal component analysis.
5. The range query indexing method based on high-dimensional data according to claim 4, wherein the dimension reduced in the dimension reduction process is the largest dimension within error epsilon;
the error epsilon is the Euclidean distance difference value between the original data and the reconstructed data which is restored to the original dimension after the dimension is reduced.
6. The method of claim 5, wherein the initial value of the number of dimensions contained in each block is 2 or 4.
7. The method of claim 6, wherein the plurality of indexes comprises: GNAT, EPT and M-tree.
8. The method of claim 7, wherein the number of data in each block is 100.
9. The method according to claim 8, wherein in the fourth step, a plurality of data conforming to uniform distribution is extracted from the database to be 1000.
10. The method of claim 9, wherein the recall lower limit is 95%.
CN202310060522.XA 2023-01-17 2023-01-17 Range query indexing method based on high-dimensional data Active CN116028500B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310060522.XA CN116028500B (en) 2023-01-17 2023-01-17 Range query indexing method based on high-dimensional data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310060522.XA CN116028500B (en) 2023-01-17 2023-01-17 Range query indexing method based on high-dimensional data

Publications (2)

Publication Number Publication Date
CN116028500A CN116028500A (en) 2023-04-28
CN116028500B true CN116028500B (en) 2023-07-14

Family

ID=86074116

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310060522.XA Active CN116028500B (en) 2023-01-17 2023-01-17 Range query indexing method based on high-dimensional data

Country Status (1)

Country Link
CN (1) CN116028500B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284338A (en) * 2018-10-25 2019-01-29 南京航空航天大学 A kind of satellite remote sensing big data Optimizing Queries method based on hybrid index
CN113010525A (en) * 2021-04-01 2021-06-22 东北大学 Ocean space-time big data parallel KNN query processing method based on PID

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11954614B2 (en) * 2017-02-08 2024-04-09 10X Genomics, Inc. Systems and methods for visualizing a pattern in a dataset
CN108241745B (en) * 2018-01-08 2020-04-28 阿里巴巴集团控股有限公司 Sample set processing method and device and sample query method and device
CN110147377B (en) * 2019-05-29 2022-12-27 大连大学 General query method based on secondary index under large-scale spatial data environment
CN110175175B (en) * 2019-05-29 2023-05-09 大连大学 SPARK-based distributed space secondary index and range query algorithm
CN110489419A (en) * 2019-08-08 2019-11-22 东北大学 A kind of k nearest neighbor approximation querying method based on multilayer local sensitivity Hash
CN111680033A (en) * 2020-04-30 2020-09-18 广州市城市规划勘测设计研究院 High-performance GIS platform
US11886445B2 (en) * 2021-06-29 2024-01-30 United States Of America As Represented By The Secretary Of The Army Classification engineering using regional locality-sensitive hashing (LSH) searches
CN114329094A (en) * 2021-12-31 2022-04-12 上海交通大学 Spark-based large-scale high-dimensional data approximate neighbor query system and method
CN115438230A (en) * 2022-08-30 2022-12-06 西安电子科技大学 Safe and efficient dynamic encrypted cloud data multidimensional range query method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284338A (en) * 2018-10-25 2019-01-29 南京航空航天大学 A kind of satellite remote sensing big data Optimizing Queries method based on hybrid index
CN113010525A (en) * 2021-04-01 2021-06-22 东北大学 Ocean space-time big data parallel KNN query processing method based on PID

Also Published As

Publication number Publication date
CN116028500A (en) 2023-04-28

Similar Documents

Publication Publication Date Title
Benites et al. Haram: a hierarchical aram neural network for large-scale text classification
CN110070121B (en) Rapid approximate K nearest neighbor method based on tree strategy and balanced K mean clustering
Yagoubi et al. Dpisax: Massively distributed partitioned isax
US7958096B2 (en) System and method for organizing, compressing and structuring data for data mining readiness
CN102129451B (en) Method for clustering data in image retrieval system
EP3752930B1 (en) Random draw forest index structure for searching large scale unstructured data
US20030208488A1 (en) System and method for organizing, compressing and structuring data for data mining readiness
US10210280B2 (en) In-memory database search optimization using graph community structure
JPWO2013129580A1 (en) Approximate nearest neighbor search device, approximate nearest neighbor search method and program thereof
Neamtu et al. Interactive time series exploration powered by the marriage of similarity distances
CN113051359A (en) Large-scale track data similarity query method based on multi-level index structure
Zhang et al. TARDIS: Distributed indexing framework for big time series data
Gong et al. Distributed evidential clustering toward time series with big data issue
Hamdani et al. Distributed genetic algorithm with bi-coded chromosomes and a new evaluation function for features selection
CN108549696B (en) Time series data similarity query method based on memory calculation
CN116028500B (en) Range query indexing method based on high-dimensional data
Echihabi et al. Big sequence management: on scalability
US11048730B2 (en) Data clustering apparatus and method based on range query using CF tree
Yagoubi et al. Radiussketch: massively distributed indexing of time series
CN116204647A (en) Method and device for establishing target comparison learning model and text clustering
Chen et al. Research on optimized R-tree high-dimensional indexing method based on video features
Glenis et al. SCALE-BOSS: A framework for scalable time-series classification using symbolic representations
Mola et al. Discriminant analysis and factorial multiple splits in recursive partitioning for data mining
Kumar et al. Partition Algorithms–A Study and Emergence of Mining Projected Clusters in High-Dimensional Dataset
Froese et al. Fast exact dynamic time warping on run-length encoded time series

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant