CN115858629B - KNN query method based on learning index - Google Patents

KNN query method based on learning index Download PDF

Info

Publication number
CN115858629B
CN115858629B CN202211701214.2A CN202211701214A CN115858629B CN 115858629 B CN115858629 B CN 115858629B CN 202211701214 A CN202211701214 A CN 202211701214A CN 115858629 B CN115858629 B CN 115858629B
Authority
CN
China
Prior art keywords
data
partition
training data
points
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211701214.2A
Other languages
Chinese (zh)
Other versions
CN115858629A (en
Inventor
黎玲利
韩奥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Heilongjiang University
Original Assignee
Heilongjiang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Heilongjiang University filed Critical Heilongjiang University
Priority to CN202211701214.2A priority Critical patent/CN115858629B/en
Publication of CN115858629A publication Critical patent/CN115858629A/en
Application granted granted Critical
Publication of CN115858629B publication Critical patent/CN115858629B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The KNN query method based on the learning index is used for solving the problems that the learning index has limitation and low flexibility when the KNN query is carried out on computer data, and the computer data is divided into a training set and a testing set according to Zipfian distribution; dividing a data space of a data set into y non-overlapping partitions by utilizing a traditional index, and obtaining all training data and corresponding partitions thereof; training the deep learning model by using all training data and the corresponding subareas to obtain a model; selecting T partitions with maximum probability corresponding to training data, obtaining K points close to the real K of the training data, and respectively establishing a weighted edge between the training data and each point and each partition; firstly ordering the points according to the weight, and then ordering the edges of each point to obtain a refined partition; and executing the operation on a certain piece of test data in the test set, and finding K points nearest to the certain piece of test data in the test set, namely obtaining a KNN result of the certain piece of test data.

Description

KNN query method based on learning index
Technical Field
The invention relates to a KNN query method, in particular to a KNN query method based on learning index and deep learning, and belongs to the field of computers.
Background
The KNN search, which deals with a huge amount of data in a high-dimensional space, is a classical problem worth studying. In the mass data of the computer, D is a data set with the capacity of n in D-dimensional space, and given a query data point q in one D-dimensional space, K data closest to q in the data set D under a given distance measurement are returned by the KNN problem. KNN algorithms generally fall into two categories: exact queries and approximate queries. Accurate queries, as their name suggests, have a percentage of query accuracy, and the predecessor has proposed a number of classical tree-based index structures: K-D Tree, M Tree, R Tree, etc. When D is small (e.g., D < 20), a tree index (e.g., K-D tree) may be used to perform a computer data query, but in practice, the approximate neighbor search problem is typically performed in a high-dimensional vector, with dimensions typically ranging from 100 to 1000, however, as dimensions increase, these conventional index structures suffer from "dimensional curse" phenomena.
In order to obtain ideal data retrieval effect and acceptable retrieval time, scholars propose a near nearest neighbor search method to reduce query accuracy, speed up query time and alleviate the "dimension curse" problem to some extent. The method is mainly divided into two types: one is a method based on improving the performance of a search structure, and the method is mostly based on a tree structure; the other type is mainly based on processing of the data itself, including a hash algorithm, a vector quantization method and the like.
Recently, the problem of processing by machine learning has become an emerging research direction. Google research has shown that the use of machine learning models can replace some of the traditional indexing structures and can learn the data distribution to some extent. Machine learning has a faster running speed in processing feature vectors, while conventional indexing structures can be regarded as classification problems, which is essentially indistinguishable from what a neural network can do. However, the current learning index mainly focuses on point and range query of a specific index structure, so that the learning index has limitation and low flexibility.
Disclosure of Invention
The invention aims to solve the problems that the learning index is limited and the flexibility is low because the current learning index mainly focuses on the point and range query of a certain specific index structure when the KNN query is carried out on computer data, and further provides a KNN query method based on the learning index.
It comprises the following steps:
s1, acquiring a certain amount of computer data as a data set, and dividing the computer data into a training set and a testing set according to Zipfian distribution;
s2, dividing a data space of the data set into y non-overlapping subareas by utilizing a traditional index, obtaining subareas where the KNN of each training data in the training set is located, if the current subareas have a KNN result, setting the corresponding position of the current subarea label to be 1, otherwise, setting the current subarea label to be 0, and obtaining all the training data and the subareas corresponding to the training data;
s3, establishing a deep learning model, training the deep learning model by utilizing all training data and corresponding subareas thereof, inputting the training data, and outputting the probability of the training data in the corresponding subareas to obtain a trained deep learning model;
s4, selecting T partitions with maximum probability corresponding to training data, obtaining K points close to the real K of the training data, establishing a weighted edge between the training data and each partition, and establishing a weighted edge between each point and each partition;
taking training data and k corresponding points as a set, obtaining the weight of each point in the set to the edge of each partition, and the total weight of each point, sorting the points in the set from big to small according to the total weight of each point, obtaining sorted points, sorting the edges connected with each point according to the sorted points from big to small according to the weight of the edge, distributing the points into the partition with the largest weight according to the sorting result of the edges, and distributing the points into the partition with the next largest weight until all the points are distributed into the partition if the data capacity in the partition with the largest weight reaches a given threshold value, thereby obtaining the refined partition;
s5, inputting a certain piece of test data in the test set into the deep learning model to obtain probability corresponding to the certain piece of test data, selecting the first T numerical values with the largest numerical value in the probability to obtain index number I corresponding to each numerical value, taking the partition with the index number I in the refined partition as a candidate partition, taking all data points in the candidate partition as a KNN candidate point set of the certain piece of test data, and finding K points nearest to the certain piece of test data from the candidate point set to obtain a KNN result of the certain piece of test data.
Further, in S3, a deep learning model is built, the deep learning model is trained by utilizing all training data and corresponding subareas thereof, the training data is input, the probability of the training data in the corresponding subareas is output, and the trained deep learning model is obtained, wherein the specific process is as follows:
the deep learning model sequentially comprises three modules, wherein the first module adopts deep separable convolution, inputs training data q and outputs a feature vector h1; the second module sequentially comprises two convolution blocks, inputs training data q and a feature vector h1, and outputs a feature vector h2; and multiplying the characteristic vector h1 and the characteristic vector h2 to obtain a vector h3, wherein a module III sequentially comprises a convolution layer, a pooling layer and a 1*1 convolution layer, inputting the h3, and outputting probability vectors h of training data q in corresponding different partitions.
Further, each convolution block in the second module sequentially comprises three convolution layers, one pooling layer, two convolution layers, one pooling layer and one full connection layer.
A KNN query system based on a learning index, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized by: the processor, when executing the computer program, implements any step as in a learning index based KNN query method.
A computer-readable storage medium storing a computer program, characterized in that: the computer program, when executed by a processor, implements any step of a KNN query method as one based on a learning index.
The beneficial effects are that:
according to the invention, the data space of the data set is divided into y non-overlapping partitions by using the traditional index, a training set and a testing set which accord with the zipfian distribution are generated, the training set is used for training the established deep learning model, training data is input, the probability of the training data in the corresponding partition is output, and the trained deep learning model is obtained. Selecting T partitions with maximum probability corresponding to training data, obtaining K points close to the real K of the training data, establishing a weighted edge between the training data and each partition, and establishing a weighted edge between each point and each partition; and taking the training data and the k corresponding points as a set, obtaining the weight from each point in the set to the edge of each partition, and the total weight of each point, sorting the points in the set according to the total weight of each point, obtaining sorted points, sorting the weights of the edges connected with each point according to the sorted points, distributing the points into the partition with the largest weight according to the sorting result, and distributing the points into the partition with the largest weight until all the points are distributed into the partition, thereby finishing the refined partition if the data capacity in the partition with the largest weight reaches a given threshold value. Inputting a certain piece of test data in the test set into a deep learning model to obtain probability corresponding to the certain piece of test data, selecting the first T numerical values with the largest numerical value in the probability to obtain index number I corresponding to each numerical value, taking the partition with the index number I in the refined partition as a candidate partition, taking all data points in the candidate partition as a KNN candidate point set of the certain piece of test data, and finding K points closest to the certain piece of test data from the candidate point set to obtain a KNN result of the certain piece of test data.
The invention combines the traditional index and the deep learning technology, has high speed, ensures the precision of the original index and simultaneously rapidly solves the KNN problem. The invention divides the data space of the data set into y non-overlapping areas and converts the KNN problem into a multi-classification problem. The data points in the library are rearranged, so that all nearest neighbor points of the query are placed in the same partition as much as possible, the query time is greatly reduced, a general heuristic learning index framework is designed, the KNN query of the existing multiple index structures can be accelerated, the flexibility is high, and the KNN problem of massive high-dimensional data is solved with high performance in time and space.
Detailed Description
The first embodiment is as follows: the KNN query method based on the learning index in the embodiment comprises the following steps:
s1, acquiring a certain amount of computer data as a data set, and dividing the data set into a training set and a testing set according to Zipfian distribution.
S2, dividing the data space of the data set into y non-overlapping subareas by utilizing the traditional index, obtaining the subarea where the KNN of each training data in the training set is located, if the current subarea has a KNN result, setting the corresponding position of the current subarea label to be 1, otherwise, setting the current subarea label to be 0, and obtaining all the training data and the subareas corresponding to the training data.
Each data has its own data attribute (information), and all data attribute categories are consistent, for example, the adopted data set is a commodity order data set D, and each data in D contains a attributes: the commodity name, commodity number, order amount, order time, etc., all data attributes are identical, only the values of the attributes are different. The method of the original index structure divides the data space of the data set into y subareas, takes each subarea as a category (attribute label), and converts the KNN problem into a multi-category problem. The label set by the invention is a 01 vector with the length of category number. Recording the position of the partition where the KNN of each training data is located according to the query step of the original index, setting the corresponding position of the partition label to be 1 if the KNN result exists in the current partition, and otherwise, setting the corresponding position to be 0.
And S3, establishing a deep learning model, training the deep learning model by utilizing all training data and corresponding subareas thereof, inputting the training data, and outputting the probability of the training data in the corresponding subareas to obtain a trained deep learning model.
The deep learning model sequentially comprises three modules, wherein the first module adopts deep separable convolution, captures attribute information specific to each input channel by using a single convolution filter, inputs training data q, outputs a feature vector h1,
Figure BDA0004024104150000042
the second module sequentially comprises two convolution blocks, wherein each convolution block sequentially comprises three convolution layers, a pooling layer, two convolution layers, a pooling layer and a full connection layer; in order to retain more original information, training data q and feature vector h1 are input, and feature vector h2,/is output>
Figure BDA0004024104150000041
Multiplying the feature vector h1 and the feature vector h2 to obtain a vector h3, sequentially comprising a convolution layer, a pooling layer and a 1*1 convolution layer by a third module, inputting the h3, outputting probability vectors h of training data q in corresponding different partitions, and performing feature fusion and feature extraction on the results of the first module and the second module to obtain a final result of the deep learning model.
In this way, only the candidate partition of the training data q is determined, and then all data points in the candidate partition still need to be calculated with the training data q, so that a KNN result is finally obtained. Here, the present invention considers that if the KNN result distribution in the original index is very dispersed, multiple data partitions need to be traversed, so that the query efficiency is greatly reduced, and therefore an algorithm for refining the data partitions is proposed, so as to distribute all KNN results in T data partitions as much as possible, and the number t=2 of target partitions of the present invention.
S4, selecting T partitions with maximum probability corresponding to training data, obtaining K points adjacent to the training data real K, establishing a weighted edge between the training data and each partition, establishing a weighted edge between each point and each partition, taking the training data and the corresponding K points as a set, obtaining the weight from each point to the edge of each partition in the set, and the total weight of each point, sorting the points according to the total weight of each point, obtaining sorted points, sorting the weights of each edge connected with each point according to the sorted points, distributing the points into the partition with the maximum weight according to the sorting result, and distributing the points into the partition with the next largest weight until all the points are distributed into the partition, thereby obtaining the refined partition P' if the data capacity in the partition with the maximum weight reaches a given threshold delta.
Refinement data partitioning: for training set x= { X 1 ,x 2 ,…,x n Real K-nearest neighbor of X
Figure BDA0004024104150000051
Figure BDA0004024104150000052
Representation and training data x 1 The nearest k point sets and selecting T partitions with the highest probability in the deep learning model prediction result for each training data in the training set X>
Figure BDA0004024104150000053
Figure BDA0004024104150000054
Representing training data x 1 Corresponding sets of T partitions. At x i And->
Figure BDA0004024104150000055
Establishing a weighted edge between each partition of the pair of partitions, < >>
Figure BDA0004024104150000056
Each point of (2) and->
Figure BDA0004024104150000057
A weighted edge is established between each partition in the system in order to make x as much as possible i And->
Figure BDA0004024104150000058
And the data is distributed into the appointed partition, and meanwhile, the data distribution problem is converted into the maximum weight matching problem. Establishing a point set s=x 1 ,s 2 ,s 3 ,s 4 ,…s m After the above steps are completed, the weight W from each point in the point set S to the edge of each partition is obtained s ={W s1 ,W s2 ,…,W sm W, where si ={(P a ,a),(P b ,b),…,(P w ,w)},P a For the id of a partition, a is the weight value to the partition, and the total weight w= { W for each point in the point set S 1 ,W 2 ,…,W m W, where W i =a+b+…+w。
Next, the set of points S is ordered according to the total weight W of each point, guaranteeing that the most important (total weight maximum) points are assigned preferentially. Obtaining ordered point set S sort Then pair with data point s i Weights W of the connected edges si And sorting is carried out, so that each data point is preferentially allocated to the partition with the largest weight connected with the data point, and if the data capacity in the partition with the largest weight reaches a given threshold delta, the data points are allocated to the partition with the next largest weight until the data capacity is still available in the partition to be allocated. After the reassignment of the data points in S is completed, the final refined partition P' is obtained.
S5, testing a certain piece of test data q in the test set i Inputting the deep learning model to obtain a certain piece of test data q i Corresponding probability h i Selecting probability h i The first T numerical values with the largest medium numerical value are used for obtaining the index number I corresponding to each numerical value, and the partition with the index number I in the refined partition P ' is used as a candidate partition P ' ' c Taking all data points in the candidate partition as q i And finally find the distance q from the candidate point set i The nearest K points are KNN results K' (q) of a certain test data i ). For example: input q i Obtaining h i = {0.4,0.9,0.5,0.8}, t=2, then i= {2,4}, P' c ={P′ 2 ,P′ 4 Then calculate P' c All points to q i To finally obtain K' (q) i )。K′(q i ) Is q i Is a result of approximation KNN, K (q i ) Is q i The model and algorithm of the present invention calculate K' (q) rapidly i ) At the same time, K' (q) i ) As close as possible to K (q i )。
The second embodiment is as follows: the KNN query system based on the learning index according to the present embodiment includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements any step of a KNN query method based on the learning index when executing the computer program.
And a third specific embodiment: a computer-readable storage medium according to the present embodiment stores a computer program, characterized in that: the computer program, when executed by a processor, implements any step of a KNN query method as a learning index-based approach.

Claims (5)

1. The KNN query method based on the learning index is characterized by comprising the following steps of: it comprises the following steps:
s1, acquiring a certain amount of computer data as a data set, and dividing the computer data into a training set and a testing set according to Zipfian distribution;
s2, dividing a data space of the data set into y non-overlapping subareas by utilizing a traditional index, obtaining subareas where the KNN of each training data in the training set is located, if the current subareas have a KNN result, setting the corresponding position of the current subarea label to be 1, otherwise, setting the current subarea label to be 0, and obtaining all the training data and the subareas corresponding to the training data;
s3, establishing a deep learning model, training the deep learning model by utilizing all training data and corresponding subareas thereof, inputting the training data, and outputting the probability of the training data in the corresponding subareas to obtain a trained deep learning model;
s4, selecting T partitions with maximum probability corresponding to training data, obtaining K points close to the real K of the training data, establishing a weighted edge between the training data and each partition, and establishing a weighted edge between each point and each partition;
taking training data and k corresponding points as a set, obtaining the weight of each point in the set to the edge of each partition, and the total weight of each point, sorting the points in the set from big to small according to the total weight of each point, obtaining sorted points, sorting the edges connected with each point according to the sorted points from big to small according to the weight of the edge, distributing the points into the partition with the largest weight according to the sorting result of the edges, and distributing the points into the partition with the next largest weight until all the points are distributed into the partition if the data capacity in the partition with the largest weight reaches a given threshold value, thereby obtaining the refined partition;
s5, inputting a certain piece of test data in the test set into the deep learning model to obtain probability corresponding to the certain piece of test data, selecting the first T numerical values with the largest numerical value in the probability to obtain index number I corresponding to each numerical value, taking the partition with the index number I in the refined partition as a candidate partition, taking all data points in the candidate partition as a KNN candidate point set of the certain piece of test data, and finding K points nearest to the certain piece of test data from the candidate point set to obtain a KNN result of the certain piece of test data.
2. A KNN query method based on learning index as claimed in claim 1, wherein: s3, establishing a deep learning model, training the deep learning model by utilizing all training data and corresponding subareas thereof, inputting the training data, and outputting the probability of the training data in the corresponding subareas to obtain a trained deep learning model, wherein the specific process is as follows:
the deep learning model sequentially comprises three modules, wherein the first module adopts deep separable convolution, inputs training data q and outputs a feature vector h1; the second module sequentially comprises two convolution blocks, inputs training data q and a feature vector h1, and outputs a feature vector h2; and multiplying the characteristic vector h1 and the characteristic vector h2 to obtain a vector h3, wherein a module III sequentially comprises a convolution layer, a pooling layer and a 1*1 convolution layer, inputting the h3, and outputting probability vectors h of training data q in corresponding different partitions.
3. A KNN query method based on learning index as claimed in claim 1, wherein: each convolution block in the second module sequentially comprises three convolution layers, a pooling layer, two convolution layers, a pooling layer and a full connection layer.
4. A KNN query system based on a learning index, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized by: the processor, when executing the computer program, implements the steps of the method according to any one of claims 1-3.
5. A computer-readable storage medium storing a computer program, characterized in that: the computer program implementing the steps of the method according to any of claims 1-3 when executed by a processor.
CN202211701214.2A 2022-12-28 2022-12-28 KNN query method based on learning index Active CN115858629B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211701214.2A CN115858629B (en) 2022-12-28 2022-12-28 KNN query method based on learning index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211701214.2A CN115858629B (en) 2022-12-28 2022-12-28 KNN query method based on learning index

Publications (2)

Publication Number Publication Date
CN115858629A CN115858629A (en) 2023-03-28
CN115858629B true CN115858629B (en) 2023-06-23

Family

ID=85655655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211701214.2A Active CN115858629B (en) 2022-12-28 2022-12-28 KNN query method based on learning index

Country Status (1)

Country Link
CN (1) CN115858629B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782804A (en) * 2020-06-09 2020-10-16 中科院成都信息技术股份有限公司 TextCNN-based same-distribution text data selection method, system and storage medium
CN114218292A (en) * 2021-11-08 2022-03-22 中国人民解放军国防科技大学 Multi-element time sequence similarity retrieval method
CN115049894A (en) * 2022-05-31 2022-09-13 北京交通大学 Target re-identification method of global structure information embedded network based on graph learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220374655A1 (en) * 2021-05-17 2022-11-24 Fujitsu Limited Data summarization for training machine learning models

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782804A (en) * 2020-06-09 2020-10-16 中科院成都信息技术股份有限公司 TextCNN-based same-distribution text data selection method, system and storage medium
CN114218292A (en) * 2021-11-08 2022-03-22 中国人民解放军国防科技大学 Multi-element time sequence similarity retrieval method
CN115049894A (en) * 2022-05-31 2022-09-13 北京交通大学 Target re-identification method of global structure information embedded network based on graph learning

Also Published As

Publication number Publication date
CN115858629A (en) 2023-03-28

Similar Documents

Publication Publication Date Title
CN108920720B (en) Large-scale image retrieval method based on depth hash and GPU acceleration
CN106095893B (en) A kind of cross-media retrieval method
CN104199827B (en) The high dimensional indexing method of large scale multimedia data based on local sensitivity Hash
CN112382352A (en) Method for quickly evaluating structural characteristics of metal organic framework material based on machine learning
CN109033172B (en) Image retrieval method for deep learning and approximate target positioning
CN112232413B (en) High-dimensional data feature selection method based on graph neural network and spectral clustering
CN110188225B (en) Image retrieval method based on sequencing learning and multivariate loss
CN108897791B (en) Image retrieval method based on depth convolution characteristics and semantic similarity measurement
CN104392250A (en) Image classification method based on MapReduce
CN109635140B (en) Image retrieval method based on deep learning and density peak clustering
JP2023523029A (en) Image recognition model generation method, apparatus, computer equipment and storage medium
CN107545276A (en) The various visual angles learning method of joint low-rank representation and sparse regression
CN110929161A (en) Large-scale user-oriented personalized teaching resource recommendation method
CN105320764A (en) 3D model retrieval method and 3D model retrieval apparatus based on slow increment features
CN110046713A (en) Robustness sequence learning method and its application based on multi-objective particle swarm optimization
CN108564116A (en) A kind of ingredient intelligent analysis method of camera scene image
CN114556364A (en) Neural architecture search based on similarity operator ordering
CN113516019A (en) Hyperspectral image unmixing method and device and electronic equipment
CN115858629B (en) KNN query method based on learning index
CN113704565B (en) Learning type space-time index method, device and medium based on global interval error
CN113779287B (en) Cross-domain multi-view target retrieval method and device based on multi-stage classifier network
CN113610350B (en) Complex working condition fault diagnosis method, equipment, storage medium and device
CN110750672B (en) Image retrieval method based on deep measurement learning and structure distribution learning loss
CN114663770A (en) Hyperspectral image classification method and system based on integrated clustering waveband selection
Devi et al. Similarity measurement in recent biased time series databases using different clustering methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant