CN115858629B

CN115858629B - KNN query method based on learning index

Info

Publication number: CN115858629B
Application number: CN202211701214.2A
Authority: CN
Inventors: 黎玲利; 韩奥
Original assignee: Heilongjiang University
Current assignee: Heilongjiang University
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-06-23
Anticipated expiration: 2042-12-28
Also published as: CN115858629A

Abstract

The KNN query method based on the learning index is used for solving the problems that the learning index has limitation and low flexibility when the KNN query is carried out on computer data, and the computer data is divided into a training set and a testing set according to Zipfian distribution; dividing a data space of a data set into y non-overlapping partitions by utilizing a traditional index, and obtaining all training data and corresponding partitions thereof; training the deep learning model by using all training data and the corresponding subareas to obtain a model; selecting T partitions with maximum probability corresponding to training data, obtaining K points close to the real K of the training data, and respectively establishing a weighted edge between the training data and each point and each partition; firstly ordering the points according to the weight, and then ordering the edges of each point to obtain a refined partition; and executing the operation on a certain piece of test data in the test set, and finding K points nearest to the certain piece of test data in the test set, namely obtaining a KNN result of the certain piece of test data.

Description

KNN query method based on learning index

Technical Field

The invention relates to a KNN query method, in particular to a KNN query method based on learning index and deep learning, and belongs to the field of computers.

Background

The KNN search, which deals with a huge amount of data in a high-dimensional space, is a classical problem worth studying. In the mass data of the computer, D is a data set with the capacity of n in D-dimensional space, and given a query data point q in one D-dimensional space, K data closest to q in the data set D under a given distance measurement are returned by the KNN problem. KNN algorithms generally fall into two categories: exact queries and approximate queries. Accurate queries, as their name suggests, have a percentage of query accuracy, and the predecessor has proposed a number of classical tree-based index structures: K-D Tree, M Tree, R Tree, etc. When D is small (e.g., D < 20), a tree index (e.g., K-D tree) may be used to perform a computer data query, but in practice, the approximate neighbor search problem is typically performed in a high-dimensional vector, with dimensions typically ranging from 100 to 1000, however, as dimensions increase, these conventional index structures suffer from "dimensional curse" phenomena.

In order to obtain ideal data retrieval effect and acceptable retrieval time, scholars propose a near nearest neighbor search method to reduce query accuracy, speed up query time and alleviate the "dimension curse" problem to some extent. The method is mainly divided into two types: one is a method based on improving the performance of a search structure, and the method is mostly based on a tree structure; the other type is mainly based on processing of the data itself, including a hash algorithm, a vector quantization method and the like.

Recently, the problem of processing by machine learning has become an emerging research direction. Google research has shown that the use of machine learning models can replace some of the traditional indexing structures and can learn the data distribution to some extent. Machine learning has a faster running speed in processing feature vectors, while conventional indexing structures can be regarded as classification problems, which is essentially indistinguishable from what a neural network can do. However, the current learning index mainly focuses on point and range query of a specific index structure, so that the learning index has limitation and low flexibility.

Disclosure of Invention

The invention aims to solve the problems that the learning index is limited and the flexibility is low because the current learning index mainly focuses on the point and range query of a certain specific index structure when the KNN query is carried out on computer data, and further provides a KNN query method based on the learning index.

It comprises the following steps:

s1, acquiring a certain amount of computer data as a data set, and dividing the computer data into a training set and a testing set according to Zipfian distribution;

s2, dividing a data space of the data set into y non-overlapping subareas by utilizing a traditional index, obtaining subareas where the KNN of each training data in the training set is located, if the current subareas have a KNN result, setting the corresponding position of the current subarea label to be 1, otherwise, setting the current subarea label to be 0, and obtaining all the training data and the subareas corresponding to the training data;

s3, establishing a deep learning model, training the deep learning model by utilizing all training data and corresponding subareas thereof, inputting the training data, and outputting the probability of the training data in the corresponding subareas to obtain a trained deep learning model;

s4, selecting T partitions with maximum probability corresponding to training data, obtaining K points close to the real K of the training data, establishing a weighted edge between the training data and each partition, and establishing a weighted edge between each point and each partition;

taking training data and k corresponding points as a set, obtaining the weight of each point in the set to the edge of each partition, and the total weight of each point, sorting the points in the set from big to small according to the total weight of each point, obtaining sorted points, sorting the edges connected with each point according to the sorted points from big to small according to the weight of the edge, distributing the points into the partition with the largest weight according to the sorting result of the edges, and distributing the points into the partition with the next largest weight until all the points are distributed into the partition if the data capacity in the partition with the largest weight reaches a given threshold value, thereby obtaining the refined partition;

s5, inputting a certain piece of test data in the test set into the deep learning model to obtain probability corresponding to the certain piece of test data, selecting the first T numerical values with the largest numerical value in the probability to obtain index number I corresponding to each numerical value, taking the partition with the index number I in the refined partition as a candidate partition, taking all data points in the candidate partition as a KNN candidate point set of the certain piece of test data, and finding K points nearest to the certain piece of test data from the candidate point set to obtain a KNN result of the certain piece of test data.

Further, in S3, a deep learning model is built, the deep learning model is trained by utilizing all training data and corresponding subareas thereof, the training data is input, the probability of the training data in the corresponding subareas is output, and the trained deep learning model is obtained, wherein the specific process is as follows:

the deep learning model sequentially comprises three modules, wherein the first module adopts deep separable convolution, inputs training data q and outputs a feature vector h1; the second module sequentially comprises two convolution blocks, inputs training data q and a feature vector h1, and outputs a feature vector h2; and multiplying the characteristic vector h1 and the characteristic vector h2 to obtain a vector h3, wherein a module III sequentially comprises a convolution layer, a pooling layer and a 1*1 convolution layer, inputting the h3, and outputting probability vectors h of training data q in corresponding different partitions.

Further, each convolution block in the second module sequentially comprises three convolution layers, one pooling layer, two convolution layers, one pooling layer and one full connection layer.

A KNN query system based on a learning index, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized by: the processor, when executing the computer program, implements any step as in a learning index based KNN query method.

A computer-readable storage medium storing a computer program, characterized in that: the computer program, when executed by a processor, implements any step of a KNN query method as one based on a learning index.

The beneficial effects are that:

according to the invention, the data space of the data set is divided into y non-overlapping partitions by using the traditional index, a training set and a testing set which accord with the zipfian distribution are generated, the training set is used for training the established deep learning model, training data is input, the probability of the training data in the corresponding partition is output, and the trained deep learning model is obtained. Selecting T partitions with maximum probability corresponding to training data, obtaining K points close to the real K of the training data, establishing a weighted edge between the training data and each partition, and establishing a weighted edge between each point and each partition; and taking the training data and the k corresponding points as a set, obtaining the weight from each point in the set to the edge of each partition, and the total weight of each point, sorting the points in the set according to the total weight of each point, obtaining sorted points, sorting the weights of the edges connected with each point according to the sorted points, distributing the points into the partition with the largest weight according to the sorting result, and distributing the points into the partition with the largest weight until all the points are distributed into the partition, thereby finishing the refined partition if the data capacity in the partition with the largest weight reaches a given threshold value. Inputting a certain piece of test data in the test set into a deep learning model to obtain probability corresponding to the certain piece of test data, selecting the first T numerical values with the largest numerical value in the probability to obtain index number I corresponding to each numerical value, taking the partition with the index number I in the refined partition as a candidate partition, taking all data points in the candidate partition as a KNN candidate point set of the certain piece of test data, and finding K points closest to the certain piece of test data from the candidate point set to obtain a KNN result of the certain piece of test data.

The invention combines the traditional index and the deep learning technology, has high speed, ensures the precision of the original index and simultaneously rapidly solves the KNN problem. The invention divides the data space of the data set into y non-overlapping areas and converts the KNN problem into a multi-classification problem. The data points in the library are rearranged, so that all nearest neighbor points of the query are placed in the same partition as much as possible, the query time is greatly reduced, a general heuristic learning index framework is designed, the KNN query of the existing multiple index structures can be accelerated, the flexibility is high, and the KNN problem of massive high-dimensional data is solved with high performance in time and space.

Detailed Description

The first embodiment is as follows: the KNN query method based on the learning index in the embodiment comprises the following steps:

s1, acquiring a certain amount of computer data as a data set, and dividing the data set into a training set and a testing set according to Zipfian distribution.

S2, dividing the data space of the data set into y non-overlapping subareas by utilizing the traditional index, obtaining the subarea where the KNN of each training data in the training set is located, if the current subarea has a KNN result, setting the corresponding position of the current subarea label to be 1, otherwise, setting the current subarea label to be 0, and obtaining all the training data and the subareas corresponding to the training data.

Each data has its own data attribute (information), and all data attribute categories are consistent, for example, the adopted data set is a commodity order data set D, and each data in D contains a attributes: the commodity name, commodity number, order amount, order time, etc., all data attributes are identical, only the values of the attributes are different. The method of the original index structure divides the data space of the data set into y subareas, takes each subarea as a category (attribute label), and converts the KNN problem into a multi-category problem. The label set by the invention is a 01 vector with the length of category number. Recording the position of the partition where the KNN of each training data is located according to the query step of the original index, setting the corresponding position of the partition label to be 1 if the KNN result exists in the current partition, and otherwise, setting the corresponding position to be 0.

And S3, establishing a deep learning model, training the deep learning model by utilizing all training data and corresponding subareas thereof, inputting the training data, and outputting the probability of the training data in the corresponding subareas to obtain a trained deep learning model.

The deep learning model sequentially comprises three modules, wherein the first module adopts deep separable convolution, captures attribute information specific to each input channel by using a single convolution filter, inputs training data q, outputs a feature vector h1,

the second module sequentially comprises two convolution blocks, wherein each convolution block sequentially comprises three convolution layers, a pooling layer, two convolution layers, a pooling layer and a full connection layer; in order to retain more original information, training data q and feature vector h1 are input, and feature vector h2,/is output>

Multiplying the feature vector h1 and the feature vector h2 to obtain a vector h3, sequentially comprising a convolution layer, a pooling layer and a 1*1 convolution layer by a third module, inputting the h3, outputting probability vectors h of training data q in corresponding different partitions, and performing feature fusion and feature extraction on the results of the first module and the second module to obtain a final result of the deep learning model.

In this way, only the candidate partition of the training data q is determined, and then all data points in the candidate partition still need to be calculated with the training data q, so that a KNN result is finally obtained. Here, the present invention considers that if the KNN result distribution in the original index is very dispersed, multiple data partitions need to be traversed, so that the query efficiency is greatly reduced, and therefore an algorithm for refining the data partitions is proposed, so as to distribute all KNN results in T data partitions as much as possible, and the number t=2 of target partitions of the present invention.

S4, selecting T partitions with maximum probability corresponding to training data, obtaining K points adjacent to the training data real K, establishing a weighted edge between the training data and each partition, establishing a weighted edge between each point and each partition, taking the training data and the corresponding K points as a set, obtaining the weight from each point to the edge of each partition in the set, and the total weight of each point, sorting the points according to the total weight of each point, obtaining sorted points, sorting the weights of each edge connected with each point according to the sorted points, distributing the points into the partition with the maximum weight according to the sorting result, and distributing the points into the partition with the next largest weight until all the points are distributed into the partition, thereby obtaining the refined partition P' if the data capacity in the partition with the maximum weight reaches a given threshold delta.

Refinement data partitioning: for training set x= { X ₁ ,x ₂ ,…,x _n Real K-nearest neighbor of X

Representation and training data x ₁ The nearest k point sets and selecting T partitions with the highest probability in the deep learning model prediction result for each training data in the training set X>

Representing training data x ₁ Corresponding sets of T partitions. At x _i And->

Establishing a weighted edge between each partition of the pair of partitions, < >>

Each point of (2) and->

A weighted edge is established between each partition in the system in order to make x as much as possible _i And->

And the data is distributed into the appointed partition, and meanwhile, the data distribution problem is converted into the maximum weight matching problem. Establishing a point set s=x ₁ ,s ₂ ,s ₃ ,s ₄ ,…s _m After the above steps are completed, the weight W from each point in the point set S to the edge of each partition is obtained _s ＝{W _s1 ,W _s2 ,…,W _sm W, where _si ＝{(P _a ,a),(P _b ,b),…,(P _w ,w)}，P _a For the id of a partition, a is the weight value to the partition, and the total weight w= { W for each point in the point set S ₁ ,W ₂ ,…,W _m W, where W _i ＝a+b+…+w。

Next, the set of points S is ordered according to the total weight W of each point, guaranteeing that the most important (total weight maximum) points are assigned preferentially. Obtaining ordered point set S _sort Then pair with data point s _i Weights W of the connected edges _si And sorting is carried out, so that each data point is preferentially allocated to the partition with the largest weight connected with the data point, and if the data capacity in the partition with the largest weight reaches a given threshold delta, the data points are allocated to the partition with the next largest weight until the data capacity is still available in the partition to be allocated. After the reassignment of the data points in S is completed, the final refined partition P' is obtained.

S5, testing a certain piece of test data q in the test set _i Inputting the deep learning model to obtain a certain piece of test data q _i Corresponding probability h _i Selecting probability h _i The first T numerical values with the largest medium numerical value are used for obtaining the index number I corresponding to each numerical value, and the partition with the index number I in the refined partition P ' is used as a candidate partition P ' ' _c Taking all data points in the candidate partition as q _i And finally find the distance q from the candidate point set _i The nearest K points are KNN results K' (q) of a certain test data _i ). For example: input q _i Obtaining h _i = {0.4,0.9,0.5,0.8}, t=2, then i= {2,4}, P' _c ＝{P′ ₂ ,P′ ₄ Then calculate P' _c All points to q _i To finally obtain K' (q) _i )。K′(q _i ) Is q _i Is a result of approximation KNN, K (q _i ) Is q _i The model and algorithm of the present invention calculate K' (q) rapidly _i ) At the same time, K' (q) _i ) As close as possible to K (q _i )。

The second embodiment is as follows: the KNN query system based on the learning index according to the present embodiment includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements any step of a KNN query method based on the learning index when executing the computer program.

And a third specific embodiment: a computer-readable storage medium according to the present embodiment stores a computer program, characterized in that: the computer program, when executed by a processor, implements any step of a KNN query method as a learning index-based approach.

Claims

1. The KNN query method based on the learning index is characterized by comprising the following steps of: it comprises the following steps:

2. A KNN query method based on learning index as claimed in claim 1, wherein: s3, establishing a deep learning model, training the deep learning model by utilizing all training data and corresponding subareas thereof, inputting the training data, and outputting the probability of the training data in the corresponding subareas to obtain a trained deep learning model, wherein the specific process is as follows:

3. A KNN query method based on learning index as claimed in claim 1, wherein: each convolution block in the second module sequentially comprises three convolution layers, a pooling layer, two convolution layers, a pooling layer and a full connection layer.

4. A KNN query system based on a learning index, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized by: the processor, when executing the computer program, implements the steps of the method according to any one of claims 1-3.

5. A computer-readable storage medium storing a computer program, characterized in that: the computer program implementing the steps of the method according to any of claims 1-3 when executed by a processor.