CN118093633A

CN118093633A - High-dimensional vector query method, device, computer equipment and readable storage medium

Info

Publication number: CN118093633A
Application number: CN202410464850.0A
Authority: CN
Inventors: 邓泽; 李风韦; 王力哲; 严坤
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2024-04-18
Filing date: 2024-04-18
Publication date: 2024-05-28

Abstract

The application discloses a high-dimensional vector query method, a device, computer equipment and a readable storage medium, relates to the field of vector query, can enhance the capability of a model for querying a high-dimensional vector data set, and can remarkably improve the query precision while maintaining high query efficiency. The method comprises the following steps: responding to a high-dimensional vector query instruction, and acquiring a high-dimensional vector to be queried; the method comprises the steps of obtaining a dimension reduction module, and carrying out dimension reduction coding on a high-dimension vector to be queried by adopting the dimension reduction module to obtain a high-dimension vector code to be queried corresponding to the high-dimension vector to be queried; the method comprises the steps of obtaining a position prediction module, inputting a high-dimensional vector code to be queried to the position prediction module, and determining a predicted position of the high-dimensional vector to be queried in an ordered array based on the position prediction module; and determining a target vector from the plurality of high-dimensional vectors to be retrieved according to the predicted position and returning.

Description

High-dimensional vector query method, device, computer equipment and readable storage medium

Technical Field

The present application relates to the field of vector query, and in particular, to a high-dimensional vector query method, apparatus, computer device, and readable storage medium.

Background

With the development of embedding technology, many query tasks are effectively summarized as vector queries, and in general, a query system deploys a deep neural model that is used to generate a corpus and embedded vectors of the query, while deploying a high performance ANN (Approximate Nearest Neighbor, near nearest neighbor) module for searching. However, as the number of terms in the corpus increases rapidly, the ANN module needs to search through more data points, resulting in reduced query efficiency.

In the related art, an RMI (Recursive Model Indexing, recursive model index) learning index model is used to query data. When the RMI learning index model is used for data query, the similarity between the vector of the queried data and the vector of the data in the corpus is required to be defined and calculated, and then the target vector is determined in the corpus according to the similarity.

In carrying out the present application, the applicant has found that the related art has at least the following problems:

the vectors in the high-dimensional vector space have a plurality of dimensions, so that the similarity between the high-dimensional vectors is difficult to judge through simple size comparison, and therefore the conventional RMI model cannot directly perform high-dimensional vector query, and the high-dimensional vector query efficiency and the query precision are low.

Disclosure of Invention

In view of this, the present application provides a method, an apparatus, a computer device and a readable storage medium for high-dimensional vector query, which mainly aims to solve the problems that the similarity between the current high-dimensional vectors is difficult to judge by simple size comparison, so that the current RMI model cannot directly perform the high-dimensional vector query, thereby resulting in low high-dimensional vector query efficiency and low query precision.

According to a first aspect of the present application, there is provided a high-dimensional vector query method, the method comprising:

responding to a high-dimensional vector query instruction, and acquiring a high-dimensional vector to be queried;

The method comprises the steps of obtaining a dimension reduction module, and carrying out dimension reduction coding on a high-dimension vector to be queried by adopting the dimension reduction module to obtain a high-dimension vector code to be queried corresponding to the high-dimension vector to be queried;

the method comprises the steps of obtaining a position prediction module, inputting the high-dimensional vector code to be queried to the position prediction module, and determining the predicted position of the high-dimensional vector code to be queried in an ordered array based on the position prediction module, wherein a plurality of high-dimensional vectors to be retrieved which are ordered according to similarity are stored in the ordered array;

And determining a target vector from the plurality of high-dimensional vectors to be searched according to the predicted position and returning.

Optionally, the dimension reduction module performs dimension reduction encoding on the to-be-queried high-dimension vector to obtain to-be-queried high-dimension vector encoding corresponding to the to-be-queried high-dimension vector, including:

Inputting the high-dimensional vector to be queried to the dimension reduction module, and dividing the high-dimensional vector to be queried into a plurality of sub-vectors based on a residual quantization algorithm SK-RPCPQ algorithm designed in the dimension reduction module and based on projection clustering;

Performing projection clustering in subspaces formed by the same dimension subvectors based on the dimension reduction module until the model converges or reaches a preset training frequency threshold value to obtain a first layer codebook, completing one iteration, determining residual vectors based on the first layer codebook, performing projection clustering on the residual vectors again to form a second layer codebook, completing another iteration, repeatedly determining the residual vectors, performing projection clustering until the iteration frequency meets the preset iteration frequency threshold value, and obtaining a multi-layer codebook corresponding to each subspace;

And carrying out vector coding on the sub-vectors in each subspace according to the multi-layer codebook corresponding to each subspace based on the dimension reduction module, and combining the sub-vector codes of the plurality of sub-vectors corresponding to the high-dimension vector to be queried to obtain the high-dimension vector code to be queried.

Optionally, based on the dimension reduction module, performing projection clustering in a subspace formed by the same dimension subvectors until the model converges or reaches a preset training frequency threshold value, to obtain a first layer codebook, including:

For any subspace, randomly selecting a designated number of initial clustering centers in a data set corresponding to the subspace, wherein all the sub-vectors forming the subspace are stored in the data set;

calculating the clustering center to which each sub-vector belongs according to the initial clustering center, and updating the clustering allocation of each sub-vector to obtain a specified number of class clusters;

calculating the maximum right singular vector of each class cluster, and updating the initial cluster center by adopting the maximum right singular vector to obtain a new cluster center, so as to complete primary clustering;

According to the new cluster center, calculating the cluster center to which each sub-vector belongs again, updating the cluster allocation of each sub-vector to obtain an updated designated number of class clusters, completing the next clustering, repeatedly calculating the cluster center to which each sub-vector belongs, updating the cluster allocation of each sub-vector until the cluster center of each class cluster is stable or the iteration number reaches the preset training number threshold;

and aggregating the cluster centers of each class cluster to obtain the first layer codebook.

Optionally, determining a residual vector based on the first layer codebook, and performing projection clustering on the residual vector again to form a second layer codebook, including:

For the clustering center associated with each codeword in the first layer codebook, calculating residual vectors of each sub-vector and the corresponding clustering center, and aggregating all residual vectors to obtain a residual data set;

Performing projection clustering again on residual vectors in the residual data set, calculating a clustering center to which each residual vector belongs, and updating clustering allocation of each residual vector until the clustering center of each class cluster is stable or the iteration number reaches the preset training number threshold;

and aggregating the cluster centers of each class cluster to obtain the second layer codebook.

Optionally, the vector encoding of the sub-vectors in each subspace according to the multi-layer codebook corresponding to each subspace based on the dimension reduction module includes:

for any sub-vector, selecting a first layer codebook from a multi-layer codebook corresponding to the sub-vector, calculating the Euclidean distance between each first layer codeword in the first layer codebook and the sub-vector, and selecting the first layer codeword with the distance value meeting the preset condition as a target codeword;

Combining the target code word with a second layer code book in the multi-layer code book respectively to obtain second layer code words, calculating Euclidean distance between the sub-vector and each second layer code word, selecting the second layer code word with a distance value meeting a preset condition as a new target code word, combining the new target code word with a third layer code book in the multi-layer code book to obtain a third layer code word, and continuously selecting the new target code word until the target code word corresponding to the last layer code book in the multi-layer code book is selected;

In all selected target code words, the target code words with Euclidean distances meeting preset conditions are adopted to encode the sub-vectors, so that sub-vector codes are obtained;

and coding each sub-vector, combining the sub-vector codes corresponding to each sub-vector to obtain the high-dimensional vector codes to be queried and quantifying the difference between the vector codes.

Optionally, the inputting the high-dimensional vector code to be queried to the position prediction module, determining, based on the position prediction module, a predicted position of the high-dimensional vector to be queried in an ordered array, includes:

based on the position prediction module, clustering the high-dimensional vector codes by adopting a sphere-based OPTICS clustering algorithm, and positioning class clusters corresponding to the high-dimensional vector codes based on a top learning index of the ordered array;

And determining a segmented polynomial learning index model corresponding to the class cluster based on the position prediction module, and positioning the predicted position of the high-dimensional vector to be queried in the ordered array based on the segmented polynomial learning index model and a preset function.

Optionally, determining a target vector from the plurality of high-dimensional vectors to be retrieved according to the predicted position and returning the target vector, including:

according to the predicted positions, a preset number of high-dimensional vectors to be searched are obtained, and the most similar vector is selected from the preset number of high-dimensional vectors to be searched to serve as the target vector;

determining an initiator for initiating the high-dimensional vector query instruction, and returning the target vector to the initiator.

According to a second aspect of the present application, there is provided a high-dimensional vector query apparatus comprising:

the acquisition module is used for responding to the high-dimensional vector query instruction and acquiring a high-dimensional vector to be queried;

the dimension reduction module is used for carrying out dimension reduction coding on the high-dimension vector to be queried by adopting the dimension reduction module to obtain a high-dimension vector code to be queried corresponding to the high-dimension vector to be queried, and inputting the high-dimension vector code to be queried into the position prediction module;

The position prediction module is used for determining the predicted position of the high-dimensional vector to be queried in an ordered array based on the position prediction module, and a plurality of high-dimensional vectors to be retrieved which are ordered according to the similarity are stored in the ordered array;

And the determining module is used for determining a target vector from the plurality of high-dimensional vectors to be searched according to the predicted position and returning the target vector.

Optionally, the apparatus further comprises:

The input module is used for inputting the high-dimensional vector to be queried to the dimension reduction module, and dividing the high-dimensional vector to be queried into a plurality of sub-vectors based on a random k projection quantization SK-RPCPQ algorithm of the subspace clustering designed in the dimension reduction module;

the dimension reduction module is used for performing projection clustering in subspaces formed by the same dimension subvectors until the model converges or reaches a preset training frequency threshold value, obtaining a first layer codebook, completing one iteration, determining residual vectors based on the first layer codebook, performing projection clustering on the residual vectors again to form a second layer codebook, completing another iteration, repeatedly determining the residual vectors, performing projection clustering until the iteration frequency meets the preset iteration frequency threshold value, and obtaining a multi-layer codebook corresponding to each subspace;

The dimension reduction module is used for carrying out vector coding on the sub-vectors in each subspace according to the multi-layer codebook corresponding to each subspace, and combining the sub-vector codes of the plurality of sub-vectors corresponding to the high-dimension vector to be queried to obtain the high-dimension vector code to be queried.

Optionally, the dimension reduction module is configured to randomly select, for any subspace, a specified number of initial cluster centers in a dataset corresponding to the subspace, where all the sub-vectors forming the subspace are stored in the dataset; calculating the clustering center to which each sub-vector belongs according to the initial clustering center, and updating the clustering allocation of each sub-vector to obtain a specified number of class clusters; calculating the maximum right singular vector of each class cluster, and updating the initial cluster center by adopting the maximum right singular vector to obtain a new cluster center, so as to complete primary clustering; according to the new cluster center, calculating the cluster center to which each sub-vector belongs again, updating the cluster allocation of each sub-vector to obtain an updated designated number of class clusters, completing the next clustering, repeatedly calculating the cluster center to which each sub-vector belongs, updating the cluster allocation of each sub-vector until the cluster center of each class cluster is stable or the iteration number reaches the preset training number threshold; and aggregating the cluster centers of each class cluster to obtain the first layer codebook.

Optionally, the dimension reduction module is configured to calculate, for a cluster center associated with each codeword in the first layer codebook, a residual vector of each sub-vector and a corresponding cluster center, and aggregate all residual vectors to obtain a residual data set; performing projection clustering again on residual vectors in the residual data set, calculating a clustering center to which each residual vector belongs, and updating clustering allocation of each residual vector until the clustering center of each class cluster is stable or the iteration number reaches the preset training number threshold; and aggregating the cluster centers of each class cluster to obtain the second layer codebook.

Optionally, the dimension reduction module is configured to select, for any sub-vector, a first layer codebook from a multi-layer codebook corresponding to the sub-vector, calculate the euclidean distance between each first layer codeword in the first layer codebook and the sub-vector, and select, as a target codeword, a first layer codeword whose distance value meets a preset condition; combining the target code word with a second layer code book in the multi-layer code book respectively to obtain second layer code words, calculating Euclidean distance between the sub-vector and each second layer code word, selecting the second layer code word with a distance value meeting a preset condition as a new target code word, combining the new target code word with a third layer code book in the multi-layer code book to obtain a third layer code word, and continuously selecting the new target code word until the target code word corresponding to the last layer code book in the multi-layer code book is selected; in all selected target code words, the target code words with Euclidean distances meeting preset conditions are adopted to encode the sub-vectors, so that sub-vector codes are obtained; and coding each sub-vector, combining the sub-vector codes corresponding to each sub-vector to obtain the high-dimensional vector codes to be queried and quantifying the difference between the vector codes.

Optionally, the position prediction module is configured to perform clustering operation on the high-dimensional vector codes by adopting a sphere-based OPTICS clustering algorithm based on the position prediction module, and locate a class cluster corresponding to the high-dimensional vector codes based on a top learning index of the ordered array; and determining a segmented polynomial learning index model corresponding to the class cluster based on the position prediction module, and positioning the predicted position of the high-dimensional vector to be queried in the ordered array based on the segmented polynomial learning index model and a preset function.

Optionally, the determining module is configured to obtain a preset number of high-dimensional vectors to be searched according to the predicted position, and select a most similar vector from the preset number of high-dimensional vectors to be searched as the target vector; determining an initiator for initiating the high-dimensional vector query instruction, and returning the target vector to the initiator.

According to a third aspect of the present application there is provided a computer device comprising a memory storing a computer program and a processor implementing the steps of the method of any of the first aspects described above when the computer program is executed by the processor.

According to a fourth aspect of the present application there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of the first aspects described above.

By means of the technical scheme, the high-dimensional vector query method, the device, the computer equipment and the readable storage medium provided by the application are used for firstly responding to a high-dimensional vector query instruction to acquire a high-dimensional vector to be queried. And then, a dimension reduction module is obtained, dimension reduction coding is carried out on the high-dimension vector to be queried by adopting the dimension reduction module, and the high-dimension vector code to be queried corresponding to the high-dimension vector to be queried is obtained. And then, acquiring a position prediction module, inputting the high-dimensional vector code to be queried to the position prediction module, and determining the predicted position of the high-dimensional vector to be queried in an ordered array based on the position prediction module, wherein a plurality of high-dimensional vectors to be retrieved are stored in the ordered array according to the similarity. And finally, determining a target vector from a plurality of high-dimensional vectors to be searched according to the predicted position and returning. The learning ANN query method improves the existing high-dimensional vector query technology, adopts a dimension reduction module for designing SK-RPCPQ (residual quantization algorithm based on projection clustering) and a position prediction module for designing CB-LIPP (accurate learning index based on clustering) algorithm, enhances the capability of a model for processing a large-scale high-dimensional vector data set, and can remarkably improve the query precision while maintaining high query efficiency.

The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

fig. 1 shows a flow chart of a high-dimensional vector query method according to an embodiment of the present application;

FIG. 2 is a schematic diagram showing a relationship between a dimension reduction module and a position prediction module according to an embodiment of the present application;

FIG. 3 shows a schematic structural diagram of a CB-LIPP according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of a high-dimensional vector query device according to an embodiment of the present application;

Fig. 5 shows a schematic structural diagram of another high-dimensional vector query apparatus according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the application to those skilled in the art.

The embodiment of the application provides a high-dimensional vector query method, which is shown in fig. 1 and comprises the following steps:

101. And responding to the high-dimensional vector query instruction, and acquiring the high-dimensional vector to be queried.

The application provides a method for constructing a high-dimensional vector learning index by taking SK-RPCPQ (residual quantization algorithm based on projection clustering) and CB-LIPP (accurate learning index based on clustering) as a subject two-layer core model on the basis of the existing high-dimensional vector learning index, wherein the relation between a dimension reduction module and a position prediction module of two-layer core modules of the algorithm is shown in figure 2, so that the query efficiency and the query precision of the high-dimensional vector are improved. In the embodiment of the application, the device firstly responds to the high-dimensional vector query instruction to obtain the high-dimensional vector to be queried, and then the high-dimensional vector to be queried is input into the trained dimension reduction module for dimension reduction coding.

102. And the dimension reduction module is used for carrying out dimension reduction coding on the high-dimension vector to be queried, so as to obtain the high-dimension vector code to be queried corresponding to the high-dimension vector to be queried.

The core of the dimension reduction module is SK-RPCPQ. SK-RPCPQ is an improvement over PQ in that it is primarily directed toHigh-dimensional vector/>, to be queriedWill/>High-dimensional vector/>, to be queriedInputting the high-dimensional vector samples to a dimension reduction module, and carrying out a residual quantization algorithm SK-RPCPQ algorithm based on projection clustering and arranged in the dimension reduction module on each high-dimensional vector sample/>Dividing to obtain/>, corresponding to each high-dimensional vector sampleSub-vectors. The dimension corresponding to each sub-vector is/>WhereinIs the dimension of the high-dimensional vector. Symbol/>Represents the/>First/>, corresponding to the high-dimensional vectorsThe number of sub-vectors obtained after division is/>And each.

Further, the sub-vectors of each dimension form a subspace, thus resulting inEach subspace has/>And carrying out projection clustering on the vectors in subspaces formed by the same-dimension subvectors based on a residual quantization algorithm SK-RPCPQ algorithm of projection clustering until the model converges or reaches a preset training frequency threshold value, obtaining a first layer codebook, and completing one iteration. Specifically, for any subspace, the dataset/>, corresponding in the subspaceSelected randomly the specified number/>Initial cluster center/>All the sub-vectors constituting the subspace are stored in the dataset. Each sub-vector/>, is calculated from the initial cluster center according to the following equations 1 and 2,/>And the cluster center is belonged, and the cluster allocation of each sub-vector is updated to obtain a specified number of class clusters.

Equation 1:

Equation 2:

Will be Clustering to/>. For/>Record/>For allocation to cluster centers/>According to the following equation 3, the largest right singular vector/>, of each cluster class is calculatedAnd adopts the maximum right singular vector to pair the initial clustering center/>Update, i.e./>And obtaining a new clustering center and completing one-time clustering.

Equation 3:

Where F is the F norm and y is the vector matrix of the decomposed vector. Further, according to the new cluster center, calculating the cluster center to which each sub-vector belongs again, updating the cluster allocation of each sub-vector to obtain an updated designated number of class clusters, completing the next clustering, repeatedly calculating the cluster center to which each sub-vector belongs, updating the cluster allocation of each sub-vector until the cluster center of each class cluster is stable or the iteration number reaches a preset training number threshold, and using each vector as a target Is mapped to the codebook, and the cluster centers of each class cluster are aggregated to obtain a converged first layer codebook/>Wherein the cluster center is called a codeword.

Next, a residual hierarchy codebook is constructed. For the cluster center associated with each codeword in the first layer codebook, calculating the residual vector of each sub-vector and the corresponding cluster center, and aggregating all residual vectors to obtain a residual data setFor residual vector/>, in residual datasetThen projection clustering is carried out to obtain a second layer codebook/>And finishing the iteration again, repeatedly determining residual vectors, and performing projection clustering until the iteration times meet a preset iteration times threshold value to obtain a multi-layer codebook. Specifically, a specified number of initial cluster centers is randomly selected in the residual data set. And calculating the cluster center to which each residual vector belongs according to the designated number of initial cluster centers, and updating the cluster allocation of each residual vector to obtain the designated number of class clusters. And calculating the maximum right singular vector of each class cluster, and updating the initial cluster center by adopting the maximum right singular vector to obtain a new cluster center, thereby completing one-time clustering. According to the new cluster center, calculating the cluster center to which each residual vector belongs again, updating the cluster allocation of each residual vector to obtain an updated designated number of class clusters, completing the next clustering, repeatedly calculating the cluster center to which each residual vector belongs, updating the cluster allocation of each residual vector until the cluster center of each class cluster is stable or the iteration number reaches a preset training number threshold. Clustering centers of each class cluster are aggregated to obtain a second layer codebook. Repetition/>Secondary get/>Layer residual codebook/>。

Through the steps, each subspace is constructedThe codebook, in turn, requires the vectors in each subspace to be used by this/>The codebook code means that, in the present invention, the following coding scheme is adopted:

When the high-dimensional vector to be queried is queried, changing the vector to be queried into a vector to be input into a trained dimension reduction module for dimension reduction coding, and for each subspace All have/>A codebook, each subvector can be defined by this/>The corresponding codeword in the codebook is represented and thus can be encoded using the position of the codeword in the codebook. The encoding is performed in a similar manner to Beam Search. The specific flow is as follows:

And dividing the high-dimensional vector to be queried into a plurality of sub-vectors by adopting a dimension reduction module, and determining a multi-layer codebook corresponding to each sub-vector. For any sub-vector, selecting a first layer codebook from the multi-layer codebooks corresponding to the sub-vectors Calculating a first layer codebookEach first layer codeword and sub-vector/>Selecting a distance value meeting a preset condition, namely the distance value and the sub-vector/>Nearest/>Taking the first layer code word as a target code word;

Will be The target code words are respectively matched with the second layer codebook/>, in the multi-layer codebookCombining to obtain second layer code words, calculating the Euclidean distance between the sub-vector and each second layer code word, and selecting the distance value to meet the preset condition, namely the distance value and the sub-vector/>Nearest/>A second layer codeword as a new target codeword, wherein/>That is, 1/2 of the results of the previous layer are selected each time, the new target codeword is combined with the third layer codebook in the multi-layer codebook to obtain the third layer codeword and the new target codeword is selected continuously until the last layer codebook/>, in the multi-layer codebook, is selectedCorresponding target code words. And finally, in all selected target code words, encoding the sub-vectors by adopting a group of target code words with the minimum Euclidean distance, wherein the encoding adopts the corresponding positions of the code words in the codebook, and the sub-vector encoding is obtained. And coding each sub-vector, and combining the sub-vector codes corresponding to each sub-vector to obtain the high-dimensional vector code to be queried. Due to the high-dimensional vector/>Is much larger than the kind of encoding, there will be many high-dimensional vectors/>In which case the vectors may be said to be similar. And high-dimensional vector/>The smaller the coding difference between the vectors, the greater the similarity between the vectors can be considered. But the similarity and difference between vectors cannot be directly derived by these sub-vector encodings alone. It is therefore necessary to quantify the differences between vector encodings next. The difference between vector encodings can be defined by the following equation 4:

equation 4:

Wherein the method comprises the steps of Representation/>Number of unequal encodings,/>Representation/>The average distance of the cluster centers corresponding to the codes which are not equal in number is obtained after the average distance is averaged through the cosine distance between the cluster centers. /(I)Is constant and makes/>. Thus if/>, between two vectorsThe larger the two vectors are, the less similar the description. If it isShould be pressed/>And sequencing.

103. And the position prediction module is obtained, the high-dimensional vector code to be queried is input to the position prediction module, the predicted position of the high-dimensional vector to be queried in the ordered array is determined based on the position prediction module, and a plurality of high-dimensional vectors to be retrieved which are ordered according to the similarity are stored in the ordered array.

In the embodiment of the application, based on a position prediction module, clustering operation is carried out on the high-dimensional vector codes by adopting an OPTICS clustering algorithm based on spheres, and class clusters corresponding to the high-dimensional vector codes are positioned based on the top learning index of the ordered array. And determining a piecewise polynomial learning index model corresponding to the class cluster based on the position prediction module, and positioning the predicted position of the high-dimensional vector to be queried in the ordered array based on the piecewise polynomial learning index model and a preset function.

Specifically, the purpose of the position prediction module is to predict the position of the corresponding vector in the ordered array through the high-dimensional vector code obtained by the dimension reduction module. In the present application, this is achieved by CB-LIPP. CB-LIPP is based on clustering, and the existing research proves that when a learning index is constructed, a model trained by retaining key data distribution can be used for the whole dataset, so that the obtained ordered groups are firstly subjected to clustering operation, the data are divided into a plurality of clusters through clustering, a plurality of cluster centers are obtained, and the plurality of cluster centers are used as the datasetGenerated by clustering/>The data distribution of the original data set is reserved, the data size is far smaller than that of the original data set, and the method is thatThe time required to construct the learning index model is reduced. However, as the amount of data increases, the amount of data in each cluster increases, and the accuracy of the learning index model at that time decreases, so that further predictions are made in each cluster by constructing a learning index model through LIPP.

In detail, the structure construction process of CB-LIPP is as follows:

firstly, carrying out clustering operation on the data (high-dimensional vector coding) subjected to dimension reduction by using a hash key data obtained by a dimension reduction module through an improved OPTICS clustering algorithm, namely a Ball-based OPTICS clustering algorithm, wherein detailed steps of the Ball-based OPTICS clustering algorithm are seen in a subsequent algorithm flow. Then the data set is composed of the cluster centers of each class At/>Build learning index model/>. In general, the number of clusters does not change frequently, and is at/>Reconstruction model/>The speed of (2) is also relatively fast, so that the model/>, is built at this stepThe learning index can be constructed by adopting the method of RMI: i.e./>The data are ordered according to the size to form an ordered array, the data are used as input, the positions of the data in the array are used as output, and the training model predicts the positions in the array through the data. The model trained at this time is the term/>。

Training a new model in each cluster by LIPP algorithm. LIPP the trained model resembles a tree structure, but each node is different. In LIPP, a model/>, is maintained inside each nodeA set of physical elements and a bit vector. Wherein a model is used to predict the location of the data, the training of the model is different from the training process of RMI, the RMI training model first orders the data and then predicts its location in the ordered array by the size of the data, and in LIPP the training model first generates an array of the same size as the data set, but the array is initialized to null, then selects a monotonically increasing function M that requires the function to meet/>At this point, the function can be used as a model/>By means of a model/>Data/>Mapping to/>Is a position of (c). The bit vector is used for representing the types of the corresponding entity elements, and the types of the entity elements are divided into three types: NULL, which indicates that the current position is NULL, can be used to insert elements, all of which initialize bit NULLs. Data, the type indicates that there is one Data for the current location. Node, the type indicates that the current position conflicts when inserting elements, and the current position stores a pointer to point to a certain Node of the next layer.

The structure of CB-LIPP is shown in FIG. 3, the first layerData set consisting of cluster centers/>Training results in a learning index consisting of a dataset that points at the cluster to which it is directed.

Next, how to perform the query and insert operation by the module will be described. Firstly, when inquiring, firstly, making correspondent clustering operation on data so as to find outThe corresponding position is found out by the pointer of the positionBy means ofPredicting the corresponding position of the Data, if the type bit NULL of the queried entity element indicates that the searched Data does not exist, if the type bit Data of the entity element is the same, comparing the Data stored in the position with the queried Data, if the type bit Data of the entity element is the same, searching is successful, otherwise, the searched Data does not exist, if the entity element is Node, indicating that a plurality of Data exist in the position, and entering a next layer of model for searching.

The insertion operation is the same as the inquiry, the data is clustered correspondingly, and the data is found outThe corresponding position is found out by the pointer of the positionBy/>The corresponding position of the data is predicted, which is the position into which the data should be inserted. If the type corresponding to the position entity element is NULL, the position entity element is directly inserted, if the type is Data, the description conflicts, the position entity element type becomes Node, the Data at the position and the newly inserted Data become new nodes of the next layer, and a new model/> istrainedAnd pointing to the Node of the position, and if the type of the position is Node, putting the data to be inserted into a next layer model for insertion. However, in the reinsertion process, as the number of data increases, the number of tree layers increases, which seriously affects the query efficiency, so that the model needs to be subjected to the merging operation, and the merging operation in the original LIPP algorithm is also applicable in the application.

Ball-based OPTICS clustering algorithm: the original OPTICS algorithm appears to optimize the DBSCAN algorithm. In the DBSCAN algorithm, there are two more important parameters: the neighborhood radius eps and the minimum neighborhood sample number min_samples of the core object, the selection of different parameters will lead to a very different result of the final clustering. In order to solve the problem, the OPTICS optimizes the DBSCAN algorithm, and in the OPTICS algorithm, a clustering result is not directly generated, but an ordered list is generated according to the reachable distance of the points, a decision diagram is obtained through the ordered list, and different eps parameters are selected through the decision diagram to perform clustering. However, according to the algorithm flow, the reachable distances between a certain point and other all points in the sample need to be calculated, so that the algorithm complexity is high, and therefore, in the method, a Ball-based OPTICS clustering algorithm is provided, in the Ball-based OPTICS clustering algorithm, the sample is firstly divided according to the balls, and then the reachable distances between a certain Ball and other balls are calculated. This greatly reduces the computational process.

In OPTICS, most of the concepts come from DBSCAN, and two new distance concepts-core distance and reach-are defined. These concepts are also followed in the present application, and three new concepts are presented, sphere radius, core sphere, non-core sphere. For convenience, only the concepts of core distance, reach, sphere radius, core sphere and non-core sphere are briefly described herein.

Core distance: for a given core object X, the minimum neighborhood distance r that makes X the core object is the core distance of X. Briefly, a point is considered a core point if it is considered to have min_samples neighbors within its proximity. Then the core distance of X is the distance of X from the min samples neighbors.

The distance can be reached: if X is the core object, then the reachable distance of object Y to object X is the maximum of the Euclidean distance of Y to X and the core distance of X. In OPTICS, the neighborhood radius eps is defaulted to infinity, so as long as the number of samples is not less than min_samples, all points are core objects.

Sphere radius: The radius of the sphere when dividing the sample needs to be smaller than the neighborhood radius eps in DBSCAN. In general, the sphere radius/>The smaller the clustering effect is, the better the clustering effect is, but the greater the algorithm complexity is; conversely, the sphere radius/>The larger the clustering effect is, the worse the clustering effect is, but the smaller the algorithm complexity is, so the sphere radius/> needs to be selected according to the actual condition of the sample. When the sphere radius/>Setting for infinite hours, namely the original OPTICS algorithm.

Core ball: if at a certain point（/>) The number of points in the neighborhood is greater than min_samples, then the point is called the core point, if the center of the sphere is the core point. The ball is referred to as a core ball. At this time, the radius of the sphere is set to/>。

Non-core ball: if at a certain point（/>) The number of points in the neighborhood is less than min_samples, and the point is called a non-core point. If the center of the ball is a non-core point, the ball is referred to as a non-core ball. At this time, the radius of the sphere is set to/>（）。

The detailed steps of the Ball-based OPTICS clustering algorithm are as follows:

sample division: first determining the sphere radius A point P is then selected from the sample, which if it belongs to a certain core sphere, indicates that the point is already divided and does nothing. Otherwise, ball division is carried out according to whether the point is a core ball, if P is the core ball, the/>, of the point is carried outPoints within a radius are divided into the same sphere/>; If P is a non-core sphere, then the/>, of that point will bePoints within a radius are divided into the same sphere/>. Point P is/>Is denoted as/>. The above operation is cycled until all points are divided. The core sphere obtained by dividing by the method is characterized by radius/>In actual clustering, the points are most likely to belong to the same class, so that the reachable distance between the points does not need to be calculated when the OPTICS algorithm is used.

Constructing an ordered list: the process of constructing the ordered list in the application is similar to the OPTICS algorithm, the difference is that the reachable distances among all points are calculated in the OPTICS algorithm, but the following cases exist when the reachable distances are calculated in the application: assuming that the sample point is now P, if P belongs to a certain core sphereThen the reachable distance between the point and other points is calculated by the sphere center/>Instead, if it is a non-core sphere, then P is still used for calculation. By this calculation method, an ordered list is created using the method of creating an ordered list by the OPTICS algorithm.

The basic flow of the Ball-based OPTICS clustering algorithm is the above, and a decision diagram can be obtained through an ordered list, and can support DBSCAN clustering of different esp parameters.

104. And determining a target vector from the plurality of high-dimensional vectors to be retrieved according to the predicted position and returning.

The vector query process is similar to the index construction process, firstly, a dimension reduction module generates codes to obtain one-dimensional Key s, the one-dimensional Key s is used as the input of a position prediction module, the position of the vector in the memory is predicted, and the ordered array is sequenced according to the relativity among the vectors, so that a preset number of high-dimensional vectors to be searched are obtained according to the predicted positions, and the most similar vector is selected from the preset number of high-dimensional vectors to be searched to be used as a target vector. Finally, determining an initiator for initiating the high-dimensional vector query instruction, and returning the target vector to the initiator.

The method provided by the embodiment of the application responds to the high-dimensional vector query instruction to acquire the high-dimensional vector to be queried. And then, a dimension reduction module is obtained, dimension reduction coding is carried out on the high-dimension vector to be queried by adopting the dimension reduction module, and the high-dimension vector code to be queried corresponding to the high-dimension vector to be queried is obtained. And then, acquiring a position prediction module, inputting the high-dimensional vector code to be queried to the position prediction module, and determining the predicted position of the high-dimensional vector to be queried in an ordered array based on the position prediction module, wherein a plurality of high-dimensional vectors to be retrieved are stored in the ordered array according to the similarity. And finally, determining a target vector from a plurality of high-dimensional vectors to be searched according to the predicted position and returning. The learning ANN query method improves the existing high-dimensional vector query technology, adopts a dimension reduction module for designing SK-RPCPQ (residual quantization algorithm based on projection clustering) and a position prediction module for designing CB-LIPP (accurate learning index based on clustering) algorithm, enhances the capability of a model for processing a large-scale high-dimensional vector data set, and can remarkably improve the query precision while maintaining high query efficiency.

Further, as a specific implementation of the method shown in fig. 1, an embodiment of the present application provides a high-dimensional vector query apparatus, as shown in fig. 4, where the apparatus includes: an acquisition module 301, a dimension reduction module 302, a position prediction module 303 and a determination module 304.

The acquiring module 301 is configured to respond to a high-dimensional vector query instruction, and acquire a high-dimensional vector to be queried;

The dimension reduction module 302 is configured to perform dimension reduction encoding on a high-dimension vector to be queried by using the dimension reduction module to obtain a high-dimension vector code to be queried corresponding to the high-dimension vector to be queried, and input the high-dimension vector code to be queried to the position prediction module;

the location prediction module 303 is configured to determine, based on the location prediction module, a predicted location of the high-dimensional vector to be queried in an ordered array, where a plurality of high-dimensional vectors to be retrieved are stored in the ordered array, where the high-dimensional vectors to be retrieved are ordered according to similarity;

the determining module 304 is configured to determine a target vector from the plurality of high-dimensional vectors to be retrieved according to the predicted position and return the target vector.

In a specific application scenario, as shown in fig. 5, the apparatus further includes: an input module 305.

The input module 305 is configured to input the high-dimensional vector to be queried to the dimension reduction module, and divide the high-dimensional vector to be queried into a plurality of sub-vectors based on a residual quantization algorithm SK-RPCPQ algorithm based on projection clustering designed in the dimension reduction module;

The dimension reduction module 302 is configured to perform projection clustering in a subspace formed by the same dimension subvectors until the model converges or reaches a preset training frequency threshold, obtain a first layer codebook, complete one iteration, determine a residual vector based on the first layer codebook, perform projection clustering on the residual vector again to form a second layer codebook, complete another iteration, repeatedly determine the residual vector, and perform projection clustering until the iteration frequency meets the preset iteration frequency threshold, so as to obtain a multi-layer codebook corresponding to each subspace;

The dimension reduction module 302 is configured to perform vector encoding on the sub-vectors in each subspace according to the multi-layer codebook corresponding to each subspace, and combine the sub-vector encodings of the plurality of sub-vectors corresponding to the high-dimension vector to be queried, to obtain the high-dimension vector encoding to be queried.

In a specific application scenario, the dimension reduction module 302 is configured to randomly select, for any subspace, a specified number of initial clustering centers in a dataset corresponding to the subspace, where all the sub-vectors forming the subspace are stored in the dataset; calculating the clustering center to which each sub-vector belongs according to the initial clustering center, and updating the clustering allocation of each sub-vector to obtain a specified number of class clusters; calculating the maximum right singular vector of each class cluster, and updating the initial cluster center by adopting the maximum right singular vector to obtain a new cluster center, so as to finish primary clustering; according to the new cluster center, calculating the cluster center to which each sub-vector belongs again, updating the cluster allocation of each sub-vector to obtain an updated designated number of class clusters, completing the next clustering, repeatedly calculating the cluster center to which each sub-vector belongs, updating the cluster allocation of each sub-vector until the cluster center of each class cluster is stable or the iteration number reaches the preset training number threshold; and aggregating the cluster centers of each class cluster to obtain the first layer codebook.

In a specific application scenario, the dimension reduction module 302 is configured to calculate, for a cluster center associated with each codeword in the first layer codebook, a residual vector of each sub-vector and a corresponding cluster center, and aggregate all residual vectors to obtain a residual data set; performing projection clustering again on residual vectors in the residual data set, calculating a clustering center to which each residual vector belongs, and updating clustering allocation of each residual vector until the clustering center of each class cluster is stable or the iteration number reaches the preset training number threshold; and aggregating the cluster centers of each class cluster to obtain the second layer codebook.

In a specific application scenario, the dimension reduction module 302 is configured to select, for any sub-vector, a first layer codebook from a multi-layer codebook corresponding to the sub-vector, calculate the euclidean distance between each first layer codeword in the first layer codebook and the sub-vector, and select, as a target codeword, a first layer codeword whose distance value satisfies a preset condition; combining the target code word with a second layer code book in the multi-layer code book respectively to obtain second layer code words, calculating Euclidean distance between the sub-vector and each second layer code word, selecting the second layer code word with a distance value meeting a preset condition as a new target code word, combining the new target code word with a third layer code book in the multi-layer code book to obtain a third layer code word, and continuously selecting the new target code word until the target code word corresponding to the last layer code book in the multi-layer code book is selected; in all selected target code words, the target code words with Euclidean distances meeting preset conditions are adopted to encode the sub-vectors, so that sub-vector codes are obtained; and coding each sub-vector, combining the sub-vector codes corresponding to each sub-vector to obtain the high-dimensional vector codes to be queried and quantifying the difference between the vector codes.

In a specific application scenario, the position prediction module 303 is configured to perform a clustering operation on the high-dimensional vector code by using a sphere-based OPTICS clustering algorithm based on the position prediction module, and locate a cluster corresponding to the high-dimensional vector code based on a top learning index of the ordered array; and determining a segmented polynomial learning index model corresponding to the class cluster based on the position prediction module, and positioning the predicted position of the high-dimensional vector to be queried in the ordered array based on the segmented polynomial learning index model and a preset function.

In a specific application scenario, the determining module 304 is configured to obtain a preset number of high-dimensional vectors to be searched according to the predicted position, and select a most similar vector from the preset number of high-dimensional vectors to be searched as the target vector; determining an initiator for initiating the high-dimensional vector query instruction, and returning the target vector to the initiator.

The device provided by the embodiment of the application responds to the high-dimensional vector query instruction to acquire the high-dimensional vector to be queried. And then, a dimension reduction module is obtained, dimension reduction coding is carried out on the high-dimension vector to be queried by adopting the dimension reduction module, and the high-dimension vector code to be queried corresponding to the high-dimension vector to be queried is obtained. And then, acquiring a position prediction module, inputting the high-dimensional vector code to be queried to the position prediction module, and determining the predicted position of the high-dimensional vector to be queried in an ordered array based on the position prediction module, wherein a plurality of high-dimensional vectors to be retrieved are stored in the ordered array according to the similarity. And finally, determining a target vector from a plurality of high-dimensional vectors to be searched according to the predicted position and returning. The learning ANN query method improves the existing high-dimensional vector query technology, adopts a dimension reduction module for designing SK-RPCPQ (residual quantization algorithm based on projection clustering) and a position prediction module for designing CB-LIPP (accurate learning index based on clustering) algorithm, enhances the capability of a model for processing a large-scale high-dimensional vector data set, and can remarkably improve the query precision while maintaining high query efficiency.

From the above description of the embodiments, it will be clear to those skilled in the art that the present application may be implemented in hardware, or may be implemented by means of software plus necessary general hardware platforms. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective implementation scenario of the present application.

Those skilled in the art will appreciate that the drawing is merely a schematic illustration of a preferred implementation scenario and that the modules or flows in the drawing are not necessarily required to practice the application.

Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above-mentioned inventive sequence numbers are merely for description and do not represent advantages or disadvantages of the implementation scenario.

The foregoing disclosure is merely illustrative of some embodiments of the application, and the application is not limited thereto, as modifications may be made by those skilled in the art without departing from the scope of the application.

Claims

1. A method for high-dimensional vector query, comprising:

2. The method of claim 1, wherein the performing, by using the dimension reduction module, dimension reduction encoding on the to-be-queried high-dimensional vector to obtain to-be-queried high-dimensional vector encoding corresponding to the to-be-queried high-dimensional vector, includes:

3. The method of claim 2, wherein the performing projection clustering in the subspace formed by the same dimension subvectors based on the dimension reduction module until the model converges or reaches a preset training frequency threshold value to obtain a first layer codebook comprises:

4. The method of claim 2, wherein determining residual vectors based on the first layer codebook, performing projection clustering again on residual vectors to form a second layer codebook, comprises:

5. The method of claim 2, wherein the vector encoding the sub-vectors in each subspace according to the multi-layer codebook corresponding to each subspace based on the dimension reduction module comprises:

6. The method of claim 1, wherein the inputting the high-dimensional vector code to be queried to the position prediction module, determining a predicted position of the high-dimensional vector to be queried in an ordered array based on the position prediction module, comprises:

7. The method of claim 1, wherein determining and returning a target vector among the plurality of high-dimensional vectors to be retrieved according to the predicted position comprises:

8. A high-dimensional vector query apparatus, comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A readable storage medium having stored thereon a computer program, which when executed by a processor realizes the steps of the method according to any of claims 1 to 7.