CN116467494A

CN116467494A - Vector data indexing method

Info

Publication number: CN116467494A
Application number: CN202310729325.2A
Authority: CN
Inventors: 王鑫炜; 苏鹏; 李剑楠; 李恒; 黄炎; 阎虎青
Original assignee: Shanghai Aikesheng Information Technology Co ltd
Current assignee: Shanghai Aikesheng Information Technology Co ltd
Priority date: 2023-06-20
Filing date: 2023-06-20
Publication date: 2023-07-21
Anticipated expiration: 2043-06-20
Also published as: CN116467494B

Abstract

The invention provides an indexing method of vector data, which comprises the following steps: establishing a vector training model; inputting a query vector into a vector training model, and outputting a plurality of approximate result vectors by the vector training model; sequencing database vectors in a database according to the dimension size; selecting a plurality of database vectors which are respectively positioned around each approximate result vector from the ordered database vectors as first vector data; extracting ids of all first vector data, removing repeated ids, and extracting corresponding database vectors of the rest ids in the database to serve as second vector data; and calculating the distance between the second vector data and the query vector, and selecting a plurality of vector data with smaller distance from the second vector data. The indexing method optimizes the query process, improves the efficiency of an indexing algorithm, and also improves the accuracy of an indexing result.

Description

Vector data indexing method

Technical Field

The invention relates to the technical field of data processing, in particular to an indexing method of vector data.

Background

With the explosive growth of unstructured data (such as images, video, and audio), unstructured data analysis is widely available in the rich context of real world applications. Many database systems began to incorporate unstructured data analysis to meet these needs. These unstructured data are eventually stored in the form of feature vectors in databases, so how to find the desired data in huge amounts of data has become a current research hotspot.

By means of the indexing method of the vector data, the wanted data can be found in the database according to the input query vector. In the vector indexing algorithm, some similarity comparisons are mainly used to obtain the desired data, and the comparison modes are mainly obtained by using distance calculation. In the indexing algorithm in the prior art, in order to improve the efficiency of the indexing algorithm, a method of using data clustering is almost adopted to realize optimization, such as IVFPQ, IVFFlat and other algorithms. The method comprises the steps of firstly clustering the original data, and then finding a clustering center, so that the comparison times of query vectors are reduced.

However, the prior art does not improve the efficiency of the indexing algorithm through query optimization, and the prior art has yet to improve the query efficiency.

Disclosure of Invention

The invention aims to provide an indexing method of vector data, which can optimize the query process, so that the efficiency of an indexing algorithm can be improved.

In order to achieve the above object, the present invention provides a method for indexing vector data, comprising:

establishing a vector training model;

inputting a query vector to the vector training model, wherein the vector training model outputs a plurality of approximate result vectors;

sequencing database vectors in a database according to the dimension size;

selecting a plurality of database vectors which are respectively positioned around each approximate result vector from the ordered database vectors as first vector data;

extracting all ids of the first vector data, removing repeated ids, and extracting database vectors corresponding to the residual ids in the database to serve as second vector data;

and calculating the distance between the second vector data and the query vector, and selecting a plurality of vector data with smaller distance from the second vector data.

Optionally, in the indexing method of vector data, the method for building a vector training model includes:

forming a generator;

forming a discriminator;

extracting a plurality of data samples from database vectors of a database, and training a discriminator by using the data samples and the noise samples;

a number of data samples are extracted from a database vector of a database, and a generator is trained using the data samples and noise samples.

Optionally, in the indexing method of vector data, the method for training the arbiter includes:

sampling m data samples from the database vector;

sampling m noise samples, and placing the m noise samples into a generator to generate m vectors;

and obtaining the distance between the m vectors and the m data sample distribution, and finishing the training of the discriminator when the distance is maximum.

Optionally, in the indexing method of vector data, the method of training the generator includes:

sampling m data samples from the database vector;

and obtaining the distance between the m vectors and m data sample distribution, and finishing training of the generator when the distance is minimum.

Optionally, in the indexing method of vector data, a generator is trained once using a plurality of noise samples.

Optionally, in the indexing method of vector data, the database vector samples are used to train the multiple discriminators.

Optionally, in the indexing method of vector data, a query vector is input to the vector training model, and the vector training model outputs 10 approximate result vectors.

Optionally, in the method for indexing vector data, the database vectors are sorted from large to small in dimension, or the database vectors are sorted from small to large in dimension.

Optionally, in the method for indexing vector data, the method for sorting database vectors in the database according to dimension size includes: firstly, a table is formed by sorting according to the size of data in a first dimension, and then another table is formed by sorting according to the size of data in a second dimension.

Optionally, in the method for indexing vector data, selecting, from the sorted database vectors, a plurality of database vectors respectively located around each of the approximate result vectors, as the first vector data, the method includes: and selecting 10 128-dimensional database vectors respectively positioned above the database vector around each approximate result and 10 128-dimensional database vectors positioned below the database vector around each approximate result from the ordered database vectors.

Optionally, in the method for indexing vector data, 20 database vectors respectively located around each of the approximate result vectors are selected from the ordered database vectors as the first vector data.

Optionally, in the method for indexing vector data, euclidean distance calculation is performed on the second vector data and the query vector, and a plurality of vector data with smaller distance are selected from the second vector data.

The indexing method of vector data provided by the invention comprises the following steps: establishing a vector training model; inputting a query vector to the vector training model, wherein the vector training model outputs a plurality of approximate result vectors; sequencing database vectors in a database according to the dimension size; selecting a plurality of database vectors which are respectively positioned around each approximate result vector from the ordered database vectors as first vector data; extracting all ids of the first vector data, removing repeated ids, and extracting database vectors corresponding to the residual ids in the database to serve as second vector data; and calculating the distance between the second vector data and the query vector, and selecting a plurality of vector data with smaller distance from the second vector data. The indexing method optimizes the query process and improves the efficiency of the indexing algorithm.

Drawings

FIG. 1 is a flow chart of a method of indexing vector data according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of database vector storage;

FIG. 3 is a schematic diagram of a query vector;

FIG. 4 is a schematic diagram of ordered database vectors.

Detailed Description

Specific embodiments of the present invention will be described in more detail below with reference to the drawings. The advantages and features of the present invention will become more apparent from the following description. It should be noted that the drawings are in a very simplified form and are all to a non-precise scale, merely for convenience and clarity in aiding in the description of embodiments of the invention.

In the following, the terms "first," "second," and the like are used to distinguish between similar elements and are not necessarily used to describe a particular order or chronological order. It is to be understood that such terms so used are interchangeable under appropriate circumstances. Similarly, if a method described herein comprises a series of steps, and the order of the steps presented herein is not necessarily the only order in which the steps may be performed, and some of the described steps may be omitted and/or some other steps not described herein may be added to the method.

Referring to fig. 1, the present invention provides a vector data indexing method, which includes:

s11: establishing a vector training model;

s12: inputting a query vector into a vector training model, and outputting a plurality of approximate result vectors by the vector training model;

s13: sequencing database vectors in a database according to the dimension size;

s14: selecting a plurality of database vectors which are respectively positioned around each approximate result vector from the ordered database vectors as first vector data;

s15: extracting ids of all first vector data, removing repeated ids, and extracting corresponding database vectors of the rest ids in the database to serve as second vector data;

s16: and calculating the distance between the second vector data and the query vector, and selecting a plurality of vector data with smaller distance from the second vector data.

Preferably, the method for establishing the vector training model comprises the following steps: forming a generator; forming a discriminator; extracting a plurality of data samples from database vectors of a database, and training a discriminator by using the data samples and the noise samples; a number of data samples are extracted from a database vector of a database, and a generator is trained using the data samples and noise samples. Wherein the generator is trained once using a number of noise samples. Multiple discriminators are trained using database vector samples because the generator does not necessarily always produce vectors that are toward a more favorable direction, frequent updating of the generator can lead to model instability. Specifically, the vector training model in the embodiment of the present invention includes two networks, namely, a generator G and a arbiter D, where the generator and the arbiter of the vector training model train with database vectors in the database to generate a vector training model, and the vector training model is a GAN network countermeasure model. Generator G is a network that generates a vector that receives random noise z, by which the vector can be generatedDenoted as G (z). The discriminator D is a discrimination network for discriminating whether the vector is true or not. During the training process, the goal of the generator G is to generate a true vector to spoof the arbiter D. While the objective of the arbiter D is to distinguish as much as possible the vector generated by the generator G from the true vector. Ideally, generator G may generate vectors sufficient to "spurious true". It is difficult for the arbiter D to determine whether the vector generated by the generator G is true or not. The training arbiter D may use database vectors in the database. First sampling m data samples { x } from a database vector in a database ¹ ，x ² ，…x ^m }. Then, m noise samples { z } are sampled ¹ ，z ² ，…z ^m The sampling method adopts the prior art, m noise samples are put into a generator G to generate m vectors { r }, and ¹ ，r ² ，…，r ^m }，r ⁱ =G（z ⁱ ) Training, gradient descent updating parameter theta _d Such that the distance between the m vectors and the m data sample distributionAt maximum, training of the discriminant is completed, where D (x ⁱ ) Representing the probability that the arbiter D network determines whether the ith real data is real (because x is real, the closer this value is to 1 the better for arbiter D). And D (r) ⁱ ) Is the probability that the arbiter D network determines whether the i-th data generated by G is authentic. The training generator G comprises the following steps: sampling m data samples { x } from a database vector in a database ¹ ，x ² ，…x ^m Next, sample m noise samples { z ¹ ，z ² ，…z ^m M noise samples are put into a generator G to generate m vectors { r } ¹ ，r ² ，…，r ^m }，r ⁱ =G（z ⁱ ) Next, training is performed, the gradient of which is reduced by the update parameter thetag so that the distance between the m vectors and the m data sample distributionAt minimum, complete generatorTraining of G, wherein D (x ⁱ ) Representing the probability that the arbiter D network determines whether the ith real data is real (because x is real, the closer this value is to 1 the better for arbiter D). And D (r) ⁱ ) Is the probability that the arbiter D network determines whether the i-th data generated by G is authentic.

In the embodiment of the invention, the query vector is input to the vector training model, and the vector training model of the embodiment of the invention can output a plurality of results by inputting one query vector.

The invention is used for searching the wanted data in the database vector of the database according to the input query vector, the wanted data are all in the database, and the database has massive data, so that some data can be input to query out the data similar to the input data. It is necessary to index similar vectors from databases that have a large number of database vectors. The vector data in the embodiment of the invention can be a data vector converted from data such as text data or picture data. The database vector may be a plurality of 128-dimensional (column) vector data lines, the lines numbered by ids, such as in fig. 2, the lines representing the database vectors and the id columns representing the ids corresponding to each database vector, wherein the numbers are for illustration only and are not values representing the database vectors. A query vector may be a 128-dimensional (column) vector of data, such as that of fig. 3, where the numbers are merely schematic, and are not values of the query vector. The database vectors are then sorted from big to small or from small to big in dimension. For example, first, one table is formed by ordering the data in the first dimension, then another table is formed by ordering the data in the second dimension, and so on for 128 tables. FIG. 4 is a sorted database vector table.

In the embodiment of the invention, 20 database vectors respectively positioned around each approximate result vector are selected from the ordered database vectors and used as first vector data. Since 10 approximation result vectors were previously selected, the database vectors around each approximation result vector would be selected in turn. The database vectors are divided into a plurality of rows, so the database vectors around the approximate result are the 10 128-dimensional database vectors above and the 10 128-dimensional database vectors below, so the embodiment of the invention has 128 database vector tables similar to those of fig. 4. So theoretically 1 x 128 x 20=2560 pieces of data will be obtained, and a total of 2560 x 10=25600 pieces of data will be obtained for 10 pieces of approximation result vectors. Some data may be repeatedly acquired and the first vector number should be less than or equal to 25600. At this time, the deduplication process is required, and since each vector data corresponds to one id, it is sufficient to remove the duplicate id and reserve one id according to whether or not the corresponding id is duplicated. The specific de-duplication method is the prior art, and will not be described here in detail.

In the embodiment of the invention, euclidean distance calculation is carried out on the second vector data and the query vector, and a plurality of vector data with smaller distance are selected from the second vector data. The embodiment of the invention adopts Euclidean distance calculation, and in other embodiments of the invention, other distance calculation methods can be adopted.

In the embodiment of the invention, the distance between the second vector data and the query vector is calculated, and 10 vector data with smaller distance are selected from the second vector data. The number of second vector data is plural, and thus, each distance from the query vector is required to be calculated, and the obtained distance is plural. The former pieces of data with smaller distances can be selected in order from small to large. In other embodiments of the present invention, other numbers of vector data with smaller distances, such as 100, may be selected, and the specific number may be set according to the required accuracy.

The indexing method of the vector data in the embodiment of the invention is applied to practice, and for similar results of inquiring 10 inquiry vectors in a database with 1000000 database vectors, the time delay of a single inquiry vector is 4.673ms, and the recall rate is 85.96%. Under the same conditions, several methods in the prior art, for example, the single delay of the Flat method is 144.146ms, and the recall rate is 100%; the single delay of the IVFFlat method is 80.821ms, and the recall rate is 68.73%; the single delay of the IVFPQ method was 7.436ms and the recall was 55.68%. Therefore, the method for indexing the vector data according to the embodiment of the invention has smaller time delay of a single query vector, so that the method for indexing the vector data according to the embodiment of the invention improves the efficiency of indexing the vector data. In summary, in the method for indexing vector data provided in the embodiment of the present invention, the method includes: establishing a vector training model; inputting a query vector into a vector training model, and outputting a plurality of approximate result vectors by the vector training model; sequencing database vectors in a database according to the dimension size; selecting a plurality of database vectors which are respectively positioned around each approximate result vector from the ordered database vectors as first vector data; extracting ids of all first vector data, removing repeated ids, and extracting corresponding database vectors of the rest ids in the database to serve as second vector data; and calculating the distance between the second vector data and the query vector, and selecting a plurality of vector data with smaller distance from the second vector data. The indexing method optimizes the query process, improves the efficiency of an indexing algorithm, and also improves the accuracy of an indexing result.

The foregoing is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Any person skilled in the art will make any equivalent substitution or modification to the technical solution and technical content disclosed in the invention without departing from the scope of the technical solution of the invention, and the technical solution of the invention is not departing from the scope of the invention.

Claims

1. A method of indexing vector data, comprising:

establishing a vector training model;

sequencing database vectors in a database according to the dimension size;

2. The method of indexing vector data according to claim 1, wherein the method of building a vector training model comprises:

forming a generator;

forming a discriminator;

3. The indexing method of vector data according to claim 2, the training discriminant method comprising:

sampling m data samples from the database vector;

4. The indexing method of vector data according to claim 2, the training generator method comprising:

sampling m data samples from the database vector;

5. The method of indexing vector data of claim 2 wherein the generator is trained once using a number of noise samples.

6. The method of indexing vector data of claim 2 wherein the multiple discriminators are trained using database vector samples.

7. The method of indexing vector data according to claim 1, wherein a query vector is input to the vector training model, and the vector training model outputs 10 approximate result vectors.

8. The method of indexing vector data according to claim 1, wherein the database vectors are ordered from large to small in dimension or from small to large in dimension.

9. The method of indexing vector data as claimed in claim 1, wherein the method of ordering database vectors in the database by dimension size comprises: firstly, a table is formed by sorting according to the size of data in a first dimension, and then another table is formed by sorting according to the size of data in a second dimension.

10. The method of indexing vector data according to claim 9, wherein selecting, as the first vector data, a plurality of database vectors respectively located around each of the approximate result vectors from the sorted database vectors, comprises: and selecting 10 128-dimensional database vectors respectively positioned above the database vector around each approximate result and 10 128-dimensional database vectors positioned below the database vector around each approximate result from the ordered database vectors.

11. The method of indexing vector data according to claim 1, wherein 20 database vectors respectively located around each of the approximate result vectors are selected from the sorted database vectors as the first vector data.

12. The indexing method of vector data according to claim 1, wherein the second vector data is subjected to euclidean distance calculation with the query vector, and a plurality of pieces of vector data having smaller distances are selected from the second vector data.