CN116166690A

CN116166690A - Mixed vector retrieval method and device for high concurrency scene

Info

Publication number: CN116166690A
Application number: CN202310199075.6A
Authority: CN
Inventors: 张明清; 徐小良
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-03-03
Filing date: 2023-03-03
Publication date: 2023-05-26

Abstract

The invention discloses a mixed vector retrieval method and device for a high concurrency scene. For vector data, calculating a distance construction diagram and a quantization index by using a diagram and a quantization code, and storing the constructed index on an SSD hard disk in a lasting manner; during inquiry, a large number of high concurrency inquiry vectors are given, and firstly searching candidate points are obtained according to quantization indexes; then, a plurality of queues are established, and search candidate points are distributed into the corresponding queues; for each queue, searching candidate points of the head of the queue are distributed to read the hard disk to obtain the map index neighbor information stored in a lasting mode; finally, searching by using a greedy algorithm, and returning to the approximate nearest neighbor of the query point; aiming at the application scene of large-scale and high-concurrency query, the invention efficiently distributes scheduling query, avoids delay risk caused by query blockage, optimizes the reading strategy of the SSD hard disk, realizes balance of hard disk reading and vector calculation, and improves the searching speed of vectors.

Description

Mixed vector retrieval method and device for high concurrency scene

Technical Field

The invention belongs to the field of approximate nearest neighbor searching, and particularly relates to a high concurrency scene-oriented hybrid vector retrieval method and device.

Background

In the digital economic age, new generation technologies such as big data, cloud computing, mobile internet and the like promote massive unstructured data such as images, videos, texts and the like on a network. However, how to effectively search these data, a tree-based, hash-based, quantization-based, and neighbor graph-based vector search method is generally used by people, and the current algorithm for constructing an index based on a neighbor graph is the mainstream algorithm in the current vector search direction because of its excellent search capability.

However, the current neighbor graph-based method is too dependent on memory, and there is a high memory cost with the steep increase of the data size. To solve this problem, a mainstream solution is to combine a mixed vector search strategy of memory and a hard disk, which saves memory occupation during search by storing a large number of adjacent graph indexes occupying memory on the hard disk.

However, as the number of queries increases rapidly and queries show high concurrency application scenes, how to efficiently allocate scheduling queries and optimize the read part of the hard disk becomes more and more a bottleneck of the retrieval speed.

Disclosure of Invention

Based on the defects and shortcomings of the prior art, the invention aims to provide a mixed vector retrieval method for a high concurrency scene, which comprises the following steps:

(1) Acquiring a vector data set V, and respectively constructing indexes for data in the vector data set V by using a neighbor graph and product quantization according to vector distances to obtain a corresponding graph index and a product quantization index; storing the corresponding graph index and the product quantization index in an SSD hard disk for persistent storage;

(2) Obtaining a query vector set q= { Q ₁ ,q ₂ ,q ₃ ,...,q _i A plurality of query vectors Q are included in the query vector group Q _i ；

Performing an approximate nearest neighbor search:

the first step, obtaining a plurality of search candidate points according to the product quantization index;

secondly, establishing a plurality of queues, wherein the queues are of a data structure;

thirdly, aiming at each search candidate point in the plurality of search candidate points, distributing the current search candidate point into the most idle queue according to the idle condition of the queue;

a fourth step of distributing search candidate points at the head of a queue to read a persistently stored graph index for each of the queues with the search candidate points established in the second step to obtain neighbor information in the graph index;

the queue head refers to: first data in the queue;

(3) For each query vector Q in the set of query vectors Q _i According to the read neighbor information in the graph index, carrying out greedy algorithm search, and recording nearest neighbors and query vectors q _i Vector distances to nearest neighbors are ordered from near to far according to distance to obtain a query vector q _i As a result of the search.

Preferably, in step (2), in the second step, the queue includes: queues of a linked list structure are used, queues of a sequential list structure are used, and only head and tail entry is allowed for the queues.

In a preferred embodiment, in the step (2), the specific allocation policy for allocating the current search candidate point to the most idle queue according to the idle queue condition in the third step includes:

query vector q assigned for current need _i If an empty queue exists, preferentially loading the empty queue; if the queues are not empty, according to the current queuing situation of the queues, inquiring the vector q _i Distributing the queue to a queue with the minimum current queuing length;

storing each of the queues and each of the query vectors q by a plurality of hosts _i 。

Preferably, the specific process of reading in the fourth step in the step (2) is as follows: and aiming at the candidate search points which currently reach the head of each queue, distributing a CPU thread resource to each queue where each candidate search point is located to interact with a read-write thread on the SSD hard disk controller, taking the position of neighbor information from the graph index, and reading the neighbor information from a storage area corresponding to the SSD hard disk according to the position of the neighbor information.

The invention also provides a mixed vector retrieval device facing to the high concurrency scene, which comprises the following steps:

an index construction module for:

acquiring a vector data set V, and respectively constructing indexes for data in the vector data set V by using a neighbor graph and product quantization according to vector distances to obtain a corresponding graph index and a product quantization index; storing the corresponding graph index and the product quantization index in an SSD hard disk for persistent storage;

the neighbor information reading module is used for:

obtaining a query vector set q= { Q ₁ ,q ₂ ,q ₃ ,...,q _i A plurality of queries Q are included in the query vector group Q _i ；

Performing an approximate nearest neighbor search:

a search ordering module for:

for each query vector Q in the set of query vectors Q _i According to the read neighbor information in the graph index, carrying out greedy algorithm search, and recording nearest neighbors and query vectors q _i Vector distances to nearest neighbors are ordered from near to far according to distance to obtain a query vector q _i As a result of the search.

The invention has the beneficial effects that:

aiming at the application scene of large-scale and high-concurrency query, the invention efficiently distributes scheduling query, avoids delay risk caused by query blockage, optimizes the reading strategy of the SSD hard disk, realizes balance of hard disk reading and vector calculation, and improves the searching speed of vectors.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description of the drawings is provided below, and some specific examples of the present invention will be described in detail below by way of example and not by way of limitation with reference to the accompanying drawings. The same reference numbers will be used throughout the drawings to refer to the same or like parts or portions. It will be appreciated by those skilled in the art that the drawings are not necessarily drawn to scale.

In the accompanying drawings:

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of a schematic diagram embodiment of the present invention.

Detailed Description

In order to make the aspects and features of the present invention more apparent, the present invention will be further described with reference to fig. 2 and the accompanying text.

The technical scheme of the invention can be applied to a scene needing vector search, the vector is a corresponding feature vector extracted from any one data of data types such as pictures, voices, texts and the like, and the feature vector is constructed to obtain a vector data set V.

The following embodiments are illustrative of vector search in connection with an application scenario in which a graph is searched:

(1) Acquiring a picture vector data set V, wherein the picture vector data is obtained by vectorizing the picture data in the picture vector data set by using a machine learning technology, and constructing indexes for data in the vector data set V by using a neighbor graph and product quantization according to vector distances respectively to obtain corresponding graph indexes and product quantization indexes; storing the corresponding graph index and the product quantization index in an SSD hard disk for persistent storage;

(2) For a plurality of picture data to be queried, a machine learning technology is used for converting the query picture into a query vector, namely a query vector group Q= { Q ₁ ,q ₂ ,q ₃ ,...,q _i A plurality of query vectors Q are included in the query vector group Q _i ；

Performing an approximate nearest neighbor search:

the queue head refers to: first data in the queue;

In the step (2), the specific allocation policy for allocating the current search candidate point to the most idle queue according to the idle queue condition in the third step includes:

In the step (2), the specific reading process in the fourth step is as follows: and aiming at the candidate search points which currently reach the head of each queue, distributing a CPU thread resource to each queue where each candidate search point is located to interact with a read-write thread on the SSD hard disk controller, taking the position of neighbor information from the graph index, and reading the neighbor information from a storage area corresponding to the SSD hard disk according to the position of the neighbor information.

A hybrid vector retrieval apparatus for high concurrency scenarios, comprising:

an index construction module for:

acquiring a picture vector data set V, wherein the picture vector data is obtained by vectorizing the picture data in the picture vector data set by using a machine learning technology, and constructing indexes for the data in the picture vector data set V by using a neighbor graph and product quantization according to vector distances respectively to obtain corresponding graph indexes and product quantization indexes; storing the corresponding graph index and the product quantization index in an SSD hard disk for persistent storage;

the neighbor information reading module is used for:

obtaining a query vector set q= { Q ₁ ,q ₂ ,q ₃ ,...,q _i A plurality of query vectors Q are included in the query vector group Q _i ；

Performing an approximate nearest neighbor search:

a search ordering module for:

for each query vector Q in the set of query vectors Q _i According to readingCarrying out greedy algorithm search on neighbor information in the obtained graph index, and recording nearest neighbors and query vectors q _i Vector distances to nearest neighbors are ordered from near to far according to distance to obtain a query vector q _i As a result of the search.

While the invention has been described with respect to certain preferred embodiments, it will be apparent to those skilled in the art that various changes and substitutions can be made herein without departing from the scope of the invention as defined by the appended claims.

Claims

1. A mixed vector retrieval method for a high concurrency scene is characterized by comprising the following steps:

Performing an approximate nearest neighbor search:

the queue head refers to: first data in the queue;

2. The method for mixed vector retrieval for high concurrency scenarios according to claim 1, wherein,

in step (2), in the second step, the queue includes: queues of a linked list structure are used, queues of a sequential list structure are used, and only head and tail entry is allowed for the queues.

3. The method for searching hybrid vectors for high concurrency scenarios according to claim 1, wherein in the step (2), the specific allocation policy for allocating the current search candidate point into the most idle queue according to the queue idle condition in the third step comprises:

4. The high concurrency scenario-oriented hybrid vector retrieval method of claim 1, wherein the method comprises the following steps:

the specific process of reading in the fourth step in the step (2) is as follows: and aiming at the candidate search points which currently reach the head of each queue, distributing a CPU thread resource to each queue where each candidate search point is located to interact with a read-write thread on the SSD hard disk controller, taking the position of neighbor information from the graph index, and reading the neighbor information from a storage area corresponding to the SSD hard disk according to the position of the neighbor information.

5. A hybrid vector retrieval apparatus for a high concurrency scenario, comprising:

an index construction module for:

the neighbor information reading module is used for:

obtaining a query vector set q= { Q _i ,q ₂ ,q ₃ ,...,q _i A plurality of query vectors Q are included in the query vector group Q _i ；

Performing an approximate nearest neighbor search:

a search ordering module for:

for each query vector Q in the set of query vectors Q _i Root of Chinese characterAccording to the read neighbor information in the graph index, carrying out greedy algorithm search, and recording the nearest neighbor and the query vector q _i Vector distances to nearest neighbors are ordered from near to far according to distance to obtain a query vector q _i As a result of the search.