CN116166690A - Mixed vector retrieval method and device for high concurrency scene - Google Patents

Mixed vector retrieval method and device for high concurrency scene Download PDF

Info

Publication number
CN116166690A
CN116166690A CN202310199075.6A CN202310199075A CN116166690A CN 116166690 A CN116166690 A CN 116166690A CN 202310199075 A CN202310199075 A CN 202310199075A CN 116166690 A CN116166690 A CN 116166690A
Authority
CN
China
Prior art keywords
vector
queue
search
query
queues
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310199075.6A
Other languages
Chinese (zh)
Inventor
张明清
徐小良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202310199075.6A priority Critical patent/CN116166690A/en
Publication of CN116166690A publication Critical patent/CN116166690A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a mixed vector retrieval method and device for a high concurrency scene. For vector data, calculating a distance construction diagram and a quantization index by using a diagram and a quantization code, and storing the constructed index on an SSD hard disk in a lasting manner; during inquiry, a large number of high concurrency inquiry vectors are given, and firstly searching candidate points are obtained according to quantization indexes; then, a plurality of queues are established, and search candidate points are distributed into the corresponding queues; for each queue, searching candidate points of the head of the queue are distributed to read the hard disk to obtain the map index neighbor information stored in a lasting mode; finally, searching by using a greedy algorithm, and returning to the approximate nearest neighbor of the query point; aiming at the application scene of large-scale and high-concurrency query, the invention efficiently distributes scheduling query, avoids delay risk caused by query blockage, optimizes the reading strategy of the SSD hard disk, realizes balance of hard disk reading and vector calculation, and improves the searching speed of vectors.

Description

Mixed vector retrieval method and device for high concurrency scene
Technical Field
The invention belongs to the field of approximate nearest neighbor searching, and particularly relates to a high concurrency scene-oriented hybrid vector retrieval method and device.
Background
In the digital economic age, new generation technologies such as big data, cloud computing, mobile internet and the like promote massive unstructured data such as images, videos, texts and the like on a network. However, how to effectively search these data, a tree-based, hash-based, quantization-based, and neighbor graph-based vector search method is generally used by people, and the current algorithm for constructing an index based on a neighbor graph is the mainstream algorithm in the current vector search direction because of its excellent search capability.
However, the current neighbor graph-based method is too dependent on memory, and there is a high memory cost with the steep increase of the data size. To solve this problem, a mainstream solution is to combine a mixed vector search strategy of memory and a hard disk, which saves memory occupation during search by storing a large number of adjacent graph indexes occupying memory on the hard disk.
However, as the number of queries increases rapidly and queries show high concurrency application scenes, how to efficiently allocate scheduling queries and optimize the read part of the hard disk becomes more and more a bottleneck of the retrieval speed.
Disclosure of Invention
Based on the defects and shortcomings of the prior art, the invention aims to provide a mixed vector retrieval method for a high concurrency scene, which comprises the following steps:
(1) Acquiring a vector data set V, and respectively constructing indexes for data in the vector data set V by using a neighbor graph and product quantization according to vector distances to obtain a corresponding graph index and a product quantization index; storing the corresponding graph index and the product quantization index in an SSD hard disk for persistent storage;
(2) Obtaining a query vector set q= { Q 1 ,q 2 ,q 3 ,...,q i A plurality of query vectors Q are included in the query vector group Q i
Performing an approximate nearest neighbor search:
the first step, obtaining a plurality of search candidate points according to the product quantization index;
secondly, establishing a plurality of queues, wherein the queues are of a data structure;
thirdly, aiming at each search candidate point in the plurality of search candidate points, distributing the current search candidate point into the most idle queue according to the idle condition of the queue;
a fourth step of distributing search candidate points at the head of a queue to read a persistently stored graph index for each of the queues with the search candidate points established in the second step to obtain neighbor information in the graph index;
the queue head refers to: first data in the queue;
(3) For each query vector Q in the set of query vectors Q i According to the read neighbor information in the graph index, carrying out greedy algorithm search, and recording nearest neighbors and query vectors q i Vector distances to nearest neighbors are ordered from near to far according to distance to obtain a query vector q i As a result of the search.
Preferably, in step (2), in the second step, the queue includes: queues of a linked list structure are used, queues of a sequential list structure are used, and only head and tail entry is allowed for the queues.
In a preferred embodiment, in the step (2), the specific allocation policy for allocating the current search candidate point to the most idle queue according to the idle queue condition in the third step includes:
query vector q assigned for current need i If an empty queue exists, preferentially loading the empty queue; if the queues are not empty, according to the current queuing situation of the queues, inquiring the vector q i Distributing the queue to a queue with the minimum current queuing length;
storing each of the queues and each of the query vectors q by a plurality of hosts i
Preferably, the specific process of reading in the fourth step in the step (2) is as follows: and aiming at the candidate search points which currently reach the head of each queue, distributing a CPU thread resource to each queue where each candidate search point is located to interact with a read-write thread on the SSD hard disk controller, taking the position of neighbor information from the graph index, and reading the neighbor information from a storage area corresponding to the SSD hard disk according to the position of the neighbor information.
The invention also provides a mixed vector retrieval device facing to the high concurrency scene, which comprises the following steps:
an index construction module for:
acquiring a vector data set V, and respectively constructing indexes for data in the vector data set V by using a neighbor graph and product quantization according to vector distances to obtain a corresponding graph index and a product quantization index; storing the corresponding graph index and the product quantization index in an SSD hard disk for persistent storage;
the neighbor information reading module is used for:
obtaining a query vector set q= { Q 1 ,q 2 ,q 3 ,...,q i A plurality of queries Q are included in the query vector group Q i
Performing an approximate nearest neighbor search:
the first step, obtaining a plurality of search candidate points according to the product quantization index;
secondly, establishing a plurality of queues, wherein the queues are of a data structure;
thirdly, aiming at each search candidate point in the plurality of search candidate points, distributing the current search candidate point into the most idle queue according to the idle condition of the queue;
a fourth step of distributing search candidate points at the head of a queue to read a persistently stored graph index for each of the queues with the search candidate points established in the second step to obtain neighbor information in the graph index;
a search ordering module for:
for each query vector Q in the set of query vectors Q i According to the read neighbor information in the graph index, carrying out greedy algorithm search, and recording nearest neighbors and query vectors q i Vector distances to nearest neighbors are ordered from near to far according to distance to obtain a query vector q i As a result of the search.
The invention has the beneficial effects that:
aiming at the application scene of large-scale and high-concurrency query, the invention efficiently distributes scheduling query, avoids delay risk caused by query blockage, optimizes the reading strategy of the SSD hard disk, realizes balance of hard disk reading and vector calculation, and improves the searching speed of vectors.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description of the drawings is provided below, and some specific examples of the present invention will be described in detail below by way of example and not by way of limitation with reference to the accompanying drawings. The same reference numbers will be used throughout the drawings to refer to the same or like parts or portions. It will be appreciated by those skilled in the art that the drawings are not necessarily drawn to scale.
In the accompanying drawings:
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow chart of a schematic diagram embodiment of the present invention.
Detailed Description
In order to make the aspects and features of the present invention more apparent, the present invention will be further described with reference to fig. 2 and the accompanying text.
The technical scheme of the invention can be applied to a scene needing vector search, the vector is a corresponding feature vector extracted from any one data of data types such as pictures, voices, texts and the like, and the feature vector is constructed to obtain a vector data set V.
The following embodiments are illustrative of vector search in connection with an application scenario in which a graph is searched:
(1) Acquiring a picture vector data set V, wherein the picture vector data is obtained by vectorizing the picture data in the picture vector data set by using a machine learning technology, and constructing indexes for data in the vector data set V by using a neighbor graph and product quantization according to vector distances respectively to obtain corresponding graph indexes and product quantization indexes; storing the corresponding graph index and the product quantization index in an SSD hard disk for persistent storage;
(2) For a plurality of picture data to be queried, a machine learning technology is used for converting the query picture into a query vector, namely a query vector group Q= { Q 1 ,q 2 ,q 3 ,...,q i A plurality of query vectors Q are included in the query vector group Q i
Performing an approximate nearest neighbor search:
the first step, obtaining a plurality of search candidate points according to the product quantization index;
secondly, establishing a plurality of queues, wherein the queues are of a data structure;
thirdly, aiming at each search candidate point in the plurality of search candidate points, distributing the current search candidate point into the most idle queue according to the idle condition of the queue;
a fourth step of distributing search candidate points at the head of a queue to read a persistently stored graph index for each of the queues with the search candidate points established in the second step to obtain neighbor information in the graph index;
the queue head refers to: first data in the queue;
(3) For each query vector Q in the set of query vectors Q i According to the read neighbor information in the graph index, carrying out greedy algorithm search, and recording nearest neighbors and query vectors q i Vector distances to nearest neighbors are ordered from near to far according to distance to obtain a query vector q i As a result of the search.
In the step (2), the specific allocation policy for allocating the current search candidate point to the most idle queue according to the idle queue condition in the third step includes:
query vector q assigned for current need i If an empty queue exists, preferentially loading the empty queue; if the queues are not empty, according to the current queuing situation of the queues, inquiring the vector q i Distributing the queue to a queue with the minimum current queuing length;
storing each of the queues and each of the query vectors q by a plurality of hosts i
In the step (2), the specific reading process in the fourth step is as follows: and aiming at the candidate search points which currently reach the head of each queue, distributing a CPU thread resource to each queue where each candidate search point is located to interact with a read-write thread on the SSD hard disk controller, taking the position of neighbor information from the graph index, and reading the neighbor information from a storage area corresponding to the SSD hard disk according to the position of the neighbor information.
A hybrid vector retrieval apparatus for high concurrency scenarios, comprising:
an index construction module for:
acquiring a picture vector data set V, wherein the picture vector data is obtained by vectorizing the picture data in the picture vector data set by using a machine learning technology, and constructing indexes for the data in the picture vector data set V by using a neighbor graph and product quantization according to vector distances respectively to obtain corresponding graph indexes and product quantization indexes; storing the corresponding graph index and the product quantization index in an SSD hard disk for persistent storage;
the neighbor information reading module is used for:
obtaining a query vector set q= { Q 1 ,q 2 ,q 3 ,...,q i A plurality of query vectors Q are included in the query vector group Q i
Performing an approximate nearest neighbor search:
the first step, obtaining a plurality of search candidate points according to the product quantization index;
secondly, establishing a plurality of queues, wherein the queues are of a data structure;
thirdly, aiming at each search candidate point in the plurality of search candidate points, distributing the current search candidate point into the most idle queue according to the idle condition of the queue;
a fourth step of distributing search candidate points at the head of a queue to read a persistently stored graph index for each of the queues with the search candidate points established in the second step to obtain neighbor information in the graph index;
a search ordering module for:
for each query vector Q in the set of query vectors Q i According to readingCarrying out greedy algorithm search on neighbor information in the obtained graph index, and recording nearest neighbors and query vectors q i Vector distances to nearest neighbors are ordered from near to far according to distance to obtain a query vector q i As a result of the search.
Aiming at the application scene of large-scale and high-concurrency query, the invention efficiently distributes scheduling query, avoids delay risk caused by query blockage, optimizes the reading strategy of the SSD hard disk, realizes balance of hard disk reading and vector calculation, and improves the searching speed of vectors.
While the invention has been described with respect to certain preferred embodiments, it will be apparent to those skilled in the art that various changes and substitutions can be made herein without departing from the scope of the invention as defined by the appended claims.

Claims (5)

1. A mixed vector retrieval method for a high concurrency scene is characterized by comprising the following steps:
(1) Acquiring a vector data set V, and respectively constructing indexes for data in the vector data set V by using a neighbor graph and product quantization according to vector distances to obtain a corresponding graph index and a product quantization index; storing the corresponding graph index and the product quantization index in an SSD hard disk for persistent storage;
(2) Obtaining a query vector set q= { Q 1 ,q 2 ,q 3 ,...,q i A plurality of query vectors Q are included in the query vector group Q i
Performing an approximate nearest neighbor search:
the first step, obtaining a plurality of search candidate points according to the product quantization index;
secondly, establishing a plurality of queues, wherein the queues are of a data structure;
thirdly, aiming at each search candidate point in the plurality of search candidate points, distributing the current search candidate point into the most idle queue according to the idle condition of the queue;
a fourth step of distributing search candidate points at the head of a queue to read a persistently stored graph index for each of the queues with the search candidate points established in the second step to obtain neighbor information in the graph index;
the queue head refers to: first data in the queue;
(3) For each query vector Q in the set of query vectors Q i According to the read neighbor information in the graph index, carrying out greedy algorithm search, and recording nearest neighbors and query vectors q i Vector distances to nearest neighbors are ordered from near to far according to distance to obtain a query vector q i As a result of the search.
2. The method for mixed vector retrieval for high concurrency scenarios according to claim 1, wherein,
in step (2), in the second step, the queue includes: queues of a linked list structure are used, queues of a sequential list structure are used, and only head and tail entry is allowed for the queues.
3. The method for searching hybrid vectors for high concurrency scenarios according to claim 1, wherein in the step (2), the specific allocation policy for allocating the current search candidate point into the most idle queue according to the queue idle condition in the third step comprises:
query vector q assigned for current need i If an empty queue exists, preferentially loading the empty queue; if the queues are not empty, according to the current queuing situation of the queues, inquiring the vector q i Distributing the queue to a queue with the minimum current queuing length;
storing each of the queues and each of the query vectors q by a plurality of hosts i
4. The high concurrency scenario-oriented hybrid vector retrieval method of claim 1, wherein the method comprises the following steps:
the specific process of reading in the fourth step in the step (2) is as follows: and aiming at the candidate search points which currently reach the head of each queue, distributing a CPU thread resource to each queue where each candidate search point is located to interact with a read-write thread on the SSD hard disk controller, taking the position of neighbor information from the graph index, and reading the neighbor information from a storage area corresponding to the SSD hard disk according to the position of the neighbor information.
5. A hybrid vector retrieval apparatus for a high concurrency scenario, comprising:
an index construction module for:
acquiring a vector data set V, and respectively constructing indexes for data in the vector data set V by using a neighbor graph and product quantization according to vector distances to obtain a corresponding graph index and a product quantization index; storing the corresponding graph index and the product quantization index in an SSD hard disk for persistent storage;
the neighbor information reading module is used for:
obtaining a query vector set q= { Q i ,q 2 ,q 3 ,...,q i A plurality of query vectors Q are included in the query vector group Q i
Performing an approximate nearest neighbor search:
the first step, obtaining a plurality of search candidate points according to the product quantization index;
secondly, establishing a plurality of queues, wherein the queues are of a data structure;
thirdly, aiming at each search candidate point in the plurality of search candidate points, distributing the current search candidate point into the most idle queue according to the idle condition of the queue;
a fourth step of distributing search candidate points at the head of a queue to read a persistently stored graph index for each of the queues with the search candidate points established in the second step to obtain neighbor information in the graph index;
a search ordering module for:
for each query vector Q in the set of query vectors Q i Root of Chinese characterAccording to the read neighbor information in the graph index, carrying out greedy algorithm search, and recording the nearest neighbor and the query vector q i Vector distances to nearest neighbors are ordered from near to far according to distance to obtain a query vector q i As a result of the search.
CN202310199075.6A 2023-03-03 2023-03-03 Mixed vector retrieval method and device for high concurrency scene Pending CN116166690A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310199075.6A CN116166690A (en) 2023-03-03 2023-03-03 Mixed vector retrieval method and device for high concurrency scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310199075.6A CN116166690A (en) 2023-03-03 2023-03-03 Mixed vector retrieval method and device for high concurrency scene

Publications (1)

Publication Number Publication Date
CN116166690A true CN116166690A (en) 2023-05-26

Family

ID=86416302

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310199075.6A Pending CN116166690A (en) 2023-03-03 2023-03-03 Mixed vector retrieval method and device for high concurrency scene

Country Status (1)

Country Link
CN (1) CN116166690A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501828A (en) * 2023-06-27 2023-07-28 北京大学 Non-perception vector query method and system for server based on unstructured data set
CN116541420A (en) * 2023-07-07 2023-08-04 上海爱可生信息技术股份有限公司 Vector data query method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501828A (en) * 2023-06-27 2023-07-28 北京大学 Non-perception vector query method and system for server based on unstructured data set
CN116501828B (en) * 2023-06-27 2023-09-12 北京大学 Non-perception vector query method and system for server based on unstructured data set
CN116541420A (en) * 2023-07-07 2023-08-04 上海爱可生信息技术股份有限公司 Vector data query method
CN116541420B (en) * 2023-07-07 2023-09-15 上海爱可生信息技术股份有限公司 Vector data query method

Similar Documents

Publication Publication Date Title
CN109254733B (en) Method, device and system for storing data
CN116166690A (en) Mixed vector retrieval method and device for high concurrency scene
CN109684333B (en) Data storage and cutting method, equipment and storage medium
CN108710639B (en) Ceph-based access optimization method for mass small files
CN109299113B (en) Range query method with storage-aware mixed index
CN105956183A (en) Method and system for multi-stage optimization storage of a lot of small files in distributed database
CN106980656B (en) A kind of searching method based on two-value code dictionary tree
CN109766318B (en) File reading method and device
KR20130020050A (en) Apparatus and method for managing bucket range of locality sensitivie hash
CN106959928B (en) A kind of stream data real-time processing method and system based on multi-level buffer structure
CN106599091B (en) RDF graph structure storage and index method based on key value storage
CN103326925A (en) Message push method and device
CN113553306B (en) Data processing method and data storage management system
US20120271833A1 (en) Hybrid neighborhood graph search for scalable visual indexing
CN104391947B (en) Magnanimity GIS data real-time processing method and system
CN115878824B (en) Image retrieval system, method and device
CN110008030A (en) A kind of method of metadata access, system and equipment
CN112241396B (en) Spark-based method and system for merging small files of Delta
US10659304B2 (en) Method of allocating processes on node devices, apparatus, and storage medium
CN112035428A (en) Distributed storage system, method, apparatus, electronic device, and storage medium
CN107220003A (en) A kind of method for reading data and system
CN111695685B (en) On-chip storage system and method for graph neural network application
CN112199333B (en) Storage method and device supporting multi-valued index file
CN113779025A (en) Optimization method, system and application of classified data retrieval efficiency in block chain
CN110365342A (en) Waveform decoder method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination