CN117951346B - Vector database oriented hybrid acceleration architecture - Google Patents

Vector database oriented hybrid acceleration architecture Download PDF

Info

Publication number
CN117951346B
CN117951346B CN202410349055.7A CN202410349055A CN117951346B CN 117951346 B CN117951346 B CN 117951346B CN 202410349055 A CN202410349055 A CN 202410349055A CN 117951346 B CN117951346 B CN 117951346B
Authority
CN
China
Prior art keywords
vector
distance
task
calculation
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410349055.7A
Other languages
Chinese (zh)
Other versions
CN117951346A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shencun Technology Wuxi Co ltd
Original Assignee
Shencun Technology Wuxi Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shencun Technology Wuxi Co ltd filed Critical Shencun Technology Wuxi Co ltd
Priority to CN202410349055.7A priority Critical patent/CN117951346B/en
Publication of CN117951346A publication Critical patent/CN117951346A/en
Application granted granted Critical
Publication of CN117951346B publication Critical patent/CN117951346B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90339Query processing by using parallel associative memories or content-addressable memories
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90348Query processing by searching ordered data, e.g. alpha-numerically ordered data

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a vector database-oriented hybrid acceleration architecture, which relates to the field of integrated circuits, and comprises a main core and two auxiliary cores, wherein a task segmentation module in the main core carries out task segmentation, and a task scheduling module executes acceleration tasks according to a scheduling auxiliary core; the first auxiliary core executes a distance calculation assembly line according to the segmentation task, wherein a distance calculation module calculates a vector distance according to a vector library and a vector to be queried, extracts the vector to be queried, the vector library and the cache vector distance from a high-speed memory, and feeds back a calculation result of the distance calculation assembly line to the main core; the second auxiliary core executes the sorting pipeline according to the segmentation task, the data sorting module performs vector sorting according to the vector distance, the sorting result of the sorting pipeline is fed back to the main core, and the main core outputs a vector query result according to the sorting result. According to the scheme, multi-pipeline parallel processing is realized through hardware, the hybrid acceleration function under various modes is realized, the auxiliary core self-defined acceleration function is realized, and the faster processing speed and the higher memory access bandwidth are provided.

Description

Vector database oriented hybrid acceleration architecture
Technical Field
The embodiment of the application relates to the field of databases, in particular to a vector database-oriented hybrid acceleration architecture.
Background
Unstructured data or knowledge may be encoded into vectors through some learning and training, including: text, images, audio, video, etc. The main role of the vector database is to store and process vector data and to provide efficient vector retrieval functions. The core idea of the vector database is similarity search, which finds the most similar vector by calculating the distance between one vector and all other vectors.
In order to improve the performance of the vector database, the previous research direction is prone to a more efficient vector index structure and a more reasonable vector query method, so that various vector search algorithms are developed. These algorithms all have a feature: to improve search performance, only the distance of a portion of the vectors may be calculated. This approach, called approximate nearest neighbor (approximate nearest neighbors: ANN), increases speed but sacrifices the quality of the result. Common ANN indexes include locality sensitive hashing (Locally SENSITIVE HASHING: LSH), hierarchical navigable little world (HIERARCHICAL NAVIGABLE SMALL WORLDS: HNSW), inverted file Index (INVERTED FILE Index: IVF), inverted product quantization, and the like.
Another way to improve vector database performance is to accelerate vector queries by using proprietary hardware such as GPU, NPU, FPGA. Databases such as Milvus, faiss, etc. provide GPU version acceleration queries in addition to CPU versions. Because the structural designs of different acceleration hardware and proprietary chips are different, the need for how to combine the features of these hardware to efficiently perform vector indexing is still currently under investigation.
Disclosure of Invention
The embodiment of the application provides a hybrid acceleration architecture oriented to a vector database, which solves the problems of low vector query speed and sacrifice of result quality when using software to accelerate the database.
The application provides a vector database-oriented hybrid acceleration architecture, which comprises a main core, and two first auxiliary cores and second auxiliary cores which are processed in parallel by multiple pipelines, wherein the main core comprises a task scheduling module and a task segmentation module; the task segmentation module performs task segmentation according to the query task, and the task scheduling module schedules the first auxiliary core and/or the second auxiliary core to execute corresponding pipeline acceleration tasks according to the segmentation task; the first auxiliary core and the second auxiliary core are respectively connected with a high-speed memory, and a vector library for vector acceleration query is cached in the high-speed memory;
the first auxiliary core executes a distance calculation pipeline according to the received segmentation task, each distance calculation pipeline comprises a calculation management module and a distance calculation module, and the distance calculation module calculates a vector distance according to a vector library and a vector to be queried; the calculation management module extracts vectors to be queried, a vector library and cache vector distances from the high-speed memory according to the segmentation task, and feeds back calculation results of the distance calculation pipeline to the main core;
The second auxiliary cores execute the sorting pipelines according to the received segmentation tasks, each sorting pipeline comprises a data sorting module, the data sorting modules sort vectors according to vector distances obtained by the first auxiliary cores, sorting results of the sorting pipelines are fed back to the main cores, and the main cores output vector query results according to the sorting results.
The technical scheme provided by the embodiment of the application has the beneficial effects that at least:
1) Providing a more flexible model configuration; the mixed acceleration architecture introduces hardware such as FPGA/ASIC/SOC, and the hardware can flexibly realize various custom functions, thereby providing more possibilities for vector retrieval.
2) Providing multiple hardware hybrid accelerations; the hybrid acceleration architecture separates different operation functions, and accelerates by using different hardware according to the characteristics of respective operations, so that the advantages of the hardware are fully exerted, and the data processing capacity is greatly improved.
3) Accelerating different types of algorithm models; the vector library search algorithm model has a plurality of kinds, and the mixed acceleration architecture has a certain acceleration effect on various algorithm models to a certain extent. The novel computing mode and ordering mode are used in the hybrid acceleration architecture, and technologies such as pipeline and multi-batch are used, so that the bandwidth of storage access is improved. Thereby having certain acceleration effect on various different algorithm models.
5) Multistage pipeline processing; the hybrid acceleration architecture adopts the design idea of a pipeline in design and improves the processing bandwidth in a multistage pipeline parallel mode. In the hybrid acceleration architecture, tasks are split into a plurality of different subtasks according to functions, and tasks with the same functions are processed in batches, so that the function of multi-task parallel processing is realized.
Drawings
FIG. 1 is a hybrid acceleration architecture for vector-oriented databases provided by an embodiment of the present application;
FIG. 2 is a hybrid acceleration architecture for vector-oriented databases according to another embodiment of the present application;
FIG. 3 is a detailed diagram of a vector database oriented hybrid acceleration architecture providing protection in accordance with an embodiment of the present application;
FIG. 4 is a schematic diagram of a distance computation pipeline in a first sub-core;
FIG. 5 is a schematic diagram of a structure in a look-up table pipeline;
FIG. 6 is a schematic diagram of the structure of a data sort module in the sort pipeline;
fig. 7 is a schematic structural diagram of a data rearrangement module in the second secondary core.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
References herein to "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
Because the traditional vector acceleration is extremely optimized on algorithm software, the optimization is based on the acceleration processing of the CPU. However, the CPU is limited by its own functional characteristics and can only execute complex program functions, so that the CPU is very important for high concurrency and a large number of simple computing tasks, and especially for high concurrency tasks, the processing speed is severely slowed down, and the available space in the field of database acceleration is not large.
In view of the above problems, the present application provides a hybrid acceleration architecture for a vector database, as shown in fig. 1, where the hybrid architecture includes a primary core and two first secondary cores and second secondary cores that are processed in parallel by multiple pipelines. The main core is a core execution part of the whole system, data processing and task division are mainly carried out according to the acceleration task to be executed, and the first auxiliary core and/or the second auxiliary core are called to execute the specific acceleration task by the coprocessor. In this embodiment, the hardware planning design is performed according to the acceleration algorithm principle commonly used in the market, specifically, the main core performs overall acceleration task scheduling and task distribution processing, and controls the first auxiliary core to perform vector distance calculation and the second auxiliary core to perform vector distance sorting calculation. In the whole process, an acceleration task of one input system is distributed to two auxiliary cores to operate, and the main core only triggers task distribution, overall scheduling work, related tasks and the like according to system requests.
In one possible embodiment, the main core adopts a CPU chip with the function of processing complex tasks and functions, and the first auxiliary core selects a GPU chip for calculating the subtasks after segmentation. The main characteristics of GPU are high concurrency and high bandwidth. The CPU continuously transmits the segmented computation sub-tasks to the GPU in a queue form, and a large number of computation related sub-tasks are displayed in the GPU to run in parallel.
The second secondary core can be selected from an FPGA/ASIC/SOC chip, and because the GPU computing core is far away from the storage and has higher delay for storage access, some tasks of quick storage access are completed by the FPGA/ASIC/SOC. The FPGA/ASIC/SOC has the advantages that the distance between a computing core and a storage core is short, more complex instructions which are not in some CPUs and GPUs can be easily realized, and the FPGA/ASIC/SOC has the characteristics of fast operation and low delay, and can complete tasks in a shorter time.
The first auxiliary core and the second auxiliary core are respectively connected with the high-speed memory DRAM/HBM. The high-speed memory herein may be DRAM or HBM, which is known as high-bandwidth memory (High Bandwidth Memory, abbreviated HBM). The DRAM/HBM provides data storage and switching functions, and the DRAM has characteristics of high bandwidth and large capacity. The high bandwidth ensures the transmission efficiency of data between different devices, and the data can completely reside in the memory before the larger capacity ensures that the data does not reach a certain scale. Compared with the DRAM, the HBM has higher bit width, and can complete data reading and writing more quickly. In this scheme, the DRAM/HBM mainly stores a vector library for vector acceleration queries. In addition, in view of mass storage, an SSD storage medium storing large data may be added to the system, and when the data size is huge, reaching the billion or even billions, the data storage capacity becomes very huge, which is required to provide a larger storage space by NVME or SSD, etc.
Of course, the above mainly considers simple vector distance calculation and sorting, and some higher-precision vector acceleration query algorithms also comprise the cases of table lookup calculation, table lookup distance sorting and the like. At this time, a third secondary core with multi-pipeline concurrent processing can be added to the system.
FIG. 2 is a schematic diagram of a hybrid acceleration architecture for vector database according to another embodiment, in which a third auxiliary core for concurrent processing of multiple pipelines is added to the architecture of FIG. 1, and is dedicated to processing of table lookup tasks.
Fig. 3 is a detailed view of a hybrid acceleration architecture for a vector database with protection provided in an embodiment of the present application, where a main core includes a task scheduling module and a task segmentation module. The task segmentation module performs task segmentation on the main task according to the query task, namely, the main task becomes a plurality of subtasks, and each subtask corresponds to a specific function. And the task scheduling module schedules the first auxiliary core and/or the second auxiliary core to execute corresponding pipeline acceleration tasks according to the segmentation tasks.
After the first auxiliary core receives the segmentation task, the distance computing pipeline is executed according to the segmentation task, and because the auxiliary cores all adopt a parallel computing mode, a plurality of pipelines can be processed in parallel, the specific number is determined by the design performance of the chip, for example, n distance computing pipelines are listed in fig. 3, and when the number of the segmented tasks exceeds n distance computing pipelines, the first auxiliary core carries out full-load parallel processing. Each distance calculation pipeline comprises a calculation management module and a distance calculation module. The distance calculating module calculates the vector distance according to the vector library and the vector to be queried. And the calculation management module extracts the vector to be queried, the vector library and the cache vector distance from the high-speed memory according to the segmentation task, and feeds back the calculation result of the distance calculation pipeline to the main core. In the process, the GPU firstly acquires the vector to be queried from the DRAM/HBM and extracts the vector from the vector library, and when the related vector does not exist in the DRAM/HBM, the vector is required to be firstly called from the SSD and then sent to the calculation management module.
After the second auxiliary core receives the segmentation task, the sequencing pipelines are executed according to the segmentation task, each sequencing pipeline is similar to the first auxiliary core, each sequencing pipeline comprises a data sequencing module, the data sequencing module performs vector sequencing according to the vector distance obtained by the first auxiliary core, and sequencing results of the sequencing pipelines are fed back to the main core. This process requires linkage of the distance output of the first secondary core, i.e., inputting the calculated vector matrix into the data sorting module for sorting. And the main core is used as a whole dispatching processing center and outputs a target vector, namely a query result, according to the final sequencing result.
The mode of the output calculation of the second auxiliary core inheritance first auxiliary core is mainly aimed at the application of a violent search model algorithm and an inverted file Index (IVF) model algorithm, because the acceleration index is mainly accelerated by directly adopting a direct full-quantity search mode or a mode of combining rough search and fine search. In addition, there is an application of an inverse product quantization (IVFPQ) model algorithm on the market, vector acceleration is performed by screening a clustering space and calculating a distance matrix, and for such an acceleration algorithm, a certain resource unit is allocated in the second secondary core to execute a resource for calculating the distance matrix, or a third secondary core is additionally used for calculation.
The third secondary core designed in this embodiment is similar to the second secondary core, and uses FPGA/ASIC/SOC chips to perform multiple pipeline concurrent processing. After the third auxiliary core receives the segmentation task, executing a table-lookup pipeline according to the segmentation task, wherein each table-lookup pipeline comprises a table-lookup distance calculation module, the table-lookup distance calculation module calculates the table-lookup distance according to the compressed vector library and the distance matrix output by the first auxiliary core, the calculated table-lookup distance is stored in a high-speed memory, and the table-lookup result of the table-lookup ordering pipeline is fed back to the main core. In this process, a compressed vector library is used, which is a search library formed by compressing vectors in a vector library for quick calculation, and the compressed vectors in the compressed vector library are in one-to-one correspondence with the vectors in the vector library, which are also stored in a high-speed memory.
In particular, because the steps included in different acceleration principles are different, when the hybrid architecture executes the acceleration task according to the algorithm principles such as inverted file index and violent search model, the segmentation task received by the first secondary core does not include a table lookup operation, and the first secondary core checks the query vector and outputs the vector distance. The second secondary core is then controlled to continue to perform the ordering tasks according to the vector distance by scheduling of the primary core.
However, when the hybrid architecture performs an acceleration task according to the algorithm principles such as the inverted product quantization model, the first secondary core at this time queries the vector matrix and outputs the distance matrix, and caches the distance matrix in the high-speed memory. And then the third auxiliary core continues to execute the table look-up pipeline according to the cached distance matrix and outputs the table look-up distance. And then the second secondary core can continue to execute the ordering pipeline based on the table lookup distance, and vector ordering and result feedback are carried out according to the table lookup distance. The two execution principles differ in that the former does not require the participation of a third secondary core, while the latter execution principles require the execution of a third secondary core.
For three parallel processing auxiliary cores, a task segmentation module, a queue module and a result module corresponding to all the auxiliary cores are required to be arranged in the main core so as to realize purposeful task segmentation and scheduling.
For the first auxiliary core, the computing task segmentation module segments the computing task and sends the segmented computing task to the computing queue module. The calculation queue module establishes a calculation task ordering table according to the calculation tasks, and sends the calculation tasks into a parallel processing distance calculation pipeline for execution according to the calculation task sequence. And the calculation result module establishes a calculation result feedback table according to the calculation result fed back by the distance calculation assembly line. For the distance calculation pipeline, the segmentation tasks to be processed need to be queued in the calculation task sequencing table according to the sequence only in the full-load operation state of all pipelines. The calculation result feedback table is used for storing a queue of completed calculation tasks, and the subsequent sequencing and/or table lookup also needs to sequentially execute corresponding tasks based on the calculation result feedback table.
For the third auxiliary core, the table-looking-up task segmentation module segments the table-looking-up task and sends the segmented table-looking-up task to the table-looking-up queue module. The table look-up queue module establishes a table look-up task ordering table according to the table look-up tasks, and sends the table look-up tasks into a table look-up pipeline for parallel processing according to the table look-up task sequence for execution. The table-lookup result management queue establishes a table-lookup result feedback table according to the table-lookup result fed back by the table-lookup pipeline. In particular, for some specific algorithms, the table look-up task segmentation module is based on the table look-up task triggered or generated by the calculation result feedback table in the first auxiliary core, and further performs segmentation post-processing on the table look-up task, so that the table look-up task needs to utilize the distance matrix output by the first auxiliary core from the data inheritance perspective.
For the second secondary core, the sorting task segmentation module segments the sorting task and sends the segmented sorting task to the sorting queue module. The sequencing queue module establishes a sequencing task sequencing table according to the sequencing tasks, and sends the sequencing tasks into a parallel processing sequencing pipeline for execution according to the sequencing task sequence. And the sequencing result module establishes a sequencing result feedback table according to the sequencing result fed back by the sequencing assembly line. Based on design principle, for the query task without table lookup operation, the sequencing task segmentation module establishes sequencing tasks according to the calculation results in the calculation result feedback table. But for the query task including the table look-up operation, the sorting task segmentation module establishes the sorting task according to the table look-up result in the table look-up result feedback table.
Of the three secondary cores, the second and third secondary cores use FPGA/ASIC/SOC, while the first secondary core uses GPU, which has the greatest advantage over GPU in that its memory access latency is low, and registers and on-chip memory (BRAM) in FPGA/ASIC/SOC are structures belonging to the control logic itself, without additional access time. For the GPU, a shared memory is used, and operations such as arbitration and cache coherency processing are required during storage access, so that the delay is high. In addition, each logic unit and other logic units in the period in the FPGA are mutually connected, so that various flexible control functions can be realized; while the structure within the GPU is fixed, relatively poor in flexibility of data control. For the two reasons above, the ordering task and the look-up task are performed using not a GPU but an FPGA/ASIC/SOC. However, the FPGA/ASIC/SOC has some drawbacks, such as relatively few internal DSP resources, and cannot quickly complete large-scale data operations; while the corresponding GPU has unique advantages in digital and parallel operations. Thus, in performing vector distance calculation tasks, the GPU is used to perform operations.
FIG. 4 is a schematic diagram of a distance computation pipeline in a first secondary core, wherein a computation management module comprises a Query cache unit, a data separation unit and an ID distance synchronization unit. The Query cache unit is responsible for reading and caching all vector data from the cache that is executed from the compute pipeline, particularly data required for one cache beat pipeline execution. Because the vector query and calculation are required to be marked, the vector data extracted from the vector library are the vector itself and the ID number, and the data separation unit is responsible for separating the ID number of the vector data from the vector and temporarily storing and managing the ID number. The ID distance synchronization unit is mainly used for acquiring the output distance of the calculation management module in the ending stage of calculation, synchronously binding the temporarily stored ID number of the data separation unit with the output distance, and outputting the vector distance or the matrix distance.
The distance calculating module comprises a single-dimensional distance calculating unit and a high-dimensional distance calculating unit; because most of the data processed by the actual database are high-order vectors, such as 64-dimensional and 128-dimensional vectors, etc. Only a basic operation unit can be built in a hardware module to perform single-dimensional calculation, so that the single-dimensional calculation unit calculates the multi-dimensional vectors one by one according to the single-dimensional vectors. The subsequent high-dimensional distance calculation unit calculates the dimension distances of the dimensions of the single beat data in parallel. The single beat is the execution process of one beat of the distance calculation module. Because the high-dimensional distance computation unit is a superposition of multiple single-dimensional operations, 16-dimensional operations are typically selected. And when the actual vector data is 32 dimensions, two beats of superposition operation is required. When the calculated amount of the cache vector data is larger than the single beat processing amount of the distance calculation module, the distance calculation module is required to perform iterative calculation, the high-dimensional distance calculation unit caches the iterative parallel accumulation results, and the iterative parallel accumulation results are subjected to serial accumulation calculation to output a final result, namely the vector distance or the matrix distance.
FIG. 5 is a schematic diagram of a table lookup pipeline in which a table lookup distance calculation module includes a DMA transfer unit, a data alignment unit, a queue selection unit, a queue selector, and a distance accumulation unit.
The DMA transmission unit reads the distance matrix into the on-chip RAM for buffering, and the reading time of the on-chip RAM is much faster than that of other storage modes, so that the speed of table lookup is ensured, and the on-chip RAM needs to be extracted from the DRAM in advance. And the vector matrix is queried in real time after the compressed vector is read, and no cache is needed. The data alignment unit shapes the input compression vector and aligns the data according to the row number and the column number of the distance matrix.
Furthermore, since the creation of the distance matrix and the query are implemented in different steps, the creation of the matrix and the query can be synchronized using multiple queues. For example, when the distance matrix 1 is queried, the distance matrix 2 is created at the same time, and when the distance matrix 1 is queried, the query on the distance matrix 2 can be expanded immediately, so that the running time is greatly reduced. For this purpose, the secondary core needs to generate a queue selection signal based on the inquiry and creation states of each distance matrix queue, the queue selection unit determines to select the target distance matrix queue according to the queue selection signal, and the queue selector outputs the result of the segmented inquiry on the target distance matrix queue according to the queue selection signal. The distance inquiry output needs to accumulate the segmented inquiry distance because the product quantization idea is used in the table lookup process to divide a vector into a plurality of segments, and the table lookup output result is the distance of each segment, so the distance accumulation unit needs to accumulate the segmented inquiry result in parallel and output the matrix distance.
FIG. 6 is a schematic diagram of a data ordering module in an ordering pipeline, in which both a table look-up operation and a distance calculation operation are ultimately required to order output through the ordering pipeline. The data ordering module comprises an ordering unit and a caching unit. The sorting unit comprises a plurality of cascaded sorters sorter, wherein the cascaded sorters sort the input table lookup distances or vector distances in a descending order step by step, and the last-stage sorters output the target number of table lookup distances or vector distances in an ascending order. The sorter is internally composed of a comparator, a FIFO buffer memory and a certain judgment logic, and mainly realizes the functions of inputting 2 groups of data and sequentially sorting out data with smaller distance for output.
In some embodiments, the sorter parameter is set to m-sorter/n, where m represents the number of buffers in the sorter that the sorter sorter sorts m data for each group. Except for the last stage of sorters, the amount of data input and output by each sorter is fixed, except that the data is ordered in an ascending order. Specifically, the rear sorter receives the front output of the front sorter and the latter sort number is twice the front sorter sort number. The sorting and ordering logic is that the ordered data volume is multiplied until the last group finishes ordering all the data, and the difference is that the last stage sorter selects the target quantity according to the set top k algorithm to intercept the output, namely k outputs of top k. The sorter is composed of a comparator, a FIFO buffer and a certain judgment logic, wherein in the process realized by top-128 in FIG. 6, each stage of sorter sequentially sorts each group of 1 data, 2 data, 4 data, 8 data, 16 data, 32 data, 64 data and 128 data in the data. The last group of sorters is different from the previous ones in that the last group of sorters only need to pick the useful data, and the useless data can be directly discarded; while for the previous sorter all data is not discardable. The sorter is used iteratively in the implementation process, a large amount of comparison operation and data caching are needed for each implementation, and when the FPGA/ASIC/SOC is used, the distance between a storage unit and a computing unit is very short, and the data acquisition delay is low, so that the function of the sorter can be realized efficiently. The buffer memory unit is used for storing the table lookup distance or vector distance of the target number ascending order output by the last stage sorter and feeding back the sorting result to the main core.
In particular, in some embodiments, for some sort tasks with larger processing capacity, the sort tasks are split into a plurality of subtasks and input into a plurality of sort pipelines for parallel processing, and each sort pipeline is only a distance sort result of a part of the subtasks, so that sorting is needed between associated parallel pipelines, and a data rearrangement module is further arranged in the second secondary core.
Fig. 7 is a schematic structural diagram of a data rearrangement module in the second secondary core, where the data rearrangement module includes a binary tree structured comparator, and the binary tree structured comparator is connected to the output of each sorting pipeline, compares the sorting results step by step, outputs the sorting pipeline with the minimum distance, and feeds back the sorting result. Taking the example shown in fig. 7 as an example, in the first layer, the result of rank 1 and the result of rank 2 are input to the first comparator MUX1, the result of rank 3 and the result of rank 4 are input to the second comparator MUX2, and in the second layer, the result of comparison 1 output by MUX1 and the result of comparison 2 output by MUX2 are input to the third comparator MUX3, and the result of comparison n is finally output in a step-by-step input comparison manner, that is, the final top ranking is output.
The ordered output to the CPU is the nearest K ID+ distances. The output of the side is in two different cases, and when one is IVF (reverse file index), the output is the ID and the distance of the IVF clustering space; the other is that in vector distance ordering, the output is the ID and distance of the vector, which is the final result. The IVF result and the vector ordering result are stored in the memory of the CPU, the HBM is not written in any more, and the addresses stored by the IVF result and the vector ordering result are different. The two result processing modes are different, and for the IVF result, the CPU distributes and creates a distance matrix and a table look-up task; for vector distance results, the CPU will output directly.
Based on the principle, when the sorting task is cut into a single sorting pipeline to be executed, the buffer unit directly feeds back to the main core according to the sorting result. When the sorting task is split into a plurality of parallel sorting pipelines to be executed, a buffer unit in the corresponding sorting pipeline inputs the sorting result to a data rearrangement module, the sorting result is compared step by step through a comparator with a binary tree structure, the minimum sorting distance is output, and the sorting result is fed back to the main core according to the minimum sorting distance.
When the violent search query process is operated based on the hybrid architecture, the query task can be divided into two subtasks of distance calculation and sequencing, the GPU is responsible for distance calculation, the FPGA is responsible for quick sequencing, the CPU is responsible for task control and cache scheduling, and vector retrieval is completed more efficiently through mutual cooperation.
The benefits of accelerating the above model through a hybrid acceleration architecture are as follows:
a. The distance calculation and the distance ordering are separated.
B. Processing special tasks through a special chip; calculating the distance through the GPU; the efficiency of the respective processing is accelerated by the FPGA/ASIC/SOC computational ordering.
C. Performing batch flow operation; multiple vectors to be queried can be input simultaneously through batch operation and the like, so that multi-stage assembly lines can be accelerated in the GPU and the FPGA, and the retrieval efficiency is improved.
When the query process of the inverted file index is operated based on the above-mentioned hybrid architecture, since the idea of the inverted file index is to speed up the search speed by narrowing the search range by clustering, the whole query process is divided into 4 subtasks of coarse calculation, coarse sorting, precise calculation and precise sorting. The coarse and the fine calculations are done in the GPU; and coarse ordering and precise ordering are completed on the FPGA/ASIC/SOC; scheduling of tasks is performed by the CPU; the exchange data between the various subtasks is stored in a cache.
The benefits of accelerating the above model through a hybrid acceleration architecture are as follows:
a. The distance calculation and the distance ordering are separated.
B. Processing special tasks through a special chip; calculating the distance through the GPU; the efficiency of the respective processing is accelerated by the FPGA/ASIC/SOC computational ordering.
C. Performing batch flow operation; tasks with the same functions are put together for processing, so that batch operation can be realized, and in addition, the processing can be accelerated through a pipeline.
D. The data access bandwidth is improved through the HBM; the hybrid acceleration architecture can store the clustering center and the clustering space vector in the HBM, so that the data access bandwidth is improved, and the calculation is further accelerated.
When the query process of the inverted product quantization is operated based on the hybrid architecture, IVFPQ adds the Product Quantization (PQ) on the basis of IVF, so that the memory usage of the high-dimensional vector is greatly compressed. IVFPQ differs from IVF primarily in implementation in that a distance matrix and a query distance matrix need to be established during vector query. The distance matrix task is established, a large amount of calculation is needed, and the distance matrix task is mainly completed by the GPU; the query distance matrix function requires efficient memory access functions, implemented in FPGA/ASIC/SOC.
The benefits of accelerating the above model through a hybrid acceleration architecture are as follows:
a. and separating the distance calculation, the distance sorting and the table look-up operation.
B. Processing special tasks through a special chip; calculating the distance through the GPU; sequencing by FPGA/ASIC/SOC calculation; the efficiency of the respective processing is accelerated by performing the table look-up calculation through the FPGA/ASIC/SOC.
C. Performing batch flow operation; tasks with the same functions are put together for processing, so that batch operation can be realized, and in addition, the processing can be accelerated through a pipeline.
D. the data access bandwidth is improved through the HBM; the hybrid acceleration architecture can store the clustering center, the clustering space vector and the compression vector in the HBM, so that the data access bandwidth is improved, and the calculation is further accelerated.
E. accelerating a table look-up function through an on-chip RAM; when the table look-up operation is performed, the on-chip RAM can be used as a temporary cache to accelerate the table look-up process.
In summary, the hybrid architecture provided by the embodiment of the present application may achieve the following technical effects:
1) Providing a more flexible model configuration; the mixed acceleration architecture introduces hardware such as FPGA/ASIC/SOC, and the hardware can flexibly realize various custom functions, thereby providing more possibilities for vector retrieval.
2) Providing multiple hardware hybrid accelerations; the hybrid acceleration architecture separates different operation functions, and accelerates by using different hardware according to the characteristics of respective operations, so that the advantages of the hardware are fully exerted, and the data processing capacity is greatly improved.
3) Accelerating different types of algorithm models; the vector library search algorithm model has a plurality of kinds, and the mixed acceleration architecture has a certain acceleration effect on various algorithm models to a certain extent. The novel computing mode and ordering mode are used in the hybrid acceleration architecture, and technologies such as pipeline and multi-batch are used, so that the bandwidth of storage access is improved. Thereby having certain acceleration effect on various different algorithm models.
4) Support of massive vector data; the hybrid acceleration architecture provides additional storage space that can handle massive vector data storage.
5) Multistage pipeline processing; the hybrid acceleration architecture adopts the design idea of a pipeline in design and improves the processing bandwidth in a multistage pipeline parallel mode. In the hybrid acceleration architecture, tasks are split into a plurality of different subtasks according to functions, and tasks with the same functions are processed in batches, so that the function of multi-task parallel processing is realized.
6) Accelerating calculation through HBM, on-chip RAM and the like; the hybrid acceleration architecture has a great advantage that acceleration can be performed through the HBM, the on-chip RAM and the like, the main storage used by the traditional vector database is DRAM and SSD, and the hybrid acceleration architecture introduces new hardware such as FPGA/ASIC/SOC and the like, and the new hardware can provide storage media such as the HBM, the on-chip RAM and the like, so that the storage bandwidth is higher.
The foregoing describes preferred embodiments of the present invention; it is to be understood that the invention is not limited to the specific embodiments described above, wherein devices and structures not described in detail are to be understood as being implemented in a manner common in the art; any person skilled in the art will make many possible variations and modifications, or adaptations to equivalent embodiments without departing from the technical solution of the present invention, which do not affect the essential content of the present invention; therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims (12)

1. The mixed acceleration architecture for the vector database is characterized by comprising a main core, a first auxiliary core and a second auxiliary core which are processed in parallel by a plurality of pipelines, wherein the main core comprises a task scheduling module and a task segmentation module; the task segmentation module performs task segmentation according to the query task, and the task scheduling module schedules the first auxiliary core and/or the second auxiliary core to execute corresponding pipeline acceleration tasks according to the segmentation task; the first auxiliary core and the second auxiliary core are respectively connected with a high-speed memory, and a vector library for vector acceleration query is cached in the high-speed memory;
the first auxiliary core executes a distance calculation pipeline according to the received segmentation task, each distance calculation pipeline comprises a calculation management module and a distance calculation module, and the distance calculation module calculates a vector distance according to a vector library and a vector to be queried; the calculation management module extracts vectors to be queried, a vector library and cache vector distances from the high-speed memory according to the segmentation task, and feeds back calculation results of the distance calculation pipeline to the main core;
The second auxiliary cores execute the sorting pipelines according to the received segmentation tasks, each sorting pipeline comprises a data sorting module, the data sorting modules sort vectors according to vector distances obtained by the first auxiliary cores, sorting results of the sorting pipelines are fed back to the main cores, and the main cores output vector query results according to the sorting results.
2. The vector database oriented hybrid acceleration architecture of claim 1, characterized in that the acceleration architecture further comprises a third sub-core of multi-pipelined concurrent processing; the third auxiliary core executes a table lookup pipeline according to the received segmentation task, each table lookup pipeline comprises a table lookup distance calculation module, and the table lookup distance calculation module calculates the table lookup distance according to the compressed vector library and the distance matrix output by the first auxiliary core;
The third auxiliary core stores the calculated table-looking-up distance into a high-speed memory, and feeds back the table-looking-up result of the table-looking-up ordering pipeline to the main core; the compressed vector library is stored in the high-speed memory, and the compressed vectors of the compressed vector library are in one-to-one correspondence with the vectors in the vector library.
3. The vector database oriented hybrid acceleration architecture of claim 2, wherein when the split task received by the first secondary core does not contain a table lookup operation, the first secondary checks a query vector and outputs a vector distance;
when the received segmentation task contains a table look-up operation, the first auxiliary check polling vector matrix outputs a distance matrix, and the distance matrix is cached in a high-speed memory; and the third auxiliary core continuously executes the table lookup pipeline according to the cached distance matrix and outputs the table lookup distance, and the second auxiliary core continuously executes the ordering pipeline based on the table lookup distance and performs vector ordering and result feedback according to the table lookup distance.
4. The vector database oriented hybrid acceleration architecture of claim 2, wherein the main core is provided with a task segmentation module, a queue module and a result module of all the auxiliary cores;
The calculation task segmentation module segments the calculation task and sends the segmented calculation task to the calculation queue module; the calculation queue module establishes a calculation task ordering table according to calculation tasks, and sends the calculation tasks into a parallel processing distance calculation pipeline for execution according to the calculation task sequence; the calculation result module establishes a calculation result feedback table according to the calculation result fed back by the distance calculation assembly line;
The table-lookup task segmentation module segments the table-lookup task and sends the segmented table-lookup task to the table-lookup queue module; the table-lookup queue module establishes a table-lookup task ordering table according to the table-lookup tasks, and sends the table-lookup tasks into a table-lookup pipeline for parallel processing to be executed according to the table-lookup task sequence; the table-lookup result management queue establishes a table-lookup result feedback table according to the table-lookup result fed back by the table-lookup pipeline;
The sequencing task segmentation module segments the sequencing task and sends the segmented sequencing task to the sequencing queue module; the sequencing queue module establishes a sequencing task sequencing table according to the sequencing tasks, and sends the sequencing tasks into a parallel processing sequencing pipeline for execution according to the sequencing task sequence; the sequencing result module establishes a sequencing result feedback table according to the sequencing result fed back by the sequencing assembly line; for the query task without the table lookup operation, the sequencing task segmentation module establishes a sequencing task according to the calculation result in the calculation result feedback table; for the query task including the table lookup operation, the sorting task segmentation module establishes a sorting task according to the table lookup result in the table lookup result feedback table.
5. The vector database oriented hybrid acceleration architecture of claim 3, wherein the computing management module comprises a Query cache unit, a data separation unit, and an ID distance synchronization unit;
the Query caching unit reads and caches all vector data of the execution distance calculation pipeline from a cache; the vector data is a vector containing a unique ID number;
The data separation unit separates the ID number of the vector data from the vector, and temporarily stores and manages the ID number;
The ID distance synchronization unit acquires the output distance of the calculation management module, synchronously binds the temporarily stored ID number of the data separation unit with the output distance, and outputs a vector distance or a matrix distance.
6. The vector database oriented hybrid acceleration architecture of claim 5, characterized in that the distance computation module comprises a single-dimensional distance computation unit and a high-dimensional distance computation unit; the single-dimensional distance calculation unit calculates the multi-dimensional vectors one by one according to the single-dimensional vectors, and the high-dimensional distance calculation unit calculates the dimension distances of all the dimensions of the single beat data in a parallel accumulation manner;
When the calculated amount of the cache vector data is larger than the single-beat processing amount of the distance calculation module, the distance calculation module performs iterative calculation, the high-dimensional distance calculation unit caches the parallel accumulation results of multiple iterations, performs serial accumulation calculation on the iterative parallel accumulation results, and outputs the results.
7. The vector database oriented hybrid acceleration architecture of claim 6, wherein the table look-up distance calculation module comprises a DMA transfer unit, a data alignment unit, a queue selection unit, a queue selector, and a distance accumulation unit;
the DMA transmission unit reads the distance matrix into an on-chip RAM for caching;
The data alignment unit shapes the input compression vector and aligns the data according to the row number and the column number of the distance matrix;
The queue selecting unit determines to select a target distance matrix queue according to the queue selecting signal, and the queue selector outputs a segmented query result of the target distance matrix queue according to the queue selecting signal; the queue selection signal is determined based on the inquiry and creation states of each distance matrix queue;
the distance accumulation unit carries out parallel accumulation on the segmented query results and outputs matrix distances.
8. The vector database oriented hybrid acceleration architecture of claim 6, wherein the data ordering module comprises an ordering unit and a caching unit;
The sorting unit comprises a plurality of cascaded sorters, the cascaded sorters sort the input table lookup distances or vector distances in descending order step by step, and the last sorter outputs the target number of table lookup distances or vector distances in ascending order;
The buffer memory unit is used for storing the table lookup distance or vector distance of the target number of ascending order arrangement output by the last stage sorter and feeding back the ordering result to the main core.
9. The vector database oriented hybrid acceleration architecture of claim 8, characterized in that the sorter parameter is set to m-sorter/n, where m represents the sorter sorter sorting m data for each group and n represents the number of buffers in the sorter for auxiliary sorting; the rear sorter receives the front sorter output and sorts twice as many as the front sorter; the final stage sorter intercepts and outputs according to the set target quantity.
10. The hybrid acceleration architecture of claim 9, wherein the second secondary core further comprises a data reordering module, the data reordering module comprises a binary tree structured comparator, the binary tree structured comparator is connected to the output of each ordering pipeline, the ordering results are compared step by step, the ordering pipeline with the smallest distance is output, and the ordering results are fed back.
11. The vector database oriented hybrid acceleration architecture of claim 10, wherein when the sorting task is split into single sorting pipeline execution, the caching unit directly feeds back to the master core according to the sorting result;
When the sorting task is split into a plurality of parallel sorting pipelines to be executed, a buffer unit in the corresponding sorting pipeline inputs the sorting result to the data rearrangement module, the sorting result is compared step by step through a comparator with a binary tree structure, the minimum sorting distance is output, and the sorting result is fed back to the main core according to the minimum sorting distance.
12. The vector database oriented hybrid acceleration architecture of claim 2, characterized in that the primary core is a CPU, the first secondary core is a GPU, the second secondary core and the third secondary core is an FPGA/ASIC/SOC.
CN202410349055.7A 2024-03-26 2024-03-26 Vector database oriented hybrid acceleration architecture Active CN117951346B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410349055.7A CN117951346B (en) 2024-03-26 2024-03-26 Vector database oriented hybrid acceleration architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410349055.7A CN117951346B (en) 2024-03-26 2024-03-26 Vector database oriented hybrid acceleration architecture

Publications (2)

Publication Number Publication Date
CN117951346A CN117951346A (en) 2024-04-30
CN117951346B true CN117951346B (en) 2024-05-28

Family

ID=90805493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410349055.7A Active CN117951346B (en) 2024-03-26 2024-03-26 Vector database oriented hybrid acceleration architecture

Country Status (1)

Country Link
CN (1) CN117951346B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118227535B (en) * 2024-05-22 2024-09-06 深存科技(无锡)有限公司 Accelerator architecture for near IO (input/output) pipeline calculation and AI (advanced technology attachment) acceleration system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109597459A (en) * 2017-09-30 2019-04-09 英特尔公司 Processor and method for the privilege configuration in space array
CN114757565A (en) * 2022-04-29 2022-07-15 深圳供电局有限公司 Optimal dynamic power grid data scheduling system based on assembly line type operation
CN116010669A (en) * 2023-01-18 2023-04-25 深存科技(无锡)有限公司 Triggering method and device for retraining vector library, search server and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9626334B2 (en) * 2014-12-24 2017-04-18 Intel Corporation Systems, apparatuses, and methods for K nearest neighbor search

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109597459A (en) * 2017-09-30 2019-04-09 英特尔公司 Processor and method for the privilege configuration in space array
CN114757565A (en) * 2022-04-29 2022-07-15 深圳供电局有限公司 Optimal dynamic power grid data scheduling system based on assembly line type operation
CN116010669A (en) * 2023-01-18 2023-04-25 深存科技(无锡)有限公司 Triggering method and device for retraining vector library, search server and storage medium

Also Published As

Publication number Publication date
CN117951346A (en) 2024-04-30

Similar Documents

Publication Publication Date Title
Johnson et al. Billion-scale similarity search with GPUs
Tomasic et al. Performance of inverted indices in shared-nothing distributed text document information retrieval systems
CN117951346B (en) Vector database oriented hybrid acceleration architecture
Li et al. Hippogriffdb: Balancing i/o and gpu bandwidth in big data analytics
Pan et al. Fast GPU-based locality sensitive hashing for k-nearest neighbor computation
Khorasani et al. Scalable simd-efficient graph processing on gpus
Lieberman et al. A fast similarity join algorithm using graphics processing units
CN116627892B (en) Data near storage computing method, device and storage medium
CN110795469B (en) Spark-based high-dimensional sequence data similarity query method and system
CN114329094B (en) Spark-based large-scale high-dimensional data approximate neighbor query system and method
JP4758429B2 (en) Shared memory multiprocessor system and information processing method thereof
CN110874271A (en) Method and system for rapidly calculating mass building pattern spot characteristics
CN112000845B (en) Hyperspatial hash indexing method based on GPU acceleration
CN111966678A (en) Optimization method for effectively improving B + tree retrieval efficiency on GPU
CN115203383A (en) Method and apparatus for querying similarity vectors in a set of candidate vectors
CN108052535B (en) Visual feature parallel rapid matching method and system based on multiprocessor platform
CN115878824B (en) Image retrieval system, method and device
WO2022087785A1 (en) Retrieval device and retrieval method
Zhang et al. Fast Vector Query Processing for Large Datasets Beyond {GPU} Memory with Reordered Pipelining
Wasif et al. Scalable clustering using multiple GPUs
WO2005041067A1 (en) Distributed memory type information processing system
JP7363145B2 (en) Learning device and learning method
Malik et al. Task scheduling for GPU accelerated hybrid OLAP systems with multi-core support and text-to-integer translation
CN115836346A (en) In-memory computing device and data processing method thereof
KR20230169321A (en) Programmable accelerator for data-dependent and irregular operations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant