CN114168588A - Vector database storage and retrieval method - Google Patents

Vector database storage and retrieval method Download PDF

Info

Publication number
CN114168588A
CN114168588A CN202111195264.3A CN202111195264A CN114168588A CN 114168588 A CN114168588 A CN 114168588A CN 202111195264 A CN202111195264 A CN 202111195264A CN 114168588 A CN114168588 A CN 114168588A
Authority
CN
China
Prior art keywords
data
vector
copy
storage engine
fragment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111195264.3A
Other languages
Chinese (zh)
Inventor
李明昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202111195264.3A priority Critical patent/CN114168588A/en
Publication of CN114168588A publication Critical patent/CN114168588A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The invention relates to the field of computers and artificial intelligence, in particular to a vector database storage and retrieval method, a vector database storage and retrieval device, electronic equipment and a storage medium. The vector database storage and retrieval method comprises the following steps: the method comprises the steps of obtaining a data table, horizontally dividing the data table to form a plurality of fragments, wherein each fragment comprises a plurality of copies; acquiring data, performing first mapping according to key data contained in the data, performing Hash on the key data to obtain an integer, performing modulo remainder on a plurality of fragments according to the integer to obtain a fragment number, and then performing second mapping to obtain a plurality of copies to which the fragment number corresponds according to the fragment number; the fragment number is used for representing fragments, and each fragment corresponds to one fragment number; and establishing a vector storage engine on the metadata server for the acquired data according to the first mapping and the second mapping, wherein the vector storage engine is used for representing a data searching mode. The invention has the advantages of improving the retrieval accuracy, reducing the retrieval delay, and increasing the reliability and the usability.

Description

Vector database storage and retrieval method
Technical Field
The invention relates to the field of computers and artificial intelligence, in particular to the technical field of distributed storage, and particularly relates to a vector database storage and retrieval method.
Background
With the popularization of artificial intelligence application, vector similarity retrieval is used as the most basic service and applied to more and more artificial intelligence services. Such as face recognition, voice recognition, news recommendations, etc. Most of the using methods are to convert the pictures, texts, etc. into feature vectors by using a neural network model, and then to select a distance calculation method (such as inner product distance, cosine distance, etc.). Thus, given any one vector, K vectors close to the distance can be found. We turn this search as a KNN search (K Newest neighbors). However, although there are various open source algorithm libraries for KNN retrieval, most of them only stay at the level of code library, and engineers need to initialize and call them with programming language, which causes some problems for engineers. For example, multiple people share the same data index, and the data size is too large, which results in poor system performance.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a vector database storage and retrieval method, which solves the technical problems and provides a vector database storage and retrieval method capable of solving the problems of index sharing of the same data by multiple people and poor system performance caused by overlarge data quantity.
In order to achieve the purpose, the invention provides the following technical scheme:
in a first aspect of the embodiments of the present invention, a method for storing and retrieving a vector database is provided, where the method includes: the method comprises the steps of obtaining a data table, horizontally dividing the data table to form a plurality of fragments, wherein each fragment comprises a plurality of copies;
acquiring data, performing first mapping according to key data contained in the data, performing Hash on the key data to obtain an integer, performing remainder on a plurality of fragments according to the integer to obtain fragment numbers, then performing second mapping, and establishing a metadata server for the acquired data according to the first mapping and the second mapping; the system comprises a plurality of copies, a fragment number and a vector storage engine, wherein the fragment number is used for representing fragments, each fragment corresponds to the plurality of copies, the plurality of copies comprise a master copy, each copy corresponds to one vector storage engine, and the vector storage engine corresponding to the master copy is obtained according to the fragment number;
writing vector data into the obtained main copy vector storage engine based on the obtained main copy vector storage engine;
and based on the obtained main copy vector storage engine, carrying out vector similarity retrieval on the main copy vector storage engine.
In an optional embodiment, the step of obtaining a data table and horizontally dividing the data table into a plurality of segments includes:
the metadata server comprises a plurality of data servers, a main server and other auxiliary servers are selected from the data servers based on a Raft protocol, and the main server and the auxiliary servers perform data interaction so as to facilitate consistency of metadata in the main server and the auxiliary servers.
In an optional implementation mode, data is acquired, first mapping is carried out according to chain data contained in the data, key data is subjected to Hash to obtain an integer, a plurality of fragments are subjected to complementation according to the integer to obtain fragment numbers, then second mapping is carried out, and a metadata server is established for the acquired data according to the first mapping and the second mapping; the method for acquiring the vector storage engine corresponding to the master copy comprises the following steps of:
and assigning numbers to the plurality of fragments, and matching the obtained integer with the number to obtain a fragment number so as to determine the fragments.
In an optional embodiment, the step of writing vector data to the one primary-copy vector storage engine based on the obtained one primary-copy vector storage engine includes:
the data also includes vector data, each copy corresponding to a vector engine on a data server;
based on the vector data and key data that the data includes, it is determined that the data corresponds to a vector storage engine on a particular data server.
In an alternative embodiment, the step of writing vector data to a primary-copy vector storage engine based on the retrieved primary-copy vector storage engine comprises:
the method comprises the steps of obtaining an index request of vector data similarity, indexing at least one copy included in each fragment, selecting a preset number of vector data with the highest similarity from the indexed copies, then conducting similarity sorting on the indexed preset number of vector data, and selecting the preset number of vector data with the highest similarity to output.
In an optional embodiment, the step of performing vector similarity retrieval on the acquired primary-copy vector storage engine based on the acquired primary-copy vector storage engine includes:
according to the acquired data, vector data are represented by using a floating point number array, the floating point number array is converted into binary data, and the binary data are stored in a key data value storage engine, wherein the key data storage engine is used for indexing key data;
if the index data is the key data, binary data is inquired in the key data value storage engine according to the key data, and then the binary data is converted into vector data to be output.
In an optional embodiment, the step of obtaining a data table and horizontally splitting the data table into a plurality of fragments, wherein each fragment includes a plurality of copies, the data table includes:
a memory table, a table disposed in the memory;
fixed memory table, table placed in the magnetic disk; wherein the content of the first and second substances,
after the number of the memory tables reaches a first preset threshold value, converting the memory tables into fixed memory tables, establishing a vector storage engine for vector data in the fixed memory tables, and storing the vector storage engine on a magnetic disk to generate a first index file;
and integrating the plurality of first index files into a second index file when the number of the first index files reaches a second preset threshold value.
In a second aspect of the embodiments of the present invention, there is also provided a vector database apparatus, including:
the system comprises a creating unit, a data processing unit and a data processing unit, wherein the creating unit is used for acquiring a data table and horizontally dividing the data table to form a plurality of fragments, and each fragment comprises a plurality of copies;
the data acquisition unit is used for acquiring data, performing first mapping according to key data contained in the data, performing Hash on the key data to obtain an integer, performing remainder on a plurality of fragments according to the integer to obtain fragment numbers, then performing second mapping, and establishing a metadata server for the acquired data according to the first mapping and the second mapping; the system comprises a plurality of copies, a fragment number and a vector storage engine, wherein the fragment number is used for representing fragments, each fragment corresponds to the plurality of copies, the plurality of copies comprise a master copy, each copy corresponds to one vector storage engine, and the vector storage engine corresponding to the master copy is obtained according to the fragment number;
a write vector data unit for writing vector data to the obtained one primary-copy vector storage engine based on the obtained one primary-copy vector storage engine;
and the vector similarity retrieval unit is used for carrying out vector similarity retrieval on the obtained main copy vector storage engine based on the obtained main copy vector storage engine.
In a third aspect of the embodiments of the present invention, there is further provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus; a memory for storing a computer program; a processor for implementing the method steps when executing the program stored in the memory.
In a fourth aspect of the embodiments of the present invention, there is also provided a storage medium having stored thereon a computer program that, when executed by a processor, implements a method.
According to the technical scheme provided by the embodiment of the invention, the data table is obtained and horizontally divided to form a plurality of fragments, the obtained data is mapped twice, so that the obtained data is stored in the metadata server, and the vector storage engine is established for the data, so that the data can be quickly searched by the vector storage engine when the data is required to be obtained in the following process, and the indexes of the data quantity to be searched, the search delay time, the search accuracy, the system performance, the reliability, the usability and the like are obviously improved.
For a better understanding of the nature and technical aspects of the present invention, reference should be made to the following detailed description of the invention, taken in conjunction with the accompanying drawings, which are provided for purposes of illustration and description and are not intended to limit the invention.
Drawings
The technical solution and other advantages of the present invention will become apparent from the following detailed description of specific embodiments of the present invention, which is to be read in connection with the accompanying drawings.
FIG. 1 is a schematic diagram of an implementation flow of a vector database storage and retrieval method according to the present invention.
FIG. 2 is a schematic diagram of an overall architecture of a vector database storage and retrieval method according to the present invention.
FIG. 3 is a diagram of a vector database storage and retrieval method according to the present invention after table level segmentation.
FIG. 4 is a schematic diagram of metadata double mapping of a vector database storage and retrieval method according to the present invention.
FIG. 5 is a schematic diagram of a user index request acquisition data of the vector database storage and retrieval method according to the present invention.
FIG. 6 is a schematic diagram of a vector database storage and retrieval method according to the present invention before system capacity expansion.
FIG. 7 is a schematic diagram of a vector database storage and retrieval method according to the present invention after system expansion.
FIG. 8 is a schematic diagram of a system for storing and retrieving vector databases according to the present invention before system contraction.
FIG. 9 is a schematic diagram of a method for storing and retrieving vector databases according to the present invention after system contraction.
FIG. 10 is a diagram illustrating a method for storing and retrieving vector databases according to the present invention before load balancing of the primary copy.
FIG. 11 is a diagram illustrating a main copy load balancing method for vector database storage and retrieval according to the present invention.
FIG. 12 is a diagram illustrating a method for storing and retrieving vector database before recovery of failure data according to the present invention.
FIG. 13 is a diagram illustrating a vector database storage and retrieval method according to the present invention after recovery of failure data.
FIG. 14 is a schematic diagram of a vector storage engine working process of the vector database storage and retrieval method of the present invention.
FIG. 15 is a schematic diagram of the operation of the vector storage engine of the vector database storage and retrieval method of the present invention.
FIG. 16 is a diagram illustrating the operation of the index manager of the vector database storage and retrieval method of the present invention.
FIG. 17 is a diagram illustrating a plug-in loaded index for a vector database storage and retrieval method according to the present invention.
FIG. 18 is a diagram illustrating vector data writing in a vector database storage and retrieval method according to the present invention.
FIG. 19 is a schematic diagram of index file fusion in the vector database storage and retrieval method according to the present invention.
FIG. 20 is a schematic diagram of similarity search in a vector database storage and search method according to the present invention.
FIG. 21 is a schematic diagram of a vector database storage and retrieval apparatus according to the present invention.
Fig. 22 is a schematic structural diagram of an electronic device according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The English nouns in the scheme are explained:
the method is characterized in that the method is a distributed protocol which is widely used in engineering, decentralized And highly available.
The Hash is called Hash, and the corresponding English is Hash, and the principle is to change the input with any length into the output with fixed length by the Hash algorithm.
As shown in fig. 2, the metadata of the system is stored in a metadata server. Typically, the metadata server is deployed multi-machine. 3 metadata servers can be adopted to form a group, and the reliability and consistency of metadata in the group are ensured through a Raft protocol. The Raft protocol will elect a master server with the remaining slave servers interacting with the master server. A plurality of interface development libraries are arranged on the main server, for example, a command console interface development library, a web page console interface development library and an application program interface development library, wherein the command console interface development library, the web page console interface development library and/or the application program interface development library can all realize data input and retrieval on a metadata server, and metadata is used for managing the topological structure of the whole cluster. The topology may specifically include: table (Table), fragmentation (Partition), copy (replay). For large size clusters, regions (regions), domains (zones) may also be included. For a server that mounts multiple storage disks on a single machine, a Disk (Disk) may be included.
As shown in fig. 1, an embodiment of the present invention provides a method for storing and retrieving a vector database, where the method includes the following steps:
s100: the method comprises the steps of obtaining a data table, horizontally dividing the data table into a plurality of fragments, wherein each fragment comprises a plurality of copies.
Wherein each slice includes multiple copies, and a slice (Sharding is generally called Sharding) is a process of dividing one data into two or more smaller blocks.
It should be noted that the data table may be formed by data of the same type of service, and the data of the same type of service is formed into a table, so as to perform subsequent storage and indexing of metadata.
As shown in fig. 3, each data server is divided into a plurality of regions, each region is further divided into a plurality of segments by horizontal splitting (horizontal splitting is called horizontal partitioning), and the database is divided into a plurality of segments by horizontal splitting, so that the load of a single machine can be reduced.
And synchronizing data among different copies under one fragment through a Raft protocol so as to ensure the consistency of the data of the different copies. And each copy corresponds to a vector storage engine, which facilitates subsequent indexing of data.
It should be noted that the metadata server includes a plurality of data servers, a main server and other secondary servers are selected from the plurality of data servers based on the Raft protocol, and the main server and the secondary servers perform data interaction, so that the metadata in the main server and the secondary servers are consistent.
S200: acquiring data, performing first mapping according to key data contained in the data, performing Hash on the key data to obtain an integer, performing remainder on a plurality of fragments according to the integer to obtain fragment numbers, then performing second mapping, and establishing a metadata server for the acquired data according to the first mapping and the second mapping; the fragment number is used for representing fragments, each fragment corresponds to multiple copies, the multiple copies comprise a master copy, each copy corresponds to one vector storage engine, and the vector storage engine corresponding to the master copy is obtained according to the fragment number.
The fragment number is used for representing fragments, and each fragment corresponds to one fragment number.
It should be noted that, a number is assigned to a plurality of segments, and a segment number is obtained by matching the obtained integer with the number, so as to determine the segment.
As shown in fig. 3, for example, the number of the fragment is from 1 to 6, after the data is obtained, Hash is performed according to the key to obtain an integer 4, the fragment 4 in fig. 3 is obtained according to the integer 4, and then the second mapping is performed to map the data in the copy 1, the copy 2, and the copy 3, so as to perform the indexing of the data according to the mapping relationship.
S300: based on the one primary-copy vector storage engine that was fetched, vector data is written to it.
It should be noted that, for any piece of data (the data includes key data and vector data), metadata is obtained through two times of mapping, and the metadata corresponds to one vector storage engine, so that any piece of data corresponds to one vector storage engine on a specific data server.
It should also be noted that the data may further include vector data, each copy corresponds to a vector engine on one of the data servers, and the data is determined to correspond to a vector quantity storage engine on a particular one of the data servers based on the vector data and the key data included in the data.
As shown in fig. 4, each fragment can divide the three copies included in the fragment into one master copy and two slave copies based on Raft, and each copy corresponds to a packet vector storage engine for consistency and data call of the copy data.
The step S300 may be followed by continuing to execute the search request.
S400: and based on the obtained main copy vector storage engine, carrying out vector similarity retrieval on the main copy vector storage engine.
It should be noted that, an index request of vector data similarity is obtained, at least one copy included in each segment is indexed, a preset number of vector data with the highest similarity are selected from the indexed copies, then the indexed preset number of vector data are subjected to similarity sorting, and the preset number of vector data with the highest similarity are selected for output.
As shown in fig. 5, for example, when a user requests data from the metadata server, and when a request for a vector similarity index is made, the top 20 vector data with the highest similarity to a given vector may be preset, and the system may select one copy from all the segments in the table to search, where each copy searches the top 20 vectors with the highest similarity. And finally, summarizing the index results, sorting the index results from high to low according to the similarity, taking the first 20 vectors with the highest similarity as a final result, and returning the final result to the user, so that the index accuracy is improved.
As shown in fig. 6 and 7, when a large amount of vector data is inserted during operation, the vector data can be uniformly distributed to a plurality of different servers. Each server is not overloaded, thereby ensuring the performance of the metadata server. The Raft protocol is run between the multiple copies, thereby ensuring data consistency and reliability. When the cluster data volume is too large, the capacity expansion can be realized by increasing the data servers. The system automatically selects the appropriate replica data and moves to the newly added data server. When data is moved, the data consistency is ensured by following the Raft protocol. This process may be referred to as system expansion.
As shown in fig. 8 and fig. 9, for the same reason, when the system data amount is too small, the machine can be left for other purposes by shrinking the capacity, and the resources are saved. The system will scan the machines to be offline, and will select all copies on the line, select the appropriate destination, and move the line past. When data is moved, the data consistency is ensured by following the Raft protocol. This process may be referred to as system scalping.
As shown in fig. 10 and 11, due to a certain randomness in the Raft protocol, the number of master copies may be uneven on each data server. To ensure data consistency, data is written to the primary copy. This results in different write bandwidths for each data server, and the write bandwidth may be too large or even full for servers with a large number of primary copies. At this point, the system may select a portion of the master on the server with the most masters to demote it to a slave, and elect a new master on the other machine. Eventually, the number of master copies on each machine is approximately the same. The phenomenon that the write bandwidth of a local machine is filled is avoided. And in the process of reselecting the main copy, a Raft protocol is followed to ensure the consistency of data. This process may be referred to as primary replica load balancing.
As shown in fig. 12 and 13, when the cluster size is too large, the probability of a server or disk failure also increases. When a server or a disk fails, the system can take the failed hardware off line. And simultaneously scanning the metadata to obtain all copies on the failed hardware. And then selecting a proper position on other normal servers to recover the duplicate data on the failed hardware. And in the process of recovering the data, a Raft protocol is followed to ensure the consistency of the data. This process is referred to as failed data recovery.
In an optional embodiment, the step S200 may further include:
s201: according to the acquired data, vector data are represented by using a floating point number array, the floating point number array is converted into binary data, and the binary data are stored in a key data value storage engine, wherein the key data storage engine is used for indexing key data;
if the index data is the key data, binary data is inquired in the key data value storage engine according to the key data, and then the binary data is converted into vector data to be output.
As shown in fig. 14, it should be noted that vector data is represented by a floating-point array, the vector data is converted into a binary system by using the floating-point array, and the converted binary data is stored in the key-value storage engine by using a conventional serialization tool.
In data reading, binary data is read from the key value storage engine, and then the binary data is deserialized into vector data for reading.
For index data, only the key data corresponding to the vector object (the vector object includes vector data and additional fields) needs to be stored in the memory, and other data is stored on the disk. When needed, the key data is used for making random inquiry to the fragments stored on the disk. Therefore, the using amount of the memory can be saved, and the problem of insufficient memory caused by the fact that the memory is occupied by the using data in the existing index can be solved.
Written to the storage engine is a vector object that includes, in addition to the vector data itself, additional fields. These fields may be used to hold additional data for the user. For example, in the case of face recognition, a URL (uniform resource locator) for storing an original picture is used (the URL is a representation method for specifying the information location on a web service program on the internet). After the similarity search is carried out, the URL of the original image can be simultaneously taken in addition to the vector data.
The vector object code is as follows:
message VecObj{
string key=1;
Vec vec=2;
string attach_value1=3;
string attach_value2=4;
string attach_value3=5;
}
message Vec{
repeated float data=1;
}
as shown in fig. 15, the vector storage engine uses a calculation module to calculate vector data. The computing module may be a CPU processor.
In fig. 15, the storage module is used for storing the vector index and the computer module, and the storage module may be a magnetic disk.
As shown in fig. 16, the vector storage engine manages vector indexes using an index manager for a program for managing indexes, which manages a plurality of vector indexes. The vector index can be queried, sorted, screened and the like through the index manager.
In the vector index set, random query can be performed according to index names, and sequencing can be performed according to creating timestamps. The index name is automatically generated internally by the vector storage engine and is globally unique. The user can obtain the vector index according to the index name; the vector index can be obtained by screening the time stamp according to the index type; if no name and filtering conditions are given, the index manager will return the latest vector index.
The index name may be equal to the table name plus the tile number plus the copy number plus the index type plus the current timestamp get.
As shown in fig. 17, the interface development libraries such as the command console interface development library, the web console interface development library, and the application program interface development library are different in the calculation method and the index object used in the search, for example, random forest, locality sensitive hash, cluster compression, and the like. Each algorithm has a library based on different hardware computational implementations, such as CPU and GPU. There are open source libraries, as well as libraries developed by the user himself. It is necessary to establish uniform interfaces, and at the programming language level, they are inherited from the same interface class, and different implementations are made inside. Different control consoles can load different dynamic library files through a uniform interface, plug-in type indexes are loaded during running, and when a new index algorithm appears in the future, a user can write an index plug-in by himself and load the new index into a vector storage engine.
The code of the interface is as follows:
class VecDt{
public:
bool operator<(const VecDt&rhs)const{
return distance_<rhs.distance_;
}
bool operator>(const VecDt&rhs)const{
return distance_>rhs.distance_;
}
private:
std::string key_;
float distance_;
std::string attach_value1_;
std::string attach_value2_;
std::string attach_value3_;
};
class Vlndex{
public:
Vlndex()=default;
Vlndex(const Vlndex&)=delete;
Vlndex&operator=(const Vlndex&)=delete;
virtual~Vlndex()=default;
virtual Status GetKNN(const std::string&key,int limit,std::vector<VecDt>&r
esults)=0;
virtual Status GetKNN(const std::vector<float>&vec,int limit,std::vector<Ve
cDt>&results)=0;
virtual Status Build()=0;
virtual Status Load()=0;
protected:
int dim_;
std::string index_type_;
std::string distance_type_;
std::string replica_name_;
time_t timestamp_;
std::string name_;
std::string path_;
};
in an optional embodiment, the step of obtaining a data table and horizontally splitting the data table into a plurality of fragments, wherein each fragment includes a plurality of copies, the data table includes:
a memory table, a table disposed in the memory;
fixed memory table, table placed in the magnetic disk; wherein the content of the first and second substances,
after the number of the memory tables reaches a first preset threshold value, converting the memory tables into fixed memory tables, establishing a vector storage engine for vector data in the fixed memory tables, and storing the vector storage engine on a magnetic disk to generate a first index file;
and integrating the plurality of first index files into a second index file when the number of the first index files reaches a second preset threshold value.
It should be noted that the general use process of vector similarity search is as follows: and filling all vectors into a database, rebuilding an index, and then searching. There is an inconvenience that after the index is built, the newly added vector cannot be retrieved through the original index. If the index is rebuilt immediately after the vector is newly added, the working efficiency is influenced, and repeated calculation is carried out. In addition, too many indexes occupy the disk space seriously. The invention creates a non-inductive Index (the non-inductive Index is called as Transparent Index in English) based on the idea of LSM-Tree (the full name of Log Structured-Merge Tree is composed of two or more than two structures for storing data). Therefore, the user does not need to manually establish the index, and does not need to maintain a plurality of indexes, delete expired indexes and other actions. Only the vectors need to be inserted continuously. The newly inserted vector can then be retrieved. The method is simple, and saves the calculated amount and the disk space.
As shown in fig. 18, the non-sensory index may specifically include 4 modules:
and the memory table stores the vector data in the memory after the vector data is inserted into the vector storage engine.
And the fixed memory table is used for converting the memory table into the fixed memory table after the number of the memory tables reaches a certain threshold value, and the memory table is not subjected to data writing.
And indexing the vector data in the fixed memory table by the scheduler at a proper time, and writing the vector data into a disk to generate an index file.
And the scheduler is a thread pool, converts the fixed memory table into an index file at a preset time, and fuses a plurality of index files into one index file.
The vector data occupies the memory, the vector data in the memory buffer area is written into a disk at a preset time, and an index file is established for the vector data. For subsequent newly inserted vector data, the memory buffer is also entered. The scheduling thread of the background scheduler automatically selects the preset time according to the setting, and fuses the vector data of the buffer area and the index file on the disk.
As shown in fig. 20, for the query request, for example, query the first K similar vector data, the system will find the K similar vectors from the memory table, also find K × N similar vector data in the N indexes of the disk, then sort the (N +1) K vector data, select the first K vector data, and return to the user. The number of the specific index files, the merging conditions and the algorithm can be dynamically adjusted according to the actual running condition.
Corresponding to the foregoing method embodiment, an embodiment of the present invention further provides a vector database storage and retrieval apparatus, as shown in fig. 21, where the apparatus may include: a vector database creation unit 201, a data acquisition unit 202, and a data retrieval creation unit 203.
A vector database creating unit 201, configured to obtain a table composed of metadata, and horizontally divide the table into a plurality of segments, where each segment includes multiple copies;
the data acquisition unit 202 is configured to acquire data, perform first mapping according to key data included in the data, perform Hash on the key data to obtain an integer, modulo the integer to obtain a plurality of fragments and obtain a fragment number, perform second mapping, and acquire a plurality of copies to which the fragment number corresponds according to the fragment number; the fragment number is used for representing fragments, and each fragment corresponds to one fragment number;
and the data retrieval creating unit 203 is used for creating a vector storage engine on the data server for the acquired data according to the first mapping and the second mapping, wherein the vector storage engine is used for carrying out similarity retrieval on the given vector.
An embodiment of the present invention further provides an electronic device, as shown in fig. 22, including a processor 51, a communication interface 52, a memory 53 and a communication bus 54, where the processor 51, the communication interface 52 and the memory 53 complete mutual communication through the communication bus 54.
A memory 53 for storing a computer program;
the processor 51 is configured to implement the following steps when executing the program stored in the memory 53:
the method comprises the steps of obtaining a data table, horizontally dividing the data table to form a plurality of fragments, wherein each fragment comprises a plurality of copies; acquiring data, performing first mapping according to key data contained in the data, performing Hash on the key data to obtain an integer, performing remainder on a plurality of fragments according to the integer to obtain fragment numbers, then performing second mapping, and establishing a metadata server for the acquired data according to the first mapping and the second mapping; the method comprises the steps that a fragment number is used for representing fragments, each fragment corresponds to multiple copies, the multiple copies comprise a main copy, each copy corresponds to a vector storage engine, and according to the fragment number, the vector storage engine corresponding to the main copy is obtained, and vector data are written into the vector storage engine based on the obtained main copy vector storage engine; and based on the obtained main copy vector storage engine, carrying out vector similarity retrieval on the main copy vector storage engine.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as a plurality of disk memories. Alternatively, the memory may be a plurality of memory devices located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In yet another embodiment of the present invention, a storage medium is further provided, where instructions are stored, and when the instructions are executed on a computer, the instructions cause the computer to execute the vector database storage and retrieval method in any one of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are brought about in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a storage medium or transmitted from one storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A method for storing and retrieving a vector database, comprising:
the method comprises the steps of obtaining a data table, horizontally dividing the data table to form a plurality of fragments, wherein each fragment comprises a plurality of copies;
acquiring data, performing first mapping according to key data contained in the data, performing Hash on the key data to obtain an integer, performing remainder calculation on the fragments according to the integer to obtain fragment numbers, then performing second mapping, and establishing a metadata server for the acquired data according to the first mapping and the second mapping; the method comprises the steps that a fragment number is used for representing fragments, each fragment corresponds to a plurality of copies, the plurality of copies comprise a main copy, each copy corresponds to a vector storage engine, and the vector storage engine corresponding to the main copy is obtained according to the fragment number;
writing vector data to one of the primary-copy vector storage engines based on the obtained vector data;
and based on the acquired main copy vector storage engine, carrying out vector similarity retrieval on the main copy vector storage engine.
2. The method of claim 1, wherein the step of obtaining the data table and horizontally slicing the data table into segments comprises:
the metadata server comprises a plurality of data servers, a main server and other auxiliary servers are selected from the data servers based on a Raft protocol, and the main server and the auxiliary servers perform data interaction so as to facilitate consistency of the metadata in the main server and the auxiliary servers.
3. The method for storing and retrieving the vector database according to claim 1, wherein the obtaining data is mapped for the first time according to key data included in the data, the key data is hashed to obtain an integer, the fragments are complemented according to the integer to obtain fragment numbers, then, mapping is performed for the second time, and a metadata server is established for the obtaining data according to the first mapping and the second mapping; the method for acquiring the vector storage engine corresponding to the master copy comprises the following steps of:
and assigning numbers to the fragments, and matching the obtained integers with the numbers to obtain fragment numbers so as to determine the fragments.
4. The vector database storage and retrieval method of claim 3, wherein said step of writing vector data to said one of said primary-copy vector storage engines based on said obtained one of said primary-copy vector storage engines comprises:
the data further comprises the vector data, each copy corresponding to a vector engine on the data server;
and determining that the data corresponds to a vector storage engine on a specific data server based on vector data and key data included in the data.
5. The method of claim 4, wherein said step of writing vector data to said one of said primary-copy vector storage engines based on said obtained one of said primary-copy vector storage engines comprises, after said step of writing vector data to said one of said primary-copy vector storage engines:
and acquiring an index request of the similarity of the vector data, indexing at least one copy included in each fragment, selecting a preset number of vector data with the highest similarity from the indexed copies, then performing similarity sorting on the indexed preset number of vector data, and selecting the preset number of vector data with the highest similarity for output.
6. The method of claim 4, wherein the step of performing vector similarity search based on the obtained primary replica vector storage engine comprises:
according to the acquired data, representing the vector data by using a floating point number array, converting the floating point number array into binary data, and storing the binary data in a key data value storage engine, wherein the key data storage engine is used for indexing the key data;
and if the index data exist, inquiring the binary data in a key data value storage engine according to the key data, and converting the binary data into vector data to be output.
7. The vector database storage and retrieval method of claim 1, wherein the step of obtaining a data table and horizontally slicing the data table into a plurality of slices, wherein each slice comprises a plurality of copies, the data table comprises:
a memory table, a table disposed in the memory;
fixed memory table, table placed in the magnetic disk; wherein the content of the first and second substances,
after the number of the memory tables reaches a first preset threshold value, converting the memory tables into the fixed memory tables, establishing a vector storage engine for the vector data in the fixed memory tables, and storing the vector storage engine on a magnetic disk to generate a first index file;
and integrating a plurality of first index files into a second index file when the number of the first index files reaches a second preset threshold value.
8. A vector database storage and retrieval apparatus, the apparatus comprising:
the system comprises a creating unit, a data processing unit and a data processing unit, wherein the creating unit is used for acquiring a data table and horizontally dividing the data table to form a plurality of fragments, and each fragment comprises a plurality of copies;
the data acquisition unit is used for acquiring data, performing first mapping according to key data contained in the data, performing Hash on the key data to obtain an integer, performing remainder on the fragments according to the integer to obtain fragment numbers, then performing second mapping, and establishing a metadata server for the acquired data according to the first mapping and the second mapping; the method comprises the steps that a fragment number is used for representing fragments, each fragment corresponds to a plurality of copies, the plurality of copies comprise a main copy, each copy corresponds to a vector storage engine, and the vector storage engine corresponding to the main copy is obtained according to the fragment number;
a write vector data unit, configured to write vector data to the obtained one primary-copy vector storage engine based on the obtained one primary-copy vector storage engine;
and the vector similarity retrieval unit is used for carrying out vector similarity retrieval on the obtained main copy vector storage engine based on the obtained main copy vector storage engine.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus; a memory for storing a computer program; a processor for implementing the method steps of any one of claims 1 to 7 when executing a program stored in the memory.
10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202111195264.3A 2021-10-10 2021-10-10 Vector database storage and retrieval method Pending CN114168588A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111195264.3A CN114168588A (en) 2021-10-10 2021-10-10 Vector database storage and retrieval method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111195264.3A CN114168588A (en) 2021-10-10 2021-10-10 Vector database storage and retrieval method

Publications (1)

Publication Number Publication Date
CN114168588A true CN114168588A (en) 2022-03-11

Family

ID=80476893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111195264.3A Pending CN114168588A (en) 2021-10-10 2021-10-10 Vector database storage and retrieval method

Country Status (1)

Country Link
CN (1) CN114168588A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168499A (en) * 2022-09-05 2022-10-11 金蝶软件(中国)有限公司 Database table fragmentation method and device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168499A (en) * 2022-09-05 2022-10-11 金蝶软件(中国)有限公司 Database table fragmentation method and device, computer equipment and storage medium
CN115168499B (en) * 2022-09-05 2023-01-03 金蝶软件(中国)有限公司 Database table fragmentation method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
US8843454B2 (en) Elimination of duplicate objects in storage clusters
US8793227B2 (en) Storage system for eliminating duplicated data
US20050234867A1 (en) Method and apparatus for managing file, computer product, and file system
US20140189423A1 (en) Two level addressing in storage clusters
CN112565325B (en) Mirror image file management method, device and system, computer equipment and storage medium
KR101078287B1 (en) Method Recovering Data Server at the Applying Multiple Reproduce Dispersion File System and Metadata Storage and Save Method Thereof
KR20210058118A (en) Casedb: low-cost put-intensive key-value store for edge computing
US11782878B2 (en) Systems and methods for searching deduplicated data
CN114168588A (en) Vector database storage and retrieval method
CN111930684A (en) Small file processing method, device and equipment based on HDFS (Hadoop distributed File System) and storage medium
US11256434B2 (en) Data de-duplication
US10073874B1 (en) Updating inverted indices
US10083121B2 (en) Storage system and storage method
CN115794819A (en) Data writing method and electronic equipment
CN115168505A (en) Management system and method for ocean space-time data
CN114416676A (en) Data processing method, device, equipment and storage medium
CN112860628A (en) File system scale management method and system
CN116595015B (en) Data processing method, device, equipment and storage medium
CN114020986B (en) Content retrieval system
CN112565373B (en) Method and device for removing duplicate of mirror image file
EP3995972A1 (en) Metadata processing method and apparatus, and computer-readable storage medium
US20230376461A1 (en) Supporting multiple fingerprint formats for data file segment
US11068500B1 (en) Remote snapshot access in a replication setup
CN117950589A (en) Data storage method, apparatus, device and computer readable storage medium
CN115934670A (en) Copy placement strategy verification method and device for multiple HDFS (Hadoop distributed File System) machine rooms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination