CN114610708A - Vector data processing method and device, electronic equipment and storage medium - Google Patents

Vector data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114610708A
CN114610708A CN202011443500.4A CN202011443500A CN114610708A CN 114610708 A CN114610708 A CN 114610708A CN 202011443500 A CN202011443500 A CN 202011443500A CN 114610708 A CN114610708 A CN 114610708A
Authority
CN
China
Prior art keywords
data
index
fragment
vector data
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011443500.4A
Other languages
Chinese (zh)
Inventor
王保坡
徐硕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202011443500.4A priority Critical patent/CN114610708A/en
Publication of CN114610708A publication Critical patent/CN114610708A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management

Abstract

The embodiment of the application discloses a vector data processing method and device, electronic equipment and a storage medium. The vector data processing method may include: converting multidimensional vector data to be stored into one-dimensional ordered data; fragmenting the ordered data to obtain fragmented data, and determining a sequence number of the fragmented data; and writing the fragment data into a cache memory so that when the storage state of the cache in the cache memory reaches a trigger condition, the fragment data is written into a target file according to the sequence number. Therefore, ordering of the multidimensional vector data can be realized, a random writing process of data storage is converted into sequential writing, and data writing performance is improved.

Description

Vector data processing method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of databases, and in particular, to a vector data processing method and apparatus, an electronic device, and a storage medium.
Background
The vector database is a database for storing, retrieving and analyzing vector data, and is suitable for scenes needing to process multidimensional vector data, such as voice, image, video processing, Geographic Information Systems (GIS), machine learning and the like. The current vector database is usually a non-relational (NoSql) database designed based on a Key-Value pair (Key-Value) schema, and focuses on data analysis, and is often difficult to satisfy the requirements of reliable storage and efficient retrieval of large-scale vector data.
Disclosure of Invention
In view of this, embodiments of the present invention provide a vector data processing method and apparatus, an electronic device, and a storage medium.
The technical scheme of the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a vector data processing method, including:
converting multidimensional vector data to be stored into one-dimensional ordered data;
the method comprises the steps of fragmenting the ordered data to obtain fragmented data, and determining the serial number of the fragmented data;
and writing the fragment data into the cache memory, and writing the fragment data into the target file according to the sequence number when the storage state of the cache in the cache memory reaches a trigger condition.
In a second aspect, an embodiment of the present invention provides a vector data processing method, including:
receiving a read request for reading target vector data; the target vector data includes: converting the data into one-dimensional ordered data, and storing the data after fragmentation;
searching the fragment data of the target vector data in a cache memory;
splicing the fragment data to obtain target vector data;
target vector data is returned.
In a third aspect, an embodiment of the present invention provides a vector data processing apparatus, including:
the conversion module is used for converting the multi-dimensional vector data to be stored into one-dimensional ordered data;
the fragmentation module is used for fragmenting the ordered data to obtain fragmentation data and determining the serial number of the fragmentation data;
and the writing module is used for writing the fragment data into the cache memory so as to write the fragment data into the target file according to the sequence number when the storage state of the cache in the cache memory reaches a trigger condition.
In a fourth aspect, an embodiment of the present invention provides a vector data processing apparatus, including:
a receiving module, configured to receive a read request for reading target vector data; the target vector data includes: converting the data into one-dimensional ordered data, and storing the data after fragmentation;
the searching module is used for searching the fragment data of the target vector data in the cache memory;
the splicing module is used for splicing the fragment data to obtain target vector data;
and the return module is used for returning the target vector data.
In a fifth aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes: a processor and a memory for storing a computer program capable of running on the processor;
the processor, when running said computer program, performs the steps of one or more of the preceding claims.
In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions; the computer-executable instructions, when executed by a processor, are capable of implementing the methods described in one or more of the preceding claims.
The vector data processing method provided by the invention converts multidimensional vector data to be stored into one-dimensional ordered data; slicing the ordered data to obtain sliced data, and determining a sequence number of the sliced data; and writing the fragment data into a cache memory so that when the storage state of the cache in the cache memory reaches a trigger condition, the fragment data is written into a target file according to the sequence number. Therefore, the multidimensional data are converted into one-dimensional ordered data, the target file is written in based on the sequence number sequence, the discontinuous multidimensional data random writing process is converted into the ordered data sequence writing process, and the data storage effect and readability are optimized. Data are written in batch based on the trigger condition of the storage state in the cache, so that the efficiency of data writing can be improved, and the storage of multidimensional data has the fragment management characteristic, so that the efficiency of data query and reading is improved.
Drawings
Fig. 1 is a schematic flowchart of a vector data processing method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a vector data processing method according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of a vector data processing method according to an embodiment of the present invention;
fig. 4 is a schematic flowchart of a vector data processing method according to an embodiment of the present invention;
fig. 5 is a schematic flowchart of a vector data processing method according to an embodiment of the present invention;
fig. 6 is a schematic flowchart of a vector data processing method according to an embodiment of the present invention;
fig. 7 is a flowchart illustrating a vector data processing method according to an embodiment of the present invention;
fig. 8 is a flowchart illustrating a vector data processing method according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a vector data processing apparatus according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of a vector data processing apparatus according to an embodiment of the present invention;
FIG. 11 is a block diagram of a vector data storage engine according to an embodiment of the present invention;
fig. 12 is a flowchart illustrating a vector data processing method according to an embodiment of the present invention;
fig. 13 is a flowchart illustrating a vector data processing method according to an embodiment of the present invention;
fig. 14 is a schematic structural diagram of a vector data storage WAL file according to an embodiment of the present invention;
FIG. 15 is a structural diagram of a vector data store VSM file according to an embodiment of the present invention;
FIG. 16 is a structural diagram of a vector data store VSM file according to an embodiment of the present invention;
FIG. 17 is a structural diagram of a vector data store VSM file according to an embodiment of the present invention;
fig. 18 is a flowchart illustrating a vector data processing method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, to enable embodiments of the invention described herein to be practiced in other than the order shown or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
As shown in fig. 1, an embodiment of the present invention provides a vector data processing method, where the method includes:
s110: converting multidimensional vector data to be stored into one-dimensional ordered data;
s120: fragmenting the ordered data to obtain fragmented data, and determining a sequence number of the fragmented data;
s130: and writing the fragment data into a cache memory so that when the storage state of the cache in the cache memory reaches a trigger condition, the fragment data is written into a target file according to the sequence number.
Here, converting the multidimensional vector data into one-dimensional data may be implemented by a preset search algorithm. The predetermined search algorithm herein includes, but is not limited to: hilbert space filling curves, Z curves, KD trees, etc. The data converted by the preset search algorithm is one-dimensional ordered data of which the data value is determined by only one factor, and the ordered data has a certain arrangement order, for example, the data can be ordered according to the size of the data value, the lexicographic order of the data serial number, and the like. The data is fragmented, based on a configurable fragmentation strategy, for example, by a hash algorithm such as Round Robin (Round Robin), ordered data is divided into a plurality of mutually independent fragments according to the number of physical nodes constituting the database, and the fragments are stored on each physical node in a balanced manner. The sequence number is a sequence number of each piece of fragmented data, is determined and allocated by the fragmentation engine, and may be a number for ordering each piece of fragmented data or a sequence number randomly allocated to each piece of fragmented data. For example, may be comprised of natural numbers and/or letters. The target file is a storage space, such as a file in a magnetic disk, into which data to be stored is to be written finally.
In the embodiment of the invention, a scheme of a Vector segmentation fused Tree (VSM) storage engine is provided, and based on a configurable fragment engine and a search algorithm, multi-dimensional Vector data is converted into a plurality of independent one-dimensional fragment data, and serial numbers are distributed. The fragmented data is stored in a Cache memory (Cache) in the memory space as temporary storage. When the storage state in the Cache reaches a certain trigger condition, the fragment data in the Cache is written (for example, written in batch sequence) into a target file such as a VSM file in a disk according to the sequence number. Therefore, the fragmented data obtained based on the hash algorithm can be stored in the storage node with the best performance, and the processing pressure of the database is reduced. And the fragmented data is stored in the cache memory, so that the rapid reading from the cache can be realized during data reading. The multi-dimensional data is converted into the one-dimensional data, discontinuity of data writing is reduced, the vector database is applicable to storage scenes of the one-dimensional data and the multi-dimensional data, and compatibility and universality are greatly improved. The sequence number based on the one-dimensional fragment data is stored, the data writing process is converted from the random writing of original multidimensional data into the sequential writing of the one-dimensional data, the process of persistence of the data in the cache is more efficient and ordered, and the storage performance of the database is improved.
In some embodiments, as shown in fig. 2, the S130 includes:
s131: writing the fragmented data into a cache memory for use when the data amount cached in the cache memory reaches a data amount threshold; or when the time interval from the last time that the cache memory writes the data into the target file reaches a time length threshold, writing the fragmented data into the target file according to the sequence number.
And when the data volume cached in the cache memory reaches a data volume threshold value, or when the time interval from the last time of writing the data into the target file in the cache memory reaches a time length threshold value, sequentially writing the fragmented data into the target file according to the sequence number.
In the embodiment of the invention, a compression and merging (compact) component in the database can execute a customized persistence policy and set a trigger condition for data stored in a Cache of a Cache so as to realize batch writing of the data into a target file. When the data accumulated in the Cache reaches a preset data volume threshold and/or the time length from the last batch writing of the target files by the Cache reaches a preset time length threshold, merging and sequencing the fragment data in the Cache according to the sequence of the serial numbers, and then writing the fragment data into the target files in sequence.
In another embodiment, when a write-in instruction is received from the human-computer interaction interface, merging and sorting the fragment data currently stored in the Cache according to the sequence of the sequence numbers, and then sequentially writing into the target file.
In one embodiment, the S130 may further include: and when the record number of a certain key (key) aiming at the fragment data in the Cache reaches a record number threshold value, sequentially writing the fragment data into a target file according to the sequence number. The key may be an attribute based on which the fragmentation engine fragments the data, or may be an attribute in a set of all fragmentation data in the current Cache, for example, a dimension of the multidimensional vector data.
In this way, by setting the threshold value of the cache data amount or the cache time, the automatic control of writing data into the target file is realized, and the situations of low efficiency, slow response and the like caused by excessive cache of data in the cache memory are suppressed. And the user-defined persistence strategy can realize flexible control on the database caching and data writing process and improve the data writing efficiency.
In some embodiments, as shown in fig. 3, the S130 further includes:
s132: and establishing a data index of the fragment data.
In the embodiment of the invention, after the fragment data is written into the target file, the data index of the fragment data is established for reading. For example, the data index may include the content of the sharded data and the storage location to which the sharded data corresponds.
In one embodiment, since the fragmented data is written into the target file continuously according to the sequence of the sequence numbers, the storage order of the fragmented data in the target file and the order of the data indexes are the same as the sequence of the sequence numbers.
In this way, after the fragmented data is written into the target file in batch, a data index convenient for query and reading can be established for each fragmented data stored in the target file. The storage of the fragment data is ensured to be continuous and orderly, the storage position of the fragment data can be quickly positioned according to the data index when the data is read, and the fragment data can also be positioned and read in batches according to the sequence of the data index.
In some embodiments, as shown in fig. 4, the S132 includes:
s132 a: establishing the first index according to the storage location of the fragmented data and the sequence number, wherein the first index at least comprises: the offset of the fragment data in the target file relative to the start address of the target file and the data volume of the fragment data;
s132 b: establishing the second index according to the storage position of the first index, wherein the second index comprises: offset of the start data in the first index in the target file relative to a start address of the target file.
Here, the data index established in S130 may include: a first index and a second index. In one embodiment, the data index may be comprised of a first index and a second index.
In one embodiment, the start address of the target file is the start address of the target file stored in the disk. An object file, such as a VSM file, is composed of four parts, i.e., a Header (Header) for identifying version information and a reserved word, a data block (Blocks) for recording data content, an Index (Index), and a Footer (Footer). Inside the Blocks are several consecutive memory Blocks, each of which is divided into two parts, a check value (e.g., a CRC32 value) and Data (Data), the CRC32 value is used to check whether the content of the Data, which is used to store the fragmented Data, contains the serial number and the Data value of the fragmented Data. The length of Data is recorded in the Index section thereafter. The first Index is recorded in the Index part, is used for characterizing the storage location of the sliced data, and is arranged by the sequence number, and the data content of the first Index at least includes: the offset of the fragment data in the VSM file relative to the start address of the VSM file, and the data size of the fragment data. Optionally, the first index may further include: the maximum and minimum values of the sequence number of the fragmented data in the VSM file.
In another embodiment, the first index first sorts the recording order of the fragment data according to the lexicographic order of the key, for example, according to the lexicographic order of the first letter. Alternatively, a key may be one attribute of the multidimensional data. And then sorting according to the sequence number of the corresponding recorded fragment data.
In the embodiment of the present invention, the second Index is stored in the Footer part, and is used to characterize the content of the Index, that is, the offset of the starting data position of the first Index in the target file, so as to establish the indirect Index. Optionally, after the data cached in the Cache memory Cache is written into the target file in batch each time, the second index is updated.
Therefore, by establishing the data index, the storage of the fragment data in the target file is more ordered, and the hierarchy of the data reading can be more distinct. Because the sequence numbers are ordered, the index establishment of the data position information can be completed only by the offset relative to the starting position of the target file, and the storage space occupied by the data index is greatly reduced. And when the data amount stored in the database is large, the query and the reading of the data can be based on two-stage indexes, so that the query efficiency is greatly improved.
In some embodiments, as shown in fig. 5, the method further comprises:
s140: and writing the fragmented data into a WAL file of a pre-written log according to the sequence number, wherein the WAL file is used for recovering fragmented data which is not written into the target file in the cache memory.
In one embodiment, a Write-Ahead Logging (WAL) file is used to restore fragmented data in the cache that has not yet been written to the target file when the storage system is restarted.
In the embodiment of the invention, after the fragment data is stored in the Cache of the Cache memory, the fragment data is synchronously stored in the WAL file according to the sequence of the sequence numbers, so that the content of the WAL file is ensured to be the same as that of the Cache, and the function of the invention is to persist the data in the Cache. In the database, one WAL file is commonly used for sharded data stored in the same physical node. When errors occur in the process of writing the data in the Cache into the target file in batch, or after the storage system is restarted due to other faults, the data which is not written into the target file in the Cache can still be recovered through the WAL file.
In one embodiment, the database manages the physical nodes through a Shard Daemon Shard Daemon (SD), and a Raft protocol is adopted between the SDs for data backup of VSM files and WAL files. After the storage system is restarted, a second index in the memory is initialized by reading the latest VSM file, and meanwhile, the latest WAL file is loaded to complete the initialization of Cache data, so that the recovery of data which is not written into the VSM file is realized.
Therefore, the data cached in the cache memory is backed up through the setting of the WAL file, the negative influence caused by system restart or accidental interruption of data writing is reduced, and the consistency and disaster tolerance of the data are kept. And because the fragment data is inserted into the WAL file according to the sequence of the sequence numbers, the efficiency of writing the fragment data in batches is greatly improved.
As shown in fig. 6, an embodiment of the present invention provides a data reading method, where the method includes:
s210: receiving a read request for reading target vector data; the target vector data includes: converting the data into one-dimensional ordered data, and storing the data after fragmentation;
s220: searching the fragment data of the target vector data in a cache memory;
s230: splicing the fragment data to obtain the target vector data;
s240: and returning the target vector data.
The data reading method provided by the embodiment of the invention can be used for reading the data stored in the vector data processing method provided by the embodiment.
In the embodiment of the invention, for the received data reading request, the cache space of the cache memory is firstly utilized to quickly search the data stored in the cache memory. For example, attribute information and/or sequence numbers of the fragmented data of the target data may be obtained and looked up in the cache based on the fragmentation policy employed by the fragmentation engine when storing the data. And if the fragment data of the target vector data exists in the memory, splicing the fragment data to obtain the target data to be read, and returning the target data.
Therefore, the fragment data which is stored in the Cache and is not written into the target file in the disk can be directly inquired and read, and the target data can be acquired more quickly compared with the target file with large storage capacity due to the small storage capacity and high reading and writing speed of the Cache. Unnecessary access to the target file is suppressed, data reading efficiency is improved, and query reading resources can be saved.
In some embodiments, as shown in fig. 7, the data reading method further includes:
s250: and if the fragment data is not found in the cache memory, searching the fragment data in the target file according to a data index.
In the embodiment of the invention, for the condition that the fragment data of the target data does not exist in the cache, the target file is searched in a mode of reading the data index. Because the data indexes respectively store the corresponding relations of the storage position information, the serial number and the data key of the fragment data, the serial number and/or the attribute information of the corresponding fragment data can be obtained according to the fragment strategy adopted by the fragment engine when the data is written in, and then the data indexes are obtained, and the fragment data is inquired in a grading way.
In one embodiment, if only part of the fragmented data can be found in the cache, the target file is accessed based on the attribute information and/or the sequence number, and the rest of the fragmented data is searched through the data index.
Therefore, data can be inquired and read in the target file of the large-capacity storage space, the inquiry and the reading of the data position can be carried out step by step based on the two-stage index established during data storage, and the efficiency and the performance of data reading are greatly improved.
In some embodiments, as shown in fig. 8, the S250 includes:
s251: performing binary search according to a second index to obtain a storage range of the fragment data;
s252: and performing binary search according to the first index to obtain the storage position of the fragment data.
In the embodiment of the invention, firstly, according to the second index, the keyword Key corresponding to the first index is obtained through a dichotomy, and the storage range of the corresponding Blocks in the target file can be positioned by comparing based on the dictionary order. Because the blocks for storing the fragmented data are ordered according to the serial numbers, the corresponding blocks are located and obtained through a bisection method according to the first index and the serial numbers, and the required fragmented data can be rapidly read from the target file.
Therefore, hierarchical positioning query is carried out on the storage position of the fragment data based on the two-level index, and the application of the dichotomy is combined, so that the storage information of the required fragment data can be acquired in a database with large capacity at a high speed. The storage range is determined based on the lexicographic order, and the storage position is determined based on the serial number, so that the inconvenience of directly inquiring the storage position in a database can be reduced, and the efficiency of inquiring and reading data is greatly improved.
As shown in fig. 9, an embodiment of the present invention provides a vector data processing apparatus, including:
a conversion module 110, configured to convert multidimensional vector data to be stored into one-dimensional ordered data;
the fragmentation module 120 is configured to fragment the ordered data to obtain fragmentation data, and determine a sequence number of the fragmentation data;
a writing module 130, configured to write the fragmented data into a cache memory, so that when a storage state of a cache in the cache memory reaches a trigger condition, the fragmented data is written into a target file according to the sequence number.
In some embodiments, the apparatus further comprises:
and the indexing module 140 is configured to establish a data index of the sliced data.
In some embodiments, the indexing module 140 is specifically configured to:
establishing the first index according to the storage location of the fragmented data and the sequence number, wherein the first index at least comprises: the offset of the fragment data in the target file relative to the start address of the target file and the data volume of the fragment data;
establishing the second index according to the storage position of the first index, wherein the second index comprises: offset of the start data in the first index in the target file relative to a start address of the target file.
In some embodiments, the writing module 130 is further configured to:
and writing the fragmented data into a WAL file of a pre-written log according to the sequence number, wherein the WAL file is used for recovering fragmented data which is not written into the target file in the cache memory.
As shown in fig. 10, an embodiment of the present invention provides a vector data processing apparatus, including: a receiving module 210, configured to receive a read request for reading target vector data; the target vector data includes: converting the data into one-dimensional ordered data, and storing the data after fragmentation;
a lookup module 220, configured to lookup the fragmented data of the target vector data in a cache memory;
a splicing module 230, configured to splice the sliced data to obtain the target vector data;
a returning module 240, configured to return the target vector data.
In some embodiments, the lookup module 220 is further configured to:
if the fragment data is not found in the cache memory, searching the fragment data in the target file according to a data index.
In some embodiments, the search module 220 is specifically configured to:
performing binary search according to a second index to obtain a storage range of the fragment data;
and performing binary search according to the first index to obtain the storage position of the fragment data.
One specific example is provided below in connection with any of the embodiments described above:
the storage engine of the present example is based on the Log-Structured Merge Tree (LSM) concept, and the core of the storage engine is to cache a part of the latest write requests in the memory, and Merge and add memory data to the disk in a Merge and sort manner when a trigger condition is reached. The purpose of this is to convert a large number of random writes into sequential writes, thereby improving the performance of data writing. Conventional LSMs are only suitable for storing data with ordered key values and writing data larger than read data, or read operations are usually data with continuous key values, and thus cannot be directly applied to multidimensional vector data. This example proposes a vsm (vector Segments targeted tree) storage Engine scheme, which introduces a Sharding Engine (shading Engine) for vector data, and a conventional DataBase (DataBase) usually partitions data for single or several attributes in records, and cannot meet the management requirement for multidimensional vector data. The present example is based on Shared-Nothing architecture design, and firstly, a data set is partitioned for multidimensional vector data based on a specific algorithm as a spatial partitioning strategy for the vector data set, so as to establish a data slice.
The overall VSM storage Engine (VSM Engine) scheme is shown in FIG. 11. The fragmentation Engine (shading Engine) comprises a Vector conversion module (Vector converter), a Hilbert space filling Curve Session Protocol (HCSDP), a Z-Curve Session Protocol (ZCSDP) and a KD Tree (K-Dimensional Tree). For vector data, the data is first partitioned into multiple sharded data (Shard) by a system-configurable Sharding engine policy (Sharding Strategy). Each Shard has its own Cache, VSM File (VSM File) and compact, the physical node corresponding to Shard is mapped into a partition area (Region), a plurality of Shards of the same Region share a WAL File, data in the Cache is recorded through the WAL File, and it is ensured that data which is not written into the VSM File can be recovered through the File when the system is restarted.
The specific slicing engine flow is shown in fig. 12. For a read/write request of Vector data, a Search Algorithm (Vector Search Algorithm) is first called, the algorithms include Hilbert space filling curve, Z curve, KD tree, etc., Sequence number (Sequence Code) is calculated through the algorithms, multi-dimensional Vector data is compressed into one-dimensional ordered data, then the multi-dimensional data space is divided into multiple independent shards through a configurable Sharding Strategy (hashing), for example, a Round Robin (Round Robin) or other hash Algorithm, and a spatial continuous index for the Vector data is established based on the Shard data storage Location (Shard Location) and the Sequence number obtained through the Sharding calculation (Sharding computer). And establishing a Write Flow (Write Flow) or a Read Flow (Read Flow) according to the Index position (Index Location).
For a data write request, after calculating index information of a vector, a specific write flow is shown in fig. 13. Data is first written into the Cache and WAL files, respectively. Cache is a simple map structure in a memory, wherein key represents an attribute of a certain data table (table), and entry is equivalent to an array for storing an actual value according to a sequence number calculated by a fragmentation engine. A Compactor component running in a background triggers a persistence operation by calling a self-defined persistence policy (persistence policy), for example, when the number of records for a key reaches a set upper limit, the size of a Cache reaches the set upper limit or the time specified by a configuration interval or the like, merges and sorts data in the Cache (Merge Sort) and then refreshes the data to a VSM (virtual switch) file.
The contents of the WAL file are the same as the Cache, and the design format is shown in fig. 14. The function of the method is to persist the part of data in the Cache, and since the data is inserted into the WAL file according to the index order, the efficiency of writing the data in bulk to the same Shard managed data fragment is very high.
The VSM file is used to store library table data. The format of a single VSM file is shown in FIG. 15, and includes four parts, Header, Blocks, Index, and Fotter. Wherein the Header part is used for identifying version information and reserved words, and the Blocks is used for recording data contents.
As shown in fig. 16, inside Blocks are several consecutive Blocks, each of which is divided into two parts, CRC32 value and Data, and CRC32 value is used to check whether the content of Data is problematic. The length of Data is recorded in the Index section thereafter. The Data content of Data contains the sequence number and the actual Value (Value). Because the serial number is an ordered value, the corresponding Block can be found quickly by recording offset information as an index, thereby facilitating data reading operation.
Data structure of Index referring to fig. 17, in order to quickly locate a certain Block record, two levels of indexes are actually used, that is, the recording order of the indexes is firstly sorted according to the lexicographic order of key, and then sorted according to the sequence number of the corresponding record. Where the Key length (Key Len) specifies the length of the adjacency field Key. Key actually identifies an attribute of a table, and Type (Type) is used to identify the Type of Data in Block. The number (Count) identifies the number of indexes that the segment record contains. The latter four parts are index information of block, which can repeatedly appear according to the number in Count, and are sorted according to the serial number. The minimum sequence number (Min Seq) field identifies the minimum sequence number of the data in Block. The maximum sequence number (Max Seq) identifies the maximum sequence number of data in Block. The Offset (Offset) field identifies the Offset of the corresponding Block in the entire VSM file. The Size (Size) field records the Size of Block, so that data corresponding to Block can be quickly read according to the Offset and Size fields. For more efficient query efficiency, Fotter at the tail part of the VSM file stores the offset of the start position of the Index part in the VSM file, so that the Index information is conveniently loaded into a memory, and an indirect Index is established.
The specific flow for the data read request is shown in fig. 18. The indirect Index (Update Offsets) needs to be updated each time the compact persists the VSM file by reading the VSM file and loading the generated Index Offsets (Index Offsets) in memory at initialization. And searching in the Cache firstly during each read request, if the request exists in the memory, directly returning, and otherwise, searching in an indirect index mode. Because the Index offset records the offset information of each Key in the Index table of the VSM file, firstly, a Key corresponding to the Index is obtained through Binary Search (Binary Search), and the storage range of the corresponding Blocks in the VSM file can be located through comparison based on the lexicographic order.
The distributed nature of the vector database is implemented by the consensus algorithm (Raft) protocol. Firstly, data fragments are evenly hashed to the nodes through a Round Robin algorithm according to the number of physical nodes forming the database. The physical node corresponding to the data slice is mapped to Region, managed by Shard Daemon (SD). Each SD manages a plurality of regions, and a Raft protocol is adopted among the SDs to perform data backup of VSM and WAL files, so that data consistency and disaster tolerance are maintained. The copies are managed by taking regions as units, and a plurality of regions on different nodes form a common Group (raw Group) which are mutually copies, so that the migration of a main brain (raw Group Leader) of the common Group is realized. And when the SD recovers the service, the index offset in the memory is initialized by reading the latest VSM file, and meanwhile, the latest WAL file is loaded to finish the initialization of the Cache data.
An embodiment of the present invention further provides an electronic device, where the electronic device includes: a processor and a memory for storing a computer program capable of running on the processor, the computer program when executed by the processor performing the steps of one or more of the methods described above.
An embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and after being executed by a processor, the computer-executable instructions can implement the method according to one or more of the foregoing technical solutions.
The computer storage media provided by the present embodiments may be non-transitory storage media.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, indirect coupling or communication connection between devices or units, and may be electrical, mechanical or other driving.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may be separately used as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized by hardware running or by hardware and software functional units.
In some cases, any two of the above technical features may be combined into a new method solution without conflict.
In some cases, any two of the above technical features may be combined into a new device solution without conflict.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (12)

1. A method of vector data processing, the method comprising:
converting multidimensional vector data to be stored into one-dimensional ordered data;
fragmenting the ordered data to obtain fragmented data, and determining a sequence number of the fragmented data;
and writing the fragment data into a cache memory so that when the storage state of the cache in the cache memory reaches a trigger condition, the fragment data is written into a target file according to the sequence number.
2. The method of claim 1, wherein the storage state of the cache memory reaching a trigger condition comprises:
the amount of data cached in the cache memory reaches a data amount threshold;
or a time interval from the last time the cache written data to the target file reaches a duration threshold.
3. The method of claim 1, wherein after writing the fragmented data to a destination file according to the sequence number, the method further comprises:
and establishing a data index of the fragment data.
4. The method of claim 3, wherein the data index comprises a first index and a second index;
the establishing of the data index of the fragmented data includes:
establishing the first index according to the storage location of the fragmented data and the sequence number, wherein the first index at least comprises: the offset of the fragment data in the target file relative to the start address of the target file and the data volume of the fragment data;
establishing the second index according to the storage position of the first index, wherein the second index comprises: offset of the start data in the first index in the target file relative to a start address of the target file.
5. The method of claim 1, further comprising:
and writing the fragmented data into a WAL file of a pre-written log according to the sequence number, wherein the WAL file is used for recovering fragmented data which is not written into the target file in the cache memory.
6. A method of vector data processing, the method comprising:
receiving a read request for reading target vector data; the target vector data includes: converting the data into one-dimensional ordered data, and storing the data after fragmentation;
searching the fragment data of the target vector data in a cache memory;
splicing the fragment data to obtain the target vector data;
and returning the target vector data.
7. The method of claim 6, further comprising:
and if the fragment data is not found in the cache memory, searching the fragment data in the target file according to a data index.
8. The method of claim 7, wherein the method comprises:
performing binary search according to a second index to obtain a storage range of the fragment data;
and performing binary search according to the first index to obtain the storage position of the fragment data.
9. A vector data processing apparatus, characterized in that the apparatus comprises:
the conversion module is used for converting the multi-dimensional vector data to be stored into one-dimensional ordered data;
the fragmentation module is used for fragmenting the ordered data to obtain fragmentation data and determining a serial number of the fragmentation data;
and the writing module is used for writing the fragment data into a cache memory so as to write the fragment data into a target file according to the sequence number when the storage state of the cache in the cache memory reaches a trigger condition.
10. A vector data processing apparatus, characterized in that the apparatus comprises:
a receiving module, configured to receive a read request for reading target vector data; the target vector data includes: converting the data into one-dimensional ordered data, and storing the data after fragmentation;
the searching module is used for searching the fragment data of the target vector data in a cache memory;
the splicing module is used for splicing the fragment data to obtain the target vector data;
and the return module is used for returning the target vector data.
11. An electronic device, characterized in that the electronic device comprises: a processor and a memory for storing a computer program capable of running on the processor; wherein the content of the first and second substances,
the processor, when executing the computer program, performs the steps of the vector data processing method of any of claims 1 to 8.
12. A computer-readable storage medium having computer-executable instructions stored thereon; the computer-executable instructions, when executed by a processor, are capable of implementing the vector data processing method of any of claims 1 to 8.
CN202011443500.4A 2020-12-08 2020-12-08 Vector data processing method and device, electronic equipment and storage medium Pending CN114610708A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011443500.4A CN114610708A (en) 2020-12-08 2020-12-08 Vector data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011443500.4A CN114610708A (en) 2020-12-08 2020-12-08 Vector data processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114610708A true CN114610708A (en) 2022-06-10

Family

ID=81856921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011443500.4A Pending CN114610708A (en) 2020-12-08 2020-12-08 Vector data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114610708A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115576947A (en) * 2022-10-19 2023-01-06 北京力控元通科技有限公司 Data management method and device, combined library, electronic equipment and storage medium
CN117149914A (en) * 2023-10-27 2023-12-01 成都优卡数信信息科技有限公司 Storage method based on ClickHouse
CN117527529A (en) * 2024-01-05 2024-02-06 平湖科谱激光科技有限公司 Ethernet data storage method and device capable of automatically recovering from normal state

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115576947A (en) * 2022-10-19 2023-01-06 北京力控元通科技有限公司 Data management method and device, combined library, electronic equipment and storage medium
CN117149914A (en) * 2023-10-27 2023-12-01 成都优卡数信信息科技有限公司 Storage method based on ClickHouse
CN117149914B (en) * 2023-10-27 2024-01-26 成都优卡数信信息科技有限公司 Storage method based on ClickHouse
CN117527529A (en) * 2024-01-05 2024-02-06 平湖科谱激光科技有限公司 Ethernet data storage method and device capable of automatically recovering from normal state
CN117527529B (en) * 2024-01-05 2024-03-19 平湖科谱激光科技有限公司 Ethernet data storage method and device capable of automatically recovering from normal state

Similar Documents

Publication Publication Date Title
US8290972B1 (en) System and method for storing and accessing data using a plurality of probabilistic data structures
US8099421B2 (en) File system, and method for storing and searching for file by the same
US7418544B2 (en) Method and system for log structured relational database objects
CN114610708A (en) Vector data processing method and device, electronic equipment and storage medium
US9594674B1 (en) Method and system for garbage collection of data storage systems using live segment records
US9715505B1 (en) Method and system for maintaining persistent live segment records for garbage collection
CN107368527B (en) Multi-attribute index method based on data stream
US9535940B2 (en) Intra-block partitioning for database management
CN106980665B (en) Data dictionary implementation method and device and data dictionary management system
CN108614837B (en) File storage and retrieval method and device
CN106682110B (en) Image file storage and management system and method based on Hash grid index
CN110888837B (en) Object storage small file merging method and device
US11947826B2 (en) Method for accelerating image storing and retrieving differential latency storage devices based on access rates
CN114936188A (en) Data processing method and device, electronic equipment and storage medium
CN114168540A (en) File index information processing method and device, electronic equipment and storage medium
CN114780530A (en) Time sequence data storage method and system based on LSM tree key value separation
CN113901279B (en) Graph database retrieval method and device
KR101806394B1 (en) A data processing method having a structure of the cache index specified to the transaction in a mobile environment dbms
CN116431726A (en) Graph data processing method, device, equipment and computer storage medium
CN112416879B (en) NTFS file system-based block-level data deduplication method
US8156126B2 (en) Method for the allocation of data on physical media by a file system that eliminates duplicate data
CN116450656B (en) Data processing method, device, equipment and storage medium
CN107133334B (en) Data synchronization method based on high-bandwidth storage system
CN109213760A (en) The storage of high load business and search method of non-relation data storage
CN115794861A (en) Offline data query multiplexing method based on feature abstract and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination