CN117667932A

CN117667932A - Method and system for customizing and converting storage format of vector database

Info

Publication number: CN117667932A
Application number: CN202311635852.3A
Authority: CN
Inventors: 陈海富; 杨雨欣
Original assignee: Beijing Birui Data Technology Co ltd
Current assignee: Beijing Birui Data Technology Co ltd
Priority date: 2023-12-01
Filing date: 2023-12-01
Publication date: 2024-03-08

Abstract

The invention relates to the technical field of computer science, and discloses a method and a system for customizing and converting a storage format of a vector database, wherein the method comprises the following steps: s1, defining characteristics of stored vector data; defining a vector by means of a document or a data dictionary, describing the meaning and units of each element, wherein the metadata comprises the name, units and data type information of the vector; s2, determining a use scene and an expected data access mode; determining a data access pattern, including frequency of read and write operations, size and complexity of data; s3, selecting a storage format; s4, converting the existing data from an original format to a new format; s5, optimizing the conversion process, and ensuring efficient processing of the large-scale data set. In the invention, batch processing is adopted to reduce the interaction times of a database or a file system, so that the efficiency is improved; through indexing and partitioning, data retrieval and updating operations are accelerated, and query performance is improved.

Description

Method and system for customizing and converting storage format of vector database

Technical Field

The invention relates to the technical field of computer science, in particular to a method and a system for customizing and converting a storage format of a vector database.

Background

A vector database is a database system designed for efficient storage, retrieval, and processing of vector data. Vector data refers to a collection of numbers or scalars that are common in the fields of machine learning, data mining, and information retrieval. Each vector represents a feature of an entity, such as a document, image, audio. The goal of the vector database is to provide efficient query and analysis operations to support various application scenarios.

With the development of machine learning and deep learning technologies, the demand for storing and processing large-scale vector data is gradually growing, and the traditional database system is more suitable for structured data instead of high-dimensional vector data, so that the data of the type cannot be effectively processed, and the analysis method is low in efficiency.

Disclosure of Invention

In order to make up for the defects, the invention provides a method and a system for customizing and converting a storage format of a vector database, which aim to solve the problems that the requirements for storing and processing large-scale vector data are gradually increased, the traditional database system cannot effectively process the data of the type, and the analysis method is low in efficiency.

In order to achieve the above purpose, the present invention adopts the following technical scheme: a method for customizing and converting a storage format of a vector database, comprising the steps of:

s1, defining characteristics of stored vector data

Defining a vector by means of a document or a data dictionary, describing the meaning and units of each element, wherein the metadata comprises the name, units and data type information of the vector;

s2, determining a usage scenario and an expected data access mode

The data access patterns are determined, including the frequency of read and write operations, the size and complexity of the data.

S3, selecting a storage format

The cells are selected according to the structure and characteristics of the data.

S4, converting the existing data from the original format to a new format

By providing an intuitive interface by using OpenRefine and Trifacta Wrangler tools, data is cleaned and converted without writing codes, and data is processed and format converted by using Python, java or a Pandas library and Apache Spark data processing tool for more complex conversion tasks;

s5, optimizing the conversion process to ensure efficient processing of large-scale data sets

And (3) processing large-scale data on the clusters in parallel by using a distributed computing framework Apache Spark, and fully utilizing the performance of the multi-core processor.

As a further description of the above technical solution:

s101, determining basic information of vectors

Giving a clear name to the vector, reflecting the meaning represented by the name, and determining the number of elements in the vector;

s102, adding metadata for each element

Giving each element a descriptive name indicating its meaning, designating the unit of each element, defining the data type of each element;

s103, creating a data dictionary or a document

Listing each element and its associated metadata using a table, creating a document, describing the metadata of the vector in terms of paragraphs or chapters;

s104, adding information

And defining a reasonable value range of each element, and facilitating data verification.

As a further description of the above technical solution:

s201, determining whether data access is mainly read or write operation, or a combination thereof;

s202, determining whether real-time data access is needed or not, and meeting the requirements through batch processing;

s203, analyzing the complexity of the data query, and knowing whether the complex query operation needs to be supported.

As a further description of the above technical solution:

s301, selecting a traditional relational database storage format for data in a table form;

s302, selecting a NoSQL database for semi-structured or unstructured data;

s303, selecting a key value for storing a simple key value for the data model;

s304, selecting a column group database when the data exists in a column cluster form and high extensibility is needed.

As a further description of the above technical solution:

s401, analyzing the format and structure of the original data, and knowing information, fields and relations contained in the data;

s402, determining a new format to be converted, and defining a data structure and a specification in the new format;

s403, processing missing values, abnormal values and errors in the original data, and cleaning the data to ensure that the data meets the requirements of a target format;

s404, establishing a mapping relation between original data and a target format, and formulating a conversion rule, wherein the conversion rule comprises data type conversion and field renaming;

s405, selecting an ETL tool or writing a script or a program to perform conversion according to a conversion rule;

s406, before data format conversion, ensuring backup of the original data so as to prevent accidents;

s407, executing data format conversion by using the selected tool or script, and monitoring the conversion process to ensure that no error or abnormality occurs;

s408, verifying whether the converted data accords with the specification of the target format, and testing the data, including testing partial data and whole data;

s409, processing abnormal conditions in any conversion, and ensuring the integrity and accuracy of data;

s410, recording data format conversion steps and rules, and creating a document, wherein the document comprises data mapping, conversion rules and test results.

As a further description of the above technical solution:

s4091, filling the missing values of the numerical type features by using the average value, the median and the mode statistics, and filling the missing values of the classification features by using the classification; processing missing values in the time series data by an interpolation method; the number of missing values is relatively small, and these samples or related features are deleted;

s4092, cleaning according to the data specification, and removing the repeated record and the error repairing value; unifying inconsistent data to a specific format to ensure consistency of the data; and correcting inconsistent data according to logic rules or default values.

As a further description of the above technical solution:

s501, dividing a large-scale data set into small blocks by using a parallel processing technology, and processing the small blocks simultaneously, wherein Apache Spark is used for improving the conversion speed;

s502, adopting batch processing to reduce interaction times of a database or a file system and improving efficiency;

s503, through indexing and partitioning, data retrieval and updating operation are accelerated, and query performance is improved;

s504, loading the data into a memory for processing so as to avoid frequent disk access;

s505, using a cache to store intermediate results, avoiding repeated computation;

s506, dividing the conversion process into a plurality of stages, gradually processing data, better controlling the flow, and reducing the processing pressure of each stage;

and S507, data compression is used in the transmission and storage stages so as to reduce the volume of data and improve the efficiency.

As a further description of the above technical solution:

the system for customizing and converting the storage format of the vector database comprises:

the data operation module is used for allowing a user to define and customize the storage format of the vector database, providing the capability of describing the vector and the structure of the characteristics thereof in a declarative manner, and defining indexes, compression methods and other storage related parameters;

the metadata management module is used for establishing a metadata management system, recording and maintaining metadata information in a database storage format, including table structures, indexes and data partition information, and allowing a user to query and modify metadata so as to adapt to different service requirements;

the format conversion module is used for realizing the format adapter and the converter, processing the conversion between different data formats, providing self-defined conversion logic and enabling a user to define conversion rules according to the needs;

the modularized architecture module is used for enabling a user to easily expand system functions or add new data storage formats, providing a plug-in architecture and allowing a third party developer to develop a custom storage format module for the system;

and the test optimization module is used for optimizing the performance of the conversion engine, realizing high-efficiency data conversion and loading, considering the expansibility of the system, processing large-scale data and keeping good performance.

The abnormality detection module is used for detecting data information which cannot be identified or abnormal in the data conversion process.

As a further description of the above technical solution:

the metadata management module includes:

identifying a target module, determining metadata management requirements and targets of an organization, improving data discoverability, improving data quality and supporting data analysis;

the metadata acquisition module is used for formulating a metadata management strategy and defining the range, content and standard of metadata, wherein the metadata management strategy comprises a data vocabulary, a business rule and a data map;

a metadata repository module that determines key metadata, which may include data entities, data attributes, business rules, data quality rules, defining clear standards and formats for each metadata;

metadata quality management module, ensure that metadata is captured, updated and deleted throughout the data lifecycle.

And integrating a metadata management module to integrate metadata management into the whole data management life cycle, including data management, data quality management and data security practices.

As a further description of the above technical solution:

the abnormality detection module includes:

the data consistency detection module is used for checking consistency of data types, ranges and missing values, and ensuring that data in a new storage format accords with expected specifications by using data verification rules and constraints;

the log analysis module records operation, error and warning information in the conversion process, and monitors the running state of the system in real time by using a log monitoring tool to discover problems in time;

the data change monitoring module monitors the change rate of data, detects whether abnormal rapid increase or decrease exists, sets a threshold value, and triggers an alarm when the data change exceeds a set range;

and the user behavior analysis module is used for monitoring the operation behavior of a system user when the storage format is defined and converted.

The invention has the following beneficial effects:

1. in the invention, batch processing is adopted to reduce the interaction times of a database or a file system, so that the efficiency is improved; through indexing and partitioning, data retrieval and updating operations are accelerated, and query performance is improved; loading data into a memory for processing so as to avoid frequent disk access; using a cache to store intermediate results, avoiding duplicate computations; dividing the conversion process into a plurality of stages, gradually processing data, better controlling the flow, and reducing the processing pressure of each stage; data compression is used in the transmission and storage stages to reduce the volume of data and improve efficiency.

2. In the invention, the test is carried out by verifying whether the converted data accords with the specification of the target format, including the test of partial data and whole data; processing abnormal conditions in any conversion, and ensuring the integrity and accuracy of data; recording the data format conversion steps and rules, and creating a document, wherein the document comprises data mapping, conversion rules and test results.

3. In the invention, the missing values of the numerical type characteristics are filled by using the statistics of the average value, the median and the mode, and the missing values are filled by using the categories for classifying the characteristics; processing missing values in the time series data by an interpolation method; the number of missing values is relatively small, and these samples or related features are deleted; cleaning according to the data specification, and removing repeated record and error repairing values; unifying inconsistent data to a specific format to ensure consistency of the data; and correcting inconsistent data according to logic rules or default values.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a flow chart of the method S1 of the present invention;

FIG. 3 is a flow chart of the method S2 of the present invention;

FIG. 4 is a flow chart of the method S3 of the present invention;

FIG. 5 is a flow chart of the method S4 of the present invention;

FIG. 6 is a flow chart of the method of S409 of the present invention;

FIG. 7 is a flow chart of the method S5 of the present invention;

FIG. 8 is a system flow diagram of the present invention for customizing and converting a storage format of a vector database;

FIG. 9 is a flowchart of a metadata management module according to the present invention;

FIG. 10 is a flowchart of an anomaly detection module according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-10, one embodiment provided by the present invention is: a method for customizing and converting a storage format of a vector database, comprising the steps of:

s1, defining characteristics of stored vector data

s2, determining a usage scenario and an expected data access mode

S3, selecting a storage format

S4, converting the existing data from the original format to a new format

S101, determining basic information of vectors

s102, adding metadata for each element

s103, creating a data dictionary or a document

s104, adding information

s302, selecting a NoSQL database for semi-structured or unstructured data;

s303, selecting a key value for storing a simple key value for the data model;

7. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the S5 optimizes the conversion process to ensure efficient processing of large-scale data sets, comprising the steps of:

Finally, it should be noted that: the foregoing description of the preferred embodiments of the present invention is not intended to be limiting, but rather, although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications and substitutions of some of the features described in the foregoing embodiments may be made, and any modifications, substitutions and improvements made within the spirit and principles of the present invention are intended to be included in the scope of the present invention.

Claims

1. A method for customizing and converting a storage format of a vector database, comprising the steps of: the method comprises the following steps:

s1, defining characteristics of stored vector data

s2, determining a usage scenario and an expected data access mode

Determining a data access pattern, including frequency of read and write operations, size and complexity of data;

s3, selecting a storage format

Selecting a storage cell according to the structure and the characteristics of the data;

s4, converting the existing data from the original format to a new format

2. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the defining of the characteristics of the stored vector data in S1 includes the following steps:

s101, determining basic information of vectors

s102, adding metadata for each element

s103, creating a data dictionary or a document

s104, adding information

3. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the step S2 of determining a usage scenario and an expected data access mode comprises the following steps:

4. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the step S3 of selecting a storage format comprises the following steps:

s302, selecting a NoSQL database for semi-structured or unstructured data;

s303, selecting a key value for storing a simple key value for the data model;

5. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the step S4 of converting the existing data from the original format to the new format comprises the following steps:

6. The method for customizing and converting a storage format of a vector database as recited in claim 5, wherein: the step S409 processes any abnormal situations in the conversion, and ensures the integrity and accuracy of the data, and includes the following steps:

8. A system for customizing and converting a storage format of a vector database implementing the method of any one of claims 1-7, characterized by: the method comprises the following steps:

the test optimization module is used for optimizing the performance of the conversion engine, realizing high-efficiency data conversion and loading, considering the expansibility of the system, processing large-scale data and keeping good performance;

9. A system for customizing and converting a storage format of a vector database as recited in claim 8, wherein: the metadata management module includes:

a metadata quality management module that ensures that metadata is captured, updated and deleted throughout the data lifecycle;

10. A system for customizing and converting a storage format of a vector database as recited in claim 8, wherein: the abnormality detection module includes: