CN117667932A - Method and system for customizing and converting storage format of vector database - Google Patents

Method and system for customizing and converting storage format of vector database Download PDF

Info

Publication number
CN117667932A
CN117667932A CN202311635852.3A CN202311635852A CN117667932A CN 117667932 A CN117667932 A CN 117667932A CN 202311635852 A CN202311635852 A CN 202311635852A CN 117667932 A CN117667932 A CN 117667932A
Authority
CN
China
Prior art keywords
data
format
conversion
metadata
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311635852.3A
Other languages
Chinese (zh)
Inventor
陈海富
杨雨欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Birui Data Technology Co ltd
Original Assignee
Beijing Birui Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Birui Data Technology Co ltd filed Critical Beijing Birui Data Technology Co ltd
Priority to CN202311635852.3A priority Critical patent/CN117667932A/en
Publication of CN117667932A publication Critical patent/CN117667932A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of computer science, and discloses a method and a system for customizing and converting a storage format of a vector database, wherein the method comprises the following steps: s1, defining characteristics of stored vector data; defining a vector by means of a document or a data dictionary, describing the meaning and units of each element, wherein the metadata comprises the name, units and data type information of the vector; s2, determining a use scene and an expected data access mode; determining a data access pattern, including frequency of read and write operations, size and complexity of data; s3, selecting a storage format; s4, converting the existing data from an original format to a new format; s5, optimizing the conversion process, and ensuring efficient processing of the large-scale data set. In the invention, batch processing is adopted to reduce the interaction times of a database or a file system, so that the efficiency is improved; through indexing and partitioning, data retrieval and updating operations are accelerated, and query performance is improved.

Description

Method and system for customizing and converting storage format of vector database
Technical Field
The invention relates to the technical field of computer science, in particular to a method and a system for customizing and converting a storage format of a vector database.
Background
A vector database is a database system designed for efficient storage, retrieval, and processing of vector data. Vector data refers to a collection of numbers or scalars that are common in the fields of machine learning, data mining, and information retrieval. Each vector represents a feature of an entity, such as a document, image, audio. The goal of the vector database is to provide efficient query and analysis operations to support various application scenarios.
With the development of machine learning and deep learning technologies, the demand for storing and processing large-scale vector data is gradually growing, and the traditional database system is more suitable for structured data instead of high-dimensional vector data, so that the data of the type cannot be effectively processed, and the analysis method is low in efficiency.
Disclosure of Invention
In order to make up for the defects, the invention provides a method and a system for customizing and converting a storage format of a vector database, which aim to solve the problems that the requirements for storing and processing large-scale vector data are gradually increased, the traditional database system cannot effectively process the data of the type, and the analysis method is low in efficiency.
In order to achieve the above purpose, the present invention adopts the following technical scheme: a method for customizing and converting a storage format of a vector database, comprising the steps of:
s1, defining characteristics of stored vector data
Defining a vector by means of a document or a data dictionary, describing the meaning and units of each element, wherein the metadata comprises the name, units and data type information of the vector;
s2, determining a usage scenario and an expected data access mode
The data access patterns are determined, including the frequency of read and write operations, the size and complexity of the data.
S3, selecting a storage format
The cells are selected according to the structure and characteristics of the data.
S4, converting the existing data from the original format to a new format
By providing an intuitive interface by using OpenRefine and Trifacta Wrangler tools, data is cleaned and converted without writing codes, and data is processed and format converted by using Python, java or a Pandas library and Apache Spark data processing tool for more complex conversion tasks;
s5, optimizing the conversion process to ensure efficient processing of large-scale data sets
And (3) processing large-scale data on the clusters in parallel by using a distributed computing framework Apache Spark, and fully utilizing the performance of the multi-core processor.
As a further description of the above technical solution:
s101, determining basic information of vectors
Giving a clear name to the vector, reflecting the meaning represented by the name, and determining the number of elements in the vector;
s102, adding metadata for each element
Giving each element a descriptive name indicating its meaning, designating the unit of each element, defining the data type of each element;
s103, creating a data dictionary or a document
Listing each element and its associated metadata using a table, creating a document, describing the metadata of the vector in terms of paragraphs or chapters;
s104, adding information
And defining a reasonable value range of each element, and facilitating data verification.
As a further description of the above technical solution:
s201, determining whether data access is mainly read or write operation, or a combination thereof;
s202, determining whether real-time data access is needed or not, and meeting the requirements through batch processing;
s203, analyzing the complexity of the data query, and knowing whether the complex query operation needs to be supported.
As a further description of the above technical solution:
s301, selecting a traditional relational database storage format for data in a table form;
s302, selecting a NoSQL database for semi-structured or unstructured data;
s303, selecting a key value for storing a simple key value for the data model;
s304, selecting a column group database when the data exists in a column cluster form and high extensibility is needed.
As a further description of the above technical solution:
s401, analyzing the format and structure of the original data, and knowing information, fields and relations contained in the data;
s402, determining a new format to be converted, and defining a data structure and a specification in the new format;
s403, processing missing values, abnormal values and errors in the original data, and cleaning the data to ensure that the data meets the requirements of a target format;
s404, establishing a mapping relation between original data and a target format, and formulating a conversion rule, wherein the conversion rule comprises data type conversion and field renaming;
s405, selecting an ETL tool or writing a script or a program to perform conversion according to a conversion rule;
s406, before data format conversion, ensuring backup of the original data so as to prevent accidents;
s407, executing data format conversion by using the selected tool or script, and monitoring the conversion process to ensure that no error or abnormality occurs;
s408, verifying whether the converted data accords with the specification of the target format, and testing the data, including testing partial data and whole data;
s409, processing abnormal conditions in any conversion, and ensuring the integrity and accuracy of data;
s410, recording data format conversion steps and rules, and creating a document, wherein the document comprises data mapping, conversion rules and test results.
As a further description of the above technical solution:
s4091, filling the missing values of the numerical type features by using the average value, the median and the mode statistics, and filling the missing values of the classification features by using the classification; processing missing values in the time series data by an interpolation method; the number of missing values is relatively small, and these samples or related features are deleted;
s4092, cleaning according to the data specification, and removing the repeated record and the error repairing value; unifying inconsistent data to a specific format to ensure consistency of the data; and correcting inconsistent data according to logic rules or default values.
As a further description of the above technical solution:
s501, dividing a large-scale data set into small blocks by using a parallel processing technology, and processing the small blocks simultaneously, wherein Apache Spark is used for improving the conversion speed;
s502, adopting batch processing to reduce interaction times of a database or a file system and improving efficiency;
s503, through indexing and partitioning, data retrieval and updating operation are accelerated, and query performance is improved;
s504, loading the data into a memory for processing so as to avoid frequent disk access;
s505, using a cache to store intermediate results, avoiding repeated computation;
s506, dividing the conversion process into a plurality of stages, gradually processing data, better controlling the flow, and reducing the processing pressure of each stage;
and S507, data compression is used in the transmission and storage stages so as to reduce the volume of data and improve the efficiency.
As a further description of the above technical solution:
the system for customizing and converting the storage format of the vector database comprises:
the data operation module is used for allowing a user to define and customize the storage format of the vector database, providing the capability of describing the vector and the structure of the characteristics thereof in a declarative manner, and defining indexes, compression methods and other storage related parameters;
the metadata management module is used for establishing a metadata management system, recording and maintaining metadata information in a database storage format, including table structures, indexes and data partition information, and allowing a user to query and modify metadata so as to adapt to different service requirements;
the format conversion module is used for realizing the format adapter and the converter, processing the conversion between different data formats, providing self-defined conversion logic and enabling a user to define conversion rules according to the needs;
the modularized architecture module is used for enabling a user to easily expand system functions or add new data storage formats, providing a plug-in architecture and allowing a third party developer to develop a custom storage format module for the system;
and the test optimization module is used for optimizing the performance of the conversion engine, realizing high-efficiency data conversion and loading, considering the expansibility of the system, processing large-scale data and keeping good performance.
The abnormality detection module is used for detecting data information which cannot be identified or abnormal in the data conversion process.
As a further description of the above technical solution:
the metadata management module includes:
identifying a target module, determining metadata management requirements and targets of an organization, improving data discoverability, improving data quality and supporting data analysis;
the metadata acquisition module is used for formulating a metadata management strategy and defining the range, content and standard of metadata, wherein the metadata management strategy comprises a data vocabulary, a business rule and a data map;
a metadata repository module that determines key metadata, which may include data entities, data attributes, business rules, data quality rules, defining clear standards and formats for each metadata;
metadata quality management module, ensure that metadata is captured, updated and deleted throughout the data lifecycle.
And integrating a metadata management module to integrate metadata management into the whole data management life cycle, including data management, data quality management and data security practices.
As a further description of the above technical solution:
the abnormality detection module includes:
the data consistency detection module is used for checking consistency of data types, ranges and missing values, and ensuring that data in a new storage format accords with expected specifications by using data verification rules and constraints;
the log analysis module records operation, error and warning information in the conversion process, and monitors the running state of the system in real time by using a log monitoring tool to discover problems in time;
the data change monitoring module monitors the change rate of data, detects whether abnormal rapid increase or decrease exists, sets a threshold value, and triggers an alarm when the data change exceeds a set range;
and the user behavior analysis module is used for monitoring the operation behavior of a system user when the storage format is defined and converted.
The invention has the following beneficial effects:
1. in the invention, batch processing is adopted to reduce the interaction times of a database or a file system, so that the efficiency is improved; through indexing and partitioning, data retrieval and updating operations are accelerated, and query performance is improved; loading data into a memory for processing so as to avoid frequent disk access; using a cache to store intermediate results, avoiding duplicate computations; dividing the conversion process into a plurality of stages, gradually processing data, better controlling the flow, and reducing the processing pressure of each stage; data compression is used in the transmission and storage stages to reduce the volume of data and improve efficiency.
2. In the invention, the test is carried out by verifying whether the converted data accords with the specification of the target format, including the test of partial data and whole data; processing abnormal conditions in any conversion, and ensuring the integrity and accuracy of data; recording the data format conversion steps and rules, and creating a document, wherein the document comprises data mapping, conversion rules and test results.
3. In the invention, the missing values of the numerical type characteristics are filled by using the statistics of the average value, the median and the mode, and the missing values are filled by using the categories for classifying the characteristics; processing missing values in the time series data by an interpolation method; the number of missing values is relatively small, and these samples or related features are deleted; cleaning according to the data specification, and removing repeated record and error repairing values; unifying inconsistent data to a specific format to ensure consistency of the data; and correcting inconsistent data according to logic rules or default values.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a flow chart of the method S1 of the present invention;
FIG. 3 is a flow chart of the method S2 of the present invention;
FIG. 4 is a flow chart of the method S3 of the present invention;
FIG. 5 is a flow chart of the method S4 of the present invention;
FIG. 6 is a flow chart of the method of S409 of the present invention;
FIG. 7 is a flow chart of the method S5 of the present invention;
FIG. 8 is a system flow diagram of the present invention for customizing and converting a storage format of a vector database;
FIG. 9 is a flowchart of a metadata management module according to the present invention;
FIG. 10 is a flowchart of an anomaly detection module according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-10, one embodiment provided by the present invention is: a method for customizing and converting a storage format of a vector database, comprising the steps of:
s1, defining characteristics of stored vector data
Defining a vector by means of a document or a data dictionary, describing the meaning and units of each element, wherein the metadata comprises the name, units and data type information of the vector;
s2, determining a usage scenario and an expected data access mode
The data access patterns are determined, including the frequency of read and write operations, the size and complexity of the data.
S3, selecting a storage format
The cells are selected according to the structure and characteristics of the data.
S4, converting the existing data from the original format to a new format
By providing an intuitive interface by using OpenRefine and Trifacta Wrangler tools, data is cleaned and converted without writing codes, and data is processed and format converted by using Python, java or a Pandas library and Apache Spark data processing tool for more complex conversion tasks;
s5, optimizing the conversion process to ensure efficient processing of large-scale data sets
And (3) processing large-scale data on the clusters in parallel by using a distributed computing framework Apache Spark, and fully utilizing the performance of the multi-core processor.
S101, determining basic information of vectors
Giving a clear name to the vector, reflecting the meaning represented by the name, and determining the number of elements in the vector;
s102, adding metadata for each element
Giving each element a descriptive name indicating its meaning, designating the unit of each element, defining the data type of each element;
s103, creating a data dictionary or a document
Listing each element and its associated metadata using a table, creating a document, describing the metadata of the vector in terms of paragraphs or chapters;
s104, adding information
And defining a reasonable value range of each element, and facilitating data verification.
S201, determining whether data access is mainly read or write operation, or a combination thereof;
s202, determining whether real-time data access is needed or not, and meeting the requirements through batch processing;
s203, analyzing the complexity of the data query, and knowing whether the complex query operation needs to be supported.
S301, selecting a traditional relational database storage format for data in a table form;
s302, selecting a NoSQL database for semi-structured or unstructured data;
s303, selecting a key value for storing a simple key value for the data model;
s304, selecting a column group database when the data exists in a column cluster form and high extensibility is needed.
S401, analyzing the format and structure of the original data, and knowing information, fields and relations contained in the data;
s402, determining a new format to be converted, and defining a data structure and a specification in the new format;
s403, processing missing values, abnormal values and errors in the original data, and cleaning the data to ensure that the data meets the requirements of a target format;
s404, establishing a mapping relation between original data and a target format, and formulating a conversion rule, wherein the conversion rule comprises data type conversion and field renaming;
s405, selecting an ETL tool or writing a script or a program to perform conversion according to a conversion rule;
s406, before data format conversion, ensuring backup of the original data so as to prevent accidents;
s407, executing data format conversion by using the selected tool or script, and monitoring the conversion process to ensure that no error or abnormality occurs;
s408, verifying whether the converted data accords with the specification of the target format, and testing the data, including testing partial data and whole data;
s409, processing abnormal conditions in any conversion, and ensuring the integrity and accuracy of data;
s410, recording data format conversion steps and rules, and creating a document, wherein the document comprises data mapping, conversion rules and test results.
S4091, filling the missing values of the numerical type features by using the average value, the median and the mode statistics, and filling the missing values of the classification features by using the classification; processing missing values in the time series data by an interpolation method; the number of missing values is relatively small, and these samples or related features are deleted;
s4092, cleaning according to the data specification, and removing the repeated record and the error repairing value; unifying inconsistent data to a specific format to ensure consistency of the data; and correcting inconsistent data according to logic rules or default values.
7. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the S5 optimizes the conversion process to ensure efficient processing of large-scale data sets, comprising the steps of:
s501, dividing a large-scale data set into small blocks by using a parallel processing technology, and processing the small blocks simultaneously, wherein Apache Spark is used for improving the conversion speed;
s502, adopting batch processing to reduce interaction times of a database or a file system and improving efficiency;
s503, through indexing and partitioning, data retrieval and updating operation are accelerated, and query performance is improved;
s504, loading the data into a memory for processing so as to avoid frequent disk access;
s505, using a cache to store intermediate results, avoiding repeated computation;
s506, dividing the conversion process into a plurality of stages, gradually processing data, better controlling the flow, and reducing the processing pressure of each stage;
and S507, data compression is used in the transmission and storage stages so as to reduce the volume of data and improve the efficiency.
The system for customizing and converting the storage format of the vector database comprises:
the data operation module is used for allowing a user to define and customize the storage format of the vector database, providing the capability of describing the vector and the structure of the characteristics thereof in a declarative manner, and defining indexes, compression methods and other storage related parameters;
the metadata management module is used for establishing a metadata management system, recording and maintaining metadata information in a database storage format, including table structures, indexes and data partition information, and allowing a user to query and modify metadata so as to adapt to different service requirements;
the format conversion module is used for realizing the format adapter and the converter, processing the conversion between different data formats, providing self-defined conversion logic and enabling a user to define conversion rules according to the needs;
the modularized architecture module is used for enabling a user to easily expand system functions or add new data storage formats, providing a plug-in architecture and allowing a third party developer to develop a custom storage format module for the system;
and the test optimization module is used for optimizing the performance of the conversion engine, realizing high-efficiency data conversion and loading, considering the expansibility of the system, processing large-scale data and keeping good performance.
The abnormality detection module is used for detecting data information which cannot be identified or abnormal in the data conversion process.
Identifying a target module, determining metadata management requirements and targets of an organization, improving data discoverability, improving data quality and supporting data analysis;
the metadata acquisition module is used for formulating a metadata management strategy and defining the range, content and standard of metadata, wherein the metadata management strategy comprises a data vocabulary, a business rule and a data map;
a metadata repository module that determines key metadata, which may include data entities, data attributes, business rules, data quality rules, defining clear standards and formats for each metadata;
metadata quality management module, ensure that metadata is captured, updated and deleted throughout the data lifecycle.
And integrating a metadata management module to integrate metadata management into the whole data management life cycle, including data management, data quality management and data security practices.
The data consistency detection module is used for checking consistency of data types, ranges and missing values, and ensuring that data in a new storage format accords with expected specifications by using data verification rules and constraints;
the log analysis module records operation, error and warning information in the conversion process, and monitors the running state of the system in real time by using a log monitoring tool to discover problems in time;
the data change monitoring module monitors the change rate of data, detects whether abnormal rapid increase or decrease exists, sets a threshold value, and triggers an alarm when the data change exceeds a set range;
and the user behavior analysis module is used for monitoring the operation behavior of a system user when the storage format is defined and converted.
Finally, it should be noted that: the foregoing description of the preferred embodiments of the present invention is not intended to be limiting, but rather, although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications and substitutions of some of the features described in the foregoing embodiments may be made, and any modifications, substitutions and improvements made within the spirit and principles of the present invention are intended to be included in the scope of the present invention.

Claims (10)

1. A method for customizing and converting a storage format of a vector database, comprising the steps of: the method comprises the following steps:
s1, defining characteristics of stored vector data
Defining a vector by means of a document or a data dictionary, describing the meaning and units of each element, wherein the metadata comprises the name, units and data type information of the vector;
s2, determining a usage scenario and an expected data access mode
Determining a data access pattern, including frequency of read and write operations, size and complexity of data;
s3, selecting a storage format
Selecting a storage cell according to the structure and the characteristics of the data;
s4, converting the existing data from the original format to a new format
By providing an intuitive interface by using OpenRefine and Trifacta Wrangler tools, data is cleaned and converted without writing codes, and data is processed and format converted by using Python, java or a Pandas library and Apache Spark data processing tool for more complex conversion tasks;
s5, optimizing the conversion process to ensure efficient processing of large-scale data sets
And (3) processing large-scale data on the clusters in parallel by using a distributed computing framework Apache Spark, and fully utilizing the performance of the multi-core processor.
2. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the defining of the characteristics of the stored vector data in S1 includes the following steps:
s101, determining basic information of vectors
Giving a clear name to the vector, reflecting the meaning represented by the name, and determining the number of elements in the vector;
s102, adding metadata for each element
Giving each element a descriptive name indicating its meaning, designating the unit of each element, defining the data type of each element;
s103, creating a data dictionary or a document
Listing each element and its associated metadata using a table, creating a document, describing the metadata of the vector in terms of paragraphs or chapters;
s104, adding information
And defining a reasonable value range of each element, and facilitating data verification.
3. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the step S2 of determining a usage scenario and an expected data access mode comprises the following steps:
s201, determining whether data access is mainly read or write operation, or a combination thereof;
s202, determining whether real-time data access is needed or not, and meeting the requirements through batch processing;
s203, analyzing the complexity of the data query, and knowing whether the complex query operation needs to be supported.
4. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the step S3 of selecting a storage format comprises the following steps:
s301, selecting a traditional relational database storage format for data in a table form;
s302, selecting a NoSQL database for semi-structured or unstructured data;
s303, selecting a key value for storing a simple key value for the data model;
s304, selecting a column group database when the data exists in a column cluster form and high extensibility is needed.
5. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the step S4 of converting the existing data from the original format to the new format comprises the following steps:
s401, analyzing the format and structure of the original data, and knowing information, fields and relations contained in the data;
s402, determining a new format to be converted, and defining a data structure and a specification in the new format;
s403, processing missing values, abnormal values and errors in the original data, and cleaning the data to ensure that the data meets the requirements of a target format;
s404, establishing a mapping relation between original data and a target format, and formulating a conversion rule, wherein the conversion rule comprises data type conversion and field renaming;
s405, selecting an ETL tool or writing a script or a program to perform conversion according to a conversion rule;
s406, before data format conversion, ensuring backup of the original data so as to prevent accidents;
s407, executing data format conversion by using the selected tool or script, and monitoring the conversion process to ensure that no error or abnormality occurs;
s408, verifying whether the converted data accords with the specification of the target format, and testing the data, including testing partial data and whole data;
s409, processing abnormal conditions in any conversion, and ensuring the integrity and accuracy of data;
s410, recording data format conversion steps and rules, and creating a document, wherein the document comprises data mapping, conversion rules and test results.
6. The method for customizing and converting a storage format of a vector database as recited in claim 5, wherein: the step S409 processes any abnormal situations in the conversion, and ensures the integrity and accuracy of the data, and includes the following steps:
s4091, filling the missing values of the numerical type features by using the average value, the median and the mode statistics, and filling the missing values of the classification features by using the classification; processing missing values in the time series data by an interpolation method; the number of missing values is relatively small, and these samples or related features are deleted;
s4092, cleaning according to the data specification, and removing the repeated record and the error repairing value; unifying inconsistent data to a specific format to ensure consistency of the data; and correcting inconsistent data according to logic rules or default values.
7. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the S5 optimizes the conversion process to ensure efficient processing of large-scale data sets, comprising the steps of:
s501, dividing a large-scale data set into small blocks by using a parallel processing technology, and processing the small blocks simultaneously, wherein Apache Spark is used for improving the conversion speed;
s502, adopting batch processing to reduce interaction times of a database or a file system and improving efficiency;
s503, through indexing and partitioning, data retrieval and updating operation are accelerated, and query performance is improved;
s504, loading the data into a memory for processing so as to avoid frequent disk access;
s505, using a cache to store intermediate results, avoiding repeated computation;
s506, dividing the conversion process into a plurality of stages, gradually processing data, better controlling the flow, and reducing the processing pressure of each stage;
and S507, data compression is used in the transmission and storage stages so as to reduce the volume of data and improve the efficiency.
8. A system for customizing and converting a storage format of a vector database implementing the method of any one of claims 1-7, characterized by: the method comprises the following steps:
the data operation module is used for allowing a user to define and customize the storage format of the vector database, providing the capability of describing the vector and the structure of the characteristics thereof in a declarative manner, and defining indexes, compression methods and other storage related parameters;
the metadata management module is used for establishing a metadata management system, recording and maintaining metadata information in a database storage format, including table structures, indexes and data partition information, and allowing a user to query and modify metadata so as to adapt to different service requirements;
the format conversion module is used for realizing the format adapter and the converter, processing the conversion between different data formats, providing self-defined conversion logic and enabling a user to define conversion rules according to the needs;
the modularized architecture module is used for enabling a user to easily expand system functions or add new data storage formats, providing a plug-in architecture and allowing a third party developer to develop a custom storage format module for the system;
the test optimization module is used for optimizing the performance of the conversion engine, realizing high-efficiency data conversion and loading, considering the expansibility of the system, processing large-scale data and keeping good performance;
the abnormality detection module is used for detecting data information which cannot be identified or abnormal in the data conversion process.
9. A system for customizing and converting a storage format of a vector database as recited in claim 8, wherein: the metadata management module includes:
identifying a target module, determining metadata management requirements and targets of an organization, improving data discoverability, improving data quality and supporting data analysis;
the metadata acquisition module is used for formulating a metadata management strategy and defining the range, content and standard of metadata, wherein the metadata management strategy comprises a data vocabulary, a business rule and a data map;
a metadata repository module that determines key metadata, which may include data entities, data attributes, business rules, data quality rules, defining clear standards and formats for each metadata;
a metadata quality management module that ensures that metadata is captured, updated and deleted throughout the data lifecycle;
and integrating a metadata management module to integrate metadata management into the whole data management life cycle, including data management, data quality management and data security practices.
10. A system for customizing and converting a storage format of a vector database as recited in claim 8, wherein: the abnormality detection module includes:
the data consistency detection module is used for checking consistency of data types, ranges and missing values, and ensuring that data in a new storage format accords with expected specifications by using data verification rules and constraints;
the log analysis module records operation, error and warning information in the conversion process, and monitors the running state of the system in real time by using a log monitoring tool to discover problems in time;
the data change monitoring module monitors the change rate of data, detects whether abnormal rapid increase or decrease exists, sets a threshold value, and triggers an alarm when the data change exceeds a set range;
and the user behavior analysis module is used for monitoring the operation behavior of a system user when the storage format is defined and converted.
CN202311635852.3A 2023-12-01 2023-12-01 Method and system for customizing and converting storage format of vector database Pending CN117667932A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311635852.3A CN117667932A (en) 2023-12-01 2023-12-01 Method and system for customizing and converting storage format of vector database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311635852.3A CN117667932A (en) 2023-12-01 2023-12-01 Method and system for customizing and converting storage format of vector database

Publications (1)

Publication Number Publication Date
CN117667932A true CN117667932A (en) 2024-03-08

Family

ID=90069253

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311635852.3A Pending CN117667932A (en) 2023-12-01 2023-12-01 Method and system for customizing and converting storage format of vector database

Country Status (1)

Country Link
CN (1) CN117667932A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117971137A (en) * 2024-04-02 2024-05-03 山东海润数聚科技有限公司 Multithreading-based large-scale vector data consistency assessment method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117971137A (en) * 2024-04-02 2024-05-03 山东海润数聚科技有限公司 Multithreading-based large-scale vector data consistency assessment method and system
CN117971137B (en) * 2024-04-02 2024-06-04 山东海润数聚科技有限公司 Multithreading-based large-scale vector data consistency assessment method and system

Similar Documents

Publication Publication Date Title
CN110531940B (en) Video file processing method and device
CN117667932A (en) Method and system for customizing and converting storage format of vector database
CN107301214B (en) Data migration method and device in HIVE and terminal equipment
CN116450656B (en) Data processing method, device, equipment and storage medium
CN108205571B (en) Key value data table connection method and device
CN115601514A (en) Automatic association mapping method for digital twin data
CN111061758A (en) Data storage method, device and storage medium
CN113763502A (en) Chart generation method, device, equipment and storage medium
CN116719799A (en) Environment-friendly data management method, device, computer equipment and storage medium
CN117453646A (en) Kernel log combined compression and query method integrating semantics and deep neural network
CN111258907A (en) Automobile instrument testing method, device and equipment
CN114896250B (en) Key value separated key value storage engine index optimization method and device
CN111414355A (en) Offshore wind farm data monitoring and storing system, method and device
CN114547086B (en) Data processing method, device, equipment and computer readable storage medium
CN114003172B (en) Storage capacity correction method, storage capacity correction device, computer equipment and storage medium
CN110399396A (en) Efficient data processing
CN114491044A (en) Log processing method and device
CN113761103A (en) Batch data processing method and device and electronic equipment
CN113742116A (en) Abnormity positioning method, abnormity positioning device, abnormity positioning equipment and storage medium
CN111339063A (en) Intelligent management method and device for hypothesis data and computer readable storage medium
CN113987785B (en) Management method and device for complete information of algorithm block of nuclear power station DCS system
CN117076515B (en) Metadata tracing method and device in medical management system, server and storage medium
CN111984470B (en) Storage cluster system fault recovery automatic detection method and device
CN117952150A (en) Gas turbine fault early warning model construction method and gas turbine fault early warning method
Ding Applying learned indexing on embedded devices for time series data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination