CN117667932A - Method and system for customizing and converting storage format of vector database - Google Patents
Method and system for customizing and converting storage format of vector database Download PDFInfo
- Publication number
- CN117667932A CN117667932A CN202311635852.3A CN202311635852A CN117667932A CN 117667932 A CN117667932 A CN 117667932A CN 202311635852 A CN202311635852 A CN 202311635852A CN 117667932 A CN117667932 A CN 117667932A
- Authority
- CN
- China
- Prior art keywords
- data
- format
- conversion
- metadata
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000013598 vector Substances 0.000 title claims abstract description 60
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000006243 chemical reaction Methods 0.000 claims abstract description 77
- 238000012545 processing Methods 0.000 claims abstract description 58
- 230000008569 process Effects 0.000 claims abstract description 21
- 230000003993 interaction Effects 0.000 claims abstract description 5
- 238000000638 solvent extraction Methods 0.000 claims abstract description 5
- 238000007726 management method Methods 0.000 claims description 30
- 238000012360 testing method Methods 0.000 claims description 15
- 230000002159 abnormal effect Effects 0.000 claims description 14
- 238000012544 monitoring process Methods 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 9
- 230000008859 change Effects 0.000 claims description 9
- 238000001514 detection method Methods 0.000 claims description 9
- 230000005856 abnormality Effects 0.000 claims description 8
- 238000004140 cleaning Methods 0.000 claims description 7
- 230000006399 behavior Effects 0.000 claims description 6
- 238000013523 data management Methods 0.000 claims description 6
- 238000013524 data verification Methods 0.000 claims description 6
- 230000005540 biological transmission Effects 0.000 claims description 4
- 238000013144 data compression Methods 0.000 claims description 4
- 238000013506 data mapping Methods 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 claims description 4
- 208000025174 PANDAS Diseases 0.000 claims description 3
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 claims description 3
- 235000016496 Panda oleosa Nutrition 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 230000006835 compression Effects 0.000 claims description 3
- 238000007405 data analysis Methods 0.000 claims description 3
- 238000013499 data model Methods 0.000 claims description 3
- 238000013500 data storage Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 230000008520 organization Effects 0.000 claims description 3
- 238000005192 partition Methods 0.000 claims description 3
- 240000000220 Panda oleosa Species 0.000 claims 1
- 210000000352 storage cell Anatomy 0.000 claims 1
- 240000004718 Panda Species 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of computer science, and discloses a method and a system for customizing and converting a storage format of a vector database, wherein the method comprises the following steps: s1, defining characteristics of stored vector data; defining a vector by means of a document or a data dictionary, describing the meaning and units of each element, wherein the metadata comprises the name, units and data type information of the vector; s2, determining a use scene and an expected data access mode; determining a data access pattern, including frequency of read and write operations, size and complexity of data; s3, selecting a storage format; s4, converting the existing data from an original format to a new format; s5, optimizing the conversion process, and ensuring efficient processing of the large-scale data set. In the invention, batch processing is adopted to reduce the interaction times of a database or a file system, so that the efficiency is improved; through indexing and partitioning, data retrieval and updating operations are accelerated, and query performance is improved.
Description
Technical Field
The invention relates to the technical field of computer science, in particular to a method and a system for customizing and converting a storage format of a vector database.
Background
A vector database is a database system designed for efficient storage, retrieval, and processing of vector data. Vector data refers to a collection of numbers or scalars that are common in the fields of machine learning, data mining, and information retrieval. Each vector represents a feature of an entity, such as a document, image, audio. The goal of the vector database is to provide efficient query and analysis operations to support various application scenarios.
With the development of machine learning and deep learning technologies, the demand for storing and processing large-scale vector data is gradually growing, and the traditional database system is more suitable for structured data instead of high-dimensional vector data, so that the data of the type cannot be effectively processed, and the analysis method is low in efficiency.
Disclosure of Invention
In order to make up for the defects, the invention provides a method and a system for customizing and converting a storage format of a vector database, which aim to solve the problems that the requirements for storing and processing large-scale vector data are gradually increased, the traditional database system cannot effectively process the data of the type, and the analysis method is low in efficiency.
In order to achieve the above purpose, the present invention adopts the following technical scheme: a method for customizing and converting a storage format of a vector database, comprising the steps of:
s1, defining characteristics of stored vector data
Defining a vector by means of a document or a data dictionary, describing the meaning and units of each element, wherein the metadata comprises the name, units and data type information of the vector;
s2, determining a usage scenario and an expected data access mode
The data access patterns are determined, including the frequency of read and write operations, the size and complexity of the data.
S3, selecting a storage format
The cells are selected according to the structure and characteristics of the data.
S4, converting the existing data from the original format to a new format
By providing an intuitive interface by using OpenRefine and Trifacta Wrangler tools, data is cleaned and converted without writing codes, and data is processed and format converted by using Python, java or a Pandas library and Apache Spark data processing tool for more complex conversion tasks;
s5, optimizing the conversion process to ensure efficient processing of large-scale data sets
And (3) processing large-scale data on the clusters in parallel by using a distributed computing framework Apache Spark, and fully utilizing the performance of the multi-core processor.
As a further description of the above technical solution:
s101, determining basic information of vectors
Giving a clear name to the vector, reflecting the meaning represented by the name, and determining the number of elements in the vector;
s102, adding metadata for each element
Giving each element a descriptive name indicating its meaning, designating the unit of each element, defining the data type of each element;
s103, creating a data dictionary or a document
Listing each element and its associated metadata using a table, creating a document, describing the metadata of the vector in terms of paragraphs or chapters;
s104, adding information
And defining a reasonable value range of each element, and facilitating data verification.
As a further description of the above technical solution:
s201, determining whether data access is mainly read or write operation, or a combination thereof;
s202, determining whether real-time data access is needed or not, and meeting the requirements through batch processing;
s203, analyzing the complexity of the data query, and knowing whether the complex query operation needs to be supported.
As a further description of the above technical solution:
s301, selecting a traditional relational database storage format for data in a table form;
s302, selecting a NoSQL database for semi-structured or unstructured data;
s303, selecting a key value for storing a simple key value for the data model;
s304, selecting a column group database when the data exists in a column cluster form and high extensibility is needed.
As a further description of the above technical solution:
s401, analyzing the format and structure of the original data, and knowing information, fields and relations contained in the data;
s402, determining a new format to be converted, and defining a data structure and a specification in the new format;
s403, processing missing values, abnormal values and errors in the original data, and cleaning the data to ensure that the data meets the requirements of a target format;
s404, establishing a mapping relation between original data and a target format, and formulating a conversion rule, wherein the conversion rule comprises data type conversion and field renaming;
s405, selecting an ETL tool or writing a script or a program to perform conversion according to a conversion rule;
s406, before data format conversion, ensuring backup of the original data so as to prevent accidents;
s407, executing data format conversion by using the selected tool or script, and monitoring the conversion process to ensure that no error or abnormality occurs;
s408, verifying whether the converted data accords with the specification of the target format, and testing the data, including testing partial data and whole data;
s409, processing abnormal conditions in any conversion, and ensuring the integrity and accuracy of data;
s410, recording data format conversion steps and rules, and creating a document, wherein the document comprises data mapping, conversion rules and test results.
As a further description of the above technical solution:
s4091, filling the missing values of the numerical type features by using the average value, the median and the mode statistics, and filling the missing values of the classification features by using the classification; processing missing values in the time series data by an interpolation method; the number of missing values is relatively small, and these samples or related features are deleted;
s4092, cleaning according to the data specification, and removing the repeated record and the error repairing value; unifying inconsistent data to a specific format to ensure consistency of the data; and correcting inconsistent data according to logic rules or default values.
As a further description of the above technical solution:
s501, dividing a large-scale data set into small blocks by using a parallel processing technology, and processing the small blocks simultaneously, wherein Apache Spark is used for improving the conversion speed;
s502, adopting batch processing to reduce interaction times of a database or a file system and improving efficiency;
s503, through indexing and partitioning, data retrieval and updating operation are accelerated, and query performance is improved;
s504, loading the data into a memory for processing so as to avoid frequent disk access;
s505, using a cache to store intermediate results, avoiding repeated computation;
s506, dividing the conversion process into a plurality of stages, gradually processing data, better controlling the flow, and reducing the processing pressure of each stage;
and S507, data compression is used in the transmission and storage stages so as to reduce the volume of data and improve the efficiency.
As a further description of the above technical solution:
the system for customizing and converting the storage format of the vector database comprises:
the data operation module is used for allowing a user to define and customize the storage format of the vector database, providing the capability of describing the vector and the structure of the characteristics thereof in a declarative manner, and defining indexes, compression methods and other storage related parameters;
the metadata management module is used for establishing a metadata management system, recording and maintaining metadata information in a database storage format, including table structures, indexes and data partition information, and allowing a user to query and modify metadata so as to adapt to different service requirements;
the format conversion module is used for realizing the format adapter and the converter, processing the conversion between different data formats, providing self-defined conversion logic and enabling a user to define conversion rules according to the needs;
the modularized architecture module is used for enabling a user to easily expand system functions or add new data storage formats, providing a plug-in architecture and allowing a third party developer to develop a custom storage format module for the system;
and the test optimization module is used for optimizing the performance of the conversion engine, realizing high-efficiency data conversion and loading, considering the expansibility of the system, processing large-scale data and keeping good performance.
The abnormality detection module is used for detecting data information which cannot be identified or abnormal in the data conversion process.
As a further description of the above technical solution:
the metadata management module includes:
identifying a target module, determining metadata management requirements and targets of an organization, improving data discoverability, improving data quality and supporting data analysis;
the metadata acquisition module is used for formulating a metadata management strategy and defining the range, content and standard of metadata, wherein the metadata management strategy comprises a data vocabulary, a business rule and a data map;
a metadata repository module that determines key metadata, which may include data entities, data attributes, business rules, data quality rules, defining clear standards and formats for each metadata;
metadata quality management module, ensure that metadata is captured, updated and deleted throughout the data lifecycle.
And integrating a metadata management module to integrate metadata management into the whole data management life cycle, including data management, data quality management and data security practices.
As a further description of the above technical solution:
the abnormality detection module includes:
the data consistency detection module is used for checking consistency of data types, ranges and missing values, and ensuring that data in a new storage format accords with expected specifications by using data verification rules and constraints;
the log analysis module records operation, error and warning information in the conversion process, and monitors the running state of the system in real time by using a log monitoring tool to discover problems in time;
the data change monitoring module monitors the change rate of data, detects whether abnormal rapid increase or decrease exists, sets a threshold value, and triggers an alarm when the data change exceeds a set range;
and the user behavior analysis module is used for monitoring the operation behavior of a system user when the storage format is defined and converted.
The invention has the following beneficial effects:
1. in the invention, batch processing is adopted to reduce the interaction times of a database or a file system, so that the efficiency is improved; through indexing and partitioning, data retrieval and updating operations are accelerated, and query performance is improved; loading data into a memory for processing so as to avoid frequent disk access; using a cache to store intermediate results, avoiding duplicate computations; dividing the conversion process into a plurality of stages, gradually processing data, better controlling the flow, and reducing the processing pressure of each stage; data compression is used in the transmission and storage stages to reduce the volume of data and improve efficiency.
2. In the invention, the test is carried out by verifying whether the converted data accords with the specification of the target format, including the test of partial data and whole data; processing abnormal conditions in any conversion, and ensuring the integrity and accuracy of data; recording the data format conversion steps and rules, and creating a document, wherein the document comprises data mapping, conversion rules and test results.
3. In the invention, the missing values of the numerical type characteristics are filled by using the statistics of the average value, the median and the mode, and the missing values are filled by using the categories for classifying the characteristics; processing missing values in the time series data by an interpolation method; the number of missing values is relatively small, and these samples or related features are deleted; cleaning according to the data specification, and removing repeated record and error repairing values; unifying inconsistent data to a specific format to ensure consistency of the data; and correcting inconsistent data according to logic rules or default values.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a flow chart of the method S1 of the present invention;
FIG. 3 is a flow chart of the method S2 of the present invention;
FIG. 4 is a flow chart of the method S3 of the present invention;
FIG. 5 is a flow chart of the method S4 of the present invention;
FIG. 6 is a flow chart of the method of S409 of the present invention;
FIG. 7 is a flow chart of the method S5 of the present invention;
FIG. 8 is a system flow diagram of the present invention for customizing and converting a storage format of a vector database;
FIG. 9 is a flowchart of a metadata management module according to the present invention;
FIG. 10 is a flowchart of an anomaly detection module according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-10, one embodiment provided by the present invention is: a method for customizing and converting a storage format of a vector database, comprising the steps of:
s1, defining characteristics of stored vector data
Defining a vector by means of a document or a data dictionary, describing the meaning and units of each element, wherein the metadata comprises the name, units and data type information of the vector;
s2, determining a usage scenario and an expected data access mode
The data access patterns are determined, including the frequency of read and write operations, the size and complexity of the data.
S3, selecting a storage format
The cells are selected according to the structure and characteristics of the data.
S4, converting the existing data from the original format to a new format
By providing an intuitive interface by using OpenRefine and Trifacta Wrangler tools, data is cleaned and converted without writing codes, and data is processed and format converted by using Python, java or a Pandas library and Apache Spark data processing tool for more complex conversion tasks;
s5, optimizing the conversion process to ensure efficient processing of large-scale data sets
And (3) processing large-scale data on the clusters in parallel by using a distributed computing framework Apache Spark, and fully utilizing the performance of the multi-core processor.
S101, determining basic information of vectors
Giving a clear name to the vector, reflecting the meaning represented by the name, and determining the number of elements in the vector;
s102, adding metadata for each element
Giving each element a descriptive name indicating its meaning, designating the unit of each element, defining the data type of each element;
s103, creating a data dictionary or a document
Listing each element and its associated metadata using a table, creating a document, describing the metadata of the vector in terms of paragraphs or chapters;
s104, adding information
And defining a reasonable value range of each element, and facilitating data verification.
S201, determining whether data access is mainly read or write operation, or a combination thereof;
s202, determining whether real-time data access is needed or not, and meeting the requirements through batch processing;
s203, analyzing the complexity of the data query, and knowing whether the complex query operation needs to be supported.
S301, selecting a traditional relational database storage format for data in a table form;
s302, selecting a NoSQL database for semi-structured or unstructured data;
s303, selecting a key value for storing a simple key value for the data model;
s304, selecting a column group database when the data exists in a column cluster form and high extensibility is needed.
S401, analyzing the format and structure of the original data, and knowing information, fields and relations contained in the data;
s402, determining a new format to be converted, and defining a data structure and a specification in the new format;
s403, processing missing values, abnormal values and errors in the original data, and cleaning the data to ensure that the data meets the requirements of a target format;
s404, establishing a mapping relation between original data and a target format, and formulating a conversion rule, wherein the conversion rule comprises data type conversion and field renaming;
s405, selecting an ETL tool or writing a script or a program to perform conversion according to a conversion rule;
s406, before data format conversion, ensuring backup of the original data so as to prevent accidents;
s407, executing data format conversion by using the selected tool or script, and monitoring the conversion process to ensure that no error or abnormality occurs;
s408, verifying whether the converted data accords with the specification of the target format, and testing the data, including testing partial data and whole data;
s409, processing abnormal conditions in any conversion, and ensuring the integrity and accuracy of data;
s410, recording data format conversion steps and rules, and creating a document, wherein the document comprises data mapping, conversion rules and test results.
S4091, filling the missing values of the numerical type features by using the average value, the median and the mode statistics, and filling the missing values of the classification features by using the classification; processing missing values in the time series data by an interpolation method; the number of missing values is relatively small, and these samples or related features are deleted;
s4092, cleaning according to the data specification, and removing the repeated record and the error repairing value; unifying inconsistent data to a specific format to ensure consistency of the data; and correcting inconsistent data according to logic rules or default values.
7. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the S5 optimizes the conversion process to ensure efficient processing of large-scale data sets, comprising the steps of:
s501, dividing a large-scale data set into small blocks by using a parallel processing technology, and processing the small blocks simultaneously, wherein Apache Spark is used for improving the conversion speed;
s502, adopting batch processing to reduce interaction times of a database or a file system and improving efficiency;
s503, through indexing and partitioning, data retrieval and updating operation are accelerated, and query performance is improved;
s504, loading the data into a memory for processing so as to avoid frequent disk access;
s505, using a cache to store intermediate results, avoiding repeated computation;
s506, dividing the conversion process into a plurality of stages, gradually processing data, better controlling the flow, and reducing the processing pressure of each stage;
and S507, data compression is used in the transmission and storage stages so as to reduce the volume of data and improve the efficiency.
The system for customizing and converting the storage format of the vector database comprises:
the data operation module is used for allowing a user to define and customize the storage format of the vector database, providing the capability of describing the vector and the structure of the characteristics thereof in a declarative manner, and defining indexes, compression methods and other storage related parameters;
the metadata management module is used for establishing a metadata management system, recording and maintaining metadata information in a database storage format, including table structures, indexes and data partition information, and allowing a user to query and modify metadata so as to adapt to different service requirements;
the format conversion module is used for realizing the format adapter and the converter, processing the conversion between different data formats, providing self-defined conversion logic and enabling a user to define conversion rules according to the needs;
the modularized architecture module is used for enabling a user to easily expand system functions or add new data storage formats, providing a plug-in architecture and allowing a third party developer to develop a custom storage format module for the system;
and the test optimization module is used for optimizing the performance of the conversion engine, realizing high-efficiency data conversion and loading, considering the expansibility of the system, processing large-scale data and keeping good performance.
The abnormality detection module is used for detecting data information which cannot be identified or abnormal in the data conversion process.
Identifying a target module, determining metadata management requirements and targets of an organization, improving data discoverability, improving data quality and supporting data analysis;
the metadata acquisition module is used for formulating a metadata management strategy and defining the range, content and standard of metadata, wherein the metadata management strategy comprises a data vocabulary, a business rule and a data map;
a metadata repository module that determines key metadata, which may include data entities, data attributes, business rules, data quality rules, defining clear standards and formats for each metadata;
metadata quality management module, ensure that metadata is captured, updated and deleted throughout the data lifecycle.
And integrating a metadata management module to integrate metadata management into the whole data management life cycle, including data management, data quality management and data security practices.
The data consistency detection module is used for checking consistency of data types, ranges and missing values, and ensuring that data in a new storage format accords with expected specifications by using data verification rules and constraints;
the log analysis module records operation, error and warning information in the conversion process, and monitors the running state of the system in real time by using a log monitoring tool to discover problems in time;
the data change monitoring module monitors the change rate of data, detects whether abnormal rapid increase or decrease exists, sets a threshold value, and triggers an alarm when the data change exceeds a set range;
and the user behavior analysis module is used for monitoring the operation behavior of a system user when the storage format is defined and converted.
Finally, it should be noted that: the foregoing description of the preferred embodiments of the present invention is not intended to be limiting, but rather, although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications and substitutions of some of the features described in the foregoing embodiments may be made, and any modifications, substitutions and improvements made within the spirit and principles of the present invention are intended to be included in the scope of the present invention.
Claims (10)
1. A method for customizing and converting a storage format of a vector database, comprising the steps of: the method comprises the following steps:
s1, defining characteristics of stored vector data
Defining a vector by means of a document or a data dictionary, describing the meaning and units of each element, wherein the metadata comprises the name, units and data type information of the vector;
s2, determining a usage scenario and an expected data access mode
Determining a data access pattern, including frequency of read and write operations, size and complexity of data;
s3, selecting a storage format
Selecting a storage cell according to the structure and the characteristics of the data;
s4, converting the existing data from the original format to a new format
By providing an intuitive interface by using OpenRefine and Trifacta Wrangler tools, data is cleaned and converted without writing codes, and data is processed and format converted by using Python, java or a Pandas library and Apache Spark data processing tool for more complex conversion tasks;
s5, optimizing the conversion process to ensure efficient processing of large-scale data sets
And (3) processing large-scale data on the clusters in parallel by using a distributed computing framework Apache Spark, and fully utilizing the performance of the multi-core processor.
2. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the defining of the characteristics of the stored vector data in S1 includes the following steps:
s101, determining basic information of vectors
Giving a clear name to the vector, reflecting the meaning represented by the name, and determining the number of elements in the vector;
s102, adding metadata for each element
Giving each element a descriptive name indicating its meaning, designating the unit of each element, defining the data type of each element;
s103, creating a data dictionary or a document
Listing each element and its associated metadata using a table, creating a document, describing the metadata of the vector in terms of paragraphs or chapters;
s104, adding information
And defining a reasonable value range of each element, and facilitating data verification.
3. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the step S2 of determining a usage scenario and an expected data access mode comprises the following steps:
s201, determining whether data access is mainly read or write operation, or a combination thereof;
s202, determining whether real-time data access is needed or not, and meeting the requirements through batch processing;
s203, analyzing the complexity of the data query, and knowing whether the complex query operation needs to be supported.
4. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the step S3 of selecting a storage format comprises the following steps:
s301, selecting a traditional relational database storage format for data in a table form;
s302, selecting a NoSQL database for semi-structured or unstructured data;
s303, selecting a key value for storing a simple key value for the data model;
s304, selecting a column group database when the data exists in a column cluster form and high extensibility is needed.
5. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the step S4 of converting the existing data from the original format to the new format comprises the following steps:
s401, analyzing the format and structure of the original data, and knowing information, fields and relations contained in the data;
s402, determining a new format to be converted, and defining a data structure and a specification in the new format;
s403, processing missing values, abnormal values and errors in the original data, and cleaning the data to ensure that the data meets the requirements of a target format;
s404, establishing a mapping relation between original data and a target format, and formulating a conversion rule, wherein the conversion rule comprises data type conversion and field renaming;
s405, selecting an ETL tool or writing a script or a program to perform conversion according to a conversion rule;
s406, before data format conversion, ensuring backup of the original data so as to prevent accidents;
s407, executing data format conversion by using the selected tool or script, and monitoring the conversion process to ensure that no error or abnormality occurs;
s408, verifying whether the converted data accords with the specification of the target format, and testing the data, including testing partial data and whole data;
s409, processing abnormal conditions in any conversion, and ensuring the integrity and accuracy of data;
s410, recording data format conversion steps and rules, and creating a document, wherein the document comprises data mapping, conversion rules and test results.
6. The method for customizing and converting a storage format of a vector database as recited in claim 5, wherein: the step S409 processes any abnormal situations in the conversion, and ensures the integrity and accuracy of the data, and includes the following steps:
s4091, filling the missing values of the numerical type features by using the average value, the median and the mode statistics, and filling the missing values of the classification features by using the classification; processing missing values in the time series data by an interpolation method; the number of missing values is relatively small, and these samples or related features are deleted;
s4092, cleaning according to the data specification, and removing the repeated record and the error repairing value; unifying inconsistent data to a specific format to ensure consistency of the data; and correcting inconsistent data according to logic rules or default values.
7. A method of customizing and converting a storage format of a vector database as claimed in claim 1, wherein: the S5 optimizes the conversion process to ensure efficient processing of large-scale data sets, comprising the steps of:
s501, dividing a large-scale data set into small blocks by using a parallel processing technology, and processing the small blocks simultaneously, wherein Apache Spark is used for improving the conversion speed;
s502, adopting batch processing to reduce interaction times of a database or a file system and improving efficiency;
s503, through indexing and partitioning, data retrieval and updating operation are accelerated, and query performance is improved;
s504, loading the data into a memory for processing so as to avoid frequent disk access;
s505, using a cache to store intermediate results, avoiding repeated computation;
s506, dividing the conversion process into a plurality of stages, gradually processing data, better controlling the flow, and reducing the processing pressure of each stage;
and S507, data compression is used in the transmission and storage stages so as to reduce the volume of data and improve the efficiency.
8. A system for customizing and converting a storage format of a vector database implementing the method of any one of claims 1-7, characterized by: the method comprises the following steps:
the data operation module is used for allowing a user to define and customize the storage format of the vector database, providing the capability of describing the vector and the structure of the characteristics thereof in a declarative manner, and defining indexes, compression methods and other storage related parameters;
the metadata management module is used for establishing a metadata management system, recording and maintaining metadata information in a database storage format, including table structures, indexes and data partition information, and allowing a user to query and modify metadata so as to adapt to different service requirements;
the format conversion module is used for realizing the format adapter and the converter, processing the conversion between different data formats, providing self-defined conversion logic and enabling a user to define conversion rules according to the needs;
the modularized architecture module is used for enabling a user to easily expand system functions or add new data storage formats, providing a plug-in architecture and allowing a third party developer to develop a custom storage format module for the system;
the test optimization module is used for optimizing the performance of the conversion engine, realizing high-efficiency data conversion and loading, considering the expansibility of the system, processing large-scale data and keeping good performance;
the abnormality detection module is used for detecting data information which cannot be identified or abnormal in the data conversion process.
9. A system for customizing and converting a storage format of a vector database as recited in claim 8, wherein: the metadata management module includes:
identifying a target module, determining metadata management requirements and targets of an organization, improving data discoverability, improving data quality and supporting data analysis;
the metadata acquisition module is used for formulating a metadata management strategy and defining the range, content and standard of metadata, wherein the metadata management strategy comprises a data vocabulary, a business rule and a data map;
a metadata repository module that determines key metadata, which may include data entities, data attributes, business rules, data quality rules, defining clear standards and formats for each metadata;
a metadata quality management module that ensures that metadata is captured, updated and deleted throughout the data lifecycle;
and integrating a metadata management module to integrate metadata management into the whole data management life cycle, including data management, data quality management and data security practices.
10. A system for customizing and converting a storage format of a vector database as recited in claim 8, wherein: the abnormality detection module includes:
the data consistency detection module is used for checking consistency of data types, ranges and missing values, and ensuring that data in a new storage format accords with expected specifications by using data verification rules and constraints;
the log analysis module records operation, error and warning information in the conversion process, and monitors the running state of the system in real time by using a log monitoring tool to discover problems in time;
the data change monitoring module monitors the change rate of data, detects whether abnormal rapid increase or decrease exists, sets a threshold value, and triggers an alarm when the data change exceeds a set range;
and the user behavior analysis module is used for monitoring the operation behavior of a system user when the storage format is defined and converted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311635852.3A CN117667932A (en) | 2023-12-01 | 2023-12-01 | Method and system for customizing and converting storage format of vector database |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311635852.3A CN117667932A (en) | 2023-12-01 | 2023-12-01 | Method and system for customizing and converting storage format of vector database |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117667932A true CN117667932A (en) | 2024-03-08 |
Family
ID=90069253
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311635852.3A Pending CN117667932A (en) | 2023-12-01 | 2023-12-01 | Method and system for customizing and converting storage format of vector database |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117667932A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117971137A (en) * | 2024-04-02 | 2024-05-03 | 山东海润数聚科技有限公司 | Multithreading-based large-scale vector data consistency assessment method and system |
-
2023
- 2023-12-01 CN CN202311635852.3A patent/CN117667932A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117971137A (en) * | 2024-04-02 | 2024-05-03 | 山东海润数聚科技有限公司 | Multithreading-based large-scale vector data consistency assessment method and system |
CN117971137B (en) * | 2024-04-02 | 2024-06-04 | 山东海润数聚科技有限公司 | Multithreading-based large-scale vector data consistency assessment method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110531940B (en) | Video file processing method and device | |
CN117667932A (en) | Method and system for customizing and converting storage format of vector database | |
CN107301214B (en) | Data migration method and device in HIVE and terminal equipment | |
CN116450656B (en) | Data processing method, device, equipment and storage medium | |
CN108205571B (en) | Key value data table connection method and device | |
CN115601514A (en) | Automatic association mapping method for digital twin data | |
CN111061758A (en) | Data storage method, device and storage medium | |
CN113763502A (en) | Chart generation method, device, equipment and storage medium | |
CN116719799A (en) | Environment-friendly data management method, device, computer equipment and storage medium | |
CN117453646A (en) | Kernel log combined compression and query method integrating semantics and deep neural network | |
CN111258907A (en) | Automobile instrument testing method, device and equipment | |
CN114896250B (en) | Key value separated key value storage engine index optimization method and device | |
CN111414355A (en) | Offshore wind farm data monitoring and storing system, method and device | |
CN114547086B (en) | Data processing method, device, equipment and computer readable storage medium | |
CN114003172B (en) | Storage capacity correction method, storage capacity correction device, computer equipment and storage medium | |
CN110399396A (en) | Efficient data processing | |
CN114491044A (en) | Log processing method and device | |
CN113761103A (en) | Batch data processing method and device and electronic equipment | |
CN113742116A (en) | Abnormity positioning method, abnormity positioning device, abnormity positioning equipment and storage medium | |
CN111339063A (en) | Intelligent management method and device for hypothesis data and computer readable storage medium | |
CN113987785B (en) | Management method and device for complete information of algorithm block of nuclear power station DCS system | |
CN117076515B (en) | Metadata tracing method and device in medical management system, server and storage medium | |
CN111984470B (en) | Storage cluster system fault recovery automatic detection method and device | |
CN117952150A (en) | Gas turbine fault early warning model construction method and gas turbine fault early warning method | |
Ding | Applying learned indexing on embedded devices for time series data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |