CN117032566A - Data self-classifying heterogeneous distributed storage method and system - Google Patents

Data self-classifying heterogeneous distributed storage method and system Download PDF

Info

Publication number
CN117032566A
CN117032566A CN202310921074.8A CN202310921074A CN117032566A CN 117032566 A CN117032566 A CN 117032566A CN 202310921074 A CN202310921074 A CN 202310921074A CN 117032566 A CN117032566 A CN 117032566A
Authority
CN
China
Prior art keywords
data
storage
pool
classification model
heterogeneous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310921074.8A
Other languages
Chinese (zh)
Inventor
王韵清
王腾飞
李超
曹磊
王迎彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202310921074.8A priority Critical patent/CN117032566A/en
Publication of CN117032566A publication Critical patent/CN117032566A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms

Abstract

The invention relates to the technical field of distributed storage, in particular to a data self-classification heterogeneous distributed storage method and system, comprising the following steps: building a heterogeneous storage cluster; pre-training a classification model; data storage; self-correcting the classification model; the beneficial effects are as follows: the self-classifying heterogeneous distributed storage method and system for the data provided by the invention consider the differences of data access scenes, server nodes and storage equipment aiming at heterogeneous distributed storage clusters, and adopt different redundancy technologies aiming at different data to finally reach different nodes or equipment. Compared with a general hash algorithm, the system can provide faster access speed for hot data and lower storage cost for cold data. Compared with a general caching algorithm, the system predicts the data access scene before data storage, and can reduce performance jitter caused by cold and hot data migration.

Description

Data self-classifying heterogeneous distributed storage method and system
Technical Field
The invention relates to the technical field of distributed storage, in particular to a data self-classification heterogeneous distributed storage method and system.
Background
In distributed storage clusters, due to cost, high performance devices cannot be fully used, which causes heterogeneous storage clusters to appear.
In the prior art, in the same storage cluster, a high-performance server exists, a server with lower configuration exists, a high-performance solid state disk exists, a mechanical hard disk with relatively low performance exists, a redundancy strategy of a copy mode exists, a redundancy strategy of an erasure code mode exists, a switch with higher bandwidth exists, and a switch with relatively low storage bandwidth is possible. Under different conditions and different configuration conditions, the read-write performance of data is greatly different, and the situation that the cost and the performance are inversely proportional is caused. In most cases, a hierarchical cache approach may be used to cache hot data using high performance devices and store cold data using low cost devices, thus providing both fast access to hot data and low cost storage of cold data.
However, this method needs to continuously judge and migrate the cold and hot data in the cluster using process, which greatly affects the stability of the performance.
Disclosure of Invention
The invention aims to provide a data self-classifying heterogeneous distributed storage method and system, which are used for solving the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: a method of self-classifying heterogeneous distributed storage of data, the method comprising the steps of:
building a heterogeneous storage cluster;
pre-training a classification model;
data storage;
the classification model is self-correcting.
Preferably, the heterogeneous storage cluster building specifically includes:
dividing hardware resources according to the performance of a server, the performance of a hard disk, the performance of a network card and the performance of a switch; constructing a storage cluster, configuring different performance optimization directions aiming at storage equipment with different performances, wherein the storage cluster is prone to throughput, time delay or iops; dividing heterogeneous storage pools with different performances according to different hardware and configuration optimization direction information, and selecting to further divide storage pool grades aiming at storage pool redundancy strategies under the same or different hardware resource conditions, wherein the redundancy strategies comprise, but are not limited to, a copy mode and an erasure code mode; further dividing storage pool grades aiming at the storage pool optimization direction, and balancing the storage pool by a storage pool throughput optimization pool, a time delay optimization pool, an iops optimization pool; or a high performance storage pool, an equilibrium storage pool, a cold storage pool; setting data heat, namely reading access frequency, periodically setting an access degree normalization value calculation formula, and calculating access degree normalization values corresponding to all storage pools by integrating hardware configuration, optimization direction and copy strategy of the storage pools; and finally, acquiring the grading condition of the storage pool and corresponding performance, optimization tendency and capacity information of the storage pool.
Preferably, the pre-training of the classification model specifically comprises:
selecting data characteristics according to the expected storage data of the current heterogeneous storage cluster, wherein the data characteristics select metadata of a file, including creation time, modification time, file size, file type and access right; the content of the file, selecting partial data of the file, fixed offset data of the file, md5 value of the file and all data of the file; the file content can influence whether the subsequent storage data is directly stored into a desired storage pool in a streaming mode; selecting a classification model according to the data characteristics and a storage pool classification strategy, and selecting a support vector machine and a random forest algorithm; inputting the data characteristics into a classification model for pre-training to obtain a pre-trained classification model; and setting a correction threshold value of the storage pool capacity and the pressure on the classification strategy according to the classification condition of the storage pool.
Preferably, the data storage specifically includes:
selecting two strategies, namely an automatic classification storage strategy and a custom storage strategy, wherein the automatic classification storage strategy comprises the following steps: uploading data to a heterogeneous storage cluster by a user, and inputting the data characteristics required to be acquired by the heterogeneous storage cluster according to a data classification model into the classification model acquired in the classification model pre-training step, so as to calculate and acquire a desired storage pool; if the data features selected in the classification model pre-training step are only metadata, a streamable mode is selected for storage in a desired storage pool; if the data features selected in the step of pre-training the classification model comprise data content, the data features are calculated after the storage file is completely received and then input into the classification model to select a desired storage pool, and a high-performance pool is selected for caching to prevent data loss; after the classification model obtains the expected storage pool, correcting the expected storage pool according to the storage pool usage amount and the pressure, if the current pool usage amount reaches a threshold value, degrading to a next-stage pool or directly returning an error, recursively until the expected storage pool is selected or the error is returned, selecting the read-write pressure of the back-end storage device by the pressure, the CPU pressure of the back-end server and the network pressure of the back-end heterogeneous storage cluster by the pressure; and finally, storing the data slice into a desired storage pool or migrating the data slice from the high-performance cache pool to the desired storage pool; and selecting a designated storage pool for data storage according to user input by the user-defined storage strategy, and if the current pool usage reaches a threshold value, failing to select storage or degrading to the next-stage pool.
Preferably, the self-correction of the classification model specifically includes:
aiming at the data stored by the storage strategy of the automatic classification storage pool, acquiring the data access frequency and period when the data is read or written, calculating and storing an access degree normalization value; judging whether the access degree normalized value is not consistent with the storage pool level, if not consistent with the access degree normalized value and exceeds the current pool threshold, storing the characteristics of the data, the access degree normalized value, the current storage pool and the expected storage pool as a data classification error record, and inserting the data classification error record into a data classification error record table; after the data classification error record table exceeds a set retraining correction threshold, detecting whether each data classification error record in the current state is still not met, if so, deleting the current record, and re-entering retraining threshold detection; if all the data in the data classification error table are not consistent, inputting the stored data characteristics and the access degree normalization value in the data classification error record table into a classification model for offline training, and replacing the classification strategy used by the current cluster after the offline training is completed; after replacement is completed, polling to detect the current cluster pressure, after the heterogeneous cluster pressure is lower than a migration threshold value, migrating data to a corresponding expected storage pool according to the corrected model and the data classification error record table, and deleting a corresponding data classification error log after migration is completed.
A data self-classifying heterogeneous distributed storage system, the system comprising a data classification model and heterogeneous storage clusters;
the data classification model is used for classifying data;
heterogeneous storage clusters are composed of storage nodes with different hardware, software and configurations.
Preferably, the data classification model inputs data information to be saved by a user, and metadata of a file is selected, including but not limited to creation time, modification time, file size, file type and access right of the file; data of the file is selected as two, including but not limited to the whole content of the file and the feature code, and current pressure of the heterogeneous cluster is selected as three, including but not limited to network pressure, disk pressure and cpu pressure.
Preferably, the data classification model is exported as a desired storage pool for data.
Preferably, heterogeneous storage clusters include, but are not limited to, clusters of different hard disks, clusters of different servers, clusters of different redundancy policies.
Preferably, an administrator partitions storage pools of different capabilities according to heterogeneous storage systems, including but not limited to high-performance storage pools, equilibrium storage pools, cold storage pools; an administrator trains the data classification model in advance, selects data characteristics aiming at the storage data which are planned in advance by the heterogeneous distributed storage cluster, inputs the data characteristics into the data classification model for training, and obtains the data classification model; the method comprises the steps that a user stores data into a heterogeneous storage cluster, and a desired storage pool is obtained through a data classification model; storing the data to a corresponding storage pool; and in the running process, correcting the data classification model according to the data heat and the use amount of the storage pool and the pressure.
Compared with the prior art, the invention has the beneficial effects that:
the self-classifying heterogeneous distributed storage method and system for the data provided by the invention consider the differences of data access scenes, server nodes and storage equipment aiming at heterogeneous distributed storage clusters, and adopt different redundancy technologies aiming at different data to finally reach different nodes or equipment. Compared with a general hash algorithm, the system can provide faster access speed for hot data and lower storage cost for cold data. Compared with a general caching algorithm, the system predicts the data access scene before data storage, and can reduce performance jitter caused by cold and hot data migration.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
fig. 2 is a block diagram of the system of the present invention.
Detailed Description
In order to make the objects, technical solutions, and advantages of the present invention more apparent, the embodiments of the present invention will be further described in detail with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are some, but not all, embodiments of the present invention, are intended to be illustrative only and not limiting of the embodiments of the present invention, and that all other embodiments obtained by persons of ordinary skill in the art without making any inventive effort are within the scope of the present invention.
Example 1
Referring to fig. 1, the present invention provides a technical solution: a data self-classifying heterogeneous distributed storage method comprises the following steps: the method comprises a heterogeneous storage cluster building step, a classification model pre-training step, a data storage step and a classification model self-correction step.
The heterogeneous storage cluster building step specifically comprises the following steps: dividing hardware resources according to the performance of a server, the performance of a hard disk, the performance of a network card and the performance of a switch; constructing a storage cluster, configuring different performance optimization directions for storage devices of different performances may be prone to throughput, latency or iops. According to different hardware and configuration optimization direction information, heterogeneous storage pools with different performances are partitioned. Storage pool redundancy policies, including but not limited to copy mode and erasure code mode, may optionally be further partitioned under the same or different hardware resource conditions. Further dividing storage pool grades aiming at the storage pool optimization direction, and balancing the storage pool by using a storage pool throughput optimization pool, a time delay optimization pool, an iops optimization pool; or a high performance storage pool, an equilibrium storage pool, a cold storage pool. And setting data heat, namely setting access degree normalization value calculation formulas such as reading access frequency, period and the like, and calculating access degree normalization values corresponding to all storage pools by integrating hardware configuration, optimization direction, copy strategy and the like of the storage pools. And finally, acquiring the grading condition of the storage pool and corresponding performance, optimization tendency and capacity information of the storage pool.
The pre-training step of the classification model specifically comprises the following steps: selecting data characteristics according to the expected storage data of the current heterogeneous storage cluster, wherein the data characteristics can select metadata of a file, including creation time, modification time, file size, file type, access rights and the like; the content of the file can select partial data of the file, fixed offset data of the file, md5 value of the file and all data of the file; wherein the file content affects whether the subsequent storage data is directly stored in a desired storage pool in a streaming manner; selecting a classification model according to the data characteristics and a storage pool classification strategy, and selecting algorithms such as a support vector machine, a random forest and the like; inputting the data characteristics into a classification model for pre-training to obtain a pre-trained classification model; and setting a correction threshold value of the storage pool capacity and the pressure on the classification strategy according to the classification condition of the storage pool.
The specific data storage step comprises the steps of selecting two strategies, namely an automatic classification storage strategy and a custom storage strategy. The automatic classification storage strategy comprises the following steps: uploading data to a heterogeneous storage cluster by a user, and inputting the data characteristics required to be acquired by the heterogeneous storage cluster according to a data classification model into the classification model acquired in the classification model pre-training step, so as to calculate and acquire a desired storage pool; if the data features selected in the classification model pre-training step are only metadata, the data features can be stored in a desired storage pool in a streaming mode; if the data features selected in the step of pre-training the classification model comprise data contents, the data features are calculated after the storage files are completely received and then input into the classification model to select a desired storage pool, and a high-performance pool can be selected for caching to prevent data loss. After the classification model acquires the expected storage pool, correcting the expected storage pool according to the storage pool usage amount and the pressure, if the current pool usage amount reaches a threshold value, degrading to a next-stage pool or directly returning an error, and recursively until the expected storage pool is selected or the error is returned, wherein the pressure can select the read-write pressure of the back-end storage device, the cpu pressure of the back-end server and the network pressure of the back-end heterogeneous storage cluster; and finally storing the data slice into a desired storage pool or migrating the data slice from the high-performance cache pool to the desired storage pool. And selecting a designated storage pool for data storage according to user input by the user-defined storage strategy, and if the current pool usage reaches a threshold value, failing to select storage or degrading to the next-stage pool.
The classification model self-correction step comprises the following steps: aiming at the data stored by the storage strategy of the automatic classification storage pool, acquiring the data access frequency and period when the data is read or written, calculating and storing an access degree normalization value; and judging whether the access degree normalized value is not consistent with the storage pool grade, if not consistent with the access degree normalized value and exceeds the current pool threshold, storing the characteristics of the data, the access degree normalized value, the current storage pool and the expected storage pool as a data classification error record, and inserting the data classification error record into a data classification error record table. And after the data classification error record table exceeds the set retraining correction threshold, detecting whether each data classification error record in the current state still does not accord with the data classification error record table, if so, deleting the current record, and re-entering the retraining threshold detection. If all the data in the data classification error table are not consistent, the stored data characteristics and access degree normalization values in the data classification error record table are input into a classification model for offline training, and the classification strategy used by the current cluster is replaced after the offline training is completed. After replacement is completed, polling to detect the current cluster pressure, after the heterogeneous cluster pressure is lower than a migration threshold value, migrating data to a corresponding expected storage pool according to the corrected model and the data classification error record table, and deleting a corresponding data classification error log after migration is completed.
Example two
Referring to fig. 2, on the basis of a first embodiment, a data self-classification heterogeneous distributed storage system is provided, where the system includes a data classification model and a heterogeneous storage cluster;
the data classification model is used for classifying data; the data classification model inputs data information to be saved by a user, and metadata of a file is selected, wherein the metadata comprise, but are not limited to, file creation time, modification time, file size, file type and access right; selecting data of the file, including but not limited to the whole content and feature codes of the file, and selecting current pressure of the heterogeneous cluster, including but not limited to network pressure, disk pressure and cpu pressure; outputting the data classification model as a desired storage pool of the data;
heterogeneous storage clusters, which are composed of storage nodes with different hardware, software and configuration, include, but are not limited to, clusters composed of different hard disks, clusters composed of different servers, and clusters composed of different redundancy policies;
an administrator divides storage pools of different performance according to a heterogeneous storage system, including but not limited to a high-performance storage pool, an equilibrium storage pool, a cold storage pool; an administrator trains the data classification model in advance, selects data characteristics aiming at the storage data which are planned in advance by the heterogeneous distributed storage cluster, inputs the data characteristics into the data classification model for training, and obtains the data classification model; the method comprises the steps that a user stores data into a heterogeneous storage cluster, and a desired storage pool is obtained through a data classification model; storing the data to a corresponding storage pool; and in the running process, correcting the data classification model according to the data heat and the use amount of the storage pool and the pressure.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A data self-classifying heterogeneous distributed storage method is characterized in that: the method comprises the following steps:
building a heterogeneous storage cluster;
pre-training a classification model;
data storage;
the classification model is self-correcting.
2. The method for self-classifying heterogeneous distributed storage of data according to claim 1, wherein: the heterogeneous storage cluster building specifically comprises:
dividing hardware resources according to the performance of a server, the performance of a hard disk, the performance of a network card and the performance of a switch; constructing a storage cluster, configuring different performance optimization directions aiming at storage equipment with different performances, wherein the storage cluster is prone to throughput, time delay or iops; dividing heterogeneous storage pools with different performances according to different hardware and configuration optimization direction information, and selecting to further divide storage pool grades aiming at storage pool redundancy strategies under the same or different hardware resource conditions, wherein the redundancy strategies comprise, but are not limited to, a copy mode and an erasure code mode; further dividing storage pool grades aiming at the storage pool optimization direction, and balancing the storage pool by a storage pool throughput optimization pool, a time delay optimization pool, an iops optimization pool; or a high performance storage pool, an equilibrium storage pool, a cold storage pool; setting data heat, namely reading access frequency, periodically setting an access degree normalization value calculation formula, and calculating access degree normalization values corresponding to all storage pools by integrating hardware configuration, optimization direction and copy strategy of the storage pools; and finally, acquiring the grading condition of the storage pool and corresponding performance, optimization tendency and capacity information of the storage pool.
3. The method for self-classifying heterogeneous distributed storage of data according to claim 1, wherein: the classification model pre-training specifically comprises the following steps:
selecting data characteristics according to the expected storage data of the current heterogeneous storage cluster, wherein the data characteristics select metadata of a file, including creation time, modification time, file size, file type and access right; the content of the file, selecting partial data of the file, fixed offset data of the file, md5 value of the file and all data of the file; the file content can influence whether the subsequent storage data is directly stored into a desired storage pool in a streaming mode; selecting a classification model according to the data characteristics and a storage pool classification strategy, and selecting a support vector machine and a random forest algorithm; inputting the data characteristics into a classification model for pre-training to obtain a pre-trained classification model; and setting a correction threshold value of the storage pool capacity and the pressure on the classification strategy according to the classification condition of the storage pool.
4. The method for self-classifying heterogeneous distributed storage of data according to claim 1, wherein: the data storage specifically comprises the following steps:
selecting two strategies, namely an automatic classification storage strategy and a custom storage strategy, wherein the automatic classification storage strategy comprises the following steps: uploading data to a heterogeneous storage cluster by a user, and inputting the data characteristics required to be acquired by the heterogeneous storage cluster according to a data classification model into the classification model acquired in the classification model pre-training step, so as to calculate and acquire a desired storage pool; if the data features selected in the classification model pre-training step are only metadata, a streamable mode is selected for storage in a desired storage pool; if the data features selected in the step of pre-training the classification model comprise data content, the data features are calculated after the storage file is completely received and then input into the classification model to select a desired storage pool, and a high-performance pool is selected for caching to prevent data loss; after the classification model obtains the expected storage pool, correcting the expected storage pool according to the storage pool usage amount and the pressure, if the current pool usage amount reaches a threshold value, degrading to a next-stage pool or directly returning an error, recursively until the expected storage pool is selected or the error is returned, selecting the read-write pressure of the back-end storage device by the pressure, the CPU pressure of the back-end server and the network pressure of the back-end heterogeneous storage cluster by the pressure; and finally, storing the data slice into a desired storage pool or migrating the data slice from the high-performance cache pool to the desired storage pool; and selecting a designated storage pool for data storage according to user input by the user-defined storage strategy, and if the current pool usage reaches a threshold value, failing to select storage or degrading to the next-stage pool.
5. The method for self-classifying heterogeneous distributed storage of data according to claim 1, wherein: the self-correction of the classification model specifically comprises the following steps:
aiming at the data stored by the storage strategy of the automatic classification storage pool, acquiring the data access frequency and period when the data is read or written, calculating and storing an access degree normalization value; judging whether the access degree normalized value is not consistent with the storage pool level, if not consistent with the access degree normalized value and exceeds the current pool threshold, storing the characteristics of the data, the access degree normalized value, the current storage pool and the expected storage pool as a data classification error record, and inserting the data classification error record into a data classification error record table; after the data classification error record table exceeds a set retraining correction threshold, detecting whether each data classification error record in the current state is still not met, if so, deleting the current record, and re-entering retraining threshold detection; if all the data in the data classification error table are not consistent, inputting the stored data characteristics and the access degree normalization value in the data classification error record table into a classification model for offline training, and replacing the classification strategy used by the current cluster after the offline training is completed; after replacement is completed, polling to detect the current cluster pressure, after the heterogeneous cluster pressure is lower than a migration threshold value, migrating data to a corresponding expected storage pool according to the corrected model and the data classification error record table, and deleting a corresponding data classification error log after migration is completed.
6. A data self-classifying heterogeneous distributed storage system for a data self-classifying heterogeneous distributed storage method according to any one of claims 1 to 5, wherein: the system comprises a data classification model and a heterogeneous storage cluster;
the data classification model is used for classifying data;
heterogeneous storage clusters are composed of storage nodes with different hardware, software and configurations.
7. A data self-classifying heterogeneous distributed storage system according to claim 6 and wherein: the data classification model inputs data information to be saved by a user, and selects metadata of a file, including but not limited to creation time, modification time, file size, file type and access right of the file; data of the file is selected as two, including but not limited to the whole content of the file and the feature code, and current pressure of the heterogeneous cluster is selected as three, including but not limited to network pressure, disk pressure and cpu pressure.
8. A data self-classifying heterogeneous distributed storage system according to claim 6 and wherein: the data classification model is exported as a desired storage pool for the data.
9. A data self-classifying heterogeneous distributed storage system according to claim 6 and wherein: heterogeneous storage clusters include, but are not limited to, clusters composed of different hard disks, clusters composed of different servers, clusters composed of different redundancy policies.
10. A data self-classifying heterogeneous distributed storage system according to claim 6 and wherein: an administrator divides storage pools of different performance according to a heterogeneous storage system, including but not limited to a high-performance storage pool, an equilibrium storage pool, a cold storage pool; an administrator trains the data classification model in advance, selects data characteristics aiming at the storage data which are planned in advance by the heterogeneous distributed storage cluster, inputs the data characteristics into the data classification model for training, and obtains the data classification model; the method comprises the steps that a user stores data into a heterogeneous storage cluster, and a desired storage pool is obtained through a data classification model; storing the data to a corresponding storage pool; and in the running process, correcting the data classification model according to the data heat and the use amount of the storage pool and the pressure.
CN202310921074.8A 2023-07-26 2023-07-26 Data self-classifying heterogeneous distributed storage method and system Pending CN117032566A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310921074.8A CN117032566A (en) 2023-07-26 2023-07-26 Data self-classifying heterogeneous distributed storage method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310921074.8A CN117032566A (en) 2023-07-26 2023-07-26 Data self-classifying heterogeneous distributed storage method and system

Publications (1)

Publication Number Publication Date
CN117032566A true CN117032566A (en) 2023-11-10

Family

ID=88640494

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310921074.8A Pending CN117032566A (en) 2023-07-26 2023-07-26 Data self-classifying heterogeneous distributed storage method and system

Country Status (1)

Country Link
CN (1) CN117032566A (en)

Similar Documents

Publication Publication Date Title
US20200250145A1 (en) Synchronized data deduplication
CN107807794B (en) Data storage method and device
US8712963B1 (en) Method and apparatus for content-aware resizing of data chunks for replication
US8639669B1 (en) Method and apparatus for determining optimal chunk sizes of a deduplicated storage system
US9727573B1 (en) Out-of core similarity matching
US10303797B1 (en) Clustering files in deduplication systems
US8843447B2 (en) Resilient distributed replicated data storage system
US8554994B2 (en) Distributed storage network utilizing memory stripes
US9933970B2 (en) Deduplicating data for a data storage system using similarity determinations
US20160132523A1 (en) Exploiting node-local deduplication in distributed storage system
US9880762B1 (en) Compressing metadata blocks prior to writing the metadata blocks out to secondary storage
US20080270729A1 (en) Cluster storage using subsegmenting
US20080256143A1 (en) Cluster storage using subsegmenting
GB2518158A (en) Method and system for data access in a storage infrastructure
CN103763383A (en) Integrated cloud storage system and storage method thereof
CN112100293A (en) Data processing method, data access method, data processing device, data access device and computer equipment
CN111400083B (en) Data storage method and system and storage medium
CN107463342B (en) CDN edge node file storage method and device
CN107422989B (en) Server SAN system multi-copy reading method and storage system
CN109840051B (en) Data storage method and device of storage system
CN116560562A (en) Method and device for reading and writing data
US20170220422A1 (en) Moving data chunks
US20230305930A1 (en) Methods and systems for affinity aware container preteching
CN113918378A (en) Data storage method, storage system, storage device and storage medium
CN117032566A (en) Data self-classifying heterogeneous distributed storage method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination