CN117032566A - Data self-classifying heterogeneous distributed storage method and system - Google Patents
Data self-classifying heterogeneous distributed storage method and system Download PDFInfo
- Publication number
- CN117032566A CN117032566A CN202310921074.8A CN202310921074A CN117032566A CN 117032566 A CN117032566 A CN 117032566A CN 202310921074 A CN202310921074 A CN 202310921074A CN 117032566 A CN117032566 A CN 117032566A
- Authority
- CN
- China
- Prior art keywords
- data
- storage
- pool
- classification model
- heterogeneous
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000013145 classification model Methods 0.000 claims abstract description 74
- 238000012549 training Methods 0.000 claims abstract description 28
- 238000013500 data storage Methods 0.000 claims abstract description 12
- 238000013508 migration Methods 0.000 claims abstract description 8
- 230000005012 migration Effects 0.000 claims abstract description 8
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 7
- 238000005457 optimization Methods 0.000 claims description 24
- 238000010606 normalization Methods 0.000 claims description 12
- 238000012937 correction Methods 0.000 claims description 10
- 238000012986 modification Methods 0.000 claims description 7
- 230000004048 modification Effects 0.000 claims description 7
- 230000000593 degrading effect Effects 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 claims description 3
- 238000007637 random forest analysis Methods 0.000 claims description 3
- 238000012706 support-vector machine Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 3
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0629—Configuration or reconfiguration of storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0643—Management of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0644—Management of space entities, e.g. partitions, extents, pools
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0646—Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
- G06F3/0647—Migration mechanisms
Abstract
The invention relates to the technical field of distributed storage, in particular to a data self-classification heterogeneous distributed storage method and system, comprising the following steps: building a heterogeneous storage cluster; pre-training a classification model; data storage; self-correcting the classification model; the beneficial effects are as follows: the self-classifying heterogeneous distributed storage method and system for the data provided by the invention consider the differences of data access scenes, server nodes and storage equipment aiming at heterogeneous distributed storage clusters, and adopt different redundancy technologies aiming at different data to finally reach different nodes or equipment. Compared with a general hash algorithm, the system can provide faster access speed for hot data and lower storage cost for cold data. Compared with a general caching algorithm, the system predicts the data access scene before data storage, and can reduce performance jitter caused by cold and hot data migration.
Description
Technical Field
The invention relates to the technical field of distributed storage, in particular to a data self-classification heterogeneous distributed storage method and system.
Background
In distributed storage clusters, due to cost, high performance devices cannot be fully used, which causes heterogeneous storage clusters to appear.
In the prior art, in the same storage cluster, a high-performance server exists, a server with lower configuration exists, a high-performance solid state disk exists, a mechanical hard disk with relatively low performance exists, a redundancy strategy of a copy mode exists, a redundancy strategy of an erasure code mode exists, a switch with higher bandwidth exists, and a switch with relatively low storage bandwidth is possible. Under different conditions and different configuration conditions, the read-write performance of data is greatly different, and the situation that the cost and the performance are inversely proportional is caused. In most cases, a hierarchical cache approach may be used to cache hot data using high performance devices and store cold data using low cost devices, thus providing both fast access to hot data and low cost storage of cold data.
However, this method needs to continuously judge and migrate the cold and hot data in the cluster using process, which greatly affects the stability of the performance.
Disclosure of Invention
The invention aims to provide a data self-classifying heterogeneous distributed storage method and system, which are used for solving the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: a method of self-classifying heterogeneous distributed storage of data, the method comprising the steps of:
building a heterogeneous storage cluster;
pre-training a classification model;
data storage;
the classification model is self-correcting.
Preferably, the heterogeneous storage cluster building specifically includes:
dividing hardware resources according to the performance of a server, the performance of a hard disk, the performance of a network card and the performance of a switch; constructing a storage cluster, configuring different performance optimization directions aiming at storage equipment with different performances, wherein the storage cluster is prone to throughput, time delay or iops; dividing heterogeneous storage pools with different performances according to different hardware and configuration optimization direction information, and selecting to further divide storage pool grades aiming at storage pool redundancy strategies under the same or different hardware resource conditions, wherein the redundancy strategies comprise, but are not limited to, a copy mode and an erasure code mode; further dividing storage pool grades aiming at the storage pool optimization direction, and balancing the storage pool by a storage pool throughput optimization pool, a time delay optimization pool, an iops optimization pool; or a high performance storage pool, an equilibrium storage pool, a cold storage pool; setting data heat, namely reading access frequency, periodically setting an access degree normalization value calculation formula, and calculating access degree normalization values corresponding to all storage pools by integrating hardware configuration, optimization direction and copy strategy of the storage pools; and finally, acquiring the grading condition of the storage pool and corresponding performance, optimization tendency and capacity information of the storage pool.
Preferably, the pre-training of the classification model specifically comprises:
selecting data characteristics according to the expected storage data of the current heterogeneous storage cluster, wherein the data characteristics select metadata of a file, including creation time, modification time, file size, file type and access right; the content of the file, selecting partial data of the file, fixed offset data of the file, md5 value of the file and all data of the file; the file content can influence whether the subsequent storage data is directly stored into a desired storage pool in a streaming mode; selecting a classification model according to the data characteristics and a storage pool classification strategy, and selecting a support vector machine and a random forest algorithm; inputting the data characteristics into a classification model for pre-training to obtain a pre-trained classification model; and setting a correction threshold value of the storage pool capacity and the pressure on the classification strategy according to the classification condition of the storage pool.
Preferably, the data storage specifically includes:
selecting two strategies, namely an automatic classification storage strategy and a custom storage strategy, wherein the automatic classification storage strategy comprises the following steps: uploading data to a heterogeneous storage cluster by a user, and inputting the data characteristics required to be acquired by the heterogeneous storage cluster according to a data classification model into the classification model acquired in the classification model pre-training step, so as to calculate and acquire a desired storage pool; if the data features selected in the classification model pre-training step are only metadata, a streamable mode is selected for storage in a desired storage pool; if the data features selected in the step of pre-training the classification model comprise data content, the data features are calculated after the storage file is completely received and then input into the classification model to select a desired storage pool, and a high-performance pool is selected for caching to prevent data loss; after the classification model obtains the expected storage pool, correcting the expected storage pool according to the storage pool usage amount and the pressure, if the current pool usage amount reaches a threshold value, degrading to a next-stage pool or directly returning an error, recursively until the expected storage pool is selected or the error is returned, selecting the read-write pressure of the back-end storage device by the pressure, the CPU pressure of the back-end server and the network pressure of the back-end heterogeneous storage cluster by the pressure; and finally, storing the data slice into a desired storage pool or migrating the data slice from the high-performance cache pool to the desired storage pool; and selecting a designated storage pool for data storage according to user input by the user-defined storage strategy, and if the current pool usage reaches a threshold value, failing to select storage or degrading to the next-stage pool.
Preferably, the self-correction of the classification model specifically includes:
aiming at the data stored by the storage strategy of the automatic classification storage pool, acquiring the data access frequency and period when the data is read or written, calculating and storing an access degree normalization value; judging whether the access degree normalized value is not consistent with the storage pool level, if not consistent with the access degree normalized value and exceeds the current pool threshold, storing the characteristics of the data, the access degree normalized value, the current storage pool and the expected storage pool as a data classification error record, and inserting the data classification error record into a data classification error record table; after the data classification error record table exceeds a set retraining correction threshold, detecting whether each data classification error record in the current state is still not met, if so, deleting the current record, and re-entering retraining threshold detection; if all the data in the data classification error table are not consistent, inputting the stored data characteristics and the access degree normalization value in the data classification error record table into a classification model for offline training, and replacing the classification strategy used by the current cluster after the offline training is completed; after replacement is completed, polling to detect the current cluster pressure, after the heterogeneous cluster pressure is lower than a migration threshold value, migrating data to a corresponding expected storage pool according to the corrected model and the data classification error record table, and deleting a corresponding data classification error log after migration is completed.
A data self-classifying heterogeneous distributed storage system, the system comprising a data classification model and heterogeneous storage clusters;
the data classification model is used for classifying data;
heterogeneous storage clusters are composed of storage nodes with different hardware, software and configurations.
Preferably, the data classification model inputs data information to be saved by a user, and metadata of a file is selected, including but not limited to creation time, modification time, file size, file type and access right of the file; data of the file is selected as two, including but not limited to the whole content of the file and the feature code, and current pressure of the heterogeneous cluster is selected as three, including but not limited to network pressure, disk pressure and cpu pressure.
Preferably, the data classification model is exported as a desired storage pool for data.
Preferably, heterogeneous storage clusters include, but are not limited to, clusters of different hard disks, clusters of different servers, clusters of different redundancy policies.
Preferably, an administrator partitions storage pools of different capabilities according to heterogeneous storage systems, including but not limited to high-performance storage pools, equilibrium storage pools, cold storage pools; an administrator trains the data classification model in advance, selects data characteristics aiming at the storage data which are planned in advance by the heterogeneous distributed storage cluster, inputs the data characteristics into the data classification model for training, and obtains the data classification model; the method comprises the steps that a user stores data into a heterogeneous storage cluster, and a desired storage pool is obtained through a data classification model; storing the data to a corresponding storage pool; and in the running process, correcting the data classification model according to the data heat and the use amount of the storage pool and the pressure.
Compared with the prior art, the invention has the beneficial effects that:
the self-classifying heterogeneous distributed storage method and system for the data provided by the invention consider the differences of data access scenes, server nodes and storage equipment aiming at heterogeneous distributed storage clusters, and adopt different redundancy technologies aiming at different data to finally reach different nodes or equipment. Compared with a general hash algorithm, the system can provide faster access speed for hot data and lower storage cost for cold data. Compared with a general caching algorithm, the system predicts the data access scene before data storage, and can reduce performance jitter caused by cold and hot data migration.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
fig. 2 is a block diagram of the system of the present invention.
Detailed Description
In order to make the objects, technical solutions, and advantages of the present invention more apparent, the embodiments of the present invention will be further described in detail with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are some, but not all, embodiments of the present invention, are intended to be illustrative only and not limiting of the embodiments of the present invention, and that all other embodiments obtained by persons of ordinary skill in the art without making any inventive effort are within the scope of the present invention.
Example 1
Referring to fig. 1, the present invention provides a technical solution: a data self-classifying heterogeneous distributed storage method comprises the following steps: the method comprises a heterogeneous storage cluster building step, a classification model pre-training step, a data storage step and a classification model self-correction step.
The heterogeneous storage cluster building step specifically comprises the following steps: dividing hardware resources according to the performance of a server, the performance of a hard disk, the performance of a network card and the performance of a switch; constructing a storage cluster, configuring different performance optimization directions for storage devices of different performances may be prone to throughput, latency or iops. According to different hardware and configuration optimization direction information, heterogeneous storage pools with different performances are partitioned. Storage pool redundancy policies, including but not limited to copy mode and erasure code mode, may optionally be further partitioned under the same or different hardware resource conditions. Further dividing storage pool grades aiming at the storage pool optimization direction, and balancing the storage pool by using a storage pool throughput optimization pool, a time delay optimization pool, an iops optimization pool; or a high performance storage pool, an equilibrium storage pool, a cold storage pool. And setting data heat, namely setting access degree normalization value calculation formulas such as reading access frequency, period and the like, and calculating access degree normalization values corresponding to all storage pools by integrating hardware configuration, optimization direction, copy strategy and the like of the storage pools. And finally, acquiring the grading condition of the storage pool and corresponding performance, optimization tendency and capacity information of the storage pool.
The pre-training step of the classification model specifically comprises the following steps: selecting data characteristics according to the expected storage data of the current heterogeneous storage cluster, wherein the data characteristics can select metadata of a file, including creation time, modification time, file size, file type, access rights and the like; the content of the file can select partial data of the file, fixed offset data of the file, md5 value of the file and all data of the file; wherein the file content affects whether the subsequent storage data is directly stored in a desired storage pool in a streaming manner; selecting a classification model according to the data characteristics and a storage pool classification strategy, and selecting algorithms such as a support vector machine, a random forest and the like; inputting the data characteristics into a classification model for pre-training to obtain a pre-trained classification model; and setting a correction threshold value of the storage pool capacity and the pressure on the classification strategy according to the classification condition of the storage pool.
The specific data storage step comprises the steps of selecting two strategies, namely an automatic classification storage strategy and a custom storage strategy. The automatic classification storage strategy comprises the following steps: uploading data to a heterogeneous storage cluster by a user, and inputting the data characteristics required to be acquired by the heterogeneous storage cluster according to a data classification model into the classification model acquired in the classification model pre-training step, so as to calculate and acquire a desired storage pool; if the data features selected in the classification model pre-training step are only metadata, the data features can be stored in a desired storage pool in a streaming mode; if the data features selected in the step of pre-training the classification model comprise data contents, the data features are calculated after the storage files are completely received and then input into the classification model to select a desired storage pool, and a high-performance pool can be selected for caching to prevent data loss. After the classification model acquires the expected storage pool, correcting the expected storage pool according to the storage pool usage amount and the pressure, if the current pool usage amount reaches a threshold value, degrading to a next-stage pool or directly returning an error, and recursively until the expected storage pool is selected or the error is returned, wherein the pressure can select the read-write pressure of the back-end storage device, the cpu pressure of the back-end server and the network pressure of the back-end heterogeneous storage cluster; and finally storing the data slice into a desired storage pool or migrating the data slice from the high-performance cache pool to the desired storage pool. And selecting a designated storage pool for data storage according to user input by the user-defined storage strategy, and if the current pool usage reaches a threshold value, failing to select storage or degrading to the next-stage pool.
The classification model self-correction step comprises the following steps: aiming at the data stored by the storage strategy of the automatic classification storage pool, acquiring the data access frequency and period when the data is read or written, calculating and storing an access degree normalization value; and judging whether the access degree normalized value is not consistent with the storage pool grade, if not consistent with the access degree normalized value and exceeds the current pool threshold, storing the characteristics of the data, the access degree normalized value, the current storage pool and the expected storage pool as a data classification error record, and inserting the data classification error record into a data classification error record table. And after the data classification error record table exceeds the set retraining correction threshold, detecting whether each data classification error record in the current state still does not accord with the data classification error record table, if so, deleting the current record, and re-entering the retraining threshold detection. If all the data in the data classification error table are not consistent, the stored data characteristics and access degree normalization values in the data classification error record table are input into a classification model for offline training, and the classification strategy used by the current cluster is replaced after the offline training is completed. After replacement is completed, polling to detect the current cluster pressure, after the heterogeneous cluster pressure is lower than a migration threshold value, migrating data to a corresponding expected storage pool according to the corrected model and the data classification error record table, and deleting a corresponding data classification error log after migration is completed.
Example two
Referring to fig. 2, on the basis of a first embodiment, a data self-classification heterogeneous distributed storage system is provided, where the system includes a data classification model and a heterogeneous storage cluster;
the data classification model is used for classifying data; the data classification model inputs data information to be saved by a user, and metadata of a file is selected, wherein the metadata comprise, but are not limited to, file creation time, modification time, file size, file type and access right; selecting data of the file, including but not limited to the whole content and feature codes of the file, and selecting current pressure of the heterogeneous cluster, including but not limited to network pressure, disk pressure and cpu pressure; outputting the data classification model as a desired storage pool of the data;
heterogeneous storage clusters, which are composed of storage nodes with different hardware, software and configuration, include, but are not limited to, clusters composed of different hard disks, clusters composed of different servers, and clusters composed of different redundancy policies;
an administrator divides storage pools of different performance according to a heterogeneous storage system, including but not limited to a high-performance storage pool, an equilibrium storage pool, a cold storage pool; an administrator trains the data classification model in advance, selects data characteristics aiming at the storage data which are planned in advance by the heterogeneous distributed storage cluster, inputs the data characteristics into the data classification model for training, and obtains the data classification model; the method comprises the steps that a user stores data into a heterogeneous storage cluster, and a desired storage pool is obtained through a data classification model; storing the data to a corresponding storage pool; and in the running process, correcting the data classification model according to the data heat and the use amount of the storage pool and the pressure.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (10)
1. A data self-classifying heterogeneous distributed storage method is characterized in that: the method comprises the following steps:
building a heterogeneous storage cluster;
pre-training a classification model;
data storage;
the classification model is self-correcting.
2. The method for self-classifying heterogeneous distributed storage of data according to claim 1, wherein: the heterogeneous storage cluster building specifically comprises:
dividing hardware resources according to the performance of a server, the performance of a hard disk, the performance of a network card and the performance of a switch; constructing a storage cluster, configuring different performance optimization directions aiming at storage equipment with different performances, wherein the storage cluster is prone to throughput, time delay or iops; dividing heterogeneous storage pools with different performances according to different hardware and configuration optimization direction information, and selecting to further divide storage pool grades aiming at storage pool redundancy strategies under the same or different hardware resource conditions, wherein the redundancy strategies comprise, but are not limited to, a copy mode and an erasure code mode; further dividing storage pool grades aiming at the storage pool optimization direction, and balancing the storage pool by a storage pool throughput optimization pool, a time delay optimization pool, an iops optimization pool; or a high performance storage pool, an equilibrium storage pool, a cold storage pool; setting data heat, namely reading access frequency, periodically setting an access degree normalization value calculation formula, and calculating access degree normalization values corresponding to all storage pools by integrating hardware configuration, optimization direction and copy strategy of the storage pools; and finally, acquiring the grading condition of the storage pool and corresponding performance, optimization tendency and capacity information of the storage pool.
3. The method for self-classifying heterogeneous distributed storage of data according to claim 1, wherein: the classification model pre-training specifically comprises the following steps:
selecting data characteristics according to the expected storage data of the current heterogeneous storage cluster, wherein the data characteristics select metadata of a file, including creation time, modification time, file size, file type and access right; the content of the file, selecting partial data of the file, fixed offset data of the file, md5 value of the file and all data of the file; the file content can influence whether the subsequent storage data is directly stored into a desired storage pool in a streaming mode; selecting a classification model according to the data characteristics and a storage pool classification strategy, and selecting a support vector machine and a random forest algorithm; inputting the data characteristics into a classification model for pre-training to obtain a pre-trained classification model; and setting a correction threshold value of the storage pool capacity and the pressure on the classification strategy according to the classification condition of the storage pool.
4. The method for self-classifying heterogeneous distributed storage of data according to claim 1, wherein: the data storage specifically comprises the following steps:
selecting two strategies, namely an automatic classification storage strategy and a custom storage strategy, wherein the automatic classification storage strategy comprises the following steps: uploading data to a heterogeneous storage cluster by a user, and inputting the data characteristics required to be acquired by the heterogeneous storage cluster according to a data classification model into the classification model acquired in the classification model pre-training step, so as to calculate and acquire a desired storage pool; if the data features selected in the classification model pre-training step are only metadata, a streamable mode is selected for storage in a desired storage pool; if the data features selected in the step of pre-training the classification model comprise data content, the data features are calculated after the storage file is completely received and then input into the classification model to select a desired storage pool, and a high-performance pool is selected for caching to prevent data loss; after the classification model obtains the expected storage pool, correcting the expected storage pool according to the storage pool usage amount and the pressure, if the current pool usage amount reaches a threshold value, degrading to a next-stage pool or directly returning an error, recursively until the expected storage pool is selected or the error is returned, selecting the read-write pressure of the back-end storage device by the pressure, the CPU pressure of the back-end server and the network pressure of the back-end heterogeneous storage cluster by the pressure; and finally, storing the data slice into a desired storage pool or migrating the data slice from the high-performance cache pool to the desired storage pool; and selecting a designated storage pool for data storage according to user input by the user-defined storage strategy, and if the current pool usage reaches a threshold value, failing to select storage or degrading to the next-stage pool.
5. The method for self-classifying heterogeneous distributed storage of data according to claim 1, wherein: the self-correction of the classification model specifically comprises the following steps:
aiming at the data stored by the storage strategy of the automatic classification storage pool, acquiring the data access frequency and period when the data is read or written, calculating and storing an access degree normalization value; judging whether the access degree normalized value is not consistent with the storage pool level, if not consistent with the access degree normalized value and exceeds the current pool threshold, storing the characteristics of the data, the access degree normalized value, the current storage pool and the expected storage pool as a data classification error record, and inserting the data classification error record into a data classification error record table; after the data classification error record table exceeds a set retraining correction threshold, detecting whether each data classification error record in the current state is still not met, if so, deleting the current record, and re-entering retraining threshold detection; if all the data in the data classification error table are not consistent, inputting the stored data characteristics and the access degree normalization value in the data classification error record table into a classification model for offline training, and replacing the classification strategy used by the current cluster after the offline training is completed; after replacement is completed, polling to detect the current cluster pressure, after the heterogeneous cluster pressure is lower than a migration threshold value, migrating data to a corresponding expected storage pool according to the corrected model and the data classification error record table, and deleting a corresponding data classification error log after migration is completed.
6. A data self-classifying heterogeneous distributed storage system for a data self-classifying heterogeneous distributed storage method according to any one of claims 1 to 5, wherein: the system comprises a data classification model and a heterogeneous storage cluster;
the data classification model is used for classifying data;
heterogeneous storage clusters are composed of storage nodes with different hardware, software and configurations.
7. A data self-classifying heterogeneous distributed storage system according to claim 6 and wherein: the data classification model inputs data information to be saved by a user, and selects metadata of a file, including but not limited to creation time, modification time, file size, file type and access right of the file; data of the file is selected as two, including but not limited to the whole content of the file and the feature code, and current pressure of the heterogeneous cluster is selected as three, including but not limited to network pressure, disk pressure and cpu pressure.
8. A data self-classifying heterogeneous distributed storage system according to claim 6 and wherein: the data classification model is exported as a desired storage pool for the data.
9. A data self-classifying heterogeneous distributed storage system according to claim 6 and wherein: heterogeneous storage clusters include, but are not limited to, clusters composed of different hard disks, clusters composed of different servers, clusters composed of different redundancy policies.
10. A data self-classifying heterogeneous distributed storage system according to claim 6 and wherein: an administrator divides storage pools of different performance according to a heterogeneous storage system, including but not limited to a high-performance storage pool, an equilibrium storage pool, a cold storage pool; an administrator trains the data classification model in advance, selects data characteristics aiming at the storage data which are planned in advance by the heterogeneous distributed storage cluster, inputs the data characteristics into the data classification model for training, and obtains the data classification model; the method comprises the steps that a user stores data into a heterogeneous storage cluster, and a desired storage pool is obtained through a data classification model; storing the data to a corresponding storage pool; and in the running process, correcting the data classification model according to the data heat and the use amount of the storage pool and the pressure.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310921074.8A CN117032566A (en) | 2023-07-26 | 2023-07-26 | Data self-classifying heterogeneous distributed storage method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310921074.8A CN117032566A (en) | 2023-07-26 | 2023-07-26 | Data self-classifying heterogeneous distributed storage method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117032566A true CN117032566A (en) | 2023-11-10 |
Family
ID=88640494
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310921074.8A Pending CN117032566A (en) | 2023-07-26 | 2023-07-26 | Data self-classifying heterogeneous distributed storage method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117032566A (en) |
-
2023
- 2023-07-26 CN CN202310921074.8A patent/CN117032566A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200250145A1 (en) | Synchronized data deduplication | |
CN107807794B (en) | Data storage method and device | |
US8712963B1 (en) | Method and apparatus for content-aware resizing of data chunks for replication | |
US8639669B1 (en) | Method and apparatus for determining optimal chunk sizes of a deduplicated storage system | |
US9727573B1 (en) | Out-of core similarity matching | |
US10303797B1 (en) | Clustering files in deduplication systems | |
US8843447B2 (en) | Resilient distributed replicated data storage system | |
US8554994B2 (en) | Distributed storage network utilizing memory stripes | |
US9933970B2 (en) | Deduplicating data for a data storage system using similarity determinations | |
US20160132523A1 (en) | Exploiting node-local deduplication in distributed storage system | |
US9880762B1 (en) | Compressing metadata blocks prior to writing the metadata blocks out to secondary storage | |
US20080270729A1 (en) | Cluster storage using subsegmenting | |
US20080256143A1 (en) | Cluster storage using subsegmenting | |
GB2518158A (en) | Method and system for data access in a storage infrastructure | |
CN103763383A (en) | Integrated cloud storage system and storage method thereof | |
CN112100293A (en) | Data processing method, data access method, data processing device, data access device and computer equipment | |
CN111400083B (en) | Data storage method and system and storage medium | |
CN107463342B (en) | CDN edge node file storage method and device | |
CN107422989B (en) | Server SAN system multi-copy reading method and storage system | |
CN109840051B (en) | Data storage method and device of storage system | |
CN116560562A (en) | Method and device for reading and writing data | |
US20170220422A1 (en) | Moving data chunks | |
US20230305930A1 (en) | Methods and systems for affinity aware container preteching | |
CN113918378A (en) | Data storage method, storage system, storage device and storage medium | |
CN117032566A (en) | Data self-classifying heterogeneous distributed storage method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |