CN117032566A

CN117032566A - Data self-classifying heterogeneous distributed storage method and system

Info

Publication number: CN117032566A
Application number: CN202310921074.8A
Authority: CN
Inventors: 王韵清; 王腾飞; 李超; 曹磊; 王迎彬
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2023-07-26
Filing date: 2023-07-26
Publication date: 2023-11-10

Abstract

The invention relates to the technical field of distributed storage, in particular to a data self-classification heterogeneous distributed storage method and system, comprising the following steps: building a heterogeneous storage cluster; pre-training a classification model; data storage; self-correcting the classification model; the beneficial effects are as follows: the self-classifying heterogeneous distributed storage method and system for the data provided by the invention consider the differences of data access scenes, server nodes and storage equipment aiming at heterogeneous distributed storage clusters, and adopt different redundancy technologies aiming at different data to finally reach different nodes or equipment. Compared with a general hash algorithm, the system can provide faster access speed for hot data and lower storage cost for cold data. Compared with a general caching algorithm, the system predicts the data access scene before data storage, and can reduce performance jitter caused by cold and hot data migration.

Description

Data self-classifying heterogeneous distributed storage method and system

Technical Field

The invention relates to the technical field of distributed storage, in particular to a data self-classification heterogeneous distributed storage method and system.

Background

In distributed storage clusters, due to cost, high performance devices cannot be fully used, which causes heterogeneous storage clusters to appear.

In the prior art, in the same storage cluster, a high-performance server exists, a server with lower configuration exists, a high-performance solid state disk exists, a mechanical hard disk with relatively low performance exists, a redundancy strategy of a copy mode exists, a redundancy strategy of an erasure code mode exists, a switch with higher bandwidth exists, and a switch with relatively low storage bandwidth is possible. Under different conditions and different configuration conditions, the read-write performance of data is greatly different, and the situation that the cost and the performance are inversely proportional is caused. In most cases, a hierarchical cache approach may be used to cache hot data using high performance devices and store cold data using low cost devices, thus providing both fast access to hot data and low cost storage of cold data.

However, this method needs to continuously judge and migrate the cold and hot data in the cluster using process, which greatly affects the stability of the performance.

Disclosure of Invention

The invention aims to provide a data self-classifying heterogeneous distributed storage method and system, which are used for solving the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions: a method of self-classifying heterogeneous distributed storage of data, the method comprising the steps of:

building a heterogeneous storage cluster;

pre-training a classification model;

data storage;

the classification model is self-correcting.

Preferably, the heterogeneous storage cluster building specifically includes:

dividing hardware resources according to the performance of a server, the performance of a hard disk, the performance of a network card and the performance of a switch; constructing a storage cluster, configuring different performance optimization directions aiming at storage equipment with different performances, wherein the storage cluster is prone to throughput, time delay or iops; dividing heterogeneous storage pools with different performances according to different hardware and configuration optimization direction information, and selecting to further divide storage pool grades aiming at storage pool redundancy strategies under the same or different hardware resource conditions, wherein the redundancy strategies comprise, but are not limited to, a copy mode and an erasure code mode; further dividing storage pool grades aiming at the storage pool optimization direction, and balancing the storage pool by a storage pool throughput optimization pool, a time delay optimization pool, an iops optimization pool; or a high performance storage pool, an equilibrium storage pool, a cold storage pool; setting data heat, namely reading access frequency, periodically setting an access degree normalization value calculation formula, and calculating access degree normalization values corresponding to all storage pools by integrating hardware configuration, optimization direction and copy strategy of the storage pools; and finally, acquiring the grading condition of the storage pool and corresponding performance, optimization tendency and capacity information of the storage pool.

Preferably, the pre-training of the classification model specifically comprises:

selecting data characteristics according to the expected storage data of the current heterogeneous storage cluster, wherein the data characteristics select metadata of a file, including creation time, modification time, file size, file type and access right; the content of the file, selecting partial data of the file, fixed offset data of the file, md5 value of the file and all data of the file; the file content can influence whether the subsequent storage data is directly stored into a desired storage pool in a streaming mode; selecting a classification model according to the data characteristics and a storage pool classification strategy, and selecting a support vector machine and a random forest algorithm; inputting the data characteristics into a classification model for pre-training to obtain a pre-trained classification model; and setting a correction threshold value of the storage pool capacity and the pressure on the classification strategy according to the classification condition of the storage pool.

Preferably, the data storage specifically includes:

selecting two strategies, namely an automatic classification storage strategy and a custom storage strategy, wherein the automatic classification storage strategy comprises the following steps: uploading data to a heterogeneous storage cluster by a user, and inputting the data characteristics required to be acquired by the heterogeneous storage cluster according to a data classification model into the classification model acquired in the classification model pre-training step, so as to calculate and acquire a desired storage pool; if the data features selected in the classification model pre-training step are only metadata, a streamable mode is selected for storage in a desired storage pool; if the data features selected in the step of pre-training the classification model comprise data content, the data features are calculated after the storage file is completely received and then input into the classification model to select a desired storage pool, and a high-performance pool is selected for caching to prevent data loss; after the classification model obtains the expected storage pool, correcting the expected storage pool according to the storage pool usage amount and the pressure, if the current pool usage amount reaches a threshold value, degrading to a next-stage pool or directly returning an error, recursively until the expected storage pool is selected or the error is returned, selecting the read-write pressure of the back-end storage device by the pressure, the CPU pressure of the back-end server and the network pressure of the back-end heterogeneous storage cluster by the pressure; and finally, storing the data slice into a desired storage pool or migrating the data slice from the high-performance cache pool to the desired storage pool; and selecting a designated storage pool for data storage according to user input by the user-defined storage strategy, and if the current pool usage reaches a threshold value, failing to select storage or degrading to the next-stage pool.

Preferably, the self-correction of the classification model specifically includes:

aiming at the data stored by the storage strategy of the automatic classification storage pool, acquiring the data access frequency and period when the data is read or written, calculating and storing an access degree normalization value; judging whether the access degree normalized value is not consistent with the storage pool level, if not consistent with the access degree normalized value and exceeds the current pool threshold, storing the characteristics of the data, the access degree normalized value, the current storage pool and the expected storage pool as a data classification error record, and inserting the data classification error record into a data classification error record table; after the data classification error record table exceeds a set retraining correction threshold, detecting whether each data classification error record in the current state is still not met, if so, deleting the current record, and re-entering retraining threshold detection; if all the data in the data classification error table are not consistent, inputting the stored data characteristics and the access degree normalization value in the data classification error record table into a classification model for offline training, and replacing the classification strategy used by the current cluster after the offline training is completed; after replacement is completed, polling to detect the current cluster pressure, after the heterogeneous cluster pressure is lower than a migration threshold value, migrating data to a corresponding expected storage pool according to the corrected model and the data classification error record table, and deleting a corresponding data classification error log after migration is completed.

A data self-classifying heterogeneous distributed storage system, the system comprising a data classification model and heterogeneous storage clusters;

the data classification model is used for classifying data;

heterogeneous storage clusters are composed of storage nodes with different hardware, software and configurations.

Preferably, the data classification model inputs data information to be saved by a user, and metadata of a file is selected, including but not limited to creation time, modification time, file size, file type and access right of the file; data of the file is selected as two, including but not limited to the whole content of the file and the feature code, and current pressure of the heterogeneous cluster is selected as three, including but not limited to network pressure, disk pressure and cpu pressure.

Preferably, the data classification model is exported as a desired storage pool for data.

Preferably, heterogeneous storage clusters include, but are not limited to, clusters of different hard disks, clusters of different servers, clusters of different redundancy policies.

Preferably, an administrator partitions storage pools of different capabilities according to heterogeneous storage systems, including but not limited to high-performance storage pools, equilibrium storage pools, cold storage pools; an administrator trains the data classification model in advance, selects data characteristics aiming at the storage data which are planned in advance by the heterogeneous distributed storage cluster, inputs the data characteristics into the data classification model for training, and obtains the data classification model; the method comprises the steps that a user stores data into a heterogeneous storage cluster, and a desired storage pool is obtained through a data classification model; storing the data to a corresponding storage pool; and in the running process, correcting the data classification model according to the data heat and the use amount of the storage pool and the pressure.

Compared with the prior art, the invention has the beneficial effects that:

the self-classifying heterogeneous distributed storage method and system for the data provided by the invention consider the differences of data access scenes, server nodes and storage equipment aiming at heterogeneous distributed storage clusters, and adopt different redundancy technologies aiming at different data to finally reach different nodes or equipment. Compared with a general hash algorithm, the system can provide faster access speed for hot data and lower storage cost for cold data. Compared with a general caching algorithm, the system predicts the data access scene before data storage, and can reduce performance jitter caused by cold and hot data migration.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

fig. 2 is a block diagram of the system of the present invention.

Detailed Description

In order to make the objects, technical solutions, and advantages of the present invention more apparent, the embodiments of the present invention will be further described in detail with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are some, but not all, embodiments of the present invention, are intended to be illustrative only and not limiting of the embodiments of the present invention, and that all other embodiments obtained by persons of ordinary skill in the art without making any inventive effort are within the scope of the present invention.

Example 1

Referring to fig. 1, the present invention provides a technical solution: a data self-classifying heterogeneous distributed storage method comprises the following steps: the method comprises a heterogeneous storage cluster building step, a classification model pre-training step, a data storage step and a classification model self-correction step.

The heterogeneous storage cluster building step specifically comprises the following steps: dividing hardware resources according to the performance of a server, the performance of a hard disk, the performance of a network card and the performance of a switch; constructing a storage cluster, configuring different performance optimization directions for storage devices of different performances may be prone to throughput, latency or iops. According to different hardware and configuration optimization direction information, heterogeneous storage pools with different performances are partitioned. Storage pool redundancy policies, including but not limited to copy mode and erasure code mode, may optionally be further partitioned under the same or different hardware resource conditions. Further dividing storage pool grades aiming at the storage pool optimization direction, and balancing the storage pool by using a storage pool throughput optimization pool, a time delay optimization pool, an iops optimization pool; or a high performance storage pool, an equilibrium storage pool, a cold storage pool. And setting data heat, namely setting access degree normalization value calculation formulas such as reading access frequency, period and the like, and calculating access degree normalization values corresponding to all storage pools by integrating hardware configuration, optimization direction, copy strategy and the like of the storage pools. And finally, acquiring the grading condition of the storage pool and corresponding performance, optimization tendency and capacity information of the storage pool.

The pre-training step of the classification model specifically comprises the following steps: selecting data characteristics according to the expected storage data of the current heterogeneous storage cluster, wherein the data characteristics can select metadata of a file, including creation time, modification time, file size, file type, access rights and the like; the content of the file can select partial data of the file, fixed offset data of the file, md5 value of the file and all data of the file; wherein the file content affects whether the subsequent storage data is directly stored in a desired storage pool in a streaming manner; selecting a classification model according to the data characteristics and a storage pool classification strategy, and selecting algorithms such as a support vector machine, a random forest and the like; inputting the data characteristics into a classification model for pre-training to obtain a pre-trained classification model; and setting a correction threshold value of the storage pool capacity and the pressure on the classification strategy according to the classification condition of the storage pool.

The specific data storage step comprises the steps of selecting two strategies, namely an automatic classification storage strategy and a custom storage strategy. The automatic classification storage strategy comprises the following steps: uploading data to a heterogeneous storage cluster by a user, and inputting the data characteristics required to be acquired by the heterogeneous storage cluster according to a data classification model into the classification model acquired in the classification model pre-training step, so as to calculate and acquire a desired storage pool; if the data features selected in the classification model pre-training step are only metadata, the data features can be stored in a desired storage pool in a streaming mode; if the data features selected in the step of pre-training the classification model comprise data contents, the data features are calculated after the storage files are completely received and then input into the classification model to select a desired storage pool, and a high-performance pool can be selected for caching to prevent data loss. After the classification model acquires the expected storage pool, correcting the expected storage pool according to the storage pool usage amount and the pressure, if the current pool usage amount reaches a threshold value, degrading to a next-stage pool or directly returning an error, and recursively until the expected storage pool is selected or the error is returned, wherein the pressure can select the read-write pressure of the back-end storage device, the cpu pressure of the back-end server and the network pressure of the back-end heterogeneous storage cluster; and finally storing the data slice into a desired storage pool or migrating the data slice from the high-performance cache pool to the desired storage pool. And selecting a designated storage pool for data storage according to user input by the user-defined storage strategy, and if the current pool usage reaches a threshold value, failing to select storage or degrading to the next-stage pool.

The classification model self-correction step comprises the following steps: aiming at the data stored by the storage strategy of the automatic classification storage pool, acquiring the data access frequency and period when the data is read or written, calculating and storing an access degree normalization value; and judging whether the access degree normalized value is not consistent with the storage pool grade, if not consistent with the access degree normalized value and exceeds the current pool threshold, storing the characteristics of the data, the access degree normalized value, the current storage pool and the expected storage pool as a data classification error record, and inserting the data classification error record into a data classification error record table. And after the data classification error record table exceeds the set retraining correction threshold, detecting whether each data classification error record in the current state still does not accord with the data classification error record table, if so, deleting the current record, and re-entering the retraining threshold detection. If all the data in the data classification error table are not consistent, the stored data characteristics and access degree normalization values in the data classification error record table are input into a classification model for offline training, and the classification strategy used by the current cluster is replaced after the offline training is completed. After replacement is completed, polling to detect the current cluster pressure, after the heterogeneous cluster pressure is lower than a migration threshold value, migrating data to a corresponding expected storage pool according to the corrected model and the data classification error record table, and deleting a corresponding data classification error log after migration is completed.

Example two

Referring to fig. 2, on the basis of a first embodiment, a data self-classification heterogeneous distributed storage system is provided, where the system includes a data classification model and a heterogeneous storage cluster;

the data classification model is used for classifying data; the data classification model inputs data information to be saved by a user, and metadata of a file is selected, wherein the metadata comprise, but are not limited to, file creation time, modification time, file size, file type and access right; selecting data of the file, including but not limited to the whole content and feature codes of the file, and selecting current pressure of the heterogeneous cluster, including but not limited to network pressure, disk pressure and cpu pressure; outputting the data classification model as a desired storage pool of the data;

heterogeneous storage clusters, which are composed of storage nodes with different hardware, software and configuration, include, but are not limited to, clusters composed of different hard disks, clusters composed of different servers, and clusters composed of different redundancy policies;

an administrator divides storage pools of different performance according to a heterogeneous storage system, including but not limited to a high-performance storage pool, an equilibrium storage pool, a cold storage pool; an administrator trains the data classification model in advance, selects data characteristics aiming at the storage data which are planned in advance by the heterogeneous distributed storage cluster, inputs the data characteristics into the data classification model for training, and obtains the data classification model; the method comprises the steps that a user stores data into a heterogeneous storage cluster, and a desired storage pool is obtained through a data classification model; storing the data to a corresponding storage pool; and in the running process, correcting the data classification model according to the data heat and the use amount of the storage pool and the pressure.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A data self-classifying heterogeneous distributed storage method is characterized in that: the method comprises the following steps:

building a heterogeneous storage cluster;

pre-training a classification model;

data storage;

the classification model is self-correcting.

2. The method for self-classifying heterogeneous distributed storage of data according to claim 1, wherein: the heterogeneous storage cluster building specifically comprises:

3. The method for self-classifying heterogeneous distributed storage of data according to claim 1, wherein: the classification model pre-training specifically comprises the following steps:

4. The method for self-classifying heterogeneous distributed storage of data according to claim 1, wherein: the data storage specifically comprises the following steps:

5. The method for self-classifying heterogeneous distributed storage of data according to claim 1, wherein: the self-correction of the classification model specifically comprises the following steps:

6. A data self-classifying heterogeneous distributed storage system for a data self-classifying heterogeneous distributed storage method according to any one of claims 1 to 5, wherein: the system comprises a data classification model and a heterogeneous storage cluster;

the data classification model is used for classifying data;

7. A data self-classifying heterogeneous distributed storage system according to claim 6 and wherein: the data classification model inputs data information to be saved by a user, and selects metadata of a file, including but not limited to creation time, modification time, file size, file type and access right of the file; data of the file is selected as two, including but not limited to the whole content of the file and the feature code, and current pressure of the heterogeneous cluster is selected as three, including but not limited to network pressure, disk pressure and cpu pressure.

8. A data self-classifying heterogeneous distributed storage system according to claim 6 and wherein: the data classification model is exported as a desired storage pool for the data.

9. A data self-classifying heterogeneous distributed storage system according to claim 6 and wherein: heterogeneous storage clusters include, but are not limited to, clusters composed of different hard disks, clusters composed of different servers, clusters composed of different redundancy policies.

10. A data self-classifying heterogeneous distributed storage system according to claim 6 and wherein: an administrator divides storage pools of different performance according to a heterogeneous storage system, including but not limited to a high-performance storage pool, an equilibrium storage pool, a cold storage pool; an administrator trains the data classification model in advance, selects data characteristics aiming at the storage data which are planned in advance by the heterogeneous distributed storage cluster, inputs the data characteristics into the data classification model for training, and obtains the data classification model; the method comprises the steps that a user stores data into a heterogeneous storage cluster, and a desired storage pool is obtained through a data classification model; storing the data to a corresponding storage pool; and in the running process, correcting the data classification model according to the data heat and the use amount of the storage pool and the pressure.