CN116820883A

CN116820883A - Intelligent disk monitoring and optimizing system and method based on deep reinforcement learning

Info

Publication number: CN116820883A
Application number: CN202310783248.9A
Authority: CN
Inventors: 邵杰; 苏薄; 付骏峰; 何鸿才
Original assignee: Higher Research Institute Of University Of Electronic Science And Technology Shenzhen
Current assignee: Higher Research Institute Of University Of Electronic Science And Technology Shenzhen
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2023-09-29

Abstract

The invention discloses a disk intelligent monitoring and optimizing system and method based on deep reinforcement learning, wherein the system comprises a health evaluation module, a strategy adjustment module and an optimizer; the strategy adjustment module comprises a strategy network, a target network and an experience playback buffer zone; the health evaluation module is used for acquiring the overall health level of the magnetic disk; the strategy network is used for acquiring corresponding actions and states according to the overall health level of the disk; the target network is used for acquiring a target state and a behavior value corresponding to a target action in a training stage; the experience playback buffer is used for storing data of the training stage; the optimizer is used for acquiring the loss function and updating parameters of the strategy network based on the loss function. According to the invention, the optimal redundancy strategy and the disk cleaning period can be trained simultaneously by the reinforcement learning method, so that the self-adaptability and the reliability of the system are enhanced, and the data is not easy to lose and is easy to manage; the health condition of the magnetic disk is evaluated through deep learning detection, and the system is trained through reinforcement learning, so that the accuracy is improved.

Description

Intelligent disk monitoring and optimizing system and method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of data center management and disk health monitoring, in particular to a disk intelligent monitoring and optimizing system and method based on deep reinforcement learning.

Background

Disk health prediction is an important means of improving disk reliability and avoiding data loss, and many studies utilize machine learning techniques to predict disk failure from various features extracted from SMART (self-monitoring, analysis, and reporting techniques) data.

Disk health prediction can be used to improve disk reliability by adjusting redundancy settings and disk cleaning as it reflects the health and future trends of the disk. Disk adaptive redundancy is a technique for dynamically adjusting redundancy settings based on disk reliability in a clustered storage system. The current implementation method uses a standard window-based variable point detection algorithm to adjust redundancy setting, and has the defects of poor timeliness, low prediction precision and the like compared with an active prediction method. Disk cleaning is the process of periodically reading disks to detect potential sector errors and repair them as much as possible. The current method of setting different disk cleaning rates for each disk and even for different areas of each disk may make the storage system more difficult to manage and result in data inconsistencies during disk cleaning, resulting in data loss or other problems. However, the existing scheme considers the redundancy of the magnetic disk and the cleaning of the magnetic disk as independent parts, and does not consider the redundancy of the magnetic disk and the cleaning of the magnetic disk as an integral system, so that the health condition of the magnetic disk and the monitoring accuracy and reliability are low, and the data are easy to lose and difficult to manage.

Disclosure of Invention

Aiming at the defects in the prior art, the intelligent disk monitoring and optimizing system and method based on deep reinforcement learning provided by the invention solve the problems of low accuracy and reliability of monitoring the health condition of a disk, easy loss of data and difficult management of storage in the prior art.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

the intelligent disk monitoring and optimizing system based on deep reinforcement learning comprises a health evaluation module, a strategy adjustment module and an optimizer; the strategy adjustment module adopts a deep Q network model; the deep Q network model includes a policy network, a target network, and an experience playback buffer;

the health evaluation module is used for acquiring the overall health level of the magnetic discs of different brands;

the policy network is used for acquiring corresponding actions and states according to the overall health level of the magnetic disks of different brands; the actions are used for intelligent monitoring and optimizing of the magnetic disk;

the target network is used for acquiring a target state and a behavior value corresponding to a target action in a training stage;

an experience playback buffer for storing the status, actions, and rewards of the training phase;

and the optimizer is used for acquiring a loss function according to the outputs of the strategy network and the target network and updating parameters of the strategy network based on the loss function.

There is provided a monitoring and optimization method comprising the steps of:

s1, obtaining health scores of magnetic discs of different brands through a health evaluation module;

s2, constructing reinforcement learning intelligent agents and rewarding functions; initializing an environment, a strategy adjustment module, an optimizer, a reward list and the current training times;

s3, setting the ep reward to 0;

s4, obtaining the overall health level of the brand according to the health score of the magnetic disk of the specific brand; simulating a damaged portion of the disk according to the overall health level of the brand; initializing the current step number;

s5, generating a random number and judging whether the random number is smaller than the exploration rate, if so, entering a step S6; otherwise, entering step S7;

s6, randomly selecting an action; step S8 is entered;

s7, selecting the action with the maximum action value in the current state through a strategy network; step S8 is entered;

s8, executing actions in the environment to obtain the next state and rewards; storing the current state, the executed action and the acquired rewards to an experience playback buffer;

s9, judging whether the size of the experience playback buffer area is larger than or equal to a set value, if so, entering a step S10; otherwise, enter step S11;

s10, randomly extracting a batch of experience data in an experience playback buffer area, and respectively obtaining a target value and a predicted value through a target network and a strategy network; calculating a loss function between the target value and the predicted value according to the target value and the predicted value; parameter updating of the strategy network is carried out through the minimized loss function of the optimizer, and an updated strategy network is obtained;

s11, updating the current state and information of the reinforcement learning agent and the epicode rewards according to the next state and rewards obtained in the step S8;

s12, judging whether the current step number reaches the maximum step number, if so, ending a round of training, adding the epoode rewards to a rewards list and entering a step S13; otherwise, adding 1 to the current step number and returning to the step S5;

s13, judging whether the current training times reach the maximum training times, if so, obtaining a trained intelligent disk monitoring and optimizing system and entering a step S14; otherwise, resetting the environment and the initial state, adding 1 to the current training times and returning to the step S3;

s14, deploying the trained intelligent disk monitoring and optimizing system to a data center system; obtaining the overall health level of the brand through a trained disk intelligent monitoring and optimizing system; and dynamically adjusting the corresponding disk redundancy strategy and the disk cleaning rate according to the overall health level to finish monitoring and optimizing.

Further, the health evaluation module adopts an LSTM neural network model; the LSTM neural network model comprises two LSTM layers and a full connection layer which are connected in series; each LSTM layer includes 128 LSTM cells; the fully connected layer comprises 4 neurons; the LSTM layer adopts a ReLU function as an activation function; the fully connected layer adopts a softmax function as an activation function;

the strategy network comprises an input layer, a hidden layer and an output layer; the target network comprises an input layer, a hidden layer and an output layer; the deep Q network model is trained by adopting a Q-Learning algorithm, and environment exploration is carried out by adopting an epsilon-greedy strategy.

Further, the specific steps of step S1 are as follows:

s1-1, acquiring original SMART data of a disk to be monitored in a period of time through a monitoring acquisition system; wherein the original SMART data includes negative sample data and positive sample data;

s1-2, processing original SMART data based on feature selection and feature processing to obtain SMART features and building a training set;

s1-3, classifying disk data of different brands and models; training the training set based on deep learning to obtain the health degree of the single disk;

s1-4, according to the formula:

obtaining health scores H of specific brands and models; wherein n represents the number of hard disks in the brand, i represents the label, and w _i Representing the weight assigned to tag i, p _i Representing the proportion of the hard disk of tag i in the brand.

Further, the formula of the reward function of step S2 is:

Reward＝C ₁ *H/H _a *MTTDL _diff +C ₂ *H _a /H*MTTD _diff +C ₃ *Space _save +C ₄ *Cost _diff

wherein λ represents a disk failure rate, μ represents a disk repair rate, k represents k redundant blocks obtained by encoding each block of data, n represents a total block number, MTTDL represents an average failure time, n ₁ And k ₁ The number of coded blocks and the number of data blocks, n, respectively, representing the new redundancy method ₂ And k ₂ The number of coded blocks and the number of data blocks representing the old redundancy method, space, respectively _save Representing saved space memory, Z _i Represents an acceleration factor, N _i Represents the number of disks, r represents the normal scrubbing rate, MTTD represents the average detection time, T represents the time span, sigma (-) represents the summing function, cost represents the Cost, reward represents the rewarding function, C ₁ 、C ₂ 、C ₃ And C ₄ Indicating hyper-parameters, H indicating health score, H _a Representing the health score, MTTDL, of a set alert level disk _diff Indicating the variation of mean time to failure, MTTD _diff Representing the variation of the average detection time, cost _diff Representing the amount of change in cost.

Further, the initial value of the exploration rate in the step S5 is set to be 1.0, and gradually decreases along with the continuous interaction between the reinforcement learning agent and the environment; the change process of the exploration rate is shown in the following formula:

wherein ε represents the exploration rate, ε _final Represents the final exploration rate, ε start represents the initial exploration rate, ε _sdecay Attenuation speed, steps, representing exploration rate _done Indicating the number of steps that the reinforcement learning agent has performed, e indicating a constant.

Further, in step S5, the generation of the random number adopts an epsilon-greedy strategy, and the specific method is as follows:

according to the formula:

acquiring probability pi of selecting action a in current state s _k (a|s), and will pi _k (a|s) generating random numbers as probabilities that the random numbers to be generated are smaller than the exploration rate; wherein A represents an action set, a represents an action corresponding to the maximum instant prize, a' represents an action, s represents a state, k represents a time step number, Q _k (s, a ') represents the instant prize corresponding to the number of time steps k for performing the action a' in the state s,the maximum instant prize value for the action a' in state s corresponding to the taken time step number k is indicated.

Further, in step S7, the calculation formulas of the values of the different actions in the current state are as follows:

Q(s,a)＝E[r+γmax _s' Q(s',a')|s＝s',a＝a']

where s represents the current state, a represents the current action, r represents the reward taken from the environment, γ represents the discount factor, Q (s, a) represents the expected total reward for action a taken in state s, s 'represents the next state, a' represents the next action, max _s' Q (s ', a') represents the maximum expected total prize, E [. Cndot.]Representing an optimal bellman function.

Further, the formula of the target value and the minimum loss function obtained by the optimizer in step S10 is as follows:

wherein L represents the minimum loss function, N represents the number of samples, j represents the current step number, r _i An immediate return representing the current step number j, gamma representing the discount factor, s 'representing the next state, a' representing the next action, s _j State, a, representing the current step number j _j An operation of the current step number j is represented, Q (s _j ,a _j ) Representing the state s of the target network _j The action value under the condition A (s ', a ') represents the action value of the target network under the condition S ',representing the probability that the policy network selects action a 'in state s',indicating the future prize value expected to be achieved by selecting the optimal action in the current state.

Further, step S1-2 compensates for the original SMART data imbalance using an undersampling method.

The beneficial effects of the invention are as follows: the intelligent disk monitoring and optimizing system brings the redundancy strategy and the cleaning period into one system through the reinforcement learning method, so that the optimal redundancy strategy and the disk cleaning period can be trained simultaneously, the self-adaptability and the reliability of the system are enhanced, and data are not easy to lose and easy to manage; according to the monitoring and optimizing method, the health condition of the magnetic disk is estimated through deep learning detection, and the magnetic disk intelligent monitoring and optimizing system is trained through reinforcement learning, so that the accuracy is improved.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

As shown in fig. 1, a monitoring and optimizing method includes the following steps:

s3, setting the ep reward to 0;

s6, randomly selecting an action; step S8 is entered;

The health evaluation module adopts an LSTM neural network model; the LSTM neural network model comprises two LSTM layers and a full connection layer which are connected in series; each LSTM layer includes 128 LSTM cells; the fully connected layer comprises 4 neurons; the LSTM layer adopts a ReLU function as an activation function; the fully connected layer adopts a softmax function as an activation function;

The specific steps of the step S1 are as follows:

s1-4, according to the formula:

The formula of the bonus function of step S2 is:

Setting the initial value of the exploration rate in the step S5 to be 1.0, and gradually reducing along with continuous interaction between the reinforcement learning intelligent agent and the environment; the change process of the exploration rate is shown in the following formula:

In the step S5, the generation of the random number adopts an epsilon-greedy strategy, and the specific method is as follows:

according to the formula:

In step S7, the calculation formulas of the values of the different actions in the current state are as follows:

Q(s,a)＝E[r+γmax _s' Q(s',a')|s＝s',a＝a']

The formula of the target value and the minimum loss function obtained by the optimizer in step S10 is as follows:

Step S1-2 adopts an undersampling method to compensate for the imbalance of the original SMART data.

In one embodiment of the invention, the target network and policy network parameters of the disk intelligent monitoring and optimizing system based on deep reinforcement learning are the same, but are not updated when the disk intelligent monitoring and optimizing system of the invention is trained. The size of the input layer of the strategy network is set to 10000, namely the set number of disks; the hidden layer size of the policy network is set to 32; the size of the output layer of the policy network is the action space size, namely, nine actions of 0 to 8, wherein 0 indicates that no operation is performed, and 1 to 8 respectively indicate different actions. When the health score is calculated, the weight of a normal disk is lower, and the weight of a failed disk is higher; in addition, there are indexes for judging the accuracy of the disk health score, namely, accuracy (Precision), recall (Recall), comprehensive evaluation index (F-measure) and comprehensive evaluation index MCC, and the corresponding formulas are as follows:

where TP represents the true instance, i.e., the number of positive samples correctly classified by the model as positive classes; TN represents true counterexamples, i.e., the number of negative examples that the model correctly classifies as negative; FP represents the false positive, i.e. the number of negative samples that the model wrongly classifies as positive classes; FN represents the false negative, i.e. the number of positive samples that the model wrongly classifies as negative.

The four indexes of the reward function are mean time between failure (MTTDL), and Space saving memory (Space) _save ) Average detection time (MTTD) and Cost (Cost). The mean time between failure (MTTDL) refers to the mean time that the system can normally run before data loss occurs, reflects the reliability of the system, and indicates that the greater the value is, the higher the reliability of the system is; space saving memory (Space) _save ) The saved space occupies the proportion of the original data space, reflects the storage efficiency of the system, and indicates that the higher the value is, the higher the storage efficiency of the system is; the average detection time (MTTD) refers to the average time spent by the system before the disk failure is found after the disk cleaning period is modified, reflects the failure detection capability of the system, and indicates that the lower the value is, the stronger the failure detection capability of the system is; cost (Cost) refers to the system consumption after modifying the disk wash cycle, reflecting the running Cost of the system, and the lower the value, the lower the running Cost of the system. The system consumption refers to the cost of performing disk cleaning, including disk life, energy consumption, performance loss, and the like.

The rewards list is the total rewards earned by the reinforcement agent for learning and performing tasks in different environments. The exploration rate refers to the tendency of an agent to try new actions in the learning process, and the performance degradation caused by excessive exploration of the agent can be prevented by reducing the exploration rate. The intelligent monitoring and optimizing system based on the deep reinforcement learning for the magnetic disk adjusts the behavior strategy by tracking the performances of the intelligent agent in different states and recording rewards and executed actions of each state.

In summary, the redundancy strategy and the cleaning period are integrated into a system through the reinforcement learning method, so that the optimal redundancy strategy and the disk cleaning period can be trained simultaneously, the self-adaptability and the reliability of the system are enhanced, and the data are not easy to lose and are easy to manage; according to the monitoring and optimizing method, the health condition of the magnetic disk is estimated through deep learning detection, and the magnetic disk intelligent monitoring and optimizing system is trained through reinforcement learning, so that the accuracy is improved.

Claims

1. A disk intelligent monitoring and optimizing system based on deep reinforcement learning is characterized in that: the system comprises a health evaluation module, a strategy adjustment module and an optimizer; the strategy adjustment module adopts a deep Q network model; the deep Q network model includes a policy network, a target network, and an experience playback buffer;

the strategy network is used for acquiring corresponding actions and states according to the overall health level of the magnetic disks of different brands; the actions are used for intelligent monitoring and optimizing of the magnetic disk;

the experience playback buffer area is used for storing the state, action and rewards of the training stage;

and the optimizer is used for acquiring a loss function according to the output of the strategy network and the target network and updating the parameters of the strategy network based on the loss function.

2. The intelligent monitoring and optimization system for a disk based on deep reinforcement learning of claim 1, wherein: the health evaluation module adopts an LSTM neural network model; the LSTM neural network model comprises two LSTM layers and a full connection layer which are connected in series; each LSTM layer comprises 128 LSTM units; the fully-connected layer comprises 4 neurons; the LSTM layer adopts a ReLU function as an activation function; the fully connected layer adopts a softmax function as an activation function;

3. A method for monitoring and optimizing a disk intelligent monitoring and optimizing system based on deep reinforcement learning as set forth in any one of claims 1 or 2, characterized in that: the method comprises the following steps:

s3, setting the ep reward to 0;

s6, randomly selecting an action; step S8 is entered;

4. A method of monitoring and optimizing as claimed in claim 3, wherein: the specific steps of the step S1 are as follows:

s1-2, processing the original SMART data based on feature selection and feature processing to obtain SMART features and building a training set;

s1-4, according to the formula:

5. A method of monitoring and optimizing as claimed in claim 3, wherein: the formula of the reward function of the step S2 is:

wherein λ represents a disk failure rate, μ represents a disk repair rate, k represents k redundant blocks obtained by encoding each block of data, n represents a total block number, and MTTDL represents an average absenceTime of failure, n ₁ And k ₁ The number of coded blocks and the number of data blocks, n, respectively, representing the new redundancy method ₂ And k ₂ The number of coded blocks and the number of data blocks representing the old redundancy method, space, respectively _save Representing saved space memory, Z _i Represents an acceleration factor, N _i Represents the number of disks, r represents the normal scrubbing rate, MTTD represents the average detection time, T represents the time span, sigma (-) represents the summing function, cost represents the Cost, reward represents the rewarding function, C ₁ 、C ₂ 、C ₃ And C ₄ Indicating hyper-parameters, H indicating health score, H _a Representing the health score, MTTDL, of a set alert level disk _diff Indicating the variation of mean time to failure, MTTD _diff Representing the variation of the average detection time, cost _diff Representing the amount of change in cost.

6. A method of monitoring and optimizing as claimed in claim 3, wherein: the initial value of the exploration rate in the step S5 is set to be 1.0, and gradually decreases along with continuous interaction between the reinforcement learning intelligent agent and the environment; the change process of the exploration rate is shown in the following formula:

7. A method of monitoring and optimizing as claimed in claim 3, wherein: the random number generation in the step S5 adopts an epsilon-greedy strategy, and the specific method is as follows:

according to the formula:

8. A method of monitoring and optimizing as claimed in claim 3, wherein: the calculation formula of the values of the different actions in the current state in the step S7 is as follows:

Q(s,a)＝E[r+γmax _s' Q(s',a')|s＝s',a＝a']

9. A method of monitoring and optimizing as claimed in claim 3, wherein: the formula of the target value and the minimum loss function obtained by the optimizer in the step S10 is as follows:

wherein L represents the minimum loss function, N represents the number of samples, j represents the current step number, r _i Immediate return representing current step number j, gamma representing discount factor, s' tableShowing the next state, a' represents the next action, s _j State, a, representing the current step number j _j An operation of the current step number j is represented, Q (s _j ,a _j ) Representing the state s of the target network _j The action value under the condition A (s ', a ') represents the action value of the target network under the condition S ',representing the probability that the policy network selects action a 'in state s',indicating the future prize value expected to be achieved by selecting the optimal action in the current state.

10. A method of monitoring and optimizing as claimed in claim 3, wherein: the step S1-2 adopts an undersampling method to compensate the unbalance of the original SMART data.