CN118012718B - Real-time monitoring method for distributed storage system - Google Patents

Real-time monitoring method for distributed storage system Download PDF

Info

Publication number
CN118012718B
CN118012718B CN202410391255.9A CN202410391255A CN118012718B CN 118012718 B CN118012718 B CN 118012718B CN 202410391255 A CN202410391255 A CN 202410391255A CN 118012718 B CN118012718 B CN 118012718B
Authority
CN
China
Prior art keywords
data
monitoring
module
observable
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410391255.9A
Other languages
Chinese (zh)
Other versions
CN118012718A (en
Inventor
刘爱贵
逯星飞
阮薛平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dadao Yunxing Technology Co ltd
Original Assignee
Beijing Dadao Yunxing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dadao Yunxing Technology Co ltd filed Critical Beijing Dadao Yunxing Technology Co ltd
Priority to CN202410391255.9A priority Critical patent/CN118012718B/en
Publication of CN118012718A publication Critical patent/CN118012718A/en
Application granted granted Critical
Publication of CN118012718B publication Critical patent/CN118012718B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention provides a real-time monitoring method of a distributed storage system, which comprises an intelligent decision module, a data aggregation module and a monitoring extraction module, wherein the monitoring extraction module captures monitoring data from each storage node and performs preprocessing, conversion and storage on the data; the data aggregation module integrates and analyzes the data from the monitoring extraction module, performs data analysis and anomaly detection by using a Bayesian algorithm model and a random forest algorithm, and discovers and processes the anomaly in the system by aggregating and analyzing the monitoring data; the intelligent decision module makes an intelligent decision based on the analysis result of the data aggregation module, and uses a preset algorithm to make a decision scheme to cope with the running state and abnormal situation of the system. The invention has the beneficial effects that: the stability, reliability and performance of the system are effectively improved, powerful support is provided for system management and operation and maintenance work, and the business efficiency and competitiveness of enterprises are improved.

Description

Real-time monitoring method for distributed storage system
Technical Field
The invention belongs to the field of storage monitoring, and particularly relates to a real-time monitoring method of a distributed storage system.
Background
In the application process of distributed storage, as the cluster size increases, an administrator needs to accurately control the use of resources of a distributed cluster, for example: the cluster provides the service form of the access share for the client; traffic morphology contrast values between different share; health status of hardware resources in the cluster; the first time when a fault occurs, the monitoring system knows that the source occurs on the part of hardware resources in the cluster.
Most of the existing tools for monitoring storage devices acquire information from block devices, and a small amount of information acquired from a VFS layer or an XFS layer is system global information, so that different file systems or directory loads cannot be distinguished.
Disclosure of Invention
In view of the foregoing, the present invention is directed to a method for real-time monitoring a distributed storage system, so as to solve at least one of the above-mentioned technical problems.
In order to achieve the above purpose, the technical scheme of the invention is realized as follows:
The invention provides a real-time monitoring method of a distributed storage system, which comprises an intelligent decision module, a data aggregation module and a monitoring extraction module, wherein the monitoring extraction module captures monitoring data from each storage node and performs preprocessing, conversion and storage on the data;
the data aggregation module integrates and analyzes the data from the monitoring extraction module, performs data analysis and anomaly detection by using a Bayesian algorithm model and a random forest algorithm, and discovers and processes the anomaly in the system by aggregating and analyzing the monitoring data;
The intelligent decision module makes an intelligent decision based on the analysis result of the data aggregation module, and uses a preset algorithm to make a decision scheme to cope with the running state and abnormal situation of the system.
Further, the workflow of the monitoring and extracting module includes:
The method comprises the steps of obtaining monitoring data through network connection or other modes and communicating with each storage node, wherein the monitoring data comprise read-write bandwidth, IOPS, delay, system load, CPU utilization rate and memory utilization rate;
preprocessing the captured monitoring data, wherein the preprocessing comprises data cleaning, de-duplication, filling missing values, and converting the original monitoring data into a uniform data format.
Further, the workflow of the data aggregation module includes:
In the Bayesian algorithm model, sample data is formed by performing machine self-learning training on monitoring data captured by a storage node;
In the random forest algorithm model, sample data sends a detection message through a storage node and forms a detection sample, and before the detection sample is formed, the inside of a detection device is adjusted to a high-frequency detection mode;
The detection data are hardware resources of the distributed storage system, and comprise system load, CPU utilization rate, memory use condition and hard disk fragmentation degree.
Further, the Bayesian algorithm model calculates posterior probability according to the detection data and the sample data, and judges whether the monitoring data is abnormal or not;
the abnormal monitoring data is grabbed and processed again, and a final aggregation result is obtained through weight calculation;
and adjusting the parameters of the sample data and the Bayesian algorithm model in real time according to the utilization rate of the distributed storage system.
Further, the workflow of the intelligent decision module includes:
Receiving monitoring data from a data aggregation module, wherein the monitoring data comprises a Bayesian model and a random forest algorithm analysis result, and preprocessing the received data, and the preprocessing comprises data cleaning and denoising operations;
according to the requirements and the design of the system, a preset algorithm is used for making decisions, corresponding operations are executed according to different conditions, and corresponding decision schemes are formulated;
and after the decision is executed, the intelligent decision module monitors the effect and influence of the decision, and feeds back and adjusts the decision model according to the monitoring result.
Further, the process of performing data analysis and anomaly detection by using a Bayesian algorithm model and a random forest algorithm comprises the following steps:
In an initialization stage, collecting attribute range values of a client under different pressure levels as prior probabilities; collecting monitoring data, including read-write request conditions of different storage nodes and corresponding attribute data;
Calculating prior probabilities of normal and abnormal states of each client pressure level according to sample data, and calculating conditional probabilities of each attribute in the normal and abnormal states according to samples in monitoring data for each attribute;
According to the Bayesian theorem, combining the prior probability and the conditional probability, calculating the posterior probability that the storage node is in a normal state and an abnormal state under given monitoring data, and setting a threshold value according to the posterior probability to judge the health state of the storage node;
if the posterior probability of the abnormal state exceeds the threshold value, judging that the storage node is in the abnormal state; otherwise, judging that the state is normal; and identifying the health state of the storage node according to the health state judgment result.
Further, the intelligent decision module performs intelligent decision based on the analysis result of the data aggregation module to obtain a decision on the health state of the distributed storage system, specifically performing state deduction according to the observed data characteristics:
if the observable data is stable and higher, the system is loaded and heavy, and the deduction is performed to be in a system busy state;
If the observable data is always higher and the system load is low, deducing that the observable data is in a sub-health state of the slow disc;
if error request statistics or frequent wave crest and wave trough occur, deducing the hard disk bad block;
If the non-observable data has a peak, the system load occurs along with the peak, and the storm data is deduced to be in a busy state in the storm data inrush stage;
If the non-observable data is gentle and the observable data has no increment, deducing that the observable data is in the health state of the storage node;
Wherein the observable data is used to identify storage node abnormal states and predict future states, and the non-observable data is used to describe normal operating states.
A second aspect of the present invention proposes an electronic device, comprising a processor and a memory communicatively connected to the processor and configured to store instructions executable by the processor, the processor being configured to perform a method for real-time monitoring of a distributed storage system according to any of the first aspects.
A third aspect of the present invention proposes a server comprising at least one processor and a memory communicatively coupled to the processor, the memory storing instructions executable by the at least one processor to cause the at least one processor to perform a method for real-time monitoring of a distributed storage system according to any of the first aspects.
A fourth aspect of the present invention proposes a computer readable storage medium storing a computer program which, when executed by a processor, implements a method for real-time monitoring of a distributed storage system according to any of the first aspects.
Compared with the prior art, the real-time monitoring method of the distributed storage system has the following beneficial effects:
The distributed monitoring system can monitor various indexes of the system in real time, and real-time early warning of abnormal conditions is realized through the intelligent decision module, so that problems can be found and solved in time, and the stability and reliability of the system are improved.
The intelligent decision-making module can automatically make a decision-making scheme according to the system state and a preset rule, so that the requirement of manual intervention is reduced, the decision-making efficiency and accuracy are improved, and the labor cost of system management is reduced.
The data aggregation module analyzes and detects the abnormality of the monitoring data by adopting a Bayesian model and a random forest algorithm, is beneficial to finding out the performance bottleneck and potential problems of the system, and provides data support for system optimization and performance improvement.
The distributed monitoring system can comprehensively monitor various indexes of the system, realize automatic decision and adjustment through the intelligent decision module, timely cope with abnormal conditions of the system, improve the stability and reliability of the system and reduce the occurrence rate of system faults.
By analyzing and deciding the monitoring data, the system can manage the system resources more effectively, optimize the resource allocation and utilization, and improve the resource utilization rate and performance of the system.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:
Fig. 1 is a schematic diagram of an execution function of a real-time monitoring method of a distributed storage system according to an embodiment of the present invention.
Detailed Description
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.
The invention will be described in detail below with reference to the drawings in connection with embodiments.
Embodiment one: the real-time monitoring method of the distributed storage system comprises an intelligent decision module, a data aggregation module and a monitoring extraction module, wherein the monitoring extraction module captures monitoring data from each storage node and performs preprocessing, conversion and storage on the data;
the data aggregation module integrates and analyzes the data from the monitoring extraction module, performs data analysis and anomaly detection by using a Bayesian algorithm model and a random forest algorithm, and discovers and processes the anomaly in the system by aggregating and analyzing the monitoring data;
The intelligent decision module makes an intelligent decision based on the analysis result of the data aggregation module, and uses a preset algorithm to make a decision scheme to cope with the running state and abnormal situation of the system.
The workflow of the monitoring and extracting module comprises the following steps:
Monitoring data is obtained by communicating with each storage node via a network connection or other means, including read bandwidth, write bandwidth, read IOPS, write IOPS, such as average latency, write average latency. Samples of various different types of requests: bandwidth, IOPS, minimum latency, maximum latency, average latency, total read-write latency, total number of reads-writes;
preprocessing the captured monitoring data, wherein the preprocessing comprises data cleaning, de-duplication, filling missing values, and converting the original monitoring data into a uniform data format.
The workflow of the data aggregation module comprises:
In the Bayesian algorithm model, sample data is formed by performing machine self-learning training on monitoring data captured by a storage node, wherein the sample data comprises prior probability and conditional probability;
The prior probability in the monitoring system represents the expected distribution situation of various states or indexes of the system, namely the estimated probability of various states of the system formed according to factors such as historical data, system design and experience before the system operates;
Conditional probability in a distributed monitoring system, a conditional probability refers to the probability that an event or state is known to occur, and in particular, in a monitoring system, a conditional probability may represent the probability distribution that a system is in or occurs at a certain state given a particular monitoring indicator or condition;
Examples: assuming we want to calculate the probability of an abnormality of the system as a whole in the case where a read operation occurs at a certain storage node, this involves a conditional probability, in this example, "abnormality of the system as a whole" is an event of interest to us, and "a read operation occurs at a certain storage node" is a condition under which we want to calculate the probability of an abnormality of the system.
In the random forest algorithm model, sample data sends a detection message through a storage node and forms a detection sample, and before the detection sample is formed, the inside of a detection device is adjusted to a high-frequency detection mode;
The detection data are hardware resources of the distributed storage system, and comprise system load, CPU utilization rate, memory use, fragmentation degree of a hard disk, network card packet loss rate and hard disk utilization rate.
Taking the detection data as input, and calculating the use condition of hardware resources under different service forms in a random forest algorithm model: y=wif (Xi); where Y represents the final output result, wi represents the weight of each decision tree, and f (Xi) represents the output result of each decision tree.
If the detection data or the monitoring data is insufficient to complete the training of the sample data, the follow-up service is supplemented, wherein the detection data and the monitoring data show rising trend along with the increase of the pressure of the client.
The Bayesian algorithm model calculates posterior probability according to the detection data and the sample data, and judges whether the monitoring data is abnormal or not;
the abnormal monitoring data is grabbed and processed again, and a final aggregation result is obtained through weight calculation;
The parameters of the sample data and the Bayesian algorithm model are adjusted in real time according to the utilization rate of the distributed storage system so as to adapt to the running change of the system; the detection data and the monitoring data grabbed by the storage nodes tend to be normally distributed along with the increase of the storage usage.
The workflow of the intelligent decision module comprises:
Receiving monitoring data from a data aggregation module, wherein the monitoring data comprises a Bayesian model and a random forest algorithm analysis result, and preprocessing the received data, and the preprocessing comprises data cleaning and denoising operations;
according to the requirements and the design of the system, a preset algorithm is used for making decisions, corresponding operations are executed according to different conditions, and corresponding decision schemes are formulated;
and after the decision is executed, the intelligent decision module monitors the effect and influence of the decision, and feeds back and adjusts the decision model according to the monitoring result.
The process of carrying out data analysis and anomaly detection by using the Bayesian algorithm model and the random forest algorithm comprises the following steps:
In an initialization stage, collecting attribute range values of a client under different pressure levels as prior probabilities; collecting monitoring data, including read-write request conditions of different storage nodes and corresponding attribute data;
Calculating prior probabilities of normal and abnormal states of each client pressure level according to sample data, and calculating conditional probabilities of each attribute in the normal and abnormal states according to samples in monitoring data for each attribute;
According to the Bayesian theorem, combining the prior probability and the conditional probability, calculating the posterior probability that the storage node is in a normal state and an abnormal state under given monitoring data, and setting a threshold value according to the posterior probability to judge the health state of the storage node;
if the posterior probability of the abnormal state exceeds the threshold value, judging that the storage node is in the abnormal state; otherwise, judging that the state is normal; and identifying the health state of the storage node according to the health state judgment result.
The Bayesian algorithm specifically comprises the following steps:
Wherein P (A) is the prior probability, P (B) is the attribute value of the monitoring data, and P (B|A) is the conditional probability.
Each storage node builds samples of attributes (bandwidth, IOPS, minimum latency, maximum latency, average latency, total READ/WRITE latency, total number of READs/WRITEs) for different types of requests (READ/WRITE);
Calculating samples of attribute members of each type of request and grabbing data in the monitoring data;
the data acquisition module calculates range values of corresponding attributes under different pressures of the client when the sample is initialized:
client pressure 1: sample data (attribute): corresponding range values (5-10);
client pressure 2: sample data (attribute): corresponding range values (10-15);
client pressure n: sample data (attribute): corresponding range values (15-20);
Also in the sample initializing process, a large period is divided into a plurality of small periods (the number falling in different ranges is recorded in detail in each small period, for example, 100 times of statistics are carried out in the first small period of the bandwidth, the number falling in the normal range of 100-200 times is 80 times, and the remaining 20 times fall in other ranges, so that the sample initializing process is judged to be normal), whether the number falling in the normal range of the sample attribute in each small period is larger in proportion is judged, if the majority rule is normal, if the minority rule is abnormal, the prior probability of the attribute (normal state and abnormal state) is calculated according to the attribute values given by all small periods, and then the conditional probability is converted.
Substituting the attribute values of all the monitoring data into a formula to obtain posterior probability, and obtaining an average value to obtain average value probability;
Any item of grabbing data is abnormal, then grabbing again immediately, and finally calculating an aggregation result output by the current data aggregation module according to probability values obtained by all grabbing data after confirmation, wherein each item of grabbing monitoring data occupies a certain weight proportion;
The grabbing attribute value taken by the data aggregation module is higher than the range of the sample data by one time compared with the attribute value in the sample model. The anomaly weight value of the current sample is doubled as an anomaly compensation index, because normal data may exist in the statistical period when anomalies occur.
Examples:
step1, if the delay index of the captured data is relatively stable, all are as long as possible;
step2, the grabbing data are combined with the sample model to carry out sample depth training, and finally a conventional operation range of the sample is trained when the detection message is normal;
step3, if the grabbing data is bigger. The probe data is extracted and if the probe message is stable. Step2 is carried out again to carry out deep training;
step4, if the captured data is bigger, extracting the detection data and also abnormal, and summarizing and forming alarm information;
Step5, if the captured data is smaller, possibly caused by smaller business pressure, performing Step2 again for deep training;
step6, detecting whether the data has abnormal fluctuation, if the detected data is steady, normal or slightly fluctuating, the detected data is due to the increase of the business of the client;
Judging whether the current detected data is abnormal or not according to the storage business form brought into the random forest model, and reminding the grabbing data to observe the grabbing data in an important mode if the detected data sample is higher than the normal level;
Repeatedly confirming the current grabbing data, namely what level is in the sample data, and if the current grabbing data is the difference of the levels, summarizing and forming alarm information.
The intelligent decision module is used for carrying out intelligent decision based on the analysis result of the data aggregation module to obtain the judgment of the health state of the distributed storage system, and particularly carrying out state deduction according to the observed data characteristics:
if the observable data is stable and higher, the system is loaded and heavy, and the deduction is performed to be in a system busy state;
If the observable data is always higher and the system load is low, deducing that the observable data is in a sub-health state of the slow disc;
if error request statistics or frequent wave crest and wave trough occur, deducing the hard disk bad block;
If the non-observable data has a peak, the system load occurs along with the peak, and the storm data is deduced to be in a busy state in the storm data inrush stage;
If the non-observable data is gentle and the observable data has no increment, deducing that the observable data is in the health state of the storage node;
Wherein the observable data is used to identify storage node abnormal states and predict future states, and the non-observable data is used to describe normal operating states.
A specific example is provided to make a detailed explanation of the decision process of the intelligent decision module:
the acquisition flow of the observation data is positioned on a rear end storage node of the distributed file system, and the design flow is as follows:
The observation data labeling is performed after the read-write request is completed, and the statistics module in the store process completes the identification of the observable data;
after a write request is sent to a file system of bottom storage data managed by a store process in the fop_write process, returning a written result to a client, and then starting to identify observation data;
The parent directory node is reversely found according to the inode information of the read-write request operation file; then judging whether the found father catalog node is an observation monitoring object set by the client; if the current file is the observation monitoring object, then updating the read-write increment of the current file into an inode node of the observable object;
Because a customer may configure multiple observable objects, their relationships are parent-child relationships, or offspring relationships; it is also necessary to find parent nodes of the parent directory reversely according to the inode information of the parent directory; the top root directory is always found, and the current fop_write can observe and identify the end of the task; the observable identification procedure of the fop_read type is the same as that of fop_write.
Non-observable data: the collected data mainly describe the state of the distributed storage system in the normal operation process;
observable data: the data are mainly used for proving whether the back-end storage node of the current distributed system is in an abnormal state or not;
the statistical duty cycle of the non-observable data and the observable data is a design ratio of 7:3;
the data volume of the non-observable data is larger in proportion (70 percent), and the non-observable data is mainly used for abstracting an intelligent non-observable scale model for describing the health state of a storage node on a storage node at the rear end of the distributed system according to the current service condition; is a scale for storing whether the node is healthy or not; and this scale is not unique; changes in real time according to the stored pressure;
The storage bandwidth and the IOPS increase with client traffic; the data volume recorded by the intelligent non-observable scale model is also continuously increased; along with the change of the client business, the intelligent non-observable scale model also changes along with the change;
the ratio of observable data is small (30 percent); because the observable data portion is mainly used to identify whether the backend storage node of the distributed system is abnormal; and predicting whether the storage node is capable of operating in a healthy state at a future time; so we need only a small proportion of the anomaly data to make a deduction to determine the stored anomaly.
The client can configure the abnormal index of the back-end storage node of the distributed system; after abnormal data exceeding the abnormal index appears; firstly, dividing the data into a range set of observable data; then transmitting the abnormal data into an intelligent abnormal grading algorithm model for risk grade grading; and abstracting an intelligent observable scale model for describing and storing the abnormal state according to the rated abnormal data.
The intelligent non-observable scale model and the intelligent observable scale model record the bandwidth, IOPS and times of the current storage node and also have information such as delay (min, max, avg), system load, cpu occupancy rate and the like;
The intelligent decision system for storing the health state of the rear end storage node in a distributed mode is used for judging the health state through a data set provided by an intelligent observable scale model and an intelligent non-observable scale model: the intelligent decision system performs quantitative deduction according to delay statistics of the observable data set, system load, change rule of cpu occupancy rate and non-observable data set;
the observable data is stable and high; meanwhile, the load of the analysis system is heavier; the graph abstracted by the two types of observation data has a close position; anomalies that may be due to resource contention; at this time, the state of the current storage node is deduced to be busy;
compared with a preset abnormal index, the observable data is always higher; the graph position abstracted by the two types of observation data is close; and the system test load is relatively low; error-free request statistics. Deducing that the current storage node is in a sub-health state with a slow disk at the moment;
In the AFR and the EC, the request error statistics of the storage node is carried out under the condition of not influencing the service; or the observable data set counts that frequent wave peaks and wave troughs appear; deducting sub-health states possibly with hard disk bad blocks;
If the observed data is always higher than the preset abnormality index, statistics of error requests exist. Deducting an unhealthy state requiring human emergency intervention;
For the staged statistics of non-observable data, the wave crest condition appears in a state of approaching to smoothness; meanwhile, the load of the system is accompanied with the wave crest occurrence; and error-free statistics; then the temperature is gradually reduced; deducing a busy state in a storm data inrush stage;
The stepwise average value, the maximum and minimum values of the non-observable data tend to be gentle, and occasionally slightly fluctuated; the observable data has no increment; deducting that the storage node is in a health state;
after the health status is judged, health status identification is carried out on a storage node at the rear end of the distributed system, the acquisition of observable data is carried out in a cbk of a lookup request, and a client necessarily sends the lookup request to a file when accessing the file;
When a lookup request is sent to a directory entry, if the current directory is configured as an observable monitoring member, the observable mark member in the saved file extension attribute is carried when the lookup is sent.
And synchronously recording the observable mark in the inode of the file in the wakeup_ cbk, and acquiring the observable data when the server side receives the observable data. Judging whether the member is an observable member according to the inode information, and if the member is an observable member according to the inode, returning the observable data recorded in the inode to the client;
the framework abstraction of the observation monitor data is described as follows:
the basic type observation data consists of data collected by a plurality of distributed storage back-end storage nodes; the final observation may consist of several sets of observations;
Therefore, an instantiation algorithm of a set of observation data cannot be designed for each type of volume redundantly, a client model in a pyramid-like form is adopted, label information is recorded at the top layer, and the label information specifically records the current monitoring data position; the specific data position records the monitoring source data information of different storage nodes at the back end, and is compatible with the information instantiation of all types of volumes;
the following is a detailed description of structural information of the distributed storage observable data:
Each data set is provided with a type for recording the current data set, whether the current data set is an observable data set or collected original data; each dataset has a length for recording the current dataset; each dataset having a number of recorded observable datasets;
the bandwidth of the recording foundation, the increase and decrease of the IPOS, the time interval of last time and current acquisition in the data set of the original data are saved; the data set for storing the original data records the information such as maximum delay, minimum delay, average delay, request increment, request processing times and the like of different types of requests;
The length of the data set is also recorded in the data set for storing the original data; the data set for storing the original data has a type (AFR, EC, DHT, TIER) for recording the current data set; the data set for storing the original data is provided with the number of recorded original acquired data;
When monitord aggregation system receives observable data, firstly, the data is returned to CLI module for showing to client; the status of the aggregation is also recorded in monitord. All the aggregated observation data can be automatically recorded and summarized;
The aggregated data is split and uniformly managed while being summarized, and monitord systems can detect information such as bandwidth, IOPS, delay, load, CPU utilization rate and the like returned by the storage nodes; after the aggregation times reach a certain number, abstracting into a state analysis model diagram with a certain change rule when the change is carried out;
after the model diagram is constructed, the data aggregation module can be used for checking the historical data model in the analysis model, and if the contrast is large, the data aggregation module can be used for checking the information again to the back-end storage system of the distributed system, so that the misjudgment of the state of the storage node and the correction of the correct state can be effectively solved.
Embodiment two: an electronic device comprising a processor and a memory communicatively coupled to the processor and configured to store instructions executable by the processor, the processor configured to perform a method for real-time monitoring of a distributed storage system as described in the first embodiment.
Embodiment III: a server comprising at least one processor and a memory communicatively coupled to the processor, the memory storing instructions executable by the at least one processor to cause the at least one processor to perform a method of real-time monitoring of a distributed storage system as described in embodiment one.
Embodiment four: a computer readable storage medium storing a computer program which when executed by a processor implements a method for real-time monitoring of a distributed storage system according to embodiment one.
Those of ordinary skill in the art will appreciate that the elements and method steps of each example described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the elements and steps of each example have been described generally in terms of functionality in the foregoing description to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed methods and systems may be implemented in other ways. For example, the above-described division of units is merely a logical function division, and there may be another division manner when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. The units may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present application.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention, and are intended to be included within the scope of the appended claims and description.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (8)

1. The real-time monitoring method of the distributed storage system comprises an intelligent decision module, a data aggregation module and a monitoring extraction module, and is characterized in that:
The monitoring extraction module captures monitoring data from each storage node and performs preprocessing, conversion and storage on the data;
the data aggregation module integrates and analyzes the data from the monitoring extraction module, performs data analysis and anomaly detection by using a Bayesian algorithm model and a random forest algorithm, and discovers and processes the anomaly in the system by aggregating and analyzing the monitoring data;
the intelligent decision module carries out intelligent decision based on the analysis result of the data aggregation module, and uses a preset algorithm to make a decision scheme to cope with the running state and abnormal situation of the system;
the process of carrying out data analysis and anomaly detection by using the Bayesian algorithm model and the random forest algorithm comprises the following steps:
In an initialization stage, collecting attribute range values of a client under different pressure levels as prior probabilities; collecting monitoring data, including read-write request conditions of different storage nodes and corresponding attribute data;
Calculating prior probabilities of normal and abnormal states of each client pressure level according to sample data, and calculating conditional probabilities of each attribute in the normal and abnormal states according to samples in monitoring data for each attribute;
According to the Bayesian theorem, combining the prior probability and the conditional probability, calculating the posterior probability that the storage node is in a normal state and an abnormal state under given monitoring data, and setting a threshold value according to the posterior probability to judge the health state of the storage node;
If the posterior probability of the abnormal state exceeds the threshold value, judging that the storage node is in the abnormal state; otherwise, judging that the state is normal; and according to the health status judging result, the health status of the storage node is marked;
the workflow of the data aggregation module comprises:
In the Bayesian algorithm model, sample data is formed by performing machine self-learning training on monitoring data captured by a storage node;
In the random forest algorithm model, sample data sends a detection message through a storage node and forms a detection sample, and before the detection sample is formed, the inside of a detection device is adjusted to a high-frequency detection mode;
The detection data are hardware resources of the distributed storage system, and comprise system load, CPU utilization rate, memory use condition and hard disk fragmentation degree;
And the Bayesian algorithm model calculates posterior probability according to the detection data and the sample data, and judges whether the monitoring data is abnormal or not.
2. The method for real-time monitoring of a distributed storage system according to claim 1, wherein:
the workflow of the monitoring and extracting module comprises the following steps:
The method comprises the steps of obtaining monitoring data through network connection or other modes and communicating with each storage node, wherein the monitoring data comprise read-write bandwidth, IOPS, delay, system load, CPU utilization rate and memory utilization rate;
preprocessing the captured monitoring data, wherein the preprocessing comprises data cleaning, de-duplication, filling missing values, and converting the original monitoring data into a uniform data format.
3. The method for real-time monitoring of a distributed storage system according to claim 1, wherein:
the abnormal monitoring data is grabbed and processed again, and a final aggregation result is obtained through weight calculation;
and adjusting the parameters of the sample data and the Bayesian algorithm model in real time according to the utilization rate of the distributed storage system.
4. The method for real-time monitoring of a distributed storage system according to claim 1, wherein:
the workflow of the intelligent decision module comprises:
Receiving monitoring data from a data aggregation module, wherein the monitoring data comprises a Bayesian model and a random forest algorithm analysis result, and preprocessing the received data, and the preprocessing comprises data cleaning and denoising operations;
According to the requirements and the design of the distributed storage system, a preset algorithm is used for making decisions, corresponding operations are executed according to different conditions, and corresponding decision schemes are formulated;
and after the decision is executed, the intelligent decision module monitors the effect and influence of the decision, and feeds back and adjusts the decision model according to the monitoring result.
5. The method for real-time monitoring of a distributed storage system according to claim 1, wherein:
the intelligent decision module is used for carrying out intelligent decision based on the analysis result of the data aggregation module to obtain the judgment of the health state of the distributed storage system, and particularly carrying out state deduction according to the observed data characteristics:
if the observable data is stable and higher, the system is loaded and heavy, and the deduction is performed to be in a system busy state;
If the observable data is always higher and the system load is low, deducing that the observable data is in a sub-health state of the slow disc;
if error request statistics or frequent wave crest and wave trough occur, deducing the hard disk bad block;
If the non-observable data has a peak, the system load occurs along with the peak, and the storm data is deduced to be in a busy state in the storm data inrush stage;
If the non-observable data is gentle and the observable data has no increment, deducing that the observable data is in the health state of the storage node;
Wherein the observable data is used to identify storage node abnormal states and predict future states, and the non-observable data is used to describe normal operating states.
6. An electronic device comprising a processor and a memory communicatively coupled to the processor for storing processor-executable instructions, characterized in that: the processor is configured to perform a method for real-time monitoring of a distributed storage system as claimed in any one of claims 1-5.
7. A server, characterized by: comprising at least one processor and a memory communicatively coupled to the processor, the memory storing instructions executable by the at least one processor to cause the at least one processor to perform a method of real-time monitoring of a distributed storage system as claimed in any one of claims 1-5.
8. A computer-readable storage medium storing a computer program, characterized in that: the computer program, when executed by a processor, implements a method for real-time monitoring of a distributed storage system as claimed in any one of claims 1-5.
CN202410391255.9A 2024-04-02 2024-04-02 Real-time monitoring method for distributed storage system Active CN118012718B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410391255.9A CN118012718B (en) 2024-04-02 2024-04-02 Real-time monitoring method for distributed storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410391255.9A CN118012718B (en) 2024-04-02 2024-04-02 Real-time monitoring method for distributed storage system

Publications (2)

Publication Number Publication Date
CN118012718A CN118012718A (en) 2024-05-10
CN118012718B true CN118012718B (en) 2024-07-12

Family

ID=90948833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410391255.9A Active CN118012718B (en) 2024-04-02 2024-04-02 Real-time monitoring method for distributed storage system

Country Status (1)

Country Link
CN (1) CN118012718B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115033450A (en) * 2022-05-26 2022-09-09 中电信数智科技有限公司 Bayesian cluster monitoring early warning analysis method based on distribution

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10917419B2 (en) * 2017-05-05 2021-02-09 Servicenow, Inc. Systems and methods for anomaly detection
CN107579858A (en) * 2017-09-28 2018-01-12 厦门集微科技有限公司 The alarm method and device of cloud main frame, communication system
CN109660407A (en) * 2019-01-18 2019-04-19 鑫涌算力信息科技(上海)有限公司 Distributed system monitoring system and method
CN111125005B (en) * 2019-12-03 2022-07-08 苏州浪潮智能科技有限公司 Method, system and equipment for optimizing IO performance of HDFS distributed file system
CN114509283A (en) * 2022-01-05 2022-05-17 中车唐山机车车辆有限公司 System fault monitoring method and device, electronic equipment and storage medium
CN115237724A (en) * 2022-08-03 2022-10-25 中国平安财产保险股份有限公司 Data monitoring method, device, equipment and storage medium based on artificial intelligence
CN115509853A (en) * 2022-09-21 2022-12-23 联想(北京)有限公司 Cluster data anomaly detection method and electronic equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115033450A (en) * 2022-05-26 2022-09-09 中电信数智科技有限公司 Bayesian cluster monitoring early warning analysis method based on distribution

Also Published As

Publication number Publication date
CN118012718A (en) 2024-05-10

Similar Documents

Publication Publication Date Title
US10956391B2 (en) Methods and systems for determining hardware sizing for storage array systems
CN108415789B (en) Node fault prediction system and method for large-scale hybrid heterogeneous storage system
CN107729210B (en) Distributed service cluster abnormity diagnosis method and device
CN110830289A (en) Container abnormity monitoring method and monitoring system
CN110417611A (en) Storage system based on I/O mode postpones assessment
US20160063090A1 (en) Analyzing Frequently Occurring Data Items
CN104516808A (en) Data preprocessing device and method thereof
CN109471847B (en) I/O congestion control method and control system
CN108809760A (en) The control method and device in sampling period in sampled-data system
US9396087B2 (en) Method and apparatus for collecting performance data, and system for managing performance data
US8281102B2 (en) Computer-readable recording medium storing management program, management apparatus, and management method
CN110955586A (en) System fault prediction method, device and equipment based on log
CN103746829A (en) Cluster-based fault perception system and method thereof
US20180123931A1 (en) Methods and systems for characterizing computing system performance using peer-derived performance severity and symptom severity models
US9235463B2 (en) Device and method for fault management of smart device
KR20170084445A (en) Method and apparatus for detecting abnormality using time-series data
CN114595085A (en) Disk failure prediction method, prediction model training method and electronic equipment
CN108021484B (en) Method and system for prolonging expected life value of disk in cloud service system
CN111339052A (en) Unstructured log data processing method and device
CN114253806A (en) Access stratum log collection, analysis and early warning system
CN118012718B (en) Real-time monitoring method for distributed storage system
CN117632897A (en) Dynamic capacity expansion and contraction method and device
JP2020009202A (en) Storage device, storage system, and performance evaluation method
CN116502802A (en) Data management system based on big data and wireless sensing technology
CN115617604A (en) Disk failure prediction method and system based on image pattern matching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant