CN114546765A

CN114546765A - Cluster monitoring method, system, device and medium

Info

Publication number: CN114546765A
Application number: CN202210129765.XA
Authority: CN
Inventors: 张书博
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-02-11
Filing date: 2022-02-11
Publication date: 2022-05-27

Abstract

The invention discloses a cluster monitoring method, which comprises the following steps: acquiring a first acquisition interval and a first storage interval; collecting a plurality of monitoring data according to the first collection interval and storing the plurality of monitoring data according to the first storage interval; predicting the monitoring data of the next period according to the plurality of monitoring data; in response to the monitoring data of the next cycle being greater than a threshold, updating the first acquisition interval to a second acquisition interval that is less than the first acquisition interval to acquire a plurality of monitoring data according to the second acquisition interval, and updating the first storage interval to a second storage interval that is less than the first storage interval to store a plurality of monitoring data according to the second storage interval. The invention also discloses a system, a computer device and a readable storage medium. The scheme provided by the invention can adaptively update the acquisition and storage frequency, and ensures the integrity, stability and usability of the artificial intelligence platform function.

Description

Cluster monitoring method, system, device and medium

Technical Field

The invention relates to the field of servers, in particular to a cluster monitoring method, a cluster monitoring system, cluster monitoring equipment and a storage medium.

Background

For the artificial intelligence cloud platform, a monitoring information and alarm mechanism is important. The platform can provide services such as a basic environment, computing power and a management method for training the deep learning model for a user, and for training the deep learning model, real-time monitoring and reasonable allocation of resources become particularly important, such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a disk and the like. The utilization rates of a CPU, a GPU, a memory and the like, the power consumption and the temperature of the GPU, whether a card falling phenomenon exists or not and the like are abnormal, the progress and the quality of a training task are affected, and whether the training is successful or not is achieved, and similar conditions such as the GPU is not fully utilized due to the fact that the utilization rate of the CPU is too high and the data reading of the CPU cannot be kept up with, so that resource waste is formed; for example, a potential threat caused by an excessively high temperature of the GPU, or a card drop of the GPU due to some reasons, may cause direct stopping or failure of training, or even cause hardware damage.

The monitoring management and alarm management module applied to the artificial intelligence cloud platform at present has a working mode which is approximately as follows: the monitoring management module is responsible for maintaining basic information of the acquisition items, such as acquisition indexes, acquisition frequency, storage configuration and the like, can synchronously update configuration files in the nodes, and performs acquisition and data storage according to the latest indexes; the alarm management module is responsible for configuring the alarm rules of the monitoring items and outputting alarm information according to the alarm period.

The cloud platform cluster monitoring can use the component combination of TIGK (cloud environment monitoring solution, which is the combination of four components of Telegraf, Influx, Grafana and Kapacitor), and correspondingly realizes the resource monitoring steps of acquisition, storage, display and alarm, wherein the acquisition and storage are the most critical two steps, namely the Telegraf and influxdb are used for realizing the resource monitoring. The telegraf needs to configure the acquisition interval of each monitoring item, and the influxdb needs to configure the related storage strategy, and it is difficult for the two items to achieve a balance. The monitoring granularity is increased when the acquisition interval is too large, the delay of module finding abnormity and alarming action is increased when the abnormity occurs, and the data records before and after alarming are less, so that the analysis and inspection are difficult to be carried out afterwards; the acquisition interval is too small, so that the acquired and stored data volume is increased, the reading and writing pressure of the database is increased, too much memory is occupied, and the significance of storing a large amount of idle normal data is not great. If the platform monitoring items can be early warned, the acquisition intervals are sparse under the normal condition, the acquisition intervals are reduced when the alarm risk is found in prediction, the abnormal condition is quickly responded, and the data change before and after intensive recording can be well improved. However, according to the conventional method of calculating the change rate of the monitoring data and setting the threshold, it is difficult to cope with slowly changing data, and it is impossible to sense an abnormality in advance.

Disclosure of Invention

In view of the above, in order to overcome at least one aspect of the above problems, an embodiment of the present invention provides a cluster monitoring method, including the following steps:

acquiring a first acquisition interval and a first storage interval;

collecting a plurality of monitoring data according to the first collection interval and storing the plurality of monitoring data according to the first storage interval;

predicting the monitoring data of the next period according to the plurality of monitoring data;

in response to the monitoring data of the next cycle being greater than a threshold, updating the first acquisition interval to a second acquisition interval that is less than the first acquisition interval to acquire a plurality of monitoring data according to the second acquisition interval, and updating the first storage interval to a second storage interval that is less than the first storage interval to store a plurality of monitoring data according to the second storage interval.

In some embodiments, predicting the monitoring data of the next cycle based on the plurality of monitoring data further comprises:

judging whether the collected monitoring data are larger than corresponding threshold values or not;

in response to being greater than a corresponding threshold, directly updating the first acquisition interval to the second acquisition interval and the first storage interval to a second storage interval;

and acquiring a plurality of monitoring data again according to the second acquisition interval and storing the plurality of monitoring data according to the second storage interval.

In some embodiments, further comprising:

responding to the fact that the monitoring data collected again are not larger than the corresponding threshold value, and continuing to predict the monitoring data of the next period;

in response to the prediction result also not being greater than the corresponding threshold, update the second acquisition interval to the base first acquisition interval and update a second storage interval to the first storage interval.

constructing and training a prediction model;

inputting the next period of monitoring data obtained by prediction and the next period of monitoring data obtained by actual acquisition into Kalman filtering to obtain a predicted value after adjustment;

and adjusting the prediction model by using the adjusted prediction value and the actually acquired monitoring data of the next period.

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a cluster monitoring system, including:

an acquisition module configured to acquire a first acquisition interval and a first storage interval;

an acquisition module configured to acquire a plurality of monitoring data according to the first acquisition interval and store the plurality of monitoring data according to the first storage interval;

the prediction module is configured to predict the monitoring data of the next period according to the plurality of monitoring data;

an adjustment module configured to update the first acquisition interval to a second acquisition interval smaller than the first acquisition interval to acquire a plurality of monitoring data according to the second acquisition interval and update the first storage interval to a second storage interval smaller than the first storage interval to store a plurality of monitoring data according to the second storage interval in response to the monitoring data of the next cycle being greater than a threshold.

In some embodiments, the prediction module is further configured to:

responding to the situation that the monitoring data collected again is not larger than the corresponding threshold value, and continuing to predict the monitoring data of the next period;

In some embodiments, the prediction module is further configured to:

constructing and training a prediction model;

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a computer apparatus, including:

at least one processor; and

a memory storing a computer program operable on the processor, wherein the processor executes the program to perform any of the steps of the cluster monitoring method as described above.

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a computer-readable storage medium, which stores a computer program that, when executed by a processor, performs the steps of any one of the cluster monitoring methods described above.

The invention has one of the following beneficial technical effects: the scheme provided by the invention can be used for predicting the real-time monitoring data of the cloud platform monitoring item, judging whether the next alarm period triggers the alarm or not, updating the frequency of the acquisition module and the storage module when the alarm is predicted, reducing the storage record of data in idle time to reduce the read-write pressure of the storage module, and increasing the acquisition and storage quantity of the data in alarm risk to facilitate the recording and analysis of the data trend before and after the alarm occurs, thereby ensuring the integrity, stability and usability of the artificial intelligent platform function.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a cluster monitoring method according to an embodiment of the present invention;

FIG. 2 is a block flow diagram of a collection interval and storage interval update method provided by an embodiment of the present invention;

fig. 3 is a flowchart of a cluster monitoring method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a cluster monitoring system according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a computer device provided in an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

According to an aspect of the present invention, an embodiment of the present invention provides a cluster monitoring method, as shown in fig. 1, which may include the steps of:

s1, acquiring a first acquisition interval and a first storage interval;

s2, collecting a plurality of monitoring data according to the first collection interval and storing the monitoring data according to the first storage interval;

s3, predicting the monitoring data of the next period according to the monitoring data;

and S4, responding to the monitoring data of the next period being larger than a threshold value, updating the first collection interval to a second collection interval smaller than the first collection interval so as to collect a plurality of monitoring data according to the second collection interval, and updating the first storage interval to a second storage interval smaller than the first storage interval so as to store a plurality of monitoring data according to the second storage interval.

The scheme provided by the invention can be used for predicting the real-time monitoring data of the cloud platform monitoring item, judging whether the next alarm period triggers the alarm or not, updating the frequency of the acquisition module and the storage module when the alarm is predicted, reducing the storage record of data in idle time to reduce the read-write pressure of the storage module, and increasing the acquisition and storage quantity of the data in alarm risk to facilitate the recording and analysis of the data trend before and after the alarm occurs, thereby ensuring the integrity, stability and usability of the artificial intelligent platform function.

In some embodiments, a monitoring data acquisition module, a monitoring data storage module, an alarm management module and an LSTM model prediction module can be set, then monitoring and alarm data acquired and stored in the past are grouped according to a set alarm detection period as an original data set, a prediction model is obtained through LSTM network training, the cluster node bottom layer resource information of a monitoring item is acquired through an original acquisition interval and an acquisition script by using a telegraf acquisition module, the acquired monitoring data is stored in a corresponding inflixdb storage module for service, the monitoring data predicted by the LSTM model is corrected by using a Kalman filtering algorithm according to the acquired data and the trained LSTM model, the optimal monitoring prediction data of the next alarm detection period is calculated and compared with the threshold value of the alarm management module, the prediction result is that the acquisition and storage intervals of the monitoring data are changed to be dense during alarm, after the alarm is recovered, the acquisition module and the storage module are also recovered to the original acquisition interval, so that the self-adaptive acquisition and storage frequency is realized, and the storage and read-write efficiency and the data volume of the monitoring items before and after the alarm occurs are ensured.

In some embodiments, in step S1, the first collection interval and the first storage interval are obtained, specifically, a telegraff component may be installed on all nodes needing to be monitored in the cluster, a monitoring collection item, a collection time interval, a custom collection script, and the like are configured, and a telegraff service is started. And then configuring influxdb, setting a storage interval, and storing the data into the influxdb. The alarm module takes out the latest monitoring data of the acquisition module from the system memory according to the alarm period; therefore, whether the monitoring item triggers the alarm or not is predicted by using the trained model according to the alarm interval, and the prediction is fed back to the acquisition and storage module.

In some embodiments, S3, predicting the monitoring data of the next cycle according to the plurality of monitoring data, further includes:

In some embodiments, further comprising:

Specifically, as shown in FIG. 2, the administrator may set the raw acquisition interval t₁Original storage interval t₂Dense acquisition storage interval t_i(ii) a The acquisition module follows from the bottom layer according to t₁Collecting monitoring item data, storing the monitoring item data in a memory, and storing the monitoring item data in the memory according to t by a storage module₂Taking out data from the memory and storing the data in a database, and an alarm module according to an alarm period t_aTaking out latest data from the memory to perform alarm judgment and alarm prediction; if true alarm occurs, stopping prediction and updating the acquisition module and the storage module to be at the dense interval t_iAfter the alarm condition is recovered, the collection and storage interval is recovered to the original value t₁t₂While restarting prediction; if no real alarm occurs, the early warning model predicts to obtain the monitoring data of the next alarm period and compares the monitoring data with an alarm threshold value; when the prediction result is alarm, the acquisition module and the storage module are updated to be at the dense interval t_iUntil the prediction result is not alarmed, the acquisition and storage interval is restored to the original value t₁t₂。

constructing and training a prediction model;

inputting the monitoring data of the next period obtained by prediction and the monitoring data of the next period obtained by actual acquisition into Kalman filtering to obtain a predicted value after tuning;

Specifically, as shown in fig. 3, according to the working mode of the artificial intelligence cloud platform monitoring module, a large amount of monitoring data of each monitoring item under various conditions are collected and used as a training set, the various conditions include single task training, multi-task distributed training, training model depth, parameter quantity and the like, comprehensive scenes are ensured as much as possible, and the monitoring items mainly include indexes such as GPU temperature, GPU utilization rate, GPU display and memory utilization rate, CPU utilization rate, memory utilization rate, disk reading and writing and the like; training the collected monitoring data training set by using an LSTM (Long Short-Term Memory) neural network according to each monitoring item and a unit alarm detection period to obtain a prediction model; and (4) putting the trained model into use, and predicting the monitoring data of the next upcoming period according to the monitoring data of the current alarm detection period to obtain a monitoring predicted value.

Then, an alarm detection period can be used as an axis, a monitoring predicted value of the model and real monitoring data obtained when the next period is reached are substituted into Kalman filtering to obtain an optimal monitoring predicted value of the period (Kalman gain can be substituted into other data sets for continuous use), and Kalman filtering is selected to optimize the predicted value and the real value of the current period according to weights, so that errors and noises of a prediction result are reduced as much as possible, the predicted value is closer to reality, and the predicted value can be better used as an input parameter to be transmitted into an LSTM network of the next period for prediction.

The Kalman filtering can be simply understood as an observation value plus (1-p) prediction value, wherein the observation value is an obtained actual value, the prediction value is a prediction value of an LSTM model, p is Kalman gain, and p is a parameter which can be continuously adjusted and optimized, so that the final value can obtain a result which is closer to the reality according to the observation value and the prediction value;

and finally, the obtained optimal predicted value can be used as input and transmitted into an LSTM model, the next period is predicted, corrected predicted value and real monitoring value are continuously injected through iteration to obtain more accurate response time predicted data, so that the monitored predicted data can be compared with an alarm threshold value, and if the condition of triggering alarm is met, the platform collects and stores data on the monitoring item at intensive intervals.

Therefore, a training set is formed by accumulating the monitored data, the LSTM network is used for training and predicting the monitored data according to the set alarm period, the prediction result is optimized by Kalman filtering, the data prediction of the monitoring unit is realized, and whether the monitoring data triggers the alarm or not can be predicted in advance by one alarm period. Sparse interval acquisition and storage are continuously used when normal prediction is carried out, dense interval acquisition and storage are changed when alarm is predicted, sparse intervals are recovered after alarm recovery, the function of self-adaptive acquisition to storage frequency according to early warning results is realized, and storage and read-write efficiency are ensured.

Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a cluster monitoring system 400, as shown in fig. 4, including:

an obtaining module 401 configured to obtain a first acquisition interval and a first storage interval;

an acquisition module 402 configured to acquire a plurality of monitoring data according to the first acquisition interval and store the plurality of monitoring data according to the first storage interval;

a prediction module 403 configured to predict the monitoring data of the next cycle according to the plurality of monitoring data;

an adjusting module 404 configured to update the first acquisition interval to a second acquisition interval smaller than the first acquisition interval to acquire a plurality of monitoring data according to the second acquisition interval and update the first storage interval to a second storage interval smaller than the first storage interval to store a plurality of monitoring data according to the second storage interval in response to the monitoring data of the next cycle being greater than a threshold.

In some embodiments, the prediction module 403 is further configured to:

constructing and training a prediction model;

Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 5, an embodiment of the present invention further provides a computer apparatus 501, comprising:

at least one processor 520; and

a memory 510, the memory 510 storing a computer program 511 executable on the processor, the processor 520 executing the program to perform the steps of any of the cluster monitoring methods as described above.

Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 6, an embodiment of the present invention further provides a computer-readable storage medium 601, where the computer-readable storage medium 601 stores computer program instructions 610, and the computer program instructions 610, when executed by a processor, perform the steps of any of the above cluster monitoring methods.

Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above.

Further, it should be appreciated that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit or scope of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A cluster monitoring method, comprising the steps of:

acquiring a first acquisition interval and a first storage interval;

in response to the next cycle of monitoring data being greater than a threshold, updating the first acquisition interval to a second acquisition interval that is less than the first acquisition interval to acquire a plurality of monitoring data according to the second acquisition interval, and updating the first storage interval to a second storage interval that is less than the first storage interval to store a plurality of monitoring data according to the second storage interval.

2. The method of claim 1, wherein predicting the monitoring data for the next cycle based on the plurality of monitoring data, further comprising:

3. The method of claim 2, further comprising:

4. The method of claim 1, wherein predicting the monitoring data for the next cycle based on the plurality of monitoring data, further comprising:

constructing and training a prediction model;

5. A cluster monitoring system, comprising:

an adjustment module configured to update the first acquisition interval to a second acquisition interval less than the first acquisition interval to acquire a plurality of monitoring data according to the second acquisition interval and update the first storage interval to a second storage interval less than the first storage interval to store a plurality of monitoring data according to the second storage interval in response to the next period of monitoring data being greater than a threshold.

6. The system of claim 5, wherein the prediction module is further configured to:

7. The system of claim 6, wherein the prediction module is further configured to:

8. The system of claim 5, wherein the prediction module is further configured to:

constructing and training a prediction model;

9. A computer device, comprising:

at least one processor; and

memory storing a computer program operable on the processor, characterized in that the processor executes the program to perform the steps of the method according to any of claims 1-4.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1-4.