CN112380089A

CN112380089A - Data center monitoring and early warning method and system

Info

Publication number: CN112380089A
Application number: CN202011245407.2A
Authority: CN
Inventors: 邱子良
Original assignee: Shenzhen Power Supply Bureau Co Ltd
Current assignee: Shenzhen Power Supply Bureau Co Ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-02-19

Abstract

The invention discloses a data center monitoring and early warning method and a system, wherein the data center monitoring and early warning method comprises the following steps: step S1, setting monitored object and its monitoring index; step S2, acquiring index data of each monitored object; step S3, selecting a corresponding detection model from a detection model database according to the monitored object and the monitoring index, and carrying out risk detection on index data by adopting the detection model; and step S4, sending out early warning prompt according to the risk detection result. According to the invention, risk detection is carried out based on the index data of the monitored object, so that risks can be found in advance and fault early warning is sent out, normal operation of the server and the data center is ensured, the fault rate of the server is effectively reduced, and great loss caused by server fault is reduced.

Description

Data center monitoring and early warning method and system

Technical Field

The invention relates to the technical field of data centers, in particular to a data center monitoring and early warning method and system.

Background

With the continuous development and progress of science and technology, more and more data centers are built by each enterprise and public institution, and the number of servers for constructing the data centers is also more and more. The server is used as a high-level computer and has high-speed computing capability and strong external data throughput capability, and the normal operation of the server is crucial in the data center. Once the server fails, problems such as loss of data stored by the server and incapability of being accessed by a user occur. At present, after a server breaks down, management personnel processes the fault, so that the server can recover to work normally. However, no matter how fast the failure processing speed is and how good the failure processing effect is, as long as the server fails, the overall operation of the data center is affected to a greater or lesser extent. In summary, how to early warn the server of the failure is a technical problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a data center monitoring and early warning method and system to ensure the normal operation of a server and a data center and effectively reduce the failure rate of the server.

In order to solve the technical problem, the invention provides a data center monitoring and early warning method, which comprises the following steps:

step S1, setting monitored object and its monitoring index;

step S2, acquiring index data of each monitored object;

step S3, selecting a corresponding detection model from a detection model database according to the monitored object and the monitoring index, and carrying out risk detection on index data by adopting the detection model;

and step S4, sending out early warning prompt according to the risk detection result.

Further, the index data or the real-time index data in the preset time period is acquired in step S2.

Further, when the index data in the preset time period is acquired in step S2, the trend prediction model is selected in step S3, and the process of performing risk detection on the index data by using the trend prediction model in step S3 specifically includes:

step S31a, arranging the index data in a preset time period according to a time sequence to obtain an index data sequence;

step S32a, inputting the index data sequence into a trend prediction model, and outputting the highest index data value in the next preset time period by the trend prediction model;

step S33a, the highest index data value is compared with a preset threshold, and if the highest index data value is greater than or equal to the threshold, it is determined that there is a risk, and if the highest index data value is less than the threshold, it is determined that there is no risk.

Further, in the step S33a, different risk levels are generated based on a ratio of the highest index data value exceeding a preset threshold, and in the step S4, different warning levels are generated according to the different risk levels.

Further, in step S4, the early warning information is grouped according to the early warning level, and a sending time is set for each group of early warning information, so as to send the early warning information to the associated user within the preset sending time, and decompose the information transmission pressure of the early warning system.

Further, when the real-time index data is obtained in step S2, the anomaly detection model is selected in step S3, and the process of performing risk detection on the index data by using the anomaly detection model in step S3 specifically includes:

step S31b, inputting the real-time index data into an abnormal detection model, and outputting corresponding index data characteristics;

step S32b, comparing the index data characteristic with the standard index characteristic, and if the index data characteristic accords with the standard index characteristic, determining that the real-time index data is in a normal state; otherwise, determining that the index data is in an abnormal state.

The invention also provides a data center monitoring and early warning system, which comprises:

the monitoring target setting module is used for setting a monitored object and a monitoring index thereof;

the index data acquisition module is used for acquiring the index data of each monitored object;

the risk prediction module is used for selecting a corresponding detection model from the detection model database according to the monitored object and the monitoring index and carrying out risk detection on the index data by adopting the detection model;

and the early warning module is used for sending out early warning prompt according to the risk detection result.

Further, the index data acquisition module acquires index data or real-time index data in a preset time period.

Further, when the index data acquisition module acquires the index data in a preset time period, the risk detection module selects a corresponding trend prediction model from the detection model database, arranges the index data in the preset time period according to a time sequence to obtain an index data sequence, inputs the index data sequence into the trend prediction model, outputs the highest index data value in the next preset time period, compares the highest index data value with a preset threshold value, determines that a risk exists if the highest index data value is greater than or equal to the threshold value, and determines that no risk exists if the highest index data value is less than the threshold value.

Further, when the index data acquisition module acquires real-time index data, the risk prediction module selects a corresponding abnormal detection model from a detection model database, the risk detection module inputs the real-time index data into the abnormal detection model and then outputs corresponding index data characteristics, the index data characteristics are compared with standard index characteristics, and if the index data characteristics meet the standard index characteristics, the real-time index data is determined to be in a normal state; otherwise, determining that the index data is in an abnormal state.

The embodiment of the invention has the beneficial effects that: by carrying out risk detection based on the index data of the monitored object, risks can be found in advance and fault early warning can be sent out, normal operation of the server and the data center is guaranteed, the failure rate of the server is effectively reduced, and major loss caused by server failure is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a data center monitoring and early warning method according to an embodiment of the present invention.

Fig. 2 is a detailed flowchart of step S3 shown in fig. 1.

Fig. 3 is another detailed flowchart of step S3 shown in fig. 1.

Fig. 4 is a schematic structural diagram of a module of a data center monitoring and early warning system according to a second embodiment of the present invention.

Detailed Description

The following description of the embodiments refers to the accompanying drawings, which are included to illustrate specific embodiments in which the invention may be practiced.

Referring to fig. 1, a preferred embodiment of the present invention provides a multi-directional monitoring and early warning method for a data center, which includes the following steps:

step S1: setting a monitored object and a monitoring index thereof;

step S2: acquiring index data of each monitored object;

step S3: selecting a corresponding detection model from a detection model database according to the monitored object and the monitoring index, and carrying out risk detection on index data by adopting the detection model;

step S4: and sending out early warning prompt according to the risk detection result.

It is to be understood that, in step S1, the monitored object may be at least one of a plurality of servers of the data center, some network devices, or at least one virtual cluster, and the staff may select and set according to actual needs, which is not limited herein. In addition, different monitoring indexes can be set for different monitored objects, for example, the monitoring indexes can include the CPU usage rate, the CPU temperature, the memory usage rate, the hard disk usage rate, and the like of a target server, the usage indexes of related applications configured on a data center by a user, and the like, the operating states and usage situations of the CPUs of the physical machines, the memories, the voltages, the currents, the energy consumptions, the temperatures, and the like, the operating states and usage situations of the CPUs of the virtual machines, the memories, the disk I/O, the Net I/O, and the like, and generally, a worker can determine the monitoring indexes through operations in a set monitoring interface.

It is to be understood that, in the step S2, index data or real-time index data within a preset time period is obtained, so that risk detection can be performed based on the index data within a time period or based on the real-time index data. For example, a target monitoring object of a target server may be monitored by a monitoring plug-in using SNMP (Simple Network Management Protocol), AGENT technology, IPMI (Intelligent Platform Management Interface), or the like, to obtain a performance value of the target monitoring object; the occupancy rate of a CPU or a memory of the corresponding service host to be managed can be monitored and acquired based on the resource occupancy monitoring program, and the current temperature information of the cloud computing data center can be monitored and acquired based on the set temperature sensor. Of course, the present invention may also obtain the index data based on other existing monitoring means, and the specific monitoring mode is not specifically described herein.

It is to be understood that different detection models are stored in advance in the detection model database based on different types of index data. For example, for index data in a period of time, a trend prediction model is stored in the detection model database in advance, and the trend prediction model is trained in a deep learning manner based on a large amount of historical data, and can predict index data change in the next period of time based on index data in the previous period of time. And aiming at real-time index data, an abnormality detection model is stored in the detection model database in advance, the abnormality detection model is specifically constructed and formed through an abnormality detection machine learning algorithm and through ceaseless algorithm iterative training learning, and the constructed abnormality detection model after training learning can form a standard characteristic value of the index data in the data center. In other embodiments of the present invention, the detection model database may further separately set different risk detection models according to different monitored objects, or may further set different risk detection models for different index data of the same monitored object, so as to improve accuracy of risk detection.

It is understood that in the step S4, the content of the warning reminder includes, but is not limited to, a warning time, a warning identifier, a warning reason, a warning level, and the like. Preferably, the present invention may further create a failure database in advance to store information such as various failures and corresponding failure processing methods, which may be obtained when a manager actually performs failure processing. After the fault early warning information of the target monitoring object is determined, a fault processing mode corresponding to the fault early warning information can be inquired in a fault knowledge base. If the server is queried, the fault preprocessing can be automatically performed according to the fault processing mode, for example, when the fault which may occur is determined to be abnormal stop of the server according to the risk detection result, the fault processing mode corresponding to the fault is obtained as restart operation by querying the fault database, a script for restarting the server can be written according to the fault processing mode, and the server is restarted by using the script, so that the abnormal stop of the server is directly avoided. Under the condition that a large number of faults and corresponding fault processing modes are stored in the fault database, automatic fault early warning processing can be achieved, or after the fault processing modes corresponding to the fault early warning information are inquired in the fault database, the fault processing modes can be output, and a basis is provided for an administrator to process the faults. If the fault processing mode corresponding to the fault early warning information is not inquired in the fault database, the fault corresponding to the fault early warning information can be stored in the fault database, and then after the administrator processes the fault based on the fault early warning information, the feedback information of the administrator is received, the corresponding fault processing mode is obtained, and the fault database is updated according to the fault processing mode. Therefore, when the fault early warning information appears again, fault preprocessing can be automatically carried out according to the corresponding fault processing mode.

According to the data center monitoring and early warning method, firstly, a worker can manually set monitored objects and monitoring indexes thereof in a data center, then index data of each monitored object is obtained, a corresponding detection model is selected from a detection model database according to the monitored objects and the monitoring indexes, the detection model is adopted to carry out risk detection on the index data, and finally early warning reminding is sent out based on a risk detection result.

Specifically, as shown in fig. 2, when the index data in the preset time period is acquired in step S2, a trend prediction model is selected in step S3, and the process of performing risk detection on the index data by using the trend prediction model in step S3 specifically includes the following steps:

step S31 a: arranging the index data in a preset time period according to a time sequence to obtain an index data sequence;

step S32 a: inputting the index data sequence into a trend prediction model, and outputting the highest index data value in the next preset time period by the trend prediction model;

step S33 a: and comparing the highest index data value with a preset threshold value, judging that the risk exists if the highest index data value is greater than or equal to the threshold value, and judging that the risk does not exist if the highest index data value is less than the threshold value.

In addition, different risk levels are generated in step S33a based on the ratio of the highest index data value exceeding the preset threshold, for example, a first risk level is generated when the highest index data value exceeds the preset threshold by 110%, a second risk level is generated when the highest index data value exceeds the preset threshold by 150%, and a third risk level is generated when the highest index data value exceeds 200%, which means that the severity of the situation is higher as the risk levels increase. And in the step S4, different early warning levels are generated according to different risk levels, and the higher the risk level is, the higher the early warning level is, the higher the priority of sending out the early warning is.

In addition, in step S4, the early warning information is further grouped according to the early warning level, and a sending time is set for each group of early warning information, so as to send the early warning information to the associated user within the preset sending time, and resolve the information transmission pressure of the early warning system.

Alternatively, as shown in fig. 3, when the real-time index data is obtained in step S2, the abnormality detection model is selected in step S3, and the process of performing risk detection on the index data by using the abnormality detection model in step S3 specifically includes the following steps:

step S31 b: inputting the real-time index data into an abnormality detection model, and outputting corresponding index data characteristics;

step S32 b: comparing the index data characteristic with the standard index characteristic, and if the index data characteristic accords with the standard index characteristic, determining that the real-time index data is in a normal state; otherwise, determining that the index data is in an abnormal state.

Based on a distributed computing mode, each monitoring index is judged through a distributed anomaly detection model, monitoring and alarming of an abnormal state are accurately and efficiently realized, and influence caused by service interruption is timely avoided, so that operation and maintenance management resource investment on a data center is reduced.

Referring to fig. 4 again, in a data center monitoring and early warning method according to an embodiment of the present invention, a second embodiment of the present invention further provides a data center monitoring and early warning system, including:

For the working principle and the specific working process of the data center monitoring and early warning system in this embodiment, please refer to the description of the first embodiment of the present invention, which is not described herein again.

As can be seen from the above description, the embodiments of the present invention have the following beneficial effects: by carrying out risk detection based on the index data of the monitored object, risks can be found in advance and fault early warning can be sent out, normal operation of the server and the data center is guaranteed, the failure rate of the server is effectively reduced, and major loss caused by server failure is reduced.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A data center monitoring and early warning method is characterized by comprising the following steps:

step S1, setting monitored object and its monitoring index;

step S2, acquiring index data of each monitored object;

2. The data center monitoring and early warning method according to claim 1, wherein the index data or the real-time index data in the preset time period is acquired in the step S2.

3. The data center monitoring and early warning method according to claim 2, wherein when the index data within the preset time period is acquired in step S2, a trend prediction model is selected in step S3, and the process of performing risk detection on the index data by using the trend prediction model in step S3 specifically includes:

4. The data center monitoring and early warning method according to claim 3, wherein in the step S33a, different risk levels are generated based on a ratio of the highest index data value exceeding a preset threshold, and in the step S4, different early warning levels are generated according to the different risk levels.

5. The data center monitoring and early warning method according to claim 4, wherein in step S4, the early warning information is grouped according to the early warning level, and the sending time is set for each group of early warning information, so as to send the early warning information to the associated user within the preset sending time, and decompose the information transmission pressure of the early warning system.

6. The data center monitoring and early warning method according to claim 2, wherein when the real-time index data is obtained in step S2, the anomaly detection model is selected in step S3, and the process of performing risk detection on the index data by using the anomaly detection model in step S3 specifically includes:

7. A data center monitoring and early warning system is characterized by comprising:

8. The data center monitoring and early warning system according to claim 7, wherein the index data acquisition module acquires index data or real-time index data within a preset time period.

9. The data center monitoring and early warning system according to claim 8, wherein when the index data acquisition module acquires index data within a preset time period, the risk detection module selects a corresponding trend prediction model from the detection model database, arranges the index data within the preset time period according to a time sequence to obtain an index data sequence, inputs the index data sequence into the trend prediction model, outputs a highest index data value within a next preset time period, compares the highest index data value with a preset threshold, determines that a risk exists if the highest index data value is greater than or equal to the threshold, and determines that no risk exists if the highest index data value is less than the threshold.

10. The data center monitoring and early warning system according to claim 8, wherein when the index data acquisition module acquires real-time index data, the risk prediction module selects a corresponding abnormal detection model from a detection model database, the risk detection module inputs the real-time index data into the abnormal detection model and outputs corresponding index data characteristics, the index data characteristics are compared with standard index characteristics, and if the index data characteristics meet the standard index characteristics, the real-time index data is determined to be in a normal state; otherwise, determining that the index data is in an abnormal state.