CN113220534A - Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium - Google Patents
Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN113220534A CN113220534A CN202110591793.9A CN202110591793A CN113220534A CN 113220534 A CN113220534 A CN 113220534A CN 202110591793 A CN202110591793 A CN 202110591793A CN 113220534 A CN113220534 A CN 113220534A
- Authority
- CN
- China
- Prior art keywords
- cluster
- node
- monitoring
- abnormal
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012544 monitoring process Methods 0.000 title claims abstract description 145
- 238000000034 method Methods 0.000 title claims abstract description 56
- 230000002159 abnormal effect Effects 0.000 claims abstract description 146
- 230000015654 memory Effects 0.000 claims description 35
- 230000005856 abnormality Effects 0.000 claims description 28
- 230000008569 process Effects 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 18
- 238000012545 processing Methods 0.000 claims description 17
- 238000007635 classification algorithm Methods 0.000 claims description 13
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000012806 monitoring device Methods 0.000 claims description 2
- 238000005070 sampling Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 13
- 238000004891 communication Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 230000007246 mechanism Effects 0.000 description 5
- 230000004044 response Effects 0.000 description 5
- 238000012423 maintenance Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/546—Message passing systems or structures, e.g. queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/54—Indexing scheme relating to G06F9/54
- G06F2209/547—Messaging middleware
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/54—Indexing scheme relating to G06F9/54
- G06F2209/548—Queue
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Quality & Reliability (AREA)
- Debugging And Monitoring (AREA)
Abstract
The embodiment of the specification provides a cluster multi-dimensional anomaly monitoring method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring cluster performance information and node state information of each node in a cluster; generating a first monitoring sub-result according to the cluster performance information, and generating a second monitoring sub-result according to the node state information; and determining an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result. The embodiment of the specification can improve the accuracy of cluster abnormity monitoring.
Description
Technical Field
The present disclosure relates to the field of cluster monitoring technologies, and in particular, to a method, an apparatus, a device, and a storage medium for monitoring a cluster multidimensional abnormality.
Background
The deployment of large-scale clusters is of great help to the development of support services, the complexity of an application system is improved, and great challenges are brought to the abnormal monitoring of the clusters. At present, a large number of clusters still use the traditional single-node monitoring method, however, the abnormality of a single node does not necessarily affect the normal service of the cluster. Therefore, an anomaly monitoring strategy method for a cluster is needed to comprehensively evaluate the service capability and the alarm strategy of the cluster so as to improve the accuracy of cluster anomaly monitoring.
Disclosure of Invention
An object of an embodiment of the present specification is to provide a method, an apparatus, a device, and a storage medium for monitoring a cluster multidimensional abnormality, so as to improve accuracy of monitoring the cluster abnormality.
In order to achieve the above object, in one aspect, an embodiment of the present specification provides a cluster multidimensional abnormality monitoring method, including:
acquiring cluster performance information and node state information of each node in a cluster;
generating a first monitoring sub-result according to the cluster performance information, and generating a second monitoring sub-result according to the node state information;
and determining an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result.
In an embodiment of this specification, the generating a first monitoring sub-result according to the cluster performance information includes:
processing the cluster performance information according to a preset classification algorithm to obtain the number of abnormal data points in a first specified time;
judging whether the number of the abnormal data points is larger than a set number threshold value or not;
and if the number of the abnormal data points is larger than the number threshold, generating a first monitoring sub-result according to the number of the abnormal data points.
In an embodiment of this specification, the generating a second monitor sub-result according to the node state information includes:
determining the single-node abnormal rate within a first designated time according to the node state information;
in an embodiment of this specification, the generating a second monitor sub-result according to the node state information includes:
determining the single-node abnormal rate within a first designated time according to the node state information;
inputting the single-node abnormal rate into a formulaAcquiring the abnormal rate of the cluster nodes;
judging whether the abnormal rate of the cluster nodes is greater than a set abnormal rate threshold value or not;
if the cluster node abnormal rate is larger than the abnormal rate threshold value, generating a second monitoring sub-result according to the abnormal rate threshold value;
wherein r is the abnormal rate of the cluster nodes, M is the total number of abnormal nodes in the first designated time, anWeight coefficient of the nth abnormal node in the first designated time, bnThe single-node abnormal rate of the nth abnormal node in the first designated time is obtained.
In the examples of the present specification, when n is 1, a is the weight coefficient1According to the formulaDetermining; when n is greater than or equal to 2, anAccording to the formula an=a1×qn-1Determining; wherein, a1Q is a set value and is a weight coefficient of a 1 st abnormal node in a first designated time, and q is greater than 1.
In an embodiment of this specification, the determining, according to the node state information, a single-node abnormal rate within a first specified time includes:
reading a node state message of a target node from a message queue; the node state message is sent to the message queue by each node at regular time;
judging whether a node state message of the target node is read from the message queue within a second designated time or not;
when the node state information of the target node is not read from the information queue within a second designated time or the node state information of the target node, which contains index abnormal data, is read within the second designated time, the target node is confirmed to be an abnormal node;
and determining the single-node abnormal rate in the first appointed time according to the abnormal node.
In an embodiment of this specification, the determining an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result includes:
determining an abnormal monitoring result of the cluster according to the abnormal monitoring value;
wherein F is the abnormal monitoring result of the cluster, A1As a first monitoring sub-result, A2Is the second monitor sub-result.
In an embodiment of this specification, the determining an anomaly monitoring result of the cluster according to the anomaly monitoring value includes:
when F is larger than 1, outputting a cluster high-level alarm;
when F is more than 0 and less than 1, outputting a cluster low-level alarm;
when F is 0, the cluster has no alarm output.
In an embodiment of this specification, each message in the message queue includes: node identification, node IP, CPU utilization, memory utilization, process state, IO performance and storage space state.
In an embodiment of the present specification, the classification algorithm comprises a nearest neighbor algorithm.
On the other hand, an embodiment of the present specification further provides a cluster multidimensional abnormality monitoring apparatus, including:
the acquisition module is used for acquiring cluster performance information and node state information of each node in the cluster;
the generating module is used for generating a first monitoring sub-result according to the cluster performance information and generating a second monitoring sub-result according to the node state information;
and the determining module is used for determining the abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result.
In another aspect, the embodiments of the present specification further provide a computer device, which includes a memory, a processor, and a computer program stored on the memory, and when the computer program is executed by the processor, the computer program executes the instructions of the above method.
In another aspect, the present specification further provides a computer storage medium, on which a computer program is stored, and the computer program is executed by a processor of a computer device to execute the instructions of the method.
As can be seen from the technical solutions provided in the embodiments of the present specification, abnormality monitoring is no longer performed based on only single node information, and cluster performance information is also considered, that is, the embodiments of the present specification integrate node state information and cluster performance information to perform abnormality monitoring, so that accuracy of cluster abnormality monitoring is improved, probability of a cluster generating a large amount of redundant alarm information is reduced, and cluster operation and maintenance pressure and cost are reduced.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort. In the drawings:
FIG. 1 illustrates a flow diagram of a cluster multi-dimensional anomaly monitoring method in some embodiments of the present description;
FIG. 2 is a flow diagram illustrating generation of a first monitoring sub-result according to cluster performance information in an embodiment of the present specification;
FIG. 3 is a flow diagram illustrating the generation of a second monitor sub-result based on node status information in one embodiment of the present description;
FIG. 4 is a block diagram illustrating the structure of a cluster multi-dimensional anomaly monitoring device in some embodiments of the present description;
FIG. 5 shows a block diagram of a computer device in some embodiments of the present description.
[ description of reference ]
41. An acquisition module;
42. a generation module;
43. a determination module;
502. a computer device;
504. a processor;
506. a memory;
508. a drive mechanism;
510. an input/output interface;
512. an input device;
514. an output device;
516. a presentation device;
518. a graphical user interface;
520. a network interface;
522. a communication link;
524. a communication bus.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step should fall within the scope of protection of the present specification.
Embodiments of the present description relate to cluster anomaly monitoring techniques. Wherein a cluster refers to a server cluster. The server cluster refers to that a plurality of servers are centralized together to perform the same service, and the client looks like only one server. The cluster can use a plurality of computers to perform parallel computation so as to obtain high computation speed, and can also use a plurality of computers to perform backup so that any one machine is broken and the whole system can still normally operate.
In the conventional technology, the alarm for the cluster is generally limited to alarm based on the abnormality of a single server (namely a single node), and when a system detects that a certain node in the cluster is abnormal, the system directly reports the abnormal node to an alarm module for alarm. In fact, in many cases, the anomaly of a single node does not necessarily affect the normal service of the cluster. Therefore, the conventional technology is easy to generate a large amount of redundant alarms, thereby causing a large burden to cluster operation and maintenance.
In view of this, in order to improve the accuracy of cluster anomaly monitoring, the cluster operation and maintenance burden is reduced. The present description provides an improved cluster multidimensional anomaly monitoring method that can be applied to any suitable computing device. Referring to fig. 1, in some embodiments of the present specification, the cluster multidimensional abnormality monitoring method may include the following steps:
s101, obtaining cluster performance information and node state information of each node in the cluster.
S102, generating a first monitoring sub-result according to the cluster performance information, and generating a second monitoring sub-result according to the node state information.
S103, determining an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result.
In the embodiment of the description, abnormality monitoring is not performed only based on single node information, and cluster performance information is also considered, that is, the embodiment of the description integrates node state information and cluster performance information to perform abnormality monitoring, so that the accuracy of cluster abnormality monitoring is improved, the probability of a cluster generating a large amount of redundant alarm information is reduced, and the cluster operation and maintenance pressure and cost are reduced.
The node state information of each node in the cluster mainly reflects the individual state of the corresponding single node, and the cluster performance information can be the overall performance of the cluster to a certain extent. Therefore, in order to more comprehensively and accurately monitor the cluster for the exception, in the embodiment of the present specification, not only the node state information of each node in the cluster but also the cluster performance information may be acquired.
In an embodiment of this specification, the node state information may include, for example, information such as a node identifier, a node IP, a CPU utilization, a memory utilization, a process state, an IO performance, and a storage space state. Wherein a node identification (e.g., a node number, etc.) may be used to uniquely identify a node in the cluster; the node IP is the IP address of the node; the CPU utilization, the memory utilization, the process state, the IO performance, and the storage space state correspond to a current CPU utilization, a memory utilization, a process state (for example, running state process number, ready state process number, blocking state process number, etc.), an IO performance (for example, disk IO performance, network IO performance, etc.), and a storage space state of the node, respectively.
In an embodiment of the present specification, the cluster performance information may include, for example, throughput, response time, and other indicators. The cluster performance information may well reflect the overall online performance of the cluster. The throughput may be, for example, a transaction amount. The response time refers to the response time of the cluster, i.e., the average of the response times of all nodes within the cluster.
Those skilled in the art will appreciate that the node state information and the cluster performance information are only exemplary, and in other embodiments of the present disclosure, the node state information may further include more or less information, which is not limited in the present disclosure and may be specifically selected according to needs.
In some embodiments of this specification, each node in the cluster may upload its own node status information to a specified message queue in the form of a message at regular time (e.g., every 5 seconds, every 10 seconds, etc.). Correspondingly, the node state information of each node in the cluster can be obtained by reading and processing the message from the message queue. In an embodiment of this specification, the cluster performance information may be collected by a script program or other tools.
Referring to fig. 2, in some embodiments of the present specification, the generating the first monitoring sub-result according to the cluster performance information may include:
s201, processing the cluster performance information according to a preset classification algorithm to obtain the number of abnormal data points in first designated time.
The advantage of processing the cluster performance information through the classification algorithm is that whether the cluster performance is abnormal can be identified by means of the relatively accurate classification capability of the classification algorithm, thereby being beneficial to improving the accuracy of cluster performance identification.
The preset classification algorithm may be any suitable classification algorithm, which is not limited in this specification and may be specifically selected as needed. For example, in an embodiment of the present specification, the predetermined classification algorithm may be a nearest neighbor algorithm (KNN) or the like. The KNN algorithm in the embodiments of the present specification includes a KNN model trainer and a KNN classifier. Based on the historical cluster performance information, a KNN model trainer may be trained to generate a KNN classifier (including determining a K value for the KNN algorithm). For cluster performance information acquired within a sampling time (for example, 30 seconds), data classification can be performed by a current KNN classifier, so that whether abnormal data points (for example, abnormal request quantity, abnormal response time and the like) exist or not is judged according to the classifier; if so, recording the number of abnormal data points in the sampling time length. In addition, the KNN classifier can be dynamically updated according to daily cluster performance information incremental data, so that the classification accuracy of the KNN classifier is further improved. Wherein, a sampling duration includes a plurality of sampling points, for example, a sampling rate of 5 seconds is taken as an example, there may be 6 sampling data within a sampling duration of 30 seconds.
In some embodiments of the present disclosure, a data analysis list exceptinonknn { } (here, KNN classification algorithm is taken as an example) may be further built in. ExceptionKnn { } may be used to store information that cluster performance data collected every 1 second of a cluster deviates from the data points of the KNN classifier, i.e., holds abnormal data points, for a single sample duration (e.g., 30 seconds) of the record. In addition, data that exceeds the present sampling duration may be cleaned (i.e., removed from exceptinonknn { }).
S202, judging whether the number of the abnormal data points is larger than a set number threshold value.
In the embodiment of the present specification, the number of abnormal data points may be counted periodically. For example, in an embodiment of the present specification, the number of elements in exceptinonknn { } may be counted every 5 seconds (i.e., exceptinonknn. size ()), and compared with a set number threshold, and if the number is greater than the number threshold, step S203 may be performed; otherwise, the next statistics and judgment can be carried out.
S203, if the number of the abnormal data points is larger than the number threshold, generating a first monitoring sub-result according to the number of the abnormal data points.
For example, in one embodiment of the present specification, if Exception Knn>5 (here 5 is used as a quantity threshold and can be adjusted according to actual production). That is, 5 abnormal data points of the cluster occur within a single sampling time, it is determined that there is a greater risk of cluster abnormality. Thus, generating the first monitoring sub-result accordingly may refer to: and taking the probability of the abnormal data point in the corresponding single sampling time length as a first monitoring sub-result. For example, if the number of data points in a single sampling period is 60, and 5 abnormal data points occur therein, the probability of the corresponding abnormal data point isIn addition, cluster KNN alarm information is triggered to be generated and labeled as AlarmKnn.
Thus, by the method shown in fig. 2, a relatively accurate cluster performance monitoring result (i.e., the first monitoring sub-result) may be obtained.
Referring to fig. 3, in some embodiments of the present specification, the generating the second monitor sub-result according to the node status information may include:
s301, determining the single-node abnormal rate in the first designated time according to the node state information.
In order to improve the data processing efficiency, in some embodiments of the present specification, one cluster information synchronization process and multiple message processing processes may be started. The cluster information synchronization process is used for recording and updating node identifications and node IP information of all nodes of the current cluster, and the number of the nodes in the cluster, the serial number of each node and the IP of each node can be rapidly known through the cluster information synchronization process; meanwhile, the cluster information synchronization process can also be responsible for summarizing the abnormal information of the nodes in each sampling duration for subsequent processing. The message processing processes can be used for reading the messages in the message queue and processing the messages according to preset processing rules.
For example, in an embodiment of the present specification, the specific processing logic of the message processing process may be: if the message of a certain node is not read from the message queue within the current sampling duration, marking the node as abnormal; if a message of a node can be read from the message queue within 5 seconds, but some index in the message is abnormal (for example, the CPU utilization exceeds a threshold value, etc.), the node may also be marked as abnormal. If the abnormal node in the cluster is marked after the message processing process is processed, the cluster information synchronization process collects the abnormal information and calculates the single-node abnormal rate according to the abnormal information so as to facilitate the subsequent processing. Wherein, the single-node abnormal rate in the first designated time means: the anomaly rate of each node within one sample duration.
In some embodiments of the present description, a data analysis list ExceptionSingleList { } may also be built in. The list exceptingsingleist { } is used for storing predefined abnormal node description information (namely, abnormal nodes are defined) so as to judge the single-node abnormal rate according to the definition
S302, inputting the single-node abnormal rate into a formulaAnd acquiring the abnormal rate of the cluster nodes.
In the formula of the step, r is the abnormal rate of the cluster nodes, M is the total number of abnormal nodes in the first designated time, and anWeight coefficient of the nth abnormal node in the first designated time, bnThe single-node abnormal rate of the nth abnormal node in the first designated time is obtained.
In some embodiments of the present specification, in the weight coefficient, when n is 1, a1According to the formulaDetermining; when n is greater than or equal to 2, anAccording to the formula an=a1×qn-1Determining; wherein, a1Is the first specified time withinAnd the weight coefficients of 1 abnormal node, q is a set value, and q is more than 1. It can be seen that in this case, the closer the sampling time, the more the weight. The advantages of using such empowerment are: the influence of the initial acquisition point is reduced (because the possibility of later automatic recovery exists, for example, accidental abnormal jitter can be automatically recovered), so that the abnormal scene of the abnormal automatic recovery can be ignored to a greater extent, and false alarm is reduced.
S303, judging whether the abnormal rate of the cluster nodes is larger than a set abnormal rate threshold value. If the cluster node abnormal rate is greater than the abnormal rate threshold, executing step S304; otherwise, the next calculation can be performed.
S304, if the cluster node abnormal rate is larger than the abnormal rate threshold value, generating a second monitoring sub-result.
In this embodiment of the present specification, generating the second monitoring sub-result may refer to taking the cluster node anomaly rate r at this time as the second monitoring sub-result, and may also trigger generating node anomaly alarm information, and mark the generated node anomaly alarm information as AlarmSingle. Thus, by the method shown in fig. 3, a relatively accurate node state monitoring result (i.e., the second monitoring sub-result) can be obtained.
In some embodiments of the present specification, the determining an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result may include: according to the formulaDetermining an abnormal monitoring value of the cluster; and determining an abnormal monitoring result of the cluster according to the abnormal monitoring value. Wherein F is the abnormal monitoring result of the cluster, A1As a first monitoring sub-result, A2Is the second monitor sub-result. From formulasIt can be seen that the abnormal monitoring result of the cluster comprehensively considers the first monitoring sub-result (i.e. the cluster performance monitoring condition) and the second monitoring sub-result (i.e. the single-node state monitoring condition), and the first monitoring sub-result (i.e. the cluster performance monitoring condition) is obtainedMonitoring case) is greater than the second monitoring sub-result (i.e., single node status monitoring case), i.e., cluster performance is more important in cluster anomaly monitoring than node status, and therefore is given a higher coefficient.
In some embodiments of this specification, the determining an anomaly monitoring result of the cluster according to the anomaly monitoring value may include: when F is larger than 1, outputting a cluster high-level alarm; when F is more than 0 and less than 1, outputting a cluster low-level alarm; when F is 0, the cluster has no alarm output. Therefore, multi-level abnormity early warning (for example, as shown in the following table 1) is realized according to the severity of the abnormal condition, and the accuracy of cluster abnormity monitoring is improved.
TABLE 1
AlarmSingle | AlarmKnn | Alarm rating |
1 | 0 | Is low in |
1 | 1 | Height of |
0 | 1 | Height of |
0 | 0 | Is normal |
While the process flows described above include operations that occur in a particular order, it should be appreciated that the processes may include more or less operations that are performed sequentially or in parallel (e.g., using parallel processors or a multi-threaded environment).
Corresponding to the method, the present specification further provides an embodiment of a cluster multidimensional abnormality monitoring apparatus. Referring to fig. 4, in some embodiments of the present specification, the cluster multidimensional abnormality monitoring apparatus may include:
an obtaining module 41, configured to obtain cluster performance information and node state information of each node in the cluster;
a generating module 42, configured to generate a first monitoring sub-result according to the cluster performance information, and generate a second monitoring sub-result according to the node state information;
the determining module 43 may be configured to determine an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result.
In some apparatus embodiments of the present description, the generating a first monitoring sub-result according to the cluster performance information includes:
processing the cluster performance information according to a preset classification algorithm to obtain the number of abnormal data points in a first specified time;
judging whether the number of the abnormal data points is larger than a set number threshold value or not;
and if the number of the abnormal data points is larger than the number threshold, generating a first monitoring sub-result according to the number of the abnormal data points.
In some device embodiments of this specification, the generating a second monitor sub-result according to the node state information includes:
determining the single-node abnormal rate within a first designated time according to the node state information;
inputting the single-node abnormal rate into a formulaAcquiring the abnormal rate of the cluster nodes;
judging whether the abnormal rate of the cluster nodes is greater than a set abnormal rate threshold value or not;
if the cluster node abnormal rate is larger than the abnormal rate threshold value, generating a second monitoring sub-result according to the abnormal rate threshold value;
wherein r is the abnormal rate of the cluster nodes, M is the total number of abnormal nodes in the first designated time, anWeight coefficient of the nth abnormal node in the first designated time, bnThe single-node abnormal rate of the nth abnormal node in the first designated time is obtained.
In some embodiments of the apparatus in this specification, when n is 1, a is the weight coefficient1According to the formulaDetermining; when n is greater than or equal to 2, anAccording to the formula an=a1×qn-1Determining; wherein, a1Q is a set value and is a weight coefficient of a 1 st abnormal node in a first designated time, and q is greater than 1.
In some apparatus embodiments of the present description, the determining an abnormal rate of a single node within a first specified time according to the node status information includes:
reading a node state message of a target node from a message queue; the node state message is sent to the message queue by each node at regular time;
judging whether a node state message of the target node is read from the message queue within a second designated time or not;
when the node state information of the target node is not read from the information queue within a second designated time or the node state information of the target node, which contains index abnormal data, is read within the second designated time, the target node is confirmed to be an abnormal node;
and determining the single-node abnormal rate in the first appointed time according to the abnormal node.
In some device embodiments of this specification, the determining an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result includes:
determining an abnormal monitoring result of the cluster according to the abnormal monitoring value;
wherein F is the abnormal monitoring result of the cluster, A1As a first monitoring sub-result, A2Is the second monitor sub-result.
In some device embodiments of this specification, the determining an anomaly monitoring result of the cluster according to the anomaly monitoring value includes:
when F is larger than 1, outputting a cluster high-level alarm;
when F is more than 0 and less than 1, outputting a cluster low-level alarm;
when F is 0, the cluster has no alarm output.
In some apparatus embodiments of the present description, each message in the message queue includes: node identification, node IP, CPU utilization, memory utilization, process state, IO performance and storage space state.
In some apparatus embodiments herein, the classification algorithm may include a KNN algorithm.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.
Embodiments of the present description also provide a computer device. As shown in FIG. 5, in some embodiments of the present description, the computer device 502 may include one or more processors 504, such as one or more Central Processing Units (CPUs) or Graphics Processors (GPUs), each of which may implement one or more hardware threads. The computer device 502 may also include any memory 506 for storing any kind of information such as code, settings, data, etc., and in a specific embodiment, a computer program running on the memory 506 and on the processor 504 may execute the instructions of the cluster multi-dimensional anomaly monitoring method according to any of the above embodiments when the computer program is executed by the processor 504. For example, and without limitation, memory 506 may include any one or more of the following in combination: any type of RAM, any type of ROM, flash memory devices, hard disks, optical disks, etc. More generally, any memory may use any technology to store information. Further, any memory may provide volatile or non-volatile retention of information. Further, any memory may represent fixed or removable components of computer device 502. In one case, when the processor 504 executes the associated instructions, which are stored in any memory or combination of memories, the computer device 502 can perform any of the operations of the associated instructions. The computer device 502 also includes one or more drive mechanisms 508, such as a hard disk drive mechanism, an optical disk drive mechanism, etc., for interacting with any memory.
Communication link 522 may be implemented in any manner, such as through a local area network, a wide area network (e.g., the Internet), a point-to-point connection, etc., or any combination thereof. Communication link 522 may include any combination of hardwired links, wireless links, routers, gateway functions, name servers, etc., governed by any protocol or combination of protocols.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products of some embodiments of the specification. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processor to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processor, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processor to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processor to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computer device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The embodiments of this specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The described embodiments may also be practiced in distributed computing environments where tasks are performed by remote processors that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of an embodiment of the specification. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
Claims (12)
1. A cluster multi-dimensional anomaly monitoring method is characterized by comprising the following steps:
acquiring cluster performance information and node state information of each node in a cluster;
generating a first monitoring sub-result according to the cluster performance information, and generating a second monitoring sub-result according to the node state information;
and determining an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result.
2. The method for monitoring the multi-dimensional abnormality of the cluster according to claim 1, wherein the generating a first monitoring sub-result according to the cluster performance information includes:
processing the cluster performance information according to a preset classification algorithm to obtain the number of abnormal data points in a first specified time;
judging whether the number of the abnormal data points is larger than a set number threshold value or not;
and if the number of the abnormal data points is larger than the number threshold, generating a first monitoring sub-result according to the number of the abnormal data points.
3. The method for monitoring the multi-dimensional abnormality of the cluster according to claim 1, wherein the generating of the second monitoring sub-result according to the node status information includes:
determining the single-node abnormal rate within a first designated time according to the node state information;
inputting the single-node abnormal rate into a formulaAcquiring the abnormal rate of the cluster nodes;
judging whether the abnormal rate of the cluster nodes is greater than a set abnormal rate threshold value or not;
if the cluster node abnormal rate is larger than the abnormal rate threshold value, generating a second monitoring sub-result according to the abnormal rate threshold value;
wherein r is the abnormal rate of the cluster nodes, M is the total number of abnormal nodes in the first designated time, anWeight coefficient of the nth abnormal node in the first designated time, bnThe single-node abnormal rate of the nth abnormal node in the first designated time is obtained.
4. The cluster multidimensional abnormality monitoring method according to claim 3, wherein in the weight coefficient, when n is 1, a is1According to the formulaDetermining; when n is greater than or equal to 2, anAccording to the formula an=a1×qn-1Determining; wherein, a1Q is a set value and is a weight coefficient of a 1 st abnormal node in a first designated time, and q is greater than 1.
5. The cluster multidimensional abnormality monitoring method of claim 3, wherein the determining the single-node abnormality rate within a first specified time according to the node state information comprises:
reading a node state message of a target node from a message queue; the node state message is sent to the message queue by each node at regular time;
judging whether a node state message of the target node is read from the message queue within a second designated time or not;
when the node state information of the target node is not read from the information queue within a second designated time or the node state information of the target node, which contains index abnormal data, is read within the second designated time, the target node is confirmed to be an abnormal node;
and determining the single-node abnormal rate in the first appointed time according to the abnormal node.
6. The method for monitoring the multi-dimensional abnormality of the cluster according to claim 1, wherein the determining the abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result includes:
determining an abnormal monitoring result of the cluster according to the abnormal monitoring value;
wherein F is the abnormal monitoring result of the cluster, A1As a first monitoring sub-result, A2Is the second monitor sub-result.
7. The method for monitoring the multi-dimensional abnormality of the cluster according to claim 6, wherein the determining the abnormality monitoring result of the cluster according to the abnormality monitoring value includes:
when F is larger than 1, outputting a cluster high-level alarm;
when F is more than 0 and less than 1, outputting a cluster low-level alarm;
when F is 0, the cluster has no alarm output.
8. The method for monitoring cluster multidimensional abnormality of claim 5, wherein each message in the message queue comprises: node identification, node IP, CPU utilization, memory utilization, process state, IO performance and storage space state.
9. The method for cluster multidimensional anomaly monitoring of claim 2, wherein said classification algorithm comprises a nearest neighbor algorithm.
10. A cluster multidimensional abnormality monitoring device, comprising:
the acquisition module is used for acquiring cluster performance information and node state information of each node in the cluster;
the generating module is used for generating a first monitoring sub-result according to the cluster performance information and generating a second monitoring sub-result according to the node state information;
and the determining module is used for determining the abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result.
11. A computer device comprising a memory, a processor, and a computer program stored on the memory, wherein the computer program, when executed by the processor, performs the instructions of the method of any one of claims 1-9.
12. A computer storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor of a computer device, executes instructions of a method according to any one of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110591793.9A CN113220534A (en) | 2021-05-28 | 2021-05-28 | Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110591793.9A CN113220534A (en) | 2021-05-28 | 2021-05-28 | Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113220534A true CN113220534A (en) | 2021-08-06 |
Family
ID=77099064
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110591793.9A Pending CN113220534A (en) | 2021-05-28 | 2021-05-28 | Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113220534A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113641561A (en) * | 2021-10-15 | 2021-11-12 | 杭州朗澈科技有限公司 | Method and system for displaying monitoring data in edge scene |
CN113671322A (en) * | 2021-10-25 | 2021-11-19 | 广东电网有限责任公司东莞供电局 | Microgrid state online monitoring method and device |
CN115412420A (en) * | 2022-08-29 | 2022-11-29 | 苏州浪潮智能科技有限公司 | Management method, device, equipment and medium for inter-frame cluster communication |
CN115499296A (en) * | 2022-07-29 | 2022-12-20 | 天翼云科技有限公司 | Cloud desktop hot standby management method, device and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111708665A (en) * | 2020-05-29 | 2020-09-25 | 苏州浪潮智能科技有限公司 | Method, device, equipment and medium for comprehensively monitoring storage cluster system |
CN112115031A (en) * | 2020-09-29 | 2020-12-22 | 中国银行股份有限公司 | Cluster state monitoring method and device |
WO2021051582A1 (en) * | 2019-09-17 | 2021-03-25 | 平安科技(深圳)有限公司 | Host performance monitoring method and apparatus for server cluster, device, and storage medium |
-
2021
- 2021-05-28 CN CN202110591793.9A patent/CN113220534A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021051582A1 (en) * | 2019-09-17 | 2021-03-25 | 平安科技(深圳)有限公司 | Host performance monitoring method and apparatus for server cluster, device, and storage medium |
CN111708665A (en) * | 2020-05-29 | 2020-09-25 | 苏州浪潮智能科技有限公司 | Method, device, equipment and medium for comprehensively monitoring storage cluster system |
CN112115031A (en) * | 2020-09-29 | 2020-12-22 | 中国银行股份有限公司 | Cluster state monitoring method and device |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113641561A (en) * | 2021-10-15 | 2021-11-12 | 杭州朗澈科技有限公司 | Method and system for displaying monitoring data in edge scene |
CN113641561B (en) * | 2021-10-15 | 2022-02-22 | 杭州朗澈科技有限公司 | Method and system for displaying monitoring data in edge scene |
CN113671322A (en) * | 2021-10-25 | 2021-11-19 | 广东电网有限责任公司东莞供电局 | Microgrid state online monitoring method and device |
CN115499296A (en) * | 2022-07-29 | 2022-12-20 | 天翼云科技有限公司 | Cloud desktop hot standby management method, device and system |
CN115499296B (en) * | 2022-07-29 | 2024-03-12 | 天翼云科技有限公司 | Cloud desktop hot standby management method, device and system |
CN115412420A (en) * | 2022-08-29 | 2022-11-29 | 苏州浪潮智能科技有限公司 | Management method, device, equipment and medium for inter-frame cluster communication |
CN115412420B (en) * | 2022-08-29 | 2023-08-18 | 苏州浪潮智能科技有限公司 | Method, device, equipment and medium for managing inter-frame trunking communication |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113220534A (en) | Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium | |
Borghesi et al. | Anomaly detection using autoencoders in high performance computing systems | |
US10318366B2 (en) | System and method for relationship based root cause recommendation | |
CN110351150B (en) | Fault source determination method and device, electronic equipment and readable storage medium | |
US20170068747A1 (en) | System and method for end-to-end application root cause recommendation | |
US20200034730A1 (en) | Machine Discovery of Aberrant Operating States | |
US20180211172A1 (en) | Machine Discovery and Rapid Agglomeration of Similar States | |
JP5933463B2 (en) | Log occurrence abnormality detection device and method | |
US9860109B2 (en) | Automatic alert generation | |
Bhaduri et al. | Detecting abnormal machine characteristics in cloud infrastructures | |
CN112769605B (en) | Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform | |
US11736363B2 (en) | Techniques for analyzing a network and increasing network availability | |
CN114595210A (en) | Multi-dimensional data anomaly detection method and device and electronic equipment | |
JP5711675B2 (en) | Network abnormality detection apparatus and network abnormality detection method | |
JP6252309B2 (en) | Monitoring omission identification processing program, monitoring omission identification processing method, and monitoring omission identification processing device | |
CN113708987A (en) | Network anomaly detection method and device | |
CN113055246B (en) | Abnormal service node identification method, device, equipment and storage medium | |
CN111061581B (en) | Fault detection method, device and equipment | |
US10282245B1 (en) | Root cause detection and monitoring for storage systems | |
CN116975938B (en) | Sensor data processing method in product manufacturing process | |
US10432647B2 (en) | Malicious industrial internet of things node activity detection for connected plants | |
US11269706B2 (en) | System and method for alarm correlation and aggregation in IT monitoring | |
CN113342608A (en) | Method and device for monitoring streaming computing engine task | |
JP2019049802A (en) | Failure analysis supporting device, incident managing system, failure analysis supporting method, and program | |
CN116826961A (en) | Intelligent power grid dispatching and operation and maintenance system, method and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |