CN113220534A - Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium - Google Patents

Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium Download PDF

Info

Publication number
CN113220534A
CN113220534A CN202110591793.9A CN202110591793A CN113220534A CN 113220534 A CN113220534 A CN 113220534A CN 202110591793 A CN202110591793 A CN 202110591793A CN 113220534 A CN113220534 A CN 113220534A
Authority
CN
China
Prior art keywords
cluster
node
monitoring
abnormal
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110591793.9A
Other languages
Chinese (zh)
Inventor
李飞飞
欧阳南杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110591793.9A priority Critical patent/CN113220534A/en
Publication of CN113220534A publication Critical patent/CN113220534A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/547Messaging middleware
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the specification provides a cluster multi-dimensional anomaly monitoring method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring cluster performance information and node state information of each node in a cluster; generating a first monitoring sub-result according to the cluster performance information, and generating a second monitoring sub-result according to the node state information; and determining an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result. The embodiment of the specification can improve the accuracy of cluster abnormity monitoring.

Description

Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of cluster monitoring technologies, and in particular, to a method, an apparatus, a device, and a storage medium for monitoring a cluster multidimensional abnormality.
Background
The deployment of large-scale clusters is of great help to the development of support services, the complexity of an application system is improved, and great challenges are brought to the abnormal monitoring of the clusters. At present, a large number of clusters still use the traditional single-node monitoring method, however, the abnormality of a single node does not necessarily affect the normal service of the cluster. Therefore, an anomaly monitoring strategy method for a cluster is needed to comprehensively evaluate the service capability and the alarm strategy of the cluster so as to improve the accuracy of cluster anomaly monitoring.
Disclosure of Invention
An object of an embodiment of the present specification is to provide a method, an apparatus, a device, and a storage medium for monitoring a cluster multidimensional abnormality, so as to improve accuracy of monitoring the cluster abnormality.
In order to achieve the above object, in one aspect, an embodiment of the present specification provides a cluster multidimensional abnormality monitoring method, including:
acquiring cluster performance information and node state information of each node in a cluster;
generating a first monitoring sub-result according to the cluster performance information, and generating a second monitoring sub-result according to the node state information;
and determining an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result.
In an embodiment of this specification, the generating a first monitoring sub-result according to the cluster performance information includes:
processing the cluster performance information according to a preset classification algorithm to obtain the number of abnormal data points in a first specified time;
judging whether the number of the abnormal data points is larger than a set number threshold value or not;
and if the number of the abnormal data points is larger than the number threshold, generating a first monitoring sub-result according to the number of the abnormal data points.
In an embodiment of this specification, the generating a second monitor sub-result according to the node state information includes:
determining the single-node abnormal rate within a first designated time according to the node state information;
in an embodiment of this specification, the generating a second monitor sub-result according to the node state information includes:
determining the single-node abnormal rate within a first designated time according to the node state information;
inputting the single-node abnormal rate into a formula
Figure BDA0003089541050000021
Acquiring the abnormal rate of the cluster nodes;
judging whether the abnormal rate of the cluster nodes is greater than a set abnormal rate threshold value or not;
if the cluster node abnormal rate is larger than the abnormal rate threshold value, generating a second monitoring sub-result according to the abnormal rate threshold value;
wherein r is the abnormal rate of the cluster nodes, M is the total number of abnormal nodes in the first designated time, anWeight coefficient of the nth abnormal node in the first designated time, bnThe single-node abnormal rate of the nth abnormal node in the first designated time is obtained.
In the examples of the present specification, when n is 1, a is the weight coefficient1According to the formula
Figure BDA0003089541050000022
Determining; when n is greater than or equal to 2, anAccording to the formula an=a1×qn-1Determining; wherein, a1Q is a set value and is a weight coefficient of a 1 st abnormal node in a first designated time, and q is greater than 1.
In an embodiment of this specification, the determining, according to the node state information, a single-node abnormal rate within a first specified time includes:
reading a node state message of a target node from a message queue; the node state message is sent to the message queue by each node at regular time;
judging whether a node state message of the target node is read from the message queue within a second designated time or not;
when the node state information of the target node is not read from the information queue within a second designated time or the node state information of the target node, which contains index abnormal data, is read within the second designated time, the target node is confirmed to be an abnormal node;
and determining the single-node abnormal rate in the first appointed time according to the abnormal node.
In an embodiment of this specification, the determining an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result includes:
according to the formula
Figure BDA0003089541050000023
Determining an abnormal monitoring value of the cluster;
determining an abnormal monitoring result of the cluster according to the abnormal monitoring value;
wherein F is the abnormal monitoring result of the cluster, A1As a first monitoring sub-result, A2Is the second monitor sub-result.
In an embodiment of this specification, the determining an anomaly monitoring result of the cluster according to the anomaly monitoring value includes:
when F is larger than 1, outputting a cluster high-level alarm;
when F is more than 0 and less than 1, outputting a cluster low-level alarm;
when F is 0, the cluster has no alarm output.
In an embodiment of this specification, each message in the message queue includes: node identification, node IP, CPU utilization, memory utilization, process state, IO performance and storage space state.
In an embodiment of the present specification, the classification algorithm comprises a nearest neighbor algorithm.
On the other hand, an embodiment of the present specification further provides a cluster multidimensional abnormality monitoring apparatus, including:
the acquisition module is used for acquiring cluster performance information and node state information of each node in the cluster;
the generating module is used for generating a first monitoring sub-result according to the cluster performance information and generating a second monitoring sub-result according to the node state information;
and the determining module is used for determining the abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result.
In another aspect, the embodiments of the present specification further provide a computer device, which includes a memory, a processor, and a computer program stored on the memory, and when the computer program is executed by the processor, the computer program executes the instructions of the above method.
In another aspect, the present specification further provides a computer storage medium, on which a computer program is stored, and the computer program is executed by a processor of a computer device to execute the instructions of the method.
As can be seen from the technical solutions provided in the embodiments of the present specification, abnormality monitoring is no longer performed based on only single node information, and cluster performance information is also considered, that is, the embodiments of the present specification integrate node state information and cluster performance information to perform abnormality monitoring, so that accuracy of cluster abnormality monitoring is improved, probability of a cluster generating a large amount of redundant alarm information is reduced, and cluster operation and maintenance pressure and cost are reduced.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort. In the drawings:
FIG. 1 illustrates a flow diagram of a cluster multi-dimensional anomaly monitoring method in some embodiments of the present description;
FIG. 2 is a flow diagram illustrating generation of a first monitoring sub-result according to cluster performance information in an embodiment of the present specification;
FIG. 3 is a flow diagram illustrating the generation of a second monitor sub-result based on node status information in one embodiment of the present description;
FIG. 4 is a block diagram illustrating the structure of a cluster multi-dimensional anomaly monitoring device in some embodiments of the present description;
FIG. 5 shows a block diagram of a computer device in some embodiments of the present description.
[ description of reference ]
41. An acquisition module;
42. a generation module;
43. a determination module;
502. a computer device;
504. a processor;
506. a memory;
508. a drive mechanism;
510. an input/output interface;
512. an input device;
514. an output device;
516. a presentation device;
518. a graphical user interface;
520. a network interface;
522. a communication link;
524. a communication bus.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step should fall within the scope of protection of the present specification.
Embodiments of the present description relate to cluster anomaly monitoring techniques. Wherein a cluster refers to a server cluster. The server cluster refers to that a plurality of servers are centralized together to perform the same service, and the client looks like only one server. The cluster can use a plurality of computers to perform parallel computation so as to obtain high computation speed, and can also use a plurality of computers to perform backup so that any one machine is broken and the whole system can still normally operate.
In the conventional technology, the alarm for the cluster is generally limited to alarm based on the abnormality of a single server (namely a single node), and when a system detects that a certain node in the cluster is abnormal, the system directly reports the abnormal node to an alarm module for alarm. In fact, in many cases, the anomaly of a single node does not necessarily affect the normal service of the cluster. Therefore, the conventional technology is easy to generate a large amount of redundant alarms, thereby causing a large burden to cluster operation and maintenance.
In view of this, in order to improve the accuracy of cluster anomaly monitoring, the cluster operation and maintenance burden is reduced. The present description provides an improved cluster multidimensional anomaly monitoring method that can be applied to any suitable computing device. Referring to fig. 1, in some embodiments of the present specification, the cluster multidimensional abnormality monitoring method may include the following steps:
s101, obtaining cluster performance information and node state information of each node in the cluster.
S102, generating a first monitoring sub-result according to the cluster performance information, and generating a second monitoring sub-result according to the node state information.
S103, determining an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result.
In the embodiment of the description, abnormality monitoring is not performed only based on single node information, and cluster performance information is also considered, that is, the embodiment of the description integrates node state information and cluster performance information to perform abnormality monitoring, so that the accuracy of cluster abnormality monitoring is improved, the probability of a cluster generating a large amount of redundant alarm information is reduced, and the cluster operation and maintenance pressure and cost are reduced.
The node state information of each node in the cluster mainly reflects the individual state of the corresponding single node, and the cluster performance information can be the overall performance of the cluster to a certain extent. Therefore, in order to more comprehensively and accurately monitor the cluster for the exception, in the embodiment of the present specification, not only the node state information of each node in the cluster but also the cluster performance information may be acquired.
In an embodiment of this specification, the node state information may include, for example, information such as a node identifier, a node IP, a CPU utilization, a memory utilization, a process state, an IO performance, and a storage space state. Wherein a node identification (e.g., a node number, etc.) may be used to uniquely identify a node in the cluster; the node IP is the IP address of the node; the CPU utilization, the memory utilization, the process state, the IO performance, and the storage space state correspond to a current CPU utilization, a memory utilization, a process state (for example, running state process number, ready state process number, blocking state process number, etc.), an IO performance (for example, disk IO performance, network IO performance, etc.), and a storage space state of the node, respectively.
In an embodiment of the present specification, the cluster performance information may include, for example, throughput, response time, and other indicators. The cluster performance information may well reflect the overall online performance of the cluster. The throughput may be, for example, a transaction amount. The response time refers to the response time of the cluster, i.e., the average of the response times of all nodes within the cluster.
Those skilled in the art will appreciate that the node state information and the cluster performance information are only exemplary, and in other embodiments of the present disclosure, the node state information may further include more or less information, which is not limited in the present disclosure and may be specifically selected according to needs.
In some embodiments of this specification, each node in the cluster may upload its own node status information to a specified message queue in the form of a message at regular time (e.g., every 5 seconds, every 10 seconds, etc.). Correspondingly, the node state information of each node in the cluster can be obtained by reading and processing the message from the message queue. In an embodiment of this specification, the cluster performance information may be collected by a script program or other tools.
Referring to fig. 2, in some embodiments of the present specification, the generating the first monitoring sub-result according to the cluster performance information may include:
s201, processing the cluster performance information according to a preset classification algorithm to obtain the number of abnormal data points in first designated time.
The advantage of processing the cluster performance information through the classification algorithm is that whether the cluster performance is abnormal can be identified by means of the relatively accurate classification capability of the classification algorithm, thereby being beneficial to improving the accuracy of cluster performance identification.
The preset classification algorithm may be any suitable classification algorithm, which is not limited in this specification and may be specifically selected as needed. For example, in an embodiment of the present specification, the predetermined classification algorithm may be a nearest neighbor algorithm (KNN) or the like. The KNN algorithm in the embodiments of the present specification includes a KNN model trainer and a KNN classifier. Based on the historical cluster performance information, a KNN model trainer may be trained to generate a KNN classifier (including determining a K value for the KNN algorithm). For cluster performance information acquired within a sampling time (for example, 30 seconds), data classification can be performed by a current KNN classifier, so that whether abnormal data points (for example, abnormal request quantity, abnormal response time and the like) exist or not is judged according to the classifier; if so, recording the number of abnormal data points in the sampling time length. In addition, the KNN classifier can be dynamically updated according to daily cluster performance information incremental data, so that the classification accuracy of the KNN classifier is further improved. Wherein, a sampling duration includes a plurality of sampling points, for example, a sampling rate of 5 seconds is taken as an example, there may be 6 sampling data within a sampling duration of 30 seconds.
In some embodiments of the present disclosure, a data analysis list exceptinonknn { } (here, KNN classification algorithm is taken as an example) may be further built in. ExceptionKnn { } may be used to store information that cluster performance data collected every 1 second of a cluster deviates from the data points of the KNN classifier, i.e., holds abnormal data points, for a single sample duration (e.g., 30 seconds) of the record. In addition, data that exceeds the present sampling duration may be cleaned (i.e., removed from exceptinonknn { }).
S202, judging whether the number of the abnormal data points is larger than a set number threshold value.
In the embodiment of the present specification, the number of abnormal data points may be counted periodically. For example, in an embodiment of the present specification, the number of elements in exceptinonknn { } may be counted every 5 seconds (i.e., exceptinonknn. size ()), and compared with a set number threshold, and if the number is greater than the number threshold, step S203 may be performed; otherwise, the next statistics and judgment can be carried out.
S203, if the number of the abnormal data points is larger than the number threshold, generating a first monitoring sub-result according to the number of the abnormal data points.
For example, in one embodiment of the present specification, if Exception Knn>5 (here 5 is used as a quantity threshold and can be adjusted according to actual production). That is, 5 abnormal data points of the cluster occur within a single sampling time, it is determined that there is a greater risk of cluster abnormality. Thus, generating the first monitoring sub-result accordingly may refer to: and taking the probability of the abnormal data point in the corresponding single sampling time length as a first monitoring sub-result. For example, if the number of data points in a single sampling period is 60, and 5 abnormal data points occur therein, the probability of the corresponding abnormal data point is
Figure BDA0003089541050000071
In addition, cluster KNN alarm information is triggered to be generated and labeled as AlarmKnn.
Thus, by the method shown in fig. 2, a relatively accurate cluster performance monitoring result (i.e., the first monitoring sub-result) may be obtained.
Referring to fig. 3, in some embodiments of the present specification, the generating the second monitor sub-result according to the node status information may include:
s301, determining the single-node abnormal rate in the first designated time according to the node state information.
In order to improve the data processing efficiency, in some embodiments of the present specification, one cluster information synchronization process and multiple message processing processes may be started. The cluster information synchronization process is used for recording and updating node identifications and node IP information of all nodes of the current cluster, and the number of the nodes in the cluster, the serial number of each node and the IP of each node can be rapidly known through the cluster information synchronization process; meanwhile, the cluster information synchronization process can also be responsible for summarizing the abnormal information of the nodes in each sampling duration for subsequent processing. The message processing processes can be used for reading the messages in the message queue and processing the messages according to preset processing rules.
For example, in an embodiment of the present specification, the specific processing logic of the message processing process may be: if the message of a certain node is not read from the message queue within the current sampling duration, marking the node as abnormal; if a message of a node can be read from the message queue within 5 seconds, but some index in the message is abnormal (for example, the CPU utilization exceeds a threshold value, etc.), the node may also be marked as abnormal. If the abnormal node in the cluster is marked after the message processing process is processed, the cluster information synchronization process collects the abnormal information and calculates the single-node abnormal rate according to the abnormal information so as to facilitate the subsequent processing. Wherein, the single-node abnormal rate in the first designated time means: the anomaly rate of each node within one sample duration.
In some embodiments of the present description, a data analysis list ExceptionSingleList { } may also be built in. The list exceptingsingleist { } is used for storing predefined abnormal node description information (namely, abnormal nodes are defined) so as to judge the single-node abnormal rate according to the definition
S302, inputting the single-node abnormal rate into a formula
Figure BDA0003089541050000081
And acquiring the abnormal rate of the cluster nodes.
In the formula of the step, r is the abnormal rate of the cluster nodes, M is the total number of abnormal nodes in the first designated time, and anWeight coefficient of the nth abnormal node in the first designated time, bnThe single-node abnormal rate of the nth abnormal node in the first designated time is obtained.
In some embodiments of the present specification, in the weight coefficient, when n is 1, a1According to the formula
Figure BDA0003089541050000082
Determining; when n is greater than or equal to 2, anAccording to the formula an=a1×qn-1Determining; wherein, a1Is the first specified time withinAnd the weight coefficients of 1 abnormal node, q is a set value, and q is more than 1. It can be seen that in this case, the closer the sampling time, the more the weight. The advantages of using such empowerment are: the influence of the initial acquisition point is reduced (because the possibility of later automatic recovery exists, for example, accidental abnormal jitter can be automatically recovered), so that the abnormal scene of the abnormal automatic recovery can be ignored to a greater extent, and false alarm is reduced.
S303, judging whether the abnormal rate of the cluster nodes is larger than a set abnormal rate threshold value. If the cluster node abnormal rate is greater than the abnormal rate threshold, executing step S304; otherwise, the next calculation can be performed.
S304, if the cluster node abnormal rate is larger than the abnormal rate threshold value, generating a second monitoring sub-result.
In this embodiment of the present specification, generating the second monitoring sub-result may refer to taking the cluster node anomaly rate r at this time as the second monitoring sub-result, and may also trigger generating node anomaly alarm information, and mark the generated node anomaly alarm information as AlarmSingle. Thus, by the method shown in fig. 3, a relatively accurate node state monitoring result (i.e., the second monitoring sub-result) can be obtained.
In some embodiments of the present specification, the determining an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result may include: according to the formula
Figure BDA0003089541050000083
Determining an abnormal monitoring value of the cluster; and determining an abnormal monitoring result of the cluster according to the abnormal monitoring value. Wherein F is the abnormal monitoring result of the cluster, A1As a first monitoring sub-result, A2Is the second monitor sub-result. From formulas
Figure BDA0003089541050000091
It can be seen that the abnormal monitoring result of the cluster comprehensively considers the first monitoring sub-result (i.e. the cluster performance monitoring condition) and the second monitoring sub-result (i.e. the single-node state monitoring condition), and the first monitoring sub-result (i.e. the cluster performance monitoring condition) is obtainedMonitoring case) is greater than the second monitoring sub-result (i.e., single node status monitoring case), i.e., cluster performance is more important in cluster anomaly monitoring than node status, and therefore is given a higher coefficient.
In some embodiments of this specification, the determining an anomaly monitoring result of the cluster according to the anomaly monitoring value may include: when F is larger than 1, outputting a cluster high-level alarm; when F is more than 0 and less than 1, outputting a cluster low-level alarm; when F is 0, the cluster has no alarm output. Therefore, multi-level abnormity early warning (for example, as shown in the following table 1) is realized according to the severity of the abnormal condition, and the accuracy of cluster abnormity monitoring is improved.
TABLE 1
AlarmSingle AlarmKnn Alarm rating
1 0 Is low in
1 1 Height of
0 1 Height of
0 0 Is normal
While the process flows described above include operations that occur in a particular order, it should be appreciated that the processes may include more or less operations that are performed sequentially or in parallel (e.g., using parallel processors or a multi-threaded environment).
Corresponding to the method, the present specification further provides an embodiment of a cluster multidimensional abnormality monitoring apparatus. Referring to fig. 4, in some embodiments of the present specification, the cluster multidimensional abnormality monitoring apparatus may include:
an obtaining module 41, configured to obtain cluster performance information and node state information of each node in the cluster;
a generating module 42, configured to generate a first monitoring sub-result according to the cluster performance information, and generate a second monitoring sub-result according to the node state information;
the determining module 43 may be configured to determine an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result.
In some apparatus embodiments of the present description, the generating a first monitoring sub-result according to the cluster performance information includes:
processing the cluster performance information according to a preset classification algorithm to obtain the number of abnormal data points in a first specified time;
judging whether the number of the abnormal data points is larger than a set number threshold value or not;
and if the number of the abnormal data points is larger than the number threshold, generating a first monitoring sub-result according to the number of the abnormal data points.
In some device embodiments of this specification, the generating a second monitor sub-result according to the node state information includes:
determining the single-node abnormal rate within a first designated time according to the node state information;
inputting the single-node abnormal rate into a formula
Figure BDA0003089541050000101
Acquiring the abnormal rate of the cluster nodes;
judging whether the abnormal rate of the cluster nodes is greater than a set abnormal rate threshold value or not;
if the cluster node abnormal rate is larger than the abnormal rate threshold value, generating a second monitoring sub-result according to the abnormal rate threshold value;
wherein r is the abnormal rate of the cluster nodes, M is the total number of abnormal nodes in the first designated time, anWeight coefficient of the nth abnormal node in the first designated time, bnThe single-node abnormal rate of the nth abnormal node in the first designated time is obtained.
In some embodiments of the apparatus in this specification, when n is 1, a is the weight coefficient1According to the formula
Figure BDA0003089541050000102
Determining; when n is greater than or equal to 2, anAccording to the formula an=a1×qn-1Determining; wherein, a1Q is a set value and is a weight coefficient of a 1 st abnormal node in a first designated time, and q is greater than 1.
In some apparatus embodiments of the present description, the determining an abnormal rate of a single node within a first specified time according to the node status information includes:
reading a node state message of a target node from a message queue; the node state message is sent to the message queue by each node at regular time;
judging whether a node state message of the target node is read from the message queue within a second designated time or not;
when the node state information of the target node is not read from the information queue within a second designated time or the node state information of the target node, which contains index abnormal data, is read within the second designated time, the target node is confirmed to be an abnormal node;
and determining the single-node abnormal rate in the first appointed time according to the abnormal node.
In some device embodiments of this specification, the determining an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result includes:
according to the formula
Figure BDA0003089541050000103
Determining an abnormal monitoring value of the cluster;
determining an abnormal monitoring result of the cluster according to the abnormal monitoring value;
wherein F is the abnormal monitoring result of the cluster, A1As a first monitoring sub-result, A2Is the second monitor sub-result.
In some device embodiments of this specification, the determining an anomaly monitoring result of the cluster according to the anomaly monitoring value includes:
when F is larger than 1, outputting a cluster high-level alarm;
when F is more than 0 and less than 1, outputting a cluster low-level alarm;
when F is 0, the cluster has no alarm output.
In some apparatus embodiments of the present description, each message in the message queue includes: node identification, node IP, CPU utilization, memory utilization, process state, IO performance and storage space state.
In some apparatus embodiments herein, the classification algorithm may include a KNN algorithm.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.
Embodiments of the present description also provide a computer device. As shown in FIG. 5, in some embodiments of the present description, the computer device 502 may include one or more processors 504, such as one or more Central Processing Units (CPUs) or Graphics Processors (GPUs), each of which may implement one or more hardware threads. The computer device 502 may also include any memory 506 for storing any kind of information such as code, settings, data, etc., and in a specific embodiment, a computer program running on the memory 506 and on the processor 504 may execute the instructions of the cluster multi-dimensional anomaly monitoring method according to any of the above embodiments when the computer program is executed by the processor 504. For example, and without limitation, memory 506 may include any one or more of the following in combination: any type of RAM, any type of ROM, flash memory devices, hard disks, optical disks, etc. More generally, any memory may use any technology to store information. Further, any memory may provide volatile or non-volatile retention of information. Further, any memory may represent fixed or removable components of computer device 502. In one case, when the processor 504 executes the associated instructions, which are stored in any memory or combination of memories, the computer device 502 can perform any of the operations of the associated instructions. The computer device 502 also includes one or more drive mechanisms 508, such as a hard disk drive mechanism, an optical disk drive mechanism, etc., for interacting with any memory.
Computer device 502 may also include input/output interface 510(I/O) for receiving various inputs (via input device 512) and for providing various outputs (via output device 514). One particular output mechanism may include a presentation device 516 and an associated graphical user interface 518 (GUI). In other embodiments, input/output interface 510(I/O), input device 512, and output device 514 may not be included, but merely as a single computer device in a network. Computer device 502 can also include one or more network interfaces 520 for exchanging data with other devices via one or more communication links 522. One or more communication buses 524 couple the above-described components together.
Communication link 522 may be implemented in any manner, such as through a local area network, a wide area network (e.g., the Internet), a point-to-point connection, etc., or any combination thereof. Communication link 522 may include any combination of hardwired links, wireless links, routers, gateway functions, name servers, etc., governed by any protocol or combination of protocols.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products of some embodiments of the specification. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processor to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processor, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processor to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processor to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computer device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The embodiments of this specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The described embodiments may also be practiced in distributed computing environments where tasks are performed by remote processors that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of an embodiment of the specification. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (12)

1. A cluster multi-dimensional anomaly monitoring method is characterized by comprising the following steps:
acquiring cluster performance information and node state information of each node in a cluster;
generating a first monitoring sub-result according to the cluster performance information, and generating a second monitoring sub-result according to the node state information;
and determining an abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result.
2. The method for monitoring the multi-dimensional abnormality of the cluster according to claim 1, wherein the generating a first monitoring sub-result according to the cluster performance information includes:
processing the cluster performance information according to a preset classification algorithm to obtain the number of abnormal data points in a first specified time;
judging whether the number of the abnormal data points is larger than a set number threshold value or not;
and if the number of the abnormal data points is larger than the number threshold, generating a first monitoring sub-result according to the number of the abnormal data points.
3. The method for monitoring the multi-dimensional abnormality of the cluster according to claim 1, wherein the generating of the second monitoring sub-result according to the node status information includes:
determining the single-node abnormal rate within a first designated time according to the node state information;
inputting the single-node abnormal rate into a formula
Figure FDA0003089541040000011
Acquiring the abnormal rate of the cluster nodes;
judging whether the abnormal rate of the cluster nodes is greater than a set abnormal rate threshold value or not;
if the cluster node abnormal rate is larger than the abnormal rate threshold value, generating a second monitoring sub-result according to the abnormal rate threshold value;
wherein r is the abnormal rate of the cluster nodes, M is the total number of abnormal nodes in the first designated time, anWeight coefficient of the nth abnormal node in the first designated time, bnThe single-node abnormal rate of the nth abnormal node in the first designated time is obtained.
4. The cluster multidimensional abnormality monitoring method according to claim 3, wherein in the weight coefficient, when n is 1, a is1According to the formula
Figure FDA0003089541040000012
Determining; when n is greater than or equal to 2, anAccording to the formula an=a1×qn-1Determining; wherein, a1Q is a set value and is a weight coefficient of a 1 st abnormal node in a first designated time, and q is greater than 1.
5. The cluster multidimensional abnormality monitoring method of claim 3, wherein the determining the single-node abnormality rate within a first specified time according to the node state information comprises:
reading a node state message of a target node from a message queue; the node state message is sent to the message queue by each node at regular time;
judging whether a node state message of the target node is read from the message queue within a second designated time or not;
when the node state information of the target node is not read from the information queue within a second designated time or the node state information of the target node, which contains index abnormal data, is read within the second designated time, the target node is confirmed to be an abnormal node;
and determining the single-node abnormal rate in the first appointed time according to the abnormal node.
6. The method for monitoring the multi-dimensional abnormality of the cluster according to claim 1, wherein the determining the abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result includes:
according to the formula
Figure FDA0003089541040000021
Determining an abnormal monitoring value of the cluster;
determining an abnormal monitoring result of the cluster according to the abnormal monitoring value;
wherein F is the abnormal monitoring result of the cluster, A1As a first monitoring sub-result, A2Is the second monitor sub-result.
7. The method for monitoring the multi-dimensional abnormality of the cluster according to claim 6, wherein the determining the abnormality monitoring result of the cluster according to the abnormality monitoring value includes:
when F is larger than 1, outputting a cluster high-level alarm;
when F is more than 0 and less than 1, outputting a cluster low-level alarm;
when F is 0, the cluster has no alarm output.
8. The method for monitoring cluster multidimensional abnormality of claim 5, wherein each message in the message queue comprises: node identification, node IP, CPU utilization, memory utilization, process state, IO performance and storage space state.
9. The method for cluster multidimensional anomaly monitoring of claim 2, wherein said classification algorithm comprises a nearest neighbor algorithm.
10. A cluster multidimensional abnormality monitoring device, comprising:
the acquisition module is used for acquiring cluster performance information and node state information of each node in the cluster;
the generating module is used for generating a first monitoring sub-result according to the cluster performance information and generating a second monitoring sub-result according to the node state information;
and the determining module is used for determining the abnormal monitoring result of the cluster according to the first monitoring sub-result and the second monitoring sub-result.
11. A computer device comprising a memory, a processor, and a computer program stored on the memory, wherein the computer program, when executed by the processor, performs the instructions of the method of any one of claims 1-9.
12. A computer storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor of a computer device, executes instructions of a method according to any one of claims 1-9.
CN202110591793.9A 2021-05-28 2021-05-28 Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium Pending CN113220534A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110591793.9A CN113220534A (en) 2021-05-28 2021-05-28 Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110591793.9A CN113220534A (en) 2021-05-28 2021-05-28 Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113220534A true CN113220534A (en) 2021-08-06

Family

ID=77099064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110591793.9A Pending CN113220534A (en) 2021-05-28 2021-05-28 Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113220534A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641561A (en) * 2021-10-15 2021-11-12 杭州朗澈科技有限公司 Method and system for displaying monitoring data in edge scene
CN113671322A (en) * 2021-10-25 2021-11-19 广东电网有限责任公司东莞供电局 Microgrid state online monitoring method and device
CN115412420A (en) * 2022-08-29 2022-11-29 苏州浪潮智能科技有限公司 Management method, device, equipment and medium for inter-frame cluster communication
CN115499296A (en) * 2022-07-29 2022-12-20 天翼云科技有限公司 Cloud desktop hot standby management method, device and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111708665A (en) * 2020-05-29 2020-09-25 苏州浪潮智能科技有限公司 Method, device, equipment and medium for comprehensively monitoring storage cluster system
CN112115031A (en) * 2020-09-29 2020-12-22 中国银行股份有限公司 Cluster state monitoring method and device
WO2021051582A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Host performance monitoring method and apparatus for server cluster, device, and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021051582A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Host performance monitoring method and apparatus for server cluster, device, and storage medium
CN111708665A (en) * 2020-05-29 2020-09-25 苏州浪潮智能科技有限公司 Method, device, equipment and medium for comprehensively monitoring storage cluster system
CN112115031A (en) * 2020-09-29 2020-12-22 中国银行股份有限公司 Cluster state monitoring method and device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641561A (en) * 2021-10-15 2021-11-12 杭州朗澈科技有限公司 Method and system for displaying monitoring data in edge scene
CN113641561B (en) * 2021-10-15 2022-02-22 杭州朗澈科技有限公司 Method and system for displaying monitoring data in edge scene
CN113671322A (en) * 2021-10-25 2021-11-19 广东电网有限责任公司东莞供电局 Microgrid state online monitoring method and device
CN115499296A (en) * 2022-07-29 2022-12-20 天翼云科技有限公司 Cloud desktop hot standby management method, device and system
CN115499296B (en) * 2022-07-29 2024-03-12 天翼云科技有限公司 Cloud desktop hot standby management method, device and system
CN115412420A (en) * 2022-08-29 2022-11-29 苏州浪潮智能科技有限公司 Management method, device, equipment and medium for inter-frame cluster communication
CN115412420B (en) * 2022-08-29 2023-08-18 苏州浪潮智能科技有限公司 Method, device, equipment and medium for managing inter-frame trunking communication

Similar Documents

Publication Publication Date Title
CN113220534A (en) Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium
Borghesi et al. Anomaly detection using autoencoders in high performance computing systems
US10318366B2 (en) System and method for relationship based root cause recommendation
CN110351150B (en) Fault source determination method and device, electronic equipment and readable storage medium
US20170068747A1 (en) System and method for end-to-end application root cause recommendation
US20200034730A1 (en) Machine Discovery of Aberrant Operating States
US20180211172A1 (en) Machine Discovery and Rapid Agglomeration of Similar States
JP5933463B2 (en) Log occurrence abnormality detection device and method
US9860109B2 (en) Automatic alert generation
Bhaduri et al. Detecting abnormal machine characteristics in cloud infrastructures
CN112769605B (en) Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform
US11736363B2 (en) Techniques for analyzing a network and increasing network availability
CN114595210A (en) Multi-dimensional data anomaly detection method and device and electronic equipment
JP5711675B2 (en) Network abnormality detection apparatus and network abnormality detection method
JP6252309B2 (en) Monitoring omission identification processing program, monitoring omission identification processing method, and monitoring omission identification processing device
CN113708987A (en) Network anomaly detection method and device
CN113055246B (en) Abnormal service node identification method, device, equipment and storage medium
CN111061581B (en) Fault detection method, device and equipment
US10282245B1 (en) Root cause detection and monitoring for storage systems
CN116975938B (en) Sensor data processing method in product manufacturing process
US10432647B2 (en) Malicious industrial internet of things node activity detection for connected plants
US11269706B2 (en) System and method for alarm correlation and aggregation in IT monitoring
CN113342608A (en) Method and device for monitoring streaming computing engine task
JP2019049802A (en) Failure analysis supporting device, incident managing system, failure analysis supporting method, and program
CN116826961A (en) Intelligent power grid dispatching and operation and maintenance system, method and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination