CN113438110B - Cluster performance evaluation method, device, equipment and storage medium - Google Patents

Cluster performance evaluation method, device, equipment and storage medium Download PDF

Info

Publication number
CN113438110B
CN113438110B CN202110696929.2A CN202110696929A CN113438110B CN 113438110 B CN113438110 B CN 113438110B CN 202110696929 A CN202110696929 A CN 202110696929A CN 113438110 B CN113438110 B CN 113438110B
Authority
CN
China
Prior art keywords
subsystem
cluster
score
abnormal
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110696929.2A
Other languages
Chinese (zh)
Other versions
CN113438110A (en
Inventor
王雄斌
王家尧
吕灼恒
张晋锋
原帅
郝文静
王建敏
周军
解文龙
苗海峰
吕益行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dawning Information Industry Beijing Co Ltd
Original Assignee
Dawning Information Industry Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dawning Information Industry Beijing Co Ltd filed Critical Dawning Information Industry Beijing Co Ltd
Priority to CN202110696929.2A priority Critical patent/CN113438110B/en
Publication of CN113438110A publication Critical patent/CN113438110A/en
Application granted granted Critical
Publication of CN113438110B publication Critical patent/CN113438110B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for evaluating cluster performance, wherein the method comprises the following steps: determining each cluster subsystem in a cluster to be evaluated, determining a quality score of the cluster subsystem based on first abnormal information and/or first available information in the cluster subsystem, and determining an availability score of the cluster subsystem based on second abnormal information and/or second available information in the cluster subsystem; determining a performance evaluation result of the cluster based on the quality score and the availability score, and performing operation and maintenance on the cluster based on the performance evaluation result; by obtaining the scores of the quality and the usability of the cluster to be evaluated and obtaining the final performance evaluation result, the accurate evaluation of the performance of the cluster to be evaluated is realized, the timely warning of the performance abnormity of the cluster to be evaluated can be conveniently realized, the effective operation and maintenance of the cluster to be evaluated are facilitated, and the operation and maintenance efficiency can be improved.

Description

Cluster performance evaluation method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of internet, in particular to a method, a device, equipment and a storage medium for evaluating cluster performance.
Background
With the continuous expansion of the scale of the internet, the scale of a data center bearing a calculation task is continuously increased; the performance of a cluster formed by a plurality of servers in the data center determines the data processing capacity of the data center; the cluster performance is comprehensively evaluated, and the method has important significance for effective operation and maintenance of the data center.
At present, the performance evaluation of a cluster generally determines the performance score of the cluster according to the percentage of available time for providing service capability by the cluster; the cluster provides service capacity, and the service capacity comprises a plurality of nodes, wherein the cluster provides service capacity available time percentages including available time percentages of batch nodes and available time percentages of single nodes, and each node corresponds to one server or host forming the cluster. However, as the cluster scale continues to expand, the influence factors among the constituent nodes become more complex; only by adopting the available time of the nodes as a judgment basis, the performance score of the nodes can be reduced only after the nodes are failed, the accurate performance evaluation of the key nodes or paths of the cluster cannot be realized, so that the performance abnormity reminding cannot be given in time, the cluster cannot be effectively operated and maintained, and the operation and maintenance efficiency is low.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a storage medium for evaluating cluster performance, which can realize accurate evaluation of the cluster performance to be evaluated, can conveniently realize timely alarm of abnormal cluster performance to be evaluated, and can realize efficient operation and maintenance of the cluster to be evaluated.
In a first aspect, an embodiment of the present invention provides a method for evaluating cluster performance, including:
determining each cluster subsystem in a cluster to be evaluated;
determining a quality score for the cluster subsystem based on first anomaly information and/or first available information in the cluster subsystem;
determining an availability score for the cluster subsystem based on second anomaly information and/or second availability information in the cluster subsystem;
determining a performance evaluation result of the cluster based on the quality score and the availability score, and performing operation and maintenance on the cluster based on the performance evaluation result.
Optionally, the determining each cluster subsystem in the cluster to be evaluated includes:
determining nodes which accord with the service types of the clusters to be evaluated, forming a node set, and determining cluster subsystems corresponding to the node set;
the cluster subsystem comprises a computing subsystem, a network subsystem, a storage subsystem, a management subsystem, a software service subsystem and/or a login subsystem.
By adopting the technical scheme, the nodes matched with the service types are determined according to the service types supported by the cluster to be evaluated, the nodes matched with the service types form node sets respectively, and then the corresponding cluster subsystem is determined according to the node sets, so that the accurate acquisition of the cluster subsystem corresponding to the cluster to be evaluated is realized.
Optionally, if the cluster subsystem is a computing subsystem and/or a management subsystem, determining a quality score of the cluster subsystem based on the first abnormal information in the cluster subsystem includes:
Figure BDA0003128851730000021
determining an availability score for the cluster subsystem based on second anomaly information in the cluster subsystem, comprising:
Figure BDA0003128851730000031
wherein A is comp/ute mana Representing the quality score of the computing or management subsystem, B compute/management Indicating the availability score of a computing or management subsystem, a i Indicating the health check of the ith Presence nodeAbnormal node of alarm, t i Represents the abnormal time corresponding to the ith abnormal node, i =1,2 j Unavailable node, t, indicating the jth presence of a DOWN event alarm j The method includes the steps that an unavailable time corresponding to a jth unavailable node is represented, j =1, 2.. R represents the number of the unavailable nodes, M represents the total number of the cluster subsystem including the nodes, and T represents a preset monitoring period.
By adopting the technical scheme, if the cluster subsystem is a computing subsystem or a management subsystem, the quality score and the availability score of the current cluster subsystem are respectively obtained based on the abnormal node information and the unavailable node information of the cluster subsystem, so that the quality score and the availability score of the computing subsystem and the management subsystem can be accurately obtained.
Optionally, if the cluster subsystem is a network subsystem; determining a quality score for the clustered subsystem based on first anomaly information and first available information in the clustered subsystem, comprising:
A network =A down ·A opensm
determining an availability score for the cluster subsystem based on second anomaly information and second availability information in the cluster subsystem, comprising:
B network =A down ·A opensm
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003128851730000032
wherein, A network Representing the quality score of the network subsystem, B network Indicating the availability score of the network subsystem, A down Network Link score, A, representing a network subsystem opensm Sub-network management service score, n, representing a network subsystem i Indicating the ith failed network link in a network down state, t i Indicate the failure time corresponding to the ith failed network link, i =1, 2.. N, N indicates the number of failed network links, and K indicates that the cluster subsystem includes network linksTotal number of (c) j Service score, t, representing the jth sub-network management service j The method includes the steps that service available time corresponding to a jth sub-network management service is represented, j =1, 2.
By adopting the technical scheme, if the cluster subsystem is the network subsystem, the corresponding network link score is obtained based on the fault network link information in the network subsystem, the corresponding subnet management service score is obtained based on the subnet management service information, the quality score and the availability score of the network subsystem are obtained based on the network link score and the subnet management service score, and the quality score and the availability score of the network subsystem can be accurately obtained.
Optionally, if the cluster subsystem is a storage subsystem, determining a quality score of the cluster subsystem based on the first abnormal information in the cluster subsystem includes:
Figure BDA0003128851730000041
determining an availability score for the cluster subsystem based on second anomaly information in the cluster subsystem, comprising:
Figure BDA0003128851730000042
wherein, A storage Representing the quality score of the storage subsystem, B storage Indicating an availability score of the storage subsystem, a h Node representing the h-th alarm type as abnormal in metadata state, t h H =1, 2.. H, where H represents the number of nodes whose alarm types are abnormal in the metadata state, and b represents the abnormal time corresponding to the node whose alarm type is abnormal in the metadata state j Node indicating that the jth alarm type is abnormal in data service state, t j And j =1, 2.. The abnormal time corresponds to the node with the j alarm type being abnormal in the data service state.J, J represents the number of nodes with alarm type as abnormal data service state, c k Node indicating that the kth alarm type is abnormal in node status, t k K =1, 2.. K, where K represents the number of nodes whose alarm types are abnormal node states, and d represents the abnormal time corresponding to the node whose alarm type is abnormal node state l The l alarm type is the node with abnormal system data state, t l L =1, 2.. L, L represents the number of nodes with the alarm type being the system data state abnormity, e n Node indicating that the nth alarm type is abnormal in cluster state, t n The method includes the steps that abnormal time corresponding to a node with the nth alarm type being abnormal in the cluster state is represented, N =1, 2.
By adopting the technical scheme, if the cluster subsystem is a storage subsystem, the quality score and the availability score of the cluster subsystem are respectively determined based on the abnormal node information of each alarm type in the storage subsystem, so that the accuracy of the obtained quality score and availability score of the storage subsystem can be improved.
Optionally, if the cluster subsystem is a software service subsystem, determining a quality score of the cluster subsystem based on the first available information in the cluster subsystem includes:
Figure BDA0003128851730000051
determining an availability score for the cluster subsystem based on second available information in the cluster subsystem, comprising:
Figure BDA0003128851730000052
wherein A is service Representing the quality score of the software service subsystem, B service Indicating the availability score, x, of a software service subsystem r Service score, t, representing the r-th preset service item r The method includes the steps that service available time corresponding to an R-th preset service item is represented, R =1, 2.
By adopting the technical scheme, if the cluster subsystem is the software service subsystem, the quality score and the availability score of the cluster subsystem are respectively determined based on the service available information of each preset service item in the software service subsystem, so that the quality score and the availability score of the software service subsystem can be accurately acquired.
Optionally, if the cluster subsystem is a login subsystem, determining a quality score of the cluster subsystem based on the first abnormal information in the cluster subsystem includes:
Figure BDA0003128851730000061
determining an availability score for the cluster subsystem based on second anomaly information in the cluster subsystem, comprising:
Figure BDA0003128851730000062
wherein A is login Quality score representing the entry sub-system, B login Indicating availability score of the logging subsystem, a i Abnormal node indicating the ith abnormal alarm, t i The method includes the steps that abnormal time corresponding to an ith abnormal node is represented, i =1, 2.. N represents the number of the abnormal nodes, M represents the total number of the cluster subsystems including the nodes, and T represents a preset monitoring period.
By adopting the technical scheme, if the cluster subsystem is the login subsystem, the quality score and the availability score of the cluster subsystem are respectively determined based on the abnormal node information in the cluster subsystem, so that the accurate acquisition of the quality score and the availability score of the login subsystem can be realized, and the accuracy of acquiring the quality score and the availability score is improved.
In a second aspect, an embodiment of the present invention provides an apparatus for evaluating cluster performance, including:
the cluster subsystem determining module is used for determining each cluster subsystem in the cluster to be evaluated;
the quality score acquisition module is used for determining the quality score of the cluster subsystem based on the first abnormal information and/or the first available information in the cluster subsystem;
an availability score obtaining module, configured to determine an availability score of the cluster subsystem based on second anomaly information and/or second available information in the cluster subsystem;
and the performance evaluation result determining module is used for determining the performance evaluation result of the cluster based on the quality score and the availability score and carrying out operation and maintenance on the cluster based on the performance evaluation result.
In a third aspect, an embodiment of the present invention provides an electronic device, including:
one or more processors;
a storage device to store one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the method for evaluating cluster performance according to any embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for evaluating cluster performance according to any embodiment of the present invention.
According to the technical scheme provided by the embodiment of the invention, each cluster subsystem in the cluster to be evaluated is determined, the quality score of the cluster subsystem is determined based on the first abnormal information and/or the first available information in the cluster subsystem, and the availability score of the cluster subsystem is determined based on the second abnormal information and/or the second available information in the cluster subsystem; finally, a performance evaluation result of the cluster is determined based on the quality score and the availability score, the cluster is operated and maintained based on the performance evaluation result, the scores of two dimensions of the quality and the availability of the cluster to be evaluated are obtained, the final performance evaluation result is obtained, the accurate evaluation of the performance of the cluster to be evaluated is realized, the timely warning of the performance abnormity of the cluster to be evaluated can be conveniently realized, the effective operation and maintenance of the cluster to be evaluated are facilitated, and the operation and maintenance efficiency can be improved.
Drawings
Fig. 1 is a flowchart of a method for evaluating cluster performance according to an embodiment of the present invention;
fig. 2 is a flowchart of a method for evaluating cluster performance according to an embodiment of the present invention;
fig. 3 is a block diagram of a structure of an apparatus for evaluating cluster performance according to an embodiment of the present invention;
fig. 4 is a block diagram of an electronic device according to a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Fig. 1 is a flowchart of a method for evaluating cluster performance according to an embodiment of the present invention, where this embodiment is applicable to accurately evaluate cluster performance according to abnormal information and available information in a cluster, and the method may be executed by an apparatus for evaluating cluster performance according to an embodiment of the present invention, where the apparatus may be implemented by software and/or hardware and integrated on an electronic device, and the electronic device may be a computer device or a server, as shown in fig. 1, the method specifically includes the following steps:
s110, determining each cluster subsystem in the cluster to be evaluated.
The cluster is to connect a plurality of servers or hosts to share network services, so that the service processing capacity can be improved, and the ever-increasing service requirements can be met; each server or host in the cluster is a node of the cluster. The cluster subsystem is a node set bearing different services in a cluster; in the embodiment of the present invention, the cluster subsystem may include a computing subsystem, a network subsystem, a storage subsystem, a management subsystem, a software service subsystem, and/or a login subsystem according to the assumed service type.
Specifically, the computing subsystem is used for running actual services and representing the actual computing capacity of the cluster; the network subsystem is used for managing each key switch forming the network topology and corresponding network links; the storage subsystem is used for providing shared storage service for the outside; the management subsystem is used for deploying and managing the service software of the whole cluster; the software service subsystem is used for providing various distributed services deployed across nodes and generally has the characteristic of high availability; the login subsystem is used for being responsible for cluster login operation of the service clients; by dividing different cluster subsystems for different services, the chaos of the cluster services can be avoided, and the orderly processing of the cluster services is realized.
In this embodiment of the present invention, optionally, the determining each cluster subsystem in the cluster to be evaluated may include: and determining nodes which accord with the service types of the cluster to be evaluated, forming a node set, and determining a cluster subsystem corresponding to the node set. It should be noted that one cluster node is responsible for one service type at the same time; therefore, when the cluster subsystem is determined, the matched nodes can be searched in all the nodes according to the service types supported by the cluster to be evaluated; the nodes corresponding to the same service type are added to the same node set, the node set corresponding to each service type can be obtained, and then one node set is used as a cluster subsystem, so that the acquisition of the cluster subsystem corresponding to the cluster to be evaluated is realized.
S120, determining a quality score of the cluster subsystem based on the first abnormal information and/or the first available information in the cluster subsystem.
In the embodiment of the invention, the abnormal information is information describing the abnormal state of the node; available information, which is information describing whether the cluster provides software service; the exception information may include an exception node identifier, an exception type, an exception duration, and the like; the availability information may include an identification of available software services, an identification of nodes that are to undertake the software services, and the time of availability of the software services, among other things. It should be noted that the abnormal information and the available information corresponding to the cluster subsystem reflect the health condition of the cluster subsystem; therefore, the abnormal information and the available information within a certain time of each cluster subsystem are obtained, and then the abnormal information and the available information are calculated based on the preset calculation rule matched with each cluster subsystem, so that the quality score and the availability score corresponding to each cluster subsystem are obtained, and the evaluation of the performance of the cluster subsystems can be realized.
In the embodiment of the invention, the quality score is used for describing the loss degree of the node, the link and the system; for example, the current system includes N nodes, and if there is a node anomaly, the system quality is (N-1)/N; if no node anomaly exists, the system quality is 1. Availability scores describing the degree to which nodes, links and systems can provide software services; for example, the current system includes N nodes, and if all of the N nodes cannot provide software services, the system availability is 0; if there is at least one node that can provide software services, then the system availability is 1. The accuracy of the performance evaluation of the cluster subsystems can be improved by acquiring the quality scores and the availability scores of the cluster subsystems and comprehensively evaluating the performance of the cluster subsystems based on the quality scores and the availability scores.
In the embodiment of the invention, the Health state of each Node or link in the cluster can be detected in real time through internal security programs (such as IBLINK commands and Node Health Check (NHC) commands), and when the Node is detected to be abnormal, an alarm is generated and corresponding abnormal information is reported; meanwhile, the processes of various services provided by the cluster are detected in real time, the available information of the services is obtained, and the abnormal information and the available information can be obtained.
In the embodiment of the present invention, optionally, before determining each cluster subsystem in the cluster to be evaluated, a matching calculation rule is preset for different types of cluster subsystems; correspondingly, determining a quality score of the cluster subsystem based on the first anomaly information and/or the first available information in the cluster subsystem may include: determining a system type corresponding to a current cluster subsystem, and acquiring a matched target calculation rule from preset calculation rules according to the current system type; determining the quality score of the current cluster subsystem based on the first abnormal information, the first available information and the target calculation rule of the current cluster; by setting matched calculation rules for different types of cluster subsystems, the calculation rules are more consistent with the actual conditions of the cluster subsystems, and the accuracy of obtaining the quality scores of the cluster subsystems can be improved.
In this embodiment of the present invention, optionally, if the cluster subsystem is a computing subsystem and/or a management subsystem, determining the quality score of the cluster subsystem based on the first abnormal information in the cluster subsystem may include: determining the quality score of the current cluster subsystem according to the number of the abnormal nodes, the abnormal time corresponding to the abnormal nodes, the total number of the cluster subsystem including the nodes and a preset monitoring period; optionally, if the cluster subsystem is a computing subsystem and/or a management subsystem, determining the quality score of the cluster subsystem based on the following formula:
Figure BDA0003128851730000111
wherein A is compute/management Representing a quality score of the computing or management subsystem, a i Abnormal node, t, indicating an ith existing node health check alarm i The abnormal time corresponding to the ith abnormal node is represented and is the duration of the alarm; n, N representing the number of abnormal nodes, M representing the total number of cluster subsystems including nodes, fixed when the system is built; t represents a preset monitoring period, for example, one month, and can be flexibly set according to task requirements.
It should be noted that, if the cluster subsystem is a computing subsystem, when determining whether a node in the current cluster subsystem is an abnormal node, the determination may be performed according to the following two rules; rule one is as follows: if the management network or the computing network of the computing nodes in the computing subsystem is not communicated, the management network and the computing network can generate corresponding abnormal alarms, and the nodes contained in the abnormal alarms are determined as abnormal nodes; the nodes which are maintained or returned to the factory due to accidents in the computing subsystem and are off-shelf are also regarded as abnormal nodes; rule two: and if the Node Health Check (NHC) operation alarm exists, determining the corresponding node as an abnormal node. If the cluster subsystem is a management subsystem, the abnormal node is determined according to the abnormal alarm information generated by the management network or the computing network of the management node in the management subsystem, so that the abnormal node in the computing subsystem and the management subsystem can be accurately detected.
In the embodiment of the invention, if the cluster subsystem is a computing subsystem or a management subsystem, a corresponding abnormal node detection method is adopted to obtain the number of abnormal nodes and abnormal time corresponding to the abnormal nodes; and determining the quality score corresponding to the current cluster subsystem according to the number of the abnormal nodes and the abnormal time corresponding to the abnormal nodes, so that the quality scores corresponding to the computing subsystem and the management subsystem can be accurately obtained.
In the embodiment of the present invention, optionally, if the cluster subsystem is a network subsystem; determining a quality score for the clustered subsystem based on the first anomaly information and the first available information in the clustered subsystem may include: acquiring network link scores of the network subsystems according to the number of the fault network links, the fault time corresponding to each fault network link, the total number of the cluster subsystems including the network links and a preset monitoring period; acquiring a subnet management service score of the network subsystem according to the service score of the subnet management service, the service available time corresponding to each subnet management service, the total number of the subnet management services provided by the cluster subsystem and a preset monitoring period; acquiring a quality score of the network subsystem according to the network link score and the subnet management service score; optionally, if the cluster subsystem is a network subsystem, determining the quality score of the cluster subsystem based on the following formula:
A network =A down ·A opensm
wherein the content of the first and second substances,
Figure BDA0003128851730000121
wherein A is network Representing the quality score of the network subsystem, A down Representing the network link score of the network subsystem, A opensm Indicating a subnet management service score, n, for a network subsystem i Indicating the ith failed network link in a network down state, t i Indicating the failure time corresponding to the ith failure network link, i =1, 2.. N, N indicating the number of failure network links, K indicating the total number of network links included in the cluster subsystem, c j Service score, t, representing the jth subnet administrative service j The method includes the steps that service available time corresponding to a jth sub-network management service is represented, j =1, 2.
It should be noted that, if the cluster subsystem is a network subsystem, the quality of the network subsystem is affected by both the network link condition of the network subsystem and the condition of the subnet management service provided, so the quality score of the network subsystem is composed of two parts, namely a network link score and a subnet management service score; when determining the network link score, judging whether the network link fails according to the following two rules; a first rule: performing network link detection of an IB (InfiniBand) network through an IBLINK command, and if a problem of the network link is detected, taking the current network link as a fault network link; rule two: detecting failure statistics of a network (e.g., a management network and an ethernet network), and regarding a network link included in the failure statistics information as a failed network link; in the embodiment of the invention, the network link detection and the network fault statistics are carried out at intervals of preset time (for example, one hour), so that the fault occurrence time of the detected fault network link has a preset time error, but the statistical fault occurrence time and the statistical end time have the same time error, so that the detection result of the fault duration time is not influenced.
In the embodiment of the invention, when the service score of the subnet management service (for example, OPENSM) is obtained, because the subnet management service has a high availability mechanism, that is, a plurality of nodes provide the same subnet management service; when partial node faults exist, the service quality is reduced; therefore, the service score of each sub-network management service needs to be determined according to the available nodes corresponding to each sub-network management service and the total number of the nodes; for example, there are a plurality of nodes providing the same subnet management service, and if the plurality of nodes are all in an available state, the service score of the current subnet management service is 1; if one node is in an unavailable state, the service score of the current subnet management service is 0.5; if the plurality of nodes are all in the unavailable state, the service score is changed into 0; the service grade of the subnet management service can reflect the service quality, the accurate acquisition of the service grade of each subnet management service can be realized, and the accurate acquisition of the corresponding quality grade of the network subsystem can be realized.
In the embodiment of the invention, if the cluster subsystem is a network subsystem, a fault network link is determined through a preset network link detection rule, and a network link score is determined according to the number of the fault network links and corresponding fault time; meanwhile, determining the grade of the subnet management service according to the available state and the available time of each subnet management service; finally, the quality score of the network subsystem is determined according to the network link score and the subnet management service score, so that the quality score can reflect the network link performance and the subnet management service performance, the accurate acquisition of the quality score of the network subsystem can be realized, and the accuracy of the acquired quality score can be improved.
In this embodiment of the present invention, optionally, if the cluster subsystem is a storage subsystem, determining the quality score of the cluster subsystem based on the first abnormal information in the cluster subsystem may include: determining the quality score of the storage subsystem according to the number of nodes corresponding to each preset alarm type, the alarm duration of each node, the total number of nodes included in the storage subsystem and a preset monitoring period; optionally, if the cluster subsystem is a storage subsystem, determining the quality score of the storage subsystem based on the following formula:
Figure BDA0003128851730000141
wherein, A storage Representing a quality score of the storage subsystem, a h Node representing the h-th alarm type as abnormal metadata state, t h H, H represents the number of nodes with alarm types of metadata state abnormity, b j Node representing the jth alarm type as abnormal data service state, t j J =1, 2.. J, J represents the number of nodes with the alarm type being abnormal in the data service state, and c k Node indicating that the kth alarm type is abnormal in node status, t k K, K represents the number of nodes with the alarm type of node state abnormity, and d l The l alarm type is the node with abnormal system data state, t l L =1, 2.. L, L represents the number of nodes with the alarm type being the system data state abnormity, e n Node indicating that the nth alarm type is abnormal in cluster state, t n The method includes the steps that abnormal time corresponding to a node with the nth alarm type being abnormal in the cluster state is represented, N =1, 2.
In the embodiment of the invention, the alarm type and the corresponding alarm rule are preset in the storage subsystem; the states of all nodes and all types of data in the storage subsystem are monitored in real time, and when the node states or the data states are determined to accord with alarm rules, corresponding types of alarms are sent out, so that the abnormity of the storage subsystem can be accurately judged; the alarm types can include metadata state abnormity, data service state abnormity, node state abnormity, system data state abnormity and cluster state abnormity; the alarm information may include an alarm type, a corresponding node identifier, and an alarm duration; it should be noted that the alarm type and the alarm rule may be adaptively modified according to the task requirement.
Further, acquiring alarm information in a preset monitoring period, and acquiring node identifiers matched with the alarm types to form a node set corresponding to each alarm type; counting the number of nodes in a node set corresponding to each alarm type, and determining the quality score of the current storage subsystem according to the formula; by comprehensively considering various abnormal types of alarms, accurate acquisition of the corresponding quality scores of the storage subsystems can be realized.
In the embodiment of the invention, if the cluster subsystem is a storage subsystem, the quality score of the current storage subsystem is determined according to the number of abnormal nodes corresponding to each alarm type and the corresponding abnormal time, so that the quality score can reflect the abnormal information of each type, and the accuracy of the obtained quality score of the storage subsystem can be improved.
In this embodiment of the present invention, optionally, if the cluster subsystem is a software service subsystem, determining the quality score of the cluster subsystem based on the first available information in the cluster subsystem may include: determining the service score of each preset service item in the cluster subsystem and the service available time corresponding to each preset service item, and determining the quality score of the cluster subsystem according to the service score of each preset service item, the service available time corresponding to each preset service item, the total number of preset service items provided by the cluster subsystem and a preset monitoring period; optionally, if the cluster subsystem is a software service subsystem, determining the quality score of the cluster subsystem based on the following formula:
Figure BDA0003128851730000161
wherein A is service Representing the quality score, x, of the software service subsystem r Service score, t, representing the r-th preset service item r The method includes the steps that service available time corresponding to an R-th preset service item is represented, R =1, 2.
It should be noted that, when the quality score of the software service subsystem is calculated, if it is determined that a process related to the preset service item exists, the current preset service item is marked as available for service, and the corresponding available time is recorded. For the preset service item configured with high availability, if all corresponding nodes are in available states, the service score of the current preset service item is 1, if one node is in unavailable states, the service score of the current preset service item is 0.5, and if all corresponding nodes are in unavailable states, the corresponding service score is 0. For the preset service items configured with load balance, the service score is the ratio of the number of the available nodes to the total number of the nodes; for example, the preset service item a is configured with load balancing, if there are N load balancing nodes and one load balancing node is in an unavailable state, the corresponding service score is (N-1)/N, and the service score of each preset service item can be accurately obtained.
In this embodiment of the present invention, the preset service item may include an SLURM (e.g., slurmctld and slurmdbd) and a Lightweight Directory Access Protocol (LDAP) management service; when the cluster subsystem is a software service subsystem, the quality score of the software service subsystem can be obtained by determining the service score and the available time corresponding to each preset service item, so that the quality score of the software service subsystem can be obtained, and the accuracy of the obtained quality score can be improved.
In this embodiment of the present invention, optionally, if the cluster subsystem is a login subsystem, determining the quality score of the cluster subsystem based on the first abnormal information in the cluster subsystem may include: determining the number of abnormal nodes with abnormal alarms and abnormal time corresponding to each abnormal node, and determining the quality score of the login subsystem according to the number of the abnormal nodes, the abnormal time corresponding to each abnormal node, the total number of nodes included in the cluster subsystem and a preset monitoring period; optionally, if the cluster subsystem is a login subsystem, determining the quality score of the login subsystem based on the following formula:
Figure BDA0003128851730000171
wherein A is login Quality score representing the entry sub-system, a i Abnormal node indicating the ith abnormal alarm, t i The method includes the steps that abnormal time corresponding to an ith abnormal node is represented, i =1, 2.. N represents the number of the abnormal nodes, M represents the total number of the cluster subsystems including the nodes, and T represents a preset monitoring period.
It should be noted that, if the cluster subsystem is a login subsystem, when an abnormal node is determined, the node responsible for login service is counted, and the communication condition between the login node management network and the computing network is detected in real time; and if the login node management network or the computing network is determined to be not through, the login node is considered to be abnormal, and an abnormal alarm is sent. In the embodiment of the invention, the quality score corresponding to the current login subsystem is determined by acquiring the abnormal node with the abnormal alarm and the abnormal time corresponding to each abnormal node, so that the accurate acquisition of the quality score corresponding to the login subsystem can be realized.
S130, determining an availability score of the cluster subsystem based on the second abnormal information and/or the second available information in the cluster subsystem.
In the embodiment of the invention, after the quality scores of all the cluster subsystems are obtained, second abnormal information and second available information are obtained in the cluster subsystems; and calculating to obtain the availability score of each cluster subsystem based on the second abnormal information, the second available information and the matched preset calculation rule corresponding to each cluster subsystem, so that the evaluation on the availability dimension of the cluster subsystems can be realized, and the accuracy of the performance evaluation of the cluster subsystems can be improved.
In the real-time example of the present invention, optionally, if the cluster subsystem is a computing subsystem and/or a management subsystem, determining the availability score of the cluster subsystem based on the second abnormal information in the cluster subsystem may include: determining the availability score of the current cluster subsystem according to the number of unavailable nodes with DOWN event alarms, the unavailable time corresponding to the unavailable nodes, the total number of the cluster subsystem including the nodes and a preset monitoring period; optionally, if the cluster subsystem is a computing subsystem and/or a management subsystem, determining an availability score of the cluster subsystem based on the following formula:
Figure BDA0003128851730000181
wherein, B compute/management Representing the availability score of the computing or management subsystem, b j Unavailable node, t, indicating the jth presence of a DOWN event alarm j The method includes the steps that an unavailable time corresponding to a jth unavailable node is represented, j =1, 2.. R represents the number of the unavailable nodes, M represents the total number of the nodes included in the cluster subsystem, and T represents a preset monitoring period.
It should be noted that, if the cluster subsystem is a computing subsystem or a management subsystem, the state of each node in the cluster subsystem is detected in real time, when a DOWN event alarm is detected in a node, the current node is indicated to be unavailable, and the unavailable time corresponding to the unavailable node is obtained through the alarm, so that the unavailable node in the computing subsystem and the unavailable node in the management subsystem can be accurately detected, and further, the availability score corresponding to the cluster subsystem can be accurately obtained.
In the embodiment of the invention, the DOWN event alarm is detected to determine the unavailable node in the computing subsystem and the management subsystem, and then the availability score of the computing subsystem and the management subsystem is obtained according to the detected unavailable node information, so that the availability score of the computing subsystem and the management subsystem can be accurately obtained.
In the embodiment of the present invention, optionally, if the cluster subsystem is a network subsystem; determining an availability score for the cluster subsystem based on the second anomaly information and the second available information in the cluster subsystem may include: acquiring network link scores of the network subsystems according to the number of the fault network links, the fault time corresponding to each fault network link, the total number of the cluster subsystems including the network links and a preset monitoring period; acquiring a subnet management service score of the network subsystem according to the service score of the subnet management service, the service available time corresponding to each subnet management service, the total number of the subnet management services provided by the cluster subsystem and a preset monitoring period; acquiring the availability score of the network subsystem according to the network link score and the subnet management service score; optionally, if the cluster subsystem is a network subsystem, determining an availability score of the cluster subsystem based on the following formula:
B network =A down ·A opensm
wherein the content of the first and second substances,
Figure BDA0003128851730000191
wherein, B network Indicating the availability score of the network subsystem, A down Representing the network link score of the network subsystem, A opensm Sub-network management service score, n, representing a network subsystem i Indicating the ith failed network link in a network down state, t i Indicating the failure time corresponding to the ith failure network link, i =1, 2.. N, N indicating the number of failure network links, K indicating the total number of network links included in the cluster subsystem, c j Service score, t, representing the jth sub-network management service j The method includes the steps that service available time corresponding to a jth sub-network management service is represented, j =1, 2.
It should be noted that, if the cluster subsystem is a network subsystem, the network link score and the subnet management service score of the network subsystem are respectively obtained, and then the availability score corresponding to the network subsystem is obtained; when determining the network link score, determining whether the network link is faulty or not by using the same rule as the above-mentioned determination of the network subsystem quality score, which is not described herein; when the service score of the subnet management service (for example, OPENSM) is obtained, since the corresponding subnet management service is available as long as one node is available, whether the subnet management service is configured to be highly available or not can not be distinguished, and the service score of the corresponding OPENSM is 1 as long as one node is available; for example, if a plurality of nodes provide the same subnet management service, if at least one node is in an available state, the service score of the current subnet management service is 1; if all the nodes are in the unavailable state, the service score becomes 0.
In the embodiment of the invention, if the cluster subsystem is a network subsystem, on the basis of determining the fault network link information; the service scores of the subnet management services are determined by detecting the available information of the subnet management services without distinguishing high available configuration, so that the service scores of the subnet management services can reflect the availability states of the subnet management services, and the accuracy of the obtained subnet management service scores is improved; meanwhile, the availability score corresponding to the network subsystem is obtained based on the network link score and the subnet management service score, and the accuracy of obtaining the availability score can be improved.
In this embodiment of the present invention, optionally, if the cluster subsystem is a storage subsystem, determining the availability score of the cluster subsystem based on the second abnormal information in the cluster subsystem may include: determining the availability score of the storage subsystem according to the number of nodes corresponding to each preset alarm type, the duration of each node alarm, the total number of nodes included in the storage subsystem and a preset monitoring period; optionally, if the cluster subsystem is a storage subsystem, determining an availability score of the storage subsystem based on the following formula:
Figure BDA0003128851730000201
wherein, B storage Indicating an availability score of the storage subsystem, a h Means for indicating the h alarm type as metadataNode of abnormal state, t h H, H represents the number of nodes with alarm types of metadata state abnormity, b j Node indicating that the jth alarm type is abnormal in data service state, t j J =1, 2.. J, J represents the number of nodes with the alarm type being the data service state abnormity, c k Node indicating that the kth alarm type is abnormal in node status, t k K, K represents the number of nodes with the alarm type of node state abnormity, and d l The l alarm type is the node with abnormal system data state, t l L =1, 2.. L, L represents the number of nodes with the alarm type being the system data state abnormity, e n Node indicating that the nth alarm type is abnormal in cluster state, t n The method includes the steps that abnormal time corresponding to a node with the nth alarm type being abnormal in the cluster state is represented, N =1, 2.
In the embodiment of the invention, if the cluster subsystem is a storage subsystem, the quality of each node also reflects the availability of each node, so that the availability score corresponding to the storage subsystem can be calculated by adopting the same calculation rule as the quality score of the storage subsystem, and the accurate acquisition of the availability score corresponding to the storage subsystem can be realized.
In this embodiment of the present invention, optionally, if the cluster subsystem is a software service subsystem, determining the availability score of the cluster subsystem based on the second available information in the cluster subsystem may include: determining service scores of all preset service items in the cluster subsystem and service available time corresponding to all the preset service items, and determining the availability scores of the cluster subsystem according to the service scores of all the preset service items, the service available time corresponding to all the preset service items, the total number of the preset service items provided by the cluster subsystem and a preset monitoring period; optionally, if the cluster subsystem is a software service subsystem, determining an availability score of the cluster subsystem based on the following formula:
Figure BDA0003128851730000221
wherein, B service Indicating the availability score, x, of the software service subsystem r Service score, t, representing the r-th preset service item r The method includes the steps that service available time corresponding to an R-th preset service item is represented, R =1, 2.
It should be noted that when calculating the availability score of the software service subsystem, the service score of the preset service item can be determined through the following two rules; rule one is as follows: judging whether processes related to the SLURM and the LDAP service exist, if so, identifying a preset service item as available service and scoring the corresponding service as 1; if the process does not exist, identifying the corresponding preset service item as unavailable service, wherein the corresponding service score is 0, and the service quality reduction caused by the high-availability configuration and the abnormal load balancing node is not considered; and a second rule: performing service response test by adopting a curl mode and a cmd mode, if one mode has no service response, indicating that a preset service item is unavailable, and scoring the corresponding service as 0; if the service response exists in both modes, the preset service item is available, and the corresponding service score is 1.
In the embodiment of the invention, the service available state of the preset service item is judged through the preset rule, and the service score of each preset service item is determined according to the service available state of each preset service item, so that the service score of each preset service item can reflect the service availability, the accuracy of the obtained service score can be improved, and the accuracy of the obtained availability score of the software service subsystem can be improved.
In this embodiment of the present invention, optionally, if the cluster subsystem is a login subsystem, determining an availability score of the cluster subsystem based on the second abnormal information in the cluster subsystem may include: determining the number of abnormal nodes with abnormal alarms, and determining the availability score of the login subsystem according to the number of the abnormal nodes and the total number of the cluster subsystem including the nodes; optionally, if the cluster subsystem is a login subsystem, determining an availability score of the login subsystem based on the following formula:
Figure BDA0003128851730000231
wherein, B login The availability score of the login subsystem is represented, N represents the number of abnormal nodes, and M represents the total number of nodes included in the cluster subsystem.
It should be noted that, if the cluster subsystem is a login subsystem, when determining an abnormal node, first determining whether processes related to the SLURM service and the LDAP service exist, if both processes exist, further determining whether a user can normally log in and whether a test job can be normally submitted, and if it is determined that the user can normally log in and the test job can be normally submitted, determining that the current node is normal; otherwise, determining the node as an abnormal node. In the embodiment of the invention, as long as the number of the normal nodes is more than or equal to 1, the login subsystem is available, and the availability score of the login subsystem is 1; otherwise, the login subsystem is not available, and the availability score of the login subsystem is 0.
In the embodiment of the invention, if the cluster subsystem is a login subsystem, the number of the available nodes of the login subsystem is determined according to the number of the abnormal nodes and the total number of the nodes included in the login subsystem, and then the availability score of the login subsystem is determined according to the number of the available nodes, so that the accurate acquisition of the corresponding availability score of the login subsystem can be realized.
S140, determining a performance evaluation result of the cluster based on the quality score and the availability score, and performing operation and maintenance on the cluster based on the performance evaluation result.
In the embodiment of the invention, after the quality score and the availability score of the cluster subsystem are obtained, the quality score and the availability score of the cluster can be determined according to the quality score and the availability score of the cluster subsystem, and the quality score and the availability score of the cluster are used as the performance evaluation result of the current cluster; or determining the comprehensive performance score of the cluster according to the quality score and the availability score of the cluster subsystem, and taking the comprehensive performance score of the cluster as the performance evaluation result of the current cluster, so that the accurate acquisition of the cluster performance evaluation result can be realized.
Correspondingly, after the performance evaluation result of the cluster is obtained, whether the current performance evaluation result is abnormal is judged, for example, whether the current performance evaluation result is less than or equal to a preset performance evaluation threshold value; if the performance evaluation result is determined to be abnormal, the cluster subsystem with the abnormal performance is determined according to the quality score and the availability score of each cluster subsystem, and the abnormal node or service is determined according to the abnormal information and the available information of each cluster subsystem, so that the abnormal node and service are maintained in a targeted manner, for example, off-shelf maintenance or restart is performed on the abnormal node, the abnormal service is reconfigured, and the cluster operation and maintenance efficiency can be improved.
In the embodiment of the invention, statistics and analysis can be performed on cluster performance evaluation results in a past period of time (for example, one year) to predict the performance state of a future cluster, so that a preprocessing strategy is determined, and when corresponding performance abnormality occurs in the future, a matched preprocessing strategy can be directly adopted, so that the cluster operation and maintenance efficiency can be further improved.
According to the technical scheme provided by the embodiment of the invention, each cluster subsystem in the cluster to be evaluated is determined, the quality score of the cluster subsystem is determined based on the first abnormal information and/or the first available information in the cluster subsystem, and the availability score of the cluster subsystem is determined based on the second abnormal information and/or the second available information in the cluster subsystem; finally, a performance evaluation result of the cluster is determined based on the quality score and the availability score, the cluster is operated and maintained based on the performance evaluation result, the accurate evaluation of the performance of the cluster to be evaluated is realized by obtaining the scores of the quality dimension and the availability dimension of the cluster to be evaluated and obtaining the final performance evaluation result, the timely warning of the performance abnormity of the cluster to be evaluated can be conveniently realized, the effective operation and maintenance of the cluster to be evaluated are facilitated, and the operation and maintenance efficiency can be improved.
Fig. 2 is a flowchart of an evaluation method for cluster performance according to an embodiment of the present invention, which is embodied on the basis of the foregoing embodiment, and optionally, determining a performance evaluation result of the cluster based on the quality score and the availability score includes: determining the quality score and the availability score of the cluster to be evaluated based on the quality score and the availability score of each cluster subsystem, and determining the performance evaluation result of the cluster based on the quality score and the availability score of the cluster; as shown in fig. 2, the method specifically includes:
s210, determining nodes which accord with the service types of the cluster to be evaluated, forming a node set, and determining a cluster subsystem corresponding to the node set.
Reference may be made to the description of the above embodiment for the description of S210.
S220, determining a quality score of the cluster subsystem based on the first abnormal information and/or the first available information in the cluster subsystem.
And S230, determining the availability score of the cluster subsystem based on the second abnormal information and/or the second available information in the cluster subsystem.
S240, determining the quality score and the availability score of the cluster based on the quality score of the cluster subsystem and the availability score of the cluster subsystem, determining the performance evaluation result of the cluster based on the quality score and the availability score of the cluster, and operating and maintaining the cluster based on the performance evaluation result.
In the embodiment of the invention, the quality score of the cluster to be evaluated is determined according to the quality score of each cluster subsystem; determining the availability score of the cluster to be evaluated according to the availability score of each cluster subsystem; optionally, the quality score of the cluster is determined based on the following formula:
A=A compute ·A network ·A storage ·A management ·A service ·A login
where a represents the quality score of the cluster.
Optionally, the availability score of the cluster is determined based on the following formula:
B=B compute ·B network ·B storage ·B management ·B service ·B login
where B represents the availability score of the cluster.
In an implementation manner of the embodiment of the present invention, optionally, after the quality scores and the availability scores of the cluster subsystems are obtained, the quality scores and the availability scores of the clusters to be evaluated may be determined according to the quality scores and the availability scores of the cluster subsystems, and preset quality evaluation thresholds and preset availability evaluation thresholds corresponding to the clusters to be evaluated are set; if the quality score of the cluster to be evaluated is detected to be larger than a preset quality evaluation threshold value and the availability score is detected to be larger than a preset availability evaluation threshold value, determining that the performance of the cluster to be evaluated is normal; if the quality score of the cluster to be evaluated is detected to be less than or equal to the preset quality evaluation threshold value or the availability score is detected to be less than or equal to the preset availability evaluation threshold value, the cluster to be evaluated is determined to have performance abnormity, a performance abnormity alarm is sent out, and timely early warning of the cluster performance abnormity can be achieved.
Correspondingly, when the operation and maintenance of the cluster are carried out based on the performance evaluation result, if the performance abnormality alarm is determined to exist, the cluster subsystem with the abnormality is determined according to the quality score and the availability score of each cluster subsystem, and the abnormal node and the abnormal type are determined according to the abnormal information and the availability information corresponding to the current cluster subsystem; maintaining the abnormal nodes by adopting a fault processing strategy matched with the abnormal type; the fault handling strategy can comprise abnormal node restarting and abnormal node off-shelf maintenance. Determining the quality score and the availability score of the cluster to be evaluated based on the quality score and the availability score of each cluster subsystem, and judging whether the current cluster has performance abnormality according to the quality score and the availability score of the cluster to be evaluated; if the cluster performance is determined to be abnormal, the abnormal node is determined according to the abnormal information and the available information so as to perform targeted maintenance on the abnormal node, the abnormal node can be quickly positioned, and the operation and maintenance efficiency can be improved.
In an implementation manner of the embodiment of the present invention, optionally, after the quality score and the availability score of the cluster to be evaluated are obtained, the comprehensive performance score of the cluster may also be obtained according to the quality score and the availability score, and the comprehensive performance score is used as a performance evaluation result of the cluster, so that more accurate performance evaluation of the cluster to be evaluated can be achieved; optionally, the comprehensive performance score is obtained based on the following formula:
Z=αA+βB;
wherein Z represents the overall performance score, α and β represent weighting coefficients, which can be set according to the service requirements, and α + β =1.
It should be noted that, according to the technical scheme in the embodiment of the present invention, the cluster performance in the preset monitoring period can be evaluated in real time, so that the cluster nodes, paths, and networks can be operated and maintained in a targeted manner according to the comprehensive performance score, and the cluster operation and maintenance efficiency is improved; the operation and maintenance of the cluster based on the comprehensive performance score may include: judging whether the comprehensive performance score is less than or equal to a preset comprehensive performance evaluation threshold value; if the detected comprehensive performance score is less than or equal to a preset comprehensive performance evaluation threshold value, determining that the current cluster has performance abnormity; and determining the abnormal cluster subsystems according to the quality scores and the availability scores corresponding to the cluster subsystems, and further determining the abnormal nodes or services according to the abnormal information and the available information so as to perform targeted maintenance on the abnormal nodes or services, thereby further improving the operation and maintenance efficiency of the clusters.
Therefore, each cluster subsystem in the cluster to be evaluated is determined, the quality score of the cluster subsystem is determined based on the first abnormal information and/or the first available information in the cluster subsystem, and the availability score of the cluster subsystem is determined based on the second abnormal information and/or the second available information in the cluster subsystem; the method comprises the steps of obtaining a quality score and an availability score of a cluster to be evaluated, finally determining a comprehensive performance score of the cluster based on the quality score and the availability score, carrying out operation and maintenance on the cluster based on the comprehensive performance score, achieving accurate evaluation of the performance of the cluster to be evaluated by obtaining the comprehensive performance score of the cluster to be evaluated, providing a data reference basis for operation and maintenance personnel by the comprehensive performance score, finding out performance abnormity of the cluster to be evaluated more timely, carrying out targeted operation and maintenance, and improving the operation and maintenance efficiency of the cluster to be evaluated.
Fig. 3 is a block diagram of a structure of an apparatus for evaluating cluster performance according to an embodiment of the present invention, where the apparatus specifically includes: a cluster subsystem determining module 310, a quality score obtaining module 320, an availability score obtaining module 330, and a performance evaluation result determining module 340;
a cluster subsystem determining module 310, configured to determine each cluster subsystem in a cluster to be evaluated;
a quality score obtaining module 320, configured to determine a quality score of the cluster subsystem based on the first anomaly information and/or the first available information in the cluster subsystem;
an availability score obtaining module 330, configured to determine an availability score of the cluster subsystem based on the second anomaly information and/or the second available information in the cluster subsystem;
and a performance evaluation result determining module 340, configured to determine a performance evaluation result of the cluster based on the quality score and the availability score, and perform operation and maintenance on the cluster based on the performance evaluation result.
Optionally, on the basis of the foregoing technical solution, the cluster subsystem determining module 310 is specifically configured to determine a node that conforms to the service type of the cluster to be evaluated, form a node set, and determine a cluster subsystem corresponding to the node set; the cluster subsystem comprises a computing subsystem, a network subsystem, a storage subsystem, a management subsystem, a software service subsystem and/or a login subsystem.
Optionally, on the basis of the above technical solution, if the cluster subsystem is a computing subsystem and/or a management subsystem, the quality score obtaining module 320 is specifically configured to:
Figure BDA0003128851730000281
the availability score obtaining module 330 is specifically configured to:
Figure BDA0003128851730000282
wherein, A compute/management Representing the quality score of the computing or management subsystem, B compute/management Indicating the availability score of the computing or management subsystem, a i Abnormal node, t, indicating an ith existing node health check alarm i Represents the abnormal time corresponding to the ith abnormal node, i =1,2 j Unavailable node, t, indicating the jth presence of a DOWN event alarm j The method includes the steps that an unavailable time corresponding to a jth unavailable node is represented, j =1, 2.. R represents the number of the unavailable nodes, M represents the total number of the cluster subsystem including the nodes, and T represents a preset monitoring period.
Optionally, on the basis of the above technical solution, if the cluster subsystem is a network subsystem, the quality score obtaining module 320 is specifically configured to:
A network =A down ·A opensm
the availability score obtaining module 330 is specifically configured to:
B network =A down ·A opensm
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003128851730000291
wherein A is network Representing the quality score of the network subsystem, B network Indicating availability of network subsystemsSexual score, A down Representing the network link score of the network subsystem, A opensm Indicating a subnet management service score, n, for a network subsystem i Indicating the ith failed network link in a network down state, t i Indicating the failure time corresponding to the ith failure network link, i =1, 2.. N, N indicating the number of failure network links, K indicating the total number of network links included in the cluster subsystem, c j Service score, t, representing the jth sub-network management service j The method comprises the steps of representing service available time corresponding to a jth sub-network management service, j =1, 2.. S, S representing the total number of sub-network management services provided by a cluster sub-system, and T representing a preset monitoring period.
Optionally, on the basis of the above technical solution, if the cluster subsystem is a storage subsystem, the quality score obtaining module 320 is specifically configured to:
Figure BDA0003128851730000292
the availability score obtaining module 330 is specifically configured to:
Figure BDA0003128851730000293
wherein A is storage Representing the quality score of the storage subsystem, B storage Indicating an availability score of the storage subsystem, a h Node representing the h-th alarm type as abnormal in metadata state, t h H, H represents the number of nodes with alarm types of metadata state abnormity, b j Node indicating that the jth alarm type is abnormal in data service state, t j J =1, 2.. J, J represents the number of nodes with the alarm type being abnormal in the data service state, and c k Node indicating that the kth alarm type is abnormal in node status, t k Denotes the firstK, K represents the number of nodes with abnormal node states, d l The l alarm type is the node with abnormal system data state, t l L =1, 2.. L, L represents the number of nodes with the alarm type being the system data state abnormity, e n Node indicating that the nth alarm type is abnormal in cluster state, t n The method includes the steps that abnormal time corresponding to a node with the nth alarm type being abnormal in the cluster state is represented, N =1, 2.
Optionally, on the basis of the above technical solution, if the cluster subsystem is a software service subsystem, the quality score obtaining module 320 is specifically configured to:
Figure BDA0003128851730000301
the availability score obtaining module 330 is specifically configured to:
Figure BDA0003128851730000302
wherein, A service Representing the quality score of the software service subsystem, B service Indicating the availability score, x, of the software service subsystem r Service score, t, representing the r-th preset service item r The service availability time corresponding to the R-th preset service item is represented, R =1, 2.. R, R represents the total number of the preset service items provided by the cluster subsystem, and T is a preset monitoring period.
Optionally, on the basis of the foregoing technical solution, if the cluster subsystem is a login subsystem, the quality score obtaining module 320 is specifically configured to:
Figure BDA0003128851730000311
the availability score obtaining module 330 is specifically configured to:
Figure BDA0003128851730000312
wherein A is login Quality score representing the entry sub-system, B login Indicating availability score of logged-in subsystem, a i Abnormal node indicating the ith abnormal alarm, t i The method includes the steps that abnormal time corresponding to an ith abnormal node is represented, i =1, 2.. N represents the number of the abnormal nodes, M represents the total number of the cluster subsystems including the nodes, and T represents a preset monitoring period.
The device can execute the cluster performance evaluation method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details not described in detail in this embodiment, reference may be made to the method provided in any embodiment of the present invention.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 4, the electronic device includes:
one or more processors 410, one processor 410 being illustrated in FIG. 4;
a memory 420;
the apparatus may further include: an input device 430 and an output device 440.
The processor 410, the memory 420, the input device 430 and the output device 440 of the apparatus may be connected by a bus or other means, for example, in fig. 4.
The memory 420 is a non-transitory computer-readable storage medium, and may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to a method for evaluating cluster performance in the embodiment of the present invention (for example, the cluster subsystem determining module 310, the quality score obtaining module 320, the availability score obtaining module 330, and the performance evaluation result determining module 340 shown in fig. 3). The processor 410 executes various functional applications and data processing of the computer device by running the software programs, instructions and modules stored in the memory 420, namely, implementing the above method embodiment, namely, the method for evaluating cluster performance:
determining each cluster subsystem in a cluster to be evaluated;
determining a quality score for the cluster subsystem based on first anomaly information and/or first available information in the cluster subsystem;
determining an availability score for the cluster subsystem based on second anomaly information and/or second availability information in the cluster subsystem;
determining a performance evaluation result of the cluster based on the quality score and the availability score, and performing operation and maintenance on the cluster based on the performance evaluation result.
The memory 420 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 420 may optionally include memory located remotely from processor 410, which may be connected to the terminal device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus. The output device 440 may include a display screen or the like.
Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for evaluating cluster performance according to any embodiment of the present invention; the method comprises the following steps:
determining each cluster subsystem in a cluster to be evaluated;
determining a quality score for the cluster subsystem based on first anomaly information and/or first available information in the cluster subsystem;
determining an availability score for the cluster subsystem based on second anomaly information and/or second available information in the cluster subsystem;
determining a performance evaluation result of the cluster based on the quality score and the availability score, and performing operation and maintenance on the cluster based on the performance evaluation result.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing description is only exemplary of the invention and that the principles of the technology may be employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in some detail by the above embodiments, the invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the invention, and the scope of the invention is determined by the scope of the appended claims.

Claims (10)

1. A method for evaluating cluster performance is characterized by comprising the following steps:
determining each cluster subsystem in a cluster to be evaluated; the cluster subsystem is a node set bearing different services in a cluster to be evaluated, and the nodes are servers;
determining a quality score for the cluster subsystem based on first anomaly information and/or first available information in the cluster subsystem; the quality score is used for describing the loss degree of the nodes, the links and the system;
determining an availability score for the cluster subsystem based on second anomaly information and/or second availability information in the cluster subsystem;
determining a performance evaluation result of the cluster based on the quality score and the availability score, and performing operation and maintenance on the cluster based on the performance evaluation result;
if the cluster subsystem is a computing subsystem and/or a management subsystem, determining a quality score of the cluster subsystem based on first abnormal information in the cluster subsystem, including: determining the quality score of the current cluster subsystem according to the number of the abnormal nodes, the abnormal time corresponding to the abnormal nodes, the total number of the cluster subsystem including the nodes and a preset monitoring period;
determining an availability score for the cluster subsystem based on second anomaly information in the cluster subsystem, comprising: and determining the availability score of the current cluster subsystem according to the number of the unavailable nodes with the DOWN event alarm, the unavailable time corresponding to the unavailable nodes, the total number of the cluster subsystem including the nodes and a preset monitoring period.
2. The method of claim 1, wherein the determining each cluster subsystem in the cluster to be evaluated comprises:
determining nodes which accord with the service types of the clusters to be evaluated, forming a node set, and determining cluster subsystems corresponding to the node set;
the cluster subsystem comprises a computing subsystem, a network subsystem, a storage subsystem, a management subsystem, a software service subsystem and/or a login subsystem.
3. The method of claim 2, wherein determining the quality score of the cluster subsystem based on the first anomaly information in the cluster subsystem if the cluster subsystem is a computing subsystem and/or a management subsystem comprises:
Figure FDA0003873821560000021
determining an availability score for the cluster subsystem based on second anomaly information in the cluster subsystem, comprising:
Figure FDA0003873821560000022
wherein A is compute/management Representing the quality score of the computing or management subsystem, B compute/management Indicating the availability score of the computing or management subsystem, a i Abnormal node, t, indicating the ith node health check alarm i Represents the abnormal time corresponding to the ith abnormal node, i =1,2 j Unavailable node, t, indicating the jth presence of a DOWN event alarm j The method includes the steps that an unavailable time corresponding to a jth unavailable node is represented, j =1, 2.. R represents the number of the unavailable nodes, M represents the total number of the cluster subsystem including the nodes, and T represents a preset monitoring period.
4. The method of claim 2, wherein if the cluster subsystem is a network subsystem; determining a quality score for the clustered subsystem based on first anomaly information and first available information in the clustered subsystem, comprising:
A network =A down ·A opensm
determining an availability score for the cluster subsystem based on second anomaly information and second available information in the cluster subsystem, comprising:
B network =A down ·A opensm
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003873821560000031
wherein A is network Representing the quality score of the network subsystem, B network Indicating the availability score of the network subsystem, A down Network Link score, A, representing a network subsystem opensm Indicating a subnet management service score, n, for a network subsystem i Indicating the ith failed network link in a network down state, t i Indicating the failure time corresponding to the ith failure network link, i =1, 2.. N, N indicating the number of failure network links, K indicating the total number of network links included in the cluster subsystem, c j Service score, t, representing the jth sub-network management service j The method includes the steps that service available time corresponding to a jth sub-network management service is represented, j =1, 2.
5. The method of claim 2, wherein determining the quality score of the cluster subsystem based on the first anomaly information in the cluster subsystem if the cluster subsystem is a storage subsystem comprises:
Figure FDA0003873821560000032
determining an availability score for the cluster subsystem based on second anomaly information in the cluster subsystem, comprising:
Figure FDA0003873821560000033
wherein, A storage Representing storage subsystemsQuality score of (A), B storage Indicating an availability score of the storage subsystem, a h Node representing the h-th alarm type as abnormal in metadata state, t h H =1, 2.. H, where H represents the number of nodes whose alarm types are abnormal in the metadata state, and b represents the abnormal time corresponding to the node whose alarm type is abnormal in the metadata state j Node representing the jth alarm type as abnormal data service state, t j J =1, 2.. J, J represents the number of nodes with the alarm type being abnormal in the data service state, and c k Node indicating that the kth alarm type is abnormal in node status, t k K, K represents the number of nodes with the alarm type of node state abnormity, and d l The l alarm type is the node with abnormal system data state, t l L =1, 2.. L, where L represents the number of nodes whose alarm types are abnormal in the system data state, and e represents the abnormal time corresponding to the node whose alarm type is abnormal in the system data state n Node indicating that the nth alarm type is abnormal in cluster state, t n The method includes the steps that abnormal time corresponding to a node with the nth alarm type being abnormal in the cluster state is represented, N =1, 2.
6. The method of claim 2, wherein determining the quality score for the cluster subsystem based on first available information in the cluster subsystem if the cluster subsystem is a software service subsystem comprises:
Figure FDA0003873821560000041
determining an availability score for the cluster subsystem based on second available information in the cluster subsystem, comprising:
Figure FDA0003873821560000051
wherein A is service Representing the quality score of the software service subsystem, B service Indicating the availability score, x, of the software service subsystem r Service score, t, representing the r-th preset service item r The service availability time corresponding to the R-th preset service item is represented, R =1, 2.. R, R represents the total number of the preset service items provided by the cluster subsystem, and T is a preset monitoring period.
7. The method of claim 2, wherein determining the quality score of the cluster subsystem based on the first anomaly information in the cluster subsystem if the cluster subsystem is a logon subsystem comprises:
Figure FDA0003873821560000052
determining an availability score for the cluster subsystem based on second anomaly information in the cluster subsystem, comprising:
Figure FDA0003873821560000053
wherein, A login Quality score representing the Login subsystem, B login Indicating availability score of the logging subsystem, a i Abnormal node indicating the ith abnormal alarm, t i The method includes the steps that abnormal time corresponding to an ith abnormal node is represented, i =1, 2.. N represents the number of the abnormal nodes, M represents the total number of the cluster subsystems including the nodes, and T represents a preset monitoring period.
8. An apparatus for evaluating cluster performance, comprising:
the cluster subsystem determining module is used for determining each cluster subsystem in the cluster to be evaluated; the cluster subsystem is a node set bearing different services in a cluster to be evaluated, and the nodes are servers;
the quality score acquisition module is used for determining the quality score of the cluster subsystem based on first abnormal information and/or first available information in the cluster subsystem; the quality score is used for describing the loss degree of the nodes, the links and the system;
an availability score obtaining module, configured to determine an availability score of the cluster subsystem based on second anomaly information and/or second available information in the cluster subsystem;
the performance evaluation result determining module is used for determining a performance evaluation result of the cluster based on the quality score and the availability score and carrying out operation and maintenance on the cluster based on the performance evaluation result;
if the cluster subsystem is a computing subsystem and/or a management subsystem, determining a quality score of the cluster subsystem based on first abnormal information in the cluster subsystem, including: determining the quality score of the current cluster subsystem according to the number of the abnormal nodes, the abnormal time corresponding to the abnormal nodes, the total number of the cluster subsystem including the nodes and a preset monitoring period;
determining an availability score for the cluster subsystem based on second anomaly information in the cluster subsystem, comprising: and determining the availability score of the current cluster subsystem according to the number of the unavailable nodes with the DOWN event alarm, the unavailable time corresponding to the unavailable nodes, the total number of the cluster subsystem including the nodes and a preset monitoring period.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of evaluating performance of a cluster according to any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for evaluating the performance of a cluster according to any one of claims 1 to 7.
CN202110696929.2A 2021-06-23 2021-06-23 Cluster performance evaluation method, device, equipment and storage medium Active CN113438110B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110696929.2A CN113438110B (en) 2021-06-23 2021-06-23 Cluster performance evaluation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110696929.2A CN113438110B (en) 2021-06-23 2021-06-23 Cluster performance evaluation method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113438110A CN113438110A (en) 2021-09-24
CN113438110B true CN113438110B (en) 2023-02-28

Family

ID=77753540

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110696929.2A Active CN113438110B (en) 2021-06-23 2021-06-23 Cluster performance evaluation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113438110B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114374707B (en) * 2022-03-22 2022-06-21 联想凌拓科技有限公司 Management method, device, equipment and medium for storage cluster
CN116627356B (en) * 2023-07-21 2023-11-14 江苏华存电子科技有限公司 Distribution control method and system for large-capacity storage data
CN116827826B (en) * 2023-08-29 2023-10-27 腾讯科技(深圳)有限公司 Method and device for evaluating edge node and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105072201A (en) * 2015-08-28 2015-11-18 北京奇艺世纪科技有限公司 Distributed storage system and storage quality control method and device thereof
CN107451039A (en) * 2016-03-31 2017-12-08 阿里巴巴集团控股有限公司 A kind of method and apparatus to performing appraisal of equipment in cluster
CN109034580A (en) * 2018-07-16 2018-12-18 三门核电有限公司 A kind of information system holistic health degree appraisal procedure based on big data analysis
CN111708665A (en) * 2020-05-29 2020-09-25 苏州浪潮智能科技有限公司 Method, device, equipment and medium for comprehensively monitoring storage cluster system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107358347A (en) * 2017-07-05 2017-11-17 西安电子科技大学 Equipment cluster health state evaluation method based on industrial big data
US10162678B1 (en) * 2017-08-14 2018-12-25 10X Genomics, Inc. Systems and methods for distributed resource management
US11218391B2 (en) * 2018-12-04 2022-01-04 Netapp, Inc. Methods for monitoring performance of a network fabric and devices thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105072201A (en) * 2015-08-28 2015-11-18 北京奇艺世纪科技有限公司 Distributed storage system and storage quality control method and device thereof
CN107451039A (en) * 2016-03-31 2017-12-08 阿里巴巴集团控股有限公司 A kind of method and apparatus to performing appraisal of equipment in cluster
CN109034580A (en) * 2018-07-16 2018-12-18 三门核电有限公司 A kind of information system holistic health degree appraisal procedure based on big data analysis
CN111708665A (en) * 2020-05-29 2020-09-25 苏州浪潮智能科技有限公司 Method, device, equipment and medium for comprehensively monitoring storage cluster system

Also Published As

Publication number Publication date
CN113438110A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
CN113438110B (en) Cluster performance evaluation method, device, equipment and storage medium
WO2022068645A1 (en) Database fault discovery method, apparatus, electronic device, and storage medium
US20160378583A1 (en) Management computer and method for evaluating performance threshold value
US20070016687A1 (en) System and method for detecting imbalances in dynamic workload scheduling in clustered environments
CN108418710B (en) Distributed monitoring system, method and device
CN110888783A (en) Monitoring method and device of micro-service system and electronic equipment
US9547545B2 (en) Apparatus and program for detecting abnormality of a system
CN105549508B (en) A kind of alarm method and device merged based on information
KR101892516B1 (en) Method, apparatus and program for failure prediction of heterogeneous network security equipment
CN111193608B (en) Network quality detection monitoring method, device and system and computer equipment
US11785023B2 (en) Vehicle abnormality detection device and vehicle abnormality detection method
CN115118581B (en) Internet of things data all-link monitoring and intelligent guaranteeing system based on 5G
WO2017000536A1 (en) Bfd method and apparatus
CN112783682B (en) Abnormal automatic repairing method based on cloud mobile phone service
CN115225460A (en) Failure determination method, electronic device, and storage medium
WO2016159039A1 (en) Relay device and program
US9397921B2 (en) Method and system for signal categorization for monitoring and detecting health changes in a database system
US20170302506A1 (en) Methods and apparatus for fault detection
US11316770B2 (en) Abnormality detection apparatus, abnormality detection method, and abnormality detection program
US8972789B2 (en) Diagnostic systems for distributed network
CN110609761B (en) Method and device for determining fault source, storage medium and electronic equipment
CN111614504A (en) Power grid regulation and control data center service characteristic fault positioning method and system based on time sequence and fault tree analysis
CN107590008B (en) A kind of method and system judging distributed type assemblies reliability by weighted entropy
WO2020044898A1 (en) Device status monitoring device and program
CN111813872B (en) Method, device and equipment for generating fault troubleshooting model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant