CN113438110B

CN113438110B - Cluster performance evaluation method, device, equipment and storage medium

Info

Publication number: CN113438110B
Application number: CN202110696929.2A
Authority: CN
Inventors: 王雄斌; 王家尧; 吕灼恒; 张晋锋; 原帅; 郝文静; 王建敏; 周军; 解文龙; 苗海峰; 吕益行
Original assignee: Dawning Information Industry Beijing Co Ltd
Current assignee: Dawning Information Industry Beijing Co Ltd
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2023-02-28
Anticipated expiration: 2041-06-23
Also published as: CN113438110A

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for evaluating cluster performance, wherein the method comprises the following steps: determining each cluster subsystem in a cluster to be evaluated, determining a quality score of the cluster subsystem based on first abnormal information and/or first available information in the cluster subsystem, and determining an availability score of the cluster subsystem based on second abnormal information and/or second available information in the cluster subsystem; determining a performance evaluation result of the cluster based on the quality score and the availability score, and performing operation and maintenance on the cluster based on the performance evaluation result; by obtaining the scores of the quality and the usability of the cluster to be evaluated and obtaining the final performance evaluation result, the accurate evaluation of the performance of the cluster to be evaluated is realized, the timely warning of the performance abnormity of the cluster to be evaluated can be conveniently realized, the effective operation and maintenance of the cluster to be evaluated are facilitated, and the operation and maintenance efficiency can be improved.

Description

Cluster performance evaluation method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of internet, in particular to a method, a device, equipment and a storage medium for evaluating cluster performance.

Background

With the continuous expansion of the scale of the internet, the scale of a data center bearing a calculation task is continuously increased; the performance of a cluster formed by a plurality of servers in the data center determines the data processing capacity of the data center; the cluster performance is comprehensively evaluated, and the method has important significance for effective operation and maintenance of the data center.

At present, the performance evaluation of a cluster generally determines the performance score of the cluster according to the percentage of available time for providing service capability by the cluster; the cluster provides service capacity, and the service capacity comprises a plurality of nodes, wherein the cluster provides service capacity available time percentages including available time percentages of batch nodes and available time percentages of single nodes, and each node corresponds to one server or host forming the cluster. However, as the cluster scale continues to expand, the influence factors among the constituent nodes become more complex; only by adopting the available time of the nodes as a judgment basis, the performance score of the nodes can be reduced only after the nodes are failed, the accurate performance evaluation of the key nodes or paths of the cluster cannot be realized, so that the performance abnormity reminding cannot be given in time, the cluster cannot be effectively operated and maintained, and the operation and maintenance efficiency is low.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for evaluating cluster performance, which can realize accurate evaluation of the cluster performance to be evaluated, can conveniently realize timely alarm of abnormal cluster performance to be evaluated, and can realize efficient operation and maintenance of the cluster to be evaluated.

In a first aspect, an embodiment of the present invention provides a method for evaluating cluster performance, including:

determining each cluster subsystem in a cluster to be evaluated;

determining a quality score for the cluster subsystem based on first anomaly information and/or first available information in the cluster subsystem;

determining an availability score for the cluster subsystem based on second anomaly information and/or second availability information in the cluster subsystem;

determining a performance evaluation result of the cluster based on the quality score and the availability score, and performing operation and maintenance on the cluster based on the performance evaluation result.

Optionally, the determining each cluster subsystem in the cluster to be evaluated includes:

determining nodes which accord with the service types of the clusters to be evaluated, forming a node set, and determining cluster subsystems corresponding to the node set;

the cluster subsystem comprises a computing subsystem, a network subsystem, a storage subsystem, a management subsystem, a software service subsystem and/or a login subsystem.

By adopting the technical scheme, the nodes matched with the service types are determined according to the service types supported by the cluster to be evaluated, the nodes matched with the service types form node sets respectively, and then the corresponding cluster subsystem is determined according to the node sets, so that the accurate acquisition of the cluster subsystem corresponding to the cluster to be evaluated is realized.

Optionally, if the cluster subsystem is a computing subsystem and/or a management subsystem, determining a quality score of the cluster subsystem based on the first abnormal information in the cluster subsystem includes:

determining an availability score for the cluster subsystem based on second anomaly information in the cluster subsystem, comprising:

wherein A is _{comp/ute mana} Representing the quality score of the computing or management subsystem, B _{compute/management} Indicating the availability score of a computing or management subsystem, a _i Indicating the health check of the ith Presence nodeAbnormal node of alarm, t _i Represents the abnormal time corresponding to the ith abnormal node, i =1,2 _j Unavailable node, t, indicating the jth presence of a DOWN event alarm _j The method includes the steps that an unavailable time corresponding to a jth unavailable node is represented, j =1, 2.. R represents the number of the unavailable nodes, M represents the total number of the cluster subsystem including the nodes, and T represents a preset monitoring period.

By adopting the technical scheme, if the cluster subsystem is a computing subsystem or a management subsystem, the quality score and the availability score of the current cluster subsystem are respectively obtained based on the abnormal node information and the unavailable node information of the cluster subsystem, so that the quality score and the availability score of the computing subsystem and the management subsystem can be accurately obtained.

Optionally, if the cluster subsystem is a network subsystem; determining a quality score for the clustered subsystem based on first anomaly information and first available information in the clustered subsystem, comprising:

A _network ＝A _down ·A _opensm ；

determining an availability score for the cluster subsystem based on second anomaly information and second availability information in the cluster subsystem, comprising:

B _network ＝A _down ·A _opensm ；

wherein, the first and the second end of the pipe are connected with each other,

wherein, A _network Representing the quality score of the network subsystem, B _network Indicating the availability score of the network subsystem, A _down Network Link score, A, representing a network subsystem _opensm Sub-network management service score, n, representing a network subsystem _i Indicating the ith failed network link in a network down state, t _i Indicate the failure time corresponding to the ith failed network link, i =1, 2.. N, N indicates the number of failed network links, and K indicates that the cluster subsystem includes network linksTotal number of (c) _j Service score, t, representing the jth sub-network management service _j The method includes the steps that service available time corresponding to a jth sub-network management service is represented, j =1, 2.

By adopting the technical scheme, if the cluster subsystem is the network subsystem, the corresponding network link score is obtained based on the fault network link information in the network subsystem, the corresponding subnet management service score is obtained based on the subnet management service information, the quality score and the availability score of the network subsystem are obtained based on the network link score and the subnet management service score, and the quality score and the availability score of the network subsystem can be accurately obtained.

Optionally, if the cluster subsystem is a storage subsystem, determining a quality score of the cluster subsystem based on the first abnormal information in the cluster subsystem includes:

wherein, A _storage Representing the quality score of the storage subsystem, B _storage Indicating an availability score of the storage subsystem, a _h Node representing the h-th alarm type as abnormal in metadata state, t _h H =1, 2.. H, where H represents the number of nodes whose alarm types are abnormal in the metadata state, and b represents the abnormal time corresponding to the node whose alarm type is abnormal in the metadata state _j Node indicating that the jth alarm type is abnormal in data service state, t _j And j =1, 2.. The abnormal time corresponds to the node with the j alarm type being abnormal in the data service state.J, J represents the number of nodes with alarm type as abnormal data service state, c _k Node indicating that the kth alarm type is abnormal in node status, t _k K =1, 2.. K, where K represents the number of nodes whose alarm types are abnormal node states, and d represents the abnormal time corresponding to the node whose alarm type is abnormal node state _l The l alarm type is the node with abnormal system data state, t _l L =1, 2.. L, L represents the number of nodes with the alarm type being the system data state abnormity, e _n Node indicating that the nth alarm type is abnormal in cluster state, t _n The method includes the steps that abnormal time corresponding to a node with the nth alarm type being abnormal in the cluster state is represented, N =1, 2.

By adopting the technical scheme, if the cluster subsystem is a storage subsystem, the quality score and the availability score of the cluster subsystem are respectively determined based on the abnormal node information of each alarm type in the storage subsystem, so that the accuracy of the obtained quality score and availability score of the storage subsystem can be improved.

Optionally, if the cluster subsystem is a software service subsystem, determining a quality score of the cluster subsystem based on the first available information in the cluster subsystem includes:

determining an availability score for the cluster subsystem based on second available information in the cluster subsystem, comprising:

wherein A is _service Representing the quality score of the software service subsystem, B _service Indicating the availability score, x, of a software service subsystem _r Service score, t, representing the r-th preset service item _r The method includes the steps that service available time corresponding to an R-th preset service item is represented, R =1, 2.

By adopting the technical scheme, if the cluster subsystem is the software service subsystem, the quality score and the availability score of the cluster subsystem are respectively determined based on the service available information of each preset service item in the software service subsystem, so that the quality score and the availability score of the software service subsystem can be accurately acquired.

Optionally, if the cluster subsystem is a login subsystem, determining a quality score of the cluster subsystem based on the first abnormal information in the cluster subsystem includes:

wherein A is _login Quality score representing the entry sub-system, B _login Indicating availability score of the logging subsystem, a _i Abnormal node indicating the ith abnormal alarm, t _i The method includes the steps that abnormal time corresponding to an ith abnormal node is represented, i =1, 2.. N represents the number of the abnormal nodes, M represents the total number of the cluster subsystems including the nodes, and T represents a preset monitoring period.

By adopting the technical scheme, if the cluster subsystem is the login subsystem, the quality score and the availability score of the cluster subsystem are respectively determined based on the abnormal node information in the cluster subsystem, so that the accurate acquisition of the quality score and the availability score of the login subsystem can be realized, and the accuracy of acquiring the quality score and the availability score is improved.

In a second aspect, an embodiment of the present invention provides an apparatus for evaluating cluster performance, including:

the cluster subsystem determining module is used for determining each cluster subsystem in the cluster to be evaluated;

the quality score acquisition module is used for determining the quality score of the cluster subsystem based on the first abnormal information and/or the first available information in the cluster subsystem;

an availability score obtaining module, configured to determine an availability score of the cluster subsystem based on second anomaly information and/or second available information in the cluster subsystem;

and the performance evaluation result determining module is used for determining the performance evaluation result of the cluster based on the quality score and the availability score and carrying out operation and maintenance on the cluster based on the performance evaluation result.

In a third aspect, an embodiment of the present invention provides an electronic device, including:

one or more processors;

a storage device to store one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the method for evaluating cluster performance according to any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for evaluating cluster performance according to any embodiment of the present invention.

According to the technical scheme provided by the embodiment of the invention, each cluster subsystem in the cluster to be evaluated is determined, the quality score of the cluster subsystem is determined based on the first abnormal information and/or the first available information in the cluster subsystem, and the availability score of the cluster subsystem is determined based on the second abnormal information and/or the second available information in the cluster subsystem; finally, a performance evaluation result of the cluster is determined based on the quality score and the availability score, the cluster is operated and maintained based on the performance evaluation result, the scores of two dimensions of the quality and the availability of the cluster to be evaluated are obtained, the final performance evaluation result is obtained, the accurate evaluation of the performance of the cluster to be evaluated is realized, the timely warning of the performance abnormity of the cluster to be evaluated can be conveniently realized, the effective operation and maintenance of the cluster to be evaluated are facilitated, and the operation and maintenance efficiency can be improved.

Drawings

Fig. 1 is a flowchart of a method for evaluating cluster performance according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for evaluating cluster performance according to an embodiment of the present invention;

fig. 3 is a block diagram of a structure of an apparatus for evaluating cluster performance according to an embodiment of the present invention;

fig. 4 is a block diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Fig. 1 is a flowchart of a method for evaluating cluster performance according to an embodiment of the present invention, where this embodiment is applicable to accurately evaluate cluster performance according to abnormal information and available information in a cluster, and the method may be executed by an apparatus for evaluating cluster performance according to an embodiment of the present invention, where the apparatus may be implemented by software and/or hardware and integrated on an electronic device, and the electronic device may be a computer device or a server, as shown in fig. 1, the method specifically includes the following steps:

s110, determining each cluster subsystem in the cluster to be evaluated.

The cluster is to connect a plurality of servers or hosts to share network services, so that the service processing capacity can be improved, and the ever-increasing service requirements can be met; each server or host in the cluster is a node of the cluster. The cluster subsystem is a node set bearing different services in a cluster; in the embodiment of the present invention, the cluster subsystem may include a computing subsystem, a network subsystem, a storage subsystem, a management subsystem, a software service subsystem, and/or a login subsystem according to the assumed service type.

Specifically, the computing subsystem is used for running actual services and representing the actual computing capacity of the cluster; the network subsystem is used for managing each key switch forming the network topology and corresponding network links; the storage subsystem is used for providing shared storage service for the outside; the management subsystem is used for deploying and managing the service software of the whole cluster; the software service subsystem is used for providing various distributed services deployed across nodes and generally has the characteristic of high availability; the login subsystem is used for being responsible for cluster login operation of the service clients; by dividing different cluster subsystems for different services, the chaos of the cluster services can be avoided, and the orderly processing of the cluster services is realized.

In this embodiment of the present invention, optionally, the determining each cluster subsystem in the cluster to be evaluated may include: and determining nodes which accord with the service types of the cluster to be evaluated, forming a node set, and determining a cluster subsystem corresponding to the node set. It should be noted that one cluster node is responsible for one service type at the same time; therefore, when the cluster subsystem is determined, the matched nodes can be searched in all the nodes according to the service types supported by the cluster to be evaluated; the nodes corresponding to the same service type are added to the same node set, the node set corresponding to each service type can be obtained, and then one node set is used as a cluster subsystem, so that the acquisition of the cluster subsystem corresponding to the cluster to be evaluated is realized.

S120, determining a quality score of the cluster subsystem based on the first abnormal information and/or the first available information in the cluster subsystem.

In the embodiment of the invention, the abnormal information is information describing the abnormal state of the node; available information, which is information describing whether the cluster provides software service; the exception information may include an exception node identifier, an exception type, an exception duration, and the like; the availability information may include an identification of available software services, an identification of nodes that are to undertake the software services, and the time of availability of the software services, among other things. It should be noted that the abnormal information and the available information corresponding to the cluster subsystem reflect the health condition of the cluster subsystem; therefore, the abnormal information and the available information within a certain time of each cluster subsystem are obtained, and then the abnormal information and the available information are calculated based on the preset calculation rule matched with each cluster subsystem, so that the quality score and the availability score corresponding to each cluster subsystem are obtained, and the evaluation of the performance of the cluster subsystems can be realized.

In the embodiment of the invention, the quality score is used for describing the loss degree of the node, the link and the system; for example, the current system includes N nodes, and if there is a node anomaly, the system quality is (N-1)/N; if no node anomaly exists, the system quality is 1. Availability scores describing the degree to which nodes, links and systems can provide software services; for example, the current system includes N nodes, and if all of the N nodes cannot provide software services, the system availability is 0; if there is at least one node that can provide software services, then the system availability is 1. The accuracy of the performance evaluation of the cluster subsystems can be improved by acquiring the quality scores and the availability scores of the cluster subsystems and comprehensively evaluating the performance of the cluster subsystems based on the quality scores and the availability scores.

In the embodiment of the invention, the Health state of each Node or link in the cluster can be detected in real time through internal security programs (such as IBLINK commands and Node Health Check (NHC) commands), and when the Node is detected to be abnormal, an alarm is generated and corresponding abnormal information is reported; meanwhile, the processes of various services provided by the cluster are detected in real time, the available information of the services is obtained, and the abnormal information and the available information can be obtained.

In the embodiment of the present invention, optionally, before determining each cluster subsystem in the cluster to be evaluated, a matching calculation rule is preset for different types of cluster subsystems; correspondingly, determining a quality score of the cluster subsystem based on the first anomaly information and/or the first available information in the cluster subsystem may include: determining a system type corresponding to a current cluster subsystem, and acquiring a matched target calculation rule from preset calculation rules according to the current system type; determining the quality score of the current cluster subsystem based on the first abnormal information, the first available information and the target calculation rule of the current cluster; by setting matched calculation rules for different types of cluster subsystems, the calculation rules are more consistent with the actual conditions of the cluster subsystems, and the accuracy of obtaining the quality scores of the cluster subsystems can be improved.

In this embodiment of the present invention, optionally, if the cluster subsystem is a computing subsystem and/or a management subsystem, determining the quality score of the cluster subsystem based on the first abnormal information in the cluster subsystem may include: determining the quality score of the current cluster subsystem according to the number of the abnormal nodes, the abnormal time corresponding to the abnormal nodes, the total number of the cluster subsystem including the nodes and a preset monitoring period; optionally, if the cluster subsystem is a computing subsystem and/or a management subsystem, determining the quality score of the cluster subsystem based on the following formula:

wherein A is _{compute/management} Representing a quality score of the computing or management subsystem, a _i Abnormal node, t, indicating an ith existing node health check alarm _i The abnormal time corresponding to the ith abnormal node is represented and is the duration of the alarm; n, N representing the number of abnormal nodes, M representing the total number of cluster subsystems including nodes, fixed when the system is built; t represents a preset monitoring period, for example, one month, and can be flexibly set according to task requirements.

It should be noted that, if the cluster subsystem is a computing subsystem, when determining whether a node in the current cluster subsystem is an abnormal node, the determination may be performed according to the following two rules; rule one is as follows: if the management network or the computing network of the computing nodes in the computing subsystem is not communicated, the management network and the computing network can generate corresponding abnormal alarms, and the nodes contained in the abnormal alarms are determined as abnormal nodes; the nodes which are maintained or returned to the factory due to accidents in the computing subsystem and are off-shelf are also regarded as abnormal nodes; rule two: and if the Node Health Check (NHC) operation alarm exists, determining the corresponding node as an abnormal node. If the cluster subsystem is a management subsystem, the abnormal node is determined according to the abnormal alarm information generated by the management network or the computing network of the management node in the management subsystem, so that the abnormal node in the computing subsystem and the management subsystem can be accurately detected.

In the embodiment of the invention, if the cluster subsystem is a computing subsystem or a management subsystem, a corresponding abnormal node detection method is adopted to obtain the number of abnormal nodes and abnormal time corresponding to the abnormal nodes; and determining the quality score corresponding to the current cluster subsystem according to the number of the abnormal nodes and the abnormal time corresponding to the abnormal nodes, so that the quality scores corresponding to the computing subsystem and the management subsystem can be accurately obtained.

In the embodiment of the present invention, optionally, if the cluster subsystem is a network subsystem; determining a quality score for the clustered subsystem based on the first anomaly information and the first available information in the clustered subsystem may include: acquiring network link scores of the network subsystems according to the number of the fault network links, the fault time corresponding to each fault network link, the total number of the cluster subsystems including the network links and a preset monitoring period; acquiring a subnet management service score of the network subsystem according to the service score of the subnet management service, the service available time corresponding to each subnet management service, the total number of the subnet management services provided by the cluster subsystem and a preset monitoring period; acquiring a quality score of the network subsystem according to the network link score and the subnet management service score; optionally, if the cluster subsystem is a network subsystem, determining the quality score of the cluster subsystem based on the following formula:

A _network ＝A _down ·A _opensm ；

wherein the content of the first and second substances,

wherein A is _network Representing the quality score of the network subsystem, A _down Representing the network link score of the network subsystem, A _opensm Indicating a subnet management service score, n, for a network subsystem _i Indicating the ith failed network link in a network down state, t _i Indicating the failure time corresponding to the ith failure network link, i =1, 2.. N, N indicating the number of failure network links, K indicating the total number of network links included in the cluster subsystem, c _j Service score, t, representing the jth subnet administrative service _j The method includes the steps that service available time corresponding to a jth sub-network management service is represented, j =1, 2.

It should be noted that, if the cluster subsystem is a network subsystem, the quality of the network subsystem is affected by both the network link condition of the network subsystem and the condition of the subnet management service provided, so the quality score of the network subsystem is composed of two parts, namely a network link score and a subnet management service score; when determining the network link score, judging whether the network link fails according to the following two rules; a first rule: performing network link detection of an IB (InfiniBand) network through an IBLINK command, and if a problem of the network link is detected, taking the current network link as a fault network link; rule two: detecting failure statistics of a network (e.g., a management network and an ethernet network), and regarding a network link included in the failure statistics information as a failed network link; in the embodiment of the invention, the network link detection and the network fault statistics are carried out at intervals of preset time (for example, one hour), so that the fault occurrence time of the detected fault network link has a preset time error, but the statistical fault occurrence time and the statistical end time have the same time error, so that the detection result of the fault duration time is not influenced.

In the embodiment of the invention, when the service score of the subnet management service (for example, OPENSM) is obtained, because the subnet management service has a high availability mechanism, that is, a plurality of nodes provide the same subnet management service; when partial node faults exist, the service quality is reduced; therefore, the service score of each sub-network management service needs to be determined according to the available nodes corresponding to each sub-network management service and the total number of the nodes; for example, there are a plurality of nodes providing the same subnet management service, and if the plurality of nodes are all in an available state, the service score of the current subnet management service is 1; if one node is in an unavailable state, the service score of the current subnet management service is 0.5; if the plurality of nodes are all in the unavailable state, the service score is changed into 0; the service grade of the subnet management service can reflect the service quality, the accurate acquisition of the service grade of each subnet management service can be realized, and the accurate acquisition of the corresponding quality grade of the network subsystem can be realized.

In the embodiment of the invention, if the cluster subsystem is a network subsystem, a fault network link is determined through a preset network link detection rule, and a network link score is determined according to the number of the fault network links and corresponding fault time; meanwhile, determining the grade of the subnet management service according to the available state and the available time of each subnet management service; finally, the quality score of the network subsystem is determined according to the network link score and the subnet management service score, so that the quality score can reflect the network link performance and the subnet management service performance, the accurate acquisition of the quality score of the network subsystem can be realized, and the accuracy of the acquired quality score can be improved.

In this embodiment of the present invention, optionally, if the cluster subsystem is a storage subsystem, determining the quality score of the cluster subsystem based on the first abnormal information in the cluster subsystem may include: determining the quality score of the storage subsystem according to the number of nodes corresponding to each preset alarm type, the alarm duration of each node, the total number of nodes included in the storage subsystem and a preset monitoring period; optionally, if the cluster subsystem is a storage subsystem, determining the quality score of the storage subsystem based on the following formula:

wherein, A _storage Representing a quality score of the storage subsystem, a _h Node representing the h-th alarm type as abnormal metadata state, t _h H, H represents the number of nodes with alarm types of metadata state abnormity, b _j Node representing the jth alarm type as abnormal data service state, t _j J =1, 2.. J, J represents the number of nodes with the alarm type being abnormal in the data service state, and c _k Node indicating that the kth alarm type is abnormal in node status, t _k K, K represents the number of nodes with the alarm type of node state abnormity, and d _l The l alarm type is the node with abnormal system data state, t _l L =1, 2.. L, L represents the number of nodes with the alarm type being the system data state abnormity, e _n Node indicating that the nth alarm type is abnormal in cluster state, t _n The method includes the steps that abnormal time corresponding to a node with the nth alarm type being abnormal in the cluster state is represented, N =1, 2.

In the embodiment of the invention, the alarm type and the corresponding alarm rule are preset in the storage subsystem; the states of all nodes and all types of data in the storage subsystem are monitored in real time, and when the node states or the data states are determined to accord with alarm rules, corresponding types of alarms are sent out, so that the abnormity of the storage subsystem can be accurately judged; the alarm types can include metadata state abnormity, data service state abnormity, node state abnormity, system data state abnormity and cluster state abnormity; the alarm information may include an alarm type, a corresponding node identifier, and an alarm duration; it should be noted that the alarm type and the alarm rule may be adaptively modified according to the task requirement.

Further, acquiring alarm information in a preset monitoring period, and acquiring node identifiers matched with the alarm types to form a node set corresponding to each alarm type; counting the number of nodes in a node set corresponding to each alarm type, and determining the quality score of the current storage subsystem according to the formula; by comprehensively considering various abnormal types of alarms, accurate acquisition of the corresponding quality scores of the storage subsystems can be realized.

In the embodiment of the invention, if the cluster subsystem is a storage subsystem, the quality score of the current storage subsystem is determined according to the number of abnormal nodes corresponding to each alarm type and the corresponding abnormal time, so that the quality score can reflect the abnormal information of each type, and the accuracy of the obtained quality score of the storage subsystem can be improved.

In this embodiment of the present invention, optionally, if the cluster subsystem is a software service subsystem, determining the quality score of the cluster subsystem based on the first available information in the cluster subsystem may include: determining the service score of each preset service item in the cluster subsystem and the service available time corresponding to each preset service item, and determining the quality score of the cluster subsystem according to the service score of each preset service item, the service available time corresponding to each preset service item, the total number of preset service items provided by the cluster subsystem and a preset monitoring period; optionally, if the cluster subsystem is a software service subsystem, determining the quality score of the cluster subsystem based on the following formula:

wherein A is _service Representing the quality score, x, of the software service subsystem _r Service score, t, representing the r-th preset service item _r The method includes the steps that service available time corresponding to an R-th preset service item is represented, R =1, 2.

It should be noted that, when the quality score of the software service subsystem is calculated, if it is determined that a process related to the preset service item exists, the current preset service item is marked as available for service, and the corresponding available time is recorded. For the preset service item configured with high availability, if all corresponding nodes are in available states, the service score of the current preset service item is 1, if one node is in unavailable states, the service score of the current preset service item is 0.5, and if all corresponding nodes are in unavailable states, the corresponding service score is 0. For the preset service items configured with load balance, the service score is the ratio of the number of the available nodes to the total number of the nodes; for example, the preset service item a is configured with load balancing, if there are N load balancing nodes and one load balancing node is in an unavailable state, the corresponding service score is (N-1)/N, and the service score of each preset service item can be accurately obtained.

In this embodiment of the present invention, the preset service item may include an SLURM (e.g., slurmctld and slurmdbd) and a Lightweight Directory Access Protocol (LDAP) management service; when the cluster subsystem is a software service subsystem, the quality score of the software service subsystem can be obtained by determining the service score and the available time corresponding to each preset service item, so that the quality score of the software service subsystem can be obtained, and the accuracy of the obtained quality score can be improved.

In this embodiment of the present invention, optionally, if the cluster subsystem is a login subsystem, determining the quality score of the cluster subsystem based on the first abnormal information in the cluster subsystem may include: determining the number of abnormal nodes with abnormal alarms and abnormal time corresponding to each abnormal node, and determining the quality score of the login subsystem according to the number of the abnormal nodes, the abnormal time corresponding to each abnormal node, the total number of nodes included in the cluster subsystem and a preset monitoring period; optionally, if the cluster subsystem is a login subsystem, determining the quality score of the login subsystem based on the following formula:

wherein A is _login Quality score representing the entry sub-system, a _i Abnormal node indicating the ith abnormal alarm, t _i The method includes the steps that abnormal time corresponding to an ith abnormal node is represented, i =1, 2.. N represents the number of the abnormal nodes, M represents the total number of the cluster subsystems including the nodes, and T represents a preset monitoring period.

It should be noted that, if the cluster subsystem is a login subsystem, when an abnormal node is determined, the node responsible for login service is counted, and the communication condition between the login node management network and the computing network is detected in real time; and if the login node management network or the computing network is determined to be not through, the login node is considered to be abnormal, and an abnormal alarm is sent. In the embodiment of the invention, the quality score corresponding to the current login subsystem is determined by acquiring the abnormal node with the abnormal alarm and the abnormal time corresponding to each abnormal node, so that the accurate acquisition of the quality score corresponding to the login subsystem can be realized.

S130, determining an availability score of the cluster subsystem based on the second abnormal information and/or the second available information in the cluster subsystem.

In the embodiment of the invention, after the quality scores of all the cluster subsystems are obtained, second abnormal information and second available information are obtained in the cluster subsystems; and calculating to obtain the availability score of each cluster subsystem based on the second abnormal information, the second available information and the matched preset calculation rule corresponding to each cluster subsystem, so that the evaluation on the availability dimension of the cluster subsystems can be realized, and the accuracy of the performance evaluation of the cluster subsystems can be improved.

In the real-time example of the present invention, optionally, if the cluster subsystem is a computing subsystem and/or a management subsystem, determining the availability score of the cluster subsystem based on the second abnormal information in the cluster subsystem may include: determining the availability score of the current cluster subsystem according to the number of unavailable nodes with DOWN event alarms, the unavailable time corresponding to the unavailable nodes, the total number of the cluster subsystem including the nodes and a preset monitoring period; optionally, if the cluster subsystem is a computing subsystem and/or a management subsystem, determining an availability score of the cluster subsystem based on the following formula:

wherein, B _{compute/management} Representing the availability score of the computing or management subsystem, b _j Unavailable node, t, indicating the jth presence of a DOWN event alarm _j The method includes the steps that an unavailable time corresponding to a jth unavailable node is represented, j =1, 2.. R represents the number of the unavailable nodes, M represents the total number of the nodes included in the cluster subsystem, and T represents a preset monitoring period.

It should be noted that, if the cluster subsystem is a computing subsystem or a management subsystem, the state of each node in the cluster subsystem is detected in real time, when a DOWN event alarm is detected in a node, the current node is indicated to be unavailable, and the unavailable time corresponding to the unavailable node is obtained through the alarm, so that the unavailable node in the computing subsystem and the unavailable node in the management subsystem can be accurately detected, and further, the availability score corresponding to the cluster subsystem can be accurately obtained.

In the embodiment of the invention, the DOWN event alarm is detected to determine the unavailable node in the computing subsystem and the management subsystem, and then the availability score of the computing subsystem and the management subsystem is obtained according to the detected unavailable node information, so that the availability score of the computing subsystem and the management subsystem can be accurately obtained.

In the embodiment of the present invention, optionally, if the cluster subsystem is a network subsystem; determining an availability score for the cluster subsystem based on the second anomaly information and the second available information in the cluster subsystem may include: acquiring network link scores of the network subsystems according to the number of the fault network links, the fault time corresponding to each fault network link, the total number of the cluster subsystems including the network links and a preset monitoring period; acquiring a subnet management service score of the network subsystem according to the service score of the subnet management service, the service available time corresponding to each subnet management service, the total number of the subnet management services provided by the cluster subsystem and a preset monitoring period; acquiring the availability score of the network subsystem according to the network link score and the subnet management service score; optionally, if the cluster subsystem is a network subsystem, determining an availability score of the cluster subsystem based on the following formula:

B _network ＝A _down ·A _opensm ；

wherein the content of the first and second substances,

wherein, B _network Indicating the availability score of the network subsystem, A _down Representing the network link score of the network subsystem, A _opensm Sub-network management service score, n, representing a network subsystem _i Indicating the ith failed network link in a network down state, t _i Indicating the failure time corresponding to the ith failure network link, i =1, 2.. N, N indicating the number of failure network links, K indicating the total number of network links included in the cluster subsystem, c _j Service score, t, representing the jth sub-network management service _j The method includes the steps that service available time corresponding to a jth sub-network management service is represented, j =1, 2.

It should be noted that, if the cluster subsystem is a network subsystem, the network link score and the subnet management service score of the network subsystem are respectively obtained, and then the availability score corresponding to the network subsystem is obtained; when determining the network link score, determining whether the network link is faulty or not by using the same rule as the above-mentioned determination of the network subsystem quality score, which is not described herein; when the service score of the subnet management service (for example, OPENSM) is obtained, since the corresponding subnet management service is available as long as one node is available, whether the subnet management service is configured to be highly available or not can not be distinguished, and the service score of the corresponding OPENSM is 1 as long as one node is available; for example, if a plurality of nodes provide the same subnet management service, if at least one node is in an available state, the service score of the current subnet management service is 1; if all the nodes are in the unavailable state, the service score becomes 0.

In the embodiment of the invention, if the cluster subsystem is a network subsystem, on the basis of determining the fault network link information; the service scores of the subnet management services are determined by detecting the available information of the subnet management services without distinguishing high available configuration, so that the service scores of the subnet management services can reflect the availability states of the subnet management services, and the accuracy of the obtained subnet management service scores is improved; meanwhile, the availability score corresponding to the network subsystem is obtained based on the network link score and the subnet management service score, and the accuracy of obtaining the availability score can be improved.

In this embodiment of the present invention, optionally, if the cluster subsystem is a storage subsystem, determining the availability score of the cluster subsystem based on the second abnormal information in the cluster subsystem may include: determining the availability score of the storage subsystem according to the number of nodes corresponding to each preset alarm type, the duration of each node alarm, the total number of nodes included in the storage subsystem and a preset monitoring period; optionally, if the cluster subsystem is a storage subsystem, determining an availability score of the storage subsystem based on the following formula:

wherein, B _storage Indicating an availability score of the storage subsystem, a _h Means for indicating the h alarm type as metadataNode of abnormal state, t _h H, H represents the number of nodes with alarm types of metadata state abnormity, b _j Node indicating that the jth alarm type is abnormal in data service state, t _j J =1, 2.. J, J represents the number of nodes with the alarm type being the data service state abnormity, c _k Node indicating that the kth alarm type is abnormal in node status, t _k K, K represents the number of nodes with the alarm type of node state abnormity, and d _l The l alarm type is the node with abnormal system data state, t _l L =1, 2.. L, L represents the number of nodes with the alarm type being the system data state abnormity, e _n Node indicating that the nth alarm type is abnormal in cluster state, t _n The method includes the steps that abnormal time corresponding to a node with the nth alarm type being abnormal in the cluster state is represented, N =1, 2.

In the embodiment of the invention, if the cluster subsystem is a storage subsystem, the quality of each node also reflects the availability of each node, so that the availability score corresponding to the storage subsystem can be calculated by adopting the same calculation rule as the quality score of the storage subsystem, and the accurate acquisition of the availability score corresponding to the storage subsystem can be realized.

In this embodiment of the present invention, optionally, if the cluster subsystem is a software service subsystem, determining the availability score of the cluster subsystem based on the second available information in the cluster subsystem may include: determining service scores of all preset service items in the cluster subsystem and service available time corresponding to all the preset service items, and determining the availability scores of the cluster subsystem according to the service scores of all the preset service items, the service available time corresponding to all the preset service items, the total number of the preset service items provided by the cluster subsystem and a preset monitoring period; optionally, if the cluster subsystem is a software service subsystem, determining an availability score of the cluster subsystem based on the following formula:

wherein, B _service Indicating the availability score, x, of the software service subsystem _r Service score, t, representing the r-th preset service item _r The method includes the steps that service available time corresponding to an R-th preset service item is represented, R =1, 2.

It should be noted that when calculating the availability score of the software service subsystem, the service score of the preset service item can be determined through the following two rules; rule one is as follows: judging whether processes related to the SLURM and the LDAP service exist, if so, identifying a preset service item as available service and scoring the corresponding service as 1; if the process does not exist, identifying the corresponding preset service item as unavailable service, wherein the corresponding service score is 0, and the service quality reduction caused by the high-availability configuration and the abnormal load balancing node is not considered; and a second rule: performing service response test by adopting a curl mode and a cmd mode, if one mode has no service response, indicating that a preset service item is unavailable, and scoring the corresponding service as 0; if the service response exists in both modes, the preset service item is available, and the corresponding service score is 1.

In the embodiment of the invention, the service available state of the preset service item is judged through the preset rule, and the service score of each preset service item is determined according to the service available state of each preset service item, so that the service score of each preset service item can reflect the service availability, the accuracy of the obtained service score can be improved, and the accuracy of the obtained availability score of the software service subsystem can be improved.

In this embodiment of the present invention, optionally, if the cluster subsystem is a login subsystem, determining an availability score of the cluster subsystem based on the second abnormal information in the cluster subsystem may include: determining the number of abnormal nodes with abnormal alarms, and determining the availability score of the login subsystem according to the number of the abnormal nodes and the total number of the cluster subsystem including the nodes; optionally, if the cluster subsystem is a login subsystem, determining an availability score of the login subsystem based on the following formula:

wherein, B _login The availability score of the login subsystem is represented, N represents the number of abnormal nodes, and M represents the total number of nodes included in the cluster subsystem.

It should be noted that, if the cluster subsystem is a login subsystem, when determining an abnormal node, first determining whether processes related to the SLURM service and the LDAP service exist, if both processes exist, further determining whether a user can normally log in and whether a test job can be normally submitted, and if it is determined that the user can normally log in and the test job can be normally submitted, determining that the current node is normal; otherwise, determining the node as an abnormal node. In the embodiment of the invention, as long as the number of the normal nodes is more than or equal to 1, the login subsystem is available, and the availability score of the login subsystem is 1; otherwise, the login subsystem is not available, and the availability score of the login subsystem is 0.

In the embodiment of the invention, if the cluster subsystem is a login subsystem, the number of the available nodes of the login subsystem is determined according to the number of the abnormal nodes and the total number of the nodes included in the login subsystem, and then the availability score of the login subsystem is determined according to the number of the available nodes, so that the accurate acquisition of the corresponding availability score of the login subsystem can be realized.

S140, determining a performance evaluation result of the cluster based on the quality score and the availability score, and performing operation and maintenance on the cluster based on the performance evaluation result.

In the embodiment of the invention, after the quality score and the availability score of the cluster subsystem are obtained, the quality score and the availability score of the cluster can be determined according to the quality score and the availability score of the cluster subsystem, and the quality score and the availability score of the cluster are used as the performance evaluation result of the current cluster; or determining the comprehensive performance score of the cluster according to the quality score and the availability score of the cluster subsystem, and taking the comprehensive performance score of the cluster as the performance evaluation result of the current cluster, so that the accurate acquisition of the cluster performance evaluation result can be realized.

Correspondingly, after the performance evaluation result of the cluster is obtained, whether the current performance evaluation result is abnormal is judged, for example, whether the current performance evaluation result is less than or equal to a preset performance evaluation threshold value; if the performance evaluation result is determined to be abnormal, the cluster subsystem with the abnormal performance is determined according to the quality score and the availability score of each cluster subsystem, and the abnormal node or service is determined according to the abnormal information and the available information of each cluster subsystem, so that the abnormal node and service are maintained in a targeted manner, for example, off-shelf maintenance or restart is performed on the abnormal node, the abnormal service is reconfigured, and the cluster operation and maintenance efficiency can be improved.

In the embodiment of the invention, statistics and analysis can be performed on cluster performance evaluation results in a past period of time (for example, one year) to predict the performance state of a future cluster, so that a preprocessing strategy is determined, and when corresponding performance abnormality occurs in the future, a matched preprocessing strategy can be directly adopted, so that the cluster operation and maintenance efficiency can be further improved.

According to the technical scheme provided by the embodiment of the invention, each cluster subsystem in the cluster to be evaluated is determined, the quality score of the cluster subsystem is determined based on the first abnormal information and/or the first available information in the cluster subsystem, and the availability score of the cluster subsystem is determined based on the second abnormal information and/or the second available information in the cluster subsystem; finally, a performance evaluation result of the cluster is determined based on the quality score and the availability score, the cluster is operated and maintained based on the performance evaluation result, the accurate evaluation of the performance of the cluster to be evaluated is realized by obtaining the scores of the quality dimension and the availability dimension of the cluster to be evaluated and obtaining the final performance evaluation result, the timely warning of the performance abnormity of the cluster to be evaluated can be conveniently realized, the effective operation and maintenance of the cluster to be evaluated are facilitated, and the operation and maintenance efficiency can be improved.

Fig. 2 is a flowchart of an evaluation method for cluster performance according to an embodiment of the present invention, which is embodied on the basis of the foregoing embodiment, and optionally, determining a performance evaluation result of the cluster based on the quality score and the availability score includes: determining the quality score and the availability score of the cluster to be evaluated based on the quality score and the availability score of each cluster subsystem, and determining the performance evaluation result of the cluster based on the quality score and the availability score of the cluster; as shown in fig. 2, the method specifically includes:

s210, determining nodes which accord with the service types of the cluster to be evaluated, forming a node set, and determining a cluster subsystem corresponding to the node set.

Reference may be made to the description of the above embodiment for the description of S210.

S220, determining a quality score of the cluster subsystem based on the first abnormal information and/or the first available information in the cluster subsystem.

And S230, determining the availability score of the cluster subsystem based on the second abnormal information and/or the second available information in the cluster subsystem.

S240, determining the quality score and the availability score of the cluster based on the quality score of the cluster subsystem and the availability score of the cluster subsystem, determining the performance evaluation result of the cluster based on the quality score and the availability score of the cluster, and operating and maintaining the cluster based on the performance evaluation result.

In the embodiment of the invention, the quality score of the cluster to be evaluated is determined according to the quality score of each cluster subsystem; determining the availability score of the cluster to be evaluated according to the availability score of each cluster subsystem; optionally, the quality score of the cluster is determined based on the following formula:

A＝A _compute ·A _network ·A _storage ·A _management ·A _service ·A _login ；

where a represents the quality score of the cluster.

Optionally, the availability score of the cluster is determined based on the following formula:

B＝B _compute ·B _network ·B _storage ·B _management ·B _service ·B _login ；

where B represents the availability score of the cluster.

In an implementation manner of the embodiment of the present invention, optionally, after the quality scores and the availability scores of the cluster subsystems are obtained, the quality scores and the availability scores of the clusters to be evaluated may be determined according to the quality scores and the availability scores of the cluster subsystems, and preset quality evaluation thresholds and preset availability evaluation thresholds corresponding to the clusters to be evaluated are set; if the quality score of the cluster to be evaluated is detected to be larger than a preset quality evaluation threshold value and the availability score is detected to be larger than a preset availability evaluation threshold value, determining that the performance of the cluster to be evaluated is normal; if the quality score of the cluster to be evaluated is detected to be less than or equal to the preset quality evaluation threshold value or the availability score is detected to be less than or equal to the preset availability evaluation threshold value, the cluster to be evaluated is determined to have performance abnormity, a performance abnormity alarm is sent out, and timely early warning of the cluster performance abnormity can be achieved.

Correspondingly, when the operation and maintenance of the cluster are carried out based on the performance evaluation result, if the performance abnormality alarm is determined to exist, the cluster subsystem with the abnormality is determined according to the quality score and the availability score of each cluster subsystem, and the abnormal node and the abnormal type are determined according to the abnormal information and the availability information corresponding to the current cluster subsystem; maintaining the abnormal nodes by adopting a fault processing strategy matched with the abnormal type; the fault handling strategy can comprise abnormal node restarting and abnormal node off-shelf maintenance. Determining the quality score and the availability score of the cluster to be evaluated based on the quality score and the availability score of each cluster subsystem, and judging whether the current cluster has performance abnormality according to the quality score and the availability score of the cluster to be evaluated; if the cluster performance is determined to be abnormal, the abnormal node is determined according to the abnormal information and the available information so as to perform targeted maintenance on the abnormal node, the abnormal node can be quickly positioned, and the operation and maintenance efficiency can be improved.

In an implementation manner of the embodiment of the present invention, optionally, after the quality score and the availability score of the cluster to be evaluated are obtained, the comprehensive performance score of the cluster may also be obtained according to the quality score and the availability score, and the comprehensive performance score is used as a performance evaluation result of the cluster, so that more accurate performance evaluation of the cluster to be evaluated can be achieved; optionally, the comprehensive performance score is obtained based on the following formula:

Z＝αA+βB；

wherein Z represents the overall performance score, α and β represent weighting coefficients, which can be set according to the service requirements, and α + β =1.

It should be noted that, according to the technical scheme in the embodiment of the present invention, the cluster performance in the preset monitoring period can be evaluated in real time, so that the cluster nodes, paths, and networks can be operated and maintained in a targeted manner according to the comprehensive performance score, and the cluster operation and maintenance efficiency is improved; the operation and maintenance of the cluster based on the comprehensive performance score may include: judging whether the comprehensive performance score is less than or equal to a preset comprehensive performance evaluation threshold value; if the detected comprehensive performance score is less than or equal to a preset comprehensive performance evaluation threshold value, determining that the current cluster has performance abnormity; and determining the abnormal cluster subsystems according to the quality scores and the availability scores corresponding to the cluster subsystems, and further determining the abnormal nodes or services according to the abnormal information and the available information so as to perform targeted maintenance on the abnormal nodes or services, thereby further improving the operation and maintenance efficiency of the clusters.

Therefore, each cluster subsystem in the cluster to be evaluated is determined, the quality score of the cluster subsystem is determined based on the first abnormal information and/or the first available information in the cluster subsystem, and the availability score of the cluster subsystem is determined based on the second abnormal information and/or the second available information in the cluster subsystem; the method comprises the steps of obtaining a quality score and an availability score of a cluster to be evaluated, finally determining a comprehensive performance score of the cluster based on the quality score and the availability score, carrying out operation and maintenance on the cluster based on the comprehensive performance score, achieving accurate evaluation of the performance of the cluster to be evaluated by obtaining the comprehensive performance score of the cluster to be evaluated, providing a data reference basis for operation and maintenance personnel by the comprehensive performance score, finding out performance abnormity of the cluster to be evaluated more timely, carrying out targeted operation and maintenance, and improving the operation and maintenance efficiency of the cluster to be evaluated.

Fig. 3 is a block diagram of a structure of an apparatus for evaluating cluster performance according to an embodiment of the present invention, where the apparatus specifically includes: a cluster subsystem determining module 310, a quality score obtaining module 320, an availability score obtaining module 330, and a performance evaluation result determining module 340;

a cluster subsystem determining module 310, configured to determine each cluster subsystem in a cluster to be evaluated;

a quality score obtaining module 320, configured to determine a quality score of the cluster subsystem based on the first anomaly information and/or the first available information in the cluster subsystem;

an availability score obtaining module 330, configured to determine an availability score of the cluster subsystem based on the second anomaly information and/or the second available information in the cluster subsystem;

and a performance evaluation result determining module 340, configured to determine a performance evaluation result of the cluster based on the quality score and the availability score, and perform operation and maintenance on the cluster based on the performance evaluation result.

Optionally, on the basis of the foregoing technical solution, the cluster subsystem determining module 310 is specifically configured to determine a node that conforms to the service type of the cluster to be evaluated, form a node set, and determine a cluster subsystem corresponding to the node set; the cluster subsystem comprises a computing subsystem, a network subsystem, a storage subsystem, a management subsystem, a software service subsystem and/or a login subsystem.

Optionally, on the basis of the above technical solution, if the cluster subsystem is a computing subsystem and/or a management subsystem, the quality score obtaining module 320 is specifically configured to:

the availability score obtaining module 330 is specifically configured to:

wherein, A _{compute/management} Representing the quality score of the computing or management subsystem, B _{compute/management} Indicating the availability score of the computing or management subsystem, a _i Abnormal node, t, indicating an ith existing node health check alarm _i Represents the abnormal time corresponding to the ith abnormal node, i =1,2 _j Unavailable node, t, indicating the jth presence of a DOWN event alarm _j The method includes the steps that an unavailable time corresponding to a jth unavailable node is represented, j =1, 2.. R represents the number of the unavailable nodes, M represents the total number of the cluster subsystem including the nodes, and T represents a preset monitoring period.

Optionally, on the basis of the above technical solution, if the cluster subsystem is a network subsystem, the quality score obtaining module 320 is specifically configured to:

A _network ＝A _down ·A _opensm ；

the availability score obtaining module 330 is specifically configured to:

B _network ＝A _down ·A _opensm ；

wherein A is _network Representing the quality score of the network subsystem, B _network Indicating availability of network subsystemsSexual score, A _down Representing the network link score of the network subsystem, A _opensm Indicating a subnet management service score, n, for a network subsystem _i Indicating the ith failed network link in a network down state, t _i Indicating the failure time corresponding to the ith failure network link, i =1, 2.. N, N indicating the number of failure network links, K indicating the total number of network links included in the cluster subsystem, c _j Service score, t, representing the jth sub-network management service _j The method comprises the steps of representing service available time corresponding to a jth sub-network management service, j =1, 2.. S, S representing the total number of sub-network management services provided by a cluster sub-system, and T representing a preset monitoring period.

Optionally, on the basis of the above technical solution, if the cluster subsystem is a storage subsystem, the quality score obtaining module 320 is specifically configured to:

the availability score obtaining module 330 is specifically configured to:

wherein A is _storage Representing the quality score of the storage subsystem, B _storage Indicating an availability score of the storage subsystem, a _h Node representing the h-th alarm type as abnormal in metadata state, t _h H, H represents the number of nodes with alarm types of metadata state abnormity, b _j Node indicating that the jth alarm type is abnormal in data service state, t _j J =1, 2.. J, J represents the number of nodes with the alarm type being abnormal in the data service state, and c _k Node indicating that the kth alarm type is abnormal in node status, t _k Denotes the firstK, K represents the number of nodes with abnormal node states, d _l The l alarm type is the node with abnormal system data state, t _l L =1, 2.. L, L represents the number of nodes with the alarm type being the system data state abnormity, e _n Node indicating that the nth alarm type is abnormal in cluster state, t _n The method includes the steps that abnormal time corresponding to a node with the nth alarm type being abnormal in the cluster state is represented, N =1, 2.

Optionally, on the basis of the above technical solution, if the cluster subsystem is a software service subsystem, the quality score obtaining module 320 is specifically configured to:

the availability score obtaining module 330 is specifically configured to:

wherein, A _service Representing the quality score of the software service subsystem, B _service Indicating the availability score, x, of the software service subsystem _r Service score, t, representing the r-th preset service item _r The service availability time corresponding to the R-th preset service item is represented, R =1, 2.. R, R represents the total number of the preset service items provided by the cluster subsystem, and T is a preset monitoring period.

Optionally, on the basis of the foregoing technical solution, if the cluster subsystem is a login subsystem, the quality score obtaining module 320 is specifically configured to:

the availability score obtaining module 330 is specifically configured to:

wherein A is _login Quality score representing the entry sub-system, B _login Indicating availability score of logged-in subsystem, a _i Abnormal node indicating the ith abnormal alarm, t _i The method includes the steps that abnormal time corresponding to an ith abnormal node is represented, i =1, 2.. N represents the number of the abnormal nodes, M represents the total number of the cluster subsystems including the nodes, and T represents a preset monitoring period.

The device can execute the cluster performance evaluation method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details not described in detail in this embodiment, reference may be made to the method provided in any embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 4, the electronic device includes:

one or more processors 410, one processor 410 being illustrated in FIG. 4;

a memory 420;

the apparatus may further include: an input device 430 and an output device 440.

The processor 410, the memory 420, the input device 430 and the output device 440 of the apparatus may be connected by a bus or other means, for example, in fig. 4.

The memory 420 is a non-transitory computer-readable storage medium, and may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to a method for evaluating cluster performance in the embodiment of the present invention (for example, the cluster subsystem determining module 310, the quality score obtaining module 320, the availability score obtaining module 330, and the performance evaluation result determining module 340 shown in fig. 3). The processor 410 executes various functional applications and data processing of the computer device by running the software programs, instructions and modules stored in the memory 420, namely, implementing the above method embodiment, namely, the method for evaluating cluster performance:

determining each cluster subsystem in a cluster to be evaluated;

The memory 420 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 420 may optionally include memory located remotely from processor 410, which may be connected to the terminal device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus. The output device 440 may include a display screen or the like.

Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for evaluating cluster performance according to any embodiment of the present invention; the method comprises the following steps:

determining each cluster subsystem in a cluster to be evaluated;

determining an availability score for the cluster subsystem based on second anomaly information and/or second available information in the cluster subsystem;

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing description is only exemplary of the invention and that the principles of the technology may be employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in some detail by the above embodiments, the invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the invention, and the scope of the invention is determined by the scope of the appended claims.

Claims

1. A method for evaluating cluster performance is characterized by comprising the following steps:

determining each cluster subsystem in a cluster to be evaluated; the cluster subsystem is a node set bearing different services in a cluster to be evaluated, and the nodes are servers;

determining a quality score for the cluster subsystem based on first anomaly information and/or first available information in the cluster subsystem; the quality score is used for describing the loss degree of the nodes, the links and the system;

determining a performance evaluation result of the cluster based on the quality score and the availability score, and performing operation and maintenance on the cluster based on the performance evaluation result;

if the cluster subsystem is a computing subsystem and/or a management subsystem, determining a quality score of the cluster subsystem based on first abnormal information in the cluster subsystem, including: determining the quality score of the current cluster subsystem according to the number of the abnormal nodes, the abnormal time corresponding to the abnormal nodes, the total number of the cluster subsystem including the nodes and a preset monitoring period;

determining an availability score for the cluster subsystem based on second anomaly information in the cluster subsystem, comprising: and determining the availability score of the current cluster subsystem according to the number of the unavailable nodes with the DOWN event alarm, the unavailable time corresponding to the unavailable nodes, the total number of the cluster subsystem including the nodes and a preset monitoring period.

2. The method of claim 1, wherein the determining each cluster subsystem in the cluster to be evaluated comprises:

3. The method of claim 2, wherein determining the quality score of the cluster subsystem based on the first anomaly information in the cluster subsystem if the cluster subsystem is a computing subsystem and/or a management subsystem comprises:

wherein A is _{compute/management} Representing the quality score of the computing or management subsystem, B _{compute/management} Indicating the availability score of the computing or management subsystem, a _i Abnormal node, t, indicating the ith node health check alarm _i Represents the abnormal time corresponding to the ith abnormal node, i =1,2 _j Unavailable node, t, indicating the jth presence of a DOWN event alarm _j The method includes the steps that an unavailable time corresponding to a jth unavailable node is represented, j =1, 2.. R represents the number of the unavailable nodes, M represents the total number of the cluster subsystem including the nodes, and T represents a preset monitoring period.

4. The method of claim 2, wherein if the cluster subsystem is a network subsystem; determining a quality score for the clustered subsystem based on first anomaly information and first available information in the clustered subsystem, comprising:

A _network ＝A _down ·A _opensm ；

determining an availability score for the cluster subsystem based on second anomaly information and second available information in the cluster subsystem, comprising:

B _network ＝A _down ·A _opensm ；

wherein A is _network Representing the quality score of the network subsystem, B _network Indicating the availability score of the network subsystem, A _down Network Link score, A, representing a network subsystem _opensm Indicating a subnet management service score, n, for a network subsystem _i Indicating the ith failed network link in a network down state, t _i Indicating the failure time corresponding to the ith failure network link, i =1, 2.. N, N indicating the number of failure network links, K indicating the total number of network links included in the cluster subsystem, c _j Service score, t, representing the jth sub-network management service _j The method includes the steps that service available time corresponding to a jth sub-network management service is represented, j =1, 2.

5. The method of claim 2, wherein determining the quality score of the cluster subsystem based on the first anomaly information in the cluster subsystem if the cluster subsystem is a storage subsystem comprises:

wherein, A _storage Representing storage subsystemsQuality score of (A), B _storage Indicating an availability score of the storage subsystem, a _h Node representing the h-th alarm type as abnormal in metadata state, t _h H =1, 2.. H, where H represents the number of nodes whose alarm types are abnormal in the metadata state, and b represents the abnormal time corresponding to the node whose alarm type is abnormal in the metadata state _j Node representing the jth alarm type as abnormal data service state, t _j J =1, 2.. J, J represents the number of nodes with the alarm type being abnormal in the data service state, and c _k Node indicating that the kth alarm type is abnormal in node status, t _k K, K represents the number of nodes with the alarm type of node state abnormity, and d _l The l alarm type is the node with abnormal system data state, t _l L =1, 2.. L, where L represents the number of nodes whose alarm types are abnormal in the system data state, and e represents the abnormal time corresponding to the node whose alarm type is abnormal in the system data state _n Node indicating that the nth alarm type is abnormal in cluster state, t _n The method includes the steps that abnormal time corresponding to a node with the nth alarm type being abnormal in the cluster state is represented, N =1, 2.

6. The method of claim 2, wherein determining the quality score for the cluster subsystem based on first available information in the cluster subsystem if the cluster subsystem is a software service subsystem comprises:

wherein A is _service Representing the quality score of the software service subsystem, B _service Indicating the availability score, x, of the software service subsystem _r Service score, t, representing the r-th preset service item _r The service availability time corresponding to the R-th preset service item is represented, R =1, 2.. R, R represents the total number of the preset service items provided by the cluster subsystem, and T is a preset monitoring period.

7. The method of claim 2, wherein determining the quality score of the cluster subsystem based on the first anomaly information in the cluster subsystem if the cluster subsystem is a logon subsystem comprises:

wherein, A _login Quality score representing the Login subsystem, B _login Indicating availability score of the logging subsystem, a _i Abnormal node indicating the ith abnormal alarm, t _i The method includes the steps that abnormal time corresponding to an ith abnormal node is represented, i =1, 2.. N represents the number of the abnormal nodes, M represents the total number of the cluster subsystems including the nodes, and T represents a preset monitoring period.

8. An apparatus for evaluating cluster performance, comprising:

the cluster subsystem determining module is used for determining each cluster subsystem in the cluster to be evaluated; the cluster subsystem is a node set bearing different services in a cluster to be evaluated, and the nodes are servers;

the quality score acquisition module is used for determining the quality score of the cluster subsystem based on first abnormal information and/or first available information in the cluster subsystem; the quality score is used for describing the loss degree of the nodes, the links and the system;

the performance evaluation result determining module is used for determining a performance evaluation result of the cluster based on the quality score and the availability score and carrying out operation and maintenance on the cluster based on the performance evaluation result;

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of evaluating performance of a cluster according to any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for evaluating the performance of a cluster according to any one of claims 1 to 7.