CN108874640B

CN108874640B - Cluster performance evaluation method and device

Info

Publication number: CN108874640B
Application number: CN201810425538.5A
Authority: CN
Inventors: 吴怡燃
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2018-05-07
Filing date: 2018-05-07
Publication date: 2022-09-30
Anticipated expiration: 2038-05-07
Also published as: CN108874640A

Abstract

The invention discloses a cluster performance evaluation method and device, and relates to the technical field of computers. One embodiment of the method comprises: acquiring service information of a cluster and physical resource use information of each node in the cluster; calculating in real time to obtain the evaluation index information of the cluster based on the service information of the cluster, the physical resource use information of each node in the cluster and a preset evaluation rule; the evaluation rule comprises: evaluating indexes of the clusters and weights corresponding to the evaluating indexes; and carrying out weighted summation on the evaluation index information of the cluster to determine the performance health degree of the cluster. According to the method and the device, the performance health degree of the cluster can be determined by analyzing the collected service information of the cluster and the physical resource use information of the nodes, so that the state of the current cluster is effectively evaluated, and the problem that the state or the score of the current cluster is evaluated in the absence of an effective evaluation mode in the prior art is solved.

Description

Cluster performance evaluation method and device

Technical Field

The invention relates to the technical field of computers, in particular to a cluster performance evaluation method and device.

Background

As the amount of traffic and data grows, the size of large data processing clusters also becomes larger and larger. How to evaluate the performance of a very large cluster becomes more and more complex, and the performance of the very large cluster is influenced by many factors, such as: network, disk IO, CPU, traffic, hot data, hot nodes, etc. In the prior art, only some conventional indicators are monitored in order to be able to locate and troubleshoot cluster performance quickly.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

1. there is a lack of an effective way to evaluate the status or score of the current cluster.

2. In the super-large-scale cluster, the components of each node in the cluster efficiency cannot be effectively analyzed by using the conventional mode, the calculation capacity of the calculation node in the whole cluster is provided, and the differential nodes are difficult to position.

3. The analysis of node service components is lacked, and the main node cannot be effectively analyzed by using the existing mode.

4. The current state of the cluster cannot be effectively evaluated due to the dispersed indexes, relevant personnel need to analyze a plurality of indexes by checking a plurality of views and evaluate the current state of the cluster according to historical information, and different people evaluate different indexes to obtain different conclusions, so that the cluster cannot effectively avoid some problems easily.

5. The method is lack of an automatic analysis mode, and the cluster state is still analyzed in a man-driven mode at present, so that the problem of information lag exists in the mode.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for evaluating cluster performance, which can determine the performance health of a cluster by analyzing collected service information of the cluster and physical resource usage information of nodes, so as to effectively evaluate a current cluster state, and solve a problem that an effective evaluation manner is lacking in the prior art to evaluate the current cluster state or score.

To achieve the above object, according to an aspect of the embodiments of the present invention, there is provided a method for evaluating cluster performance, including: acquiring service information of a cluster and physical resource use information of each node in the cluster; calculating in real time to obtain the evaluation index information of the cluster based on the service information of the cluster, the physical resource use information of each node in the cluster and a preset evaluation rule; the evaluation rule includes: evaluating indexes of the clusters and weights corresponding to the evaluating indexes; and carrying out weighted summation on the evaluation index information of the cluster to determine the performance health degree of the cluster.

Optionally, the performance health of the cluster is determined by:

wherein H represents the performance health degree of the cluster, f (i) represents the ith evaluation index information of the cluster, w represents the weight corresponding to the evaluation index, and n represents the number of the evaluation indexes of the cluster.

Optionally, the method further comprises: determining the principal component ratio of the node according to the physical resource use information of the node and the service information of the cluster; and if the ratio of the main components of the node is not within the preset range, determining the node as a difference node.

Optionally, the method further comprises: calculating the mean square error of the principal component ratio of each node in the cluster according to the principal component ratio of the node; and if the mean square error exceeds a set threshold value, the cluster is an abnormal cluster.

Optionally, after determining that the node is a differential node, the method further includes: determining the problem of the differential nodes according to the differential nodes and the main component ratio thereof; and acquiring answers or optimization modes corresponding to the problems from a preset rule base according to the problems.

To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided a cluster performance evaluation apparatus including: the system comprises an acquisition module, an analysis module and an evaluation module; the obtaining module is configured to: acquiring service information of a cluster and physical resource use information of each node in the cluster; the analysis module is configured to: calculating in real time to obtain the evaluation index information of the cluster based on the service information of the cluster, the physical resource use information of each node in the cluster and a preset evaluation rule; the evaluation rule comprises: evaluating indexes of the clusters and weights corresponding to the evaluating indexes; the evaluation module is configured to: and carrying out weighted summation on the evaluation index information of the cluster to determine the performance health degree of the cluster.

Optionally, the evaluation module is configured to determine the performance health of the cluster by:

Optionally, the analysis module is further configured to: determining the principal component ratio of the node according to the physical resource use information of the node and the service information of the cluster; and if the ratio of the main components of the node is not within the preset range, determining the node as a difference node.

Optionally, the analysis module is further configured to: calculating the mean square error of the principal component ratio of each node in the cluster according to the principal component ratio of the node; and if the mean square error exceeds a set threshold value, the cluster is an abnormal cluster.

Optionally, the analysis module is further configured to: determining the problem of the differential nodes according to the differential nodes and the main component ratio thereof; and acquiring answers or optimization modes corresponding to the problems from a preset rule base according to the problems.

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus including: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the method for evaluating the cluster performance provided by the embodiment of the invention.

To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, the computer program, when executed by a processor, implementing a method for evaluating cluster performance provided by the embodiments of the present invention.

One embodiment of the above invention has the following advantages or benefits: the embodiment of the invention can determine the performance health degree of the cluster by analyzing the collected service information of the cluster and the physical resource use information of the nodes, thereby effectively evaluating the state of the current cluster, solving the problem that the state or the score of the current cluster is evaluated in the absence of an effective evaluation mode in the prior art, and simultaneously automatically analyzing the state of the cluster and solving the problem of information lag.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

fig. 1 is a schematic diagram of a basic flow of an evaluation method of cluster performance according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the scoring of the Resourcemanager and NameNode master nodes of a Hadoop cluster according to an embodiment of the invention;

fig. 3 is a schematic diagram of basic modules of an evaluation apparatus of cluster performance according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an optimization module of the Hadoop cluster performance evaluation device according to the embodiment of the invention;

FIG. 5 is a schematic diagram of a real-time index analysis process according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a report generation flow according to an embodiment of the present invention;

FIG. 7 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 8 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the prior art, an Open Falcon Agent is generally installed at each node to collect physical resource usage information on the node and report the information to a monitoring server. An OpenFalcon Agent refers to a collector (a component of an OpenFalcon monitoring system) deployed on a compute node. The main collected information is as follows: utilization rate of devices such as CPU, memory, network and the like. The physical resource information refers to indexes of the computing node, such as: the method comprises the following steps of utilizing the CPU resource of a single computing node, utilizing the memory of the single computing node, exchanging a single computing node, utilizing the network of the single computing node, opening the port of the single computing node, utilizing the disk of the single computing node, busy degree of the disk of the single computing node and other systematic indexes.

The monitoring server collects Hadoop service information from the computing nodes periodically, and the services mainly comprise DataNode, NodeManager, NameNode and ResourceMeanager. The information collected is as follows: JVM information, number of operations, number of requests, request time, Job runtime, Job success count, Job failure count, etc. ResourceManager is a resource management service for a Hadoop cluster to which an application may request resource usage. The DataNode is a component in a Hadoop cluster and is responsible for storing distributed data, and one or more data in one cluster can be stored. The NodeManager is a component in a Hadoop cluster and is responsible for managing one computing node, corresponding computation is started according to tasks distributed by a main node, and one or more NodeManagers exist in one cluster. The NameNode is a component in a Hadoop distributed system, and the main functions of the NameNode comprise metadata management, directory tree maintenance and client request response. The service information refers to resource use information of the Hadoop component, such as: the method comprises the following steps that CPU (Central processing Unit) use information of a single Hadoop process, memory use information of the single Hadoop process, thread information of the single Hadoop process, description information of operation of the single Hadoop process (such as communication heartbeat information, throughput, average work load, request amount, request time and the like) and different index information, scheduler monitoring, storage monitoring and data access amount information are provided for different components.

And the front-end analysis system displays each index in a time sequence diagram mode according to the collected information, and supports displaying according to minutes, hours, days and months.

Fig. 1 is a schematic diagram of a basic flow of an evaluation method of cluster performance according to an embodiment of the present invention. As shown in fig. 1, an embodiment of the present invention provides a method for evaluating cluster performance, including:

s101, acquiring service information of a cluster and physical resource use information of each node in the cluster;

s102, calculating in real time to obtain evaluation index information of the cluster based on the service information of the cluster, the physical resource use information of each node in the cluster and a preset evaluation rule; the evaluation rule includes: evaluating indexes of the clusters and weights corresponding to the evaluating indexes;

and S103, carrying out weighted summation on each evaluation index information of the cluster to determine the performance health degree of the cluster.

The cluster evaluation index of the embodiment of the invention may include but is not limited to: whole cluster level Container throughput (per second launch amount, release amount), job run throughput, job failure rate, whole cluster file access frequency, whole cluster file modification frequency, single job file access frequency, single job creation file number, single job Container average run time, single node Container throughput, single node event throughput. The service information includes health information of the service itself, physical resource usage information of the service, and processing capability information (various service indexes) of the service. The embodiment of the invention can determine the performance health degree of the cluster by analyzing the collected service information of the cluster and the physical resource use information of the nodes, thereby effectively evaluating the state of the current cluster, solving the problem that the state or the score of the current cluster is evaluated in the absence of an effective evaluation mode in the prior art, and simultaneously automatically analyzing the state of the cluster and solving the problem of information lag.

In the embodiment of the present invention, the performance health degree of the cluster is determined in the following manner:

wherein H represents the performance health degree of the cluster, f (i) represents the ith evaluation index information of the cluster, w represents the weight corresponding to the evaluation index, and n represents the number of the evaluation indexes of the cluster. The health degree calculation mode of the embodiment of the invention can enable the state result of the current cluster to be more accurate.

In the embodiment of the present invention, the method further includes: determining the principal component ratio of the node according to the physical resource use information of the node and the service information of the cluster; and if the ratio of the main components of the node is not within the preset range, determining the node as a difference node. The fact that the ratio of the main components of the node is not within the preset range means that the ratio of the main components of the node is higher than the maximum value of the preset range or lower than the minimum value of the preset range. The principal component proportion of the node may refer to a proportion of the event amount processed by the node in a unit time to the whole cluster event. Such as: the cluster has 10 nodes, 10000A messages are processed in 1 hour, and each node processes 1000A messages. The ratio of the type a message components of the X node to the total amount of type a messages processed/type a messages of the cluster. The embodiment of the invention can determine the preset range according to the operation and maintenance experience and quantize each index. Such as: the single-node processing capacity (i.e., the ratio of the principal components of the node) must be within a preset range before it is normal. The method can effectively analyze the components of each node for the cluster efficiency, and the computing power of the computing nodes in the whole cluster, thereby solving the problems that the service component analysis of the nodes is lacked and the main nodes can not be effectively analyzed in the prior art, and further realizing the beneficial effect of quickly positioning the difference nodes.

In an embodiment of the present invention, the method further includes: calculating the mean square error of the principal component ratio of each node in the cluster according to the principal component ratio of the node; and if the mean square error exceeds a set threshold value, the cluster is an abnormal cluster. The embodiment of the invention determines the abnormal cluster by calculating the mean square error of the proportion of each node component, can effectively monitor the working state of the cluster in real time and find the cluster with abnormal working.

In this embodiment of the present invention, after determining that the node is a differential node, the method further includes: determining the problem of the differential nodes according to the differential nodes and the main component ratio thereof; and acquiring answers or optimization modes corresponding to the problems from a preset rule base according to the problems. According to the embodiment of the invention, the problems existing in the differential nodes can be determined and displayed after the differential nodes are rapidly positioned, and the corresponding answers and the optimization method are displayed according to the rules (namely the mapping between the problems and the answers) which are input into the library in advance.

Taking a Hadoop cluster as an example, fig. 2 is a schematic diagram of scoring conditions of resource managers and NameNode master nodes of the Hadoop cluster according to the embodiment of the present invention. As shown in FIG. 2, the pie chart is composed of the main key indexes of each component. The Resource pie chart consists of events related to the job, such as release of a container, failure of a container, allocation of a container, etc. The pie chart of the NameNode is mainly composed of events related to file operations, such as: create, delete, modify, add, etc. The information in fig. 2 is as follows:

pie charts, which represent key indicators of clustering. The pie chart on the left represents the percentage of job event types accumulated since the day. The main contents are as follows: the number of successful containers in the current day, the number of failed containers in the current day, the number of cancelled containers in the current day, and the number of containers running in the current day; the entire pie chart is the daily job run size (real-time).

The pie chart on the right represents the storage resource usage ratio accumulated from the current day, and the main contents are as follows: the number of effective target files of the newly added service of the cluster on the same day, the number of temporary data files of the newly added service of the cluster on the same day, the number of temporary files of the newly added distributed system of the cluster on the same day and the deleted file data of the cluster on the same day; the entire pie chart represents the total number of files operated on that day.

Data under "70" represents the amount of computing resources of the cluster, describing the current resource utilization state of the cluster. Wherein, the total amount of VCore represents the total CPU number of the cluster, the remaining amount of VCores: representing the resource amount which is remained and available for allocation of the cluster, and the total memory amount: representing the total memory resource amount of the cluster, the memory remains: representing the amount of memory resources currently remaining available for allocation by the cluster.

Data under "80 points" represents the contents of the current cluster storage. Wherein, the number of files: representing how many files the cluster stores in total, folder: representing how many folders the cluster has in total, total storage: representing the total storage capacity of the cluster, the remaining storage: representing the storage remaining available for allocation.

The index values in the "component content" are real-time, with each refresh being the latest state of the cluster. Wherein, the left column: the ContainerAllocate represents the total number of containers distributed on the cluster on the day, the containerFailed represents the number of containers failed on the cluster on the day, the finished application represents the number of jobs completed on the cluster on the day, the running application represents the number of jobs currently running, the finished application represents the number of jobs failed on the cluster on the day, the KillApplication represents the number of jobs cancelled to be executed on the day, the NodeNumber represents the total number of computing nodes in the cluster, and the DeadNode represents the number of computing nodes lost on the cluster. Column on the right: CreateNewFileNumber represents the total number of files created on the same day, DeleteFileNumber represents the total number of files deleted on the same day, FiledWriteFileNumber represents the total number of files failed to write on the same day, ChangeDirNumber represents the number of modified directories on the same day, ChangeFileNumber represents the number of modified files on the same day, NodeNumber represents the total number of storage nodes of the cluster, and DeadNode represents the number of lost storage nodes in the cluster.

Fig. 3 is a schematic diagram of basic modules of an evaluation apparatus for cluster performance according to an embodiment of the present invention. As shown in fig. 3, an embodiment of the present invention provides an apparatus 300 for evaluating cluster performance, including: an acquisition module 301, an analysis module 302 and an evaluation module 303; the obtaining module 301 is configured to: acquiring service information of a cluster and physical resource use information of each node in the cluster; the analysis module 302 is configured to: calculating in real time to obtain the evaluation index information of the cluster based on the service information of the cluster, the physical resource use information of each node in the cluster and a preset evaluation rule; the evaluation rule comprises: evaluating indexes of the clusters and weights corresponding to the evaluating indexes; the evaluation module 303 is configured to: and carrying out weighted summation on the evaluation index information of the cluster to determine the performance health degree of the cluster. The embodiment of the invention can determine the performance health degree of the cluster by analyzing the collected service information of the cluster and the physical resource use information of the nodes, thereby effectively evaluating the state of the current cluster, solving the problem that the state or the score of the current cluster is evaluated in the absence of an effective evaluation mode in the prior art, and simultaneously automatically analyzing the state of the cluster and solving the problem of information lag.

In this embodiment of the present invention, the evaluation module 303 is configured to determine the performance health degree of the cluster by using the following method:

In this embodiment of the present invention, the analysis module 302 is further configured to: determining the principal component ratio of the node according to the physical resource use information of the node and the service information of the cluster; and if the ratio of the main components of the node is not within the preset range, determining the node as a difference node. The embodiment of the invention can determine the preset range according to the operation and maintenance experience and quantize each index. Such as: the single-node processing capacity (i.e., the ratio of the principal components of the node) must be within a preset range before it is normal. The method can effectively analyze the components of each node for the cluster efficiency, and the computing power of the computing nodes in the whole cluster, thereby solving the problems that the service component analysis of the nodes is lacked and the main nodes can not be effectively analyzed in the prior art, and further realizing the beneficial effect of quickly positioning the difference nodes.

In this embodiment of the present invention, the analysis module 302 is further configured to: calculating the mean square error of the principal component ratio of each node in the cluster according to the principal component ratio of the node; and if the mean square error exceeds a set threshold value, the cluster is an abnormal cluster. The embodiment of the invention determines the abnormal cluster by calculating the mean square error of the proportion of each node component, can effectively monitor the working state of the cluster in real time and find the cluster with abnormal working.

In this embodiment of the present invention, the analysis module 302 is further configured to: determining the problem of the differential nodes according to the differential nodes and the main component ratio thereof; and acquiring answers or optimization modes corresponding to the problems from a preset rule base according to the problems. According to the embodiment of the invention, the problems existing in the differential nodes can be determined and displayed after the differential nodes are rapidly positioned, and the corresponding answers and the optimization method are displayed according to the rules (namely the mapping between the problems and the answers) which are input into the library in advance.

Fig. 4 is a schematic diagram of an optimization module of the apparatus for evaluating Hadoop cluster performance according to the embodiment of the invention. A Hadoop cluster is a cluster consisting of multiple physical servers deployed with Hadoop services. The Hadoop is a Distributed system, and comprises a Distributed storage system hdfs (Hadoop Distributed File system) and a Distributed computing system MapReduce. As shown in fig. 4, the system mainly includes the following modules:

1) and the MonitorServer is responsible for collecting and displaying the cluster information, and the collected information can be stored in the OpenTSDB. The user can utilize the Server to display information according to time, point and time. Meanwhile, the Server calculates the evaluation index according to the information at regular time and displays the evaluation index on the interface.

2) OpenTSDB, a distributed time series data storage system, can store a large amount of time series data.

3) HadoopMaster, Master node of Hadoop, may refer to nodes of NameNode and ResourceManager.

4) Node, the calculation Node of Hadoop refers to two types of nodes of DataNode and NodeManager of Hadoop.

The device mainly comprises three processing flows:

1. collecting all index information of the Hadoop cluster, including: the MonitorServer periodically collects index information by using an interface of the Hadoop component; the MonitorServer stores the collected information into the OpenTSDB for storage.

2. Calculating cluster evaluation, node principal component ratio and performance analysis graphs in real time according to the indexes and providing display; index analysis evaluation execution flow: the process relies on predefined evaluation rules and calculates cluster health according to the predefined rules. And the evaluation rule is supported to be added, deleted, changed and checked. For example, the cluster evaluation index included in the evaluation rule may be: the JVM resource usage amount of the node, the physical resource usage amount of the node, the standard amount of the processing event speed of the node per hour, and the reference amount of the total score occupied by the processing performance of the node in the current day when compared with the historical information, and the scoring standard when the processing performance of the node in the current day is lower than or higher than the historical information during the scoring.

Fig. 5 is a schematic diagram of a real-time index analysis flow according to an embodiment of the present invention, and as shown in fig. 5, service information of a cluster and physical resource usage information of a node in the past 5 minutes are collected every 5 minutes; calculating cluster health degree, and storing the health degree into OpenTSDB; all data collected and generated over the past 5 minutes was added to the component analysis statistical table. The component analysis statistical table is a database table, and calculated data is stored in the database.

3. Allowing the user to manually trigger the evaluation process and generate detailed evaluation reports. And manually triggering a report generation process, allowing an administrator to deeply analyze the cluster on an interface, generating a cluster report and supporting a mail sending report. Fig. 6 is a schematic diagram of a report generation flow according to an embodiment of the present invention, and as shown in fig. 6, the information of the cluster evaluation index information and the principal component ratio of the node is obtained, a form and an information display web page are generated, and then whether to send the form and the information display web page to the client in an email manner is determined according to the selection of whether the user subscribes or not.

Fig. 7 shows an exemplary system architecture 700 of an evaluation method or an evaluation apparatus for cluster performance, to which an embodiment of the present invention may be applied.

As shown in fig. 7, the system architecture 700 may include

terminal devices

701, 702, 703, a network 704, and a server 705. The network 704 serves to provide a medium for communication links between the

terminal devices

701, 702, 703 and the server 705. Network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

701, 702, 703 to interact with a server 705 over a network 704, to receive or send messages or the like. Various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like, may be installed on the

terminal devices

701, 702, and 703.

The

terminal devices

701, 702, 703 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 705 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the

terminal devices

701, 702, and 703. The background management server can analyze and process the received data such as the product information query request and feed back the processing result such as the target push information to the terminal equipment.

It should be noted that the method for evaluating cluster performance provided by the embodiment of the present invention is generally executed by the server 705, and accordingly, the apparatus for evaluating cluster performance is generally disposed in the server 705.

It should be understood that the number of terminal devices, networks, and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The invention also provides an electronic device and a readable storage medium according to the embodiment of the invention.

The electronic device of the embodiment of the invention comprises: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the method for evaluating the cluster performance provided by the embodiment of the invention.

The computer readable medium of the embodiment of the present invention stores a computer program thereon, and the program, when executed by a processor, implements the method for evaluating cluster performance provided by the embodiment of the present invention.

Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 8, a computer system 800 includes a Central Processing Unit (CPU)801 which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 801.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor, comprising: the device comprises an acquisition module, an analysis module and an evaluation module. The names of these modules do not constitute a limitation to the module itself in some cases, for example, "acquiring module" may also be described as "module for acquiring service information of a cluster and physical resource usage information of each node in the cluster".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: s101, acquiring service information of a cluster and physical resource use information of each node in the cluster; s102, calculating in real time to obtain evaluation index information of the cluster based on the service information of the cluster, the physical resource use information of each node in the cluster and a preset evaluation rule; the evaluation rule includes: evaluating indexes of the clusters and weights corresponding to the evaluating indexes; and S103, carrying out weighted summation on each evaluation index information of the cluster to determine the performance health degree of the cluster.

The embodiment of the invention can determine the performance health degree of the cluster by analyzing the collected service information of the cluster and the physical resource use information of the nodes, thereby effectively evaluating the state of the current cluster and solving the problem that the state or the score of the current cluster is evaluated in the absence of an effective evaluation mode in the prior art.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for evaluating cluster performance, comprising:

acquiring service information of a cluster and physical resource use information of each node in the cluster, wherein the service information comprises: health information of the service itself, physical resource usage information of the service, and processing capability information of the service;

calculating in real time to obtain the evaluation index information of the cluster based on the service information of the cluster, the physical resource use information of each node in the cluster and a preset evaluation rule; the evaluation rule includes: the evaluation indexes of the clusters and the weights corresponding to the evaluation indexes comprise: the method comprises the steps of whole cluster level Container throughput, job running throughput, job failure rate, whole cluster file access frequency, whole cluster file modification frequency, single job file access frequency, single job creation file number, single job Container average running time, single node Container throughput and single node event handling capacity;

carrying out weighted summation on each evaluation index information of the cluster to determine the performance health degree of the cluster;

the method further comprises the following steps: determining the principal component ratio of the node according to the physical resource use information of the node and the service information of the cluster; if the ratio of the main components of the node is not within a preset range, determining the node as a difference node; the main component proportion of the nodes is the proportion of the event amount processed by the nodes in unit time to the whole cluster event;

calculating the mean square error of the principal component ratio of each node in the cluster according to the principal component ratio of the node; and if the mean square error exceeds a set threshold value, the cluster is an abnormal cluster.

2. The method of claim 1, wherein the performance health of the cluster is determined by:

3. The method of claim 1, wherein after determining that the node is a differential node, the method further comprises:

determining the problem of the differential nodes according to the differential nodes and the main component ratio thereof;

and acquiring answers or optimization modes corresponding to the problems from a preset rule base according to the problems.

4. An apparatus for evaluating cluster performance, comprising: the system comprises an acquisition module, an analysis module and an evaluation module;

the acquisition module is configured to: acquiring service information of a cluster and physical resource use information of each node in the cluster, wherein the service information comprises: health information of the service itself, physical resource usage information of the service, and processing capability information of the service;

the analysis module is configured to: calculating in real time to obtain the evaluation index information of the cluster based on the service information of the cluster, the physical resource use information of each node in the cluster and a preset evaluation rule; the evaluation rule includes: the evaluation indexes of the clusters and the weights corresponding to the evaluation indexes comprise: the method comprises the steps of whole cluster level Container throughput, job running throughput, job failure rate, whole cluster file access frequency, whole cluster file modification frequency, single job file access frequency, single job creation file number, single job Container average running time, single node Container throughput and single node event handling capacity;

the analysis module is further configured to: determining the principal component ratio of the node according to the physical resource use information of the node and the service information of the cluster; if the ratio of the main components of the node is not within a preset range, determining the node as a difference node; the main component proportion of the nodes is the proportion of the event amount processed by the nodes in unit time to the whole cluster event;

calculating the mean square error of the principal component ratio of each node in the cluster according to the principal component ratio of the node; if the mean square error exceeds a set threshold, the cluster is an abnormal cluster;

the evaluation module is configured to: and carrying out weighted summation on the evaluation index information of the cluster to determine the performance health degree of the cluster.

5. The apparatus of claim 4, wherein the evaluation module is configured to determine the performance health of the cluster by:

wherein, H represents the performance health degree of the cluster, f (i) represents the ith evaluation index information of the cluster, w represents the weight corresponding to the evaluation index, and n represents the number of the evaluation indexes of the cluster.

6. The apparatus of claim 4, wherein the analysis module is further configured to:

7. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-3.

8. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-3.