CN106445754A

CN106445754A - Method and system for inspecting cluster health status and cluster server

Info

Publication number: CN106445754A
Application number: CN201610822574.6A
Authority: CN
Inventors: 马四腾
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2016-09-13
Filing date: 2016-09-13
Publication date: 2017-02-22

Abstract

The invention discloses a method for inspecting cluster health status comprising following steps: setting detection indexes of cluster health status, wherein the detection indexes comprise a device performance detection index and a cluster environment status detection index; collecting status information corresponding to the detection indexes; detecting according to the status information and by means of detection scripts corresponding to the cluster environment status detection indexes, and determining the health status of the cluster environment status according to detection results; testing according to the status information and by means of performance detection programs and/ or application performance detection programs, and determining the cluster health status according to test results; the method can carried out comprehensive health status inspection on a cluster through detecting aspects such as the cluster service status, hardware performance indexes and application compatibility; it is convenient for technicians to carry out malfunction elimination on a cluster system; the invention discloses a system and a server for inspecting cluster health status which have above beneficial effects.

Description

A kind of method checking cluster health status, system and cluster server

Technical field

The present invention relates to field of computer technology, particularly to a kind of method checking cluster health status, system and collection Group's server.

Background technology

At present, the development with computer technology and being increasingly widely applied, more and more depends on computer skill The application system of art has come into work and the life of people.Although as the speed development to make rapid progress for the computer technology, single The Performance And Reliability of platform computer is become better and better, but has the requirement of much reality to be that single computer is unapproachable. Such as a lot of industries, such as molecule power, fluid dynamic etc. is required for high-performance calculation as background support.High-performance calculation collection As a total system, its framework great majority is to build up cluster by a lot of server groups to use to group, because it needs to provide by force Big computing capability, server is combined for up to a hundred easily, and number of servers is many, and overall fault rate also can rise, firmly Part fault is easy to be found, but how Check System level fault is it is simply that a problem.

Content of the invention

It is an object of the invention to provide a kind of method checking cluster health status, system and server, can be by inspection Survey cluster service state, hardware performance index, the aspect such as application compatibility cluster is done with omnibearing health status inspection；Just In technical staff, malfunction elimination is carried out to group system.

For solving above-mentioned technical problem, the present invention provides a kind of method checking cluster health status, including：

The Testing index of setting cluster health status, wherein, described Testing index includes equipment performance Testing index and collection Group rings border state-detection index；

Gather the corresponding status information of described Testing index；

According to described status information, examined using the corresponding detection script of each described cluster environment state-detection index Survey, and judge the health status of cluster environment state according to testing result；

According to described status information, utility detection program and/or application performance detection program are tested, according to survey Test result judges cluster health status.

Wherein, according to described status information, utility detection program and/or application performance detection program are tested, According to test result, including：

When judging the health status of cluster environment state as health, according to described status information, utility detects journey Sequence and/or application performance detection program are tested, and according to test result, judge cluster health status.

Wherein, the method also includes：

Described status information and/or testing result and/or test result are preserved to journal file.

Wherein, utility detection program is tested, and judges cluster health status according to test result, including：

Utility detection program carries out the test of single node benchmark；

When test result is less than performance detection threshold value, cluster health status are unhealthy；

When test result is not less than performance detection threshold value, cluster health status are health.

Wherein, tested using application performance detection program, judged that cluster health status include according to test result：

Create the running environment of predetermined application；

In each running environment, little example calculating is carried out according to corresponding statess information, obtains test result；

When test result is less than application performance detection threshold value, cluster health status are unhealthy；

When test result is not less than application performance detection threshold value, cluster health status are health.

The present invention also provides a kind of system checking cluster health status, including：

Setup module, for arranging the Testing index of cluster health status, wherein, described Testing index includes equipment performance Testing index and cluster environment state-detection index；

Acquisition module, for gathering the corresponding status information of described Testing index；

Cluster environment state detection module, for according to described status information, using each described cluster environment state-detection The corresponding detection script of index is detected, and judges the health status of cluster environment state according to testing result；

Cluster performance detection module, for according to described status information, utility detection program and/or application performance inspection Ranging sequence is tested, and judges cluster health status according to test result.

Wherein, this system also includes：

Preserving module, for preserving described status information and/or testing result and/or test result to journal file In.

Wherein, described cluster performance detection module, including：Single node benchmark test cell, for utility inspection Ranging sequence carries out the test of single node benchmark；When test result is less than performance detection threshold value, cluster health status are not to be good for Health；When test result is not less than performance detection threshold value, cluster health status are health.

Wherein, described cluster performance detection module, including：Application performance detector unit, for creating the fortune of predetermined application Row environment；In each running environment, little example calculating is carried out according to corresponding statess information, obtains test result；Work as test result During less than application performance detection threshold value, cluster health status are unhealthy；When test result is not less than application performance detection threshold value When, cluster health status are health.

The present invention also provides a kind of cluster server, including：Inspection cluster health status according to any of the above-described System.

A kind of method checking cluster health status provided by the present invention, including：The detection of setting cluster health status Index, wherein, described Testing index includes equipment performance Testing index and cluster environment state-detection index；Gather described detection The corresponding status information of index；According to described status information, using the corresponding detection of each described cluster environment state-detection index Script is detected, and judges the health status of cluster environment state according to testing result；According to described status information, usability Can detect that program and/or application performance detection program are tested, cluster health status are judged according to test result；

It can be seen that, the method can be by detecting cluster service state, hardware performance index, and it is right that the aspect such as application compatibility is come Cluster does omnibearing health status inspection；It is easy to technical staff and malfunction elimination is carried out to group system；The invention provides one Plant the system checking cluster health status and server, there is above-mentioned beneficial effect, will not be described here.

Brief description

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Have technology description in required use accompanying drawing be briefly described it should be apparent that, drawings in the following description be only this Inventive embodiment, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis The accompanying drawing providing obtains other accompanying drawings.

The flow chart of the method for the inspection cluster health status that Fig. 1 is provided by the embodiment of the present invention；

The structured flowchart of the system of the inspection cluster health status that Fig. 2 is provided by the embodiment of the present invention.

Specific embodiment

The core of the present invention is to provide a kind of method checking cluster health status, system and server, can be by inspection Survey cluster service state, hardware performance index, the aspect such as application compatibility cluster is done with omnibearing health status inspection；Just In technical staff, malfunction elimination is carried out to group system.

Purpose, technical scheme and advantage for making the embodiment of the present invention are clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described it is clear that described embodiment is The a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art The every other embodiment being obtained under the premise of not making creative work, broadly falls into the scope of protection of the invention.

Refer to Fig. 1, the flow chart of the method for the inspection cluster health status that Fig. 1 is provided by the embodiment of the present invention；Should Method can include：

S100, the Testing index of setting cluster health status, wherein, described Testing index includes equipment performance Testing index With cluster environment state-detection index；

Specifically, Testing index here will be set according to the actual demand of user, not to this monitoring index Particular content is defined, and user can also change according to the actual requirements, adaptation is carried out to Testing index；For example Increase, delete, the operation such as modification Testing index.

Here the point detecting health status is needed to be Testing index in High-Performance Computing Cluster to be analyzed, such as some basis clothes Business, such as：Whether NFS carry is normal, and whether NIS service is normal, and whether machine network is the state of UNICOM；Some machines for another example Performance is related, such as：Cpu performance, internal memory performance, network performance, application performance etc..

User can choose the point of the health status needing detection, to be configured, this configuration user by way of combination Can modify, and cluster health status inspection can be carried out when logging in every time in order to improve the effect of cluster health status Survey, further for being easy to technical staff, group system fault or timely understanding group system shape are excluded according to testing result State, can be to export testing result to user in the form of reporting.User can be for further processing according to the report of output.

Check the time of report further for saving technical staff, can be by some testing results with various specifically lively Form be indicated.The information of only output abnormality can also save user time further.

Due to can detect to system when starting shooting each time, user can sum up collection according to each testing result The long term state of group's system, so that user predicts in time or investigates group system fault according to historical data, can every time Monitoring result record in daily record so that for future reference.

S110, the collection corresponding status information of described Testing index；

Specifically, acquisition operations can be obtained by sending instruction to server OS each in cluster, including section Point title, CPU, internal memory, the status information of the index such as network, optionally, these status informations can be preserved to journal file.

S120, according to described status information, entered using the corresponding detection script of each described cluster environment state-detection index Row detection, and the health status of cluster environment state are judged according to testing result；

Specifically, by creating a series of scripts, health inspection is carried out to the cluster environment configuration of each node in cluster Survey, wherein can include ssh no cryptographic acess between node, NIS services, NFS services, nodal directory carry situation, and each section The script informations such as the consistency check of point configuration, optionally, these information and corresponding testing result are preserved to daily record literary composition Part.Can be determined that the health status of cluster environment state according to these testing results, specific decision rule can be according to user Demand carries out actual setting, and user can consider requirement to cluster environment state facilities and health status etc. to set Decision rule.

Here it is healthy and unhealthy it is also possible to cluster environment that result of determination can only comprise cluster environment state State demarcation Health Category.Optionally, these testing results can be preserved to journal file.

S130, according to described status information, utility detection program and/or application performance detection program are tested, Cluster health status are judged according to test result.

Specifically, performance detection program and application performance detection program user can select all to be detected here, also may be used Only to carry out one of which detection.And user can set performance detection program and application performance inspection according to the actual demand of itself The actual content of ranging sequence.Optionally, corresponding test result can be preserved to journal file.

Optionally, utility detection program is tested, and judges cluster health status according to test result, including：

Utility detection program carries out the test of single node benchmark；

Specifically, single node benchmark detection, such as detects HPL (the High Performance of cpu performance Linpack, a kind of benchmark for measuring CPU floating-point operation performance), the STREAM of detection internal memory performance is one kind For measuring the benchmark of memory bandwidth performance, by the CPU collecting, memory information, can calculate The corresponding theoretical value of benchmark, defining a threshold value according to percentage ratio is performance detection threshold value, is typically set to 80% (this is an empirical value) is not defined to specific performance detection threshold value certainly here, and results of calculation is made with threshold value Contrast, higher than threshold value be by be cluster health status be health, less than not by be cluster health status be unhealthy, and Testing result can be shown.User can also carry out cluster health status grade and set that can to set different grades corresponding Threshold value.Here, when user only carries out performance detection, this testing result is cluster health status result, if user also need to into During the detection of row application performance, this result is the performance detection health status of cluster, and the health status of final cluster also need to consider The result of application performance detection.

Optionally, tested using application performance detection program, judged that cluster health status include according to test result：

Create the running environment of predetermined application；

Specifically, according to different application types, create the running environment of typical case's application, provide little example to be calculated, And an empirical data is set for threshold value, judge by comparison threshold value whether cluster is examined by health when running application Survey.Here, when user only carries out application performance detection, this testing result is cluster health status result, if user also needs to When carrying out performance detection, this result is that the application performance of cluster detects health status, and the health status of final cluster also need to examine Consider the result of performance detection.

Wherein, test result here can be the synthesis when test result of single application or multiple application Test result.User can also carry out the setting of cluster health status grade and can set the corresponding threshold value of different grades.

If when two kinds of user detection is all carried out, cluster health status can be judged as being good for when every kind of detection be all health Health.Can also be that other decision rules are determined according to user configured detection content.

Further for improving cluster health status detection speed, can judge the health status of cluster environment state as When healthy, then execution step S130.

In the group system tentatively put up, implement the method, cluster health degree is checked, can be by configuring File is customizing detection content, general, carries out comprehensive health degree inspection, checks that after finishing, the method can be by testing result Export in journal file, and point out not pass through item, so that attendant discovers problems and solve them it is ensured that cluster is normally steady Fixed operation.But when S120 has been detected by mistake, the detection that the time of can saving no longer carries out step S130.

Wherein, the result that each step in S110 to S130 obtains can be shown to user, and user can be according to aobvious The result judgement shown is the need of the detection proceeding cluster monitoring state.And the process showing can make user more preferable Solution detection procedure.

Based on technique scheme, the method for inspection cluster health status provided in an embodiment of the present invention, collected by detection Group's service state, hardware performance index, the aspect such as application compatibility cluster is done with omnibearing health status inspection, simultaneously defeated Go out examining report, to solve the problems, such as the investigation of group system level fault.

Check that the system of cluster health status and cluster server are introduced to provided in an embodiment of the present invention below, under The system of inspection cluster health status of literary composition description and cluster server and the above-described method checking cluster health status Can be mutually to should refer to.

Refer to Fig. 2, the structured flowchart of the system of the inspection cluster health status that Fig. 2 is provided by the embodiment of the present invention； This system can include：

Setup module 100, for arranging the Testing index of cluster health status, wherein, described Testing index includes equipment Performance detection index and cluster environment state-detection index；

Acquisition module 200, for gathering the corresponding status information of described Testing index；

Cluster environment state detection module 300, for according to described status information, being examined using each described cluster environment state Survey the corresponding detection script of index to be detected, and judge the health status of cluster environment state according to testing result；

Cluster performance detection module 400, for according to described status information, utility detection program and/or application Can detect that program is tested, cluster health status are judged according to test result.

Based on above-described embodiment, this system also includes：

Based on above-mentioned any embodiment, described cluster performance detection module 400, including：The test of single node benchmark is single Unit, carries out the test of single node benchmark for utility detection program；When test result is less than performance detection threshold value, Cluster health status are unhealthy；When test result is not less than performance detection threshold value, cluster health status are health.

Based on above-mentioned any embodiment, described cluster performance detection module 400, including：Application performance detector unit, is used for Create the running environment of predetermined application；In each running environment, little example calculating is carried out according to corresponding statess information, is tested Result；When test result is less than application performance detection threshold value, cluster health status are unhealthy；Answer when test result is not less than During with performance detection threshold value, cluster health status are health.

Based on above-mentioned any embodiment, this system also includes：

Display module, for entering described status information and/or testing result and/or test result and cluster health status Row display.

The embodiment of the present invention also provides a kind of cluster server, including：Inspection collection according to above-mentioned any embodiment The system of group's health status.

In description, each embodiment is described by the way of going forward one by one, and what each embodiment stressed is real with other Apply the difference of example, between each embodiment identical similar portion mutually referring to.For device disclosed in embodiment Speech, because it corresponds to the method disclosed in Example, so description is fairly simple, referring to method part illustration in place of correlation ?.

Professional further appreciates that, in conjunction with the unit of each example of the embodiments described herein description And algorithm steps, can with electronic hardware, computer software or the two be implemented in combination in, in order to clearly demonstrate hardware and The interchangeability of software, generally describes composition and the step of each example in the above description according to function.These Function to be executed with hardware or software mode actually, the application-specific depending on technical scheme and design constraint.Specialty Technical staff can use different methods to each specific application realize described function, but this realization should Think beyond the scope of this invention.

The step of the method in conjunction with the embodiments described herein description or algorithm can directly be held with hardware, processor The software module of row, or the combination of the two is implementing.Software module can be placed in random access memory (RAM), internal memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, depositor, hard disk, moveable magnetic disc, CD-ROM or technology In known any other form of storage medium in field.

Above the method checking cluster health status provided by the present invention, system and cluster server are carried out in detail Introduce.Specific case used herein is set forth to the principle of the present invention and embodiment, the explanation of above example It is only intended to help and understand the method for the present invention and its core concept.It should be pointed out that the ordinary skill people for the art Member for, under the premise without departing from the principles of the invention, the present invention can also be carried out some improve and modify, these improve and Modify and also fall in the protection domain of the claims in the present invention.

Claims

1. a kind of method checking cluster health status is it is characterised in that include：

Gather the corresponding status information of described Testing index；

According to described status information, detected using the corresponding detection script of each described cluster environment state-detection index, and Judge the health status of cluster environment state according to testing result；

According to described status information, utility detection program and/or application performance detection program are tested, according to test knot Fruit judges cluster health status.

2. method according to claim 1 is it is characterised in that according to described status information, utility detection program and/ Or application performance detection program tested, according to test result, including：

When judging the health status of cluster environment state as health, according to described status information, utility detection program and/ Or application performance detects that program is tested, according to test result, judge cluster health status.

3. method according to claim 2 is it is characterised in that also include：

4. the method according to any one of claim 1-3 it is characterised in that utility detection program tested, root Judge cluster health status according to test result, including：

Utility detection program carries out the test of single node benchmark；

5. the method according to any one of claim 1-3 is it is characterised in that surveyed using application performance detection program According to test result, examination, judges that cluster health status include：

Create the running environment of predetermined application；

6. a kind of system checking cluster health status is it is characterised in that include：

Setup module, for arranging the Testing index of cluster health status, wherein, described Testing index includes equipment performance detection Index and cluster environment state-detection index；

Cluster environment state detection module, for according to described status information, using each described cluster environment state-detection index Corresponding detection script is detected, and judges the health status of cluster environment state according to testing result；

Cluster performance detection module, for according to described status information, utility detection program and/or application performance detect journey Sequence is tested, and judges cluster health status according to test result.

7. system according to claim 6 is it is characterised in that also include：

Preserving module, for preserving described status information and/or testing result and/or test result to journal file.

8. the system according to claim 6 or 7 is it is characterised in that described cluster performance detection module, including：Single node Benchmark test cell, carries out the test of single node benchmark for utility detection program；When test result is less than During performance detection threshold value, cluster health status are unhealthy；When test result is not less than performance detection threshold value, cluster health shape State is health.

9. the system according to claim 6 or 7 is it is characterised in that described cluster performance detection module, including：Application Energy detector unit, for creating the running environment of predetermined application；In each running environment, little calculation is carried out according to corresponding statess information Example calculates, and obtains test result；When test result is less than application performance detection threshold value, cluster health status are unhealthy；When When test result is not less than application performance detection threshold value, cluster health status are health.

10. a kind of cluster server is it is characterised in that include：Inspection cluster health according to any one of claim 6-9 The system of state.