CN101373447A - System and method for detecting health degree of computer cluster - Google Patents

System and method for detecting health degree of computer cluster Download PDF

Info

Publication number
CN101373447A
CN101373447A CNA2008100419062A CN200810041906A CN101373447A CN 101373447 A CN101373447 A CN 101373447A CN A2008100419062 A CNA2008100419062 A CN A2008100419062A CN 200810041906 A CN200810041906 A CN 200810041906A CN 101373447 A CN101373447 A CN 101373447A
Authority
CN
China
Prior art keywords
module
detection
detecting
computer cluster
health degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008100419062A
Other languages
Chinese (zh)
Inventor
寇大治
王涛
袁俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI SUPERCOMPUTER CENTER
Original Assignee
SHANGHAI SUPERCOMPUTER CENTER
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI SUPERCOMPUTER CENTER filed Critical SHANGHAI SUPERCOMPUTER CENTER
Priority to CNA2008100419062A priority Critical patent/CN101373447A/en
Publication of CN101373447A publication Critical patent/CN101373447A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention discloses a system and a method for detecting the health degree of a computer cluster, which can be used for detecting and judging the entire health degree of a computer cluster system and then giving the result information of the detection. The detecting system comprises layered modules composed of detecting modules and peripheral modules, wherein the detecting modules respectively include: a hardware detecting module, a service detecting module, a software detecting module and a job detecting module; the peripheral modules respectively include: an input module, a maintenance, a user module and an output module. Meanwhile, the layer and the module framework of the detecting system provided by the invention have good extensibility and openness, and can flexibly add or reduce the detecting module without the need of changing the entire framework of the detecting system, so as to achieve the detecting purpose of entire or partial computer cluster.

Description

The health degree detection system and the method for computer cluster
Technical field:
The invention belongs to high-performance computing sector, particularly a kind of health degree detection system and method for high-performance computer cluster.
Background technology:
Along with the development of computer software and hardware and parallel computation, the method for high-performance calculation and simulation has been applied in more and more fields.Also there is increasing tissue to begin buying, builds and use the high-performance computer cluster.Therefore, set up once the complete computer cluster health degree detection system of cover particularly important.
The design feature of existing computer cluster is: whole group system is made of some nodes, the simplest cluster can be made of host node and computing node, may also can be subdivided into host node for large-scale cluster and to land node, memory node etc., and topmost in this system what bear calculation task is computing node, each node can become a workstation separately, certain independence is arranged, and the mirror image each other of the system between each computing node, all nodes are by the interconnected (gigabit Ethernet for example of express network, myrinet or infiniband etc.), job task is distributed on each computing node by modes such as message transmission.The advantage of other high-performance calculation platform architectures such as the relative SMP of loose coupling cluster system, the MPP of this non-triangular web mirror image is that construction cost is low and builds easily and realize that shortcoming also is this simultaneously that the loose coupling of total system causes management comparatively complicated.
Because computer cluster node is numerous, just produced certain scale effect, group system and single workstation and server are essentially different, detection for group system, detect if adopt manually, then detection efficiency is low, detects quality and also can't guarantee.
In addition, also have some manufacturers to develop some testing tools, mostly these instruments are the system of sealing, and are primarily aimed at its own hardware characteristics and develop, often complete inadequately on the function, and do not have universality.
Summary of the invention:
In view of this, the objective of the invention is to set up health degree detection system and the method for a cover at computer cluster, it has good level framework, and different levels are carried out healthy and strong and complete detection by modular implementation with the health degree that reaches computer cluster.
The health degree detection system of this computer cluster is characterized in that:
Detection system is made up of the module of stratification, and module comprises detection module and peripheral module.
Wherein detection module comprises:
Hardware detecting module 201, this module mainly are to check the health status of various types of hardware, and this comprises access node, memory node, computing node, network equipment node, and power equipment, optional equipments such as refrigeration radiating equipment.For comprising ad hoc network equipment or network characteristic being had the detection of the group system of special concern, also can be added in after this module relating to the independent network measuring module that forms after the part refinement of network in this module.Overall, this module mainly is a situation about detecting as each sub-element independent operating of group system.This microcosmic that is equivalent to a part detects, and the content of monitoring is except external conditions such as electric power supply situation, temperature, for cluster inside, detects and mainly carries out round various node main elements.
Service detection module 202, the detection that the system-level service that this module is primarily aimed at group system itself should be provided is carried out, but comprise the logging status test of all nodes, this comprises telnet, long-range execution, remote data transmission etc., important network service, this comprises the network information service, network file system(NFS) service etc., and the integrality of the system database that may use and accessibility.This module mainly detects is whether group system possesses the functional completeness that each sub-element is organized into an integral body and embodies a whole cluster.
Software detection module 203, this module mainly is at the applying detection on the operating system, comprise main system library file to needing to call, the math library file, and the detection of resource scheduling system etc. itself, for job scheduling system, this system is mutual interface of final user and group system and platform, determined the operation that can final operation complete and healthy, and whether complete alternately with the final user, so except the universal test of the dispatching system that will fulfil assignment, also be provided with at the needed some other test of the different common application of each cluster.The detection of this module will provide the functional realization of application to make judgement to group system.
Operation detection module 204, this module is a whole detection to group system, comprise the submission of all kinds, various scale operations, the selection of scale and type may also can relate to the application of some marginalisations, to realize the detection to the whole robustness of group system; Here the submission classification that comprises also has dependence or does not rely on resource scheduling system, whether relies on dispatching system for test and has also embodied the purpose of observing and detect whole group system from differing heights; And the submission etc. that comprises whole group system different work under different integral load.This module will comprise hardware and software at different levels to whole cluster, and the co-ordination of each subsystem is functionally made detection with what realize overall applicability.
Peripheral module comprises:
Load module 101, this module will generate the module interface at module to be detected, the input configuration information of detection module all will obtain from this load module interface afterwards, input configuration information in this module, also can adopt interactive mode as required, form interactive load module, gather these input informations.
Maintenance module 102, this module mainly are that detected fault is passed through to judge, the problem that can safeguard corrigendum is revised; To the fault that can't repair or can't judge, this module will be made isolation processing to detected trouble unit.
User notification module 103, this module will start after detection module completes successfully, and its function will be the Adjustment System state, comprise network state, user right states etc. guarantee the user to each access node, and access node is to the accessibility of computing node under appropriate authorization privilege.
Output module 104, detection information will be collected and export to this module, wherein detects the output of information, also can adopt interactive mode, forms interactive output module, as required, optionally exports the detection information of being paid close attention to.
According to the above, can see because the present invention has adopted the modular design feature of stratification, different modules realizes certain detection level, and these detections are progressively to go forward one by one, realized by the part to whole, by hardware to software, by the detection of low layer to the upper strata, such design can be finished the detection to the various aspects of whole cluster system, and its level of detection system that the present invention proposes and module architectures is with good expansibility and open, can add or reduce detection module flexibly, and need not to change the overall architecture of detection system, to reach the testing goal of computer cluster in whole or in part.
Description of drawings:
Fig. 1 is a system flowchart of the present invention.
Embodiment:
Technical scheme of the present invention and advantage are more clear understands that for embodiment and with reference to accompanying drawing, the present invention is described in further detail below in order to make.
At the dawn 4000A supercomputer that is deployed in Shanghai Supercomputer Center, use health degree detection system and method that the present invention proposes, provide specific embodiment, following table is a description to the formation of present embodiment module and module contents, these module correspondences the corresponding module among Fig. 1, referring to shown in Figure 1, it at first is the once inspection (beginning 001) before starting in group system/restarting, the current inspection comprises a judgement generally, whether satisfy the required fundamental prerequisite of start-up system, this also is a step making preparation for start-up system in addition, here comprise the inspection peripheral environment, air-conditioning for example, whether circuit etc. are normal, and relevant early warning comprises whether fire equipment is complete, and whether the related personnel puts in place etc.After the pretrigger step is all complete, enter the setting up procedure of system, after the computer cluster startup finishes, present embodiment will read in corresponding module to be detected from load module (101) and detect needed detection configuration information with these, the load module here also can adopt interactively input mode, be that accessibility detects (Reachability test or Alive test) afterwards, go in each detection module after the result of this detection also will be input to as configuration information, present embodiment has used network control message agreement (Internet Control Message Protocol simultaneously in this step, ICMP) and safety shell protocol (Secure ShellProtocol, SSH) realize, specifically, present embodiment has used ping and scp instrument, and the Ping Scan of nmap instrument (go no further than determining if host is online) option, when load module (101) is collected and the node listing that all set obtains by these detections has just formally entered the relevant detection module after other the configuration information together.At first be hardware detecting module (201), this module mainly is the built-in command (uptime of employing system, df, free etc.) health status of inspection various types of hardware, this comprises access node, memory node, computing node, network equipment node, overall, this module mainly is a situation about detecting as each sub-element independent operating of group system.This microcosmic that is equivalent to a part detects, and mainly carries out round various node main elements.For the special group system of the network equipment, or there are needs to detect the relevant device of network separately, can form the network measuring module at the specific concrete network equipment or demand, because disparate networks equipment realizes having important bottom effect for the function of group system, a part that belongs to hardware, promptly being in the bottom that need to detect equally, thus need place after the hardware detecting module for the network measuring module of interpolation, before the service detection module.Be service detection module (202) afterwards, the detection that the system-level service that this module is primarily aimed at group system itself should be provided is carried out, realize by detection all kinds of serve ports, but the i.e. logging status of all nodes test, this comprises telnet, long-range execution, telefile transmission etc.; Important network service, this comprises the network information service, network file system(NFS) service etc., and the integrality of the system database that may use and accessibility.This module mainly detects is whether group system possesses the functional completeness that each sub-element is organized into an integral body and embodies a whole cluster.Be software detection module (203) afterwards, the content that this module will realize detection by existence or the MD5 method of calibration to different system softwares and application software, to the functional of practical application that group system will provide be detected simultaneously, this also comprises the detection to job scheduling system, this pervasive test case that has used the LSF job scheduling system is realized, after this step is finished, also just finished dependence test as the actual group system of using of a cover.But whether can carry out our needed work smoothly just needs ensuing operation detection module (204) to test, just to some tests of the macroscopic view of cluster overall performance, work to be measured relevant in this step has been divided into chained job and concurrent job, the submission of chained job is the test to the overall performance of each node in the group system, specific implementation is by the method for recurrence all nodes to be submitted to same test example simultaneously, the operation result of contrast example and then judge has been selected to reflect the node overall performance and the close friend has been arranged and the example of thorough result output.The submission of concurrent job is as a kind of simulation to final user's practical operation, it is test to the group system overall performance, test has been used the dependence job scheduling system simultaneously and has not been relied on two kinds of situations of job scheduling system, chosen typical practical application operation, and tested different operation scales, this module has detected the ruuning situation of cluster total system.The fault of finding in above detection module (201-204) is output to maintenance module (102), maintenance module is all relevant with all detection modules, all detection steps all can enter into this maintenance module to the part of the mistake of appearance, this module has comprised simple maintenance and maintenance, makes the output of replacing or isolating for the insurmountable faulty equipment of system manager.Output information in the above detection module (201-204) is collected into output module (104), and this module also comprises the collection of some system informations, the information of these collections can for after the system manager situation of system done contrast and judge.Finished this step, the system manager just can notify user's group system information available, and these will be user notification module (103).Just all test jobs have all been finished, and by top test, we have obtained health and full-order system, and this module comprises appropriate visit and the rights of using that give the final user, and gives notice.For output module (104), the output information that may obtain is more, use so also this module can be replaced with interactive output module, promptly interactivelyly check needed information, the benefit of doing like this is the part output information that can inquire concern faster.
Figure A200810041906D00081

Claims (3)

1. the health degree detection system of a computer cluster, being used for holistic health degree to computer cluster detects and judges, and provide the object information of detection, and it is characterized in that: this detection system is made of the module of stratification, and module comprises detection module and peripheral module:
Wherein detection module is respectively: hardware detecting module, and the service detection module, software detection module and operation detection module are used for the detection to the relevant portion of group system;
Wherein peripheral module is respectively: load module, and maintenance module, line module and output module are as the middleware of detection module realization.
2. the health degree detection system of computer cluster as claimed in claim 1, it is characterized in that: wherein add the network measuring module between hardware detecting module and the service detection module in the detection module, be used to use the computer cluster of ad hoc network as computational grid.
3. the health degree detection system of computer cluster as claimed in claim 1, it is characterized in that: wherein load module is interactive load module in the peripheral module, output module is interactive output module, is used to the mutual customization of the information that detects and the mutual selectivity of output information and watches.
CNA2008100419062A 2008-08-20 2008-08-20 System and method for detecting health degree of computer cluster Pending CN101373447A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008100419062A CN101373447A (en) 2008-08-20 2008-08-20 System and method for detecting health degree of computer cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008100419062A CN101373447A (en) 2008-08-20 2008-08-20 System and method for detecting health degree of computer cluster

Publications (1)

Publication Number Publication Date
CN101373447A true CN101373447A (en) 2009-02-25

Family

ID=40447621

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008100419062A Pending CN101373447A (en) 2008-08-20 2008-08-20 System and method for detecting health degree of computer cluster

Country Status (1)

Country Link
CN (1) CN101373447A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102025776A (en) * 2010-11-16 2011-04-20 山东中创软件工程股份有限公司 Disaster tolerant control method, device and system
CN102053865A (en) * 2011-01-05 2011-05-11 南京大学 Environment detection platform system based on software main body and environment detection method thereof
CN102902598A (en) * 2012-09-10 2013-01-30 曙光信息产业(北京)有限公司 Resource detecting and preprocessing method combined with job scheduling system
CN103607297A (en) * 2013-11-07 2014-02-26 上海爱数软件有限公司 Fault processing method of computer cluster system
CN103902442A (en) * 2012-12-25 2014-07-02 中国移动通信集团公司 Method and system for evaluating cloud software health degree
CN105681070A (en) * 2014-11-21 2016-06-15 中芯国际集成电路制造(天津)有限公司 Method and system for automatically collecting and analyzing computer cluster node information
CN106445754A (en) * 2016-09-13 2017-02-22 郑州云海信息技术有限公司 Method and system for inspecting cluster health status and cluster server
CN106656675A (en) * 2017-01-03 2017-05-10 北京奇虎科技有限公司 Method and device for detecting transmission node cluster
CN109885435A (en) * 2019-02-18 2019-06-14 国家计算机网络与信息安全管理中心 The test macro and method of universal container computing cluster
CN110674034A (en) * 2019-09-12 2020-01-10 北京浪潮数据技术有限公司 Health examination method and device, electronic equipment and storage medium

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102025776A (en) * 2010-11-16 2011-04-20 山东中创软件工程股份有限公司 Disaster tolerant control method, device and system
CN102053865A (en) * 2011-01-05 2011-05-11 南京大学 Environment detection platform system based on software main body and environment detection method thereof
CN102053865B (en) * 2011-01-05 2012-11-21 南京大学 Environment detection platform system based on software main body and environment detection method thereof
CN102902598B (en) * 2012-09-10 2015-08-19 曙光信息产业(北京)有限公司 A kind of resources measurement preprocess method combined with job scheduling system
CN102902598A (en) * 2012-09-10 2013-01-30 曙光信息产业(北京)有限公司 Resource detecting and preprocessing method combined with job scheduling system
CN103902442A (en) * 2012-12-25 2014-07-02 中国移动通信集团公司 Method and system for evaluating cloud software health degree
CN103902442B (en) * 2012-12-25 2016-11-23 中国移动通信集团公司 A kind of cloud software health degree evaluating method and system
CN103607297A (en) * 2013-11-07 2014-02-26 上海爱数软件有限公司 Fault processing method of computer cluster system
CN103607297B (en) * 2013-11-07 2017-02-08 上海爱数信息技术股份有限公司 Fault processing method of computer cluster system
CN105681070A (en) * 2014-11-21 2016-06-15 中芯国际集成电路制造(天津)有限公司 Method and system for automatically collecting and analyzing computer cluster node information
CN106445754A (en) * 2016-09-13 2017-02-22 郑州云海信息技术有限公司 Method and system for inspecting cluster health status and cluster server
CN106656675A (en) * 2017-01-03 2017-05-10 北京奇虎科技有限公司 Method and device for detecting transmission node cluster
CN106656675B (en) * 2017-01-03 2020-01-21 北京奇虎科技有限公司 Detection method and device for transmission node cluster
CN109885435A (en) * 2019-02-18 2019-06-14 国家计算机网络与信息安全管理中心 The test macro and method of universal container computing cluster
CN110674034A (en) * 2019-09-12 2020-01-10 北京浪潮数据技术有限公司 Health examination method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN101373447A (en) System and method for detecting health degree of computer cluster
CN106330593B (en) Protocol detection method and device
US7894357B2 (en) Capability-based testing and evaluation of network performance
JP6336606B2 (en) Method and apparatus for visual network operation and maintenance
CN103795749B (en) The method and apparatus operating in the problem of software product in cloud environment for diagnosis
CN109388530A (en) Blade server-oriented automatic test platform and test method
CN108092854B (en) Test method and device for train-level Ethernet equipment based on IEC61375 protocol
CN109196565A (en) Data concentrator and its operating method
US10289522B2 (en) Autonomous information technology diagnostic checks
Pontes et al. Izinto: A pattern-based IoT testing framework
Aguzzi et al. Modron: A scalable and interoperable web of things platform for structural health monitoring
KR20210113155A (en) LED display update configuration method, receiving card, LED display module and LED display
US20080127083A1 (en) Method and system for combining multiple benchmarks
Manvell Utilising the strengths of different sound sensor networks in smart city noise management
Montori et al. An iot toolchain architecture for planning, running and managing a complete condition monitoring scenario
Fadhil et al. A survey on Internet of Things (IoT) testing
CN109189679A (en) Interface test method and system, electronic equipment, storage medium
CN114598680B (en) Domain name management method, device and storage medium
CN110188040A (en) A kind of software platform for software systems fault detection and health state evaluation
Kundu et al. Collaborative and accountable hardware governance using blockchain
CN110048909A (en) Network O&M method and device
CN110971478A (en) Pressure measurement method and device for cloud platform service performance and computing equipment
US20220236079A1 (en) Systems and methods for automated metrology
Kim et al. HDF: Hybrid debugging framework for distributed network environments
Cheng et al. New remote monitoring and control system architectures based on cloud computing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20090225