CN101373447A - System and method for detecting health degree of computer cluster - Google Patents
System and method for detecting health degree of computer cluster Download PDFInfo
- Publication number
- CN101373447A CN101373447A CNA2008100419062A CN200810041906A CN101373447A CN 101373447 A CN101373447 A CN 101373447A CN A2008100419062 A CNA2008100419062 A CN A2008100419062A CN 200810041906 A CN200810041906 A CN 200810041906A CN 101373447 A CN101373447 A CN 101373447A
- Authority
- CN
- China
- Prior art keywords
- module
- detection
- detecting
- computer cluster
- health degree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Debugging And Monitoring (AREA)
- Test And Diagnosis Of Digital Computers (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention discloses a system and a method for detecting the health degree of a computer cluster, which can be used for detecting and judging the entire health degree of a computer cluster system and then giving the result information of the detection. The detecting system comprises layered modules composed of detecting modules and peripheral modules, wherein the detecting modules respectively include: a hardware detecting module, a service detecting module, a software detecting module and a job detecting module; the peripheral modules respectively include: an input module, a maintenance, a user module and an output module. Meanwhile, the layer and the module framework of the detecting system provided by the invention have good extensibility and openness, and can flexibly add or reduce the detecting module without the need of changing the entire framework of the detecting system, so as to achieve the detecting purpose of entire or partial computer cluster.
Description
Technical field:
The invention belongs to high-performance computing sector, particularly a kind of health degree detection system and method for high-performance computer cluster.
Background technology:
Along with the development of computer software and hardware and parallel computation, the method for high-performance calculation and simulation has been applied in more and more fields.Also there is increasing tissue to begin buying, builds and use the high-performance computer cluster.Therefore, set up once the complete computer cluster health degree detection system of cover particularly important.
The design feature of existing computer cluster is: whole group system is made of some nodes, the simplest cluster can be made of host node and computing node, may also can be subdivided into host node for large-scale cluster and to land node, memory node etc., and topmost in this system what bear calculation task is computing node, each node can become a workstation separately, certain independence is arranged, and the mirror image each other of the system between each computing node, all nodes are by the interconnected (gigabit Ethernet for example of express network, myrinet or infiniband etc.), job task is distributed on each computing node by modes such as message transmission.The advantage of other high-performance calculation platform architectures such as the relative SMP of loose coupling cluster system, the MPP of this non-triangular web mirror image is that construction cost is low and builds easily and realize that shortcoming also is this simultaneously that the loose coupling of total system causes management comparatively complicated.
Because computer cluster node is numerous, just produced certain scale effect, group system and single workstation and server are essentially different, detection for group system, detect if adopt manually, then detection efficiency is low, detects quality and also can't guarantee.
In addition, also have some manufacturers to develop some testing tools, mostly these instruments are the system of sealing, and are primarily aimed at its own hardware characteristics and develop, often complete inadequately on the function, and do not have universality.
Summary of the invention:
In view of this, the objective of the invention is to set up health degree detection system and the method for a cover at computer cluster, it has good level framework, and different levels are carried out healthy and strong and complete detection by modular implementation with the health degree that reaches computer cluster.
The health degree detection system of this computer cluster is characterized in that:
Detection system is made up of the module of stratification, and module comprises detection module and peripheral module.
Wherein detection module comprises:
Peripheral module comprises:
According to the above, can see because the present invention has adopted the modular design feature of stratification, different modules realizes certain detection level, and these detections are progressively to go forward one by one, realized by the part to whole, by hardware to software, by the detection of low layer to the upper strata, such design can be finished the detection to the various aspects of whole cluster system, and its level of detection system that the present invention proposes and module architectures is with good expansibility and open, can add or reduce detection module flexibly, and need not to change the overall architecture of detection system, to reach the testing goal of computer cluster in whole or in part.
Description of drawings:
Fig. 1 is a system flowchart of the present invention.
Embodiment:
Technical scheme of the present invention and advantage are more clear understands that for embodiment and with reference to accompanying drawing, the present invention is described in further detail below in order to make.
At the dawn 4000A supercomputer that is deployed in Shanghai Supercomputer Center, use health degree detection system and method that the present invention proposes, provide specific embodiment, following table is a description to the formation of present embodiment module and module contents, these module correspondences the corresponding module among Fig. 1, referring to shown in Figure 1, it at first is the once inspection (beginning 001) before starting in group system/restarting, the current inspection comprises a judgement generally, whether satisfy the required fundamental prerequisite of start-up system, this also is a step making preparation for start-up system in addition, here comprise the inspection peripheral environment, air-conditioning for example, whether circuit etc. are normal, and relevant early warning comprises whether fire equipment is complete, and whether the related personnel puts in place etc.After the pretrigger step is all complete, enter the setting up procedure of system, after the computer cluster startup finishes, present embodiment will read in corresponding module to be detected from load module (101) and detect needed detection configuration information with these, the load module here also can adopt interactively input mode, be that accessibility detects (Reachability test or Alive test) afterwards, go in each detection module after the result of this detection also will be input to as configuration information, present embodiment has used network control message agreement (Internet Control Message Protocol simultaneously in this step, ICMP) and safety shell protocol (Secure ShellProtocol, SSH) realize, specifically, present embodiment has used ping and scp instrument, and the Ping Scan of nmap instrument (go no further than determining if host is online) option, when load module (101) is collected and the node listing that all set obtains by these detections has just formally entered the relevant detection module after other the configuration information together.At first be hardware detecting module (201), this module mainly is the built-in command (uptime of employing system, df, free etc.) health status of inspection various types of hardware, this comprises access node, memory node, computing node, network equipment node, overall, this module mainly is a situation about detecting as each sub-element independent operating of group system.This microcosmic that is equivalent to a part detects, and mainly carries out round various node main elements.For the special group system of the network equipment, or there are needs to detect the relevant device of network separately, can form the network measuring module at the specific concrete network equipment or demand, because disparate networks equipment realizes having important bottom effect for the function of group system, a part that belongs to hardware, promptly being in the bottom that need to detect equally, thus need place after the hardware detecting module for the network measuring module of interpolation, before the service detection module.Be service detection module (202) afterwards, the detection that the system-level service that this module is primarily aimed at group system itself should be provided is carried out, realize by detection all kinds of serve ports, but the i.e. logging status of all nodes test, this comprises telnet, long-range execution, telefile transmission etc.; Important network service, this comprises the network information service, network file system(NFS) service etc., and the integrality of the system database that may use and accessibility.This module mainly detects is whether group system possesses the functional completeness that each sub-element is organized into an integral body and embodies a whole cluster.Be software detection module (203) afterwards, the content that this module will realize detection by existence or the MD5 method of calibration to different system softwares and application software, to the functional of practical application that group system will provide be detected simultaneously, this also comprises the detection to job scheduling system, this pervasive test case that has used the LSF job scheduling system is realized, after this step is finished, also just finished dependence test as the actual group system of using of a cover.But whether can carry out our needed work smoothly just needs ensuing operation detection module (204) to test, just to some tests of the macroscopic view of cluster overall performance, work to be measured relevant in this step has been divided into chained job and concurrent job, the submission of chained job is the test to the overall performance of each node in the group system, specific implementation is by the method for recurrence all nodes to be submitted to same test example simultaneously, the operation result of contrast example and then judge has been selected to reflect the node overall performance and the close friend has been arranged and the example of thorough result output.The submission of concurrent job is as a kind of simulation to final user's practical operation, it is test to the group system overall performance, test has been used the dependence job scheduling system simultaneously and has not been relied on two kinds of situations of job scheduling system, chosen typical practical application operation, and tested different operation scales, this module has detected the ruuning situation of cluster total system.The fault of finding in above detection module (201-204) is output to maintenance module (102), maintenance module is all relevant with all detection modules, all detection steps all can enter into this maintenance module to the part of the mistake of appearance, this module has comprised simple maintenance and maintenance, makes the output of replacing or isolating for the insurmountable faulty equipment of system manager.Output information in the above detection module (201-204) is collected into output module (104), and this module also comprises the collection of some system informations, the information of these collections can for after the system manager situation of system done contrast and judge.Finished this step, the system manager just can notify user's group system information available, and these will be user notification module (103).Just all test jobs have all been finished, and by top test, we have obtained health and full-order system, and this module comprises appropriate visit and the rights of using that give the final user, and gives notice.For output module (104), the output information that may obtain is more, use so also this module can be replaced with interactive output module, promptly interactivelyly check needed information, the benefit of doing like this is the part output information that can inquire concern faster.
Claims (3)
1. the health degree detection system of a computer cluster, being used for holistic health degree to computer cluster detects and judges, and provide the object information of detection, and it is characterized in that: this detection system is made of the module of stratification, and module comprises detection module and peripheral module:
Wherein detection module is respectively: hardware detecting module, and the service detection module, software detection module and operation detection module are used for the detection to the relevant portion of group system;
Wherein peripheral module is respectively: load module, and maintenance module, line module and output module are as the middleware of detection module realization.
2. the health degree detection system of computer cluster as claimed in claim 1, it is characterized in that: wherein add the network measuring module between hardware detecting module and the service detection module in the detection module, be used to use the computer cluster of ad hoc network as computational grid.
3. the health degree detection system of computer cluster as claimed in claim 1, it is characterized in that: wherein load module is interactive load module in the peripheral module, output module is interactive output module, is used to the mutual customization of the information that detects and the mutual selectivity of output information and watches.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2008100419062A CN101373447A (en) | 2008-08-20 | 2008-08-20 | System and method for detecting health degree of computer cluster |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2008100419062A CN101373447A (en) | 2008-08-20 | 2008-08-20 | System and method for detecting health degree of computer cluster |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101373447A true CN101373447A (en) | 2009-02-25 |
Family
ID=40447621
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2008100419062A Pending CN101373447A (en) | 2008-08-20 | 2008-08-20 | System and method for detecting health degree of computer cluster |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101373447A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102025776A (en) * | 2010-11-16 | 2011-04-20 | 山东中创软件工程股份有限公司 | Disaster tolerant control method, device and system |
CN102053865A (en) * | 2011-01-05 | 2011-05-11 | 南京大学 | Environment detection platform system based on software main body and environment detection method thereof |
CN102902598A (en) * | 2012-09-10 | 2013-01-30 | 曙光信息产业(北京)有限公司 | Resource detecting and preprocessing method combined with job scheduling system |
CN103607297A (en) * | 2013-11-07 | 2014-02-26 | 上海爱数软件有限公司 | Fault processing method of computer cluster system |
CN103902442A (en) * | 2012-12-25 | 2014-07-02 | 中国移动通信集团公司 | Method and system for evaluating cloud software health degree |
CN105681070A (en) * | 2014-11-21 | 2016-06-15 | 中芯国际集成电路制造(天津)有限公司 | Method and system for automatically collecting and analyzing computer cluster node information |
CN106445754A (en) * | 2016-09-13 | 2017-02-22 | 郑州云海信息技术有限公司 | Method and system for inspecting cluster health status and cluster server |
CN106656675A (en) * | 2017-01-03 | 2017-05-10 | 北京奇虎科技有限公司 | Method and device for detecting transmission node cluster |
CN109885435A (en) * | 2019-02-18 | 2019-06-14 | 国家计算机网络与信息安全管理中心 | The test macro and method of universal container computing cluster |
CN110674034A (en) * | 2019-09-12 | 2020-01-10 | 北京浪潮数据技术有限公司 | Health examination method and device, electronic equipment and storage medium |
-
2008
- 2008-08-20 CN CNA2008100419062A patent/CN101373447A/en active Pending
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102025776A (en) * | 2010-11-16 | 2011-04-20 | 山东中创软件工程股份有限公司 | Disaster tolerant control method, device and system |
CN102053865A (en) * | 2011-01-05 | 2011-05-11 | 南京大学 | Environment detection platform system based on software main body and environment detection method thereof |
CN102053865B (en) * | 2011-01-05 | 2012-11-21 | 南京大学 | Environment detection platform system based on software main body and environment detection method thereof |
CN102902598B (en) * | 2012-09-10 | 2015-08-19 | 曙光信息产业(北京)有限公司 | A kind of resources measurement preprocess method combined with job scheduling system |
CN102902598A (en) * | 2012-09-10 | 2013-01-30 | 曙光信息产业(北京)有限公司 | Resource detecting and preprocessing method combined with job scheduling system |
CN103902442A (en) * | 2012-12-25 | 2014-07-02 | 中国移动通信集团公司 | Method and system for evaluating cloud software health degree |
CN103902442B (en) * | 2012-12-25 | 2016-11-23 | 中国移动通信集团公司 | A kind of cloud software health degree evaluating method and system |
CN103607297A (en) * | 2013-11-07 | 2014-02-26 | 上海爱数软件有限公司 | Fault processing method of computer cluster system |
CN103607297B (en) * | 2013-11-07 | 2017-02-08 | 上海爱数信息技术股份有限公司 | Fault processing method of computer cluster system |
CN105681070A (en) * | 2014-11-21 | 2016-06-15 | 中芯国际集成电路制造(天津)有限公司 | Method and system for automatically collecting and analyzing computer cluster node information |
CN106445754A (en) * | 2016-09-13 | 2017-02-22 | 郑州云海信息技术有限公司 | Method and system for inspecting cluster health status and cluster server |
CN106656675A (en) * | 2017-01-03 | 2017-05-10 | 北京奇虎科技有限公司 | Method and device for detecting transmission node cluster |
CN106656675B (en) * | 2017-01-03 | 2020-01-21 | 北京奇虎科技有限公司 | Detection method and device for transmission node cluster |
CN109885435A (en) * | 2019-02-18 | 2019-06-14 | 国家计算机网络与信息安全管理中心 | The test macro and method of universal container computing cluster |
CN110674034A (en) * | 2019-09-12 | 2020-01-10 | 北京浪潮数据技术有限公司 | Health examination method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101373447A (en) | System and method for detecting health degree of computer cluster | |
CN106330593B (en) | Protocol detection method and device | |
US7894357B2 (en) | Capability-based testing and evaluation of network performance | |
JP6336606B2 (en) | Method and apparatus for visual network operation and maintenance | |
CN103795749B (en) | The method and apparatus operating in the problem of software product in cloud environment for diagnosis | |
CN109388530A (en) | Blade server-oriented automatic test platform and test method | |
CN108092854B (en) | Test method and device for train-level Ethernet equipment based on IEC61375 protocol | |
CN109196565A (en) | Data concentrator and its operating method | |
US10289522B2 (en) | Autonomous information technology diagnostic checks | |
Pontes et al. | Izinto: A pattern-based IoT testing framework | |
Aguzzi et al. | Modron: A scalable and interoperable web of things platform for structural health monitoring | |
KR20210113155A (en) | LED display update configuration method, receiving card, LED display module and LED display | |
US20080127083A1 (en) | Method and system for combining multiple benchmarks | |
Manvell | Utilising the strengths of different sound sensor networks in smart city noise management | |
Montori et al. | An iot toolchain architecture for planning, running and managing a complete condition monitoring scenario | |
Fadhil et al. | A survey on Internet of Things (IoT) testing | |
CN109189679A (en) | Interface test method and system, electronic equipment, storage medium | |
CN114598680B (en) | Domain name management method, device and storage medium | |
CN110188040A (en) | A kind of software platform for software systems fault detection and health state evaluation | |
Kundu et al. | Collaborative and accountable hardware governance using blockchain | |
CN110048909A (en) | Network O&M method and device | |
CN110971478A (en) | Pressure measurement method and device for cloud platform service performance and computing equipment | |
US20220236079A1 (en) | Systems and methods for automated metrology | |
Kim et al. | HDF: Hybrid debugging framework for distributed network environments | |
Cheng et al. | New remote monitoring and control system architectures based on cloud computing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Open date: 20090225 |