CN108769170A

CN108769170A - A kind of cluster network fault self-checking system and method

Info

Publication number: CN108769170A
Application number: CN201810479418.3A
Authority: CN
Inventors: 李俊
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2018-05-18
Filing date: 2018-05-18
Publication date: 2018-11-06

Abstract

The invention discloses a kind of cluster network fault self-checking system and methods, belong to server High Performance Computing field.The cluster network fault self-checking system of the present invention, configuration detection program module and IPMI network modules on the primary node, network driver block, Configuration network IP modules and IPMI network modules are installed in each calculate node, the IPMI network modules of host node form IPMI networks with the IPMI network modules of each calculate node, host node is connected by network driver block with each calculate node, by ping network IP module check network-in-dialing states, detection program module is used to detect the working condition of each calculate node and generates testing result.The cluster network fault self-checking system of the invention can count network failure type between node and simply be repaired, while carry out record feedback to intractable failure or hardware layer failure, have good application value.

Description

A kind of cluster network fault self-checking system and method

Technical field

The present invention relates to server High Performance Computing field, specifically provide a kind of cluster network fault self-checking system and Method.

Background technology

Common computing cluster is made of a large amount of node, including host node, calculate node and storage unit etc..It calculates Node is the main body of load capacity calculation task, including a large amount of cpu check figures, memory etc.；Storage unit includes a large amount of disk space, For storing data；Host node is the core of entire cluster, for the task of delivering, every cores such as distribution calculates, manages operation Activity.And to realize this scheme it is essential be exactly network.Networking component is to pass for a cluster stable operation Important.

Communication role between network carrying node and node, if there are failure, will cause such as can not be parallel for network It calculates, operation can not be submitted, the significant consequences such as entire cluster can not be managed, therefore it is cluster normal use to keep network stabilization One great premise.Common network includes common gigabit Ethernet, ten thousand mbit ethernets, 100Gb Ethernets, Infiniband Network, FC networks are quickly calculated, OPA calculates network etc..Different types of network generally also has respective effect in the cluster, A certain network failure may result in the afunction that the network module is responsible for, and reduce clustering performance or availability.With The growth of cluster service life, this unstability is more obvious, while the workload of O&M will also incrementally increase.And these Failure needs after processing, can reach the standard grade and work on.If ignored, waste and the cluster of resource can be caused The diminution of practical calculation scale.

Invention content

The technical assignment of the present invention is that in view of the above problems, network failure class between node can be counted by providing one kind Type is simultaneously simply repaired, while the cluster network failure of record feedback is carried out to intractable failure or hardware layer failure Self-checking system.

The further technical assignment of the present invention is to provide a kind of cluster network fault self-detection method.

To achieve the above object, the present invention provides following technical solutions：

A kind of cluster network fault self-checking system, cluster include host node and multiple calculate nodes, and gigabit ether is configured in cluster Net, 10,000,000,000 nets, IB nets, OPA nets are managed, on the primary node configuration detection program module and IPMI network modules, is saved in each calculating Point installation network driver block, Configuration network IP modules and IPMI network modules, IPMI network modules and each calculating of host node The IPMI network modules of node form IPMI networks, and host node is connected with each calculate node by network driver block, is passed through Ping network IP module check network-in-dialing states, detection program module are used to detect working condition and the generation of each calculate node Testing result when host node detects to communicate abnormal, passes through IPMI network moulds on the primary node when link generation unknown failure Block restarts the calculate node that need to be adjusted or resets the network driver block in the calculate node.

The IPMI（Intelligent Platform Management Interface）I.e. intelligent platform management connects Mouthful.

IB nets are Infiniband nets.

OPA nets are omni-path nets.

It, can also be several by copying a file or transmission by ping network IP module check network-in-dialing states Data packet carries out.

Preferably, calculate node is there are hardware fault or starts logic fault, being restarted by IPMI network modules can not Restore, then records system log and alarm.

By ping network IP module check network-in-dialing states, and state is judged, to abnormal state of affairs into Row processing, if gigabit ether management net is obstructed, can pass through IPMI network module reset nodes；10000000000 nets, IB nets, OPA nets are not It is logical, network interface card can be reset.After if upper two kinds of processing judgement is normal, examined by the network card status that high-level network Component driver program carries Ranging sequence checks that network card status, such as ibstat, ibstatus, opainfo etc. check state states, judges network work at present State enters in next step if normal, if abnormal, reset trawl performance or reset node checks again for, and record Information.If still abnormal, reboot operation is carried out to malfunctioning node using IPMI networks.Terminate if restarting normally, if It is abnormal, it records system log and alarms.

A kind of cluster network fault self-detection method, cluster include host node and multiple calculate nodes, and gigabit is configured in cluster Ether manages net, 10,000,000,000 nets, IB nets, OPA nets, the IPMI network modules of host node and the IPMI network module groups of each calculate node At IPMI networks, host node is connected by network driver block with each calculate node, passes through IPMI network moulds on the primary node Block restarts the calculate node that need to be adjusted, the then record system log that can not restore is restarted by IPMI network modules and alarms.

Preferably, configured with detection program module on the host node, detection program module is for detecting each calculating section The working condition of point simultaneously generates testing result.

Preferably, the method specifically includes following steps：

S1：Host node starts automatic executive plan task；

S2：Priority check gigabit ether manages net, judges Status Type, if abnormal, judges that Exception Type, node then weigh extremely Open node, return to step S1 record log and alarms if Network Abnormal, if normal, thens follow the steps S3；

S3：By ping network IP module check network connection situations, network state type is judged, if network is not connected to, resetting Network interface card or refitting driving, still abnormal, record log is simultaneously alarmed, and thens follow the steps S4 if normal；

S4：From tape program collecting net card connection status, judge Status Type, then end task if normal, otherwise record log is simultaneously Alarm.

Preferably, the host node starts automatic executive plan task by means of crond services, in crond services Configuration host node executes detection according to certain frequency and judges script.

Preferably, the cluster network fault self-detection method is further comprising the steps of：

S5：The network service of a calculate node is turned off manually, host node is made to communicate with failure, and executes detection and judges foot This, if recovery network can be checked successfully, configuration is normal.

Compared with prior art, cluster network fault self-detection method of the invention has advantageous effect following prominent：Institute Network failure type between node can be counted and simply be repaired by stating cluster network fault self-detection method, while to difficult Failure or hardware layer failure carry out record feedback, advantageously reduce maintenance workload, shorten maintenance time, ensure that cluster is steady Fixed operation, has good application value.

Description of the drawings

Fig. 1 is the flow chart of cluster network fault self-detection method of the present invention.

Specific implementation mode

Below in conjunction with drawings and examples, the cluster network fault self-checking system and method for the present invention is made further detailed It describes in detail bright.

Embodiment 1

The cluster network fault self-checking system of the present invention, cluster include host node and multiple calculate nodes, and gigabit is configured in cluster Ether manages net, 10,000,000,000 nets, IB nets, OPA nets.

Configuration detection program module and IPMI network modules on the primary node install network-driven mould in each calculate node Block, Configuration network IP modules and IPMI network modules.The IPMI network modules of host node and the IPMI network moulds of each calculate node Block forms IPMI networks, and host node is connected by network driver block with each calculate node, is examined by ping network IP modules Network-in-dialing state is looked into, detection program module is used to detect the working condition of each calculate node and generates testing result, works as link Unknown failure occurs, when host node detects to communicate abnormal, on the primary node restarting by IPMI network modules need to be adjusted Calculate node resets network driver block in the calculate node.Calculate node is there are hardware fault or starts logic fault, Being restarted by IPMI network modules can not complete, then records system log and alarm, feed back to maintenance personnel.

Embodiment 2

The cluster network fault self-detection method of the present invention, cluster include host node and multiple calculate nodes, and gigabit is configured in cluster Ether manages net, 10,000,000,000 nets, IB nets, OPA nets.Configured with detection program module on host node, detection program module is for detecting The working condition of each calculate node simultaneously generates testing result.The IPMI network modules of host node and the IPMI networks of each calculate node Module forms IPMI networks, and host node is connected by network driver block with each calculate node, passes through IPMI on the primary node Network module restarts the then record system log that can not restore by IPMI network modules simultaneously to restart the calculate node that need to be adjusted Alarm.

As shown in Figure 1, the cluster network fault self-detection method specifically includes following steps：

S1：Host node starts automatic executive plan task.

Host node starts automatic executive plan task by means of crond services, in crond services configuration host node according to Certain frequency executes detection and judges script.

S2：Priority check gigabit ether manages net, judges Status Type, if abnormal, judges that Exception Type, node are abnormal Then reset node, return to step S1 record log and alarm if Network Abnormal, if normal, then follow the steps S3.

S3：By ping network IP module check network connection situations, network state type is judged, if network is not connected to, Network interface card or refitting driving are reset, still abnormal, record log is simultaneously alarmed, and thens follow the steps S4 if normal.

S4：From tape program collecting net card connection status, judges Status Type, then end task if normal, otherwise record day Will is simultaneously alarmed.

S5：The network service of a calculate node is turned off manually, host node is made to communicate with failure, and executes detection and judges Script, if recovery network can be checked successfully, configuration is normal.

Embodiment described above, the only present invention more preferably specific implementation mode, those skilled in the art is at this The usual variations and alternatives carried out within the scope of inventive technique scheme should be all included within the scope of the present invention.

Claims

1. a kind of cluster network fault self-checking system, cluster includes host node and multiple calculate nodes, in cluster configuration gigabit with Manage very much net, 10,000,000,000 nets, IB nets, OPA nets, it is characterised in that：Configuration detection program module and IPMI network moulds on the primary node Block installs network driver block, Configuration network IP modules and IPMI network modules, the IPMI networks of host node in each calculate node The IPMI network modules of module and each calculate node form IPMI networks, and host node passes through network driver block and each calculate node It is connected, by ping network IP module check network-in-dialing states, detection program module is used to detect the work of each calculate node Make state and generate testing result, when the unknown failure of link generation, when host node detects to communicate abnormal, leads on the primary node IPMI network modules are crossed to restart the calculate node that need to be adjusted or reset the network driver block in the calculate node.

2. cluster network fault self-checking system according to claim 1, it is characterised in that：There are hardware faults for calculate node Or start logic fault, being restarted by IPMI network modules can not restore, then records system log and alarm.

3. a kind of cluster network fault self-detection method, it is characterised in that：Cluster includes host node and multiple calculate nodes, in cluster Configure gigabit ether management net, 10,000,000,000 nets, IB nets, OPA nets, the IPMI network modules of host node and the IPMI nets of each calculate node Network module forms IPMI networks, and host node is connected with each calculate node by network driver block, is passed through on the primary node IPMI network modules restart the then record system day that can not restore by IPMI network modules to restart the calculate node that need to be adjusted Will is simultaneously alarmed.

4. cluster network fault self-detection method according to claim 3, it is characterised in that：Configured with inspection on the host node Program module is surveyed, detection program module is used to detect the working condition of each calculate node and generates testing result.

5. cluster network fault self-detection method according to claim 3 or 4, it is characterised in that：The method specifically includes Following steps：

S1：Host node starts automatic executive plan task；

S2：Priority check gigabit ether manages net, judges Status Type, if abnormal, judges that Exception Type, calculate node are abnormal Then reset node, return to step S1 record log and alarm if Network Abnormal, if normal, then follow the steps S3；

6. cluster network fault self-detection method according to claim 5, it is characterised in that：The host node by means of Crond services start automatic executive plan task, and configuration host node executes detection according to certain frequency and judges in crond services Script.

7. cluster network fault self-detection method according to claim 6, it is characterised in that：It is further comprising the steps of：