CN108769170A - A kind of cluster network fault self-checking system and method - Google Patents

A kind of cluster network fault self-checking system and method Download PDF

Info

Publication number
CN108769170A
CN108769170A CN201810479418.3A CN201810479418A CN108769170A CN 108769170 A CN108769170 A CN 108769170A CN 201810479418 A CN201810479418 A CN 201810479418A CN 108769170 A CN108769170 A CN 108769170A
Authority
CN
China
Prior art keywords
network
node
ipmi
cluster
modules
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810479418.3A
Other languages
Chinese (zh)
Inventor
李俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201810479418.3A priority Critical patent/CN108769170A/en
Publication of CN108769170A publication Critical patent/CN108769170A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • H04L67/1044Group management mechanisms 
    • H04L67/1048Departure or maintenance mechanisms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0811Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention discloses a kind of cluster network fault self-checking system and methods, belong to server High Performance Computing field.The cluster network fault self-checking system of the present invention, configuration detection program module and IPMI network modules on the primary node, network driver block, Configuration network IP modules and IPMI network modules are installed in each calculate node, the IPMI network modules of host node form IPMI networks with the IPMI network modules of each calculate node, host node is connected by network driver block with each calculate node, by ping network IP module check network-in-dialing states, detection program module is used to detect the working condition of each calculate node and generates testing result.The cluster network fault self-checking system of the invention can count network failure type between node and simply be repaired, while carry out record feedback to intractable failure or hardware layer failure, have good application value.

Description

A kind of cluster network fault self-checking system and method
Technical field
The present invention relates to server High Performance Computing field, specifically provide a kind of cluster network fault self-checking system and Method.
Background technology
Common computing cluster is made of a large amount of node, including host node, calculate node and storage unit etc..It calculates Node is the main body of load capacity calculation task, including a large amount of cpu check figures, memory etc.;Storage unit includes a large amount of disk space, For storing data;Host node is the core of entire cluster, for the task of delivering, every cores such as distribution calculates, manages operation Activity.And to realize this scheme it is essential be exactly network.Networking component is to pass for a cluster stable operation Important.
Communication role between network carrying node and node, if there are failure, will cause such as can not be parallel for network It calculates, operation can not be submitted, the significant consequences such as entire cluster can not be managed, therefore it is cluster normal use to keep network stabilization One great premise.Common network includes common gigabit Ethernet, ten thousand mbit ethernets, 100Gb Ethernets, Infiniband Network, FC networks are quickly calculated, OPA calculates network etc..Different types of network generally also has respective effect in the cluster, A certain network failure may result in the afunction that the network module is responsible for, and reduce clustering performance or availability.With The growth of cluster service life, this unstability is more obvious, while the workload of O&M will also incrementally increase.And these Failure needs after processing, can reach the standard grade and work on.If ignored, waste and the cluster of resource can be caused The diminution of practical calculation scale.
Invention content
The technical assignment of the present invention is that in view of the above problems, network failure class between node can be counted by providing one kind Type is simultaneously simply repaired, while the cluster network failure of record feedback is carried out to intractable failure or hardware layer failure Self-checking system.
The further technical assignment of the present invention is to provide a kind of cluster network fault self-detection method.
To achieve the above object, the present invention provides following technical solutions:
A kind of cluster network fault self-checking system, cluster include host node and multiple calculate nodes, and gigabit ether is configured in cluster Net, 10,000,000,000 nets, IB nets, OPA nets are managed, on the primary node configuration detection program module and IPMI network modules, is saved in each calculating Point installation network driver block, Configuration network IP modules and IPMI network modules, IPMI network modules and each calculating of host node The IPMI network modules of node form IPMI networks, and host node is connected with each calculate node by network driver block, is passed through Ping network IP module check network-in-dialing states, detection program module are used to detect working condition and the generation of each calculate node Testing result when host node detects to communicate abnormal, passes through IPMI network moulds on the primary node when link generation unknown failure Block restarts the calculate node that need to be adjusted or resets the network driver block in the calculate node.
The IPMI(Intelligent Platform Management Interface)I.e. intelligent platform management connects Mouthful.
IB nets are Infiniband nets.
OPA nets are omni-path nets.
It, can also be several by copying a file or transmission by ping network IP module check network-in-dialing states Data packet carries out.
Preferably, calculate node is there are hardware fault or starts logic fault, being restarted by IPMI network modules can not Restore, then records system log and alarm.
By ping network IP module check network-in-dialing states, and state is judged, to abnormal state of affairs into Row processing, if gigabit ether management net is obstructed, can pass through IPMI network module reset nodes;10000000000 nets, IB nets, OPA nets are not It is logical, network interface card can be reset.After if upper two kinds of processing judgement is normal, examined by the network card status that high-level network Component driver program carries Ranging sequence checks that network card status, such as ibstat, ibstatus, opainfo etc. check state states, judges network work at present State enters in next step if normal, if abnormal, reset trawl performance or reset node checks again for, and record Information.If still abnormal, reboot operation is carried out to malfunctioning node using IPMI networks.Terminate if restarting normally, if It is abnormal, it records system log and alarms.
A kind of cluster network fault self-detection method, cluster include host node and multiple calculate nodes, and gigabit is configured in cluster Ether manages net, 10,000,000,000 nets, IB nets, OPA nets, the IPMI network modules of host node and the IPMI network module groups of each calculate node At IPMI networks, host node is connected by network driver block with each calculate node, passes through IPMI network moulds on the primary node Block restarts the calculate node that need to be adjusted, the then record system log that can not restore is restarted by IPMI network modules and alarms.
Preferably, configured with detection program module on the host node, detection program module is for detecting each calculating section The working condition of point simultaneously generates testing result.
Preferably, the method specifically includes following steps:
S1:Host node starts automatic executive plan task;
S2:Priority check gigabit ether manages net, judges Status Type, if abnormal, judges that Exception Type, node then weigh extremely Open node, return to step S1 record log and alarms if Network Abnormal, if normal, thens follow the steps S3;
S3:By ping network IP module check network connection situations, network state type is judged, if network is not connected to, resetting Network interface card or refitting driving, still abnormal, record log is simultaneously alarmed, and thens follow the steps S4 if normal;
S4:From tape program collecting net card connection status, judge Status Type, then end task if normal, otherwise record log is simultaneously Alarm.
Preferably, the host node starts automatic executive plan task by means of crond services, in crond services Configuration host node executes detection according to certain frequency and judges script.
Preferably, the cluster network fault self-detection method is further comprising the steps of:
S5:The network service of a calculate node is turned off manually, host node is made to communicate with failure, and executes detection and judges foot This, if recovery network can be checked successfully, configuration is normal.
Compared with prior art, cluster network fault self-detection method of the invention has advantageous effect following prominent:Institute Network failure type between node can be counted and simply be repaired by stating cluster network fault self-detection method, while to difficult Failure or hardware layer failure carry out record feedback, advantageously reduce maintenance workload, shorten maintenance time, ensure that cluster is steady Fixed operation, has good application value.
Description of the drawings
Fig. 1 is the flow chart of cluster network fault self-detection method of the present invention.
Specific implementation mode
Below in conjunction with drawings and examples, the cluster network fault self-checking system and method for the present invention is made further detailed It describes in detail bright.
Embodiment 1
The cluster network fault self-checking system of the present invention, cluster include host node and multiple calculate nodes, and gigabit is configured in cluster Ether manages net, 10,000,000,000 nets, IB nets, OPA nets.
Configuration detection program module and IPMI network modules on the primary node install network-driven mould in each calculate node Block, Configuration network IP modules and IPMI network modules.The IPMI network modules of host node and the IPMI network moulds of each calculate node Block forms IPMI networks, and host node is connected by network driver block with each calculate node, is examined by ping network IP modules Network-in-dialing state is looked into, detection program module is used to detect the working condition of each calculate node and generates testing result, works as link Unknown failure occurs, when host node detects to communicate abnormal, on the primary node restarting by IPMI network modules need to be adjusted Calculate node resets network driver block in the calculate node.Calculate node is there are hardware fault or starts logic fault, Being restarted by IPMI network modules can not complete, then records system log and alarm, feed back to maintenance personnel.
Embodiment 2
The cluster network fault self-detection method of the present invention, cluster include host node and multiple calculate nodes, and gigabit is configured in cluster Ether manages net, 10,000,000,000 nets, IB nets, OPA nets.Configured with detection program module on host node, detection program module is for detecting The working condition of each calculate node simultaneously generates testing result.The IPMI network modules of host node and the IPMI networks of each calculate node Module forms IPMI networks, and host node is connected by network driver block with each calculate node, passes through IPMI on the primary node Network module restarts the then record system log that can not restore by IPMI network modules simultaneously to restart the calculate node that need to be adjusted Alarm.
As shown in Figure 1, the cluster network fault self-detection method specifically includes following steps:
S1:Host node starts automatic executive plan task.
Host node starts automatic executive plan task by means of crond services, in crond services configuration host node according to Certain frequency executes detection and judges script.
S2:Priority check gigabit ether manages net, judges Status Type, if abnormal, judges that Exception Type, node are abnormal Then reset node, return to step S1 record log and alarm if Network Abnormal, if normal, then follow the steps S3.
S3:By ping network IP module check network connection situations, network state type is judged, if network is not connected to, Network interface card or refitting driving are reset, still abnormal, record log is simultaneously alarmed, and thens follow the steps S4 if normal.
S4:From tape program collecting net card connection status, judges Status Type, then end task if normal, otherwise record day Will is simultaneously alarmed.
S5:The network service of a calculate node is turned off manually, host node is made to communicate with failure, and executes detection and judges Script, if recovery network can be checked successfully, configuration is normal.
Embodiment described above, the only present invention more preferably specific implementation mode, those skilled in the art is at this The usual variations and alternatives carried out within the scope of inventive technique scheme should be all included within the scope of the present invention.

Claims (7)

1. a kind of cluster network fault self-checking system, cluster includes host node and multiple calculate nodes, in cluster configuration gigabit with Manage very much net, 10,000,000,000 nets, IB nets, OPA nets, it is characterised in that:Configuration detection program module and IPMI network moulds on the primary node Block installs network driver block, Configuration network IP modules and IPMI network modules, the IPMI networks of host node in each calculate node The IPMI network modules of module and each calculate node form IPMI networks, and host node passes through network driver block and each calculate node It is connected, by ping network IP module check network-in-dialing states, detection program module is used to detect the work of each calculate node Make state and generate testing result, when the unknown failure of link generation, when host node detects to communicate abnormal, leads on the primary node IPMI network modules are crossed to restart the calculate node that need to be adjusted or reset the network driver block in the calculate node.
2. cluster network fault self-checking system according to claim 1, it is characterised in that:There are hardware faults for calculate node Or start logic fault, being restarted by IPMI network modules can not restore, then records system log and alarm.
3. a kind of cluster network fault self-detection method, it is characterised in that:Cluster includes host node and multiple calculate nodes, in cluster Configure gigabit ether management net, 10,000,000,000 nets, IB nets, OPA nets, the IPMI network modules of host node and the IPMI nets of each calculate node Network module forms IPMI networks, and host node is connected with each calculate node by network driver block, is passed through on the primary node IPMI network modules restart the then record system day that can not restore by IPMI network modules to restart the calculate node that need to be adjusted Will is simultaneously alarmed.
4. cluster network fault self-detection method according to claim 3, it is characterised in that:Configured with inspection on the host node Program module is surveyed, detection program module is used to detect the working condition of each calculate node and generates testing result.
5. cluster network fault self-detection method according to claim 3 or 4, it is characterised in that:The method specifically includes Following steps:
S1:Host node starts automatic executive plan task;
S2:Priority check gigabit ether manages net, judges Status Type, if abnormal, judges that Exception Type, calculate node are abnormal Then reset node, return to step S1 record log and alarm if Network Abnormal, if normal, then follow the steps S3;
S3:By ping network IP module check network connection situations, network state type is judged, if network is not connected to, resetting Network interface card or refitting driving, still abnormal, record log is simultaneously alarmed, and thens follow the steps S4 if normal;
S4:From tape program collecting net card connection status, judge Status Type, then end task if normal, otherwise record log is simultaneously Alarm.
6. cluster network fault self-detection method according to claim 5, it is characterised in that:The host node by means of Crond services start automatic executive plan task, and configuration host node executes detection according to certain frequency and judges in crond services Script.
7. cluster network fault self-detection method according to claim 6, it is characterised in that:It is further comprising the steps of:
S5:The network service of a calculate node is turned off manually, host node is made to communicate with failure, and executes detection and judges foot This, if recovery network can be checked successfully, configuration is normal.
CN201810479418.3A 2018-05-18 2018-05-18 A kind of cluster network fault self-checking system and method Pending CN108769170A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810479418.3A CN108769170A (en) 2018-05-18 2018-05-18 A kind of cluster network fault self-checking system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810479418.3A CN108769170A (en) 2018-05-18 2018-05-18 A kind of cluster network fault self-checking system and method

Publications (1)

Publication Number Publication Date
CN108769170A true CN108769170A (en) 2018-11-06

Family

ID=64007237

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810479418.3A Pending CN108769170A (en) 2018-05-18 2018-05-18 A kind of cluster network fault self-checking system and method

Country Status (1)

Country Link
CN (1) CN108769170A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710442A (en) * 2018-12-20 2019-05-03 麒麟合盛网络技术股份有限公司 A kind of execution method and apparatus of task
CN112491633A (en) * 2020-12-17 2021-03-12 北京浪潮数据技术有限公司 Fault recovery method, system and related components of multi-node cluster
CN112511356A (en) * 2020-12-18 2021-03-16 北京浪潮数据技术有限公司 Fault repairing method, device, equipment and medium for multi-node cluster
CN112737934A (en) * 2020-12-28 2021-04-30 常州森普信息科技有限公司 Cluster type Internet of things edge gateway device and method
CN113345566A (en) * 2021-07-07 2021-09-03 上海蓬海涞讯数据技术有限公司 Hospital operation management data acquisition integrated device and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104067599A (en) * 2013-01-16 2014-09-24 冲电气工业株式会社 Network state monitoring system
CN104993953A (en) * 2015-06-19 2015-10-21 北京奇虎科技有限公司 Method for detecting network service state and device detecting network service state
US20150370657A1 (en) * 2014-06-20 2015-12-24 Vmware, Inc. Protecting virtual machines from network failures
CN106130778A (en) * 2016-07-18 2016-11-16 浪潮电子信息产业股份有限公司 A kind of method processing clustering fault and a kind of management node
CN106571972A (en) * 2015-10-10 2017-04-19 北京国双科技有限公司 Server monitoring method and device
CN106789441A (en) * 2017-01-09 2017-05-31 郑州云海信息技术有限公司 A kind of condition detection method and device of high-end fault-tolerant server administrative unit

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104067599A (en) * 2013-01-16 2014-09-24 冲电气工业株式会社 Network state monitoring system
US20150370657A1 (en) * 2014-06-20 2015-12-24 Vmware, Inc. Protecting virtual machines from network failures
CN104993953A (en) * 2015-06-19 2015-10-21 北京奇虎科技有限公司 Method for detecting network service state and device detecting network service state
CN106571972A (en) * 2015-10-10 2017-04-19 北京国双科技有限公司 Server monitoring method and device
CN106130778A (en) * 2016-07-18 2016-11-16 浪潮电子信息产业股份有限公司 A kind of method processing clustering fault and a kind of management node
CN106789441A (en) * 2017-01-09 2017-05-31 郑州云海信息技术有限公司 A kind of condition detection method and device of high-end fault-tolerant server administrative unit

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710442A (en) * 2018-12-20 2019-05-03 麒麟合盛网络技术股份有限公司 A kind of execution method and apparatus of task
CN112491633A (en) * 2020-12-17 2021-03-12 北京浪潮数据技术有限公司 Fault recovery method, system and related components of multi-node cluster
CN112491633B (en) * 2020-12-17 2023-01-24 北京浪潮数据技术有限公司 Fault recovery method, system and related components of multi-node cluster
CN112511356A (en) * 2020-12-18 2021-03-16 北京浪潮数据技术有限公司 Fault repairing method, device, equipment and medium for multi-node cluster
CN112737934A (en) * 2020-12-28 2021-04-30 常州森普信息科技有限公司 Cluster type Internet of things edge gateway device and method
CN113345566A (en) * 2021-07-07 2021-09-03 上海蓬海涞讯数据技术有限公司 Hospital operation management data acquisition integrated device and system

Similar Documents

Publication Publication Date Title
CN108769170A (en) A kind of cluster network fault self-checking system and method
TWI746512B (en) Physical machine fault classification processing method and device, and virtual machine recovery method and system
US6792456B1 (en) Systems and methods for authoring and executing operational policies that use event rates
CN104798341B (en) Service level is characterized on electric network
CN106775929B (en) A kind of virtual platform safety monitoring method and system
US8996924B2 (en) Monitoring device, monitoring system and monitoring method
US20080080384A1 (en) System and method for implementing an infiniband error log analysis model to facilitate faster problem isolation and repair
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
CN112162907A (en) Health degree evaluation method based on monitoring index data
CN103607297A (en) Fault processing method of computer cluster system
WO2015090098A1 (en) Method and apparatus for realizing fault location
CN110134518A (en) A kind of method and system improving big data cluster multinode high application availability
CN108199901B (en) Hardware repair reporting method, system, device, hardware management server and storage medium
CN113282635A (en) Micro-service system fault root cause positioning method and device
US20160191359A1 (en) Reactive diagnostics in storage area networks
CN114356499A (en) Kubernetes cluster alarm root cause analysis method and device
CN108809729A (en) The fault handling method and device that CTDB is serviced in a kind of distributed system
CN112838944A (en) Diagnosis and management, rule determination and deployment method, distributed device, and medium
TWI591489B (en) Intelligent monitoring and warning device and method for distributed software defined storage system
CN108959025A (en) A kind of server alarm method, device and server
US11544091B2 (en) Determining and implementing recovery actions for containers to recover the containers from failures
CN109510730A (en) Distributed system and its monitoring method, device, electronic equipment and storage medium
CN116737444A (en) Database server fault processing method and system
JP2009252006A (en) Log management system and method in computer system
CN114866606A (en) Micro-service management system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181106

RJ01 Rejection of invention patent application after publication