CN108769170A - A kind of cluster network fault self-checking system and method - Google Patents
A kind of cluster network fault self-checking system and method Download PDFInfo
- Publication number
- CN108769170A CN108769170A CN201810479418.3A CN201810479418A CN108769170A CN 108769170 A CN108769170 A CN 108769170A CN 201810479418 A CN201810479418 A CN 201810479418A CN 108769170 A CN108769170 A CN 108769170A
- Authority
- CN
- China
- Prior art keywords
- network
- node
- ipmi
- cluster
- modules
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/104—Peer-to-peer [P2P] networks
- H04L67/1044—Group management mechanisms
- H04L67/1048—Departure or maintenance mechanisms
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/069—Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0811—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0817—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Environmental & Geological Engineering (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Theoretical Computer Science (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention discloses a kind of cluster network fault self-checking system and methods, belong to server High Performance Computing field.The cluster network fault self-checking system of the present invention, configuration detection program module and IPMI network modules on the primary node, network driver block, Configuration network IP modules and IPMI network modules are installed in each calculate node, the IPMI network modules of host node form IPMI networks with the IPMI network modules of each calculate node, host node is connected by network driver block with each calculate node, by ping network IP module check network-in-dialing states, detection program module is used to detect the working condition of each calculate node and generates testing result.The cluster network fault self-checking system of the invention can count network failure type between node and simply be repaired, while carry out record feedback to intractable failure or hardware layer failure, have good application value.
Description
Technical field
The present invention relates to server High Performance Computing field, specifically provide a kind of cluster network fault self-checking system and
Method.
Background technology
Common computing cluster is made of a large amount of node, including host node, calculate node and storage unit etc..It calculates
Node is the main body of load capacity calculation task, including a large amount of cpu check figures, memory etc.;Storage unit includes a large amount of disk space,
For storing data;Host node is the core of entire cluster, for the task of delivering, every cores such as distribution calculates, manages operation
Activity.And to realize this scheme it is essential be exactly network.Networking component is to pass for a cluster stable operation
Important.
Communication role between network carrying node and node, if there are failure, will cause such as can not be parallel for network
It calculates, operation can not be submitted, the significant consequences such as entire cluster can not be managed, therefore it is cluster normal use to keep network stabilization
One great premise.Common network includes common gigabit Ethernet, ten thousand mbit ethernets, 100Gb Ethernets, Infiniband
Network, FC networks are quickly calculated, OPA calculates network etc..Different types of network generally also has respective effect in the cluster,
A certain network failure may result in the afunction that the network module is responsible for, and reduce clustering performance or availability.With
The growth of cluster service life, this unstability is more obvious, while the workload of O&M will also incrementally increase.And these
Failure needs after processing, can reach the standard grade and work on.If ignored, waste and the cluster of resource can be caused
The diminution of practical calculation scale.
Invention content
The technical assignment of the present invention is that in view of the above problems, network failure class between node can be counted by providing one kind
Type is simultaneously simply repaired, while the cluster network failure of record feedback is carried out to intractable failure or hardware layer failure
Self-checking system.
The further technical assignment of the present invention is to provide a kind of cluster network fault self-detection method.
To achieve the above object, the present invention provides following technical solutions:
A kind of cluster network fault self-checking system, cluster include host node and multiple calculate nodes, and gigabit ether is configured in cluster
Net, 10,000,000,000 nets, IB nets, OPA nets are managed, on the primary node configuration detection program module and IPMI network modules, is saved in each calculating
Point installation network driver block, Configuration network IP modules and IPMI network modules, IPMI network modules and each calculating of host node
The IPMI network modules of node form IPMI networks, and host node is connected with each calculate node by network driver block, is passed through
Ping network IP module check network-in-dialing states, detection program module are used to detect working condition and the generation of each calculate node
Testing result when host node detects to communicate abnormal, passes through IPMI network moulds on the primary node when link generation unknown failure
Block restarts the calculate node that need to be adjusted or resets the network driver block in the calculate node.
The IPMI(Intelligent Platform Management Interface)I.e. intelligent platform management connects
Mouthful.
IB nets are Infiniband nets.
OPA nets are omni-path nets.
It, can also be several by copying a file or transmission by ping network IP module check network-in-dialing states
Data packet carries out.
Preferably, calculate node is there are hardware fault or starts logic fault, being restarted by IPMI network modules can not
Restore, then records system log and alarm.
By ping network IP module check network-in-dialing states, and state is judged, to abnormal state of affairs into
Row processing, if gigabit ether management net is obstructed, can pass through IPMI network module reset nodes;10000000000 nets, IB nets, OPA nets are not
It is logical, network interface card can be reset.After if upper two kinds of processing judgement is normal, examined by the network card status that high-level network Component driver program carries
Ranging sequence checks that network card status, such as ibstat, ibstatus, opainfo etc. check state states, judges network work at present
State enters in next step if normal, if abnormal, reset trawl performance or reset node checks again for, and record
Information.If still abnormal, reboot operation is carried out to malfunctioning node using IPMI networks.Terminate if restarting normally, if
It is abnormal, it records system log and alarms.
A kind of cluster network fault self-detection method, cluster include host node and multiple calculate nodes, and gigabit is configured in cluster
Ether manages net, 10,000,000,000 nets, IB nets, OPA nets, the IPMI network modules of host node and the IPMI network module groups of each calculate node
At IPMI networks, host node is connected by network driver block with each calculate node, passes through IPMI network moulds on the primary node
Block restarts the calculate node that need to be adjusted, the then record system log that can not restore is restarted by IPMI network modules and alarms.
Preferably, configured with detection program module on the host node, detection program module is for detecting each calculating section
The working condition of point simultaneously generates testing result.
Preferably, the method specifically includes following steps:
S1:Host node starts automatic executive plan task;
S2:Priority check gigabit ether manages net, judges Status Type, if abnormal, judges that Exception Type, node then weigh extremely
Open node, return to step S1 record log and alarms if Network Abnormal, if normal, thens follow the steps S3;
S3:By ping network IP module check network connection situations, network state type is judged, if network is not connected to, resetting
Network interface card or refitting driving, still abnormal, record log is simultaneously alarmed, and thens follow the steps S4 if normal;
S4:From tape program collecting net card connection status, judge Status Type, then end task if normal, otherwise record log is simultaneously
Alarm.
Preferably, the host node starts automatic executive plan task by means of crond services, in crond services
Configuration host node executes detection according to certain frequency and judges script.
Preferably, the cluster network fault self-detection method is further comprising the steps of:
S5:The network service of a calculate node is turned off manually, host node is made to communicate with failure, and executes detection and judges foot
This, if recovery network can be checked successfully, configuration is normal.
Compared with prior art, cluster network fault self-detection method of the invention has advantageous effect following prominent:Institute
Network failure type between node can be counted and simply be repaired by stating cluster network fault self-detection method, while to difficult
Failure or hardware layer failure carry out record feedback, advantageously reduce maintenance workload, shorten maintenance time, ensure that cluster is steady
Fixed operation, has good application value.
Description of the drawings
Fig. 1 is the flow chart of cluster network fault self-detection method of the present invention.
Specific implementation mode
Below in conjunction with drawings and examples, the cluster network fault self-checking system and method for the present invention is made further detailed
It describes in detail bright.
Embodiment 1
The cluster network fault self-checking system of the present invention, cluster include host node and multiple calculate nodes, and gigabit is configured in cluster
Ether manages net, 10,000,000,000 nets, IB nets, OPA nets.
Configuration detection program module and IPMI network modules on the primary node install network-driven mould in each calculate node
Block, Configuration network IP modules and IPMI network modules.The IPMI network modules of host node and the IPMI network moulds of each calculate node
Block forms IPMI networks, and host node is connected by network driver block with each calculate node, is examined by ping network IP modules
Network-in-dialing state is looked into, detection program module is used to detect the working condition of each calculate node and generates testing result, works as link
Unknown failure occurs, when host node detects to communicate abnormal, on the primary node restarting by IPMI network modules need to be adjusted
Calculate node resets network driver block in the calculate node.Calculate node is there are hardware fault or starts logic fault,
Being restarted by IPMI network modules can not complete, then records system log and alarm, feed back to maintenance personnel.
Embodiment 2
The cluster network fault self-detection method of the present invention, cluster include host node and multiple calculate nodes, and gigabit is configured in cluster
Ether manages net, 10,000,000,000 nets, IB nets, OPA nets.Configured with detection program module on host node, detection program module is for detecting
The working condition of each calculate node simultaneously generates testing result.The IPMI network modules of host node and the IPMI networks of each calculate node
Module forms IPMI networks, and host node is connected by network driver block with each calculate node, passes through IPMI on the primary node
Network module restarts the then record system log that can not restore by IPMI network modules simultaneously to restart the calculate node that need to be adjusted
Alarm.
As shown in Figure 1, the cluster network fault self-detection method specifically includes following steps:
S1:Host node starts automatic executive plan task.
Host node starts automatic executive plan task by means of crond services, in crond services configuration host node according to
Certain frequency executes detection and judges script.
S2:Priority check gigabit ether manages net, judges Status Type, if abnormal, judges that Exception Type, node are abnormal
Then reset node, return to step S1 record log and alarm if Network Abnormal, if normal, then follow the steps S3.
S3:By ping network IP module check network connection situations, network state type is judged, if network is not connected to,
Network interface card or refitting driving are reset, still abnormal, record log is simultaneously alarmed, and thens follow the steps S4 if normal.
S4:From tape program collecting net card connection status, judges Status Type, then end task if normal, otherwise record day
Will is simultaneously alarmed.
S5:The network service of a calculate node is turned off manually, host node is made to communicate with failure, and executes detection and judges
Script, if recovery network can be checked successfully, configuration is normal.
Embodiment described above, the only present invention more preferably specific implementation mode, those skilled in the art is at this
The usual variations and alternatives carried out within the scope of inventive technique scheme should be all included within the scope of the present invention.
Claims (7)
1. a kind of cluster network fault self-checking system, cluster includes host node and multiple calculate nodes, in cluster configuration gigabit with
Manage very much net, 10,000,000,000 nets, IB nets, OPA nets, it is characterised in that:Configuration detection program module and IPMI network moulds on the primary node
Block installs network driver block, Configuration network IP modules and IPMI network modules, the IPMI networks of host node in each calculate node
The IPMI network modules of module and each calculate node form IPMI networks, and host node passes through network driver block and each calculate node
It is connected, by ping network IP module check network-in-dialing states, detection program module is used to detect the work of each calculate node
Make state and generate testing result, when the unknown failure of link generation, when host node detects to communicate abnormal, leads on the primary node
IPMI network modules are crossed to restart the calculate node that need to be adjusted or reset the network driver block in the calculate node.
2. cluster network fault self-checking system according to claim 1, it is characterised in that:There are hardware faults for calculate node
Or start logic fault, being restarted by IPMI network modules can not restore, then records system log and alarm.
3. a kind of cluster network fault self-detection method, it is characterised in that:Cluster includes host node and multiple calculate nodes, in cluster
Configure gigabit ether management net, 10,000,000,000 nets, IB nets, OPA nets, the IPMI network modules of host node and the IPMI nets of each calculate node
Network module forms IPMI networks, and host node is connected with each calculate node by network driver block, is passed through on the primary node
IPMI network modules restart the then record system day that can not restore by IPMI network modules to restart the calculate node that need to be adjusted
Will is simultaneously alarmed.
4. cluster network fault self-detection method according to claim 3, it is characterised in that:Configured with inspection on the host node
Program module is surveyed, detection program module is used to detect the working condition of each calculate node and generates testing result.
5. cluster network fault self-detection method according to claim 3 or 4, it is characterised in that:The method specifically includes
Following steps:
S1:Host node starts automatic executive plan task;
S2:Priority check gigabit ether manages net, judges Status Type, if abnormal, judges that Exception Type, calculate node are abnormal
Then reset node, return to step S1 record log and alarm if Network Abnormal, if normal, then follow the steps S3;
S3:By ping network IP module check network connection situations, network state type is judged, if network is not connected to, resetting
Network interface card or refitting driving, still abnormal, record log is simultaneously alarmed, and thens follow the steps S4 if normal;
S4:From tape program collecting net card connection status, judge Status Type, then end task if normal, otherwise record log is simultaneously
Alarm.
6. cluster network fault self-detection method according to claim 5, it is characterised in that:The host node by means of
Crond services start automatic executive plan task, and configuration host node executes detection according to certain frequency and judges in crond services
Script.
7. cluster network fault self-detection method according to claim 6, it is characterised in that:It is further comprising the steps of:
S5:The network service of a calculate node is turned off manually, host node is made to communicate with failure, and executes detection and judges foot
This, if recovery network can be checked successfully, configuration is normal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810479418.3A CN108769170A (en) | 2018-05-18 | 2018-05-18 | A kind of cluster network fault self-checking system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810479418.3A CN108769170A (en) | 2018-05-18 | 2018-05-18 | A kind of cluster network fault self-checking system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108769170A true CN108769170A (en) | 2018-11-06 |
Family
ID=64007237
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810479418.3A Pending CN108769170A (en) | 2018-05-18 | 2018-05-18 | A kind of cluster network fault self-checking system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108769170A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710442A (en) * | 2018-12-20 | 2019-05-03 | 麒麟合盛网络技术股份有限公司 | A kind of execution method and apparatus of task |
CN112491633A (en) * | 2020-12-17 | 2021-03-12 | 北京浪潮数据技术有限公司 | Fault recovery method, system and related components of multi-node cluster |
CN112511356A (en) * | 2020-12-18 | 2021-03-16 | 北京浪潮数据技术有限公司 | Fault repairing method, device, equipment and medium for multi-node cluster |
CN112737934A (en) * | 2020-12-28 | 2021-04-30 | 常州森普信息科技有限公司 | Cluster type Internet of things edge gateway device and method |
CN113345566A (en) * | 2021-07-07 | 2021-09-03 | 上海蓬海涞讯数据技术有限公司 | Hospital operation management data acquisition integrated device and system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104067599A (en) * | 2013-01-16 | 2014-09-24 | 冲电气工业株式会社 | Network state monitoring system |
CN104993953A (en) * | 2015-06-19 | 2015-10-21 | 北京奇虎科技有限公司 | Method for detecting network service state and device detecting network service state |
US20150370657A1 (en) * | 2014-06-20 | 2015-12-24 | Vmware, Inc. | Protecting virtual machines from network failures |
CN106130778A (en) * | 2016-07-18 | 2016-11-16 | 浪潮电子信息产业股份有限公司 | A kind of method processing clustering fault and a kind of management node |
CN106571972A (en) * | 2015-10-10 | 2017-04-19 | 北京国双科技有限公司 | Server monitoring method and device |
CN106789441A (en) * | 2017-01-09 | 2017-05-31 | 郑州云海信息技术有限公司 | A kind of condition detection method and device of high-end fault-tolerant server administrative unit |
-
2018
- 2018-05-18 CN CN201810479418.3A patent/CN108769170A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104067599A (en) * | 2013-01-16 | 2014-09-24 | 冲电气工业株式会社 | Network state monitoring system |
US20150370657A1 (en) * | 2014-06-20 | 2015-12-24 | Vmware, Inc. | Protecting virtual machines from network failures |
CN104993953A (en) * | 2015-06-19 | 2015-10-21 | 北京奇虎科技有限公司 | Method for detecting network service state and device detecting network service state |
CN106571972A (en) * | 2015-10-10 | 2017-04-19 | 北京国双科技有限公司 | Server monitoring method and device |
CN106130778A (en) * | 2016-07-18 | 2016-11-16 | 浪潮电子信息产业股份有限公司 | A kind of method processing clustering fault and a kind of management node |
CN106789441A (en) * | 2017-01-09 | 2017-05-31 | 郑州云海信息技术有限公司 | A kind of condition detection method and device of high-end fault-tolerant server administrative unit |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710442A (en) * | 2018-12-20 | 2019-05-03 | 麒麟合盛网络技术股份有限公司 | A kind of execution method and apparatus of task |
CN112491633A (en) * | 2020-12-17 | 2021-03-12 | 北京浪潮数据技术有限公司 | Fault recovery method, system and related components of multi-node cluster |
CN112491633B (en) * | 2020-12-17 | 2023-01-24 | 北京浪潮数据技术有限公司 | Fault recovery method, system and related components of multi-node cluster |
CN112511356A (en) * | 2020-12-18 | 2021-03-16 | 北京浪潮数据技术有限公司 | Fault repairing method, device, equipment and medium for multi-node cluster |
CN112737934A (en) * | 2020-12-28 | 2021-04-30 | 常州森普信息科技有限公司 | Cluster type Internet of things edge gateway device and method |
CN113345566A (en) * | 2021-07-07 | 2021-09-03 | 上海蓬海涞讯数据技术有限公司 | Hospital operation management data acquisition integrated device and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108769170A (en) | A kind of cluster network fault self-checking system and method | |
TWI746512B (en) | Physical machine fault classification processing method and device, and virtual machine recovery method and system | |
US6792456B1 (en) | Systems and methods for authoring and executing operational policies that use event rates | |
CN104798341B (en) | Service level is characterized on electric network | |
CN106775929B (en) | A kind of virtual platform safety monitoring method and system | |
US8996924B2 (en) | Monitoring device, monitoring system and monitoring method | |
US20080080384A1 (en) | System and method for implementing an infiniband error log analysis model to facilitate faster problem isolation and repair | |
CN106789306B (en) | Method and system for detecting, collecting and recovering software fault of communication equipment | |
CN112162907A (en) | Health degree evaluation method based on monitoring index data | |
CN103607297A (en) | Fault processing method of computer cluster system | |
WO2015090098A1 (en) | Method and apparatus for realizing fault location | |
CN110134518A (en) | A kind of method and system improving big data cluster multinode high application availability | |
CN108199901B (en) | Hardware repair reporting method, system, device, hardware management server and storage medium | |
CN113282635A (en) | Micro-service system fault root cause positioning method and device | |
US20160191359A1 (en) | Reactive diagnostics in storage area networks | |
CN114356499A (en) | Kubernetes cluster alarm root cause analysis method and device | |
CN108809729A (en) | The fault handling method and device that CTDB is serviced in a kind of distributed system | |
CN112838944A (en) | Diagnosis and management, rule determination and deployment method, distributed device, and medium | |
TWI591489B (en) | Intelligent monitoring and warning device and method for distributed software defined storage system | |
CN108959025A (en) | A kind of server alarm method, device and server | |
US11544091B2 (en) | Determining and implementing recovery actions for containers to recover the containers from failures | |
CN109510730A (en) | Distributed system and its monitoring method, device, electronic equipment and storage medium | |
CN116737444A (en) | Database server fault processing method and system | |
JP2009252006A (en) | Log management system and method in computer system | |
CN114866606A (en) | Micro-service management system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181106 |
|
RJ01 | Rejection of invention patent application after publication |