CN103607297B - Fault processing method of computer cluster system - Google Patents

Fault processing method of computer cluster system Download PDF

Info

Publication number
CN103607297B
CN103607297B CN201310548737.2A CN201310548737A CN103607297B CN 103607297 B CN103607297 B CN 103607297B CN 201310548737 A CN201310548737 A CN 201310548737A CN 103607297 B CN103607297 B CN 103607297B
Authority
CN
China
Prior art keywords
node
fault
computer cluster
service module
message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310548737.2A
Other languages
Chinese (zh)
Other versions
CN103607297A (en
Inventor
陈浩
赵亚萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Eisoo Information Technology Co Ltd
Original Assignee
Shanghai Eisoo Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Eisoo Information Technology Co Ltd filed Critical Shanghai Eisoo Information Technology Co Ltd
Priority to CN201310548737.2A priority Critical patent/CN103607297B/en
Publication of CN103607297A publication Critical patent/CN103607297A/en
Application granted granted Critical
Publication of CN103607297B publication Critical patent/CN103607297B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a fault processing method of a computer cluster system. The method comprises the following steps: (A) at least two nodes in the computer cluster system are selected and are set as management nodes which bear the fault processing and the management of the computer cluster system, one node in the management nodes is taken as a main node, and other nodes are taken as standby nodes, (B) a bottom monitoring service module of each node in the computer cluster system monitors the operation state of the node and software and hardware loads and judges whether a fault appears or not, and if so, the bottom monitoring service module notifies a message middleware service module to send a fault massage to a management center service module of the main node; and (C) the management center service module of the main node carries out fault processing according to the fault message. According to the technical scheme of the invention, in the condition that human intervention is not needed, the automatic processing function of the cluster computer system fault can be realized.

Description

A kind of fault handling method of computer cluster
Technical field
The application is related to computer technology, particularly to computer cluster, more particularly, to a kind of computer cluster system The fault handling method of system.
Background technology
With the propulsion of informationization technology, either enterprise or other organizations are all increasingly dependent on department of computer science System.Along with the drastically expansion of data volume, single computer cannot meet its needs, if using supercomputer again greatly The cost increasing computer, in this case, computer cluster technology arises at the historic moment.
Computer cluster is coupled together by the software of one group of loose integrated computer or hardware, and height is closely assisted Complete evaluation work.The multiple stage computers equipment of composition computer cluster logically can be counted as a calculating Machine.Single computer in computer cluster is commonly referred to node, and computer cluster can be connected by LAN, Also other connected modes are supported.Computer cluster is commonly used to improve the calculating speed data stream of single computer Load balancing.Calculating speed and cheap price that computer cluster is exceedingly fast with it, are widely favored, and are obtained fast Speed popularization.
The number of nodes of computer cluster even thousands of from several to hundreds of, therefore works as computer cluster During one or more of system nodes break down, the calculating speed of computer cluster would generally be affected, or even The all nodes in computer cluster are led to all cannot normally to use.Therefore for user of service, how to ensure to count During any one nodes break down in calculation machine group system, computer cluster still can use on the whole, and not shadow Ringing calculating speed then becomes the key of lifting work efficiency and the creation of value.
For the fault processing in computer cluster, usual method is that attendant enters machine room in computer cluster Failed machines are searched in multiple stage node in system, it is then determined that the failure cause of machine, then carry out maintenance work, when node Quantity may need when increasing to increase quantity and the workload of attendant, and not only cost is higher, and work efficiency is very Low.
Content of the invention
This application provides a kind of fault handling method of computer cluster, can be not required to want the bar of manual intervention That realizes computer cluster fault under part automatically processes function.
A kind of fault handling method of computer cluster that the embodiment of the present application provides, including:
In A, selection computer cluster, at least two nodes are set to undertake troubleshooting and management computer collection The management node of group's system, as host node, remaining is as slave node for one of described management node;
The bottom monitoring service module of each of B, computer cluster node monitor the running status of this node with And software and hardware load condition, and judge whether to break down, if so, bottom monitoring service module notifies in the middle of the message of this node Part service module sends failure message to administrative center's service module of host node;
C, administrative center's service module of host node carry out troubleshooting according to described failure message.
It is preferred that internal memory, CPU or system disk utilization rate that described fault is node exceed prespecified threshold value;
Step C is:Defect content is reported attendant by administrative center's service module of host node.
It is preferred that described fault is hardware fault;
Step C is:The hardware identifier that administrative center's service module of host node will appear from fault notifies manager, and will be former Barrier equipment is rejected from computer cluster.
It is preferred that the node breaking down is ordinary node, fault is software fault;
Step C is:Administrative center's service module of host node to identify the state of this node with defined state value, and Concrete fault message is notified attendant.
It is preferred that the node breaking down is host node, fault is software fault;
Step C is:The work that a new host node takes over former host node is elected from slave node.
It is preferred that the method further includes:
Computer cluster has detected node by heartbeat mechanism and has been in off-line state, if saving based on this node Point, elects after a new host node takes over the work of former host node from slave node, will former host node enter aging;If should Node is then directly entered aging for ordinary node;
After aging period, delete all information of this node from computer cluster.
It is preferred that heartbeat message is sent in the middle of the message at host node place for each node unification of computer cluster Part service module, is collected by host node and slave node and manages heartbeat message, if in the last item heartbeat message being received Timestamp also do not receive new heartbeat message then it is assumed that sending this heart apart from the current time beyond threshold value set in advance Jump the node off-line of message.
As can be seen from the above technical solutions, form a covering using message-oriented middleware and single node monitoring programme whole The monitoring network of individual computer cluster node, the service state of each node of monitor in real time and network state, if find Fault information reporting is then uniformly processed to administrative center by node failure by the monitoring programme on this node, thus being not required to very important person That realizes computer cluster fault under conditions of work intervention automatically processes function it is ensured that computer cluster node occurs Can normally use after fault, mitigate the workload of attendant, improve the fault-tolerant ability of computer cluster.
Brief description
A kind of fault handling method schematic flow sheet of computer cluster that Fig. 1 provides for the embodiment of the present application;
The deployment process schematic of the fault handling method of the computer cluster that Fig. 2 provides for the embodiment of the present application.
Specific embodiment
For problems of the prior art, this application provides a kind of troubleshooting side of computer cluster Method, realizes reporting of computer cluster fault using message mechanism, by specific node handling failure, thus being not required to very important person That realizes computer cluster fault under conditions of work intervention automatically processes function it is ensured that computer cluster node occurs Can normally use after fault, mitigate the workload of attendant, improve the fault-tolerant ability of computer cluster.
The main design idea of technical scheme is:Form one using message-oriented middleware and single node monitoring programme The individual monitoring network covering whole computer cluster node, the service state of each node of monitor in real time and network-like Fault information reporting, if finding node failure, is uniformly processed to administrative center, wherein by state by the monitoring programme on this node The monitoring programme of node and failure message suffer from normalized definition, and the process for all kinds of faults also has unified mark Standard, strives realizing the High Availabitity of computer cluster in the case of cost-effective and manpower and materials it is ensured that computer collection Group's system is continuously available on the premise of there is not major accident.
Know-why, feature and technique effect for making technical scheme are clearer, below in conjunction with concrete reality Apply example technical scheme is described in detail.
A kind of fault handling method flow process of computer cluster that the embodiment of the present application provides is as shown in figure 1, include:
Step 101:Choose at least two nodes in computer cluster to be set to undertake troubleshooting and management meter The management node of calculation machine group system, as host node, remaining is as slave node for one of described management node;
Step 102:The bottom monitoring service module of each of computer cluster node monitors the operation of this node State and software and hardware load condition, and judge whether to break down, if so, bottom monitoring service module notification message middleware Service module sends failure message to administrative center's service module of host node;
Step 103:Administrative center's service module of host node carries out troubleshooting according to described failure message.
Mainly utilize message-oriented middleware in the embodiment of the present application scheme, monitored the shape of each node by bottom monitoring programme Condition, once finding that fault reports in time, being collected failure message and being processed by the specific node unification of computer cluster.At this In invention, need installation message middleware, and our computer cluster single node monitoring services of being formulated, computer Cluster system management center service etc., the operating system being used is linux system.The fault processing system of the embodiment of the present application Relate generally to four more crucial parts:Message-oriented middleware service module, bottom monitoring service module, administrative center's service module And failover processing module.
The deployment process of the fault handling method of computer cluster that the embodiment of the present application provides is as shown in Fig. 2 wrap Include:
Step 201:It is installed and activated linux system.
For each of computer cluster node, correctly install required linux system respectively, and right Start after linux system configuration.
Step 202:It is installed and activated message-oriented middleware service.
Correct installation message middleware starting on each node of computer cluster, and just guarantee its work Often, can accurate messaging.
Step 203:Start other services of computer cluster.
The correct administrative center's service mould starting in computer cluster on all nodes in computer cluster Block and bottom monitoring service module.Bottom monitoring service module is responsible for monitoring the running status of each node, and software and hardware Load condition, administrative center's service module is responsible for processing message, and the type of analysis fault, and is carried out point according to fault type Other places are managed.
Step 204:Configuration main-standby nodes.
By the web interface of application programming interfaces (API) or O&M software choose in computer cluster 2 or 3 nodes are set to undertake the management node of troubleshooting and management computer cluster it is ensured that computer cluster Normal work simultaneously has fail-over feature, in the management node of selection one be host node remaining be slave node.Corresponding, In computer cluster, the node in addition to management node is referred to as ordinary node.
After above-mentioned flow processing, computer cluster is in normal operating conditions, if breaking down, computer Group system can quick response fault processing, taking over fault node is it is ensured that the High Availabitity of computer cluster as needed Property.
Common several fault types given below and corresponding processing method:
The system failure
The system failure include but is not limited to internal memory, CPU, system disk utilization rate too high (be defaulted as 70%, can be according to actual feelings Condition configures).When bottom monitoring service module detects above-mentioned fault, can be by fault message notification message middleware services mould Block, message-oriented middleware service module sends failure message, this message package section containing fault to administrative center's service module of host node Point information, fault time etc..
Because above-mentioned fault does not affect the normal work of host node, administrative center's service module of host node pass through mail or Other modes are informed its defect content of attendant or are checked corresponding system index, no by the web page of O&M software Manager is needed to enter machine room inspection machine, the great convenience work disposal of manager.
Device hardware fault
Device hardware fault includes but is not limited to disk failure, raid fault, net card failure etc., when bottom monitoring service mould Block detects such fault, can be by fault message notification message middleware services module, and message-oriented middleware service module is to main section Administrative center's service module of point sends failure message, and administrative center's service module is responsible for handling failure, and concrete grammar is to notify The hardware identifier that manager is broken down, rejects faulty equipment.
Ordinary node software fault
Software fault include the various softwares that computer cluster used there occurs fault, such as message-oriented middleware therefore Barrier, ASC administrative service center fault, bottom monitoring service fault etc..Such fault is primarily referred to as each section in computer cluster The service for providing single node being owned by point there occurs fault, and the process at this point for this node is with defined shape Identifying the state of this node and to inform the concrete fault message of attendant by mail or other modes, such fault needs state value Want human intervention malfunctioning node, repair fault manually.
Administrative center's software fault
Software fault include the various softwares that computer cluster used there occurs fault, such as message-oriented middleware therefore Barrier, ASC administrative service center fault, bottom monitoring service fault etc..When administrative center's service module of host node there occurs fault, Now host node cannot normal work, need from slave node according to certain principle (such as node load situation or Little IP principle etc.), elect a new host node, take over the work of former host node.Bear offer externally to service internally The work of management is provided, or slave node breaks down or taken over by other slave nodes offline, this process is referred to as management node certainly Dynamic switching.
What a kind of management node given below automatically switched realizes process example:Slave node gets master by message mechanism Node there occurs fault or offline, slave node startup election mechanism, learns oneself to be little IP node, then take over from data base The work served as before host node, becomes new host node.
Above-mentioned fault need to carry out the switching of fault it is ensured that the High Availabitity of computer cluster when occurring, and handoff procedure is no Need manual intervention, whole-process automatic monitoring, manager can monitor handoff procedure by the used web O&M page.Fault discovery is rapid, switching The of short duration normal use not affecting computer cluster of process.
Node off-line
Such fault refers mainly to node and there occurs situations such as power-off, suspension.Computer cluster passes through message-oriented middleware The heartbeat mechanism realized detects this node and has been in off-line state, if host node, then carries out host node automatic switchover laggard Enter aging, if ordinary node is then directly entered aging, this section after aging period, will be deleted from whole computer cluster The all information of point.It is the node that this node is not re-used as in computer cluster, no longer undertake any computer cluster system System work.Heartbeat mechanism in the embodiment of the present application is:Heartbeat message is sent to by each node unification of computer cluster The message-oriented middleware module that host node is located, is collected by host node and slave node and manages heartbeat message, if received The timestamp jumped in message of uniting as one afterwards does not also receive new heart beating apart from the current time beyond threshold value set in advance and disappears Breath is then it is assumed that send the node off-line of this heartbeat message.
By the invention it is possible to reach following effect:
1st, realize the troubleshooting of computer cluster it is ensured that computer cluster due to employing message mechanism In node failure can promptly and accurately report, can be processed according to different fault types, no matter hardware fault is also It is that software fault can respond rapidly to, considerably reduce the maintenance difficulties of manager;
2nd, by the multiple node unified managements in computer cluster, load balancing, data are carried out by host node unification The operation such as shunting substantially increases the efficiency of computer cluster.Node in computer cluster is more, this advantage More obvious;
3rd, in the fault treating procedure of computer cluster, in most cases executed by Automatic Program, need not be artificial Intervene, do not affect computer cluster and run well it is not necessary to the configuration of complexity and extra instrument, therefore this programme has Easy to operate, easy care feature;
4th, the present invention is applicable not only to the server platform of different brands, is equally applicable for various virtual machines and therefore has There is good hardware platform adaptability.Have benefited from message-oriented middleware, the reliability of message high it is ensured that computer cluster The accuracy of switching;The switching time of short duration normal use not affecting computer cluster;Linux system stability is high, Decrease the impact to customer service when safeguarding computer cluster.
The foregoing is only the preferred embodiment of the application, not in order to limit the protection domain of the application, all Within the spirit of technical scheme and principle, any modification, equivalent substitution and improvement done etc., should be included in this Shen Within the scope of please protecting.

Claims (7)

1. a kind of fault handling method of computer cluster is it is characterised in that include:
In A, selection computer cluster, at least two nodes are set to undertake troubleshooting and management computer cluster system The management node of system, as host node, remaining is as slave node for one of described management node;
The bottom monitoring service module of each of B, computer cluster node monitors the running status of this node and soft Hardware load situation, and judge whether to break down, if so, bottom monitoring service module notifies the message-oriented middleware on this node Service module sends failure message to administrative center's service module of host node;
C, administrative center's service module of host node carry out troubleshooting according to described failure message.
2. method according to claim 1 is it is characterised in that internal memory, CPU or system disk that described fault is node use Rate exceedes prespecified threshold value;
Step C is:Defect content is reported attendant by administrative center's service module of host node.
3. method according to claim 1 is it is characterised in that described fault is hardware fault;
Step C is:The hardware identifier that administrative center's service module of host node will appear from fault notifies manager, and fault is set Standby rejecting from computer cluster.
4. it is characterised in that the node breaking down is ordinary node, fault is software to method according to claim 1 Fault;
Step C is:Administrative center's service module of host node to identify the state of this node with defined state value, and will have Body fault message notifies attendant.
5. it is characterised in that the node breaking down is host node, fault is software event to method according to claim 1 Barrier;
Step C is:The work that a new host node takes over former host node is elected from slave node.
6. the method according to any one of claim 1 to 5 is it is characterised in that the method further includes:
Computer cluster has detected node by heartbeat mechanism and has been in off-line state, if this node is host node, from Elect after a new host node takes over the work of former host node in slave node, will former host node enter aging;If this node Then it is directly entered aging for ordinary node;
After aging period, delete all information of this node from computer cluster.
7. method according to claim 6 is it is characterised in that described heartbeat mechanism is:Each section of computer cluster The unified message-oriented middleware service module that heartbeat message is sent to host node place of point, is collected and managed by host node and slave node Reason heartbeat message, if the timestamp in the last item heartbeat message being received apart from the current time exceed set in advance Threshold value does not also receive new heartbeat message then it is assumed that sending the node off-line of this heartbeat message.
CN201310548737.2A 2013-11-07 2013-11-07 Fault processing method of computer cluster system Expired - Fee Related CN103607297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310548737.2A CN103607297B (en) 2013-11-07 2013-11-07 Fault processing method of computer cluster system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310548737.2A CN103607297B (en) 2013-11-07 2013-11-07 Fault processing method of computer cluster system

Publications (2)

Publication Number Publication Date
CN103607297A CN103607297A (en) 2014-02-26
CN103607297B true CN103607297B (en) 2017-02-08

Family

ID=50125498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310548737.2A Expired - Fee Related CN103607297B (en) 2013-11-07 2013-11-07 Fault processing method of computer cluster system

Country Status (1)

Country Link
CN (1) CN103607297B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106933693A (en) * 2017-03-15 2017-07-07 郑州云海信息技术有限公司 A kind of data-base cluster node failure self-repairing method and system

Families Citing this family (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104267689B (en) * 2014-09-22 2017-01-18 中国科学院寒区旱区环境与工程研究所 Super computer room outage early warning and automatic power-on management method based on video image differentiation
CN104270268B (en) * 2014-09-28 2017-12-05 曙光信息产业股份有限公司 A kind of distributed system network performance evaluation and method for diagnosing faults
CN105681156B (en) * 2014-11-19 2019-06-11 阿里巴巴集团控股有限公司 Message issuance method, apparatus and system
CN104579791A (en) * 2015-01-26 2015-04-29 浪潮电子信息产业股份有限公司 Method for achieving automatic K-DB main and standby disaster recovery cluster switching
CN104767794B (en) * 2015-03-13 2018-05-01 聚好看科技股份有限公司 Node electoral machinery and node in a kind of distributed system
CN104735069A (en) * 2015-03-26 2015-06-24 浪潮集团有限公司 High-availability computer cluster based on safety and reliability
CN105007193A (en) * 2015-08-19 2015-10-28 浪潮(北京)电子信息产业有限公司 Multi-layer information processing method, system thereof and cluster management node
CN105162632A (en) * 2015-09-15 2015-12-16 浪潮集团有限公司 Automatic processing system for server cluster failures
CN107203420A (en) * 2016-03-18 2017-09-26 北京京东尚科信息技术有限公司 The master-slave switching method and device of task scheduling example
CN106161090A (en) * 2016-07-12 2016-11-23 许继集团有限公司 The monitoring method of a kind of subregion group system and device
CN106452952B (en) * 2016-09-29 2019-11-22 华为技术有限公司 A kind of method and gateway cluster detecting group system communications status
CN106878077A (en) * 2017-02-21 2017-06-20 深圳实现创新科技有限公司 The method of controlling security and device of safety monitoring
CN107247729B (en) * 2017-05-03 2021-04-27 中国银联股份有限公司 File processing method and device
CN107343032A (en) * 2017-06-21 2017-11-10 武汉慧联无限科技有限公司 The offline detection method and device of terminal node in remote low power consumption network
CN109218126B (en) * 2017-06-30 2023-10-17 中兴通讯股份有限公司 Method, device and system for monitoring node survival state
CN107257298A (en) * 2017-07-27 2017-10-17 郑州云海信息技术有限公司 A kind of fault handling method and device
CN107342905A (en) * 2017-08-28 2017-11-10 郑州云海信息技术有限公司 A kind of node scheduling method and system of cluster storage system failure transfer
CN107704387B (en) * 2017-09-26 2021-03-16 恒生电子股份有限公司 Method, device, electronic equipment and computer readable medium for system early warning
CN107831452A (en) * 2017-10-31 2018-03-23 国网上海市电力公司 DC control and protection system hostdown diagnoses and life appraisal equipment
CN107948260A (en) * 2017-11-15 2018-04-20 郑州云海信息技术有限公司 Main monitoring node selecting method and device in a kind of distributed type assemblies
CN108134706B (en) * 2018-01-02 2020-08-18 中国工商银行股份有限公司 Block chain multi-activity high-availability system, computer equipment and method
CN108809729A (en) * 2018-06-25 2018-11-13 郑州云海信息技术有限公司 The fault handling method and device that CTDB is serviced in a kind of distributed system
CN108847982B (en) * 2018-06-26 2021-11-19 郑州云海信息技术有限公司 Distributed storage cluster and node fault switching method and device thereof
CN109117317A (en) * 2018-11-01 2019-01-01 郑州云海信息技术有限公司 A kind of clustering fault restoration methods and relevant apparatus
CN111158962B (en) * 2018-11-07 2023-10-13 中移信息技术有限公司 Remote disaster recovery method, device and system, electronic equipment and storage medium
CN111258840B (en) * 2018-11-30 2023-10-10 杭州海康威视数字技术股份有限公司 Cluster node management method and device and cluster
CN109634787B (en) * 2018-12-17 2022-04-26 浪潮电子信息产业股份有限公司 Distributed file system monitor switching method, device, equipment and storage medium
CN111338647B (en) * 2018-12-18 2023-09-12 杭州海康威视数字技术股份有限公司 Big data cluster management method and device
CN111355600B (en) * 2018-12-21 2023-05-02 杭州海康威视数字技术股份有限公司 Main node determining method and device
CN109714202B (en) * 2018-12-21 2021-10-08 郑州云海信息技术有限公司 Client off-line reason distinguishing method and cluster type safety management system
CN109873719B (en) * 2019-02-03 2019-12-31 华为技术有限公司 Fault detection method and device
CN111130920B (en) * 2019-11-26 2022-03-11 网宿科技股份有限公司 Hardware information acquisition method, device, server and storage medium
CN111143027A (en) * 2019-12-06 2020-05-12 北京浪潮数据技术有限公司 Cloud platform management method, system, equipment and computer readable storage medium
CN111865714B (en) * 2020-06-24 2022-08-02 上海上实龙创智能科技股份有限公司 Cluster management method based on multi-cloud environment
CN112131077A (en) * 2020-09-21 2020-12-25 中国建设银行股份有限公司 Fault node positioning method and device and database cluster system
CN112306747B (en) * 2020-09-29 2023-04-11 新华三技术有限公司合肥分公司 RAID card fault processing method and device
CN112491633B (en) * 2020-12-17 2023-01-24 北京浪潮数据技术有限公司 Fault recovery method, system and related components of multi-node cluster
CN112631718A (en) * 2020-12-21 2021-04-09 常州微亿智造科技有限公司 Method and system for realizing Controller and Worker service combination under industrial Internet of things
CN114978875A (en) * 2021-02-23 2022-08-30 广州汽车集团股份有限公司 Vehicle-mounted node management method and device and storage medium
CN115396295A (en) * 2021-05-24 2022-11-25 中兴通讯股份有限公司 Equipment operation and maintenance method, network equipment and storage medium
CN113282334A (en) * 2021-06-07 2021-08-20 深圳华锐金融技术股份有限公司 Method and device for recovering software defects, computer equipment and storage medium
CN114363156A (en) * 2022-01-25 2022-04-15 南瑞集团有限公司 Hydropower station computer monitoring system deployment method based on cluster technology
CN114826905A (en) * 2022-03-31 2022-07-29 西安超越申泰信息科技有限公司 Method, system, equipment and medium for switching management service of lower node
CN115242701B (en) * 2022-07-25 2024-04-02 中国民用航空总局第二研究所 Airport data platform cluster consumption processing method, device and storage medium
CN116614348A (en) * 2023-07-19 2023-08-18 联想凌拓科技有限公司 System for remote copy service and method of operating the same

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101183996A (en) * 2007-12-13 2008-05-21 浪潮电子信息产业股份有限公司 Cluster information monitoring method
CN101373447A (en) * 2008-08-20 2009-02-25 上海超级计算中心 System and method for detecting health degree of computer cluster
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6427163B1 (en) * 1998-07-10 2002-07-30 International Business Machines Corporation Highly scalable and highly available cluster system management scheme

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101183996A (en) * 2007-12-13 2008-05-21 浪潮电子信息产业股份有限公司 Cluster information monitoring method
CN101373447A (en) * 2008-08-20 2009-02-25 上海超级计算中心 System and method for detecting health degree of computer cluster
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106933693A (en) * 2017-03-15 2017-07-07 郑州云海信息技术有限公司 A kind of data-base cluster node failure self-repairing method and system

Also Published As

Publication number Publication date
CN103607297A (en) 2014-02-26

Similar Documents

Publication Publication Date Title
CN103607297B (en) Fault processing method of computer cluster system
TWI746512B (en) Physical machine fault classification processing method and device, and virtual machine recovery method and system
CN105187249B (en) A kind of fault recovery method and device
CN107147540A (en) Fault handling method and troubleshooting cluster in highly available system
CN103873279B (en) Server management method and server management device
CN105095001B (en) Virtual machine abnormal restoring method under distributed environment
US10771323B2 (en) Alarm information processing method, related device, and system
US8555189B2 (en) Management system and management system control method
KR20090035152A (en) Autonomous fault processing system in home network environments and operation method thereof
CN104113428B (en) A kind of equipment management device and method
CN106789306A (en) Restoration methods and system are collected in communication equipment software fault detect
CN105323113A (en) A visualization technology-based system fault emergency handling system and a system fault emergency handling method
CN105306272A (en) Method and system for collecting fault scene information of information system
US11886904B2 (en) Virtual network function VNF deployment method and apparatus
JP2013030826A (en) Network monitoring system and network monitoring method
EP2518627A2 (en) Partial fault processing method in computer system
CN109726046A (en) Computer room switching method and switching device
CN110134518A (en) A kind of method and system improving big data cluster multinode high application availability
CN112035319A (en) Monitoring alarm system for multi-path state
CN105893211A (en) Method and system for monitoring
WO2021114971A1 (en) Method for detecting whether application system based on multi-tier architecture operates normally
CN108429656A (en) A method of monitoring physical machine network interface card connection status
CN110119325A (en) Server failure processing method, device, equipment and computer readable storage medium
JP2013130901A (en) Monitoring server and network device recovery system using the same
CN105072386A (en) Video networking system based on multicast technologies and state monitoring method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 201112 Shanghai, Minhang District, United Airlines route 1188, building second layer A-1 unit 8

Applicant after: SHANGHAI EISOO INFORMATION TECHNOLOGY CO., LTD.

Address before: 200072 room 3, building 840, No. 101 Middle Luochuan Road, Shanghai, Zhabei District

Applicant before: Shanghai Eisoo Software Co.,Ltd.

COR Change of bibliographic data
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170208

Termination date: 20191107