CN103607297B - Fault processing method of computer cluster system - Google Patents
Fault processing method of computer cluster system Download PDFInfo
- Publication number
- CN103607297B CN103607297B CN201310548737.2A CN201310548737A CN103607297B CN 103607297 B CN103607297 B CN 103607297B CN 201310548737 A CN201310548737 A CN 201310548737A CN 103607297 B CN103607297 B CN 103607297B
- Authority
- CN
- China
- Prior art keywords
- node
- fault
- computer cluster
- service module
- message
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Landscapes
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a fault processing method of a computer cluster system. The method comprises the following steps: (A) at least two nodes in the computer cluster system are selected and are set as management nodes which bear the fault processing and the management of the computer cluster system, one node in the management nodes is taken as a main node, and other nodes are taken as standby nodes, (B) a bottom monitoring service module of each node in the computer cluster system monitors the operation state of the node and software and hardware loads and judges whether a fault appears or not, and if so, the bottom monitoring service module notifies a message middleware service module to send a fault massage to a management center service module of the main node; and (C) the management center service module of the main node carries out fault processing according to the fault message. According to the technical scheme of the invention, in the condition that human intervention is not needed, the automatic processing function of the cluster computer system fault can be realized.
Description
Technical field
The application is related to computer technology, particularly to computer cluster, more particularly, to a kind of computer cluster system
The fault handling method of system.
Background technology
With the propulsion of informationization technology, either enterprise or other organizations are all increasingly dependent on department of computer science
System.Along with the drastically expansion of data volume, single computer cannot meet its needs, if using supercomputer again greatly
The cost increasing computer, in this case, computer cluster technology arises at the historic moment.
Computer cluster is coupled together by the software of one group of loose integrated computer or hardware, and height is closely assisted
Complete evaluation work.The multiple stage computers equipment of composition computer cluster logically can be counted as a calculating
Machine.Single computer in computer cluster is commonly referred to node, and computer cluster can be connected by LAN,
Also other connected modes are supported.Computer cluster is commonly used to improve the calculating speed data stream of single computer
Load balancing.Calculating speed and cheap price that computer cluster is exceedingly fast with it, are widely favored, and are obtained fast
Speed popularization.
The number of nodes of computer cluster even thousands of from several to hundreds of, therefore works as computer cluster
During one or more of system nodes break down, the calculating speed of computer cluster would generally be affected, or even
The all nodes in computer cluster are led to all cannot normally to use.Therefore for user of service, how to ensure to count
During any one nodes break down in calculation machine group system, computer cluster still can use on the whole, and not shadow
Ringing calculating speed then becomes the key of lifting work efficiency and the creation of value.
For the fault processing in computer cluster, usual method is that attendant enters machine room in computer cluster
Failed machines are searched in multiple stage node in system, it is then determined that the failure cause of machine, then carry out maintenance work, when node
Quantity may need when increasing to increase quantity and the workload of attendant, and not only cost is higher, and work efficiency is very
Low.
Content of the invention
This application provides a kind of fault handling method of computer cluster, can be not required to want the bar of manual intervention
That realizes computer cluster fault under part automatically processes function.
A kind of fault handling method of computer cluster that the embodiment of the present application provides, including:
In A, selection computer cluster, at least two nodes are set to undertake troubleshooting and management computer collection
The management node of group's system, as host node, remaining is as slave node for one of described management node;
The bottom monitoring service module of each of B, computer cluster node monitor the running status of this node with
And software and hardware load condition, and judge whether to break down, if so, bottom monitoring service module notifies in the middle of the message of this node
Part service module sends failure message to administrative center's service module of host node;
C, administrative center's service module of host node carry out troubleshooting according to described failure message.
It is preferred that internal memory, CPU or system disk utilization rate that described fault is node exceed prespecified threshold value;
Step C is:Defect content is reported attendant by administrative center's service module of host node.
It is preferred that described fault is hardware fault;
Step C is:The hardware identifier that administrative center's service module of host node will appear from fault notifies manager, and will be former
Barrier equipment is rejected from computer cluster.
It is preferred that the node breaking down is ordinary node, fault is software fault;
Step C is:Administrative center's service module of host node to identify the state of this node with defined state value, and
Concrete fault message is notified attendant.
It is preferred that the node breaking down is host node, fault is software fault;
Step C is:The work that a new host node takes over former host node is elected from slave node.
It is preferred that the method further includes:
Computer cluster has detected node by heartbeat mechanism and has been in off-line state, if saving based on this node
Point, elects after a new host node takes over the work of former host node from slave node, will former host node enter aging;If should
Node is then directly entered aging for ordinary node;
After aging period, delete all information of this node from computer cluster.
It is preferred that heartbeat message is sent in the middle of the message at host node place for each node unification of computer cluster
Part service module, is collected by host node and slave node and manages heartbeat message, if in the last item heartbeat message being received
Timestamp also do not receive new heartbeat message then it is assumed that sending this heart apart from the current time beyond threshold value set in advance
Jump the node off-line of message.
As can be seen from the above technical solutions, form a covering using message-oriented middleware and single node monitoring programme whole
The monitoring network of individual computer cluster node, the service state of each node of monitor in real time and network state, if find
Fault information reporting is then uniformly processed to administrative center by node failure by the monitoring programme on this node, thus being not required to very important person
That realizes computer cluster fault under conditions of work intervention automatically processes function it is ensured that computer cluster node occurs
Can normally use after fault, mitigate the workload of attendant, improve the fault-tolerant ability of computer cluster.
Brief description
A kind of fault handling method schematic flow sheet of computer cluster that Fig. 1 provides for the embodiment of the present application;
The deployment process schematic of the fault handling method of the computer cluster that Fig. 2 provides for the embodiment of the present application.
Specific embodiment
For problems of the prior art, this application provides a kind of troubleshooting side of computer cluster
Method, realizes reporting of computer cluster fault using message mechanism, by specific node handling failure, thus being not required to very important person
That realizes computer cluster fault under conditions of work intervention automatically processes function it is ensured that computer cluster node occurs
Can normally use after fault, mitigate the workload of attendant, improve the fault-tolerant ability of computer cluster.
The main design idea of technical scheme is:Form one using message-oriented middleware and single node monitoring programme
The individual monitoring network covering whole computer cluster node, the service state of each node of monitor in real time and network-like
Fault information reporting, if finding node failure, is uniformly processed to administrative center, wherein by state by the monitoring programme on this node
The monitoring programme of node and failure message suffer from normalized definition, and the process for all kinds of faults also has unified mark
Standard, strives realizing the High Availabitity of computer cluster in the case of cost-effective and manpower and materials it is ensured that computer collection
Group's system is continuously available on the premise of there is not major accident.
Know-why, feature and technique effect for making technical scheme are clearer, below in conjunction with concrete reality
Apply example technical scheme is described in detail.
A kind of fault handling method flow process of computer cluster that the embodiment of the present application provides is as shown in figure 1, include:
Step 101:Choose at least two nodes in computer cluster to be set to undertake troubleshooting and management meter
The management node of calculation machine group system, as host node, remaining is as slave node for one of described management node;
Step 102:The bottom monitoring service module of each of computer cluster node monitors the operation of this node
State and software and hardware load condition, and judge whether to break down, if so, bottom monitoring service module notification message middleware
Service module sends failure message to administrative center's service module of host node;
Step 103:Administrative center's service module of host node carries out troubleshooting according to described failure message.
Mainly utilize message-oriented middleware in the embodiment of the present application scheme, monitored the shape of each node by bottom monitoring programme
Condition, once finding that fault reports in time, being collected failure message and being processed by the specific node unification of computer cluster.At this
In invention, need installation message middleware, and our computer cluster single node monitoring services of being formulated, computer
Cluster system management center service etc., the operating system being used is linux system.The fault processing system of the embodiment of the present application
Relate generally to four more crucial parts:Message-oriented middleware service module, bottom monitoring service module, administrative center's service module
And failover processing module.
The deployment process of the fault handling method of computer cluster that the embodiment of the present application provides is as shown in Fig. 2 wrap
Include:
Step 201:It is installed and activated linux system.
For each of computer cluster node, correctly install required linux system respectively, and right
Start after linux system configuration.
Step 202:It is installed and activated message-oriented middleware service.
Correct installation message middleware starting on each node of computer cluster, and just guarantee its work
Often, can accurate messaging.
Step 203:Start other services of computer cluster.
The correct administrative center's service mould starting in computer cluster on all nodes in computer cluster
Block and bottom monitoring service module.Bottom monitoring service module is responsible for monitoring the running status of each node, and software and hardware
Load condition, administrative center's service module is responsible for processing message, and the type of analysis fault, and is carried out point according to fault type
Other places are managed.
Step 204:Configuration main-standby nodes.
By the web interface of application programming interfaces (API) or O&M software choose in computer cluster 2 or
3 nodes are set to undertake the management node of troubleshooting and management computer cluster it is ensured that computer cluster
Normal work simultaneously has fail-over feature, in the management node of selection one be host node remaining be slave node.Corresponding,
In computer cluster, the node in addition to management node is referred to as ordinary node.
After above-mentioned flow processing, computer cluster is in normal operating conditions, if breaking down, computer
Group system can quick response fault processing, taking over fault node is it is ensured that the High Availabitity of computer cluster as needed
Property.
Common several fault types given below and corresponding processing method:
The system failure
The system failure include but is not limited to internal memory, CPU, system disk utilization rate too high (be defaulted as 70%, can be according to actual feelings
Condition configures).When bottom monitoring service module detects above-mentioned fault, can be by fault message notification message middleware services mould
Block, message-oriented middleware service module sends failure message, this message package section containing fault to administrative center's service module of host node
Point information, fault time etc..
Because above-mentioned fault does not affect the normal work of host node, administrative center's service module of host node pass through mail or
Other modes are informed its defect content of attendant or are checked corresponding system index, no by the web page of O&M software
Manager is needed to enter machine room inspection machine, the great convenience work disposal of manager.
Device hardware fault
Device hardware fault includes but is not limited to disk failure, raid fault, net card failure etc., when bottom monitoring service mould
Block detects such fault, can be by fault message notification message middleware services module, and message-oriented middleware service module is to main section
Administrative center's service module of point sends failure message, and administrative center's service module is responsible for handling failure, and concrete grammar is to notify
The hardware identifier that manager is broken down, rejects faulty equipment.
Ordinary node software fault
Software fault include the various softwares that computer cluster used there occurs fault, such as message-oriented middleware therefore
Barrier, ASC administrative service center fault, bottom monitoring service fault etc..Such fault is primarily referred to as each section in computer cluster
The service for providing single node being owned by point there occurs fault, and the process at this point for this node is with defined shape
Identifying the state of this node and to inform the concrete fault message of attendant by mail or other modes, such fault needs state value
Want human intervention malfunctioning node, repair fault manually.
Administrative center's software fault
Software fault include the various softwares that computer cluster used there occurs fault, such as message-oriented middleware therefore
Barrier, ASC administrative service center fault, bottom monitoring service fault etc..When administrative center's service module of host node there occurs fault,
Now host node cannot normal work, need from slave node according to certain principle (such as node load situation or
Little IP principle etc.), elect a new host node, take over the work of former host node.Bear offer externally to service internally
The work of management is provided, or slave node breaks down or taken over by other slave nodes offline, this process is referred to as management node certainly
Dynamic switching.
What a kind of management node given below automatically switched realizes process example:Slave node gets master by message mechanism
Node there occurs fault or offline, slave node startup election mechanism, learns oneself to be little IP node, then take over from data base
The work served as before host node, becomes new host node.
Above-mentioned fault need to carry out the switching of fault it is ensured that the High Availabitity of computer cluster when occurring, and handoff procedure is no
Need manual intervention, whole-process automatic monitoring, manager can monitor handoff procedure by the used web O&M page.Fault discovery is rapid, switching
The of short duration normal use not affecting computer cluster of process.
Node off-line
Such fault refers mainly to node and there occurs situations such as power-off, suspension.Computer cluster passes through message-oriented middleware
The heartbeat mechanism realized detects this node and has been in off-line state, if host node, then carries out host node automatic switchover laggard
Enter aging, if ordinary node is then directly entered aging, this section after aging period, will be deleted from whole computer cluster
The all information of point.It is the node that this node is not re-used as in computer cluster, no longer undertake any computer cluster system
System work.Heartbeat mechanism in the embodiment of the present application is:Heartbeat message is sent to by each node unification of computer cluster
The message-oriented middleware module that host node is located, is collected by host node and slave node and manages heartbeat message, if received
The timestamp jumped in message of uniting as one afterwards does not also receive new heart beating apart from the current time beyond threshold value set in advance and disappears
Breath is then it is assumed that send the node off-line of this heartbeat message.
By the invention it is possible to reach following effect:
1st, realize the troubleshooting of computer cluster it is ensured that computer cluster due to employing message mechanism
In node failure can promptly and accurately report, can be processed according to different fault types, no matter hardware fault is also
It is that software fault can respond rapidly to, considerably reduce the maintenance difficulties of manager;
2nd, by the multiple node unified managements in computer cluster, load balancing, data are carried out by host node unification
The operation such as shunting substantially increases the efficiency of computer cluster.Node in computer cluster is more, this advantage
More obvious;
3rd, in the fault treating procedure of computer cluster, in most cases executed by Automatic Program, need not be artificial
Intervene, do not affect computer cluster and run well it is not necessary to the configuration of complexity and extra instrument, therefore this programme has
Easy to operate, easy care feature;
4th, the present invention is applicable not only to the server platform of different brands, is equally applicable for various virtual machines and therefore has
There is good hardware platform adaptability.Have benefited from message-oriented middleware, the reliability of message high it is ensured that computer cluster
The accuracy of switching;The switching time of short duration normal use not affecting computer cluster;Linux system stability is high,
Decrease the impact to customer service when safeguarding computer cluster.
The foregoing is only the preferred embodiment of the application, not in order to limit the protection domain of the application, all
Within the spirit of technical scheme and principle, any modification, equivalent substitution and improvement done etc., should be included in this Shen
Within the scope of please protecting.
Claims (7)
1. a kind of fault handling method of computer cluster is it is characterised in that include:
In A, selection computer cluster, at least two nodes are set to undertake troubleshooting and management computer cluster system
The management node of system, as host node, remaining is as slave node for one of described management node;
The bottom monitoring service module of each of B, computer cluster node monitors the running status of this node and soft
Hardware load situation, and judge whether to break down, if so, bottom monitoring service module notifies the message-oriented middleware on this node
Service module sends failure message to administrative center's service module of host node;
C, administrative center's service module of host node carry out troubleshooting according to described failure message.
2. method according to claim 1 is it is characterised in that internal memory, CPU or system disk that described fault is node use
Rate exceedes prespecified threshold value;
Step C is:Defect content is reported attendant by administrative center's service module of host node.
3. method according to claim 1 is it is characterised in that described fault is hardware fault;
Step C is:The hardware identifier that administrative center's service module of host node will appear from fault notifies manager, and fault is set
Standby rejecting from computer cluster.
4. it is characterised in that the node breaking down is ordinary node, fault is software to method according to claim 1
Fault;
Step C is:Administrative center's service module of host node to identify the state of this node with defined state value, and will have
Body fault message notifies attendant.
5. it is characterised in that the node breaking down is host node, fault is software event to method according to claim 1
Barrier;
Step C is:The work that a new host node takes over former host node is elected from slave node.
6. the method according to any one of claim 1 to 5 is it is characterised in that the method further includes:
Computer cluster has detected node by heartbeat mechanism and has been in off-line state, if this node is host node, from
Elect after a new host node takes over the work of former host node in slave node, will former host node enter aging;If this node
Then it is directly entered aging for ordinary node;
After aging period, delete all information of this node from computer cluster.
7. method according to claim 6 is it is characterised in that described heartbeat mechanism is:Each section of computer cluster
The unified message-oriented middleware service module that heartbeat message is sent to host node place of point, is collected and managed by host node and slave node
Reason heartbeat message, if the timestamp in the last item heartbeat message being received apart from the current time exceed set in advance
Threshold value does not also receive new heartbeat message then it is assumed that sending the node off-line of this heartbeat message.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310548737.2A CN103607297B (en) | 2013-11-07 | 2013-11-07 | Fault processing method of computer cluster system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310548737.2A CN103607297B (en) | 2013-11-07 | 2013-11-07 | Fault processing method of computer cluster system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103607297A CN103607297A (en) | 2014-02-26 |
CN103607297B true CN103607297B (en) | 2017-02-08 |
Family
ID=50125498
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310548737.2A Expired - Fee Related CN103607297B (en) | 2013-11-07 | 2013-11-07 | Fault processing method of computer cluster system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103607297B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106933693A (en) * | 2017-03-15 | 2017-07-07 | 郑州云海信息技术有限公司 | A kind of data-base cluster node failure self-repairing method and system |
Families Citing this family (45)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104267689B (en) * | 2014-09-22 | 2017-01-18 | 中国科学院寒区旱区环境与工程研究所 | Super computer room outage early warning and automatic power-on management method based on video image differentiation |
CN104270268B (en) * | 2014-09-28 | 2017-12-05 | 曙光信息产业股份有限公司 | A kind of distributed system network performance evaluation and method for diagnosing faults |
CN105681156B (en) * | 2014-11-19 | 2019-06-11 | 阿里巴巴集团控股有限公司 | Message issuance method, apparatus and system |
CN104579791A (en) * | 2015-01-26 | 2015-04-29 | 浪潮电子信息产业股份有限公司 | Method for achieving automatic K-DB main and standby disaster recovery cluster switching |
CN104767794B (en) * | 2015-03-13 | 2018-05-01 | 聚好看科技股份有限公司 | Node electoral machinery and node in a kind of distributed system |
CN104735069A (en) * | 2015-03-26 | 2015-06-24 | 浪潮集团有限公司 | High-availability computer cluster based on safety and reliability |
CN105007193A (en) * | 2015-08-19 | 2015-10-28 | 浪潮(北京)电子信息产业有限公司 | Multi-layer information processing method, system thereof and cluster management node |
CN105162632A (en) * | 2015-09-15 | 2015-12-16 | 浪潮集团有限公司 | Automatic processing system for server cluster failures |
CN107203420A (en) * | 2016-03-18 | 2017-09-26 | 北京京东尚科信息技术有限公司 | The master-slave switching method and device of task scheduling example |
CN106161090A (en) * | 2016-07-12 | 2016-11-23 | 许继集团有限公司 | The monitoring method of a kind of subregion group system and device |
CN106452952B (en) * | 2016-09-29 | 2019-11-22 | 华为技术有限公司 | A kind of method and gateway cluster detecting group system communications status |
CN106878077A (en) * | 2017-02-21 | 2017-06-20 | 深圳实现创新科技有限公司 | The method of controlling security and device of safety monitoring |
CN107247729B (en) * | 2017-05-03 | 2021-04-27 | 中国银联股份有限公司 | File processing method and device |
CN107343032A (en) * | 2017-06-21 | 2017-11-10 | 武汉慧联无限科技有限公司 | The offline detection method and device of terminal node in remote low power consumption network |
CN109218126B (en) * | 2017-06-30 | 2023-10-17 | 中兴通讯股份有限公司 | Method, device and system for monitoring node survival state |
CN107257298A (en) * | 2017-07-27 | 2017-10-17 | 郑州云海信息技术有限公司 | A kind of fault handling method and device |
CN107342905A (en) * | 2017-08-28 | 2017-11-10 | 郑州云海信息技术有限公司 | A kind of node scheduling method and system of cluster storage system failure transfer |
CN107704387B (en) * | 2017-09-26 | 2021-03-16 | 恒生电子股份有限公司 | Method, device, electronic equipment and computer readable medium for system early warning |
CN107831452A (en) * | 2017-10-31 | 2018-03-23 | 国网上海市电力公司 | DC control and protection system hostdown diagnoses and life appraisal equipment |
CN107948260A (en) * | 2017-11-15 | 2018-04-20 | 郑州云海信息技术有限公司 | Main monitoring node selecting method and device in a kind of distributed type assemblies |
CN108134706B (en) * | 2018-01-02 | 2020-08-18 | 中国工商银行股份有限公司 | Block chain multi-activity high-availability system, computer equipment and method |
CN108809729A (en) * | 2018-06-25 | 2018-11-13 | 郑州云海信息技术有限公司 | The fault handling method and device that CTDB is serviced in a kind of distributed system |
CN108847982B (en) * | 2018-06-26 | 2021-11-19 | 郑州云海信息技术有限公司 | Distributed storage cluster and node fault switching method and device thereof |
CN109117317A (en) * | 2018-11-01 | 2019-01-01 | 郑州云海信息技术有限公司 | A kind of clustering fault restoration methods and relevant apparatus |
CN111158962B (en) * | 2018-11-07 | 2023-10-13 | 中移信息技术有限公司 | Remote disaster recovery method, device and system, electronic equipment and storage medium |
CN111258840B (en) * | 2018-11-30 | 2023-10-10 | 杭州海康威视数字技术股份有限公司 | Cluster node management method and device and cluster |
CN109634787B (en) * | 2018-12-17 | 2022-04-26 | 浪潮电子信息产业股份有限公司 | Distributed file system monitor switching method, device, equipment and storage medium |
CN111338647B (en) * | 2018-12-18 | 2023-09-12 | 杭州海康威视数字技术股份有限公司 | Big data cluster management method and device |
CN111355600B (en) * | 2018-12-21 | 2023-05-02 | 杭州海康威视数字技术股份有限公司 | Main node determining method and device |
CN109714202B (en) * | 2018-12-21 | 2021-10-08 | 郑州云海信息技术有限公司 | Client off-line reason distinguishing method and cluster type safety management system |
CN109873719B (en) * | 2019-02-03 | 2019-12-31 | 华为技术有限公司 | Fault detection method and device |
CN111130920B (en) * | 2019-11-26 | 2022-03-11 | 网宿科技股份有限公司 | Hardware information acquisition method, device, server and storage medium |
CN111143027A (en) * | 2019-12-06 | 2020-05-12 | 北京浪潮数据技术有限公司 | Cloud platform management method, system, equipment and computer readable storage medium |
CN111865714B (en) * | 2020-06-24 | 2022-08-02 | 上海上实龙创智能科技股份有限公司 | Cluster management method based on multi-cloud environment |
CN112131077A (en) * | 2020-09-21 | 2020-12-25 | 中国建设银行股份有限公司 | Fault node positioning method and device and database cluster system |
CN112306747B (en) * | 2020-09-29 | 2023-04-11 | 新华三技术有限公司合肥分公司 | RAID card fault processing method and device |
CN112491633B (en) * | 2020-12-17 | 2023-01-24 | 北京浪潮数据技术有限公司 | Fault recovery method, system and related components of multi-node cluster |
CN112631718A (en) * | 2020-12-21 | 2021-04-09 | 常州微亿智造科技有限公司 | Method and system for realizing Controller and Worker service combination under industrial Internet of things |
CN114978875A (en) * | 2021-02-23 | 2022-08-30 | 广州汽车集团股份有限公司 | Vehicle-mounted node management method and device and storage medium |
CN115396295A (en) * | 2021-05-24 | 2022-11-25 | 中兴通讯股份有限公司 | Equipment operation and maintenance method, network equipment and storage medium |
CN113282334A (en) * | 2021-06-07 | 2021-08-20 | 深圳华锐金融技术股份有限公司 | Method and device for recovering software defects, computer equipment and storage medium |
CN114363156A (en) * | 2022-01-25 | 2022-04-15 | 南瑞集团有限公司 | Hydropower station computer monitoring system deployment method based on cluster technology |
CN114826905A (en) * | 2022-03-31 | 2022-07-29 | 西安超越申泰信息科技有限公司 | Method, system, equipment and medium for switching management service of lower node |
CN115242701B (en) * | 2022-07-25 | 2024-04-02 | 中国民用航空总局第二研究所 | Airport data platform cluster consumption processing method, device and storage medium |
CN116614348A (en) * | 2023-07-19 | 2023-08-18 | 联想凌拓科技有限公司 | System for remote copy service and method of operating the same |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101183996A (en) * | 2007-12-13 | 2008-05-21 | 浪潮电子信息产业股份有限公司 | Cluster information monitoring method |
CN101373447A (en) * | 2008-08-20 | 2009-02-25 | 上海超级计算中心 | System and method for detecting health degree of computer cluster |
CN102231681A (en) * | 2011-06-27 | 2011-11-02 | 中国建设银行股份有限公司 | High availability cluster computer system and fault treatment method thereof |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6427163B1 (en) * | 1998-07-10 | 2002-07-30 | International Business Machines Corporation | Highly scalable and highly available cluster system management scheme |
-
2013
- 2013-11-07 CN CN201310548737.2A patent/CN103607297B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101183996A (en) * | 2007-12-13 | 2008-05-21 | 浪潮电子信息产业股份有限公司 | Cluster information monitoring method |
CN101373447A (en) * | 2008-08-20 | 2009-02-25 | 上海超级计算中心 | System and method for detecting health degree of computer cluster |
CN102231681A (en) * | 2011-06-27 | 2011-11-02 | 中国建设银行股份有限公司 | High availability cluster computer system and fault treatment method thereof |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106933693A (en) * | 2017-03-15 | 2017-07-07 | 郑州云海信息技术有限公司 | A kind of data-base cluster node failure self-repairing method and system |
Also Published As
Publication number | Publication date |
---|---|
CN103607297A (en) | 2014-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103607297B (en) | Fault processing method of computer cluster system | |
TWI746512B (en) | Physical machine fault classification processing method and device, and virtual machine recovery method and system | |
CN105187249B (en) | A kind of fault recovery method and device | |
CN107147540A (en) | Fault handling method and troubleshooting cluster in highly available system | |
CN103873279B (en) | Server management method and server management device | |
CN105095001B (en) | Virtual machine abnormal restoring method under distributed environment | |
US10771323B2 (en) | Alarm information processing method, related device, and system | |
US8555189B2 (en) | Management system and management system control method | |
KR20090035152A (en) | Autonomous fault processing system in home network environments and operation method thereof | |
CN104113428B (en) | A kind of equipment management device and method | |
CN106789306A (en) | Restoration methods and system are collected in communication equipment software fault detect | |
CN105323113A (en) | A visualization technology-based system fault emergency handling system and a system fault emergency handling method | |
CN105306272A (en) | Method and system for collecting fault scene information of information system | |
US11886904B2 (en) | Virtual network function VNF deployment method and apparatus | |
JP2013030826A (en) | Network monitoring system and network monitoring method | |
EP2518627A2 (en) | Partial fault processing method in computer system | |
CN109726046A (en) | Computer room switching method and switching device | |
CN110134518A (en) | A kind of method and system improving big data cluster multinode high application availability | |
CN112035319A (en) | Monitoring alarm system for multi-path state | |
CN105893211A (en) | Method and system for monitoring | |
WO2021114971A1 (en) | Method for detecting whether application system based on multi-tier architecture operates normally | |
CN108429656A (en) | A method of monitoring physical machine network interface card connection status | |
CN110119325A (en) | Server failure processing method, device, equipment and computer readable storage medium | |
JP2013130901A (en) | Monitoring server and network device recovery system using the same | |
CN105072386A (en) | Video networking system based on multicast technologies and state monitoring method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 201112 Shanghai, Minhang District, United Airlines route 1188, building second layer A-1 unit 8 Applicant after: SHANGHAI EISOO INFORMATION TECHNOLOGY CO., LTD. Address before: 200072 room 3, building 840, No. 101 Middle Luochuan Road, Shanghai, Zhabei District Applicant before: Shanghai Eisoo Software Co.,Ltd. |
|
COR | Change of bibliographic data | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170208 Termination date: 20191107 |