CN107147540A - Fault handling method and troubleshooting cluster in highly available system - Google Patents

Fault handling method and troubleshooting cluster in highly available system Download PDF

Info

Publication number
CN107147540A
CN107147540A CN201710589299.2A CN201710589299A CN107147540A CN 107147540 A CN107147540 A CN 107147540A CN 201710589299 A CN201710589299 A CN 201710589299A CN 107147540 A CN107147540 A CN 107147540A
Authority
CN
China
Prior art keywords
node
cluster
failure
troubleshooting
management object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710589299.2A
Other languages
Chinese (zh)
Inventor
杨勇
亓开元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201710589299.2A priority Critical patent/CN107147540A/en
Publication of CN107147540A publication Critical patent/CN107147540A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0823Errors, e.g. transmission errors

Abstract

The invention discloses the fault handling method in a kind of highly available system and troubleshooting cluster, each node includes in the troubleshooting cluster:Acquisition module, for obtaining the management object in highly available system in working cluster, wherein the management object is the node of working cluster;Monitoring modular, for according to the monitoring policy pre-set, the running status to the management object to be monitored;Node administration module, for when thering is node to be unable to processing business because of failure in management object, notifying that the node broken down is offline.

Description

Fault handling method and troubleshooting cluster in highly available system
Technical field
The present invention relates to communication technical field, fault handling method and troubleshooting in espespecially a kind of highly available system Cluster.
Background technology
High availability cluster (High Available, HA) is the effective solution for ensureing business continuance, is typically had Two or more nodes, and it is divided into active node and secondary node.The business that is carrying out generally is referred to as active section Point, and then it is referred to as secondary node as what one of active node backed up.When active node goes wrong, cause what is be currently running When business (task) is not normally functioning, secondary node will now be detected, and the active node that continues immediately performs business. So as to realize not interrupting or short interruption for business.
But in High Availabitity (HA) system, when the contact between contacting 2 nodes disconnects, it is integral, acts originally The HA systems of coordination, just division turns into 2 independent individuals.Due to mutually losing contact, it can all be considered that other side has gone out event Barrier.HA softwares on two nodes fight for shared resource, have striven application service as " split ", after will occurring seriously Really, such as, shared resource is carved up, the service of 2 sides all be cannot get up;Or 2 node serves all get up, but read while write altogether Storage is enjoyed, causes corrupted data, such as hdfs file system metadatas error etc..
Therefore, in High Availabitity (HA) system, when the contact between contacting 2 nodes disconnects, how to being saved in cluster Point is managed to ensure that normally operation is urgent problem to be solved to business.
The content of the invention
In order to solve the above-mentioned technical problem, the invention provides the fault handling method in a kind of highly available system and event Barrier processing cluster, can prevent the generation of high-availability cluster fissure phenomenon.
It is described the invention provides the troubleshooting cluster in a kind of highly available system in order to reach the object of the invention Each node includes in troubleshooting cluster:
Acquisition module, for obtaining the management object in highly available system in working cluster, wherein the management object For the node of working cluster;
Monitoring modular, for according to the monitoring policy pre-set, the running status to the management object to be monitored;
Node administration module, for when thering is node to be unable to processing business because of failure in management object, notifying The node of failure is offline.
Wherein, the troubleshooting cluster has 2N+1 node, and one of node is host node, and remaining node is from section Point, N is positive integer;Wherein:
Sending module, for notifying to select to work on instead of the node of the failure from working cluster from node Node;
Determining module, for according to from the selection result of node and host node, it is determined that worked on instead of malfunctioning node Node;
Wherein, each node includes:
Module is elected, the node that the node for the selection replacement failure from working cluster works on, and Selection result is sent to the host node.
Wherein, the node administration module includes:
Acquiring unit, the IP address information of the node upper substrate Management Controller BMC for obtaining failure node;
Transmitting element, the BMC IP address information on the node according to failure node, to the node of failure BMC send close power supply instruction.
Wherein, each node also includes:
Alarm module, the failure-description information for exporting the node broken down.
Wherein, each node also includes:
Policy management module, after being asked in the renewal for receiving monitoring policy, is asked, to monitoring according to described update Strategy is updated, and the monitoring policy after renewal is sent into the monitoring modular.
Fault handling method in a kind of highly available system, including:
Troubleshooting cluster obtains the corresponding management object of each node working cluster in highly available system, wherein institute State the node that management object is working cluster;
According to the monitoring policy pre-set, the running status to the management object is monitored;
When thering is node to be unable to processing business because of failure in management object, under the node for notifying failure Line.
Wherein, after the node for notifying to break down is offline, methods described also includes:
The node for notifying each node to select the node instead of the failure to work on from working cluster;
Receive the selection result that each node is sent;
According to the selection result, it is determined that the node worked on instead of malfunctioning node, wherein the troubleshooting cluster There is 2N+1 node, one of node is host node, and remaining node is that N is positive integer from node.
Wherein, it is described notify break down node it is offline including:
Obtain the node upper substrate Management Controller BMC of failure node IP address information;
According to the IP address information of BMC on the node of failure node, send and close to the BMC of the node of failure The instruction of power supply.
Wherein, after the node for notifying to break down is offline, methods described also includes:
Export the failure-description information of the node broken down.
Wherein, methods described also includes:
After the renewal for receiving monitoring policy is asked, asked according to described update, monitoring policy is updated, and will Monitoring policy after renewal is sent to the monitoring modular.
The embodiment that the present invention is provided, by carrying out fault diagnosis to clustered node, when the disconnection of the nodes heart beat of certain in cluster, Power-off operation is carried out to malfunctioning node, it is ensured that malfunctioning node is completely closed, and prevents the generation of high-availability cluster fissure phenomenon.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification Obtain it is clear that or being understood by implementing the present invention.The purpose of the present invention and other advantages can be by specification, rights Specifically noted structure is realized and obtained in claim and accompanying drawing.
Brief description of the drawings
Accompanying drawing is used for providing further understanding technical solution of the present invention, and constitutes a part for specification, with this The embodiment of application is used to explain technical scheme together, does not constitute the limitation to technical solution of the present invention.
The structure chart of troubleshooting cluster interior joint in the highly available system that Fig. 1 provides for the present invention;
The structural representation for the highly available cluster system based on IPMI that Fig. 2 provides for application example of the present invention;
Fig. 3 is the method flow diagram of the troubleshooting under Fig. 2 systems;
The flow chart of fault handling method in the highly available system that Fig. 4 provides for the present invention.
Embodiment
For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with accompanying drawing to the present invention Embodiment be described in detail.It should be noted that in the case where not conflicting, in the embodiment and embodiment in the application Feature can mutually be combined.
Can be in the computer system of such as one group computer executable instructions the step of the flow of accompanying drawing is illustrated Perform.And, although logical order is shown in flow charts, but in some cases, can be with suitable different from herein Sequence performs shown or described step.
The structure chart of troubleshooting cluster interior joint in the highly available system that Fig. 1 provides for the present invention.Event shown in Fig. 1 Each node includes in barrier processing cluster:
Acquisition module 101, for obtaining the management object in highly available system in working cluster, wherein the management pair As the node for working cluster;
Monitoring modular 102, for according to the monitoring policy pre-set, the running status to the management object to be supervised Survey;
Node administration module 103, for when thering is node to be unable to processing business because of failure in management object, leading to Know that the node of failure is offline.
The troubleshooting cluster that the present invention is provided, by carrying out fault diagnosis to clustered node, when the node heart of certain in cluster Jump and disconnect, power-off operation is carried out to malfunctioning node, it is ensured that malfunctioning node is completely closed, and prevents the production of high-availability cluster fissure phenomenon It is raw.
The troubleshooting cluster that the present invention is provided is illustrated below:
After malfunctioning node is occurred in detecting working cluster, the business of malfunctioning node processing needs to be performed, Then need troubleshooting cluster to select a node from working cluster for the malfunctioning node, carried out instead of the node of failure Work.
To ensure that troubleshooting cluster can select suitable node as early as possible, the quantity of troubleshooting cluster interior joint is 2N+1, one of node is host node, and remaining node is that N is positive integer from node;
The host node notifies to select what the node instead of the failure worked on from working cluster from node Node;
Remaining is from node after notice is received, and selection replaces the node of the failure to continue from working cluster The node of work, and send result to host node;
The node that host node works on according to the selection result from node and host node, selection instead of malfunctioning node.
Wherein host node can select working cluster interior joint the most node of number of times, and instead malfunctioning node continues The node of work.
Each node is carried in working cluster baseboard management controller (Baseboard Management Controller, BMC) chip and management net, pass through IPMI (Intelligent Platform Management Interface, IPMI) the specific virtual machine power-off operation of protocol realization.Current overwhelming majority server master board all carries Bmc cores Piece and bmc network interfaces, bmc chips carry out work independent of the processor, BIOS or operating system of server, are only in the extreme It is vertical, be one individually run in server without proxy management subsystem, as long as electricity just can start working on server.bmc Good autonomous characteristic is just overcome in the past based on the limitation suffered by the way to manage of operating system, and such as operating system is not responding to Or it still can carry out the operation such as switching on and shutting down, information extraction in the case of not loading, and due to managing net independence networking, Service network will be typically better than in network stabilization and accessibility, therefore the ipmi of remote server is operated in non-extreme conditions Under be all reliable, so may insure that node is completely closed when failure occurs, prevent fissure phenomenon.
Specifically, the node administration module includes:
BMC IP address information on acquiring unit, the node for obtaining failure node;
Transmitting element, the BMC IP address information on the node according to failure node, to the node of failure BMC send close power supply instruction.
Optionally, each node also includes:
Alarm module, the failure-description information for exporting the node broken down.
For example, send alarm prompt sound, or, outputting alarm prompt information on the display screen, wherein failure-description is believed Breath can describe the exception that the identification information and node of node occur.
Certainly, monitoring policy can dynamically be updated, and user can also can be carried out with self-defined according to the need for business Dynamic is changed, therefore each node also includes:
Policy management module, after being asked in the renewal for receiving monitoring policy, is asked, to monitoring according to described update Strategy is updated, and the monitoring policy after renewal is sent into the monitoring modular.
In actual applications, can by node state monitoring script and the node control script of rational strict logic, So that after the nodes break down of HA clusters, the shutdown isolation of malfunctioning node being carried out automatically, primary HA softwares are compensate for not The problem of can ensure that malfunctioning node isolation, easily produce fissure phenomenon, and by continuous accumulation monitor mode, can reach The fault comprehensive judgement of a variety of conditions, such as the performance information such as cpu utilization rates, memory usage, whether java processes are normal, section The condition such as whether application process normal on point, can be added in watch-list, and malfunctioning node is held by IPMI protocol Row compulsory power-off operation, it is to avoid the fissure phenomenon often occurred in HA schemes, reaches that HA schemes are flexibly customized, highly reliable The effect of operation.
Illustrated with reference to concrete application example:
The present invention is used as fault-finding and the controller recovered using pacemaker.Pacemaker can voluntarily manage one Individual cluster, creates pacemaker ocf resources on cluster, and ensures election, drift between this cluster of ocf resources in itself And High Availabitity.Therefore we make openstack monitor using ocf scripts, itself are also High Availabitity;
Pacemaker resource mainly has two classes, and (Linux Standards Base, Linux standard take respectively LSB Business) and OCF (Open Cluster Framework, open cluster frameworks).Wherein LSB resources are generally exactly/etc/init.d Script under catalogue, Pacemaker can be with these scripts come start and stop service.OCF resources are the extension serviced LSB, increase Function such as failure monitoring of some high-availability clusters reason etc. and more metamessages, are used as the realization side of specific fault-finding Formula.Pacemaker can be very good to carry out High Availabitity guarantee to service by realizing an OCF resource.
The anti-fissure implementation of the high-availability cluster based on IPMI that the present invention is provided:
The structural representation for the highly available cluster system based on IPMI that Fig. 2 provides for application example of the present invention.The system Deployment it is as follows:
3 nodes in selection at least high-availability system, install pacemaker clustered softwares and install ipmitool, its Middle ipmitool is used for remote management node;Wherein why selecting 3 nodes is led to ensure that pacemaker resources are elected Ballot during node can produce majority;
The node mutual authentication of pacemaker softwares will be mounted with, be configured to an entirety, complete pacemaker clusters Establishment;
The ocf script nodeMonitor of automatic monitor node state are created, and are uploaded in pacemaker clusters each Node /usr/lib/ocf/resource.d/openstack/ catalogues under, wherein the ocf scripts can be realized to enter node Row is started shooting, shut down, checking state, the operation for monitoring resource etc.;Node shape is write by using the form for meeting ocf script specifications State monitoring script, with flexible utilization pacemaker own resources High Availabitity, timer-triggered scheduler and can manage resource and a large amount of Existing Linux ocf scripts, realize the flexible and High Availabitity of monitoring programme in itself;
Certainly, a variety of monitoring conditions can be realized in self-defined monitor methods inside ocf scripts, can be according to business spirit Customization living, reaches the evacuation that virtual machine is can trigger after condition limitation.Method includes but is not limited to:Check business network interface card on node State;Check the state that application process is specified on node;The performance data of the node is checked, such as cpu utilization rates, internal memory are utilized Rate etc.;Check memory space state on the node etc..
Creating the script nodeController of specific execution node power-off operation, and be uploaded to pacemaker user has Under the catalogue of limiting operation, such as/usr/lib/myScript/;The script needs input node bmc ip addresses, account, password;
A pacemaker resource is created using the ocf scripts;Pacemaker resources are equivalent to one by pacemaker Cluster come ensure perform and monitor state Service Instance, each resource in itself may pacemaker clusters each node On start through election, and according to the logic defined inside resource, corresponding action is performed by pacemaker frameworks, for example, The execution interval and time-out time operated defined in ocf meta labels.
Fig. 3 is the method flow diagram of the troubleshooting under Fig. 2 systems.Method includes shown in Fig. 2:
After system starts, pacemaker is spaced the monitor methods of timing perform script, this method according to set time Can by business personnel's self-defining, for decision node whether failure, such as ping ip, access certain business interface, ssh and connect Connect, access database connection etc.;
When nodeMonitor scripts judge certain node failure, then ipmitool is called, it is remote using existing BMC management nets Process control node power, carries out power-off operation, it is not necessary to additionally increase equipment to remote failure node;Use simultaneously It is invalid that malfunctioning node is set to by pacemaker attrd_updater interfaces, is flowed with triggering pacemaker host node and electing Journey.
As seen from the above, application example of the present invention provides a kind of anti-fissure scheme of high-availability cluster based on IPMI, leads to The fault-finding center for using the pacemaker that increases income as HA clusters is crossed, using ocf scripts as monitoring method, using many The fault message of kind of means probe node, the bmc management interface isolated fault nodes provided using server board itself can be with Quickly reliably malfunctioning node is separated from HA clusters, the data in perfect guarantee distributed service running are complete Property, the generation of fissure phenomenon is prevented, makes up and halfway problem is isolated to malfunctioning node in current HA total solution, it is completeer The problem of in kind monitoring business running, the integrality of business datum is protected significantly.
The flow chart of fault handling method in the highly available system that Fig. 4 provides for the present invention.Method includes shown in Fig. 4:
Step 401, the corresponding management object of each node working cluster in highly available system of acquisition, wherein the pipe Manage the node that object is working cluster;
The monitoring policy that step 402, basis are pre-set, the running status to the management object is monitored;
Step 403, when manage object in there is node to be unable to processing business because of failure when, notify break down Node is offline.
Wherein, after the node for notifying to break down is offline, methods described also includes:
The node for notifying each node to select the node instead of the failure to work on from working cluster;
Receive the selection result that each node is sent;
According to the selection result, it is determined that the node worked on instead of malfunctioning node, wherein the troubleshooting cluster There is 2N+1 node, one of node is host node, and remaining node is that N is positive integer from node.
Wherein, it is described notify break down node it is offline including:
Obtain the node upper substrate Management Controller BMC of failure node IP address information;
According to the IP address information of BMC on the node of failure node, send and close to the BMC of the node of failure The instruction of power supply.
Wherein, after the node for notifying to break down is offline, methods described also includes:
Export the failure-description information of the node broken down.
Optionally, methods described also includes:
After the renewal for receiving monitoring policy is asked, asked according to described update, monitoring policy is updated, and will Monitoring policy after renewal is sent to the monitoring modular.
The fault handling method that the present invention is provided, by carrying out fault diagnosis to clustered node, when the node heart of certain in cluster Jump and disconnect, power-off operation is carried out to malfunctioning node, it is ensured that malfunctioning node is completely closed, and prevents the production of high-availability cluster fissure phenomenon It is raw.
Although disclosed herein embodiment as above, described content be only readily appreciate the present invention and use Embodiment, is not limited to the present invention.Technical staff in any art of the present invention, is taken off not departing from the present invention On the premise of the spirit and scope of dew, any modification and change, but the present invention can be carried out in the form and details of implementation Scope of patent protection, still should be subject to the scope of the claims as defined in the appended claims.

Claims (10)

1. a kind of troubleshooting cluster in highly available system, it is characterised in that each node in the troubleshooting cluster Including:
Acquisition module, for obtaining the management object in highly available system in working cluster, wherein the management object is work Make the node of cluster;
Monitoring modular, for according to the monitoring policy pre-set, the running status to the management object to be monitored;
Node administration module, for when thering is node to be unable to processing business because of failure in management object, notifying occur The node of failure is offline.
2. troubleshooting cluster according to claim 1, it is characterised in that the troubleshooting cluster has 2N+1 section Point, one of node is host node, and remaining node is that N is positive integer from node;Wherein:
The host node also includes:
Sending module, for the section for notifying to select the node instead of the failure to work on from working cluster from node Point;
Determining module, for according to the selection result from node and host node, it is determined that the node worked on instead of malfunctioning node;
Wherein, each node includes:
Module is elected, the node that the node for the selection replacement failure from working cluster works on, and will choosing Select result and be sent to the host node.
3. troubleshooting cluster according to claim 1, it is characterised in that the node administration module includes:
Acquiring unit, the IP address information of the node upper substrate Management Controller BMC for obtaining failure node;
Transmitting element, on the node according to failure node BMC IP address information, to the node of failure BMC sends the instruction for closing power supply.
4. according to any described troubleshooting cluster of claims 1 to 3, it is characterised in that each node also includes:
Alarm module, the failure-description information for exporting the node broken down.
5. according to any described troubleshooting cluster of claims 1 to 3, it is characterised in that each node also includes:
Policy management module, after being asked in the renewal for receiving monitoring policy, is asked, to monitoring policy according to described update It is updated, and the monitoring policy after renewal is sent to the monitoring modular.
6. fault handling method in a kind of highly available system, it is characterised in that including:
Troubleshooting cluster obtains the corresponding management object of each node working cluster in highly available system, wherein the pipe Manage the node that object is working cluster;
According to the monitoring policy pre-set, the running status to the management object is monitored;
When thering is node to be unable to processing business because of failure in management object, notify that the node broken down is offline.
7. according to the method described in claim 1, it is characterised in that described after the node for notifying to break down is offline Method also includes:
The node for notifying each node to select the node instead of the failure to work on from working cluster;
Receive the selection result that each node is sent;
According to the selection result, it is determined that the node worked on instead of malfunctioning node, wherein the troubleshooting cluster has 2N+ 1 node, one of node is host node, and remaining node is that N is positive integer from node.
8. method according to claim 6, it is characterised in that node that the notice breaks down it is offline including:
Obtain the node upper substrate Management Controller BMC of failure node IP address information;
According to the IP address information of BMC on the node of failure node, sent to the BMC of the node of failure and close power supply Instruction.
9. according to any described method of claim 6 to 8, it is characterised in that node that the notice breaks down it is offline it Afterwards, methods described also includes:
Export the failure-description information of the node broken down.
10. according to any described method of claim 6 to 8, it is characterised in that methods described also includes:
After the renewal for receiving monitoring policy is asked, asked according to described update, monitoring policy is updated, and will updated Monitoring policy afterwards is sent to the monitoring modular.
CN201710589299.2A 2017-07-19 2017-07-19 Fault handling method and troubleshooting cluster in highly available system Pending CN107147540A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710589299.2A CN107147540A (en) 2017-07-19 2017-07-19 Fault handling method and troubleshooting cluster in highly available system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710589299.2A CN107147540A (en) 2017-07-19 2017-07-19 Fault handling method and troubleshooting cluster in highly available system

Publications (1)

Publication Number Publication Date
CN107147540A true CN107147540A (en) 2017-09-08

Family

ID=59776469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710589299.2A Pending CN107147540A (en) 2017-07-19 2017-07-19 Fault handling method and troubleshooting cluster in highly available system

Country Status (1)

Country Link
CN (1) CN107147540A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107612787A (en) * 2017-11-06 2018-01-19 南京易捷思达软件科技有限公司 A kind of cloud hostdown detection method for cloud platform of being increased income based on Openstack
CN107741966A (en) * 2017-09-30 2018-02-27 郑州云海信息技术有限公司 A kind of node administration method and device
CN108011880A (en) * 2017-12-04 2018-05-08 郑州云海信息技术有限公司 The management method and computer-readable recording medium monitored in cloud data system
CN108089911A (en) * 2017-12-14 2018-05-29 郑州云海信息技术有限公司 The control method and device of calculate node in OpenStack environment
CN108449200A (en) * 2018-02-02 2018-08-24 云宏信息科技股份有限公司 A kind of mask information wiring method and device based on control node
CN109286529A (en) * 2018-10-31 2019-01-29 武汉烽火信息集成技术有限公司 A kind of method and system for restoring RabbitMQ network partition
CN109981204A (en) * 2019-02-21 2019-07-05 福建星云电子股份有限公司 A kind of Multi-Machine Synchronous method of BMS analogue system
CN109981782A (en) * 2019-03-28 2019-07-05 山东浪潮云信息技术有限公司 Remote storage abnormality eliminating method and system for cluster fissure
CN110377487A (en) * 2019-07-11 2019-10-25 无锡华云数据技术服务有限公司 A kind of method and device handling high-availability cluster fissure
CN111124765A (en) * 2019-12-06 2020-05-08 中盈优创资讯科技有限公司 Big data cluster task scheduling method and system based on node labels
CN111371599A (en) * 2020-02-26 2020-07-03 山东汇贸电子口岸有限公司 Cluster disaster recovery management system based on ETCD
CN111475386A (en) * 2020-06-05 2020-07-31 中国银行股份有限公司 Fault early warning method and related device
CN112291288A (en) * 2019-07-24 2021-01-29 北京金山云网络技术有限公司 Container cluster expansion method and device, electronic equipment and readable storage medium
CN112838965A (en) * 2021-02-19 2021-05-25 浪潮云信息技术股份公司 Method for identifying and recovering strong synchronization role fault
CN113162797A (en) * 2021-03-03 2021-07-23 山东英信计算机技术有限公司 Method, system and medium for switching master node fault of distributed cluster
CN113760610A (en) * 2020-06-01 2021-12-07 富泰华工业(深圳)有限公司 OpenStack-based bare computer high-availability realization method and device and electronic equipment
CN114500327A (en) * 2022-04-13 2022-05-13 统信软件技术有限公司 Detection method and detection device for server cluster and computing equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541802A (en) * 2010-12-17 2012-07-04 伊姆西公司 Methods and equipment for identifying object based on quorum resource quantity of object
CN103457771A (en) * 2013-08-30 2013-12-18 杭州华三通信技术有限公司 Method and device for HA virtual machine cluster management
CN103905247A (en) * 2014-03-10 2014-07-02 北京交通大学 Two-unit standby method and system based on multi-client judgment
CN104077199A (en) * 2014-06-06 2014-10-01 中标软件有限公司 Shared disk based high availability cluster isolation method and system
CN106656624A (en) * 2017-01-04 2017-05-10 合肥康捷信息科技有限公司 Optimization method based on Gossip communication protocol and Raft election algorithm

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541802A (en) * 2010-12-17 2012-07-04 伊姆西公司 Methods and equipment for identifying object based on quorum resource quantity of object
CN103457771A (en) * 2013-08-30 2013-12-18 杭州华三通信技术有限公司 Method and device for HA virtual machine cluster management
CN103905247A (en) * 2014-03-10 2014-07-02 北京交通大学 Two-unit standby method and system based on multi-client judgment
CN104077199A (en) * 2014-06-06 2014-10-01 中标软件有限公司 Shared disk based high availability cluster isolation method and system
CN106656624A (en) * 2017-01-04 2017-05-10 合肥康捷信息科技有限公司 Optimization method based on Gossip communication protocol and Raft election algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张育军,黄红元: "《上海证券交易所联合研究报告 2011 证券信息前沿技术专集[M]》", 31 December 2012 *
黄红元: "《《上海证券交易所联合研究报告 2013 证券信息前沿技术专集》》", 31 December 2014 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107741966A (en) * 2017-09-30 2018-02-27 郑州云海信息技术有限公司 A kind of node administration method and device
CN107612787A (en) * 2017-11-06 2018-01-19 南京易捷思达软件科技有限公司 A kind of cloud hostdown detection method for cloud platform of being increased income based on Openstack
CN107612787B (en) * 2017-11-06 2021-01-12 南京易捷思达软件科技有限公司 Cloud host fault detection method based on Openstack open source cloud platform
CN108011880A (en) * 2017-12-04 2018-05-08 郑州云海信息技术有限公司 The management method and computer-readable recording medium monitored in cloud data system
CN108089911A (en) * 2017-12-14 2018-05-29 郑州云海信息技术有限公司 The control method and device of calculate node in OpenStack environment
CN108449200A (en) * 2018-02-02 2018-08-24 云宏信息科技股份有限公司 A kind of mask information wiring method and device based on control node
CN109286529A (en) * 2018-10-31 2019-01-29 武汉烽火信息集成技术有限公司 A kind of method and system for restoring RabbitMQ network partition
CN109286529B (en) * 2018-10-31 2021-08-10 武汉烽火信息集成技术有限公司 Method and system for recovering RabbitMQ network partition
CN109981204A (en) * 2019-02-21 2019-07-05 福建星云电子股份有限公司 A kind of Multi-Machine Synchronous method of BMS analogue system
CN109981782A (en) * 2019-03-28 2019-07-05 山东浪潮云信息技术有限公司 Remote storage abnormality eliminating method and system for cluster fissure
CN109981782B (en) * 2019-03-28 2022-03-22 浪潮云信息技术股份公司 Remote storage exception handling method and system for cluster split brain
CN110377487A (en) * 2019-07-11 2019-10-25 无锡华云数据技术服务有限公司 A kind of method and device handling high-availability cluster fissure
CN112291288B (en) * 2019-07-24 2022-10-04 北京金山云网络技术有限公司 Container cluster expansion method and device, electronic equipment and readable storage medium
CN112291288A (en) * 2019-07-24 2021-01-29 北京金山云网络技术有限公司 Container cluster expansion method and device, electronic equipment and readable storage medium
CN111124765A (en) * 2019-12-06 2020-05-08 中盈优创资讯科技有限公司 Big data cluster task scheduling method and system based on node labels
CN111371599A (en) * 2020-02-26 2020-07-03 山东汇贸电子口岸有限公司 Cluster disaster recovery management system based on ETCD
CN113760610A (en) * 2020-06-01 2021-12-07 富泰华工业(深圳)有限公司 OpenStack-based bare computer high-availability realization method and device and electronic equipment
CN111475386A (en) * 2020-06-05 2020-07-31 中国银行股份有限公司 Fault early warning method and related device
CN111475386B (en) * 2020-06-05 2024-01-23 中国银行股份有限公司 Fault early warning method and related device
CN112838965A (en) * 2021-02-19 2021-05-25 浪潮云信息技术股份公司 Method for identifying and recovering strong synchronization role fault
CN113162797A (en) * 2021-03-03 2021-07-23 山东英信计算机技术有限公司 Method, system and medium for switching master node fault of distributed cluster
CN113162797B (en) * 2021-03-03 2023-03-21 山东英信计算机技术有限公司 Method, system and medium for switching master node fault of distributed cluster
CN114500327A (en) * 2022-04-13 2022-05-13 统信软件技术有限公司 Detection method and detection device for server cluster and computing equipment

Similar Documents

Publication Publication Date Title
CN107147540A (en) Fault handling method and troubleshooting cluster in highly available system
CN103607297B (en) Fault processing method of computer cluster system
CN107239383A (en) A kind of failure monitoring method and device of OpenStack virtual machines
US20140372805A1 (en) Self-healing managed customer premises equipment
CN105323113A (en) A visualization technology-based system fault emergency handling system and a system fault emergency handling method
CN112181660A (en) High-availability method based on server cluster
CN101996106A (en) Method for monitoring software running state
CN104579791A (en) Method for achieving automatic K-DB main and standby disaster recovery cluster switching
US20080313319A1 (en) System and method for providing multi-protocol access to remote computers
CN110134518A (en) A kind of method and system improving big data cluster multinode high application availability
CN106330523A (en) Cluster server disaster recovery system and method, and server node
CN103490919A (en) Fault management system and fault management method
CN114090184B (en) Method and equipment for realizing high availability of virtualization cluster
CN107947998A (en) A kind of real-time monitoring system based on application system
CN110138611A (en) Automate O&M method and system
JP2013130901A (en) Monitoring server and network device recovery system using the same
CN105849702A (en) Cluster system, server device, cluster system management method, and computer-readable recording medium
CN107071189B (en) Connection method of communication equipment physical interface
CN101854263B (en) Method, system and management server for analysis processing of network topology
CN101442437A (en) Method, system and equipment for implementing high availability
CN110677288A (en) Edge computing system and method generally used for multi-scene deployment
CN110399254A (en) A kind of server CMC dual-locomotive heat activating method, system, terminal and storage medium
CN104346233A (en) Fault recovery method and device for computer system
CN107133130A (en) Computer operational monitoring method and apparatus
TWI698741B (en) Method for remotely clearing abnormal status of racks applied in data center

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170908

RJ01 Rejection of invention patent application after publication