CN107147540A - Fault handling method and troubleshooting cluster in highly available system - Google Patents
Fault handling method and troubleshooting cluster in highly available system Download PDFInfo
- Publication number
- CN107147540A CN107147540A CN201710589299.2A CN201710589299A CN107147540A CN 107147540 A CN107147540 A CN 107147540A CN 201710589299 A CN201710589299 A CN 201710589299A CN 107147540 A CN107147540 A CN 107147540A
- Authority
- CN
- China
- Prior art keywords
- node
- cluster
- failure
- troubleshooting
- management object
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
- H04L41/0663—Performing the actions predefined by failover planning, e.g. switching to standby network elements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0823—Errors, e.g. transmission errors
Abstract
The invention discloses the fault handling method in a kind of highly available system and troubleshooting cluster, each node includes in the troubleshooting cluster:Acquisition module, for obtaining the management object in highly available system in working cluster, wherein the management object is the node of working cluster;Monitoring modular, for according to the monitoring policy pre-set, the running status to the management object to be monitored;Node administration module, for when thering is node to be unable to processing business because of failure in management object, notifying that the node broken down is offline.
Description
Technical field
The present invention relates to communication technical field, fault handling method and troubleshooting in espespecially a kind of highly available system
Cluster.
Background technology
High availability cluster (High Available, HA) is the effective solution for ensureing business continuance, is typically had
Two or more nodes, and it is divided into active node and secondary node.The business that is carrying out generally is referred to as active section
Point, and then it is referred to as secondary node as what one of active node backed up.When active node goes wrong, cause what is be currently running
When business (task) is not normally functioning, secondary node will now be detected, and the active node that continues immediately performs business.
So as to realize not interrupting or short interruption for business.
But in High Availabitity (HA) system, when the contact between contacting 2 nodes disconnects, it is integral, acts originally
The HA systems of coordination, just division turns into 2 independent individuals.Due to mutually losing contact, it can all be considered that other side has gone out event
Barrier.HA softwares on two nodes fight for shared resource, have striven application service as " split ", after will occurring seriously
Really, such as, shared resource is carved up, the service of 2 sides all be cannot get up;Or 2 node serves all get up, but read while write altogether
Storage is enjoyed, causes corrupted data, such as hdfs file system metadatas error etc..
Therefore, in High Availabitity (HA) system, when the contact between contacting 2 nodes disconnects, how to being saved in cluster
Point is managed to ensure that normally operation is urgent problem to be solved to business.
The content of the invention
In order to solve the above-mentioned technical problem, the invention provides the fault handling method in a kind of highly available system and event
Barrier processing cluster, can prevent the generation of high-availability cluster fissure phenomenon.
It is described the invention provides the troubleshooting cluster in a kind of highly available system in order to reach the object of the invention
Each node includes in troubleshooting cluster:
Acquisition module, for obtaining the management object in highly available system in working cluster, wherein the management object
For the node of working cluster;
Monitoring modular, for according to the monitoring policy pre-set, the running status to the management object to be monitored;
Node administration module, for when thering is node to be unable to processing business because of failure in management object, notifying
The node of failure is offline.
Wherein, the troubleshooting cluster has 2N+1 node, and one of node is host node, and remaining node is from section
Point, N is positive integer;Wherein:
Sending module, for notifying to select to work on instead of the node of the failure from working cluster from node
Node;
Determining module, for according to from the selection result of node and host node, it is determined that worked on instead of malfunctioning node
Node;
Wherein, each node includes:
Module is elected, the node that the node for the selection replacement failure from working cluster works on, and
Selection result is sent to the host node.
Wherein, the node administration module includes:
Acquiring unit, the IP address information of the node upper substrate Management Controller BMC for obtaining failure node;
Transmitting element, the BMC IP address information on the node according to failure node, to the node of failure
BMC send close power supply instruction.
Wherein, each node also includes:
Alarm module, the failure-description information for exporting the node broken down.
Wherein, each node also includes:
Policy management module, after being asked in the renewal for receiving monitoring policy, is asked, to monitoring according to described update
Strategy is updated, and the monitoring policy after renewal is sent into the monitoring modular.
Fault handling method in a kind of highly available system, including:
Troubleshooting cluster obtains the corresponding management object of each node working cluster in highly available system, wherein institute
State the node that management object is working cluster;
According to the monitoring policy pre-set, the running status to the management object is monitored;
When thering is node to be unable to processing business because of failure in management object, under the node for notifying failure
Line.
Wherein, after the node for notifying to break down is offline, methods described also includes:
The node for notifying each node to select the node instead of the failure to work on from working cluster;
Receive the selection result that each node is sent;
According to the selection result, it is determined that the node worked on instead of malfunctioning node, wherein the troubleshooting cluster
There is 2N+1 node, one of node is host node, and remaining node is that N is positive integer from node.
Wherein, it is described notify break down node it is offline including:
Obtain the node upper substrate Management Controller BMC of failure node IP address information;
According to the IP address information of BMC on the node of failure node, send and close to the BMC of the node of failure
The instruction of power supply.
Wherein, after the node for notifying to break down is offline, methods described also includes:
Export the failure-description information of the node broken down.
Wherein, methods described also includes:
After the renewal for receiving monitoring policy is asked, asked according to described update, monitoring policy is updated, and will
Monitoring policy after renewal is sent to the monitoring modular.
The embodiment that the present invention is provided, by carrying out fault diagnosis to clustered node, when the disconnection of the nodes heart beat of certain in cluster,
Power-off operation is carried out to malfunctioning node, it is ensured that malfunctioning node is completely closed, and prevents the generation of high-availability cluster fissure phenomenon.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification
Obtain it is clear that or being understood by implementing the present invention.The purpose of the present invention and other advantages can be by specification, rights
Specifically noted structure is realized and obtained in claim and accompanying drawing.
Brief description of the drawings
Accompanying drawing is used for providing further understanding technical solution of the present invention, and constitutes a part for specification, with this
The embodiment of application is used to explain technical scheme together, does not constitute the limitation to technical solution of the present invention.
The structure chart of troubleshooting cluster interior joint in the highly available system that Fig. 1 provides for the present invention;
The structural representation for the highly available cluster system based on IPMI that Fig. 2 provides for application example of the present invention;
Fig. 3 is the method flow diagram of the troubleshooting under Fig. 2 systems;
The flow chart of fault handling method in the highly available system that Fig. 4 provides for the present invention.
Embodiment
For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with accompanying drawing to the present invention
Embodiment be described in detail.It should be noted that in the case where not conflicting, in the embodiment and embodiment in the application
Feature can mutually be combined.
Can be in the computer system of such as one group computer executable instructions the step of the flow of accompanying drawing is illustrated
Perform.And, although logical order is shown in flow charts, but in some cases, can be with suitable different from herein
Sequence performs shown or described step.
The structure chart of troubleshooting cluster interior joint in the highly available system that Fig. 1 provides for the present invention.Event shown in Fig. 1
Each node includes in barrier processing cluster:
Acquisition module 101, for obtaining the management object in highly available system in working cluster, wherein the management pair
As the node for working cluster;
Monitoring modular 102, for according to the monitoring policy pre-set, the running status to the management object to be supervised
Survey;
Node administration module 103, for when thering is node to be unable to processing business because of failure in management object, leading to
Know that the node of failure is offline.
The troubleshooting cluster that the present invention is provided, by carrying out fault diagnosis to clustered node, when the node heart of certain in cluster
Jump and disconnect, power-off operation is carried out to malfunctioning node, it is ensured that malfunctioning node is completely closed, and prevents the production of high-availability cluster fissure phenomenon
It is raw.
The troubleshooting cluster that the present invention is provided is illustrated below:
After malfunctioning node is occurred in detecting working cluster, the business of malfunctioning node processing needs to be performed,
Then need troubleshooting cluster to select a node from working cluster for the malfunctioning node, carried out instead of the node of failure
Work.
To ensure that troubleshooting cluster can select suitable node as early as possible, the quantity of troubleshooting cluster interior joint is
2N+1, one of node is host node, and remaining node is that N is positive integer from node;
The host node notifies to select what the node instead of the failure worked on from working cluster from node
Node;
Remaining is from node after notice is received, and selection replaces the node of the failure to continue from working cluster
The node of work, and send result to host node;
The node that host node works on according to the selection result from node and host node, selection instead of malfunctioning node.
Wherein host node can select working cluster interior joint the most node of number of times, and instead malfunctioning node continues
The node of work.
Each node is carried in working cluster baseboard management controller (Baseboard Management Controller,
BMC) chip and management net, pass through IPMI (Intelligent Platform Management
Interface, IPMI) the specific virtual machine power-off operation of protocol realization.Current overwhelming majority server master board all carries Bmc cores
Piece and bmc network interfaces, bmc chips carry out work independent of the processor, BIOS or operating system of server, are only in the extreme
It is vertical, be one individually run in server without proxy management subsystem, as long as electricity just can start working on server.bmc
Good autonomous characteristic is just overcome in the past based on the limitation suffered by the way to manage of operating system, and such as operating system is not responding to
Or it still can carry out the operation such as switching on and shutting down, information extraction in the case of not loading, and due to managing net independence networking,
Service network will be typically better than in network stabilization and accessibility, therefore the ipmi of remote server is operated in non-extreme conditions
Under be all reliable, so may insure that node is completely closed when failure occurs, prevent fissure phenomenon.
Specifically, the node administration module includes:
BMC IP address information on acquiring unit, the node for obtaining failure node;
Transmitting element, the BMC IP address information on the node according to failure node, to the node of failure
BMC send close power supply instruction.
Optionally, each node also includes:
Alarm module, the failure-description information for exporting the node broken down.
For example, send alarm prompt sound, or, outputting alarm prompt information on the display screen, wherein failure-description is believed
Breath can describe the exception that the identification information and node of node occur.
Certainly, monitoring policy can dynamically be updated, and user can also can be carried out with self-defined according to the need for business
Dynamic is changed, therefore each node also includes:
Policy management module, after being asked in the renewal for receiving monitoring policy, is asked, to monitoring according to described update
Strategy is updated, and the monitoring policy after renewal is sent into the monitoring modular.
In actual applications, can by node state monitoring script and the node control script of rational strict logic,
So that after the nodes break down of HA clusters, the shutdown isolation of malfunctioning node being carried out automatically, primary HA softwares are compensate for not
The problem of can ensure that malfunctioning node isolation, easily produce fissure phenomenon, and by continuous accumulation monitor mode, can reach
The fault comprehensive judgement of a variety of conditions, such as the performance information such as cpu utilization rates, memory usage, whether java processes are normal, section
The condition such as whether application process normal on point, can be added in watch-list, and malfunctioning node is held by IPMI protocol
Row compulsory power-off operation, it is to avoid the fissure phenomenon often occurred in HA schemes, reaches that HA schemes are flexibly customized, highly reliable
The effect of operation.
Illustrated with reference to concrete application example:
The present invention is used as fault-finding and the controller recovered using pacemaker.Pacemaker can voluntarily manage one
Individual cluster, creates pacemaker ocf resources on cluster, and ensures election, drift between this cluster of ocf resources in itself
And High Availabitity.Therefore we make openstack monitor using ocf scripts, itself are also High Availabitity;
Pacemaker resource mainly has two classes, and (Linux Standards Base, Linux standard take respectively LSB
Business) and OCF (Open Cluster Framework, open cluster frameworks).Wherein LSB resources are generally exactly/etc/init.d
Script under catalogue, Pacemaker can be with these scripts come start and stop service.OCF resources are the extension serviced LSB, increase
Function such as failure monitoring of some high-availability clusters reason etc. and more metamessages, are used as the realization side of specific fault-finding
Formula.Pacemaker can be very good to carry out High Availabitity guarantee to service by realizing an OCF resource.
The anti-fissure implementation of the high-availability cluster based on IPMI that the present invention is provided:
The structural representation for the highly available cluster system based on IPMI that Fig. 2 provides for application example of the present invention.The system
Deployment it is as follows:
3 nodes in selection at least high-availability system, install pacemaker clustered softwares and install ipmitool, its
Middle ipmitool is used for remote management node;Wherein why selecting 3 nodes is led to ensure that pacemaker resources are elected
Ballot during node can produce majority;
The node mutual authentication of pacemaker softwares will be mounted with, be configured to an entirety, complete pacemaker clusters
Establishment;
The ocf script nodeMonitor of automatic monitor node state are created, and are uploaded in pacemaker clusters each
Node /usr/lib/ocf/resource.d/openstack/ catalogues under, wherein the ocf scripts can be realized to enter node
Row is started shooting, shut down, checking state, the operation for monitoring resource etc.;Node shape is write by using the form for meeting ocf script specifications
State monitoring script, with flexible utilization pacemaker own resources High Availabitity, timer-triggered scheduler and can manage resource and a large amount of
Existing Linux ocf scripts, realize the flexible and High Availabitity of monitoring programme in itself;
Certainly, a variety of monitoring conditions can be realized in self-defined monitor methods inside ocf scripts, can be according to business spirit
Customization living, reaches the evacuation that virtual machine is can trigger after condition limitation.Method includes but is not limited to:Check business network interface card on node
State;Check the state that application process is specified on node;The performance data of the node is checked, such as cpu utilization rates, internal memory are utilized
Rate etc.;Check memory space state on the node etc..
Creating the script nodeController of specific execution node power-off operation, and be uploaded to pacemaker user has
Under the catalogue of limiting operation, such as/usr/lib/myScript/;The script needs input node bmc ip addresses, account, password;
A pacemaker resource is created using the ocf scripts;Pacemaker resources are equivalent to one by pacemaker
Cluster come ensure perform and monitor state Service Instance, each resource in itself may pacemaker clusters each node
On start through election, and according to the logic defined inside resource, corresponding action is performed by pacemaker frameworks, for example,
The execution interval and time-out time operated defined in ocf meta labels.
Fig. 3 is the method flow diagram of the troubleshooting under Fig. 2 systems.Method includes shown in Fig. 2:
After system starts, pacemaker is spaced the monitor methods of timing perform script, this method according to set time
Can by business personnel's self-defining, for decision node whether failure, such as ping ip, access certain business interface, ssh and connect
Connect, access database connection etc.;
When nodeMonitor scripts judge certain node failure, then ipmitool is called, it is remote using existing BMC management nets
Process control node power, carries out power-off operation, it is not necessary to additionally increase equipment to remote failure node;Use simultaneously
It is invalid that malfunctioning node is set to by pacemaker attrd_updater interfaces, is flowed with triggering pacemaker host node and electing
Journey.
As seen from the above, application example of the present invention provides a kind of anti-fissure scheme of high-availability cluster based on IPMI, leads to
The fault-finding center for using the pacemaker that increases income as HA clusters is crossed, using ocf scripts as monitoring method, using many
The fault message of kind of means probe node, the bmc management interface isolated fault nodes provided using server board itself can be with
Quickly reliably malfunctioning node is separated from HA clusters, the data in perfect guarantee distributed service running are complete
Property, the generation of fissure phenomenon is prevented, makes up and halfway problem is isolated to malfunctioning node in current HA total solution, it is completeer
The problem of in kind monitoring business running, the integrality of business datum is protected significantly.
The flow chart of fault handling method in the highly available system that Fig. 4 provides for the present invention.Method includes shown in Fig. 4:
Step 401, the corresponding management object of each node working cluster in highly available system of acquisition, wherein the pipe
Manage the node that object is working cluster;
The monitoring policy that step 402, basis are pre-set, the running status to the management object is monitored;
Step 403, when manage object in there is node to be unable to processing business because of failure when, notify break down
Node is offline.
Wherein, after the node for notifying to break down is offline, methods described also includes:
The node for notifying each node to select the node instead of the failure to work on from working cluster;
Receive the selection result that each node is sent;
According to the selection result, it is determined that the node worked on instead of malfunctioning node, wherein the troubleshooting cluster
There is 2N+1 node, one of node is host node, and remaining node is that N is positive integer from node.
Wherein, it is described notify break down node it is offline including:
Obtain the node upper substrate Management Controller BMC of failure node IP address information;
According to the IP address information of BMC on the node of failure node, send and close to the BMC of the node of failure
The instruction of power supply.
Wherein, after the node for notifying to break down is offline, methods described also includes:
Export the failure-description information of the node broken down.
Optionally, methods described also includes:
After the renewal for receiving monitoring policy is asked, asked according to described update, monitoring policy is updated, and will
Monitoring policy after renewal is sent to the monitoring modular.
The fault handling method that the present invention is provided, by carrying out fault diagnosis to clustered node, when the node heart of certain in cluster
Jump and disconnect, power-off operation is carried out to malfunctioning node, it is ensured that malfunctioning node is completely closed, and prevents the production of high-availability cluster fissure phenomenon
It is raw.
Although disclosed herein embodiment as above, described content be only readily appreciate the present invention and use
Embodiment, is not limited to the present invention.Technical staff in any art of the present invention, is taken off not departing from the present invention
On the premise of the spirit and scope of dew, any modification and change, but the present invention can be carried out in the form and details of implementation
Scope of patent protection, still should be subject to the scope of the claims as defined in the appended claims.
Claims (10)
1. a kind of troubleshooting cluster in highly available system, it is characterised in that each node in the troubleshooting cluster
Including:
Acquisition module, for obtaining the management object in highly available system in working cluster, wherein the management object is work
Make the node of cluster;
Monitoring modular, for according to the monitoring policy pre-set, the running status to the management object to be monitored;
Node administration module, for when thering is node to be unable to processing business because of failure in management object, notifying occur
The node of failure is offline.
2. troubleshooting cluster according to claim 1, it is characterised in that the troubleshooting cluster has 2N+1 section
Point, one of node is host node, and remaining node is that N is positive integer from node;Wherein:
The host node also includes:
Sending module, for the section for notifying to select the node instead of the failure to work on from working cluster from node
Point;
Determining module, for according to the selection result from node and host node, it is determined that the node worked on instead of malfunctioning node;
Wherein, each node includes:
Module is elected, the node that the node for the selection replacement failure from working cluster works on, and will choosing
Select result and be sent to the host node.
3. troubleshooting cluster according to claim 1, it is characterised in that the node administration module includes:
Acquiring unit, the IP address information of the node upper substrate Management Controller BMC for obtaining failure node;
Transmitting element, on the node according to failure node BMC IP address information, to the node of failure
BMC sends the instruction for closing power supply.
4. according to any described troubleshooting cluster of claims 1 to 3, it is characterised in that each node also includes:
Alarm module, the failure-description information for exporting the node broken down.
5. according to any described troubleshooting cluster of claims 1 to 3, it is characterised in that each node also includes:
Policy management module, after being asked in the renewal for receiving monitoring policy, is asked, to monitoring policy according to described update
It is updated, and the monitoring policy after renewal is sent to the monitoring modular.
6. fault handling method in a kind of highly available system, it is characterised in that including:
Troubleshooting cluster obtains the corresponding management object of each node working cluster in highly available system, wherein the pipe
Manage the node that object is working cluster;
According to the monitoring policy pre-set, the running status to the management object is monitored;
When thering is node to be unable to processing business because of failure in management object, notify that the node broken down is offline.
7. according to the method described in claim 1, it is characterised in that described after the node for notifying to break down is offline
Method also includes:
The node for notifying each node to select the node instead of the failure to work on from working cluster;
Receive the selection result that each node is sent;
According to the selection result, it is determined that the node worked on instead of malfunctioning node, wherein the troubleshooting cluster has 2N+
1 node, one of node is host node, and remaining node is that N is positive integer from node.
8. method according to claim 6, it is characterised in that node that the notice breaks down it is offline including:
Obtain the node upper substrate Management Controller BMC of failure node IP address information;
According to the IP address information of BMC on the node of failure node, sent to the BMC of the node of failure and close power supply
Instruction.
9. according to any described method of claim 6 to 8, it is characterised in that node that the notice breaks down it is offline it
Afterwards, methods described also includes:
Export the failure-description information of the node broken down.
10. according to any described method of claim 6 to 8, it is characterised in that methods described also includes:
After the renewal for receiving monitoring policy is asked, asked according to described update, monitoring policy is updated, and will updated
Monitoring policy afterwards is sent to the monitoring modular.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710589299.2A CN107147540A (en) | 2017-07-19 | 2017-07-19 | Fault handling method and troubleshooting cluster in highly available system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710589299.2A CN107147540A (en) | 2017-07-19 | 2017-07-19 | Fault handling method and troubleshooting cluster in highly available system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107147540A true CN107147540A (en) | 2017-09-08 |
Family
ID=59776469
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710589299.2A Pending CN107147540A (en) | 2017-07-19 | 2017-07-19 | Fault handling method and troubleshooting cluster in highly available system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107147540A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107612787A (en) * | 2017-11-06 | 2018-01-19 | 南京易捷思达软件科技有限公司 | A kind of cloud hostdown detection method for cloud platform of being increased income based on Openstack |
CN107741966A (en) * | 2017-09-30 | 2018-02-27 | 郑州云海信息技术有限公司 | A kind of node administration method and device |
CN108011880A (en) * | 2017-12-04 | 2018-05-08 | 郑州云海信息技术有限公司 | The management method and computer-readable recording medium monitored in cloud data system |
CN108089911A (en) * | 2017-12-14 | 2018-05-29 | 郑州云海信息技术有限公司 | The control method and device of calculate node in OpenStack environment |
CN108449200A (en) * | 2018-02-02 | 2018-08-24 | 云宏信息科技股份有限公司 | A kind of mask information wiring method and device based on control node |
CN109286529A (en) * | 2018-10-31 | 2019-01-29 | 武汉烽火信息集成技术有限公司 | A kind of method and system for restoring RabbitMQ network partition |
CN109981204A (en) * | 2019-02-21 | 2019-07-05 | 福建星云电子股份有限公司 | A kind of Multi-Machine Synchronous method of BMS analogue system |
CN109981782A (en) * | 2019-03-28 | 2019-07-05 | 山东浪潮云信息技术有限公司 | Remote storage abnormality eliminating method and system for cluster fissure |
CN110377487A (en) * | 2019-07-11 | 2019-10-25 | 无锡华云数据技术服务有限公司 | A kind of method and device handling high-availability cluster fissure |
CN111124765A (en) * | 2019-12-06 | 2020-05-08 | 中盈优创资讯科技有限公司 | Big data cluster task scheduling method and system based on node labels |
CN111371599A (en) * | 2020-02-26 | 2020-07-03 | 山东汇贸电子口岸有限公司 | Cluster disaster recovery management system based on ETCD |
CN111475386A (en) * | 2020-06-05 | 2020-07-31 | 中国银行股份有限公司 | Fault early warning method and related device |
CN112291288A (en) * | 2019-07-24 | 2021-01-29 | 北京金山云网络技术有限公司 | Container cluster expansion method and device, electronic equipment and readable storage medium |
CN112838965A (en) * | 2021-02-19 | 2021-05-25 | 浪潮云信息技术股份公司 | Method for identifying and recovering strong synchronization role fault |
CN113162797A (en) * | 2021-03-03 | 2021-07-23 | 山东英信计算机技术有限公司 | Method, system and medium for switching master node fault of distributed cluster |
CN113760610A (en) * | 2020-06-01 | 2021-12-07 | 富泰华工业(深圳)有限公司 | OpenStack-based bare computer high-availability realization method and device and electronic equipment |
CN114500327A (en) * | 2022-04-13 | 2022-05-13 | 统信软件技术有限公司 | Detection method and detection device for server cluster and computing equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102541802A (en) * | 2010-12-17 | 2012-07-04 | 伊姆西公司 | Methods and equipment for identifying object based on quorum resource quantity of object |
CN103457771A (en) * | 2013-08-30 | 2013-12-18 | 杭州华三通信技术有限公司 | Method and device for HA virtual machine cluster management |
CN103905247A (en) * | 2014-03-10 | 2014-07-02 | 北京交通大学 | Two-unit standby method and system based on multi-client judgment |
CN104077199A (en) * | 2014-06-06 | 2014-10-01 | 中标软件有限公司 | Shared disk based high availability cluster isolation method and system |
CN106656624A (en) * | 2017-01-04 | 2017-05-10 | 合肥康捷信息科技有限公司 | Optimization method based on Gossip communication protocol and Raft election algorithm |
-
2017
- 2017-07-19 CN CN201710589299.2A patent/CN107147540A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102541802A (en) * | 2010-12-17 | 2012-07-04 | 伊姆西公司 | Methods and equipment for identifying object based on quorum resource quantity of object |
CN103457771A (en) * | 2013-08-30 | 2013-12-18 | 杭州华三通信技术有限公司 | Method and device for HA virtual machine cluster management |
CN103905247A (en) * | 2014-03-10 | 2014-07-02 | 北京交通大学 | Two-unit standby method and system based on multi-client judgment |
CN104077199A (en) * | 2014-06-06 | 2014-10-01 | 中标软件有限公司 | Shared disk based high availability cluster isolation method and system |
CN106656624A (en) * | 2017-01-04 | 2017-05-10 | 合肥康捷信息科技有限公司 | Optimization method based on Gossip communication protocol and Raft election algorithm |
Non-Patent Citations (2)
Title |
---|
张育军,黄红元: "《上海证券交易所联合研究报告 2011 证券信息前沿技术专集[M]》", 31 December 2012 * |
黄红元: "《《上海证券交易所联合研究报告 2013 证券信息前沿技术专集》》", 31 December 2014 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107741966A (en) * | 2017-09-30 | 2018-02-27 | 郑州云海信息技术有限公司 | A kind of node administration method and device |
CN107612787A (en) * | 2017-11-06 | 2018-01-19 | 南京易捷思达软件科技有限公司 | A kind of cloud hostdown detection method for cloud platform of being increased income based on Openstack |
CN107612787B (en) * | 2017-11-06 | 2021-01-12 | 南京易捷思达软件科技有限公司 | Cloud host fault detection method based on Openstack open source cloud platform |
CN108011880A (en) * | 2017-12-04 | 2018-05-08 | 郑州云海信息技术有限公司 | The management method and computer-readable recording medium monitored in cloud data system |
CN108089911A (en) * | 2017-12-14 | 2018-05-29 | 郑州云海信息技术有限公司 | The control method and device of calculate node in OpenStack environment |
CN108449200A (en) * | 2018-02-02 | 2018-08-24 | 云宏信息科技股份有限公司 | A kind of mask information wiring method and device based on control node |
CN109286529A (en) * | 2018-10-31 | 2019-01-29 | 武汉烽火信息集成技术有限公司 | A kind of method and system for restoring RabbitMQ network partition |
CN109286529B (en) * | 2018-10-31 | 2021-08-10 | 武汉烽火信息集成技术有限公司 | Method and system for recovering RabbitMQ network partition |
CN109981204A (en) * | 2019-02-21 | 2019-07-05 | 福建星云电子股份有限公司 | A kind of Multi-Machine Synchronous method of BMS analogue system |
CN109981782A (en) * | 2019-03-28 | 2019-07-05 | 山东浪潮云信息技术有限公司 | Remote storage abnormality eliminating method and system for cluster fissure |
CN109981782B (en) * | 2019-03-28 | 2022-03-22 | 浪潮云信息技术股份公司 | Remote storage exception handling method and system for cluster split brain |
CN110377487A (en) * | 2019-07-11 | 2019-10-25 | 无锡华云数据技术服务有限公司 | A kind of method and device handling high-availability cluster fissure |
CN112291288B (en) * | 2019-07-24 | 2022-10-04 | 北京金山云网络技术有限公司 | Container cluster expansion method and device, electronic equipment and readable storage medium |
CN112291288A (en) * | 2019-07-24 | 2021-01-29 | 北京金山云网络技术有限公司 | Container cluster expansion method and device, electronic equipment and readable storage medium |
CN111124765A (en) * | 2019-12-06 | 2020-05-08 | 中盈优创资讯科技有限公司 | Big data cluster task scheduling method and system based on node labels |
CN111371599A (en) * | 2020-02-26 | 2020-07-03 | 山东汇贸电子口岸有限公司 | Cluster disaster recovery management system based on ETCD |
CN113760610A (en) * | 2020-06-01 | 2021-12-07 | 富泰华工业(深圳)有限公司 | OpenStack-based bare computer high-availability realization method and device and electronic equipment |
CN111475386A (en) * | 2020-06-05 | 2020-07-31 | 中国银行股份有限公司 | Fault early warning method and related device |
CN111475386B (en) * | 2020-06-05 | 2024-01-23 | 中国银行股份有限公司 | Fault early warning method and related device |
CN112838965A (en) * | 2021-02-19 | 2021-05-25 | 浪潮云信息技术股份公司 | Method for identifying and recovering strong synchronization role fault |
CN113162797A (en) * | 2021-03-03 | 2021-07-23 | 山东英信计算机技术有限公司 | Method, system and medium for switching master node fault of distributed cluster |
CN113162797B (en) * | 2021-03-03 | 2023-03-21 | 山东英信计算机技术有限公司 | Method, system and medium for switching master node fault of distributed cluster |
CN114500327A (en) * | 2022-04-13 | 2022-05-13 | 统信软件技术有限公司 | Detection method and detection device for server cluster and computing equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107147540A (en) | Fault handling method and troubleshooting cluster in highly available system | |
CN103607297B (en) | Fault processing method of computer cluster system | |
CN107239383A (en) | A kind of failure monitoring method and device of OpenStack virtual machines | |
US20140372805A1 (en) | Self-healing managed customer premises equipment | |
CN105323113A (en) | A visualization technology-based system fault emergency handling system and a system fault emergency handling method | |
CN112181660A (en) | High-availability method based on server cluster | |
CN101996106A (en) | Method for monitoring software running state | |
CN104579791A (en) | Method for achieving automatic K-DB main and standby disaster recovery cluster switching | |
US20080313319A1 (en) | System and method for providing multi-protocol access to remote computers | |
CN110134518A (en) | A kind of method and system improving big data cluster multinode high application availability | |
CN106330523A (en) | Cluster server disaster recovery system and method, and server node | |
CN103490919A (en) | Fault management system and fault management method | |
CN114090184B (en) | Method and equipment for realizing high availability of virtualization cluster | |
CN107947998A (en) | A kind of real-time monitoring system based on application system | |
CN110138611A (en) | Automate O&M method and system | |
JP2013130901A (en) | Monitoring server and network device recovery system using the same | |
CN105849702A (en) | Cluster system, server device, cluster system management method, and computer-readable recording medium | |
CN107071189B (en) | Connection method of communication equipment physical interface | |
CN101854263B (en) | Method, system and management server for analysis processing of network topology | |
CN101442437A (en) | Method, system and equipment for implementing high availability | |
CN110677288A (en) | Edge computing system and method generally used for multi-scene deployment | |
CN110399254A (en) | A kind of server CMC dual-locomotive heat activating method, system, terminal and storage medium | |
CN104346233A (en) | Fault recovery method and device for computer system | |
CN107133130A (en) | Computer operational monitoring method and apparatus | |
TWI698741B (en) | Method for remotely clearing abnormal status of racks applied in data center |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170908 |
|
RJ01 | Rejection of invention patent application after publication |