CN110740064A - Distributed cluster node fault processing method, device, equipment and storage medium - Google Patents

Distributed cluster node fault processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN110740064A
CN110740064A CN201911025111.7A CN201911025111A CN110740064A CN 110740064 A CN110740064 A CN 110740064A CN 201911025111 A CN201911025111 A CN 201911025111A CN 110740064 A CN110740064 A CN 110740064A
Authority
CN
China
Prior art keywords
node
cluster
fault
distributed
respond
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911025111.7A
Other languages
Chinese (zh)
Inventor
张大帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Inspur Data Technology Co Ltd
Original Assignee
Beijing Inspur Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Inspur Data Technology Co Ltd filed Critical Beijing Inspur Data Technology Co Ltd
Priority to CN201911025111.7A priority Critical patent/CN110740064A/en
Publication of CN110740064A publication Critical patent/CN110740064A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0659Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/16Arrangements for providing special services to substations
    • H04L12/18Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a distributed cluster node fault processing method which comprises the following steps of respectively sending multicast requests to agent services pre-deployed by nodes in a distributed storage cluster, determining nodes which do not respond to the multicast requests as fault nodes when determining that the nodes which do not respond to the multicast requests exist, and clearing relevant authentication information of the fault nodes in the distributed storage cluster.

Description

Distributed cluster node fault processing method, device, equipment and storage medium
Technical Field
The present invention relates to the field of distributed storage technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for processing a fault in distributed cluster nodes.
Background
The distributed storage cluster system generally includes a plurality of storage servers (servers), which form cluster systems for providing services to the outside, the servers are also denoted by "nodes", each distributed storage cluster generally has main monitoring nodes (called main nodes for short), which monitor the state of the storage cluster.
In summary, how to effectively solve the problem that the normal service operation of a client is affected by the overall performance of a cluster due to data reconstruction caused by the down of a node is an urgent need to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide a distributed cluster node fault processing method which avoids data reconstruction caused by node downtime, greatly reduces the influence on the overall performance of a cluster and ensures the operation of normal services of a client, and the invention also aims to provide a distributed cluster node fault processing device, equipment and a computer readable storage medium.
In order to solve the technical problems, the invention provides the following technical scheme:
A distributed cluster node fault handling method, comprising:
respectively sending multicast requests to agent services pre-deployed by each node in the distributed storage cluster;
determining a node that does not respond to the multicast request as a failed node when it is determined that there is a node that does not respond to the multicast request;
and clearing the relevant authentication information of the fault node in the distributed storage cluster.
In embodiments of the present invention, after clearing the relevant authentication information of the failed node in the distributed storage cluster, the method further includes:
and adding fault identification information after the sn serial number corresponding to the fault node.
In embodiments of the present invention, after adding the fault identification information after the sn sequence number corresponding to the faulty node, the method further includes:
when a cluster joining request is received, detecting whether the fault identification information exists after the sn serial number of a node to be joined;
and if so, removing the original cluster service information in the node to be added, and adding the node to be added with the removed original cluster service information to the distributed storage cluster.
In embodiments of the present invention, when it is determined that there is a node that does not respond to the multicast request, determining the node that does not respond to the multicast request as a failed node includes:
and when determining that the nodes which do not respond to the multicast requests for the continuous preset times exist, determining the nodes which do not respond to the multicast requests for the continuous preset times as fault nodes.
distributed cluster node fault handling device, comprising:
the request sending module is used for respectively sending multicast requests to the agent services pre-deployed by each node in the distributed storage cluster;
a failed node determination module, configured to determine a node that does not respond to the multicast request as a failed node when it is determined that there is a node that does not respond to the multicast request;
and the authentication information clearing module is used for clearing the relevant authentication information of the fault node in the distributed storage cluster.
In embodiments of the present invention, the method further comprises:
and the identification information adding module is used for adding fault identification information after the sn serial number corresponding to the fault node is removed after the relevant authentication information of the fault node in the distributed storage cluster.
In embodiments of the present invention, the method further comprises:
the identification information detection module is used for detecting whether the fault identification information exists after the sn serial number of the node to be added when a cluster adding request is received after the fault identification information is added after the sn serial number corresponding to the fault node;
and the node adding module is used for clearing the original cluster service information in the node to be added when the fault identification information exists after the sn serial number of the node to be added is detected, and adding the node to be added with the cleared original cluster service information to the distributed storage cluster.
In specific embodiments of the present invention, the failed node determining module is specifically a module that, when it is determined that there is a node that does not respond to the multicast requests for a preset number of consecutive times, determines a node that does not respond to the multicast requests for a preset number of consecutive times as a failed node.
distributed cluster node failure handling device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the distributed cluster node fault handling method as described above when executing the computer program.
computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the distributed cluster node failure handling method as set out above.
The method provided by the embodiment of the invention is applied to respectively send multicast requests to the pre-deployed proxy services of each node in the distributed storage cluster; when determining that there is a node which does not respond to the multicast request, determining the node which does not respond to the multicast request as a failed node; and clearing the relevant authentication information of the fault node in the distributed storage cluster. By pre-deploying the proxy service for each node in the distributed storage cluster respectively, the fault node can be detected in time according to the response state of each node to the multicast request received by the corresponding proxy service, and the relevant authentication information of the fault node can be cleared in time, so that the fault node is removed from the distributed storage cluster in time, data reconstruction caused by node downtime is avoided, the influence on the overall performance of the cluster is greatly reduced, and the normal service operation of a client is ensured.
Correspondingly, embodiments of the present invention further provide a distributed cluster node fault processing apparatus, a device, and a computer-readable storage medium corresponding to the distributed cluster node fault processing method, which have the above technical effects and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flow chart of implementation methods of a distributed cluster node fault handling method in an embodiment of the present invention;
fig. 2 is another implementation flowcharts of the distributed cluster node fault handling method in the embodiment of the present invention;
fig. 3 is a block diagram of a distributed cluster node fault handling apparatus according to an embodiment of the present invention;
fig. 4 is a block diagram of distributed cluster node fault handling apparatus according to an embodiment of the present invention.
Detailed Description
For a better understanding of the present invention, reference is made to the following detailed description taken in conjunction with the accompanying drawings and the accompanying detailed description, it is understood that the illustrated embodiments are only a partial embodiment , rather than a full embodiment.
Example :
referring to fig. 1, fig. 1 is a flow chart of implementation methods of a distributed cluster node fault handling method in an embodiment of the present invention, where the method may include the following steps:
s101: and respectively sending multicast requests to the agent services pre-deployed by each node in the distributed storage cluster.
A detection service (master) may be pre-deployed in a master node of the distributed storage cluster and an agent service (agent) may be pre-deployed in each node of the distributed storage cluster. The detection service may send multicast requests to the proxy services in real time or at preset time intervals.
The multicast request may be a handshake request.
It should be noted that, when the detection service sends the multicast request to each proxy service according to the preset time interval, the time interval for sending the multicast request may be set and adjusted according to the actual situation, which is not limited in this embodiment of the present invention, and may be set to 15s, for example.
S102: when it is determined that there is a node that does not respond to the multicast request, the node that does not respond to the multicast request is determined as a failed node.
After the multicast request is sent to the pre-deployed proxy service of each node in the distributed storage cluster, whether each node responds to the multicast request can be detected, for example, when a certain node returns an "OK" reply through the corresponding proxy service, it indicates that the node is normal, and when a certain node does not respond to the multicast request later, it indicates that the node has a problem. When it is determined that there is a node that does not respond to the multicast request, the node that does not respond to the multicast request may be determined as a failed node.
S103: and clearing the relevant authentication information of the fault node in the distributed storage cluster.
After the fault node is determined, because the fault node is down and cannot communicate with the fault node, the state (MON) of the related monitoring Storage cluster in the fault node and the Object Storage Device cannot be cleared, the cluster service such as the data Storage service (OSD) and the like cannot be provided, so that the related authentication information such as MON and OSD and the like of the fault node in the distributed Storage cluster is cleared first, and if the name of the fault node is noden, the related authentication information can be cleared through a command cluster auth del. And the operation and maintenance cost is reduced by automatically detecting the fault node.
The method provided by the embodiment of the invention is applied to respectively send multicast requests to the pre-deployed proxy services of each node in the distributed storage cluster; when determining that there is a node which does not respond to the multicast request, determining the node which does not respond to the multicast request as a failed node; and clearing the relevant authentication information of the fault node in the distributed storage cluster. By pre-deploying the proxy service for each node in the distributed storage cluster respectively, the fault node can be detected in time according to the response state of each node to the multicast request received by the corresponding proxy service, and the relevant authentication information of the fault node can be cleared in time, so that the fault node is removed from the distributed storage cluster in time, data reconstruction caused by node downtime is avoided, the influence on the overall performance of the cluster is greatly reduced, and the normal service operation of a client is ensured.
It should be noted that, based on the above embodiment , the embodiment of the present invention further provides a corresponding improved scheme, and the steps that are the same as or correspond to those in the above embodiment may be referred to each other in the subsequent embodiments, and corresponding beneficial effects may also be referred to each other, which is not described in detail in the following improved embodiment .
Referring to fig. 2, fig. 2 is another implementation flowcharts of the distributed cluster node fault handling method in the embodiment of the present invention, where the method may include the following steps:
s201: and respectively sending multicast requests to the agent services pre-deployed by each node in the distributed storage cluster.
S202: and when determining that the nodes which do not respond to the multicast requests for the continuous preset times exist, determining the nodes which do not respond to the multicast requests for the continuous preset times as the fault nodes.
The number of times that a certain node is determined as a failed node and that no multicast request needs to be satisfied continuously responds may be preset, and when it is determined that there are nodes that do not respond to multicast requests for the continuously preset number of times, the node that does not respond to multicast requests for the continuously preset number of times is determined as the failed node. Through multiple times of verification, misjudgment caused by network jitter and the like is avoided.
It should be noted that the preset number of times may be set and adjusted according to actual situations, which is not limited in the embodiment of the present invention, and may be set to 3 times, for example.
S203: and clearing the relevant authentication information of the fault node in the distributed storage cluster.
S204: and adding fault identification information after the sn serial number corresponding to the fault node.
After the related authentication information of the failed node in the distributed storage cluster is cleared, fault identification information, such as fault/clear identification, may be added after a sn serial number (i.e., a product serial number) corresponding to the failed node, indicating that the node is the removed distributed storage cluster in the event of a fault, and storage service and configuration information such as MON, OSD, etc. on the node are not yet completely cleared.
S205: when a cluster joining request is received, whether fault identification information exists after the sn serial number of the node to be joined is detected, if yes, step S206 is executed, and if not, the node to be joined is directly added to the distributed storage cluster.
When a new node or a node with a fault repair completed needs to join the distributed storage cluster, a cluster joining request can be sent to the detection service, and after the detection service receives the cluster joining request, whether fault identification information exists after the sn serial number of the node to be joined can be detected, so that whether the node is a node which reappears to join the distributed storage cluster after the fault repair is completed can be determined. When the fault identification information exists after the sn serial number of the node to be added is determined, it indicates that the node is a node which reappears to add to the distributed storage cluster after fault repair is completed, in this case, step S206 may be continuously executed, and when the sn serial number of the node to be added is determined that the fault identification information does not exist, it indicates that the node is a new node which applies to add to the distributed storage cluster, in this case, the node to be added may be directly added to the distributed storage cluster.
S206: and removing the original cluster service information in the node to be added, and adding the node to be added with the removed original cluster service information to the distributed storage cluster.
When the fault identification information exists after the sn serial number of the node to be added is determined, the original cluster service information in the node to be added can be removed, and the node to be added with the original cluster service information removed is added to the distributed storage cluster. By clearing the original cluster service information to be added into the node, the problem that the service access of a client is influenced after the node is added into the distributed storage cluster by the residual isolated service before the last fault can be avoided.
Corresponding to the above method embodiment, the embodiment of the present invention further provides a distributed cluster node fault handling apparatus, and the distributed cluster node fault handling apparatus described below and the distributed cluster node fault handling method described above may be referred to correspondingly.
Referring to fig. 3, fig. 3 is a block diagram illustrating a structure of an distributed cluster node fault handling apparatus according to an embodiment of the present invention, where the apparatus may include:
a request sending module 31, configured to send a multicast request to a proxy service pre-deployed by each node in the distributed storage cluster;
a failed node determination module 32, configured to determine a node that does not respond to the multicast request as a failed node when it is determined that there is a node that does not respond to the multicast request;
and the authentication information clearing module 33 is configured to clear the relevant authentication information of the failed node in the distributed storage cluster.
The device provided by the embodiment of the invention is applied to respectively send multicast requests to the pre-deployed proxy services of each node in the distributed storage cluster; when determining that there is a node which does not respond to the multicast request, determining the node which does not respond to the multicast request as a failed node; and clearing the relevant authentication information of the fault node in the distributed storage cluster. By pre-deploying the proxy service for each node in the distributed storage cluster respectively, the fault node can be detected in time according to the response state of each node to the multicast request received by the corresponding proxy service, and the relevant authentication information of the fault node can be cleared in time, so that the fault node is removed from the distributed storage cluster in time, data reconstruction caused by node downtime is avoided, the influence on the overall performance of the cluster is greatly reduced, and the normal service operation of a client is ensured.
In embodiments of the present invention, the apparatus may further comprise:
and the identification information adding module is used for adding the fault identification information after the sn serial number corresponding to the fault node after the relevant authentication information of the fault node in the distributed storage cluster is eliminated.
In embodiments of the present invention, the apparatus may further comprise:
the identification information detection module is used for detecting whether the fault identification information exists after the sn serial number of the node to be added when a cluster adding request is received after the fault identification information is added after the sn serial number corresponding to the fault node;
and the node adding module is used for removing the original cluster service information in the node to be added when the fault identification information exists after the sn serial number of the node to be added is detected, and adding the node to be added with the removed original cluster service information to the distributed storage cluster.
In embodiments of the present invention, the failed node determining module 32 is specifically a module that, when it is determined that there is a node that does not respond to any of the consecutive preset number of multicast requests, determines a node that does not respond to any of the consecutive preset number of multicast requests as a failed node.
Corresponding to the above method embodiment, referring to fig. 4, fig. 4 is a schematic diagram of a distributed cluster node fault handling device provided in the present invention, where the device may include:
a memory 41 for storing a computer program;
the processor 42, when executing the computer program stored in the memory 41, may implement the following steps:
respectively sending multicast requests to agent services pre-deployed by each node in the distributed storage cluster; when determining that there is a node which does not respond to the multicast request, determining the node which does not respond to the multicast request as a failed node; and clearing the relevant authentication information of the fault node in the distributed storage cluster.
For the introduction of the device provided by the present invention, please refer to the above method embodiment, which is not described herein again.
In accordance with the above method embodiment, the present invention further provides computer-readable storage media, on which a computer program is stored, the computer program, when executed by a processor, being adapted to perform the steps of:
respectively sending multicast requests to agent services pre-deployed by each node in the distributed storage cluster; when determining that there is a node which does not respond to the multicast request, determining the node which does not respond to the multicast request as a failed node; and clearing the relevant authentication information of the fault node in the distributed storage cluster.
The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
For the introduction of the computer-readable storage medium provided by the present invention, please refer to the above method embodiments, which are not described herein again.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device, the apparatus and the computer-readable storage medium disclosed in the embodiments correspond to the method disclosed in the embodiments, so that the description is simple, and the relevant points can be referred to the description of the method.
The principle and the implementation of the present invention are explained in the present application by using specific examples, and the above description of the embodiments is only used to help understanding the technical solution and the core idea of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (10)

1, distributed cluster node fault handling method, comprising:
respectively sending multicast requests to agent services pre-deployed by each node in the distributed storage cluster;
determining a node that does not respond to the multicast request as a failed node when it is determined that there is a node that does not respond to the multicast request;
and clearing the relevant authentication information of the fault node in the distributed storage cluster.
2. The method according to claim 1, wherein after clearing the relevant authentication information of the failed node in the distributed storage cluster, the method further comprises:
and adding fault identification information after the sn serial number corresponding to the fault node.
3. The method according to claim 2, wherein after adding the fault identification information after the sn sequence number corresponding to the faulty node, the method further comprises:
when a cluster joining request is received, detecting whether the fault identification information exists after the sn serial number of a node to be joined;
and if so, removing the original cluster service information in the node to be added, and adding the node to be added with the removed original cluster service information to the distributed storage cluster.
4. The distributed cluster node fault handling method of any of claims 1-3, wherein determining a node that is not responding to the multicast request as a faulty node when it is determined that there are nodes that are not responding to the multicast request includes:
and when determining that the nodes which do not respond to the multicast requests for the continuous preset times exist, determining the nodes which do not respond to the multicast requests for the continuous preset times as fault nodes.
5, distributed cluster node fault handling device, comprising:
the request sending module is used for respectively sending multicast requests to the agent services pre-deployed by each node in the distributed storage cluster;
a failed node determination module, configured to determine a node that does not respond to the multicast request as a failed node when it is determined that there is a node that does not respond to the multicast request;
and the authentication information clearing module is used for clearing the relevant authentication information of the fault node in the distributed storage cluster.
6. The distributed cluster node failure handling apparatus of claim 5, further comprising:
and the identification information adding module is used for adding fault identification information after the sn serial number corresponding to the fault node is removed after the relevant authentication information of the fault node in the distributed storage cluster.
7. The distributed cluster node failure handling apparatus of claim 6, further comprising:
the identification information detection module is used for detecting whether the fault identification information exists after the sn serial number of the node to be added when a cluster adding request is received after the fault identification information is added after the sn serial number corresponding to the fault node;
and the node adding module is used for clearing the original cluster service information in the node to be added when the fault identification information exists after the sn serial number of the node to be added is detected, and adding the node to be added with the cleared original cluster service information to the distributed storage cluster.
8. The distributed cluster node failure processing apparatus of as claimed in any of claims 5 to 7, wherein the failed node determining module is specifically a module that determines a node that does not respond to a preset number of consecutive multicast requests as a failed node when it is determined that there are nodes that do not respond to a preset number of consecutive multicast requests.
9, distributed cluster node fault handling device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the distributed cluster node failure handling method of any claims 1-4 when executing the computer program.
10, computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the distributed cluster node failure handling method according to any of claims 1 to 4 through .
CN201911025111.7A 2019-10-25 2019-10-25 Distributed cluster node fault processing method, device, equipment and storage medium Pending CN110740064A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911025111.7A CN110740064A (en) 2019-10-25 2019-10-25 Distributed cluster node fault processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911025111.7A CN110740064A (en) 2019-10-25 2019-10-25 Distributed cluster node fault processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110740064A true CN110740064A (en) 2020-01-31

Family

ID=69271485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911025111.7A Pending CN110740064A (en) 2019-10-25 2019-10-25 Distributed cluster node fault processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110740064A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111756571A (en) * 2020-05-28 2020-10-09 苏州浪潮智能科技有限公司 Cluster node fault processing method, device, equipment and readable medium
CN113783735A (en) * 2021-09-24 2021-12-10 小红书科技有限公司 Method, device, equipment and medium for identifying fault node in Redis cluster
CN115426247A (en) * 2022-08-22 2022-12-02 中国工商银行股份有限公司 Processing method and device of fault node, storage medium and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040059805A1 (en) * 2002-09-23 2004-03-25 Darpan Dinker System and method for reforming a distributed data system cluster after temporary node failures or restarts
US20120042030A1 (en) * 2010-08-12 2012-02-16 International Business Machines Corporation High availability management system for stateless components in a distributed master-slave component topology
US20170373926A1 (en) * 2016-06-22 2017-12-28 Vmware, Inc. Dynamic heartbeating mechanism
CN108847982A (en) * 2018-06-26 2018-11-20 郑州云海信息技术有限公司 A kind of distributed storage cluster and its node failure switching method and apparatus
CN109218100A (en) * 2018-09-21 2019-01-15 郑州云海信息技术有限公司 Distributed objects storage cluster and its request responding method, system and storage medium
US10275326B1 (en) * 2014-10-31 2019-04-30 Amazon Technologies, Inc. Distributed computing system failure detection

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040059805A1 (en) * 2002-09-23 2004-03-25 Darpan Dinker System and method for reforming a distributed data system cluster after temporary node failures or restarts
US20120042030A1 (en) * 2010-08-12 2012-02-16 International Business Machines Corporation High availability management system for stateless components in a distributed master-slave component topology
US10275326B1 (en) * 2014-10-31 2019-04-30 Amazon Technologies, Inc. Distributed computing system failure detection
US20170373926A1 (en) * 2016-06-22 2017-12-28 Vmware, Inc. Dynamic heartbeating mechanism
CN108847982A (en) * 2018-06-26 2018-11-20 郑州云海信息技术有限公司 A kind of distributed storage cluster and its node failure switching method and apparatus
CN109218100A (en) * 2018-09-21 2019-01-15 郑州云海信息技术有限公司 Distributed objects storage cluster and its request responding method, system and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李学勇等: "基于广播的分布式系统级故障诊断算法", 《计算机工程》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111756571A (en) * 2020-05-28 2020-10-09 苏州浪潮智能科技有限公司 Cluster node fault processing method, device, equipment and readable medium
CN111756571B (en) * 2020-05-28 2022-02-18 苏州浪潮智能科技有限公司 Cluster node fault processing method, device, equipment and readable medium
US11750437B2 (en) 2020-05-28 2023-09-05 Inspur Suzhou Intelligent Technology Co., Ltd. Cluster node fault processing method and apparatus, and device and readable medium
CN113783735A (en) * 2021-09-24 2021-12-10 小红书科技有限公司 Method, device, equipment and medium for identifying fault node in Redis cluster
CN115426247A (en) * 2022-08-22 2022-12-02 中国工商银行股份有限公司 Processing method and device of fault node, storage medium and electronic equipment
CN115426247B (en) * 2022-08-22 2024-04-26 中国工商银行股份有限公司 Fault node processing method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN109274544B (en) Fault detection method and device for distributed storage system
CN106933843B (en) Database heartbeat detection method and device
CN110740064A (en) Distributed cluster node fault processing method, device, equipment and storage medium
CN110830283B (en) Fault detection method, device, equipment and system
CN108737132B (en) Alarm information processing method and device
CN108924202B (en) Distributed cluster data disaster tolerance method and related device
EP3258653A1 (en) Message pushing method and device
CN109921942B (en) Cloud platform switching control method, device and system and electronic equipment
CN105959078B (en) A kind of cluster method for synchronizing time, cluster and clock synchronization system
CN111355600B (en) Main node determining method and device
CN111142801B (en) Distributed storage system network sub-health detection method and device
CN109391691A (en) The restoration methods and relevant apparatus that NAS is serviced under a kind of single node failure
CN109302435B (en) Message publishing method, device, system, server and computer readable storage medium
CN110113187B (en) Configuration updating method and device, configuration server and configuration system
CN111338858A (en) Disaster recovery method and device for double machine rooms
CN108509296B (en) Method and system for processing equipment fault
CN112436962B (en) Block chain consensus network dynamic expansion method, electronic device, system and medium
CN105490837A (en) Network monitoring processing method and device
CN110224872B (en) Communication method, device and storage medium
CN104243473A (en) Data transmission method and device
CN113254245A (en) Fault detection method and system for storage cluster
CN114301763B (en) Distributed cluster fault processing method and system, electronic equipment and storage medium
CN107493308B (en) Method and device for sending message and distributed equipment cluster system
CN113190347A (en) Edge cloud system and task management method
JP7143609B2 (en) COMMUNICATION DEVICE, COMMUNICATION METHOD, AND PROGRAM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200131

RJ01 Rejection of invention patent application after publication