CN110431533A - 故障恢复的方法、设备和系统 - Google Patents

故障恢复的方法、设备和系统 Download PDF

Info

Publication number
CN110431533A
CN110431533A CN201680091858.0A CN201680091858A CN110431533A CN 110431533 A CN110431533 A CN 110431533A CN 201680091858 A CN201680091858 A CN 201680091858A CN 110431533 A CN110431533 A CN 110431533A
Authority
CN
China
Prior art keywords
node
log
leader
log entry
voting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201680091858.0A
Other languages
English (en)
Other versions
CN110431533B (zh
Inventor
侯杰
宋跃忠
林程勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN110431533A publication Critical patent/CN110431533A/zh
Application granted granted Critical
Publication of CN110431533B publication Critical patent/CN110431533B/zh
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/40Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass for recovering from a failure of a protocol instance or entity, e.g. service redundancy protocols, protocol state redundancy or protocol service redirection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/30Decision processes by autonomous network management units using voting and bidding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • G06F11/1425Reconfiguring to eliminate the error by reconfiguration of node membership
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1471Saving, restoring, recovering or retrying involving logging of persistent data for recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0659Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
    • H04L41/0661Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities by reconfiguring faulty entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0895Configuration of virtualised networks or elements, e.g. virtualised network function or OpenFlow elements

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Hardware Redundancy (AREA)

Abstract

一种故障恢复的方法,应用在分布式集群系统,所述分布式集群系统包括的拥有最新日志的节点数量会影响其中一个拥有最新日志的节点故障重启后选举一个没有最新日志的节点成为领导者Leader,所述分布式集群系统至少包括第一节点、第二节点和第三节点,其中第一节点和第二节点拥有所述故障前的最新日志,第三节点没有所述最新日志,该方法包括:第一节点故障重启后,投票状态设置为不能投票,投票状态用于指示第一节点是否可以在所述分布式集群系统选举Leader的过程中进行投票;第一节点接收来自第二节点的复制日志条目消息,将第一节点的投票状态设置为可以投票,第二节点为Leader。该方法有助于提高分布式集群系统的安全性。

Description

PCT国内申请,说明书已公开。

Claims (23)

  1. PCT国内申请,权利要求书已公开。
CN201680091858.0A 2016-12-30 2016-12-30 故障恢复的方法、设备和系统 Active CN110431533B (zh)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/113848 WO2018120174A1 (zh) 2016-12-30 2016-12-30 故障恢复的方法、设备和系统

Publications (2)

Publication Number Publication Date
CN110431533A true CN110431533A (zh) 2019-11-08
CN110431533B CN110431533B (zh) 2021-09-14

Family

ID=62706721

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201680091858.0A Active CN110431533B (zh) 2016-12-30 2016-12-30 故障恢复的方法、设备和系统

Country Status (4)

Country Link
US (1) US11102084B2 (zh)
EP (1) EP3553669B1 (zh)
CN (1) CN110431533B (zh)
WO (1) WO2018120174A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538763A (zh) * 2020-04-24 2020-08-14 咪咕文化科技有限公司 一种确定集群中主节点的方法、电子设备和存储介质
CN112601216A (zh) * 2020-12-10 2021-04-02 苏州浪潮智能科技有限公司 一种基于Zigbee的可信平台告警方法与系统
CN112865995A (zh) * 2019-11-27 2021-05-28 上海哔哩哔哩科技有限公司 分布式主从系统
CN113014634A (zh) * 2021-02-20 2021-06-22 成都新希望金融信息有限公司 集群选举处理方法、装置、设备及存储介质
CN113742254A (zh) * 2021-01-19 2021-12-03 北京沃东天骏信息技术有限公司 内存碎片治理方法、装置和系统
CN114299655A (zh) * 2020-09-23 2022-04-08 成都中科信息技术有限公司 一种电子投票系统及其工作方法
CN114518973A (zh) * 2022-02-18 2022-05-20 成都西南信息控制研究院有限公司 分布式集群节点宕机重启恢复方法
CN115794478A (zh) * 2023-02-06 2023-03-14 天翼云科技有限公司 系统配置方法、装置、电子设备及存储介质
CN116028250A (zh) * 2021-10-26 2023-04-28 慧与发展有限责任合伙企业 具有多个集群级别的分解式存储

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10595363B2 (en) * 2018-05-11 2020-03-17 At&T Intellectual Property I, L.P. Autonomous topology management for wireless radio user equipment
CN114189421B (zh) * 2022-02-17 2022-05-31 江西农业大学 一种领导者节点选举方法、系统、存储介质及设备
CN114448996B (zh) * 2022-03-08 2022-11-11 南京大学 基于计算存储分离框架下的冗余存储资源的共识方法和系统
CN114406409B (zh) * 2022-03-30 2022-07-12 中国船级社 一种焊接机故障状态的确定方法、装置及设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050132154A1 (en) * 2003-10-03 2005-06-16 International Business Machines Corporation Reliable leader election in storage area network
CN103763155A (zh) * 2014-01-24 2014-04-30 国家电网公司 分布式云存储系统多服务心跳监测方法
CN103793517A (zh) * 2014-02-12 2014-05-14 浪潮电子信息产业股份有限公司 一种基于监控机制的文件系统日志转储动态增容方法
CN104994168A (zh) * 2015-07-14 2015-10-21 苏州科达科技股份有限公司 分布式存储方法及分布式存储系统
CN105512266A (zh) * 2015-12-03 2016-04-20 曙光信息产业(北京)有限公司 一种实现分布式数据库操作一致性的方法及装置
US9507843B1 (en) * 2013-09-20 2016-11-29 Amazon Technologies, Inc. Efficient replication of distributed storage changes for read-only nodes of a distributed database

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7558883B1 (en) * 2002-06-28 2009-07-07 Microsoft Corporation Fast transaction commit
CN103152434A (zh) * 2013-03-27 2013-06-12 江苏辰云信息科技有限公司 一种分布式云系统中的领导节点更替方法
US9047246B1 (en) * 2014-07-31 2015-06-02 Splunk Inc. High availability scheduler
CN105511987A (zh) * 2015-12-08 2016-04-20 上海爱数信息技术股份有限公司 一种强一致性且高可用的分布式任务管理系统
US10503427B2 (en) * 2017-03-10 2019-12-10 Pure Storage, Inc. Synchronously replicating datasets and other managed objects to cloud-based storage systems

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050132154A1 (en) * 2003-10-03 2005-06-16 International Business Machines Corporation Reliable leader election in storage area network
US9507843B1 (en) * 2013-09-20 2016-11-29 Amazon Technologies, Inc. Efficient replication of distributed storage changes for read-only nodes of a distributed database
CN103763155A (zh) * 2014-01-24 2014-04-30 国家电网公司 分布式云存储系统多服务心跳监测方法
CN103793517A (zh) * 2014-02-12 2014-05-14 浪潮电子信息产业股份有限公司 一种基于监控机制的文件系统日志转储动态增容方法
CN104994168A (zh) * 2015-07-14 2015-10-21 苏州科达科技股份有限公司 分布式存储方法及分布式存储系统
CN105512266A (zh) * 2015-12-03 2016-04-20 曙光信息产业(北京)有限公司 一种实现分布式数据库操作一致性的方法及装置

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112865995A (zh) * 2019-11-27 2021-05-28 上海哔哩哔哩科技有限公司 分布式主从系统
CN112865995B (zh) * 2019-11-27 2022-10-14 上海哔哩哔哩科技有限公司 分布式主从系统
CN111538763A (zh) * 2020-04-24 2020-08-14 咪咕文化科技有限公司 一种确定集群中主节点的方法、电子设备和存储介质
CN111538763B (zh) * 2020-04-24 2023-08-15 咪咕文化科技有限公司 一种确定集群中主节点的方法、电子设备和存储介质
CN114299655B (zh) * 2020-09-23 2023-09-05 成都中科信息技术有限公司 一种电子投票系统及其工作方法
CN114299655A (zh) * 2020-09-23 2022-04-08 成都中科信息技术有限公司 一种电子投票系统及其工作方法
CN112601216A (zh) * 2020-12-10 2021-04-02 苏州浪潮智能科技有限公司 一种基于Zigbee的可信平台告警方法与系统
CN112601216B (zh) * 2020-12-10 2022-06-21 苏州浪潮智能科技有限公司 一种基于Zigbee的可信平台告警方法与系统
CN113742254A (zh) * 2021-01-19 2021-12-03 北京沃东天骏信息技术有限公司 内存碎片治理方法、装置和系统
CN113014634A (zh) * 2021-02-20 2021-06-22 成都新希望金融信息有限公司 集群选举处理方法、装置、设备及存储介质
CN116028250B (zh) * 2021-10-26 2024-06-11 慧与发展有限责任合伙企业 具有多个集群级别的分解式存储
CN116028250A (zh) * 2021-10-26 2023-04-28 慧与发展有限责任合伙企业 具有多个集群级别的分解式存储
CN114518973A (zh) * 2022-02-18 2022-05-20 成都西南信息控制研究院有限公司 分布式集群节点宕机重启恢复方法
CN114518973B (zh) * 2022-02-18 2024-07-30 成都西南信息控制研究院有限公司 分布式集群节点宕机重启恢复方法
CN115794478A (zh) * 2023-02-06 2023-03-14 天翼云科技有限公司 系统配置方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN110431533B (zh) 2021-09-14
US11102084B2 (en) 2021-08-24
EP3553669A4 (en) 2019-10-16
WO2018120174A1 (zh) 2018-07-05
EP3553669A1 (en) 2019-10-16
US20190386893A1 (en) 2019-12-19
EP3553669B1 (en) 2024-09-25

Similar Documents

Publication Publication Date Title
CN110431533B (zh) 故障恢复的方法、设备和系统
CN113014634B (zh) 集群选举处理方法、装置、设备及存储介质
EP3928208B1 (en) System and method for self-healing in decentralized model building for machine learning using blockchain
US7249280B2 (en) Cheap paxos
US7856502B2 (en) Cheap paxos
US7711825B2 (en) Simplified Paxos
US9465650B2 (en) Executing distributed globally-ordered transactional workloads in replicated state machines
EP2434729A2 (en) Method for providing access to data items from a distributed storage system
WO2014197963A1 (en) Failover system and method
CN110865907B (zh) 在主服务器与从服务器之间提供服务冗余的方法和系统
EP4191429B1 (en) Techniques to achieve cache coherency across distributed storage clusters
CN114554593A (zh) 数据处理方法及装置
US11010086B2 (en) Data synchronization method and out-of-band management device
US11522966B2 (en) Methods, devices and systems for non-disruptive upgrades to a replicated state machine in a distributed computing environment
EP3140735A1 (en) System and method for running application processes
CN110781039B (zh) 哨兵进程选举方法及装置
CN100442248C (zh) 用于避免竞争的计算机系统同步单元
CN113157494B (zh) 区块链系统中数据备份的方法及装置
CN117215833A (zh) 分布式数据备份方法、系统、设备及存储介质
US20240111747A1 (en) Optimizing the operation of a microservice cluster
CN118331644A (zh) 一种交互控制方法、装置、设备及介质
Zhu Shaft: Serializable, highly available and fault tolerant concurrency control in the cloud
CN114721869A (zh) 一种账户余额处理方法及系统
CN115344424A (zh) 一种同步状态恢复方法、装置、设备及存储介质
CN113568710A (zh) 一种虚拟机高可用实现方法、装置和设备

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant