WO2019180477A1 - Distributed group membership service - Google Patents
Distributed group membership service Download PDFInfo
- Publication number
- WO2019180477A1 WO2019180477A1 PCT/IB2018/051785 IB2018051785W WO2019180477A1 WO 2019180477 A1 WO2019180477 A1 WO 2019180477A1 IB 2018051785 W IB2018051785 W IB 2018051785W WO 2019180477 A1 WO2019180477 A1 WO 2019180477A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- virtual machine
- processes
- running
- cluster
- different virtual
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0712—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a virtual computing platform, e.g. logically partitioned systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0715—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a system implementing multitasking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5077—Logical partitioning of resources; Management or configuration of virtualized resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45575—Starting, stopping, suspending or resuming virtual machine instances
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45591—Monitoring or debugging support
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/815—Virtual
Definitions
- local failure detectors or programs which register for such failure events with the kernel when a process crashes and halts, or when a process gets suspended due to another process crash, etc.
- local failure detectors corresponding to each process in the group running on different virtual machine nodes in the cluster can communicate with each other.
- a process communicates with its local failure detector through a special receive-only channel on which the local failure detector may place a new list of identifiers of processes along with identifiers or IP addresses of the virtual machine where they are running on and those processes are not suspected to have crashed. We call this list the adjacency view of the process.
- the local failure detector can share the adjacency view of the process along with the current state of the process itself (whether it is failed or not) with all other failure detectors running on different virtual machine nodes which it can reach out. This way all processes in the group running on different virtual machine nodes in the cluster have a consistent view of the up and running processes and all those processes will agree with a consensus to revoke the
- Health Check Service which periodically checks the health of each virtual machine node in the cluster and for a fixed number of consecutive cycles if a virtual machine node does not respond, the Health Check Service assumes that the virtual machine is down and hence updates about the same to all local failure detectors of all other virtual machines in the cluster.
Abstract
Here we have a group of processes each of which runs on a different virtual machine node in order to complete specific set of tasks. We consider here an asynchronous distributed system, where processes communicate by exchanging messages. Processes running on different virtual machine nodes are identified by their unique identifiers along with the IP address of the virtual machine node where they are running. Every pair of processes is connected by a communication channel. To track failures on the same virtual machine node we use local failure detectors or programs which register for such failure events of the process with the kernel. Also local failure detectors corresponding to each process in the group running on different virtual machine nodes in the cluster can communicate with each other which helps the processes to get a view of adjacent up and running processes and which processes have failed.
Description
Distributed Group Membership Service
In this invention we have a group of processes each of which runs on a different virtual machine node in order to complete specific set of tasks. We consider here an asynchronous distributed system, where processes communicate by exchanging messages. Processes running on different virtual machine nodes are identified by their unique identifiers along with the Internet Protocol (IP) address of the virtual machine node (or its identifier) where they are running. Every pair of processes is connected by a communication channel. That is, every process can send messages to and can receive messages from any other. The failure model we assume allows processes to crash, silently halting their execution. To track such failures on the same virtual machine node we use local failure detectors or programs which register for such failure events with the kernel when a process crashes and halts, or when a process gets suspended due to another process crash, etc. Also local failure detectors corresponding to each process in the group running on different virtual machine nodes in the cluster can communicate with each other. We assume that a process communicates with its local failure detector through a special receive-only channel on which the local failure detector may place a new list of identifiers of processes along with identifiers or IP addresses of the virtual machine where they are running on and those processes are not suspected to have crashed. We call this list the adjacency view of the process. Also the local failure detector can share the adjacency view of the process along with the current state of the process itself (whether it is failed or not) with all other failure detectors running on different virtual machine nodes which it can reach out. This way all processes in the group running on different virtual machine nodes in the cluster have a consistent view of the up and running processes and all those processes will agree with a consensus to revoke the
membership of the failed process and distribute its pending tasks among themselves or create a new process on a virtual machine to achieve the same. Also to handle virtual machine node failures, we have a Health Check Service which periodically checks the health of each virtual machine node in the cluster and for a fixed number of consecutive cycles if a virtual machine node does not respond, the Health Check Service assumes that the virtual machine is down and hence updates about the same to all local failure detectors of all other virtual machines in the cluster.
Claims
1 . In this invention we have a group of processes each of which runs on a different virtual machine node in order to complete specific set of tasks. We consider here an asynchronous distributed system, where processes communicate by exchanging messages. Processes running on different virtual machine nodes are identified by their unique identifiers along with the Internet Protocol (IP) address of the virtual machine node (or its identifier) where they are running. Every pair of processes is connected by a communication channel. That is, every process can send messages to and can receive messages from any other. The failure model we assume allows processes to crash, silently halting their execution. To track such failures on the same virtual machine node we use local failure detectors or programs which register for such failure events with the kernel when a process crashes and halts, or when a process gets suspended due to another process crash, etc. Also local failure detectors corresponding to each process in the group running on different virtual machine nodes in the cluster can communicate with each other. We assume that a process communicates with its local failure detector through a special receive-only channel on which the local failure detector may place a new list of identifiers of processes along with identifiers or IP addresses of the virtual machine where they are running on and those processes are not suspected to have crashed. We call this list the adjacency view of the process. Also the local failure detector can share the adjacency view of the process along with the current state of the process itself (whether it is failed or not) with all other failure detectors running on different virtual machine nodes which it can reach out. This way all processes in the group running on different virtual machine nodes in the cluster have a consistent view of the up and running processes and all those processes will agree with a consensus to revoke the membership of the failed process and distribute its pending tasks among themselves or create a new process on a virtual machine to achieve the same. Also to handle virtual machine node failures, we have a Health Check Service which periodically checks the health of each virtual machine node in the cluster and for a fixed number of consecutive cycles if a virtual machine node does not respond, the Health Check Service assumes that the virtual machine is down and hence updates about the same to all local failure detectors of all other virtual machines in the cluster. The above novel technique of providing
membership service for a group of processes running on different virtual machine nodes in the cluster is the claim for this invention.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/IB2018/051785 WO2019180477A1 (en) | 2018-03-17 | 2018-03-17 | Distributed group membership service |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/IB2018/051785 WO2019180477A1 (en) | 2018-03-17 | 2018-03-17 | Distributed group membership service |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019180477A1 true WO2019180477A1 (en) | 2019-09-26 |
Family
ID=67986638
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2018/051785 WO2019180477A1 (en) | 2018-03-17 | 2018-03-17 | Distributed group membership service |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2019180477A1 (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8776050B2 (en) * | 2003-08-20 | 2014-07-08 | Oracle International Corporation | Distributed virtual machine monitor for managing multiple virtual resources across multiple physical nodes |
-
2018
- 2018-03-17 WO PCT/IB2018/051785 patent/WO2019180477A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8776050B2 (en) * | 2003-08-20 | 2014-07-08 | Oracle International Corporation | Distributed virtual machine monitor for managing multiple virtual resources across multiple physical nodes |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8032780B2 (en) | Virtualization based high availability cluster system and method for managing failure in virtualization based high availability cluster system | |
KR100645733B1 (en) | Automatic configuration of network for monitoring | |
CN106294713A (en) | The method of data synchronization resolved based on Incremental Log and data synchronization unit | |
CN103051470B (en) | The control method of a kind of cluster and magnetic disk heartbeat thereof | |
JP6269250B2 (en) | Data transfer control device, data transfer control method, and program | |
JP2018522471A (en) | Software-defined data center and service cluster placement method there | |
JP6079426B2 (en) | Information processing system, method, apparatus, and program | |
CN103444256A (en) | Self-organization of a satellite grid | |
WO2009079177A3 (en) | Systems and methods of high availability cluster environment failover protection | |
WO2003039071A1 (en) | Method to manage high availability equipments | |
CN105993161A (en) | Scalable address resolution | |
EP1117038A3 (en) | Method and apparatus for providing fault-tolerant addresses for nodes in a clustered system | |
CN104618521A (en) | Node de-duplication in a network monitoring system | |
CN103501355B (en) | Internet protocol address collision detection method, device and gateway device | |
CN103036702A (en) | Network segment crossing N+1 backup method and network segment crossing N+1 backup device | |
US10530634B1 (en) | Two-channel-based high-availability | |
CN110771097A (en) | Connectivity monitoring for data tunneling between network device and application server | |
US10742493B1 (en) | Remote network interface card management | |
WO2019180477A1 (en) | Distributed group membership service | |
US20150039929A1 (en) | Method and Apparatus for Forming Software Fault Containment Units (SWFCUS) in a Distributed Real-Time System | |
CN105991678A (en) | Distributed equipment service processing method, distributed equipment service processing device and distributed equipment | |
CN106603330A (en) | Cloud platform virtual machine connection state checking method | |
Hou et al. | Design and implementation of heartbeat in multi-machine environment | |
KR20180104639A (en) | Database-based redundancy in the network | |
JP5157685B2 (en) | COMMUNICATION SYSTEM, NETWORK DEVICE, COMMUNICATION RECOVERY METHOD USED FOR THEM, AND PROGRAM THEREOF |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18910815 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18910815 Country of ref document: EP Kind code of ref document: A1 |