WO2019004845A1

WO2019004845A1 - Service replication system and method

Info

Publication number: WO2019004845A1
Application number: PCT/NZ2018/050091
Authority: WO
Inventors: Abdolhossein SARRAFZADEH; Denis LAVROV; Shaoning PANG
Original assignee: Sarrafzadeh Abdolhossein; Lavrov Denis; Pang Shaoning
Priority date: 2017-06-27
Filing date: 2018-06-27
Publication date: 2019-01-03

Abstract

An aspect of the invention provides a method for replicating a service hosted by a first node to a second node, within a cluster of nodes that includes the first node and the second node. The method comprises: responsive to one or more of detecting the creation of the service on the first node, detecting the membership of the second node within the cluster, and receiving a user replication request: transmitting, over a network, service definition associated to the service and the first node, transmitting, over a network, user state associated to the service and the first node, and wherein volatile state is associated to the service and the first node, wherein the service definition and the user state does not include the volatile state; and responsive to detecting at least one change in the user state associated to the service and the first node, transmitting, over a network, the at least one change in the user state.

Description

SERVICE REPLICATION SYSTEM! AhiD METHOD

FIELD OF THE INVENTION

The invention relates to a method and system for replicating a service hosted by a master node to at least one active node. The invention is particularly suitable for use in virtual machine replication.

BACKGROUND OF THE INVENTION

Service availability is a critical issue for companies that rely on emerging technologies for the operation of their business. A main portal for their service must be up at all times. Demand for high availability systems is not uncommon.

Distributed database and data storage systems provide good performance and reliability in terms of hardware failure. Some solutions provide large-scale failure solutions. These technologies however mostly focus on data storage and access. The service's logic components have a risk of being exposed .

Traditional ad-hoc solutions can be implemented involving load balancers, content delivery networks and multiple points of presence. The solutions can be difficult to implement correctly and often require the service to be purpose-built in order to meet availability and scalability requirements. Virtual machine replication is one solution attempting to be more transparent and able to be generically applied to almost any service without special configuration. One goal of virtual machine replication is to address perceived difficulties involved with the design and implementation of specialised systems.

Traditional virtual machine replication solutions are only able to protect against localised disasters such as hardware failure and local network or power outages. Traditional virtual machine replication techniques are not able to satisfactorily address large-scale incidents such as natural disasters, data centre level power outages, software crashes, or user misoperation.

The above risks can be minimised by implementing high availability tier data centres with redundant network and power, security policies, and regular off-site backups.

However, these solutions can be extremely expensive, hard to control, and/or not very scalable. The performance of such virtual machine replication solutions severely degrade when used over low bandwidth and high latency links such as those typically

encountered in long distance replication over WAN and/or internet.

Virtual machine replication must absolutely guarantee the consistency of the service recovery, and cannot afford to lose any state. Losing even a single page of memory contents for example would result in the service being rendered unusa ble. Furthermore, traditional solutions typically rely on network buffering during replication stage in order to gua rantee a consistent view of the system d uring failure to its clients. Solutions with buffering a re severely impacted by the conditions presented by WAN .

It is an object of at least preferred embodiments of the present invention to add ress some of the aforementioned disadvantages. An additional or alternative object is to at least provide the public with a useful choice.

SUGA Y OF THE INVENTION In accordance with an aspect of the invention, a method for replicating a service hosted by a first node to a second node, within a cluster of nodes that includes the first node and the second node, comprises: responsive to one or more of detecting the creation of the service on the first node, detecting the membership of the second node within the cluster, a nd receiving a user replication request: transmitting, over a network, service definition associated to the service and the first node, transmitting, over a network, user state associated to the service and the first node, and wherein volatile state is associated to the service and the first node, wherein the service definition and the user state does not include the volatile state; and responsive to detecting at least one change in the user state associated to the service and the first node, transmitting, over a network, the at least one change in the user state.

The term 'comprising' as used in this specification means 'consisting at least in part of. When interpreting each statement in this specification that includes the term

'comprising', features other than that or those prefaced by the term may also be present. Related terms such as 'comprise' and 'comprises' are to be interpreted in the same manner.

In an embodiment the user state comprises at least one sequence of system ca lls associated to at least one data block. in an embodiment the at least one sequence of system calls are associated to respective system calls that have been completed.

In an embodiment the user state comprises a plurality of snapshots associated to respective checkpoints. In an embodiment the first node comprises a master node and the second node comprises a slave node, the slave node comprising an active node or a passive node.

In an embodiment at least one of the service definition and the user state is maintained in at least one non-volatile storage medium.

In an embodiment the service definition includes one or more of service configuration data, service software, service operation immutable data.

In an embodiment the user state is associated to an application and includes one or more of transactions, personal records, emails, files.

In an embodiment the volatile state is maintained in at least one volatile storage medium, In an embodiment the method further comprises; maintaining the service definition in a first file system, the first file system associated to at least one read only layer; and maintaining the user state in a second file system, the second file system associated to at least one read/write layer.

In an embodiment the method further comprises presenting to the service a view associated to one or more of the first file system, the second file system.

In an embodiment the method further comprises maintaining at least some of the volatile state in at least one non-volatile storage medium.

In an embodiment the method further comprises writing at least some of the volatile state maintained in the at least one non-volatile storage medium to a third file system, the third file system associated to at least one read/write layer.

In an embodiment the method further comprises presenting to the service a view associated to one or more of the first file system, the second file system, the third file system.

In an embodiment the method further comprises one or more of receiving a user request to mark at least some data maintained in the at least one non-volatile storage medium as volatile state, receiving a user request to mark at least some data maintained in the at least one non-volatile storage medium as user state.

In an embodiment the method further comprises discarding at ieast some of the volatile state. In an embodiment the volatile state includes one or more of lookup tables, indexes, buffers, caches, temporary variables.

In accordance with a further aspect of the invention, a method for replicating a service hosted by a first node to a second node, within a cluster of nodes that includes the first node and the second node, comprises: responsive to one or more of detecting the creation of a service on the first node, detecting the membership of the second node within the cluster, and receiving a user replication request: receiving, over a network, service definition associated to both the first node and a service on the first node, receiving, over a network, user state associated to the service and the first node, and wherein volatile state is associated to the service and the first node, wherein the service definition and the user state does not include the volatile state; and responsive to detecting at Ieast one change in the user state associated to the service and the first node, receiving, over a network, the at Ieast one change in the user state.

In an embodiment the user state comprises at Ieast one sequence of system calls associated to at Ieast one data block. In an embodiment the at ieast one sequence of system calls are associated to respective system calls that have been completed.

In an embodiment the user state comprises a plurality of snapshots associated to respective checkpoints.

In an embodiment the first node comprises a master node and the second node comprises a slave node, the slave node comprising an active node or a passive node.

In an embodiment at Ieast one of the service definition and the user state is maintained in at Ieast one non-volatile storage medium.

In an embodiment the service definition includes one or more of service configuration data, service software, service operation immutable data. In an embodiment the user state is associated to an application and includes one or more of transactions, personal records, emails, files. in an embodiment the volatile state is maintained in at least one volatile storage medium.

In an embodiment the method further comprises recreating at least some of the volatile state. In an embodiment the method further comprises: maintaining the service definition in a first file system, the first file system associated to at least one read only layer; and maintaining the user state in a second file system, the second file system associated to at least one read/write layer.

In an embodiment the method further comprises discarding at least some of the volatile state. In an embodiment the volatile state includes one or more of lookup tables, indexes, buffers, caches, temporary variables.

In accordance with a further aspect of the invention, a service replication system comprises a service hosted by a first node, the first node being a member of a cluster that includes the first node and a second node; and a processor. The processor is programmed to: responsive to one or more of detecting the creation of the service on the first node, detecting the membership of the second node within the cluster, and receiving a user replication request: transmitting, over a network, service definition associated to the service and the first node, transmitting, over a network, user state associated to the service and the first node, and wherein voiatiie state is associated to the service a nd the first node, wherein the service definition and the user state does not include the volatile state; and responsive to detecting at least one change in the user state associated to the service and the first node, transmitting, over a network, the at least one change in the user state.

In accordance with a further aspect of the invention a service replication system comprises a service hosted by a first node, the first node being a member of a cluster that includes the first node and a second node; and a processor. The processor is programmed to : responsive to one or more of detecting the creation of a service on the first node, detecting the membership of the second node within the cluster, and receiving a user replication request: receiving, over a network, service definition associated to both the first node and a service on the first node, receiving, over a network, user state associated to the service and the first node, and wherein voiatiie state is associated to the service and the first node, wherein the service definition and the user state does not include the volatile state; and responsive to detecting at least one change in the user state associated to the service and the first node, receiving, over a network, the at least one change in the user state.

In accordance with a further aspect of the invention, a computer-readable medium has stored thereon computer-executable instructions that, when executed by a processor, cause the processor to perform a method for replicating a service hosted by a first node to a second node, within a cluster of nodes that includes the first node and the second node. The method comprises responsive to one or more of detecting the creation of the service on the first node, detecting the membership of the second node within the cluster, a nd receiving a user replication request: transmitting, over a network, service definition associated to the service and the first node, transmitting, over a network, user state associated to the service and the first node, and wherein volatile state is associated to the service and the first node, wherein the service definition and the user state does not include the voiatiie state; and responsive to detecting at least one change in the user state associated to the service and the first node, transmitting, over a network, the at least one change in the user state.

In accordance with a further aspect of the invention, a computer-readable medium has stored thereon computer-executable instructions that, when executed by a processor, cause the processor to perform a method for replicating a service hosted by a first node to a second node, within a cluster of nodes that includes the first node and the second node. The method comprises responsive to one or more of detecting the creation of a service on the first node, detecting the membership of the second node within the cluster, and receiving a user replication request: receiving, over a network, service definition associated to both the first node a nd a service on the first node, receiving, over a network, user state associated to the service and the first node, and wherein volatile state is associated to the service and the first node, wherein the service definition and the user state does not include the volatile state; and responsive to detecting at least one change in the user state associated to the service and the first node, receiving, over a network, the at least one cha nge in the user state.

The invention in one aspect comprises several steps. The relation of one or more of such steps with respect to each of the others, the apparatus embodying features of construction, and combinations of elements and arrangement of parts that a re adapted to affect such steps, are all exemplified in the following deta iled disclosure.

To those skilled in the art to which the invention relates, many changes in construction and widely differing embodiments and applications of the invention will suggest themselves without departing from the scope of the invention as defined in the appended claims. The disclosures and the descriptions herein are purely illustrative and are not intended to be in any sense limiting. Where specific integers are mentioned herein which have known equivalents in the art to which this invention relates, such known equivalents are deemed to be incorporated herein as if individually set forth. In addition, where features or aspects of the invention are described in terms of Markush groups, those persons skilled in the art will appreciate that the invention is also thereby described in terms of any individual member or subgroup of members of the Markush group.

As used herein, ^¾(s)' following a noun means the plural a nd/or singular forms of the noun .

As used herein, the term 'a nd/or' means 'and' or Or' or both .

It is intended that reference to a range of numbers disclosed herein (for example, 1 to 10) a lso incorporates reference to all rational numbers within that range (for example, 1, 1.1, 2, 3, 3.9, 4, 5, 6, 6.5, 7, 8, 9, and 10) and also any range of rational numbers within that range (for example, 2 to 8, 1.5 to 5.5, and 3.1 to 4.7) and, therefore, all sub-ranges of all ranges expressly disclosed herein are hereby expressly disclosed .

These are only examples of what is specifically intended and ail possible combinations of numerical values between the lowest value and the highest value enumerated are to be considered to be expressly stated in this application in a similar manner. ϊη this specification where reference has been made to patent specifications, other external documents, or other sources of information, this is generally for the purpose of providing a context for discussing the features of the invention, Unless specifically stated otherwise, reference to such external documents or such sources of information is not to be construed as an admission that such documents or such sources of information, in any jurisdiction, are prior art or form part of the common genera! knowledge in the art.

In the description in this specification reference may be made to subject matter which is not within the scope of the appended claims. That subject matter should be readily identifiable by a person skilled in the art and may assist in putting into practice the invention as defined in the presently appended claims.

Although the present invention is broadly as defined above, those persons skilled in the art will appreciate that the invention is not limited thereto and that the invention also includes embodiments of which the following description gives examples.

The term 'computer-readable medium' should be taken to include a single medium or multiple media. Examples of multiple media include a centralised or distributed database and/or associated caches. These multiple media store the one or more sets of computer executable instructions, The term 'computer readable medium' should also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any one or more of the methods described above. The computer-readable medium is also capable of storing, encoding or carrying data structures used by or associated with these sets of

instructions. The term 'computer-readable medium' includes solid-state memories, optical media and magnetic media.

The terms 'component', 'module', 'system', 'interface', and/or the like as used in this specification in relation to a processor are generally intended to refer to a computer- related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a controller and the controller can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.

The term 'connected to' as used in this specification in relation to data or signal transfer includes all direct or indirect types of communication, including wired and wireless, via a cellular network, via a data bus, or any other computer structure. It is envisaged that they may be intervening elements between the connected integers. Variants such as 'in communication with', 'joined to', and 'attached to' are to be interpreted in a similar manner, Related terms such as 'connecting' and 'in connection with' are to be interpreted in the same manner,

BRIEF DESCRIPTION OF THE DRAWINGS

Preferred forms of the method and system for replicating a service wi!! now be described by way of example only with reference to the accompanying figures in which :

Figure 1 shows an example architecture suitable for the service replication; Figure 2 shows an overview of the iifecycle of a service suitable for replication;

Figure 3 illustrates an example of state replication;

Figure 4 shows an example of the continuous state synchronisation process from figure 2;

Figure 5 shows an example of the failure detection process from figure 2; Figure 6 shows an example of the coordination and control process from figure 2;

Figure 7 shows an example of dynamic faiiover from figure 2;

Figure 8 shows examples of recovery time per timeout;

Figure 9 shows an example of a recovery process profile;

Figure 10 shows results of measuring bandwidth use Figure 11 shows the findings of total replication size and overhead;

Figure 12 shows an example of write coalescing;

Figure 13 shows an example of measuring CPU load for different bandwidth links; Figure 14 shows the percentage slowdown compared to replication rate; and Figure 15 shows a computing environment suitable for implementing the invention. Figure I shows an example architecture 100 suitable for the service replication techniques described below. In an embodiment the architecture 100 is configured as a geo-distributed cluster of symmetric nodes. The nodes are represented by a physical or virtual machine in the cloud or on the premises of a user. in an embodiment the nodes are configured to connect to each other and host at least one service under protection. This will typically require a globally visible IP address.

Each service is considered to be a separate logical namespace in the cluster. In an embodiment a logical namespace comprises a master node 102 and a plurality of slave nodes. Examples of slave nodes include active nodes indicated at 104 and 106, and passive nodes indicated at 108 and 110.

The master node 102 is responsible for running the service, replicating any state changes and performing heartbeat to the other nodes. Active nodes 104 and 106, and passive nodes 108 and 110, receive heartbeats from the master node 102. Active nodes 104 and 106, and passive nodes 108 and 110, must reply to heartbeats received from the master node 102.

Examples of data paths for heartbeats are shown as dashed bi-directional lines for example 112, 114, 116, and 118.

In an embodiment at least one of the active nodes 104 and 106 receives service definition from the master node 102, Service definition is further described below.

Examples of data paths for service definition are shown as uni-directional lines 120 and 122,

In an embodiment at least one of the passive nodes 108 and 110 or at least one of the active nodes 104 and 106 receives user state from the master node 102. User state is further described below. Examples of data paths for user state are shown as unidirectional lines 124, 126, 128, and 130.

In an embodiment, a single node is configured to run more than one service. A single node can therefore function as a master node, active node, and passive node all at the same time. Modes in the architecture 100 are considered equal and capable of fulfilling any of the roles.

Figure 2 shows an overview of the lifecycle 200 of a service suitable for replication. A new service 202 runs on a node for example a master node 102 from figure I, In an embodiment the replication process is initiated on detecting a new service on the master node 102, This can arise through the creation of the service on the master node, in an embodiment the replication process is initiated on detecting a new active node joining a cluster of nodes, This ca n arise through detecting the membership of the active node within the cluster, in an embodiment the replication process is initiated on receiving a user replication request.

Service replication 204 involves replicating a service hosted by a master node 102 to at least one active node 104 and/or 106. Continuous state synchronisation 206 is a process by which the master node 102 keeps the service being replicated synchronized between all nodes. Failure detection 208 is a process by which the death of a master node 102 is detected. An active node, for example active node 104 or active node 106, becomes the new logical master node during dynamic failover 212,

Coordination and control 210 is a process that links all modules together, provides cluster management, a nd implements a state machine. The master node 102

communicates state changes, and checks node availa bility during a replication lifecycle, Cluster management determines the state of the cluster, and displays it to the human operator while managing the services and nodes.

Service change detection 214 may optionally be performed if the user desires to synchronise changes to the service configuration, for example in cases such as software updates or preference changes, Otherwise the updated service may be treated by the system as a new service. The user submits a user replication request to initiate replication.

In order to perform service change detection a read-write layer or current layer is monitored . Upon receipt for change notification a service definition is marked by the system as changed. Service definition is further described below.

Optionally, on a periodical basis or on user request, the service definition that has been marked as changed would be committed . The term "committed' includes accepted by the system as current state, and rendered immutable. The changed service definition is replicated to at least one active node. Simultaneously a new mutable layer is created to accept any further changes to the service definition.

The components shown in figure 2 are further described below. Service replication

In an embodiment, service replication 204 places restriction on the kind of services it is able to replicate. These services are referred to as Permanent State Service (PSS). PSS differ from other services in the way they treat their state. In PSS any state that is associated with a complete service transaction must be stored on at least one non-voiatile storage medium. The transaction must not be considered complete unless the storage medium is able to confirm that the data has been successfully stored.

Any transaction in progress may be stored in volatile storage, in order to harness the speed benefits of RAM and save on the precious 10 of secondary storage. The replication techniques described below assume that most if not all critical state is stored in the secondary storage, and only that state needs to be replicated. State stored in RAM can be considered volatile and is ignored during the replication process.

A further assumption is that the service is able to recover successfully upon restart from the state contained in the secondary storage. In turn replication guarantees the consistency of service definition during service replication 204 and recovery. In this way the application will not have to handle partial writes and can focus solely on its own transaction logic.

A majority of applications can be replicated using the techniques described below without any modification. As an example the most common method of storing application data is through a database. Most traditional database systems already follow this pattern. They also employ consistency and cache coherency mechanisms to ensure that the data is written to disk before acknowledging any database operations.

In an embodiment, data associated to a service is referred to as compound state.

Compound state includes every piece of information a service requires in order to operate. This compound state is automatically separated into three categories. In an embodiment the three categories include service definition, user state and volatile state.

The three categories are treated differently in order to minimise computer resource utilisation and/or speed up replication. Examples of computer resource utilisation includes network bandwidth, CPU and memory. Examples of speeding up replication include improving metrics such as Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Each of the categories possesses unique properties that are used to distinguish between them in order to treat them separately. Service definition includes data that fully describes a specific service, including service configuration data, service software and/or service operation immutable data required for service operation , The service definition does not include any information about the actual content that is of interest to the user. Service definition includes static information that defines the service behaviour given corresponding inputs. Examples include machine code, software libraries, and static datasets.

In an embodiment the service definition is permitted to be modified at runtime.

However, any changes will not be replicated automatically and must be requested explicitly by the user. Typically the user does not need to interact with the service beyond its initial configuration. This allows the service to write logs, temporary file system locks or run self-modifying code without restrictions.

This behaviour has the potential to yield the data forming the service definition itself stateless, allowing it to be replicated only once and during start-up. The remaining contents of the secondary storage is then considered to be user state. In an embodiment, service definition is immutable for the duration of the service runtime. Service definition may exist independently of both its volatile state and user state. Furthermore, service definition may be shared between multiple services that possess different volatile state and user state.

Once a service has been started it may perform actions that would alter the associated compound state. These actions do not alter service definition. Compound state includes service definition, volatile state and user state. Therefore these actions are assumed to alter volatile state and/or user state includes in the compound state.

User state refers to the business data that the service handles on behalf of its users, it may include transactions, personal records, emails, files or anything else a specific application may choose to store. This data changes dynamically and has the highest value when it comes to disaster recovery.

User state includes information inserted into the system by the user through import of data or through normal operation of the service. This information cannot be derived automatically by a computer system. The information is therefore of greatest value to the end user, in an embodiment user state includes database records, digitally stored documents, files, and/or various types of media .

Unlike the service data, user state is highly mutable where the size of change is highly dependent on a specific service used . However due to the nature of this data changes tend to happen in bursts. One example is where a user uploads a large file to the file server. Sometimes the data changes do not occur at all, for example at night time when a user is not active.

In an embodiment the replication strategy accommodates this behaviour and ca refully trades off recovery point objective with user experience. In an embodiment, user state is the remaining state of the system within a comound state once volatile state and user state have been accounted for. Prior to service startup user state is defined as the compound state of the system minus the service definition. During the service runtime user state is carefully separated from the volatile state.

Volatile state includes computer derived information based on service definition and user state. This information is used to help the service perform its function. However, as it is merely a transformation or generation of information, it may be recreated given service definition, user state and appropriate procedure. In an embodiment volatile state includes lookup tables, indexes, buffers, caches, and/or temporary variables.

In an embodiment, volatile state of a service has a lifetime limited to that of service lifetime. Volatile state is discarded once the service is terminated . For a service that is capable of being stopped and started, this property implies the fact that volatile state must be recreated during startup and/or during service runtime. As such, for a service that starts automatically, without human intervention, it may be considered that sta rtup of the service is a procedure required to recreate volatile state. In an embodiment service replication 204 occurs at the beginning of a service lifecycle responsive to creation of the service, responsive to detecting the membership of a slave node within the cluster and/or upon explicit request from the user to replicate the service when configuration changes need to be applied .

Service replication only needs to copy the service software and its configuration. In most cases it may safely discard any volatile state or dynamic state such as memory contents, buffers, temporary and special file systems (/dev, /proc, ,/tmp) .

In an embodiment, a distinction is made between the service definition, volatile state and user state in secondary storage by mapping the relevant locations in the file system to different volumes which can be treated as independent data units for the replication. Volatile state is typically stored in at least one volatile storage medium such as RAM . User state is typically stored in at least one non-voiatiie or permanent storage medium. Examples of non-volatile storage mediums include block devices such as hard disks and/or tapes. it is uncommon for user state to be stored in RAM, due to the implications of data loss when power is made unavailable to memory, if user state is stored in RAM another method to guarantee its avaiiabiiity must be used. Embodiments of the techniques described herein are of limited value to systems in which user state is maintained in at least one volatile storage medium.

It is also uncommon for volatile state to be stored in at least one non-volatile storage medium. One reason for such storage is bad system design. In other cases it may be unclear whether or not a particular piece of state is volatile state or user state, for example system logs, Embodiments of the techniques described herein are suited to situations where volatile state is maintained in at least one non-volatile storage medium. In an embodiment a user request is received to mark regions of non-volatile storage as volatile state. In an embodiment a user request is received to mark regions of non-volatile storage as user state, and everything else is then determined to be volatile state. In an embodiment, what is received is a user request to mark as volatile state at least some of the data maintained in a non-volatile storage medium on which is stored both volatile state and user state. In an embodiment what is received is a user request to mark as user state at least some of the data maintained in a non-volatile storage medium on which is stored both volatile state and user state. In an embodiment, service definition is stored as a read-only file system image that contains necessary files for the service to operate. It is replicated once to each node in a cluster at the beginning of the service lifetime, on creation of a new node within the cluster, or on user request. Service definition is stored in a first file system associated to at least one read only layer. In an embodiment, user state is stored separately on a second file system different to the first file system on which the service definition is stored. The second file system is associated to at least one read/write layer.

The second file system is mounted into the first file system image to form a complete view of the file system from the perspective of the service. User state stored in RAM is not considered . A view is presented to the service that is associated to the first file system and/or the second file system.

The second file system that stores user state is replicated with the help of a btrfs file system. This replication may be performed for example via asynchronous atomic incremental snapshots. The user state in this embodiment includes a plurality of snapshots associated to respective checkpoints.

Once a service has been started the file system image in an embodiment is shadowed by an empty proxy file system that can be read from and written to. This third file system is considered as a scratch space and is used to catch volatile state that is written to a nonvolatile storage medium. At least some of the volatile state maintained in at least one non-volatile medium is written to the third file system.

The third file system is associated to at least one read/write layer. A view is presented to the service that is associated to the first file system, the second file system, and/or the third file system. In an embodiment, this file system is simply discarded once the service is stopped, and recreated anew once the service starts.

In an embodiment, all file system writes that occur during service runtime will either go to user state and be replicated or to volatile state and be discarded. This procedure permits the replication of only the user data that is necessa ry to be replicated as it cannot be recreated otherwise. Volatile state can be ignored entirely.

When coupled with compression and incremental replication there is the potential to significantly reduce replication bandwidth. Furthermore, user state changes slowly compared to volatile state. Therefore there is the potential to require less resource to perform the replication procedure. Service definition is a special case whose changes while permitted should not be considered permanent unless explicitly persisted . Additionally, multiple services may use the same dynamic libraries, operating system utilities and other information. This redundancy can be exploited to further reduce replication requirements in large deployments such as cloud providers or datacenters where many services are replicated simultaneously.

In an embodiment, a layered file system approach is used to store service definition. Layered file systems permit differential checkpoints and data sharing by localizing certain parts of the file system as its own layers. Additionally layered file systems allow space efficient history to be stored in the form of read only layers while behaving like any POSIX compliant file system allowing them to be freely used for most existing software.

Certain layered file systems are well suited for this type of work, particularly those that allow overlaying of different file system images in read only mode and provide write capability through an RW-Layer. The RW-Layer works in copy-on-write fashion, any changes written to existing files in the iayered file system will cause the relevant files to be copied to the RW-Layer and modified there accordingly.

The RW-Layer effectively represents the difference between the last checkpoint and current state if, for every checkpoint, a new layer is created and added to the layered file system. The current state of RW-Layer can then be archived and replicated to passive nodes and/or active nodes.

Using this technique to replicate the service definition has the potential to provide advantages over existing techniques.

For example, state consistency of the service can be guaranteed during replication as the checkpoints can be considered as atomic transactions that do not modify remote or local state until the snapshot is finished. Hence even if the failure was to occur during the checkpointing procedure, the checkpoint in progress can be simply discarded and the most recent checkpoint can be used in its stead.

A further advantage arises from the fact that the checkpoints comprise incremental backups of the service. There is therefore sufficient resources to store the history of checkpoints due to their relatively small size. This a llows the administrator to revert to any previous state of the system that was recorded. This has the potential to help mitigate human errors, which tend to more common than natural disasters. Additionally incremental checkpoints also achieve efficient bandwidth utilization as only the changes are transmitted .

Service replication 204 includes replication of user state in addition to replication of service definition. Once changes are detected in user state, those changes a re replicated to other nodes in the cluster.

Changes in user state are more frequent than in static service, since every transaction performed by the service typically involves a state change. Further, changes greatly va ry in size, whether it is a web application writing access logs to database, or a file server accepting file uploads.

Frequent changes in user state result in a large amount of layers that has the potential to slow down filesystem access, This performance drop is caused by a method that iayered fi!esystems use to represent changes of stacking multiple layers on top of one another. This requires each write operation to traverse the stack to figure out at which layer the original data is located which results in higher latency disk access. Additionally, iayered fiiesystems copy the entire file to the RW-Layer. While this is reasonable for smai! text files such as a configuration file, copying a multi-gigabyte binary database for the sake of changing a few bytes is extremely wasteful,

Therefore layered fi!esystems are not suitable for access patterns seen in the user state. A preferred solution selected for user state replication must be able to efficiently deal with changes without impacting the user experience and operate transparently to the service.

Secondary storage replication can be performed synchronously or asynchronously.

In synchronous replication every write is sent to both primary and backup nodes. Both nodes must acknowledge the write in order for the system to return the control back to the application, this behaviour ensures that 100% state consistency is maintained during failure but sacrifices ava ilability.

Synchronous replication has the potential to be suboptimal for commodity WAN links, as the disk bandwidth will be effectively limited to the bandwidth and latency would be limited to that of the WAN link. In asynchronous replication, the writes are still replicated to both primary and secondary nodes. However, the write is acknowledged as soon as the primary confirms the disk write. In this case some state might be lost during faiiover, however the disk access is no longer limited by the specifications of the WAN link.

In an embodiment, the replication techniques seek to deal with frequent changes in user state and yet provide optimal user experience. One example is a fiiesystem that represents its internal structures in a B+ Tree, and provides instantaneous atomic snapshots.

One example is a btrfs-based fiiesystem that is configured to provide atomic snapshots. It is a copy-on-write file system that operates on data blocks rather than files. Each modification can be considered a separate transaction that does not touch the original data, instead the changes are written elsewhere to the disk.

Due to the fact that btrfs snapshots a re taken atornicaliy, they can be performed live while the file system is mounted and being accessed. Any file system transaction that has been accepted by the fiiesystem will be in the snapshot and any current transaction will not be included in the sna pshot. This property has the potential to ensure that softwa re that specifically relies on the filesystem to report the state of its writes being recorded to the non volatile media can function correctly.

Additionally, btrfs-based systems have the potential to provide an efficient mechanism to track transactions and represent the difference between two snapshots as a stream of system ca lls that produced the changes, in a n embodiment the user state includes at least one sequence of system calls associated to at least one data block. In this way the at least one sequence of system calls are associated to respective system calls that have been completed . This stream can then be asynchronously exported and further compressed to improve bandwidth utilization when being sent over network to backup nodes, for example active nodes 104 or 106 where it will be replayed.

In an embodiment the snapshots are kept on the master node 102 and at least one of the active nodes 104, 106 to provide a user with the ability to recover the system in case of user misoperation. Old snapshots can be garbage collected, as the internal datastructures will be automatically altered to represent the changes asynchronously.

In an embodiment this checkpoint/ replay process is repeated periodically, adj usting for the network bandwidth, to achieve asynchronous disk replication completely decoupled from the user service. Upon joining a duster, such as a rchitecture 100, a node undergoes service replication in which the service bina ries and static data a re copied from the master node 102 to the newly joined backup node, for example active node 104 and/or active node 106.

Container technology is an exa mple of a technique that can be used in service replication to package and describe service dependencies and data in a standardized fashion using conta iner images.

Containers are considered to be volatile. Any changes occurring within the container configuration are ignored by default during faiiover. However they can be committed and replicated manually if the user desires to do so.

In an embodiment, the service replication process includes the following ; * Any changes within the service RW-Layer are persisted and assigned a hash identifier

* A new RW-Layer is created for the service to continue operation

59 The committed RW- Layer is read from the file system and piped into the replication system

The stream of data is compressed and transmitted over a network to the ACTIVE nodes

The cluster acknowledges the new service configuration and assigns a hash identifier that will be used to recover the service upon failure

Any outdated service images are garbage collected as per user configuration.

Figure 3 illustrates an example of a state replication pipeline 300. State replication includes change detection, atomic snaphots, snapshot transmission and validation .

In a btrfs-based filesystem, each transaction that involves modification of file system state is represented by a uint64 trans!D number. This number is incremented for each operation on a specific subvolume associated with the service.

This value is polled using user-space btrfs utilities to determine whether there are any changes to the service definition. If changes a re detected then a checkpointing procedure is triggered creating an atomic snapshot of the file system.

If changes are not detected no action is performed, which is often the case of write- once-read-many services such as file servers or websites.

State replication includes taking read-only snapshots of a specific subvolume at the state of the last file system transaction. An example of a snapshot is shown at 302. A transaction is expressed by a system call related to file access i.e. writeQ, chmodQ.

Snapshots are performed asynchronously to the service transactions, meaning that the snapshot procedure will not impact the service quality beyond the disk access bandwidth required to read the data associated with the snapshot. Additionally because the snapshot only contains data that has been mod ified recently it will be maintained in RAM, for example in a page cache. This has the potential to remove the need to access the disk completely, allowing snapshots to be read from RAM, whose bandwidth is unlikely to be exceeded by a typical network application.

In an embodiment, network bandwidth use is minimised by sending snapshots incrementally. The stream of changes is calculated using the previous snapshot as a

2.0 reference. Additionally, any data blocks that are present in any of the ancestor

snapshots are not sent and instead are pulled from existing ancestor snapshots on the receiving side,

In an embodiment, file system changes are streamed using a btrfs send 304 and btrfs receive 306 mechanism that represents the changes as a variable length stream of system calls that were executed. This information is later replayed on the destination node. In an embodiment the system calls occupy only a few bytes and are highly compressible. Furthermore, the system calls are replayed by reading a journal in order to produce a stream thereby avoiding expensive hash based calculation to determine a delta between a remote file and a local file.

This has the potential to avoid additional CPU cycles and reduce required network bandwidth.

When a node joins the cluster, for example architecture 100, any existing snapshots associated to the new node are validated to hash the contents of the subvolume. This is done to ensure that any previous data contained in the node is consistent with that of the master node 102.

If the validation detects any inconsistencies between the snapshots located on the slave nodes (for example active nodes 104, 106) and those of the master node 102, the master node 102 will forcefully wipe a history associated to the active nodes 104, 106 and begin anew to ensure consistency of the data present.

In an embodiment the state replication pipeline includes compression 308 which has the potential to improve replication speed and hence increase the frequency of checkpoints, while maintaining low resource consumption in order not to impact the service quality.

In an embodiment, the compression algorithm achieves reasonable compression on differential snapshot streams and supports stream compression and decompression.

Fsgure 4 shows an example of the continuous state synchronisation process 206 from figure 2. One of the responsibilities of the master node 102 is to keep the service configuration synchronized between all nodes, the service configuration containers and metadata. Examples of metadata include current checkpoints for user state, current checkpoints for service definition, and a node list.

In an embodiment the node list keeps track of nodes as they enter and leave the cluster. The master node 102 is usually the first node in the node list. Other nodes are appended to the node list by the master node 102 as they join the cluster.

2.1 The position of a node in the node list determines its priority for failover. Nodes at the beginning of the node list are considered to be high priority and the nodes at the end low. This makes sure that the most stable nodes that have joined the cluster early and have not left it are considered first for important roles such as master node and active node. Nodes that often fail or suffer from connectivity issues will naturally remain at the end of the list and will only be considered in case every other node has failed .

In an embodiment, any node in the cluster can query this list but only the master node 102 can ever change it. This allows for a simple way to achieve consensus between nodes. Figure 5 shows an exa mple of the failure detection process 208 from figure 2. Prior to failover a cluster includes a master node 502, active nodes 504 and 506, and passive nodes 508 and 510.

Failover is caused by the death of master node 502. Active node 506 becomes the new logical master node 512. Passive node 510 becomes a new active node 514 after failover.

In order to achieve network pa rtition tolerance the decision of who becomes the new logical master can be achieved via quorum where each node must vote for the new master according to their view of the cluster a nd only then the new master can be elected . In case the number of nodes is insufficient to achieve consensus the first node from the node list is selected to be the master.

After failover, any node in the cluster other than the master node 512 that fails the heartbeat is assumed to have left the cluster. The master node 512 is then required to reorder the node list accordingly removing the exited node. The node is then continuously monitored by the master node 512 to determine if it has come back a live, and is then allowed to rejoin the cluster, in which case the master node 512 will append the new node to the end of the node list as a passive node.

In an embodiment, this structure is repeated for other services running in the cluster, allowing the cluster to run multiple independent services and efficiently utilize the resources. For N equal nodes in the cluster designed for up to k failures, the effective resource available is 'r=N-k'. If the number of failed nodes exceeds k, services could be affected by resource starvation, however the cluster will continue to be operational even if only a single node remains. Figure 6 shows an example of the coordination and control process 210 from figure 2. This process links ali modules together, provides cluster management and state machine,

Cluster management requires secure and reliable access to the cluster. Management requires a set of protocols. Each node communicates via two channels, one for the coordination and control process 210, and another one for direct command interface and data transfers.

A control channel is used by the master node 102 to communicate state changes, and check node availability during a replication lifecyde. The control channel is also used by a cluster management utility to determine the state of the cluster, and to display it to the human operator while managing the services and nodes.

In order to prevent rogue nodes intercepting or listening on cluster communication, every conversation is preferably encrypted and authenticated using TLS client side and server side authentication. The direct command interface is used strictly by the logical master node 102 to manage active nodes 104 and 106 directly. In an embodiment this channel is implemented using an SSH protocol using the SSH sidechannels to multiplex each individual communication between the master node 102 and the active nodes 104 and 106,

In addition to node management the master node 102 also uses this channel to multiplex additional data transfer connections for service definition and user state replication. This ensures that ali communication between the cluster nodes is encrypted and authenticated so no sensitive information can leak as a result of replication over untrusted network such as internet itself.

In an embodiment, cluster consensus is implemented using the secure foundation described above provided by the two communication channels. Cluster consensus is via a distributed state machine for example state machine 600.

Every node in the cluster is initialized in state WAITING 602. The node may be manually promoted to MASTER 604 to form a new cluster or to recover an existing cluster from an ambiguous state. If the cluster already has the MASTER node, then it connects to a newly added node which is in WAITING state. Because of the connection initiated by the MASTER node, the new node enters the state of CONNECTING 606 in which preliminary synchronization and cleanup occurs. The MASTER node evaluates the new node for previous cluster membership and repiication history. The MASTER node updates the newly added node's state if necessary. Additionally, the master node determines whether the node is to be considered PASSIVE 608 or ACTIVE 610 depending on the number of nodes that are currently ACTIVE. In an embodiment the user determines the number of currently active nodes as it depends on the distribution and bandwidth limitations. A default configuration for example, is 2 active nodes per service.

Depending on the master node's decision the node will either become PASSIVE or ACTIVE. This marks the node initialization complete. Nodes in any state of PASSIVE 608, ACTIVE 610, CANDIDATE 612 or MASTER 604 may fail at any point in time to DEAD 614 without compromising the service as long as at least one ACTIVE or MASTER remains alive.

During the faiiure of a PASSIVE node the current master node simply updates the cluster state to reflect the change. Additionally the master node will continuously poll for the failed node to determine if it has come back online.

In the case of ACTIVE node failure the procedure is similar to that of PASSIVE node failure. One difference is that the master node will attempt to promote one of the PASSIVE nodes to become the new ACTIVE node.

During the failure of the MASTER node one of the ACTIVE nodes is selected as described above in relation to the service coordination and control process 210. The selected ACTIVE node is self promoted to CANDIDATE, in which state it tries to achieve cluster consensus. A majority of remaining ACTIVE and PASSIVE nodes must agree that the previous MASTER is DEAD and successfully report this to the CANDIDATE, If the

CANDIDATE fails to achieve the majority consensus, due to its failure, a new CANDIDATE is re-elected. If the CANDIDATE fails to reach the majority of the nodes or the nodes vote that the MASTER remains alive CANDIDATE is left to assume that a network partition has occurred and in such case will req uire manual intervention by the user.

At any time nodes may leave the cluster in which case they will be considered DEAD 514 and they may rejoin the cluster from the state WAITING 602. Nodes that recently joined the cluster will be placed at the end of the node list and considered low priority for fail- over as their behaviour may indicate intermittent issues and thus they should be considered unreliable.

Figure 7 shows an example of dynamic failover 212 from figure 2.

2.4 When using WAN technologies a particular administrative domain does not have any control over the behaviour of the entire network. Specifically in the case of the Internet one server cannot control which path will be ta ken by the client to access it.

In LAN technologies however this problem is much simpler. The server may send unsolicited ARP to every node on the network saying that its location has now changed . The clients on the network will use the new MAC address to locate the moved service.

DNS provides one option for dynamic failover, DNS allows application layer addresses to be translated to IP addresses. DNS is universally used on the Internet and almost any client accessing a service on the Internet will use its application layer address and query the DNS Server for its IP address before the actual request occurs.

This property of DNS can be exploited and used in a similar way to that of how ARP is used in LAN to provide failover, with one significant difference. DNS does not provide unsolicited updates to the clients. The clients themselves must request the DNS name translation. The clients accessing the services through a DNS record will typically perform a DNS request for every application level request if the DNS record has expired in the client's cache.

Furthermore, if the request to an IP address returned from a DNS response fails, a client will often try the next address or re-query the DNS server. This behaviour however is highly implementation specific and cannot be relied upon. In an embodiment a Dynamic DNS system is used with lowered TTL values. Dynamic

DNS is a DNS server whose entries may change dynamically, and can often be requested to change via a specific protocol .

Upon the detection of the failure in the system by the methods described above, the DDNS system will be notified of the failure and the new service Public IP address will be provided. Any of the following requests to the DNS server will result in the new address being returned by the DNS server.

In an embodiment, in order to achieve fast failover, TTL values a re set reasonably low. An example range for TTL is 1-60.

E E IMENTAL RESULTS

Evaluation was performed on the replication techniques described above on its ability to achieve Recovery Time Objective, resource consumption and service quality impact. A high availability solution must be able to recover the service promptly so that its users will be able to use it in disaster scenarios. Additionally the proposed solution must not consume too many resources for its operation as otherwise it would have severe negative impact on the service. Resources consumption evaluation was based on CPU and network bandwidth consumption.

The evaluation system included three different nodes, one in private IaaS in Auckland, NZ; a second on physical infrastructure in Auckland, NZ in a different building about 200 meters apart connected via lOGig fibre to the first node; and a third in Amazon AWS T2. micro instance in Oregon, US. A 100/100 symmetric Internet connection was supplied to the private IaaS in Auckland, NZ, this link was used for communication with the Amazon instance. Ail nodes Ubuntu 16.04 with Linux kernel version 4.4-stab!e, Docker version 1.12, and btrfs-progs version 4.4, the binaries for the system were compiled with goiang compiler 1.7 for linux x86 64 target. Mean time to recovery

Recovery time objective was measured as the time required for the system to fully recover and bring up the service after a synthetically injected failure to the master node. The nodes were using a bind9 DNS server deployed in the same private IaaS as node 2.

The time required for failover was profiled using different timeout settings. A timeout setting is a variable that is used to determine when a heartbeat times out. After the first heartbeat fails the system has up to N seconds to reconnect to avoid being considered a failure. The timeout setting N can be changed by the user to adjust for the specific network conditions. For example, if the network is known to have high latency spikes due to congestion a high timeout value may be used, if the network is known to be convergent and very stable low timeout value may be used.

Additionally, by using 3 or more nodes the timeout value may be further decreased as the two remaining nodes will be able to perform a quorum and decide if they both see that one node has failed.

However, in the case of two nodes the system cannot differentiate a temporary network partition from node failure hence a higher value should be used to prevent false positive fail-overs.

Figure 8 shows examples of recovery time per timeout. The effects of different timeout settings on the mean time to recovery are shown. This measurement is produced by

2.6 measuring the time from the point the failure has been injected via a piece of code that immediately halts the master's CPU to the point when the service is fully recovered a nd is able to serve requests.

The timeout values of 2, 5 and 10 seconds are used which correspond to typical settings used for LAN, low latency WAN, and high latency WAN respectively. it was found that the Amazon instance would often freeze for a few seconds which would result in failing heartbeats, hence a timeout of 10 was used there in order to eliminate the false positive fail-overs. This issue is common for T2. micro instances when the host machines become highly congested. It can be seen in figure 8 that the timeout setting has the direct a nd most prominent effect on the recovery time, hence it is desirable to keep the timeout as low as the underlying network could possibly allow. The remaining constituents of the failover time remain consistent over different network connection types and are further explored below. Figure 9 shows an exa mple of a recovery process profile. The figure shows a breakdown of recovery activity during a hypothetical timeout of 0 seconds. The values shown in the fig ure are in seconds, and the entire recovery procedure is performed within a second .

The contributors to the failover procedure are the following in order from largest to smallest: * The time required to start the service cold from the state replicated by the

system;

« The time required by the state machine to reach consensus with the remaining network nodes;

* Time for the DNS update query using nsupdate; and » Time to recover the volume from the replicated snapshots.

The time required to launch a service in this case is mostly contributed by the container runtime itself as it needs to perform various checks and prepare the environment for the container to run in . This runtime can be further reduced by changing the container runtime environment and changing certain parameters such as disabling security checks. The time used by the state machine to reach consensus with the remaining nodes however cannot be optimized much further as it purely depends on the latency of the connection between the remaining nodes. The CANDIDATE node has to wait for the remaining nodes to acknowledge that the old MASTER is dead which may take multiple network round trips as the heartbeats are not synchronized.

Resource Consumption

Two metrics were considered for resource consumption: Bandwidth and CPU Load. The system uses a parallel pipeline architecture to perform replication therefore the memory is only used for buffering, hence the RAM is not measured as the system only uses a few megabytes and this value is constant across all replication intervals.

Furthermore, it was noticed that replication does not require additional disk 10 for snapshot transmission as it tends to be read from page cache. In the case the system is starving for memory, Disk 10 would be less or equal to that of the network bandwidth. It is assumed that the system is healthy and the resource properly allocated. The workload used to perform these measurement was compilation of Linux kernel 4.8.4 bzlmage target in its default configuration with 8 worker threads. This task has the potential to illustrate a well-balanced moderate workload involving CPU and RAM as well as providing a reasonable random read/write load on the disk that can be expected in real life scenarios. It has been found that the sustained load caused by the kernel compilation exceeds that of the normal operation for low volume internet services, such as blogs, websites and file servers. As this category of services are the target of the system's low bandwidth configuration, kernel compilation should provide a conservative estimate of the system^'s performance in real life. Fsgure 10 shows results of measuring bandwidth use.

One test evaluated the bandwidth requirements of different snapshot frequencies on Linux kernel compilation workload. The tests were repeated for various snapshot rates from 1/1 to 1/32 in intervals of powers of two. The size of each snapshot was recorded. An additional snapshot was taken before and after the kernel compilation as a reference delta image.

It can be observed in figure 10 the anticipated inverse proportionality of mean snapshot size to snapshot frequency. The increase in snapshot size was simply due to the data

2.8 compounding between the snapshot epochs, hence larger snapshots were required to accommodate the state change.

Additionally for each snapshot interval, minimum bandwidth requirement was plotted . The minimum bandwidth requirement is an estimate of the link bandwidth between nodes required to sustain uncompressed replication at selected frequency using this specific workload.

The figure shows that the bandwidth required to replicate the service that performs kernel compilation is between 3.2 MB/s for highest replication frequency and 2.75 B/S for 1/32 replication frequency. This value indicates, that even without compression, the system performs more than xl 5 times better than Virtual Machine replication systems performing the same task.

By enabling LZ4 compression the minimum bandwidth requirement can be reduced even further, making sustained replication for this workload feasible over commodity networks. Less write intensive workloads will consume significantly less bandwidth for replication, and as many web services tend to follow write-once-read-many philosophy, the bandwidth required to replicate them would be significantly.

If the available link is slower than the measured bandwidth requirement, replication will still occur however the frequency will be automatically reduced by the system. This way the system can support large bursts of state changes, being only limited by the disk capacity. With lower replication frequencies the state changes become increasingly more likely to overwrite previous ones. This behaviour can be illustrated by the decline of minimum bandwidth requirement with decrease of replication frequency. The reduction in minimum bandwidth requirement is additionally contributed by the lowered metadata overhead associated with the replication . In order to further study the effects of different replication intervals on bandwidth the total sizes of the uncompressed state changes were compared at different replication intervals.

Figure I I shows the findings of total replication size and overhead. The decreasing total of snapshot sizes originate for two reasons in the replication procedure. The first source is due to the reduced number of snapshots taken for slower snapshot frequencies. The fixed amount of metadata appended to each snapshot by btrfs is reduced, accounting for a few bytes in size reduction. However the difference seen in the measurements is fa r too large to be only accounted by the metadata reduction.

2.9 The second reason for reduced total snapshot size is due to write coalescing naturally occurring within larger snapshot periods. Coa lescing is described in more detail below.

Additionally in figure 11 is plotted the % overhead caused by multiple snapshots in comparison to a single incremental snapshot taken after the kernel compilation is finished. From this plot it ca n be seen that even the most frequent snapshots 1/1 perform very efficiently with only 2% overhead in total transmission size. By decreasing the snapshot frequency the overhead can be reduced even further. it can be seen that snapshot frequencies lower than 1/8 result in total transmission size significantly smaller than that of a single incremental snapshot. It is believed that this is caused by the btrfs algorithm not being able to optimize very large incremental snapshots for the temporal writes hence increasing the size of the snapshot.

Figure 12 shows an example of write coalescing . Specific files in the file system may be written to at different time points in between the snapshots. These writes may sometimes overlap, in which case only the union of changes introduced by the first write and second write will be carried over to the incremental snapshot.

Furthermore the two temporal writes can be combined into one continuous write to improve replay performance and reduce metadata required for transmission.

Figure 13 shows an example of measuring CPU load for different bandwidth links. The CPU load required to transmit the snapshots generated by the kernel compilation workload is measured. The measured results exclude the compilation process itself and only focus on activities of the system.

An evaluation is performed of the two fastest compress algorithms used in the implementation GZIP -1 and LZ4 across a simulated link of variable bandwidth . These tests were carried out using the physical node in the cluster where the % CPU was defined as CPU time/wall clock time.

The results of these measurements can be seen in figure 13. It should be noted that these figures illustrate the CPU utilization during the transmission time for a snapshot. They do not show the transmission time itseif for a particular snapshot as it highly depends on the workload and snapshot size. Furthermore, workload type is only relevant for the compression algorithms that perform worse on in-compressible data. If this is the case then it is recommend to turn compression off completely as it will result in wasted CPU, for insignificant reduction in transmission size. ϊη the figure a linear increase in CPU utilization is observed caused by the btrfs log and repiay. This becomes significant (^"22% TX, ^"27% RX) when higher bandwidth link is used. As the link becomes capable of carrying more data, btrfs functions will be invoked more frequently to fill the network channel and utilize it at its maximum speed . However for lower bandwidth links such as 10 and 100 Mbit/s the CPU consumption required by btrfs log and repiay is minimal (< 1% @ 10, < 5% @ 100), and should not impact the service on a multi-core system.

As for compression algorithms, it was determined that GZIP was completely unable to keep up with the network bandwidth at 10Q0 Mbit/s speeds in single threaded mode. Hence the datapoint for GZIP at that speed is unavailable, and GZIP should not be utilized for high bandwidth replication even if parallel implementation for stream compression is used, it would utilize far too much CPU.

However at link speeds such as 10 and 100 GZIP - 1 couid be used at a significant cost of CPU Load (^"20% Compression @ lOQ bit/s, ^" 12% Decompression @ lOOMbit s) .

However such compression scheme is undesirable if the service under protection would be impacted by the additional CPU load, or the backup node runs in metered conditions, where the user is billed for the CPU consumption (i .e. public cloud infrastructure such as the one used by the AWS node) . LZ4 on the other hand can sacrifice some compression ratio for massive performance gain, allowing the compression to be used even at 1000 Mbit s link speeds albeit at significant CPU load .

Furthermore LZ4's performance really shines at the lower link speeds as it can perform both compression and decompression under 4% of CPU load @ lOOMbit/s.

Service Impact

A service impact test was performed to determine the extent to which the snapshot process and replication process impact the end user service. In order to conduct this test state replication of kernel compilation workload was performed over a LAN between two nodes. Both user space and kernel space execution time were measured .

The measurement was performed over different replication rates ranging from 0 to 1/32. In this case only the relative slowdown was considered . Hence the different replication rates were compared by their percentage slowdown compared to no replication .

Figure 14 shows the percentage slowdown compa red to replication rate. It can be observed in the figure that the most frequent replication considered by the system ( IHz) only yields a minor 2% increase in user space time compared to that of no replication . The slowdown mainly originates from the fact that btrfs must perform copy on write activity after the snapshot occurs, This induces additionai latency and disk bandwidth on write access, in which case latency affects the compilation time the most as it requires the multi-threaded application to wait for 10. As the replication frequency is decreased the check pointing overhead becomes less significant with the lowest replication frequency of 1/32 resulting in 0.3% slowdown. Even though conservative 1/32 replication frequency is still acceptable for workloads whose recovery point objective is not critical and network bandwidth should be prioritized, Overall it can be seen that the overhead is minor and is unlikely to be noticed by the application itself, in comparison to other high availability solutions such as Remus whose overhead in kernel compilation workload ranges between 30%-100% depending on the replication frequency. The replication frequency for Remus has to be much higher in order to satisfy service quality requirements as it blocks the network traffic for the period of replication.

With longer replication intervals such as those available to the system's latency of Remus will timeout TCP connections, and result in a large amount of jitter causing extremely slow transfer speeds due to TCP window scaling algorithm.

Such a huge difference is due to the way the system performs atomic COW snapshots of the secondary storage that does not require to pause the container. Additionally Remus replicates the contents of RAM which is completely irrelevant for permanent state services as the workload can be resumed after the application is restarted.

Figure IS shows an embodiment of a suitable computing environment to implement embodiments of one or more of the systems and methods disclosed above. The operating environment of Figure 15 is an example of a suitable operating environment. It is not intended to suggest any limitation as to the scope of use or functionality of the operating environment. Example computing devices include, but are not limited to, personal computers, server computers, hand-held or laptop devices, mobile devices, multiprocessor systems, consumer electronics, mini computers, mainframe computers, and distributed computing environments that include any of the above systems or devices. Examples of mobile devices include mobile phones, tablets, and Personal Digital Assistants (PDAs).

Although not required, embodiments are described in the general context of 'computer readable instructions' being executed by one or more computing devices, in an embodiment, computer readable instructions are distributed via tangible computer readable media.

In an embodiment, computer readable instructions are implemented as program modules. Examples of program modules include functions, objects, Application

Programming interfaces (APIs), and data structures that perform particular tasks or implement particular abstract data types. Typically, the functionality of the computer readable instructions is combined or distributed as desired in various environments.

Shown in figure 15 is a system 1500 comprising a computing device 1505 configured to implement one or more embodiments described above. In an embodiment, computing device 1505 includes at least one processing unit 1510 and memory 1515. Depending on the exact configuration and type of computing device, memory 1515 is volatile (such as RAM, for example), non-volatile (such as ROM, flash memory, etc., for example) or some combination of the two.

A server 152Q is shown by a dashed line notionaliy grouping processing unit 1510 and memory 1515 together.

In an embodiment, computing device 1505 includes additional features and/or functionality.

One example is removable and/or non-removable additional storage including, but not limited to, magnetic storage and optical storage. Such additional storage is illustrated in Figure 15 as storage 1525. In an embodiment, computer readable instructions to implement one or more embodiments provided herein are maintained in storage 1525. In an embodiment, storage 1525 stores other computer readable instructions to implement an operating system and/or an application program. Computer readable instructions are loaded into memory 1515 for execution by processing unit 1510, for example.

Where the computing device 105 is configured to host a master node 102, active node 104 or 106, and/or a passive node 108 or 110, the storage 1525 is an example of secondary storage within which service definition and user state are stored.

Memory 1515 and storage 1525 are examples of computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVDs) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 1505. Any such computer storage media may be part of device 1505.

In an embodiment, computing device 1505 includes at least one communication connection 1540 that allows device 1505 to communicate with other devices. The at least one communication connection 1540 includes one or more of a modem, a Network Interface Card (NIC), an integrated network interface, a radio frequency

transmitter/receiver, an infrared port, a USB connection, or other interfaces for connecting computing device 1505 to other computing devices.

In an embodiment, the communication connection(s) 1540 facilitate a wired connection, a wireless connection, or a combination of wired and wireless connections.

Communication connection(s) 1540 transmit and/or receive communication media .

Communication media typically embodies computer readable instructions or other data in a "modulated data signal" such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" includes a signal that has one or more of its cha racteristics set or changed in such a manner as to encode information in the signal.

In an embodiment, device 1505 includes at least one input device 1545 such as keyboard, mouse, pen, voice input device, touch input device, infrared cameras, video input devices, and/or any other input device. In an embodiment, device 1405 includes at least one output device 1550 such as one or more displays, spea kers, printers, and/or any other output device.

Input device(s) 1545 and output device(s) 1550 are connected to device 1505 via a wired connection, wireless connection, or any combination thereof. In a n embodiment, an input device or an output device from another computing device is/are used as input device(s) 1545 or output device(s) 1550 for computing device 1505.

In an embodiment, components of computing device 1505 a re connected by various interconnects, such as a bus. Such interconnects include one or more of a Peripheral Component Interconnect (PCI), such as PCI Express, a Universal Serial Bus (USB), firewire (IEEE 13104), and an optica! bus structure. In an embodiment, components of computing device 1505 are interconnected by a network. For example, memory 1515 in an embodiment comprises multiple physical memory units located in different physical locations i terconnected by a network.

It will be appreciated that storage devices used to store computer readable instructions may be distributed across a network. For example, in an embodiment, a computing device 1555 accessible via a network 1560 stores computer readable instructions to implement one or more embodiments provided herein. Computing device 1505 accesses computing device 1555 in an embodiment and downloads a part or all of the computer readable instructions for execution. Alternatively, computing device 1505 downloads portions of the computer readable instructions, as needed. In a n embodiment, some instructions are executed at computing device 1505 a nd some at computing device 1555.

In an embodiment computing device 1505 and computing device 155 host respective nodes. Replication of the master node to at least one active node is performed over the network 1560.

A client application 1585 enables a user experience and user interface. In an

embodiment, the client application 1585 is provided as a thin client application configured to run within a web browser. The client appiication 1585 is shown in figure 15 associated to computing device 1555. It will be appreciated that application 1585 in a n embodiment is associated to computing device 1505 or another computing device.

The client application 1585 for example receives user requests directing at least one aspect of service replication and/or other aspects of the service replication lifecyde described above.

As described above, a service associated to a master node is replicated to one or more active nodes. Only data stored on non-volatile media is replicated. Data replicated includes service data and user state data stored in memory associated to the master node.

Service data may be stored in layers. There are a plurality of read-only layers created by and associated to respective checkpoints that have completed . There is a current layer for recently modified files using copy-on-write. The current layer is replicated at the beginning of each lifecyde and on user request. A new service on master node, or a new active node joining the cluster can initiate replication process. The read-only layers are replicated only once on initialisation .

User state stored in data structure can be based on btrfs or some other product able to construct snapshots. A plurality of snapshots are stored in a B+ tree. Snapshots a re implemented as a strea m of system calls. Data blocks and operations on the data blocks are copied-on-write on completion of operation . Once replicated the snapshots a re replayed on the active nodes(s) . Replay can occur automatically during periodic garbage collection or at user request. The foregoing description of the invention inciudes preferred forms thereof. Modifications may be made thereto without departing from the scope of the invention.

Claims

1. A method for replicating a service hosted by a first node to a second node, within a cluster of nodes that includes the first node and the second node, the method comprising : responsive to one or more of detecting the creation of the service on the first node, detecting the membership of the second node within the cluster, and receiving a user replication request: transmitting, over a network, service definition associated to the service and the first node, transmitting, over a network, user state associated to the service and the first node, and wherein volatile state is associated to the service and the first node, wherein the service definition and the user state does not include the voiatiie state; and responsive to detecting at least one change in the user state associated to the service and the first node, transmitting, over a network, the at least one change in the user state,

2. The method of claim 1 wherein the user state comprises at least one sequence of system calls associated to at least one data block,

3. The method of claim 2 wherein the at least one sequence of system calls are associated to respective system calls that have been completed,

4. The method of any one of the preceding claims wherein the user state comprises a plurality of snapshots associated to respective checkpoints.

5. The method of any one of the preceding claims wherein the first node comprises a master node and the second node comprises a slave node, the slave node comprising an active node or a passive node,

6. The method of any one of the preceding claims wherein at least one of the service definition and the user state is maintained in at least one non-volatile storage medium,

7. The method of any one of the preceding claims wherein the service definition includes one or more of service configuration data, service software, service operation immutable data.

8. The method of any one of the preceding claims wherein the user state is associated to an application and includes one or more of transactions, personal records, emails, files.

9. The method of any one of the preceding claims wherein the volatile state is maintained in at least one volatile storage medium.

10. The method of any one of the preceding claims further comprising : maintaining the service definition in a first file system, the first file system associated to at least one read only layer; and maintaining the user state in a second file system, the second file system associated to at least one read/write layer.

11. The method of claim 10 further comprising presenting to the service a view associated to one or more of the first file system, the second file system.

12. The method of claim 10 or claim 11 further comprising ma intaining at least some of the volatile state in at least one non-volatile storage medium.

13, The method of claim 12 further comprising writing at least some of the volatile state maintained in the at least one non-volatile storage medium to a third file system, the third file system associated to at least one read/write layer.

14, The method of claim 13 further comprising presenting to the service a view- associated to one or more of the first file system, the second file system, the third file system,

15, The method of any one of claims 12 to 14 further comprising one or more of receiving a user request to mark at least some data maintained in the at least one nonvolatile storage medium as volatile state, receiving a user request to mark at least some data maintained in the at least one non-volatile storage medium as user state.

16, The method of any one of the preceding claims further comprising discarding at least some of the volatile state.

17, The method of any one of the preceding claims wherein the volatile state includes one or more of lookup tables, indexes, buffers, caches, temporary varia bles.

18, A method for replicating a service hosted by a first node to a second node, within a cluster of nodes that includes the first node and the second node, the method comprising : responsive to one or more of detecting the creation of a service on the first node, detecting the membership of the second node within the cluster, and receiving a user replication request: receiving, over a network, service definition associated to both the first node and a service on the first node, receiving, over a network, user state associated to the service and the first node, and wherein volatile state is associated to the service and the first node, wherein the service definition and the user state does not include the volatile state; and responsive to detecting at least one change in the user state associated to the service and the first node, receiving, over a network, the at least one change in the user state,

19, The method of claim 18 wherein the user state comprises at least one sequence of system calls associated to at least one data block.

20, The method of claim 19 wherein the at least one sequence of system calls are associated to respective system calls that have been completed.

21, The method of any one of claims 18 to 20 wherein the user state comprises a plurality of snapshots associated to respective checkpoints.

22, The method of any one of claims 18 to 21 wherein the first node comprises a master node and the second node comprises a slave node, the slave node comprising an active node or a passive node.

23, The method of any one of claims 18 to 22 wherein at least one of the service definition and the user state is maintained in at least one non-volatile storage medium.

24, The method of any one of claims 18 to 23 wherein the service definition includes one or more of service configuration data, service software, service operation immutable data,

25. The method of any one of claims 18 to 24 wherein the user state is associated to an application and includes one or more of transactions, personal records, emails, fiies.

26. The method of any one of claims 18 to 25 wherein the volatile state is maintained in at least one volatile storage medium.

27, The method of any one of claims 18 to 26 further comprising recreating at least some of the volatile state.

28. The method of any one of claims 18 to 27 further comprising : maintaining the service definition in a first file system, the first file system associated to at least one read only layer; and maintaining the user state in a second file system, the second file system associated to at least one read/write layer.

29. The method of claim 28 further comprising presenting to the service a view associated to one or more of the first file system, the second file system.

30. The method of claim 28 or claim 29 further comprising ma intaining at least some of the volatile state in at least one non-volatile storage medium,

31. The method of claim 30 further comprising writing at least some of the volatile state maintained in the at least one non-volatile storage medium to a third file system, the third file system associated to at least one read/write layer,

32. The method of claim 31 further comprising presenting to the service a view associated to one or more of the first file system, the second file system, the third file system,

33. The method of any one of claims 30 to 32 further comprising one or more of receiving a user request to mark at least some data maintained in the at least one nonvolatile storage medium as volatile state, receiving a user request to mark at least some data maintained in the at least one non-volatile storage medium as user state.

34. The method of any one of the preceding claims further comprising discarding at least some of the volatile state.

35. The method of any one of claims 13 to 23 wherein the volatile state includes one or more of lookup tables, indexes, buffers, caches, temporary variables.

36. A service replication system comprising : a service hosted by a first node, the first node being a member of a cluster that includes the first node and a second node; and a processor programmed to: responsive to one or more of detecting the creation of the service on the first node, detecting the membership of the second node within the cluster, and receiving a user replication request: transmitting, over a network, service definition associated to the service and the first node, transmitting, over a network, user state associated to the service and the first node, and wherein volatile state is associated to the service and the first node, wherein the service definition and the user state does not include the volatile state; and responsive to detecting at least one change in the user state associated to the service and the first node, transmitting, over a network, the at least one change in the user state.

37, A service replication system comprising : a service hosted by a first node, the first node being a member of a cluster that includes the first node and a second node; and a processor programmed to: responsive to one or more of detecting the creation of a service on the first node, detecting the membership of the second node within the cluster, and receiving a user replication request: receiving, over a network, service definition associated to both the first node and a service on the first node, receiving, over a network, user state associated to the service and the first node, and wherein volatile state is associated to the service and the first node, wherein the service definition and the user state does not include the volatile state; and responsive to detecting at least one change in the user state associated to the service and the first node, receiving, over a network, the at !east one change in the user state.

38. A computer-readabie medium having stored thereon computer-executable instructions that, when executed by a processor, cause the processor to perform a method for replicating a service hosted by a first node to a second node, within a cluster of nodes that includes the first node and the second node, the method comprising : responsive to one or more of detecting the creation of the service on the first node, detecting the membership of the second node within the cluster, and receiving a user replication request; transmitting, over a network, service definition associated to the service and the first node, transmitting, over a network, user state associated to the service and the first node, and wherein volatile state is associated to the service and the first node, wherein the service definition and the user state does not include the volatile state; and responsive to detecting at least one change in the user state associated to the service and the first node, transmitting, over a network, the at least one change in the user state.

39. A computer-readable medium having stored thereon computer-executable instructions that, when executed by a processor, cause the processor to perform a method for replicating a service hosted by a first node to a second node, within a cluster of nodes that includes the first node and the second node, the method comprising : responsive to one or more of detecting the creation of a service on the first node, detecting the membership of the second node within the cluster, and receiving a user replication request: receiving, over a network, service definition associated to both the first node and a service on the first node, receiving, over a network, user state associated to the service and the first node, and wherein volatile state is associated to the service and the first node, wherein the service definition and the user state does not include the volatiie state; and responsive to detecting at least one change in the user state associated to the service and the first node, receiving, over a network, the at least one change in the user state.