WO2006075332A2

WO2006075332A2 - Resuming application operation over a data network

Info

Publication number: WO2006075332A2
Application number: PCT/IL2006/000057
Authority: WO
Inventors: Offer Markovich; Constantine Gavrilov; Victor Umansky; Eran Cinamon; Moshe Israel Bar
Original assignee: Qlusters Software Israel Ltd.
Priority date: 2005-01-13
Filing date: 2006-01-12
Publication date: 2006-07-20
Also published as: WO2006075332A3

Abstract

A system for allowing an application to resume its operation on a new node of a data network after an original application and/or an original node fail to operate over the data network, that comprises: an active node comprising an active application and an active replication software component for replicating changes of the active application; stand-by nodes, where each node comprises a replication of the application and a stand-by replication software component, for replicating changes of the active application that are received from the active replication software component to its corresponding stand-by node; and a high availability software component for instructing each of all nodes to become active or stand-by; for monitoring the availability of the active node and the one or more stand-by nodes and the availability of the active application and the replication(s) of application; and for controlling and communicating with the active and the one or more stand-by nodes.

Description

METHOD AND SYSTEM FOR RESUMING APPLICATION OPERATION OVER A DATA NETWORK

Field of the Invention

The present invention relates to failover sj^steins. More particularly, the invention relates to a method and system for allowing an application to resume its operation on a new node in a fast manner after the original application and/or original node fail to operate.

Definitions, Acronyms and Abbreviations

Throughout this specification, the following definitions are employed:

API: An Application Programming Interface that a computer system or application provides in order to allow requests for service to be made of it by other computer programs, and/or to allow data to be exchanged between them. For instance, a computer program can (and often must) use its operating system API to allocate memory and access files. Many types of systems and applications implement APIs, such as graphics systems, databases, networks, web services and computer games.

FAILOVER: is a backup operation that automatically switches to a standby application, server or network, if the primary system fails or is temporarily shut down. Failover is an important fault tolerance function of mission-critical systems that rely on continuous accessibility. Failover automatically and transparently to the user redirects requests from the failed or down system to the backup system that mimics the operations of the primary system.

KERNEL: is the core of an operating system. It is a portion of software, responsible for providing secure access to the machine hardware and to various computer processes (a process is a computer program in a state of execution). Since there can be many processes running at the same time, and hardware access is limited, the kernel also decides when and how long a program should be able to make use of a portion of hardware. Accessing the hardware directly can be very complex, since there are many different hardware designs for the same type of component. Kernels usually implement some hardware abstraction (a set of instructions universal to all devices of a certain type) to hide the underlying complexity from the operating system and provide a clean and uniform interface to the hardware, which helps application programmers to develop programs that work with all devices of that type. The Hardware Abstraction Layer (HAL) then relies upon a software driver that provides the instructions specific to that device manufacturing specifications.

RPO: is a short for a Recovery Point Objective, which is the point in time that the restarted application will reflect. Essentially, this is the roll-back that may be experienced as a result of the recovery. It also can be explained as a measure of the amount of time for which work may be lost in the event of a unplanned outage at the primary site.

RTO: is a short for a Recovery Time Objective that is determined, based on the acceptable downtime in case of a disruption of operations. It indicates the earliest point in time at which the business operations must resume after disaster.

Background of the Invention

The speed of technology development nowadays requires system administrators to be vigilant about application availability. In a marketplace where the term "downtime" is an unacceptable concept, Information Technology (IT) is concentrated on the creation of redundant systems for ensuring minimal risk to business-critical applications. As these applications accumulate more data and becorne the core of day-to-day business practices, system administrators are expected to create foolproof system architectures.

The financial services sector contains main examples of business-critical applications. Financial enterprises require constant and reliable interaction with both customers and global financial exchanges. Financial IT systems require constant and reliable interaction with fast-moving global markets while sustaining immense volumes of traffic. The timing issue in the financial market is usually more critical than in any other market: every second, billions of dollars at numerous transactions depend solely on reliability of the enterprise infrastructure and its capability to effectively manage said transactions. As a result, downtime costs are immediately quantifiable. In addition, every financial market opportunity may vanish by the time it takes to recover from a system failure.

During the system or application downtime, a company is affected in the following areas:

Financial loss: the system failure during critical business hours, for example during trading hours, it leads to the conrpanjr direct financial damage;

Customer experience: when the customers' network connection is interrupted or halted, the company is at risk of losing its orders and customers;

Executive experience: when internal knowledge and/or support systems are not operating, a financial enterprise can miss critical opportunity windows for taking business decisions involving outstanding orders and positions. In the best-case scenario, it results in a loss of opportunity, and in the worst case, it results in poor business decisions based on incorrect market data; - Business experience: when a financial enterprise is disconnected from a global exchange network, critical orders may not be completed in a timely manner. Due to the inherent volatility of the market, each trading time window can be measured in seconds. If a system failure occurs during said time window, both an institution and customer immediately experience a direct financial impact.

For providing accurate financial service data, corporate applications are required to contain up-to-the-second information regarding orders statuses and global market conditions. The substantial amount of "state information" requires accuracy and reliability — errors and/or losses of information are not acceptable.

The prior art solutions implement one of the three following approaches: Providing continuous availability via the redundant hardware and lock-step operation. By this way, a recovery point objective (RPO) and a recovery time objective (RTO) can be zero and there is no downtime for single failures. However, this approach is expensive and still susceptible to multiple hardware failures.

- Several software-based solutions "sacrifice" both RPO and RTO, and restart a failed application. These solutions do not require expensive specialized hardware; however, they have a number of problems: (a) RPO is poor, since the application state is not preserved; and RTO is also poor, since the full application restart is required.

Some applications are designed and written with failover "in mind". However, this approach is not general and therefore, each application has to be programmed accordingly. There are also a number of academic approaches, which involve various checkpoint/restart mechanisms; however, none of them provides the fast failover with a continuous application state replication.

The prior art failover solutions require expensive hardware for providing their continuous operation without a loss of state, or they require long recovery times (usually tens of seconds), leading to a reduced application availability and the loss of state of the application.

Therefore, there is a continuous need to overcome the prior art drawbacks by providing a solution that is implemented only in software and that provides a short failover recovery time in a range of seconds instead of minutes. Furthermore, the solution must retain the state of the application. In other words, a solution must provide a small recovery point objective (RPO) and a low recoveiy time objective (RTO).

It is an object of the present invention to provide a method and system for allowing an application to resume its operation on a new node in a fast manner after the original application and/or original node fails to operate due to hardware and/or software problems.

It is another object of the present invention to provide a method and system for resuming the operation of the failed application with a small recovery point objective (RPO) and a low recovery time objective (RTO).

It is still another object of the present invention to provide a method and system for resuming the operation of the application by providing a software-based solution. It is still another object of the present invention to provide a method and system allowing system administrators to focus on critical applications management rather than on hardware redundancy.

It is a further object of the present invention to provide a method and system, which is inexpensive.

It is still a further object of the present invention to provide a method and system, which is user friendly.

Other objects and advantages of the invention will become apparent as the description proceeds.

Summary of the Invention

The present invention relates to a method and system for allowing an application to resume its operation on a new node in a fast manner after the original application and/or original node fail to operate.

The system for allowing an application to resume its operation on a new node of a data network after an original application and/or an original node fail to operate over said data network comprises: (a) an active node comprising an active application and an active replication software component for replicating changes of said active application; (b) one or more stand-by nodes, each of which comprising a replication of said application and a stand-by replication software component, for replicating changes of said active application to its corresponding stand-by node, said changes received from said active replication software component; and (c) an HA software component for: (c.l.) instructing each of all nodes to become active or stand-by; (c.2.) monitoring the availability of said active node and said one or more stand-by nodes and the availability of said active application and said replication(s) of application; and (c.3.) controlling and communicating with said active and said one or more stand-by nodes.

Preferably, the active and stand-by replication software components run in the kernel space of the operating system within each node.

The method for allowing an application to resume its operation on a new node of a data network after the original application and/or original node fail to operate over said data network comprises: (a) independently starting two or more instances of an application on two or more corresponding nodes; (b) registering said instances of the application with an HA software component; (c) selecting, by means of said HA software component, one node to be an active node and the remaining nodes to be stand-by nodes; (d) providing a network address of said active node to each replication software component of said one or more stand-by nodes; (e) transferring a list of the one or more stand-by nodes to a replication unit of said active node and taking the instance of the application on said active node out of a registration call and causing said instance to become active; and (f) monitoring the status of said active node and one or more stand-by nodes by means of said HA software component.

Preferably, the method further comprises detecting a failure of the active node or a failure of the instance of an application on said active node by means of the HA software component.

Preferably, the method further comprises selecting a new active node by means of the HA software component.

Preferably, the method further comprises transferring a new list of the one or more stand-by nodes to the replication software component of the new - S - active node and taking the instance of an application on said new active node out of a registration call.

Preferably, the method further comprises providing a network address of the new active node to each replication software component of the one or more stand-by nodes.

Preferably, the method further comprises adding a new node after taking the active instance of an application out of the registration call and after resuming a flow of said active instance of an application, by: (a) assigning a new node as a stand-by node by means of the HA software component; (b) providing the network address of the active node to a replication software component of said node; and (c) monitoring the status of the new stand-by node by means of the HA software component.

Preferably, the method further comprises starting a process of data synchronization of the new stand-by node with the rest of nodes.

Preferably, the method further comprises leaving the application in a standby state in the registration call state until a number of stand-by nodes reaches a predetermined quorum requirement.

Brief Description of the Drawings

In the drawings:

Fig. 1 is a schematic illustration of a system for allowing an application to resume its operation on a new node in a fast manner after the original application and/or original node fail to operate over a data network, according to a preferred embodiment of the present invention; Fig. 2A is a flowchart of initiating the system operation, according to a preferred embodiment of the present invention;

Fig. 2B is a flowchart of a recovery flow after the failure of the active node or active application running on said node, according to a preferred embodiment of the present invention; and

- Fig. 3 is a flowchart of adding a new node to the system after taking the active application out of a registration call and after the flow of said application is resumed, according to a preferred embodiment of the present invention.

Detailed Description of the Preferred Embodiments

Hereinafter, when the term "node" is used, it should be noted that it refers to a computer, server, and the like. In addition, it should be noted, when the term "call" is mentioned it refers to the terms "command", "message" and the like, which are used interchangeably.

Fig. 1 is a schematic illustration of a sj_^stem 100 for allowing an application to resume its operation on a new node in a fast manner after the original application and/or original node fail to operate over a data network, according to a preferred embodiment of the present invention. System 100 comprises a High Availability (HA) software component 105, an active node 110 and one or more stand-by nodes, such as stand-by nodes 115, 120 and 125. All nodes are connected by network links that can transmit a socket network protocol (TCP/IP (Transmission Control Protocol/Internet Protocol), SDP (Socket Direct Protocol), or the like). Each node can have one ore more network links. HA software component 105 and all nodes communicate between them by means of software installed within said HA software component 105 and said nodes, using network connections to send messages when necessary. An active node 110 is a node where an application runs in an active state. Active node 110 is a primary node of system 100. Each of stand-by nodes 115, 120 and 125 runs a copy of the same application (replication of the application 117, 122 and 127) in a suspended (stand-by) state. The same memory size is allocated for each replication of the application on each corresponding stand-by node, as for active application 112 on active node 110. Furthermore, each replication of the application 117, 122 and 127 (each stand-by instance of the application) has the same file handlers (the same handling rules for displaying a plurality of files related to each application), as active application 112 on active node 110. When the state of active application 112 is changed, said active application replicates said change to all stand-by nodes 115, 120 and 125 by means of its replication software components 111 and replication software components 116, 121, 126 of said all stand-bjr nodes, respectively. HA software component 105 is a main controller (and actually functions as the "brain" of system 100), and its operation comprises monitoring the availability of all nodes and availability of their application(s) and replications of application, respectively; instructing each of said nodes to become active or stand-by; and controlling and communicating with said nodes. Active application 112 specifies an address in the memory or a position in its corresponding file wherein the data change has happened. Active application 112 does not proceed with its execution until the replication of said data change on each stand-by node 115, 120 and 125 is completed. As a result, the state of each replication of the application is alwa_3^s valid: it is either the previous application state (the last known application state) or the current application state. Each stand-by node is ready to substitute failed active node 110 upon the need, allowing the application to quickly resume its operation on a new node after active application 112 and/or active node 110 fail to operate. When active node 110 or application 112 running on this node fails, HA software component 105 selects a stand-b}' node within one or more standby nodes 115, 120 and 125, as a new active node. The selected stand-by node changes its state from the stand-by (suspended) state to the active state upon receiving one or more API commands. Then the application on said selected node becomes active. The failover period, starting from the failure of the originally active node or application to the moment when the stand-b}^ application on one of the stand-by nodes becomes active is very short (can be, for example, less than a second), since the data stored in the nieinory (for example, the virtual memory) of the application (that becomes active on the selected stand-by node) and its files are synchronized with the same of the previously active application.

System 100 can be, for example, the XHA (Extreme High Availability) system, which is developed by Qlusters Inc. U.S.A (http://www.qlusters.com/index.php?option=^:com content&task=blogcategory &id- 18&Itemid=^:47) . Each replication software component can be, for example, the "MRS" (Memory Replication System) component, which is also developed by Qlusters Inc. and is a part of the above XHA system.

It should be noted that the data network can be any network, such as the Ethernet, Internet, LAN (Local Area Network), wireless network, etc.

Fig. 2A is a flowchart 200 of initiating the system 100 (Fig. 1) operation, according to a preferred embodiment of the present invention. At step 205, replication software components 111, 116, 121 and 126 (Fig. 1) are activated at each node 110, 115, 120 and 125 (Fig. 1), respectively, prior to starting an application (an instance of an application) on said each node. Then at step 210, an application is independently started (initialized and/or its one or more secondary flows are started) at each node, registering itself with HA software component 105 (Fig. 1). The application registers itself with said HA software component 105 via a software library (for example, one or more ".dll" files) by sending a registration call (activating a corresponding function within said library). Said library provides two types of calls (commands): (a) registration calls; and (b) replication calls. The library is a part of software installed on each node. Each registration call informs HA software component 105 that the application has completed its initialization and is waiting to become active. The application can register itself multiple times whenever needed. The replication calls are used bj_^ active instance of the application only (such as application 112 (Fig. I)). By receiving said replication calls, replication software components 116, 121 and 126 replicate changes of application 112 to all stand-by nodes. For enabling the application to communicate with HA software component 105 by registration calls and with replication software components by replication calls, the program code of said application is modified prior to its start-up. After modifying said application, it begins to "understand" the API commands and can send and/or receive the registration and synchronization calls with HA software component 105. HA software component 105 selects one node to be an active node. Then at step 215, HA software component 105 instructs each node (except for the active node) to become stand-by. At step 220, HA software component 105 provides a network address of active node 110 to replication software components 116, 121 and 126 of each corresponding stand-by node 115, 120 and 125. Then at step 225, HA software component 105 transfers a list of all stand-by nodes to replication software component 111 of active node 110, and then takes application 112 (Fig. 1) out from the registration call, making said application active and readjf to be executed. Then at step 230, HA software component 105 starts monitoring a status and availability of each node and of application 112 and replications of application 117, 122 and 127, respectively. On each stand^ node 115, 120 and 125, the execution of replications of the application 117, 122 and 127, respectively, is blocked (by entering a wait state) in the corresponding API registration call. Then, active application 112 on active node 110 resumes its flow. If one or more stand-b}^ nodes fail after that, HA software component 105 informs replication software component 111 of active node 110 about this event, and said replication software component 111 removes one or more failed replications of the application 117, 122 and 127 from its list.

The active application 112 state is kept in memory or in a file within active node 110. When this state is changed, application 112 calls replication component 111 by sending to it an appropriate API command and requests a replication of the corresponding memory data or file to all stand-by nodes 115, 120 and 125. If application 112 creates a new file, said application 112 calls replication software component 111 and requests to create said file on each stand-by node 115, 120 and 125. Since replication software component 111 has the list of all stand-by nodes within replication system 110, it instructs all replication software components within each stand-by node to replicate said new file.

It should be noted that the term "execution" in the context, for example, the "application execution" may mean that application activates its main flow, repeats its program code block and the like.

Fig. 2B is a flowchart 250 of a recovery flow after the failure of active node 110 (Fig. 1) or active application 112 (Fig. 1) running on said node, according to a preferred embodiment of the present invention. At step 255, HA software component 105 (Fig. 1) detects that active node 110 or active application 112 (Fig. 1) fail to operate. Then, at step 260, HA software component 105 informs each replication software component 116, 121 and 126 (Fig. 1) of each stand-by node that active node 110 failed to operate. Then at step 265, HA software component 105 selects a new active node from one or more stand-by nodes, such as stand-by nodes 115, 120 and 125 (Fig. 1). Suppose that node 115 is selected as the new active node. After that at step 270, HA software component 105 sends one or more API commands to replication software components 121 and 126, instructing them to become stand-by (to assume the "stand-by role") for the new system configuration wherein node 115 is the new active node. At step 275, HA software component 105 sends to said replication software components a network address of the new active node. After that at step 280, HA software component 105 transfers to the replication software component of the new active node a list of all stand-by nodes and takes the corresponding replication of the application (of said new active node) out of a registration call (takes the corresponding replication of the application out of a wait state), making it active and ready to be executed. Then, the new active application on the new active node resumes its flow. Since all stand-by nodes were synchronized with the previous active node before the failure of active node 110 or active application 112, no additional S3αichronization is required. Furthermore, since it is assumed that application 112 has a number of states and can be recovered from airy state by examining the data stored in memory or stored in files, the new active application resumes its execution from the last state replicated by the replication software component of the previous active node.

Fig. 3 is a flowchart 300 of adding a new node after taking active application 112 (Fig. 1) out of a registration call and after the flow of said application is resumed, according to a preferred embodiment of the present invention. At step 305, HA software component 105 (Fig. 1) assigns the new node as a stand-by node and provides to the replication software component of said node a network address of active node 110 (Fig. 1). Then at step 310, HA software component 105 adds the new stand-by node to the list of stand-by nodes, said list stored within replication software component 111 (Fig. 1) of active node 110. After that at step 315, HA software component 105 starts monitoring a status and availability of the new stand-by node and its replication of application. Then at step 320, replication software component 111 of active node 110 starts a process of data synchronization of said active node with the new stand-by node. The S3αichronization process takes place in the background and can be processed in parallel with processing various replication API calls (commands, messages and the like) from active application 112 to its replication software component 111. During the synchronization process, the replication software component on the new stand-by node creates the same files and the same memorj^ areas that active application 112 has, and then it copies the contents of said files and the same memory areas. After the synchronization process is completed, the new stand-by node becomes an operational stand-by node like other standby nodes 115, 120 and 125 (Fig. 1), and it can be selected when active node 110 fails.

According to a preferred embodiment of the present invention, for allowing easy zero-copying networking and fast, efficient and easy manipulation (transfer, copy, etc.) of application memory areas, each replication software component 111, 116, 121 and 126 (Fig. 1) is running in the kernel space of the operating system of each node. The zero-copy networking allows bypassing copying of data from the kernel of the operating system installed on the corresponding node to the application running on said node or from said application to said kernel. Assuming that application A runs on one node and wants to send data to application B running on another node, then the following situation happens: a) application A calls the kernel of the operating system of its node and requests a send() call; b) said kernel allocates memory and copies the corresponding sendO call data to said memory; c) said kernel queues said sendO call data in its IP (Internet Protocol) subsystem and returns the control back to application A (at this point the sendQ call is completed). Then, the data is sent by said IP subsystem of said kernel shortly after this moment. By using the zero- copying networking the above steps (a) and (b) can be skipped and the data can be transferred directly to the IP kernel subsystem from the application A to the application B.

The implementation for Linux operating systems exists in the form of a device driver. When the driver is loaded on a given node, one or more applications running on said node can call the API provided by the driver by opening the driver device node and by performing ioctlO system calls of the file descriptor (a key to a kernel data structure containing the details of all open files) of said driver device node. It should be noted that ioctlO is a system call found on Unix-like systems allowing an application to control or communicate with a device driver (device driver is a computer program that enables another program, typically an operating system (OS), to interact with a hardware device) outside the usual read/write of data. Since the ioctlO system call runs the ioctlO handler (the function defined by the driver), the appropriate driver code runs for each ioctlO command. It should be noted that in most operating systems, device driver API can be accessed by applications by means of accessing a device node (a device node is a special file associated with the said device driver) by means of system calls, like "openO", "readO", "writeO", "closeO", "ioctlO", etc.

It should be noted, that when replication software component 111 of active node 110 sends one or more replication messages (calls, commands) to stand-by nodes 115, 120 and 125, a pluggable implementation of transport layer (the fourth layer of the seven layers OSI (Open Systems Interconnection Reference) model - the layer that provides data transfer) is used for enabling efficient memory transfers from the kernel of the operating system installed on the active node to the corresponding kernels of the operating systems installed on stand-by nodes. The pluggable transport layer allows to use various interconnect technologies without changing the code of the driver. The transport layer defines a convenient and familiar socket abstraction API. Socket abstraction assumes that connection channels are presented by abstractions called "sockets". Multiple connection channels are usually available and are usually established by a connectO call. Individual connections are distinguished by different handles (sockets). The socket API accepts a connection socket for various API calls, such as sendO, receiveQ, etc.

It should be noted that the term "pluggable" in the context of the "pluggable transport layer" means that the transport layer API is defined and is static. The replication program code uses this static API, being independent to the network technology to be used. Implementations of this API for various network interconnects may exist and will be called "plug-ins". If such implementations exist, each of them can be "plugged" (loaded without a need to change the rest of the system) into the system, and each replication component will be able to use it without "understanding" how it works and without a need to change its program code. Usually, the plug-ins for the transport layer are implemented as kernel modules (a kernel module is a library code that is inserted into the kernel of the operating system and then other drivers are able to use). According to a preferred embodiment of the present invention, there are several plug-in implementations of the transport layer: for the TCP/IP (Transmission Control Protocol/Internet Protocol), for the SDP (Socket Direct Protocol) on the Infmiband (a highspeed serial computer bus), for the SDP on Myrinet, which is a high-speed local area networking system, and for the Dolphin SCI (Serial Communication Interface) interconnect (http://www.dolphinics.com), which is a low latency high speed computer bus with the RDMA (Remote Direct Memor}^ Access) capability.

According to a preferred embodiment of the present invention, each replication software component 111, 116, 121 and 126 comprises a driver for enabling application 112, replications of the application 117, 122, 127 and HA software component 105 to communicate with said each replication software component. The driver also enables replication software components to communicate with each other by means of sending messages over the transport layer API. Both application 112 and HA software component 105 communicate with said driver by means of "ioctlO" sj^stem calls. When active instance of the application (application 112) communicates with the driver, it requests the data of the corresponding replication of the application.

According to a preferred embodiment of the present invention, application 112 communicates with replication component 111 of its active node 110. Replications of the application 117, 122 and 127 communicate with replication software component of their corresponding stands nodes 115, 120 and 125. Replication software components 111, 116, 121 and 126 can communicate with each other by means of the transport API.

When HA software component 105 communicates with the driver, it can send one ore more following commands (each command can be represented by a separate ioctlO call): a request to the corresponding replication software component to become active. a request to the corresponding replication software component to become stand-by. When requesting from said corresponding replication software component to become stand-by, the network address of the active node is provided. an active replication software component (a replication software component within the active node) may receive a command to add a stand-by node to its list of present stand-by nodes. an active replication software component may receive a command to delete a stand-by node from its list of stand-by nodes. a stand-by replication software component may receive a command informing it that the active software component or active application failed to operate.

According to a preferred embodiment of the present invention, system 100 (Fig. 1) can comprise different active applications or different replications of the application running within each node. For one or more applications, this node can be the active node and for one or more replications of the application, this node can be the stand-bjr node.

When the new stand-by node is added, the driver of replication software component 111 of active node 110 establishes a network connection with the driver of the replication software component of said new stand-by node and adds said new stand-by node to the list of stand-b}^ nodes stored within said replication software component 111. If said new stand-by node is added after taking active application 112 out of a registration call and resuming its flow, the driver of replication software component 111 of active node 110 initiates a kernel thread (a special process that runs entirely in the kernel space of the operating system) that synchronizes said active node 110 with the new stand-by node. The synchronization process instructs the replication software component to send a list and contents of all registered files and memory areas to the new stand-by node.

When active application 112 sends one or more replication API calls to the active replication software component, it transfers with said calls the size and address of the replicated files and memory area (the address is a position in a specified file or memory area). Since each replication software coinponent runs in the kernel space, the file or memory area contents, which have to be replicated, are easily accessed. The replicated information is sent to all stand-by nodes. If a stand-by node fails, HA software component 105 informs the driver of replication software component 111 by means of another "ioctl" call about this event. After that, said driver of said replication software component 111 removes the failed stand-bjr node form its stand-by nodes list.

When communicating with drivers of replication software components of stand-by nodes, HA software component 105 provides to each of them a network address of active node 110. Then, each driver initiates a kernel thread that waits for connection on a network socket from active node 110. After establishing the connection, each driver starts receiving messages (using transport API calls) from active replication software component 110 and then processes them. Since the communication is provided in the kernel space, files and memory of the stand-by nodes can be easily accessed and altered. If active node 110 fails, the HA software component 105 informs about this event by means of another "ioctl" call the driver of each stand-by replication software component. Then, the kernel thread is stopped. Said replication software components leave their stand-by state and wait for further commands from the HA software component 105. When the HA software component selects a new active node, it sends a "become active" ioctlO call to the corresponding replication component of said selected new active node. Each of the replication components of the remaining nodes receives the "become stand-by" call from the HA software component. In addition, said replication components receive the network address of the new active node from said HA software component.

API calls, which can be used by HA software component 105 and active application to communicate with all replication software components are specified below. MRS_REGISTER - this call accepts a file name (in the form of a data string) and adds said file name in a list of registered files (in other words, it registers said file name), according to said data string. By using the MRS_REGISTER call, is created a file on all stand-by nodes, said file having the same size, as the oiiginal file on the active node. This call is used by the active application to call the active replication software component. During the call, active replication software component communicates with all standby software components registered in its list by means of transport API. It should be noted that according to a preferred embodiment of the present invention, when a stand-by node is added by the HA software component to the list of stand-by nodes of the replication component, it means that the network address of the said node is provided to said replication component. Thus, the replication component of the active node always has a list of all stand-by nodes with their network addresses.

MRSJJNREGISTER - this call accepts a file name (in the form of a data string) and removes the file, whose name is provided in said data string, from the list of registered files of each stand-by replication software component (the list is provided within each stand-by replication software component). This call is used by the active application to call the active replication software component. During the call, the active replication software component communicates with stand-by software components registered in the list of stand-by software components, said list provided within the active node, by means of the transport API.

MRS_FILE_RESIZED - this call accepts a file name (in the form of a data string) and informs all stand-by nodes that the file, whose name is provided in said data string, changed its size. Then, the new file size is allocated on all stand-by nodes. This call is used by the active application to call the active replication software component. During the call, the active replication software component communicates with stand-by software components, which are registered in the list of stand-by software components, said list provided within the active node, by means of the transport API.

MRS_REPL - this call accepts a list of memory areas. Each memory area is represented with the corresponding memory address and size of the area. Each memory area of the process specified by the list is replicated to all stand-by nodes. This call is used by the active application to call the active replication software component. During the call, the active replication software component communicates with all stand-by software components registered in the list of stand-by software components, said list provided within the active node, by means of the transport API. This call returns the number of stand-by nodes that have replicated the given changes. The behavior of this call can be optionally configured to block the application (to force it to enter the wait state in this call) until a number of stand-by nodes reaches a predetermined "quorum" requirement. Thus, if a number of functional stand-by nodes is below the specified "quorum" number for the calling application, the application enters the wait state in this call until a required number of stand-by nodes is obtained. When the "quorum" requirement is fulfilled, the call operation is released form the wait state and the call is continued.

MRS_REPL_FILE - this call accepts a list of file areas (a file area is a specified piece of a specified file) of the application. Each file area is presented bjr a file descriptor, an offset within the file and size of the file area. Each file area specified by the list will be replicated to all stand-by nodes. This call is used by the active application to call the active replication software component. During the call, the active replication software component communicates with all stand-by software components registered in its list by means of the transport API. The behavior of this call can be optionally configured to block the application (to force it to enter the wait state in this call) until a number of stand-by nodes reaches a "quorum" requirement. Thus, if a number of functional stand-by nodes is below the specified "quorum" number for the calling application, the application would enter the wait state in this call until a required number of stand-by nodes is obtained. When the "quorum" requirement is fulfilled, the call operation is released form the wait state and the call is continued.

MRS_ACTIVE_UP - this call instructs a replication software component to become active (to assume the "active role"). This call is used by the HA software component to call a replication software component that is neither in the active nor in stand-by state. This call is used by the HA software component after said HA software component selected which instance of the application should become active (which replication of the application should become active).

MRS_PASSIVE_UP - this call accepts a node network address. This call is used by the HA software component to call a replication software component that is either in the active or initialized state. (Initialized state is one that is neither the active nor stand-by state). When calling the active replication software component, it adds to the list of all stand-by nodes a new stand-by node with a given address. When calling a replication software component being in the initialized state, it informs said replication software component of the corresponding node to assume the "stand-by" role, transferring to it the address of the active replication software component. MRS_ACTIVE_DOWN - this call accepts a node network address and is used by the HA software component to call a stand-by replication software component. This call is used when the HA software component detects a failure of the active node or the active application. Then the stand-by replication software component stops the replication process, remaining synchronized with the rest of stand-by nodes, enters the "initialized" state and waits for a command(s) from the HA software component.

MRS_PASSIVE_DOWN - this call accepts a node network address and is used by the HA software component to call either an active or stand-by replication software component. When this call is used on the active node, it informs the replication component of said active node that a stand-by node (specified by the provided address) failed. When this call is used on the stand-by node, it instructs the replication component of said stand-by node to stop the replication process. The stand-by node looses its synchronization with the rest of stand-by nodes, enters the "initialized" state and waits for a command(s) from the HA software component.

MRS_GET_STAT - this call can be activated on each stand-by node, returning the progress value (in percent software components) of the synchronization process of said each stand-by node with the rest of the nodes. This call can be used by the HA software component to select the "most synchronized" stand-by node among all nodes, if there was no completely synchronized node at the time of active node and/or active application failure.

MRS_CHECK_NODE - this call accepts the node address and the time interval, said time interval determined by the timeout value from the corresponding node. This call is used by the HA software component to call either the active or stand-by replication software component. It returns "true" if the called replication software component received one or more messages from the node specified by the given address during the given time interval. Ify determining whether the replication softwai'e components of two given nodes were able to communicate, the HA software component may perform one or more additional checks for detecting one or more failed nodes. Further, MRS_CHECK_NODE call allows the HA software component to select the "most synchronized" node among all nodes when an active node and/or active application failed to operate.

According to a preferred embodiment of the present invention, application 112 calls HA software component 105 by a single call - HA_REGISTERO, providing to said call an application name. This call blocks the application execution until the calling instance of the application (replication of the application) is selected to be active. Upon receiving this call, the HA software component starts monitoring the application and is able to detect the application failure. It should be noted that stand-by applications that are blocked are also monitored. Thus, if one or more stand-by applications fail to operate, then HA software component would detect their failure.

An Example of the API Calls flow a) Application instance 110 is started on a single active node. At it starts, said application 110 registers itself with HA software component 105 by sending to said HA software component 105 the HA_REGISTERO call. Since the calling instance is a single application 110 instance, the HA software component 105 selects this instance to be active. Then, HA software component 105 sends the MRS_ACTIVE_UP command to replication software component 111, and as a result, replication software component 111 assumes the "active role". The list of stand-by nodes in the replication software component is initialized to be empty. After that, HA software component 105 takes application 112 out from a registration call. Upon completion of the HAJREGISTERO call, application instance 112 becomes an active instance of the application and resumes its flow. b) Application 110 starts its main flow. It opens a file ("the first file") on active node 110 that has, for example, a list of transactions to complete. The file is registered with replication software component 111 by means of the MRS_REGISTER command. c) Application 110 opens (or creates if the said file is absent) another file ("the second file") stored on active node 110, wherein said application 110 will write later a list of processed transactions. This file is also registered with replication software component 111 by means of the MRS_REGISTER command. Since the second file is empty, the application "realizes" that it has to start the transaction processing from the beginning of the first file stored on active node 110. d) To achieve an easier data manipulation (data transfer, data replication, etc.), application 110 maps both opened files to the memory (it allocates for each file a memory address). From now on, when reading or changing files contents, the application works with the memory addresses of said files. e) Each time the application completes a transaction (it reads what to do from the first file), it writes some data to the second file (indicating that said transaction is processed). After finishing writing the data to the second file, the application calls the active replication software component by means of the MRSJREPL call, transferring to said replication software component the address and the size of said data change in the second file. However, if the active replication software component has an empty list of stand-by nodes, this call does not have any effect. f) A second instance of the application, which has to become a replication of the application, is started on a stand-by node. Suppose, for example, that replication of the application 117 starts on stand-by node 115. The application calls HA software component 105 by the HA_RE GI STERO call. Since an active instance of the application 112 is already running on active node 110, the execution of replication of the application 117 is blocked within this call by HA software component 105 (replication of the application 117 enters a wait state). Replication of the application 117 is remaining inside a wait state within the HA_REGISTERO call until it receives an appropriate command to come out (to return) from said call. HA software component 105 also calls replication software component 116 of stand-by node 115 by the MRS_PASSIVE_UP call. As a result, replication software component 116 of stand-by node 115 assumes the "stand-by role" and becomes ready for receiving messages from replication software component 111 of active node 110. g) HA software component 105 also calls replication software component 111 of active node 110 by the MRS_PAS SIVE-UP call. Due to this call, the network address of stand-by node 115 is added to the list of stand-by nodes stored within the active replication software component and HA software component. h) Replication software component 111 of active node 110 starts in the background a process of data synchronization of the above two files (the first and second files) that were registered by the application. During this synchronization process, both files are created on stand-by node 115 and the contents of these files are copied to said stand-by node 115. The progress of said data synchronization process is recorded by replication software components 111 and 116 of both active node 110 and stand-by node 115. When the data synchronization process is completed, the replication software components on both active node 110 and stand-by node 115 are marked to be in a synchronized state. i) HA software component 105 may query replication software components 111 and 116 for the progress of the data synchronization process by sending the MRS_GETJSTAT call. j) After replication software component 111 receives the MRS_PASSIVE_UP call, whenever application completes a transaction and calls replication software component 111 by the MRS_REPL call, replication software component 111 sends the change of the modified second file to replication software component 116 of stand-by node 115 and waits for the acknowledgment from said replication software component 116 that said change was propagated. k) Suppose that some time after completing the synchronization process, active application 112 fails. Then, HA software component 105 detects this event and sends the MRS-ACTIVE JDO WN call to replication software component 116 of stand-by node 115.

1) HA software component 105 starts selecting a new active node. Since that there is a single stand-by instance of the application (for this example, there is only one replication of the application - 117), HA software component 105 selects stand-by node 115 and sends the MRS_ACTIVE_UP call to replication component 116 of said stand-by node 115. As a result, replication component 116 assumes the "active role". m) Upon completion of the MRS_ACTIVE_UP call, HA software component 105 forces replication of the application 117 to come out (to return) from the waiting cycle loop of the HAJREGISTERO call. n) New active application 117 starts its execution by opening the above two files (the first and second files). The contents of both file on new active node 115 is identical to content of these files on node 110, as it was at the time of node 110 failure. Since the second file is not empty, the new active application 117 "realizes" that it has to resume the processing of the list of transactions from the point indicated in the second file. Thus, the new active application 117 is processed (executed) from the same point of the program code that the previous active application 112 could have been executed, if it had not failed.

While some embodiments of the invention have been described by way of illustration, it will be apparent that the invention can be put into practice with many modifications, variations and adaptations, and with the use of numerous equivalents or alternative solutions that are within the scope of persons skilled in the art, without departing from the spirit of the invention or exceeding the scope of the claims.

Claims

1. A system for allowing an application to resume its operation on a new node of a data network after an original application and/or an original node fail to operate over said data network, comprising: a. an active node comprising an active application and an active replication software component for replicating changes of said active application; b. one or more stand-by nodes, each of which comprising a replication of said application and a stand-by replication software component, for replicating changes of said active application to its corresponding stand-by node, said changes received from said active replication software component; and c. an HA software component for: c.l. instructing each of all nodes to become active or stand-by; c.2. monitoring the availability of said active node and said one or more stand-by nodes and the availability of said active application and said replication(s) of application; and c.3. controlling and communicating with said active and said one or more stand-by nodes.

2. System according to claim 1, wherein the active and stand-by replication software components run in the kernel space of the operating system within each node.

3. A method for allowing an application to resume its operation on a new node of a data network after the original application and/or original node fail to operate over said data network, comprising: a. independently starting two or more instances of an application on two or more corresponding nodes; b. registering said instances of the application with an HA software component; c. selecting, by means of said HA software component, one node to be an active node and the remaining nodes to be stand-by nodes; d. providing a network address of said active node to each replication software component of said one or more stand-by nodes; e. transferring a list of the one or more stand-by nodes to a replication unit of said active node and taking the instance of the application on said active node out of a registration call and causing said instance to become active; and f. monitoring the status of said active node and one or more stand-by nodes by means of said HA software component.

4. Method according to claim 3, further comprising detecting a failure of the active node or a failure of the instance of an application on said active node by means of the HA software component.

5. Method according to claim 3, further comprising selecting a new active node by means of the HA software component.

6. Method according to claim 5, further comprising transferring a new list of the one or more stand-by nodes to the replication software component of the new active node and taking the instance of an application on said new active node out of a registration call.

7. Method according to claim 5, further comprising providing a network address of the new active node to each replication software component of the one or more stand-by nodes.

8. Method according to claim 3, further comprising adding a new node after taking the active instance of an application out of the registration call and after resuming a flow of said active instance of an application, by: a. assigning a new node as a stand-by node by means of the HA software component; b. providing the network address of the active node to a replication software component of said node; and c. monitoring the status of the new stand-by node bj^ means of the HA software component.

9. Method according to claim 8, further comprising starting a process of data synchronization of the new stand-by node with the rest of nodes.

10. Method according to claim 3, further comprising leaving the application in a stand-by state in the registration call state until a number of stand-by nodes reaches a predetermined quorum requirement.