WO2023198276A1 - Gestion de défaillance d'une instance d'application - Google Patents

Gestion de défaillance d'une instance d'application Download PDF

Info

Publication number
WO2023198276A1
WO2023198276A1 PCT/EP2022/059747 EP2022059747W WO2023198276A1 WO 2023198276 A1 WO2023198276 A1 WO 2023198276A1 EP 2022059747 W EP2022059747 W EP 2022059747W WO 2023198276 A1 WO2023198276 A1 WO 2023198276A1
Authority
WO
WIPO (PCT)
Prior art keywords
application
entity
application instance
restoration
stages
Prior art date
Application number
PCT/EP2022/059747
Other languages
English (en)
Inventor
Dániel GÉHBERGER
Péter MÁTRAY
Original Assignee
Telefonaktiebolaget Lm Ericsson (Publ)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget Lm Ericsson (Publ) filed Critical Telefonaktiebolaget Lm Ericsson (Publ)
Priority to PCT/EP2022/059747 priority Critical patent/WO2023198276A1/fr
Publication of WO2023198276A1 publication Critical patent/WO2023198276A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • G06F11/1482Generic software techniques for error detection or fault masking by means of middleware or OS functionality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Definitions

  • the disclosure relates to methods for handling a failure of an application instance and entities configured to operate in accordance with those methods.
  • failure handling can be costly. For example, failure handling can mean more complex and more expensive virtual machines (VMs), increased network usage, and increased input/output (I/O).
  • VMs virtual machines
  • I/O input/output
  • Many cloud-based services are designed to run in multiple application instances that, in turn, form clusters. The clustering of application instances is necessary to distribute workload among the application instances and to provide resiliency against failures. Typically, the failure of a subset of application instances does not impact the operation of the service in general.
  • Failures in a cloud system can occur due to various reasons ranging from programming errors to infrastructure failure.
  • the failures can be permanent or temporary from the perspective of the service.
  • Some failures, such as a short outage in the networking infrastructure e.g. due to an overloaded switch), may be considered temporary for most applications since all states are kept intact in the application instances and it is likely that only a short synchronisation will be necessary after the networking issue is resolved.
  • Some other failures, such as a programming error causing an automated restart of an application instance can be temporary for applications with state information persistently stored (e.g. in a memory, such as on a solid state drive (SSD) or a disk) or easily recoverable by other means (e.g. via recalculation or short synchronization with other application instances).
  • SSD solid state drive
  • the same application instance restart can be considered to be a permanent failure from the perspective of an application instance with a significant amount of lost state information (e.g. an in-memory database).
  • lost state information e.g. an in-memory database
  • recovery is still necessary to restore the cluster of application instances.
  • existing techniques for handling application instance failures are generally inflexible and there is no account taken of any specific aspects associated with the application instance failure. This can have a negative impact on the efficiency of handling application instance failures and can waste resources.
  • a first method for handling a failure of a first application instance of a first application is performed by a first entity.
  • the first method comprises initiating transmission of a first message towards a second entity in response to identifying the failure of the first application instance.
  • the first message comprises information indicative of a plurality of restoration stages to execute to restore the failed first application instance for the second entity to generate a plan for restoring the failed first application instance.
  • the plan comprises the plurality of restoration stages and a timing for executing the plurality of restoration stages.
  • a second method for handling a failure of a first application instance of a first application is performed by a second entity.
  • the second method comprises generating a plan for restoring the failed first application instance in response to receiving a first message from a first entity.
  • the first message comprises information indicative of a plurality of restoration stages to execute to restore the failed first application instance.
  • the plan comprises the plurality of restoration stages and a timing for executing the plurality of restoration stages.
  • a method performed by a system comprising the first method and the second method.
  • a first entity comprising processing circuitry configured to operate in accordance with the first method.
  • the first entity may comprise at least one memory for storing instructions which, when executed by the processing circuitry, cause the first entity to operate in accordance with the first method.
  • a second entity comprising processing circuitry configured to operate in accordance with the second method.
  • the second entity may comprise at least one memory for storing instructions which, when executed by the processing circuitry, cause the second entity to operate in accordance with the second method.
  • a system comprising at least one first entity as described earlier and at least one second entity as described earlier.
  • a computer program comprising instructions which, when executed by processing circuitry, cause the processing circuitry to perform the first method and/or the second method.
  • a computer program product embodied on a non-transitory machine-readable medium, comprising instructions which are executable by processing circuitry to cause the processing circuitry to perform the first method and/or the second method.
  • Figure 1 is a block diagram illustrating a first entity according to an embodiment
  • Figure 2 is a block diagram illustrating a method performed by a first entity according to an embodiment
  • Figure 3 is a block diagram illustrating a second entity according to an embodiment
  • Figure 4 is a block diagram illustrating a method performed by a second entity according to an embodiment
  • Figure 5 is a schematic illustration of a system according to an embodiment
  • Figure 6 is a signalling diagram illustrating an exchange of signals in a system according to an embodiment
  • FIGS. 7-11 are block diagrams illustrating methods performed according to some embodiments.
  • Figures 12A-12E are schematic illustrations of example scenarios.
  • the techniques can be performed by a first entity and a second entity.
  • the first entity and the second entity described herein may communicate with each other, e.g. over a communication channel, to implement the techniques described herein.
  • the first entity and the second entity may communicate over the cloud.
  • the techniques described herein can be implemented in the cloud according to some embodiments.
  • the techniques described herein can be computer-implemented.
  • Figure 1 illustrates a first entity 10a, 10b in accordance with an embodiment.
  • the first entity 10a, 10b is for handling a failure of a first application instance of a first application.
  • the first entity 10a, 10b referred to herein may be an entity that is responsible for handling the failure of application instances of the first application only.
  • the first entity 10a, 10b referred to herein may be an entity that is responsible for handling the failure of application instances of the first application and one or more other applications.
  • the first entity 10a, 10b referred to herein can refer to equipment capable, configured, arranged and/or operable to communicate directly or indirectly with the second entity referred to herein, and/or with other entities or equipment to enable and/or to perform the functionality described herein.
  • the first entity 10a, 10b referred to herein may be a physical entity (e.g. a physical machine) or a virtual entity (e.g. a virtual machine, VM).
  • the first entity 10a, 10b comprises processing circuitry (or logic) 12.
  • the processing circuitry 12 controls the operation of the first entity 10a, 10b and can implement the method described herein in respect of the first entity 10a, 10b.
  • the processing circuitry 12 can be configured or programmed to control the first entity 10a, 10b in the manner described herein.
  • the processing circuitry 12 can comprise one or more hardware components, such as one or more processors, one or more processing units, one or more multi-core processors and/or one or more modules.
  • each of the one or more hardware components can be configured to perform, or is for performing, individual or multiple steps of the method described herein in respect of the first entity 10a, 10b.
  • the processing circuitry 12 can be configured to run software to perform the method described herein in respect of the first entity 10a, 10b.
  • the software may be containerised according to some embodiments.
  • the processing circuitry 12 may be configured to run a container to perform the method described herein in respect of the first entity 10a, 10b.
  • the processing circuitry 12 of the first entity 10a, 10b is configured to initiate transmission of a first message towards a second entity in response to identifying the failure of the first application instance.
  • the first message comprises information indicative of a plurality of restoration stages to execute to restore the failed first application instance for the second entity to generate a plan for restoring the failed first application instance.
  • the plan comprises the plurality of restoration stages and a timing for executing the plurality of restoration stages.
  • the first entity 10a, 10b may optionally comprise a memory 14.
  • the memory 14 of the first entity 10a, 10b can comprise a volatile memory or a non-volatile memory.
  • the memory 14 of the first entity 10a, 10b may comprise a non-transitory media. Examples of the memory 14 of the first entity 10a, 10b include, but are not limited to, a random access memory (RAM), a read only memory (ROM), a mass storage media such as a hard disk, a removable storage media such as a compact disk (CD) or a digital video disk (DVD), and/or any other memory.
  • RAM random access memory
  • ROM read only memory
  • CD compact disk
  • DVD digital video disk
  • the processing circuitry 12 of the first entity 10a, 10b can be communicative coupled (e.g. connected) to the memory 14 of the first entity 10a, 10b.
  • the memory 14 of the first entity 10a, 10b may be for storing program code or instructions which, when executed by the processing circuitry 12 of the first entity 10a, 10b, cause the first entity 10a, 10b to operate in the manner described herein in respect of the first entity 10a, 10b.
  • the memory 14 of the first entity 10a, 10b may be configured to store program code or instructions that can be executed by the processing circuitry 12 of the first entity 10a, 10b to cause the first entity 10a, 10b to operate in accordance with the method described herein in respect of the first entity 10a, 10b.
  • the memory 14 of the first entity 10a, 10b can be configured to store any information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
  • the processing circuitry 12 of the first entity 10a, 10b may be configured to control the memory 14 of the first entity 10a, 10b to store information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
  • the first entity 10a, 10b may optionally comprise a communications interface 16.
  • the communications interface 16 of the first entity 10a, 10b can be communicative coupled (e.g. connected) to the processing circuitry 12 of the first entity 10a, 10b and/or the memory 14 of first entity 10a, 10b.
  • the communications interface 16 of the first entity 10a, 10b may be operable to allow the processing circuitry 12 of the first entity 10a, 10b to communicate with the memory 14 of the first entity 10a, 10b and/or vice versa.
  • the communications interface 16 of the first entity 10a, 10b may be operable to allow the processing circuitry 12 of the first entity 10a, 10b to communicate with any one or more of the other entities (e.g.
  • the communications interface 16 of the first entity 10a, 10b can be configured to transmit and/or receive information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
  • the processing circuitry 12 of the first entity 10a, 10b may be configured to control the communications interface 16 of the first entity 10a, 10b to transmit and/or receive information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
  • first entity 10a, 10b is illustrated in Figure 1 as comprising a single memory 14, it will be appreciated that the first entity 10a, 10b may comprise at least one memory (i.e. a single memory or a plurality of memories) 14 that operate in the manner described herein.
  • first entity 10a, 10b is illustrated in Figure 1 as comprising a single communications interface 16, it will be appreciated that the first entity 10a, 10b may comprise at least one communications interface (i.e. a single communications interface or a plurality of communications interface) 16 that operate in the manner described herein.
  • Figure 1 only shows the components required to illustrate an embodiment of the first entity 10a, 10b and, in practical implementations, the first entity 10a, 10b may comprise additional or alternative components to those shown.
  • Figure 2 illustrates a first method performed by the first entity 10a, 10b in accordance with an embodiment.
  • the first method can be computer-implemented.
  • the first method is for handling a failure of a first application instance of a first application.
  • the first entity 10a, 10b described earlier with reference to Figure 1 can be configured to operate in accordance with the first method of Figure 2.
  • the first method can be performed by or under the control of the processing circuitry 12 of the first entity 10a, 10b according to some embodiments.
  • transmission of a first message is initiated towards a second entity in response to identifying the failure of the first application instance.
  • the first entity 10a, 10b e.g. the processing circuitry 12 of the first entity 10a, 10b
  • the term “initiate” can mean, for example, cause or establish.
  • the first entity 10a, 10b e.g. the processing circuitry 12 of the first entity 10a, 10b
  • the first message comprises information indicative of a plurality of restoration stages to execute to restore the failed first application instance for the second entity to generate a plan for restoring the failed first application instance.
  • the plan comprises the plurality of restoration stages and a timing for executing the plurality of restoration stages.
  • the plan may also be referred to herein as a restoration plan.
  • the terms “restoration”, “restore”, “restoring”, and “restored” can be used interchangeably with the terms “recovery”, “recover”, “recovering”, and “recovered” respectively.
  • the failed first application instance may be considered to be restored in various ways. For example, in the case of the first application disappearing (e.g. going offline) due to its failure, the failed first application instance may be considered to be restored when it reappears (e.g. comes back online). Similarly, for example, in the case of the first application becoming inoperative due to its failure, the failed first application instance may be considered to be restored when it becomes operative again.
  • the first method may comprise executing at least a first restoration stage of the plurality of restoration stages according to the plan in response to receiving a second message from the second entity.
  • the first entity 10a, 10b e.g. the processing circuitry 12 of the first entity 10a, 10b
  • the second message may comprise an instruction for the first entity 10a, 10b to execute at least the first restoration stage of the plurality of restoration stages according to the plan.
  • the first method may comprise receiving the second message from the second entity. More specifically, in some embodiments, the first entity 10a, 10b (e.g. the processing circuitry 12 of the first entity 10a, 10b) can be configured to receive the second message (e.g. via the communications interface 16 of the first entity 10a, 10b).
  • the first method may comprise initiating transmission a third message towards the second entity.
  • the first entity 10a, 10b e.g. the processing circuitry 12 of the first entity 10a, 10b
  • the first entity 10a, 10b can be configured to initiate transmission of (e.g. itself transmit such as via the communications interface 16 of the first entity 10a, 10b, or cause another entity to transmit) the third message according to some embodiments.
  • the third message may comprise information indicative that at least the first restoration stage is complete.
  • the first method may comprise, in response to receiving a notification that the first application instance is restored independently of the execution of the plurality of restoration stages, initiating cancellation of the execution of the plurality of restoration stages if the notification is received before execution of the plurality of restoration stages, or allowing the execution of the plurality of restoration stages to continue or initiating termination of the execution of the plurality of restoration stages if the notification is received during execution of the plurality of restoration stages, or initiating removal of the independently restored first application instance if the notification is received after completion of the execution of the plurality of restoration stages.
  • the first entity 10a, 10b e.g.
  • the processing circuitry 12 of the first entity 10a, 10b) can be configured to initiate cancellation of (e.g. itself cancel or cause another entity to cancel) the execution of the plurality of restoration stages if the notification is received before execution of the plurality of restoration stages, and/or allow the execution of the plurality of restoration stages to continue or initiate termination of (e.g. itself terminate or cause another entity to terminate) the execution of the plurality of restoration stages if the notification is received during execution of the plurality of restoration stages, and/or initiate removal of (e.g. itself remove or cause another entity to remove) the independently restored first application instance if the notification is received after completion of the execution of the plurality of restoration stages.
  • removal of the independently restored first application instance may be initiated after recovery of any data from the independently restored first application instance that is not otherwise available.
  • the first method may comprise receiving the notification that the first application instance is restored independently of the execution of the plurality of restoration stages. More specifically, in some embodiments, the first entity 10a, 10b (e.g. the processing circuitry 12 of the first entity 10a, 10b) can be configured to receive this notification (e.g. via the communications interface 16 of the first entity 10a, 10b).
  • the first entity 10a, 10b e.g. the processing circuitry 12 of the first entity 10a, 10b
  • the first entity 10a, 10b can be configured to receive this notification (e.g. via the communications interface 16 of the first entity 10a, 10b).
  • the first method may comprise initiating transmission of a notification towards the second entity.
  • the first entity 10a, 10b e.g. the processing circuitry 12 of the first entity 10a, 10b
  • the first entity 10a, 10b can be configured to initiate transmission of (e.g. itself transmit such as via the communications interface 16 of the first entity 10a, 10b, or cause another entity to transmit) the notification according to some embodiments.
  • the notification may be that the first application instance is restored independently of the execution of the plurality of restoration stages.
  • the first application instance may be restored independently of the execution of the plurality of restoration stages when it is restored naturally. That is, the first application instance may be restored without intervention. For example, in the case of the first application disappearing (e.g. going offline) due to its failure, the failed first application instance may be considered to be restored naturally when it reappears (e.g. comes back online).
  • FIG. 3 illustrates a second entity 20 in accordance with an embodiment.
  • the second entity 20 is for handling a failure of a first application instance of a first application.
  • the second entity 20 referred to herein can refer to equipment capable, configured, arranged and/or operable to communicate directly or indirectly with the first entity 10a, 10b referred to herein, and/or with other entities or equipment to enable and/or to perform the functionality described herein.
  • the second entity 20 referred to herein may be a physical entity (e.g. a physical machine) or a virtual entity (e.g. a virtual machine, VM).
  • the second entity 20 comprises processing circuitry (or logic) 22.
  • the processing circuitry 22 controls the operation of the second entity 20 and can implement the method described herein in respect of the second entity 20.
  • the processing circuitry 22 can be configured or programmed to control the second entity 20 in the manner described herein.
  • the processing circuitry 22 can comprise one or more hardware components, such as one or more processors, one or more processing units, one or more multi-core processors and/or one or more modules.
  • each of the one or more hardware components can be configured to perform, or is for performing, individual or multiple steps of the method described herein in respect of the second entity 20.
  • the processing circuitry 22 can be configured to run software to perform the method described herein in respect of the second entity 20.
  • the software may be containerised according to some embodiments.
  • the processing circuitry 22 may be configured to run a container to perform the method described herein in respect of the second entity 20.
  • the processing circuitry 22 of the second entity 20 is configured generate a plan for restoring the failed first application instance in response to receiving a first message from a first entity 10a, 10b.
  • the first message comprises information indicative of a plurality of restoration stages to execute to restore the failed first application instance.
  • the plan comprises the plurality of restoration stages and a timing for executing the plurality of restoration stages.
  • the second entity 20 may optionally comprise a memory 24.
  • the memory 24 of the second entity 20 can comprise a volatile memory or a non-volatile memory.
  • the memory 24 of the second entity 20 may comprise a non-transitory media. Examples of the memory 24 of the second entity 20 include, but are not limited to, a random access memory (RAM), a read only memory (ROM), a mass storage media such as a hard disk, a removable storage media such as a compact disk (CD) or a digital video disk (DVD), and/or any other memory.
  • RAM random access memory
  • ROM read only memory
  • CD compact disk
  • DVD digital video disk
  • the processing circuitry 22 of the second entity 20 can be communicative coupled (e.g. connected) to the memory 24 of the second entity 20.
  • the memory 24 of the second entity 20 may be for storing program code or instructions which, when executed by the processing circuitry 22 of the second entity 20, cause the second entity 20 to operate in the manner described herein in respect of the second entity 20.
  • the memory 24 of the second entity 20 may be configured to store program code or instructions that can be executed by the processing circuitry 22 of the second entity 20 to cause the second entity 20 to operate in accordance with the method described herein in respect of the second entity 20.
  • the memory 24 of the second entity 20 can be configured to store any information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
  • the processing circuitry 22 of the second entity 20 may be configured to control the memory 24 of the second entity 20 to store information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
  • the second entity 20 may optionally comprise a communications interface 26.
  • the communications interface 26 of the second entity 20 can be communicative coupled (e.g. connected) to the processing circuitry 22 of the second entity 20 and/or the memory 24 of second entity 20.
  • the communications interface 26 of the second entity 20 may be operable to allow the processing circuitry 22 of the second entity 20 to communicate with the memory 24 of the second entity 20 and/or vice versa.
  • the communications interface 26 of the second entity 20 may be operable to allow the processing circuitry 22 of the second entity 20 to communicate with any one or more of the other entities (e.g. the first entity 10a, 10b) referred to herein.
  • the communications interface 26 of the second entity 20 can be configured to transmit and/or receive information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
  • the processing circuitry 22 of the second entity 20 may be configured to control the communications interface 26 of the second entity 20 to transmit and/or receive information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
  • the second entity 20 is illustrated in Figure 3 as comprising a single memory 24, it will be appreciated that the second entity 20 may comprise at least one memory (i.e. a single memory or a plurality of memories) 24 that operate in the manner described herein.
  • the second entity 20 is illustrated in Figure 3 as comprising a single communications interface 26, it will be appreciated that the second entity 20 may comprise at least one communications interface (i.e. a single communications interface or a plurality of communications interface) 26 that operate in the manner described herein.
  • Figure 3 only shows the components required to illustrate an embodiment of the second entity 20 and, in practical implementations, the second entity 20 may comprise additional or alternative components to those shown.
  • Figure 4 illustrates a second method performed by second first entity 20 in accordance with an embodiment.
  • the second method can be computer-implemented.
  • the second method is for handling a failure of a first application instance of a first application.
  • the first entity 10a, 10b described earlier with reference to Figure 3 can be configured to operate in accordance with the second method of Figure 4.
  • the second method can be performed by or under the control of the processing circuitry 22 of the second entity 20 according to some embodiments.
  • a plan for restoring the failed first application instance is generated in response to receiving a first message from a first entity 10a, 10b.
  • the second entity 20 e.g. the processing circuitry 22 of the second entity 20
  • the first message comprises information indicative of a plurality of restoration stages to execute to restore the failed first application instance.
  • the plan comprises the plurality of restoration stages and a timing for executing the plurality of restoration stages.
  • the second method may comprise receiving the first message from the first entity 10a, 10b.
  • the second entity 20 e.g. the processing circuitry 22 of the second entity 20
  • the first message e.g. via the communications interface 26 of the second entity 20.
  • the second method may comprise initiating execution of at least a first restoration stage of the plurality of restoration stages according to the plan.
  • the second entity 20 e.g. the processing circuitry 22 of the second entity 20
  • the second entity 20 can be configured to initiate execution of (e.g. itself execute or cause another entity to execute) at least the first restoration stage of the plurality of restoration stages according to some embodiments.
  • initiating execution of at least the first restoration stage of the plurality of restoration stages according to the plan may comprise initiating transmission of a second message towards the first entity 10a, 10b.
  • the second entity 20 e.g. the processing circuitry 22 of the second entity 20
  • the second message may comprise an instruction for the first entity 10a, 10b to execute at least the first restoration stage of the plurality of restoration stages according to the plan.
  • the second method may comprise receiving a third message from the first entity 10a, 10b.
  • the second entity 20 e.g. the processing circuitry 22 of the second entity 20
  • the third message may comprise information indicative that at least the first restoration stage is complete.
  • the second method may comprise receiving a notification from the first entity 10a, 10b.
  • the second entity 20 e.g. the processing circuitry 22 of the second entity 20
  • the notification can be a notification that the first application instance is restored independently of the execution of the plurality of restoration stages.
  • the plan referred to herein may comprise any one or more of information indicative of a tolerance of the first application to the plurality of restoration stages, information indicative of one or more resources required (i.e. a resource requirement) to execute the plurality of restoration stages, information indicative of a cost associated with executing one or more of the restoration stages, information indicative of a priority with which one or more of the restoration stages are to be executed, and information indicative of a priority with which the first application instance is to be restored relative to a second application instance of a second application.
  • the tolerance, resource requirement, and priorities mentioned can be referred to herein as parameters.
  • One or more of these parameters are associated with the plurality of restoration stages or, more specifically, with each restoration stage of the plurality of restoration stages.
  • one or more parameters associated with one restoration stage of the plurality of restoration stages may be different to one or more parameters associated with another restoration stage of the plurality of restoration stages.
  • a first stage of the plurality of restoration stages may require more resources than a second stage, etc.
  • the information referred to herein that is indicative of the tolerance of the first application to the plurality of restoration stages can be information indicative of whether the first application can tolerate the plurality of restoration stages. For example, if the information is indicative that the first application cannot tolerate one or more restoration stages of the plurality of restoration stages, then these one or more restoration stages can be skipped.
  • the failure of the first application referred to herein can be due to a software failure and/or a hardware failure.
  • the information referred to herein that is indicative of the tolerance of the first application to the plurality of restoration stages can comprise information indicative of the tolerance of the first application to one or more restoration stages of the plurality of restoration stages that are aimed at resolving a software failure that may have contributed to the failure of the first application instance and/or the tolerance of the first application to one or more restoration stages of the plurality of restoration stages that are aimed at resolving a hardware failure that may have contributed to the failure of the first application instance.
  • the one or more restoration stages aimed at resolving a software failure can, for example, comprise one or more software restarts.
  • the one or more restoration stages aimed at resolving a hardware failure can, for example, comprise one or more hardware restarts.
  • the priority with which the first application instance is to be restored may be based on a first measure of a criticality of the first application instance to the first application relative to a second measure of a criticality of the second application instance to the second application. In other embodiments, the priority with which the first application instance is to be restored may be based on a total number of application instances that exist for the first application relative to a total number of application instances that exist for the second application. In some embodiments, the priority with which the first application instance is to be restored may be higher than a priority with which the second application instance is to be restored if the first measure and the second measure are indicative that the first application instance is more critical to the first application than the second application is to the second application.
  • the priority with which the first application instance is to be restored may be lower than the priority with which the second application instance is to be restored if the first measure and the second measure are indicative that the first application instance is less critical to the first application than the second application instance is to the second application.
  • the priority with which the first application instance is to be restored may be higher than a priority with which the second application instance is to be restored if the total number of application instances that exist for the first application is less than the total number of application instances that exist for the second application. In other embodiments, the priority with which the first application instance is to be restored may be lower than a priority with which the second application instance is to be restored if the total number of application instances that exist for the first application is more than the total number of application instances that exist for the second application.
  • the timing referred to herein for execution of the plurality of restoration stages may be set to start on expiry of a predefined time period.
  • the predefined time period can be a time period during which the first application instance is expected to be restored independently of execution of the plurality of restoration stages.
  • the timing may comprise a time period assigned to each restoration stage of the plurality of restoration stages and the time period assigned to each restoration stage can be indicative of a time period on expiry of which the restoration stage is to start.
  • the timing may be set based on any one or more of historical data on a past failure of the first application instance and historical data on a past failure of a second application instance of a second application.
  • the past failure of the second application instance may be identified as similar to the failure of the first application instance.
  • the past failure of the second application instance may be identified as similar to the failure of the first application instance based on a comparison of monitored information about the first application instance and monitored information about the second application instance.
  • the historical data on the past failure of the first application instance may comprise an elapsed time from detection of the past failure of the first application instance to restoration of the first application instance from the past failure.
  • the historical data on the past failure of the second application instance may comprise an elapsed time from detection of the past failure of the second application instance to restoration of the second application instance from the past failure.
  • the system can comprise at least one first entity 10a, 10b as described earlier with reference to Figure 1 and at least one second entity 20 as described earlier with reference to Figure 3.
  • a method performed by the system can comprise the first method described earlier with reference to Figure 2 and the second method described earlier with reference to Figure 4.
  • Figure 5 illustrates a system in accordance with an embodiment.
  • the system comprises two first entities 10a, 10b and a second entity 20.
  • the system may comprise any other number of first entities (e.g. one or more first entities) and any other number of second entities (e.g. one or more second entities) according to other embodiments.
  • the first entities 10a, 10b illustrated in Figure 5 can be as described earlier with reference to Figures 1 and 2.
  • the second entity 20 illustrated in Figure 5 can be as described earlier with reference to Figures 3 and 4.
  • the first entities 10a, 10b can provide application recovery services or, more specifically, a first application (“Application A”) recovery service and a second application (“Application B”) recovery service.
  • the second entity 20 can be referred to as a recovery manager (module).
  • the first entities 10a, 10b can be communicatively coupled (e.g. connected) to the second entity 20.
  • the system in the embodiment illustrated in Figure 5 comprises three nodes (“Node A”, “Node B”, “Node C”) 304, 306, 308.
  • the system may comprise any other number of nodes (e.g. one or more nodes) according to other embodiments.
  • Any one or more of the nodes 304, 306, 308 can, for example, be a network node such as a (e.g. physical) server. That is, any one or more of the nodes 304, 306, 308 can be a node of a network 324.
  • any reference to a node can be understood to include a node of a network (or network node), such as a server.
  • the network 324 referred to herein can be any type of network.
  • the network 324 referred to herein can be a telecommunications network.
  • the network 324 referred to herein can be a mobile network, such as a fourth generation (4G) mobile network, a fifth generation (5G) mobile network, a sixth generation (6G) mobile network, or any other generation mobile network.
  • the network 324 referred to herein can be a radio access network (RAN).
  • the network 324 referred to herein can be a local network, such as a local area network (LAN).
  • the network 324 referred to herein may be a content delivery network (CDN).
  • CDN content delivery network
  • the network 324 referred to herein may be a software defined network (SDN).
  • the network 324 referred to herein can be a fog computing environment or an edge computing environment.
  • the network 324 referred to herein can be a virtual network or an at least partially virtual network.
  • a first node (“Node A”) 304 and a second node (“Node B”) 306 are each executing a first instance of a first application (“Application A”) 310, 312.
  • the second node (“Node B”) 306 and a third node (“Node C”) 308 are each executing a first instance of a second application (“Application B”) 314, 316.
  • any one or more of the nodes 304, 306, 308 may use a virtualisation technology (e.g. one or more virtual machines, one or more containers, and/or any other virtualisation technology) to execute the applications.
  • any one or more of the nodes 304, 306, 308 can comprise a health monitor 318, 320, 322.
  • the applications are connected to application specific first entities 10a. 10b, which can provide application specific recovery services.
  • the recovery service can be a distributed solution.
  • a single recovery service can be responsible for multiple applications.
  • the recovery service can, for example, be application specific in the sense that it can monitor a specific application and/or that it can control the recovery procedures described herein.
  • the second entity 20 may be application independent. In some embodiments, the second entity 20 can collect deployment, infrastructure monitoring, and/or historical information.
  • Figure 5 shows the interface (or communications interface) 16a, 16b, 26 between the first entities 10a, 10b and the second entity 20. This interface 16a, 16b, 26 can enable a standard way for new applications to be added to the system.
  • the interface 16a, 16b, 26 can enable the first entities 10a, 10b to send to the second entity 20 any one or more of a problem report, a status update, and an instance restored (e.g. reappeared) notification.
  • the problem report is referred to herein as the first message.
  • the problem report can, for example, comprise an identifier that identifies the application (“application ID”), an identifier that identifies the instance of the application (“instance ID”), a list of identifiers that identify the plurality of restoration stages (“stage ID”), one or more required resources, one or more priorities, one or more failure tolerance parameters, and/or any other relevant information.
  • the status update is referred to herein as the third message.
  • the status update can be sent after restoration stage execution.
  • the status update can, for example, comprise application ID, an instance ID, an identifier that identifies a restoration stage (“stage ID”), and/or any other relevant information.
  • stage ID identifier that identifies a restoration stage
  • the instance restored notification is referred to herein as the notification.
  • the instance restored notification can, for example, comprise an application ID, an instance ID, and/or any other relevant information.
  • the interface 16a, 16b, 26 can enable the second entity 20 to the first entities 10a, 10b a restoration (or recovery) stage execution command.
  • the restoration stage execution command is referred to herein as the second message.
  • the restoration stage execution command can comprise a stage ID and/or any other relevant information.
  • the second entity 20 can orchestrate application restoration (or recovery) procedures via the first entities 10a, 10b in a way that the overhead and cost is minimised.
  • the system may comprise a third entity 302.
  • the third entity 302 can also be referred to as an infrastructure monitoring service entity.
  • the third entity 302 can collect (e.g. node and/or networking related) monitoring data and expose it towards the second entity 20.
  • the third entity 302 may execute specific tests, such as validating the availability of a node (e.g. via one or multiple interfaces such as one or more data and/or management interfaces) or connecting to one or more nodes such as to a baseboard management controller (BMC) interface of one or more nodes.
  • BMC baseboard management controller
  • the system may comprise a fourth entity 300.
  • the fourth entity 300 can also be referred to as a cloud deployment service entity.
  • the fourth entity 302 can store information about the current application deployments in the system (e.g. that Application A has instances on Nodes A and B) and make that information available for the second entity 20.
  • Figure 6 is a signalling diagram illustrating an exchange of signals in a system according to an embodiment.
  • the system comprises the first entity 10a and the second entity 20 mentioned earlier.
  • the system also comprises a third entity 302 and a fourth entity 300 mentioned earlier.
  • the signalling diagram of Figure 6 illustrates an example communication pattern between the different entities described earlier.
  • the first entity 10a transmits a first message (which is also referred to herein as the problem report) to the second entity 20 in response to identifying the failure of an application instance and the second entity 20 receives this first message.
  • a first message which is also referred to herein as the problem report
  • the second entity 20 may acquire (e.g. collect) deployment and/or monitoring information.
  • the second entity 20 may transmit a request for deployment information to the fourth entity 300 and the fourth entity 300 can receive this request.
  • the fourth entity 300 may transmit a response to this request to the second entity 20 and the second entity 20 can receive this response.
  • the response can comprise the requested deployment information.
  • the second entity 20 may transmit a request for monitoring information to the third entity 302 and the third entity 302 can receive this request.
  • the third entity 302 may transmit a response to this request to the second entity 20 and the second entity 20 can receive this response.
  • the response can comprise the requested monitoring information.
  • the second entity 20 generates (or creates) a restoration plan comprising a plurality of stages for restoring the failed application instance.
  • the plan comprises two stages, namely a first restoration stage (“Stage 1”) and a second restoration stage (“Stage 2”).
  • the plan can comprise any other number of stages, e.g. a single stage or more than two stages.
  • first time period (T1) On expiry of a first time period (T1), execution of the first restoration stage is triggered. Specifically, the execution of the first restoration stage is triggered with a second message (which is also referred to herein as the restoration stage execution command).
  • the second entity 20 transmits the second message to the first entity 10a and the first entity 10a receives the second message.
  • the second message comprises an instruction for the first entity 10a to execute the first restoration stage.
  • the first entity 10a may transmit a third message (which is also referred to herein as the status update) to the second entity 20 in response to the second message and the second entity 20 can receive this third message.
  • the third message can comprise information indicative (or an acknowledgement) that the first restoration stage is complete.
  • the second entity 20 may query the monitoring information to validate that the restoration plan is still valid. For example, as illustrated by arrow 414 of Figure 6, the second entity 20 may transmit a request for monitoring information to the third entity 302 and the third entity 302 can receive this request. As illustrated by arrow 416 of Figure 6, the third entity 302 may transmit a response to this request to the second entity 20 and the second entity 20 can receive this response. The response can comprise the requested monitoring information.
  • the second entity 20 has validated that the restoration plan is still valid (or updated the restoration plan if is not still valid)
  • execution of the second restoration stage is triggered.
  • the execution of the second restoration stage is triggered with a fourth message (which is also referred to herein as the restoration stage execution command).
  • the second entity 20 transmits the fourth message to the first entity 10a and the first entity 10a receives the fourth message.
  • the fourth message comprises an instruction for the first entity 10a to execute the second restoration stage.
  • the first entity 10a may transmit a fifth message (which is also referred to herein as the status update) to the second entity 20 in response to the fourth message and the second entity 20 can receive this fifth message.
  • the fifth message can comprise information indicative (or an acknowledgement) that the second restoration stage is complete.
  • the first entity 10a may transmit an instance restored notification to the second entity 20 and the second entity 20 may receive this notification.
  • the notification can be that the application instance has been restored independently of execution of the plurality of restoration stages.
  • Figure 7 illustrates a method performed by the first entity 10a, 10b in accordance with an embodiment.
  • the method can be computer-implemented.
  • the first entity 10a, 10b described earlier with reference to Figure 1 can be configured to operate in accordance with the method of Figure 7.
  • the method can be performed by or under the control of the processing circuitry 12 of the first entity 10a, 10b according to some embodiments.
  • the procedure illustrated in Figure 7 can be triggered once an application instance problem is detected. More specifically, as illustrated at block 500 of Figure 7, the method may begin with the first entity 10a, 10b receiving a notification that there is an application instance problem.
  • the first entity 10a, 10b receiving a notification that there is an application instance problem.
  • mechanisms for detecting an application instance problem include heartbeat mechanisms and/or gossip protocols.
  • the application instance problem can be any problem that can result in failure of the application instance, such as the unavailability of the application instance, an issue with the I/O, out of space errors, poor quality networking, and/or any other problems.
  • the procedure illustrated in Figure 7 can be performed in response to identifying a failure of the application instance.
  • the first entity 10a, 10b collects information about one or more components that need to be restored (or recovered).
  • the one or more components can, for example, comprise one or more software components and/or one or more hardware components.
  • the first entity 10a, 10b generates (or creates) a plurality of restoration stages to execute to restore the failed application instance.
  • the plurality of restoration stages can be application dependent. For example, certain applications may require certain components to be restored with a higher priority, e.g. regardless of wasting resources, such as in cases where a temporary outage is resolved.
  • the plurality of restoration stages may be characterised with a resource requirement.
  • the resource requirement can be indicative of one or more resources (e.g. networking, compute instance, memory, and/or any other resources) required for a given restoration stage to be completed.
  • the resource requirement may be later be converted into a cost (e.g. a dollar amount) by the second entity 20, such as by using resource pricing information in the given environment.
  • the first entity 10a, 10b transmits a first message towards the second entity 20.
  • the first message comprises information indicative of the plurality of restoration stages to execute to restore the failed application instance in order for the second entity 20 to generate a plan for the restoration.
  • the first message is also referred to herein as the problem report.
  • the first message may comprise additional information, such as any one or more of the parameters mentioned earlier.
  • the one or more parameters can comprise an overall priority of the given application, an indication of whether the application can tolerate the plurality of restoration stages (e.g. an application instance restart or a physical server restart), and/or any other parameters.
  • the one or more parameters can be used by the second entity 20 to prioritise and manage the restoration.
  • the second entity 20 may delay the restoration if it concludes such an issue.
  • the application tolerates server restarts (e.g. it persists states to durable storage)
  • the second entity 20 may delay the restoration if it concludes such an issue.
  • the application loses all states when the server restarts (e.g. everything is in a dynamic random access memory (DRAM))
  • the second entity 20 may initiate a full restoration if such an issue is detected from monitoring.
  • DRAM dynamic random access memory
  • each key can have a data length and a number of remaining replicas.
  • each key may have a priority (e.g. gold keys for charging information and silver keys for monitoring data).
  • the first entity 10a, 10b can iterate over the stored keys after an application instance failure. During this iteration, keys can be assigned to restoration stages, such as in the manner shown in the following table:
  • the resource requirement is calculated for networking (e.g. network usage to replicate data).
  • restoration stages 1 and 2 are determined to be executable without starting a new application instance, since the amount of data to be replicated fits to the remaining healthy application instances.
  • restoration stage 3 a new application instance is also required that can store the amount of data that the failed node stored.
  • the restoration stages include a wider restoration and it is possible to, for example, issue a full recovery only and skip restoration stages 1 and 2. For example, if it is indicated that the key-value store cannot tolerate server restarts and the monitoring data shows this event, the full recovery may be issued immediately.
  • Figure 8 illustrates a method performed by the first entity 10a, 10b in accordance with an embodiment.
  • the method can be computer-implemented.
  • the first entity 10a, 10b described earlier with reference to Figure 1 can be configured to operate in accordance with the method of Figure 8.
  • the method can be performed by or under the control of the processing circuitry 12 of the first entity 10a, 10b according to some embodiments.
  • the first entity 10a, 10b receives a second message from the second entity 20.
  • the second message comprises an instruction for the first entity 10a, 10b to execute one or more restoration stages (i.e. at least a first restoration stage) of the plurality of restoration stages.
  • the instruction can thus also be referred to as a restoration stage execution command.
  • the first entity 10a, 10b executes the one or more restoration stages.
  • the first entity 10a, 10b transmits a third message to the second entity 20.
  • the third message comprises information indicative that the one or more restoration stages are complete.
  • the third message can thus also be referred to as a status update.
  • the procedure illustrated in Figure 8 can be performed multiple times. That is, the procedure illustrated in Figure 8 can be performed for each restoration stage that is to be executed.
  • the second entity 20 may first command the execution of the high priority restoration stage (stage 1) and later in time command the execution of one or more of the other restoration stages (stage 2 and/or stage 3).
  • Figure 9 illustrates a method performed by the first entity 10a, 10b in accordance with an embodiment.
  • the method can be computer-implemented.
  • the first entity 10a, 10b described earlier with reference to Figure 1 can be configured to operate in accordance with the method of Figure 9.
  • the method can be performed by or under the control of the processing circuitry 12 of the first entity 10a, 10b according to some embodiments.
  • Figure 9 illustrates a procedure that can be executed if a failed application instance is restored (or recovers) from such a temporary failure. The steps to be taken can differ depending on the progress of the restoration.
  • the first entity 10a, 10b may receive a notification that a failed application instance is restored independently of the execution of the plurality of restoration stages.
  • the failed application instance may have reappeared.
  • the failed application instance that is restored independently of the execution of the plurality of restoration stages will be referred to as the independently restored application instance.
  • the first entity 10a, 10b can determine whether the restoration of the failed application instance via the plurality of restoration stages is already complete. If it is determined at block 702 of Figure 9 that the restoration via the plurality of restoration stages is already complete, the process moves to block 714 of Figure 9. As illustrated at block 714 of Figure 9, the first entity 10a, 10b may check whether any data can be recovered from the independently restored application instance. In the earlier example of the key-value store, this data can comprise a data element that had its only replica on the node at which the application instance failed. As illustrated at block 718 of Figure 9, the first entity 10a, 10b may remove the independently restored application instance.
  • the process moves to block 704 of Figure 9.
  • the first entity 10a, 10b can determine whether the restoration via the plurality of restoration stages is in progress. For example, it may be determined that the restoration via the plurality of restoration stages is in progress if the execution of at least one restoration stage has been started. If it is determined at block 704 of Figure 9 that the restoration via the plurality of restoration stages is in progress, the process moves to block 706 of Figure 9. As illustrated at block 706 of Figure 9, the first entity 10a, 10b may gather information on the progress of the restoration via the plurality of restoration stages. As illustrated at block 708 of Figure 9, the first entity 10a, 10b may calculate a resource requirement for cancelling the restoration via the plurality of restoration stages.
  • the first entity 10a, 10b may determine whether to stop the restoration. This decision can depend on the progress of the restoration. If it is determined at block 710 of Figure 9 that the restoration is to be stopped, the process moves to block 712 of Figure 9. As illustrated at block 712 of Figure 9, the first entity 10a, 10b can cancel the execution of the plurality of restoration stages. As illustrated at block 716 of Figure 9, the first entity 10a, 10b may restore the independently restored application instance into the cluster. For example, if only critical data elements have been restored, then it likely makes sense to cancel the execution of the plurality of restoration stages.
  • the process moves to block 716 of Figure 9.
  • the first entity 10a, 10b may restore the independently restored application instance into the cluster.
  • the first entity 10a, 10b can transmit to the second entity 20 a notification about the event (namely, that the failed application instance is restored independently of the execution of the plurality of restoration stages) and optionally also the decisions taken.
  • Figure 10 illustrates a method performed by the second entity 20 in accordance with an embodiment.
  • the method can be computer-implemented.
  • the second entity 20 described earlier with reference to Figure 3 can be configured to operate in accordance with the method of Figure 10.
  • the method can be performed by or under the control of the processing circuitry 22 of the second entity 20 according to some embodiments.
  • the second entity 20 receives the first message from the first entity 10a, 10b.
  • the output of the method illustrated in Figure 7 is an input of the method illustrated in Figure 10.
  • the first message comprises information indicative of the plurality of restoration stages to execute to restore the failed application instance in order for the second entity 20 to generate a plan for the restoration.
  • the first message can comprise a problem report and/or additional information, such as any one or more of the parameters mentioned earlier.
  • the method illustrated in Figure 10 is described with reference to a single first entity 10a, 10b for simplicity, it will be understood that the method illustrated in Figure 10 can be performed in respect of multiple first entities.
  • the second entity 20 can receive the first message from one or more first entities 10a, 10b.
  • the second entity 20 can acquire (e.g. collect) information, such as monitoring information (e.g. from the third entity 302) and/or deployment information (e.g. from the fourth entity 300).
  • the deployment information can detail the current physical deployment of the application, such as the infrastructure (e.g. node) on which the application executing.
  • the purpose of the monitoring information can be to identify the details of the problem as much as possible.
  • one or more health checks may be performed to acquire the monitoring information.
  • the one or more health checks can be performed by one or more health monitors 318, 320, 322.
  • the one or more health checks can, for example, comprise any one or more of the following:
  • the second entity 20 may compare the acquired information to historical data. For example, based on the acquired information, the second entity 20 may determine whether similar failures happened in the past. A decision on the similarity of failures can be taken using rule-based approaches and/or machine learning based solutions (such as clustering).
  • the second entity 20 generates a plan for restoring the failed application instance.
  • restoration stage costs may be calculated, such as based on the indicated resource requirement of each stage and/or pricing information in the current environment.
  • the plan can be generated taking into account the restoration stage costs.
  • the plan can essentially list one or more restoration stages, e.g. with associated timing. If available, the historical data may be used during this process, as will be described later.
  • a plan may be generated for each received problem.
  • the second entity 20 may transmit the plan to the first entity 10a, 10b in stages to be executed. As illustrated at block 818 of Figure 10, the second entity 20 may determine whether execution of the next (e.g. first or subsequent) restoration stage is required. For example, it may be that the plan requires immediate action. If it is determined at block 818 of Figure 10 that execution of the next restoration stage is required, the process can proceed to block 820 of Figure 10. As illustrated at block 820 of Figure 10, the second entity 20 can transmit a second message to the first entity 10a, 10b. The second message comprises an instruction for the first entity 10a, 10b to execute the next restoration stage of the plurality of restoration stages. As illustrated by the letter “D” in Figure 8, the output of the method illustrated at block 820 of Figure 10 is the input of the method illustrated in Figure 8.
  • the process can proceed to block 824 of Figure 10.
  • the second entity 20 can set the timing for execution of the next (e.g. first or subsequent) restoration stage to start. For example, a timer can be used to schedule the next restoration stage.
  • the second entity 20 can identify that the timer has expired. As illustrated at block 808 of Figure 10, in response to expiry of the timer, the second entity 20 may acquire (e.g. gather) updated information (e.g. updated monitoring information and/or deployment information). As illustrated at block 814 of Figure 10, the second entity 20 may generate an updated (e.g. revised) plan. That is, the second entity 20 may update (e.g. revise) the previously generated plan. The updated plan can be generated based on the updated information. For example, it may become certain from the update information that a (e.g. physical) node has failed as it failed to reboot. The method may then proceed to block 818, which is as described earlier.
  • the second entity 20 may check whether a full restoration is requested (or required). If it is determined at block 822 of Figure 10 that a full restoration is not requested, then the process can proceed to block 824 of Figure 10, which is as described earlier. On the other hand, if it is determined at block 822 of Figure 10 that a full restoration is requested, then the process can proceed to block 826 of Figure 10. As illustrated at block 826 of Figure 10, the second entity 20 may record the event (as a historical case), e.g. for future restoration planning. For example, once the second entity 20 has transmitted all of the restoration stages to the first entity 10a, 10b, the event may be recorded
  • the second entity 20 may receive a notification about the event from the first entity 10a, 10b.
  • the event is that the failed application instance is restored independently of the execution of the plurality of restoration stages. For example, the failed application instance may reappear.
  • the output of the method illustrated in Figure 9 is an input of the method illustrated in Figure 10.
  • the second entity 20 can cancel the associated timer. The process can then proceed to block 826 of Figure 10, which is as described earlier.
  • the status update transmitted to the second entity 20, as illustrated at block 604 of Figure 8, may only be recorded by the second entity 20 and no action taken as a result. Thus, the status updated is not illustrated in Figure 10.
  • Figure 11 illustrates a method performed by the second entity 20 in accordance with an embodiment.
  • the method can be computer-implemented.
  • the second entity 20 described earlier with reference to Figure 3 can be configured to operate in accordance with the method of Figure 11.
  • the method can be performed by or under the control of the processing circuitry 22 of the second entity 20 according to some embodiments.
  • Figure 11 illustrates a method for generating (or creating) the plan referred to herein.
  • the plan can be generated by the second entity 20 for reported problems that have resulted in application instance failures.
  • the plan comprises a plurality of restoration stages.
  • the plan can assign time periods (T1 , T2, ...) to the available restoration stages.
  • the second entity 20 can receive a request for a plan to be generated and may calculate costs based on resource requirements and/or one or more local cost parameters.
  • the second entity 20 may acquire (e.g. collect) and optionally also parse monitoring information.
  • the second entity 20 may check whether a node on which the failed application instance is executing has restarted. If it is determined at block 904 of Figure 11 that the node has restarted, the process can move to block 906 of Figure 11. As illustrated at block 906 of Figure 11 , the second entity 20 may check whether the application can tolerate the restart. If it is determined at block 906 that the application cannot tolerate the restart, the process can move to block 910 of Figure 11. As illustrated at block 910 of Figure 11 , the second entity 20 may initiate a full restoration of the failed application instance.
  • the process can move to block 908 of Figure 11.
  • the process can move to block 908 of Figure 11.
  • the second entity 20 may create a default plan, e.g. based on the base time constant (T_base), priorities, costs, and/or any other parameters.
  • the base time constant can represent an expected (e.g. typical) restoration time.
  • the base time constant can, for example, be used for cases where there is no historical data available.
  • the second entity 20 may check whether a similar case has been handled before. If it is determined at block 912 of Figure 11 that a similar case has been handled before, the process can move to block 916 of Figure 11. As illustrated at block 916 of Figure 11, the second entity 20 can initiate the default plan. More specifically, as described earlier with reference to Figure 10, the second entity 20 can send the restoration stages of the default plan to the first entity 10a, 10b for execution. On the other hand, if it is determined at block 912 of Figure 11 that a similar case has not been handled before, the process can move to block 914 of Figure 11. As illustrated at block 914 of Figure 11 , the second entity 20 may adjust the default plan to generate an adjusted plan. For example, the default plan may be adjusted based on the observed timing in the previous case.
  • the second entity 20 can initiate the adjusted plan. More specifically, as described earlier with reference to Figure 10, the second entity 20 can send the restoration stages of the adjusted plan to the first entity 10a, 10b for execution.
  • the generation of the plan referred to herein may take into account any one or more of the following factors:
  • collected monitoring information such as application health, node operating system health, node health, networking status, and/or any other collected monitoring information
  • the technique described herein enables staged recovery planning, e.g. to ensure prioritisation and/or cost saving.
  • the technique can, for example, be executed for clustered applications.
  • An application instance failure can be identified by a (e.g. application specific) first entity 10a, 10b that generates possible restoration stages, optionally along with associated costs.
  • the restoration stages are then compiled into a plan by a (e.g. central) second entity 20, such as by using deployment information, monitoring information, and/or historical restoration cases.
  • the restoration stages of the plan can then be applied to the application, e.g. with a determined timing.
  • the technique described herein can enable the coordination of application instance restoration in stages based on multiple parameters.
  • the technique described herein can, according to some embodiments, react to changes in monitoring information and/or can also handle the possible natural restoration of a temporary failed application instance. For example, in some cases, it may be ensured that only minimal restoration is executed if an application instance failure is temporary and the failed instance is expected to be restored independently of the execution of the plurality of restoration stages. In some cases, the entire restoration process or stages of the restoration process may be delayed if an application instance is expected to be restored independently of the execution of the plurality of restoration stages, e.g. based on infrastructure monitoring information and/or historical cases. Thus, the technique described herein can result in infrastructure cost savings associated with bringing up new application instances and/or transferring data over the network.
  • the technique described herein can enable multiple levels of restoration prioritisation. For example, entire applications can be prioritised, which makes it possible to restore critical applications first. This can be particularly beneficial in a case when multiple applications fail at the same time.
  • certain restoration stages can be prioritised, which makes it possible to restore critical components of applications but delay the restoration of possibly costly components. This can be particularly beneficial in a case when it is determined that the failed application instance is likely to be restored (e.g. reappear) independently of the execution of the plurality of restoration stages.
  • a database handling user transactions can be prioritised as high on an application level. On a restoration stages level, high priority can be given to data that is only replicated to two instances or that is critical for the application (e.g. charging data).
  • multiple possible restoration stages are created for an application instance failure and the restoration stages are reported by a (e.g. application specific) first entity 10a, 10b to a second entity 20.
  • the second entity 20 creates a restoration plan (e.g. based on multiple inputs) and coordinates the restoration of the failed application instance.
  • the technique described herein enables the timing of multi-staged application restoration, e.g. based on observed conditions and/or historical events. If the failed application instance is expected to be restored (e.g. reappear) independently of the execution of the plurality of restoration stages at a certain time, then the restoration or stages of the restoration can be delayed until that time.
  • the technique described herein can react to the possible restoration (e.g. reappearance) of application instances independently of the execution of the plurality of restoration stages, such as by either continuing the restoration process or cancelling the restoration process.
  • Figure 12A-12E illustrate example scenarios involving multiple restoration stages. More specifically, each of Figures 12A-12E represent a plan generated once the first message referred to herein has been received by the second entity 20 from a given first entity 10a, 10b. In the example scenarios, there are two restoration stages. The restoration stages comprise a first (critical) restoration stage and a second (full) restoration stage.
  • FIG 12A illustrates a scenario where the monitoring information acquired by the second entity 20 is indicative that there has been an operating system restart in relation to a failed application instance.
  • the second entity 20 calculates a timing (T1 and T2) for the two restoration stages without taking into account historical information, since there is no information about a relevant case.
  • the timing can be calculated using the base time constant (T_base), which may be scaled using received priority (P1, P2) and cost (C1 , C2) parameters.
  • the timing may be adjusted based on application-level information such as application priority.
  • the timing can be set such that execution of the first (critical) restoration stage starts on expiry of a first predefined time period (T1) and the second (full) restoration stage starts on expiry of a second predefined time period (T2).
  • the dotted lines in Figure 12A illustrate the time at which the respective restoration stages are completed.
  • Figure 12B illustrates a scenario where the monitoring information acquired by the second entity 20 is indicative that a BMC of a node executing a failed application instance is unreachable.
  • the application preferences also indicate that the application instance is not capable of recovering from a node error.
  • the second entity 20 initiates execution of the second (full) restoration immediately.
  • the dotted line in Figure 12B illustrates the time at which the restoration is completed.
  • Figure 12C illustrates a scenario where the monitoring information acquired by the second entity 20 is indicative that there has been an application restart in relation to a failed application instance.
  • the plan generation in the scenario illustrated in Figure 12C is similar to scenario illustrated in Figure 12A as no similar case has been seen.
  • the failed application instance is independently restored while the full restoration is in progress. It is determined by the first entity 10a, 10b that cancelling the full restoration is a more cost effective option.
  • the first dotted line in Figure 12C illustrates the time at which the first (critical) restoration stage is completed
  • the second dotted line in Figure 12C illustrates the time at which the second (full) restoration stage is cancelled.
  • the case may be recorded as a historical case, e.g. the time at which the failed application instance was independently restored may be recorded.
  • Figure 12D illustrates another scenario where the monitoring information acquired by the second entity 20 is indicative that there has been an application restart in relation to a failed application instance.
  • the scenario illustrated in Figure 12D is identified as being similar to the historical scenario illustrated in Figure 12C.
  • the second predefined time period (T2) is adjusted to the restoration time recorded in the scenario illustrated in Figure 12C. Since the failed application instance does not reappear until T2, the second (full) restoration stage is started. Finally, the failed application instance is independently restored before completion of the second (full) restoration stage. However, it is decided that finishing the restoration is more beneficial in the scenario illustrated in Figure 12D.
  • the case may be recorded as a historical case, e.g. the time at which the failed application instance was independently restored may be recorded.
  • Figure 12E illustrates another scenario where the monitoring information acquired by the second entity 20 is indicative that there has been an application restart in relation to a failed application instance.
  • the scenario illustrated in Figure 12E is identified as being similar to the historical scenario illustrated in Figure 12C and the historical scenario illustrated in Figure 12D.
  • the second predefined time period (T2) is adjusted to the longer restoration time recorded in the scenario illustrated in Figure 12D.
  • the failed application instance is independently restored before expiry of the second predefined time period (T2) and thus the second (full) restoration stage is not triggered.
  • the technique described herein advantageously breaks the restoration (e.g. repair or healing) of a failed application instance into a plurality of restoration stages, such as based on priorities and/or costs.
  • the handling of application instance failures is more flexible and can be adapted to the particular scenario.
  • the plurality of restoration stages make it possible for the restoration of critical components to be prioritised and the restoration of less important components to be delayed.
  • the restoration of data that has only one replica remaining after an application instance failure may be considered critical, while the restoration of data with more remaining replicas may be delayed (e.g. if the application instance failure is considered temporary with some probability).
  • historical information can be used to determine if it is beneficial to delay one or more of the restoration stages.
  • the failure of application instances cannot be easily categorised (e.g. as temporary or permanent). For example, using existing techniques such as heartbeats, a network failure cannot be distinguished from an application instance failure, an operating system failure, or a storage or node failure.
  • the technique described herein enables the use of information gathered from different sources and/or historical data to categorise application instance failures. With this approach, it is possible to assign a probability that a failure is temporary.
  • the impact of an application instance failure may be categorised on an application level and/or an application component level.
  • the restoration of certain applications may be prioritised (e.g. for dependency reasons).
  • certain components or data elements may require higher priority restoration actions than others.
  • a computer program comprising instructions which, when executed by processing circuitry (such as the processing circuitry 12 of the first entity 10a, 10b described earlier and/or the processing circuitry 22 of the second entity 20 described earlier), cause the processing circuitry to perform at least part of the method described herein.
  • a computer program product embodied on a non- transitory machine-readable medium, comprising instructions which are executable by processing circuitry (such as the processing circuitry 12 of the first entity 10a, 10b described earlier and/or the processing circuitry 22 of the second entity 20 described earlier) to cause the processing circuitry to perform at least part of the method described herein.
  • a computer program product comprising a carrier containing instructions for causing processing circuitry (such as the processing circuitry 12 of the first entity 10a, 10b described earlier and/or the processing circuitry 22 of the second entity 20 described earlier) to perform at least part of the method described herein.
  • the carrier can be any one of an electronic signal, an optical signal, an electromagnetic signal, an electrical signal, a radio signal, a microwave signal, or a computer-readable storage medium.
  • the entity functionality described herein can be performed by hardware.
  • the first entity 10a, 10b and/or the second entity 20 described herein can be hardware.
  • at least part or all of the entity functionality described herein can be virtualized.
  • the functions performed by the first entity 10a, 10b and/or the second entity 20 described herein can be implemented in software running on generic hardware that is configured to orchestrate them.
  • the first entity 10a, 10b and/or the second entity 20 described herein can be virtual.
  • at least part or all of the entity functionality described herein may be performed in a network enabled cloud.
  • the method described herein can be realised as a cloud implementation according to some embodiments.
  • the entity functionality described herein may all be at the same location or at least some of the functionality may be distributed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)

Abstract

La présente invention concerne un procédé de gestion d'une défaillance d'une première instance d'application d'une première application. Le procédé est mis en œuvre par une première entité. Le procédé consiste à initier la transmission (102) d'un premier message vers une seconde entité en réponse à l'identification de la défaillance de la première instance d'application. Le premier message comprend des informations indiquant une pluralité d'étapes de restauration à exécuter pour restaurer la première instance d'application défaillante afin que la seconde entité génère un plan pour restaurer la première instance d'application défaillante. Le plan comprend la pluralité d'étapes de restauration et une temporisation pour exécuter la pluralité d'étapes de restauration.
PCT/EP2022/059747 2022-04-12 2022-04-12 Gestion de défaillance d'une instance d'application WO2023198276A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/059747 WO2023198276A1 (fr) 2022-04-12 2022-04-12 Gestion de défaillance d'une instance d'application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/059747 WO2023198276A1 (fr) 2022-04-12 2022-04-12 Gestion de défaillance d'une instance d'application

Publications (1)

Publication Number Publication Date
WO2023198276A1 true WO2023198276A1 (fr) 2023-10-19

Family

ID=81598013

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/059747 WO2023198276A1 (fr) 2022-04-12 2022-04-12 Gestion de défaillance d'une instance d'application

Country Status (1)

Country Link
WO (1) WO2023198276A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130305083A1 (en) * 2011-09-08 2013-11-14 Nec Corporation Cloud service recovery time prediction system, method and program
US20160342450A1 (en) * 2013-03-14 2016-11-24 Microsoft Technology Licensing, Llc Coordinating fault recovery in a distributed system
US20190073276A1 (en) * 2017-09-06 2019-03-07 Royal Bank Of Canada System and method for datacenter recovery
US20200412624A1 (en) * 2019-06-26 2020-12-31 International Business Machines Corporation Prioritization of service restoration in microservices architecture

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130305083A1 (en) * 2011-09-08 2013-11-14 Nec Corporation Cloud service recovery time prediction system, method and program
US20160342450A1 (en) * 2013-03-14 2016-11-24 Microsoft Technology Licensing, Llc Coordinating fault recovery in a distributed system
US20190073276A1 (en) * 2017-09-06 2019-03-07 Royal Bank Of Canada System and method for datacenter recovery
US20200412624A1 (en) * 2019-06-26 2020-12-31 International Business Machines Corporation Prioritization of service restoration in microservices architecture

Similar Documents

Publication Publication Date Title
US10509680B2 (en) Methods, systems and apparatus to perform a workflow in a software defined data center
CN109656742B (zh) 一种节点异常处理方法、装置及存储介质
US8381015B2 (en) Fault tolerance for map/reduce computing
WO2017067484A1 (fr) Système et procédé de planification de centre de données de virtualisation
CN111314125A (zh) 用于容错通信的系统和方法
CN111209110B (zh) 一种实现负载均衡的任务调度管理方法、系统和存储介质
CN110807064B (zh) Rac分布式数据库集群系统中的数据恢复装置
US10630566B1 (en) Tightly-coupled external cluster monitoring
CN113032085A (zh) 云操作系统的管理方法、装置、服务器、管理系统及介质
CN105069152B (zh) 数据处理方法及装置
CN111880906A (zh) 虚拟机高可用性管理方法、系统以及存储介质
US10721296B2 (en) Optimized rolling restart of stateful services to minimize disruption
CN106940671B (zh) 一种集群中任务线程运行的监控方法、装置及系统
CN106572137B (zh) 一种分布式服务资源管理方法和装置
CN112434008A (zh) 分布式数据库升级方法、设备及介质
CN111538585B (zh) 一种基于node.js的服务器进程调度方法、系统和装置
US7373542B2 (en) Automatic startup of a cluster system after occurrence of a recoverable error
CN113312153A (zh) 一种集群部署方法、装置、电子设备及存储介质
CN110377664B (zh) 数据同步方法、装置、服务器及存储介质
CN108804129B (zh) 一种软件升级方法及装置
CN114064438A (zh) 数据库故障处理方法和装置
CN111818188B (zh) 一种Kubernetes集群的负载均衡可用性提升方法和装置
CN112714022A (zh) 多套集群的控制处理方法、装置及计算机设备
CN111897626A (zh) 一种面向云计算场景的虚拟机高可靠系统和实现方法
CN111090537A (zh) 集群启动方法、装置、电子设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22722481

Country of ref document: EP

Kind code of ref document: A1