WO2022033672A1

WO2022033672A1 - Apparatus and method for injecting a fault into a distributed system

Info

Publication number: WO2022033672A1
Application number: PCT/EP2020/072602
Authority: WO
Inventors: Ilya SHAKHAT; Jorge Cardoso
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2022-02-17
Also published as: CN116097226A; CN116097226A8

Abstract

An apparatus (200) for injecting a fault into a distributed system (230) is disclosed. The apparatus (200) is configured to collect trace information on an execution of an operation by a plurality of processing nodes of the distributed system (230), and generate, based on the trace information, a model of a state machine of the distributed system (230). Moreover, the apparatus (200) is configured to generate an execution plan based on the model of the state machine, wherein the execution plan identifies a type of fault to be injected in at least one state, and to inject at least one fault having the identified type into the distributed system (230). The invention advantageously allows combining the benefits of white box fault injection approaches (i.e., high precision and predictability) and the benefits of runtime fault injection approaches (i.e., lower implementation costs), while not exhibiting their respective drawbacks.

Description

Apparatus and method for injecting a fault into a distributed system

TECHNICAL FIELD

The present disclosure relates to software testing and verification of distributed systems. More specifically, the present disclosure relates to an apparatus and method for injecting a fault into a distributed system as well as such a distributed system.

BACKGROUND

Deliberately injecting faults into a distributed system is an important tool for testing whether the distributed system is able to handle and recover from known or unknown failures. In other words, fault injection is a reliability testing approach for ensuring that a distributed system keeps working under failing conditions. A distributed system is a system consisting of multiple components, i.e. processing nodes running on the same or different electronic machines or devices and communicating over a network. A failure may occur in any component or in the communication channels connecting these components.

There are two main types of fault injection techniques, namely white box fault injection approaches (also known as compile time fault injection approaches) and runtime fault injection approaches. In white box fault injection approaches, fault injection primitives are usually injected directly into the software code. Runtime fault injection approaches usually identify the state of a running distributed system and rely on triggers to dynamically inject faults.

Both techniques have limitations with respect to implementation costs and fault injection precision. The white box approach has a high cost of implementation since the original source code of a distributed system needs to be understood, altered, and possibly recompiled. On the other hand, since fault injection primitives are directly added into the software code, the precision is high since it is possible to accurately inject faults at the instruction level. This makes experimentation reproducible. Runtime fault injection approaches on the other hand typically have a lower cost of implementation since the distributed system under analysis does not need to be instrumented. An external system is responsible for monitoring the state of the distributed system under study. Nonetheless, this benefit comes with the drawback that the precision is lower when compared to the white box approach since - using existing tools -detecting the state of a distributed system is complex and imprecise. State monitoring is often implemented by inspecting and detecting when a certain action is triggered, such as when a file is created by the distributed system, a certain port is opened, a process starts its execution, or a record is added to a log file. The lag between state detection and fault injection can be high which implies that the fault will not be injected always in the same state making it hard to reproduce scenarios.

SUMMARY

It is an objective of the present disclosure to provide an improved apparatus for injecting a fault into a distributed system as well as a corresponding method.

The foregoing and other objectives are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description, and the figures.

A model-based approach for the reliability testing of a distributed system using software- implemented fault injections is disclosed. Distributed tracing infrastructures may be used to decide where to inject faults and fault types may be chosen based on interrelations between the components of the distributed system, e.g. a part of a distributed system, such as a service in a micro-service application. A high level of precision may be achieved when injecting faults by using distributed tracing to identify the important states of a distributed system and by temporally blocking transitions until faults are injected. A distributed system is viewed as a state machine containing multiple processing nodes, i.e. components communicating with each other. Components of the distributed system can be represented as services in a micro-service architecture. Communication channels between the components can be implemented using, e.g., RPC (remote procedure call), HTTP (hypertext transfer protocol), and IPC (inter-process communication) technologies.

More specifically, according to a first aspect, an apparatus for injecting a fault into a distributed system is provided. The apparatus comprises a trace observer configured to collect trace information on an execution of an operation by a plurality of processing nodes of the distributed system, such as a plurality of network nodes processing a service request. The apparatus further comprises a model builder configured to generate, based on the trace information, a model of a state machine of the distributed system (herein also referred to as state machine model of the distributed system), wherein the model of the state machine of the distributed system defines a plurality of different states of the distributed system. The apparatus further comprises a plan engine configured to generate an execution plan based on the model of the state machine of the distributed system, wherein, for at least one state of the plurality of different states of the distributed system, the execution plan identifies a type of fault to be injected in the at least one state. The apparatus further comprises a fault injector configured to inject at least one fault having the type identified in the execution plan into the distributed system, for instance, during a further execution of the operation.

Thus, advantageously, the benefits of white box fault injection approaches (i.e., high precision and predictability) and the benefits of runtime fault injection approaches (i.e., lower implementation costs) are combined, while not exhibiting their respective drawbacks. As a result, the apparatus according to the first aspect provides, in particular, the following main benefits:

(i) higher precision: faults can be precisely injected into well-defined states of a distributed system and evaluations are reproducible;

(ii) reduced costs: it has a low cost of implementation, since no changes to the source code are needed (it follows a black box paradigm); and

(iii) unsupervised operation: automated decisions may be taken on where faults are to be injected.

In a further possible implementation form of the first aspect, the trace information comprises information on an order of the execution of the operation by the plurality of processing nodes.

In a further possible implementation form of the first aspect, the apparatus further comprises a breakpoint manager configured to set a breakpoint at the at least one state of the plurality of different states of the distributed system, wherein the fault injector is further configured to inject the at least one fault into the distributed system when, during the execution of the operation by the plurality of processing nodes, the breakpoint is reached.

In a further possible implementation form of the first aspect, the breakpoint manager is further configured to block any further state transition, while the at least one fault is being injected into the distributed system. In other words, during the fault injection the system is "put on pause".

In a further possible implementation form of the first aspect, the execution plan identifies for more than one state of the plurality of different states of the distributed system a respective type of fault to be injected in the respective state, wherein the fault injector is further configured to inject the plurality of faults identified by the execution plan in sequential order or in an order based on a state probability. Thus, advantageously the apparatus can implement different execution strategies, such as a "sequential", "frequent first" or "long tail" execution strategy. In a further possible implementation form of the first aspect, the apparatus further comprises a fault library or database comprising a plurality of different fault types, wherein the fault injector is further configured to select the at least one fault from the fault library based on the execution plan.

In a further possible implementation form of the first aspect, the trace observer comprises one or more communication libraries for collecting the trace information on the execution of the operation by the plurality of processing nodes.

In a further possible implementation form of the first aspect, the model of the state machine of the distributed system further defines a plurality of transition probabilities between the plurality of different states. The apparatus may be configured to estimate these transition probabilities by counting the occurrences of the transitions between the states of the plurality of different states of the distributed system.

In a further possible implementation form of the first aspect, the model builder is further configured to generate the model of the state machine of the distributed system as a directed graph or a weighted Markov chain.

In a further possible implementation form of the first aspect, the model builder is further configured to update the model of the state machine of the distributed system when a new state is detected.

In a further possible implementation form of the first aspect, the fault injector is further configured to inject a first fault and a second fault into the distributed system based on the execution plan, wherein the first fault causes the distributed system to be set into a different state and wherein the second fault is injected in the different state.

In a further possible implementation form of the first aspect, the fault injector is further configured to: inject a plurality of faults into the distributed system based on the execution plan; and terminate the execution of the execution plan, once a specified coverage level has been reached.

As used herein, a specified coverage level may be defined as a ratio (or percentage level) between the number of states already covered by an operation and the total number of states of the model of the state machine. In a further possible implementation form of the first aspect, the type of fault to be injected in the at least one state depends only on a current state of the distributed system, the current state and one or more previous states of the distributed system or the current state and one or more predicted future states of the distributed system.

According to a second aspect, a method for injecting a fault into a distributed system is provided. The method comprises the steps of: collecting trace information on an execution of an operation by a plurality of processing nodes of the distributed system; generating, based on the trace information, a model of a state machine of the distributed system, wherein the model of the state machine of the distributed system defines a plurality of different states of the distributed system; generating an execution plan based on the model of the state machine of the distributed system, wherein, for at least one state of the plurality of different states of the distributed system, the execution plan identifies a type of fault to be injected in the at least one state; and injecting at least one fault having the type identified in the execution plan into the distributed system, for instance, during a further execution of the operation.

The fault injection method according to the second aspect of the present disclosure can be performed by the fault injection apparatus according to the first aspect of the present disclosure. Thus, further features of the fault injection method according to the second aspect of the present disclosure result directly from the functionality of the fault injection apparatus according to the first aspect of the present disclosure and its different implementation forms described above and below.

According to a third aspect, a computer program product comprising a non-transitory computer-readable storage medium for storing program code which causes a computer or a processor to perform the method according to the second aspect, when the program code is executed by the computer or the processor, is provided.

Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following, embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which: Fig. 1 is a diagram illustrating three conventional fault injection approaches as well as a fault injection process implemented by a fault injection apparatus according to an embodiment;

Fig. 2 is a schematic diagram illustrating a fault injection apparatus for injecting a fault into a distributed system according to an embodiment;

Fig. 3 is a flow diagram illustrating different steps of a method for injecting a fault into a distributed system according to an embodiment;

Fig. 4 is a signaling diagram illustrating the interaction between the components of a fault injection apparatus according to an embodiment with a distributed system during a learning phase;

Fig. 5 is a signaling diagram illustrating the interaction between the components of a fault injection apparatus according to an embodiment with a distributed system during an execution phase;

Fig. 6 is a diagram illustrating an exemplary operation executed by two processing nodes of a distributed system;

Fig. 7 is a diagram illustrating an aspect of a fault injection apparatus for injecting a fault into a distributed system according to an embodiment;

Fig. 8 shows diagrams illustrating an aspect of a model builder component of a fault injection apparatus according to an embodiment;

Fig. 9 shows diagrams illustrating an aspect of a model builder component of a fault injection apparatus according to an embodiment;

Fig. 10 is a diagram illustrating an aspect of an apparatus for injecting a fault into a distributed system according to an embodiment; and

Fig. 11 is a flow diagram illustrating a method for injecting a fault into a distributed system according to an embodiment.

In the following, identical reference signs refer to identical or at least functionally equivalent features. DETAILED DESCRIPTION OF THE EMBODIMENTS

In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the present disclosure or specific aspects in which embodiments of the present disclosure may be used. It is understood that embodiments of the present disclosure may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.

For instance, it is to be understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise.

In the following, some background material will be described in the context of an exemplary fault injection scenario for introducing detailed embodiments. In this exemplary fault injection scenario, the resiliency of the following exemplary code against communication failures occurring between services A and B is evaluated:

# Service A def A.getObjByName(name): try:

Objld = B.search(name) except lOException: return null Obj = B.getObject(Objld) return Obj

Service A has a function A.getObjByName(name) which queries objects by name. Service B provides function B.search(name), which allows to search for an object id by name, and function B.getObject(Objld) which queries objects by id. A reliability test can emulate a failure by blocking the connectivity between services A and B. For example, by disabling a network switch between the components providing the services A and B. In this particular example, the fault injected into the distributed system corresponds to disabling the switch. Faults can introduce many forms of disruption into a distributed system such as kill of process, process slow down, or packet loss in network cards. For this example, there are two scenarios (code paths) which should be tested. If the switch is disabled at an early stage of the execution, the call to B.search(Objld) might fail and an exception will be generated and properly handled by returning the null value. If the switch is disabled after B.search(Objld) is executed and before B.getObject(Objld) starts its execution, the function A.getObjByName(name) may not handle the error which may lead the whole application to crash.

To test the reliability of this exemplary fault injection scenario, one of the three fault injection approaches described in the following may be used.

In a conventional white-box approach, the source code of the distributed system has to be changed to add hooks to trigger the switch to fail. A new function could be placed before B.search(name) and before B.getObject(Objld) to mark the precise state where a fault is to be injected. Such a conventional white-box approach has two main limitations: 1) it requires changes in the source code, which is often costly, 2) the states to mark where faults should be injected need to be known in advance based on knowledge about the distributed system, which is not always readily available.

A second conventional fault injection approach is known as chaotic engineering. It consists in injecting faults into a random component of a distributed system at a random moment in time. Such an approach is easy to apply to any distributed system as a black-box, but has a low precision of injection since the state of the distributed system is generally unknown. Furthermore, it also has a low reproducibility because every time a fault is injected, the system may be in a different state.

In a third conventional fault injection approach known as time-based fault injection, the state of the distributed system is observed and, when the state preceding the predefined state where a fault needs to be injected is reached, a time model is used to calculate when the fault should be injected. While this approach is easy to implement since it treats a distributed system as a black box, the use of a probabilistic time model may require many experiments until a fault is injected in the desired state.

Figure 1 illustrates these three conventional fault injection approaches described above, i.e. the white-box fault injection approach having high precision, high reproducibility and high costs in the upper left-hand corner, the chaotic fault injection approach having low precision, low reproducibility and low costs in the upper right-hand corner, and the time-based fault injection approach having medium precision, medium reproducibility and low costs in the lower left-hand corner of figure 1. A further fault injection approach, which is referred to herein as model-based fault injection approach, is implemented by embodiments of this disclosure and is illustrated in the lower right-hand corner of figure 1. It has the following main characteristics: high precision, high reproducibility, and low implementation costs. As will be described in more detail below, embodiments of this disclosure employ distributed tracing infrastructures and trace instrumentation points which are re-used to identify important states of a distributed system. The use of distributed tracing enables to automatically create a runtime model of a distributed system by observing its states. Fault types may be automatically selected based on the current state of the distributed system. The distributed system may be paused during fault injection to ensure it remains in the desired state.

Embodiments of this disclosure may be used for analysing the reliability of an external operation made available by a distributed system (e.g., for Infrastructure as a Service (laaS), it can be, for instance, the operation to start a new virtual machine). A typical distributed system consists of multiple components, i.e. processing nodes such as services. In an embodiment, communication libraries may be used to implement distributed tracing technologies. This allows to catch calls between components of the distributed system, e.g., RPC calls or SQL queries to databases. Usually, the more instrumentation points exist, the more precise the injection of faults can be. Instrumentation points enable to track the current state of execution of an operation. As used herein, instrumentation is a process when a program code is extended to capture state information. Distributed tracing instrumentation is adding a call to a tracing library, where the library is responsible for representing the state (i.e. it can capture stack state and request variables). An instrumentation point is a specific place in the code where the tracing library is called. The communication library implementing tracing technology (herein also referred to as tracing library) may be configured to make external synchronous calls from instrumentation points. The call may include attributes that unambiguously represent the execution within the distributed system. For example, attributes may contain an address of the machine where the code is running, the name of the component of the distributed system, the name of the remote procedure call or HTTP endpoint. In general, the set of attributes may depend on the architecture of distributed system, e.g., in a container-based environment it may also refer to a container or pod.

Figure 2 shows a schematic diagram illustrating the architecture of an apparatus 200 according to an embodiment for injecting a fault into a distributed system 230. The distributed system 230 comprises a plurality of processing nodes, i.e. components for executing an operation. As illustrated in figure 2, the plurality of processing nodes define an infrastructure 233 of the distributed system 230, which moreover may comprise an application programming interface (API) 231 as well as a distributed tracing framework 235, in particular a distributed tracing library 235 for performing tracing operations within the infrastructure 233 of the distributed system 230. As will be described in more detail below, solid lines in figure 2 illustrate interactions between the modules of the fault injection apparatus 200 during both a learning and execution phase, while the dashed lines illustrate interactions during the execution phase only.

The fault injection apparatus 200 comprises a trace observer and notifier 215 configured to collect trace information about the execution of the operation by the plurality of processing nodes 233 of the distributed system 230, for instance, by obtaining information about trace events from the distributed tracing framework 235 of the distributed system 230. In an embodiment, the trace information may comprise information about an order of the execution of the operation by the plurality of processing nodes of the distributed system 230. In an embodiment, the trace observer and notifier 215 may comprise one or more communication libraries for collecting the trace information about the execution of the operation by the plurality of processing nodes of the distributed system 230.

Moreover, the fault injection apparatus 200 comprises a model builder 207 configured to generate based on the trace information collected by the trace observer and notifier 215 a model of a state machine (also referred to as "execution model") of the distributed system 230, i.e. the plurality of processing nodes 233 of the distributed system 230, as will be described in more detail further below. The model of the state machine of the distributed system 230 defines a plurality of different states of the distributed system 230. In an embodiment, the state machine model of the distributed system 230 may further define a plurality of transition probabilities between the plurality of different states of the distributed system. As will be described in more detail below, in an embodiment, the model builder 207 may be configured to generate the state machine model of the distributed system as a directed graph or a weighted Markov chain. Moreover, the model builder 207 may be configured to update an existing model of a state machine of the distributed system 230 based on further trace information, e.g. when a new state is detected.

The apparatus 200 further comprises a plan engine 205 configured to generate an execution plan on the basis of the model of the state machine of the distributed system 230 generated by the model builder 207. As will be described in more detail further below, for at least one state of the plurality of different states of the distributed system defined by the model of the state machine of the distributed system 230 the execution plan identifies a fault type to be injected in that state. The fault type may depend on a current state of the distributed system 230, on the current state and one or more previous states of the distributed system 230 or on the current state and one or more future predicted states of the distributed system 230. The plan engine 205 may be configured to generate and manage a plurality of different execution plans.

Moreover, as illustrated in figure 2, the apparatus 200 comprises a fault injector 211 configured to inject at least one fault into the distributed system 230 based on the execution plan generated by the plan engine 205. In an embodiment, the fault injector 211 is configured to inject a plurality of faults identified by the execution plan in sequential order or in an order based on a state probability. In an embodiment, the fault injector 211 is configured to inject a first and a second fault into the distributed system 230 based on the execution plan, wherein the injection of the first fault turns the distributed system 230 into a different state and the second fault is injected in the different state. In an embodiment, the fault injector 211 is configured to inject a plurality of faults into the distributed system 230 based on the execution plan, wherein the fault injector 211 is configured to terminate execution of the execution plan, once a specified coverage level has been reached. As used herein, a specified coverage level may be defined as a ratio (or percentage level) between the number of states already covered by an operation and the total number of states of the model of the state machine of the distributed system.

In the embodiment shown in figure 2, the fault injection apparatus 200 further comprises a user interface 201 for enabling a user 220 to specify the operation of the distributed system 230 to be analyzed by the fault injection(s) as well as its parameters. Moreover, the user interface 201 may be configured to display or provide a final evaluation report about the result of the fault injection(s).

The fault injection apparatus 200 may further comprise an operation adapter 213 configured to create a trace context and make calls to the external API 231 of the distributed system 230, as will be described in more detail below.

As illustrated in figure 2, the fault injection apparatus 200 may further comprise a breakpoint manager 209 for handling breakpoints and calling the fault injector 211 when breakpoints are triggered. More specifically, the breakpoint manager 209 may be configured to set a breakpoint at the at least one state of the plurality of different states of the distributed system 230 identified by the execution plan, wherein the fault injector 211 is configured to inject the at least one fault into the distributed system 230 when during the execution of the operation by the plurality of processing nodes 233 the breakpoint is reached. As will be described in more detail below, in an embodiment, the breakpoint manager 209 may be further configured to block any further state transitions, while the at least one fault is being injected into the distributed system 230 by the fault injector.

In an embodiment, the fault injection apparatus 200 may further comprise a fault library 203 comprising a plurality of different fault types, i.e. fault actions or operations, wherein the fault injector 211 is configured to select the at least one fault to be injected into the distributed system 230 from the fault library 203 based on the execution plan generated by the plan engine 205.

Figure 3 is a flow diagram illustrating from a high-level perspective the main steps 300 of a process performed by the different components of the apparatus 200 for injecting a fault into a distributed system 230 according to an embodiment:

Step 301 - Model Bootstrap: Based on the input from the user 220 (e.g. the user 220 may specify via the user interface 201 the operation for which the reliability of the distributed system 230 is to be evaluated, i.e. tested) a model bootstrap is performed, i.e. the (initial) model of the state machine of the distributed system 230 is bootstrapped and the (initial) execution plan is generated, i.e. the operation is executed "Bootstrap iterations" times. The model is build based on observed traces. The execution plan is derived from the model and it may be composed of statements. Step 303 - Execute Plan: The execution plan is processed by executing the statements defined by the execution plan, e.g. by selecting one after the other statement from the execution plan. For each statement, the state may be retrieved and a breakpoint may be set for the state (e.g. set breakpoint in state Sb).

Step 305 - Execute Operation: The operation of the distributed system 230 under analysis (as specified by the user 220) is executed by the plurality of processing nodes 233 of the distributed system 230.

Step 307 - Observe State: The traces generated by the distributed system 230 are observed, i.e. collected by the trace observer and notifier 215. Whenever a new state is observed, the model and/or the execution plan may be updated.

Step 309: Trace observations are collected until the distributed system 230 reaches the state Sb defined by the breakpoint in step 303.

Step 311 - Inject Fault: When the state defined by a breakpoint is reached, the fault Fj specified by the statement in the execution plan is injected into the distributed system 230. Observations are made until the operation completes (step 313).

Step 315 - Collect Results: The results of the operation execution are collected. The execution of the execution plan is continued until it is completed or until a requested coverage target is reached (step 317), as will be described in more detail further below.

Step 319 - Generate Report: A report consisting of all results of executing the operation of the distributed system 230 under different faults is created and may be provided via the user interface 201 to the user 220.

Figures 4 and 5 show in more detail the steps described above in the context of figure 3 as performed by the different components or modules of the fault injection apparatus 200, namely in a "learning phase" or "model building phase" (figure 4) and an "execution phase" (figure 5).

More specifically, in the "learning phase" or "model building phase" illustrated in figure 4, the different components of the apparatus 200 are configured to perform the following steps, which have, at least partially, already been described in the context of figures 2 and 3 above: Step 401 : The user interface 201 forwards information about the operation to be analyzed, i.e. tested to the plan engine 205.

Step 403: The plan engine 205 generates the execution plan for the operation to be tested and provides it to the operation adapter 213.

Step 405: The operation adapter 213 defines a trace identifier ("Trace id") and provides the trace identifier to the trace observer and notifier 215 for collecting the trace information using this trace identifier.

Step 407: The operation adapter 213 provides the operation and the associated trace identifier to the distributed system 230.

Step 408: The operation is executed by the distributed system 230.

Steps 409: During the execution of the operation the distributed system 230 using the distributed tracing library 235 produces trace events. These trace events are collected by the trace observer and notifier 215 and stored in memory as a trace object.

Step 411 : The result of the operation is recorded by the operation adapter 213.

Step 413: The trace object collected by the trace observer and notifier 215 is forwarded to the model builder 207, where it is used for generating the state machine model of the distributed system.

Step 415: The model is made available to the plan engine 205 which uses it to generate the execution plan.

Step 417: The report with results of the learning phase is created for the user interface 201.

In the "execution phase" illustrated in figure 5 (following the "learning phase" illustrated in figure 4), the different components of the apparatus 200 are configured to perform the following steps, which have, at least partially, already been described in the context of figures 2 and 3 above:

Step 501: The user interface 201 forwards information about the operation to be analyzed, i.e. tested to the plan engine 205. Step 503: The plan engine 205 picks up the next step of the execution plan and instructs the breakpoint manager 209 to set a breakpoint. The breakpoint can be specified as the target state the distributed system 230 must be in.

Step 505: The plan engine 205 provides the operation to be executed to the operation adapter 213.

Step 507: The operation adapter 213 defines a trace identifier ("Trace id") and provides the trace identifier to the trace observer and notifier 215 for collecting the trace information using this trace identifier.

Step 509: The operation adapter 213 provides the operation and the associated trace identifier to the distributed system 230.

Step 510: The operation is executed by the distributed system 230.

Steps 511 : During the execution of the operation the distributed system 230 using the distributed tracing library 235 produces trace events. They are collected by the trace observer and notifier 215 and stored in memory as a trace object.

Steps 513: The trace observer and notifier 215 determines a state based on the information from the trace event (as already described). The state is forwarded to the breakpoint manager 209 for evaluation.

Step 514: A breakpoint is detected by the breakpoint manager 209.

Step 515: In response to the detection of a breakpoint in step 514, the breakpoint manager 209 provides fault parameters of the fault to be injected into the distributed system 230 to the fault injector 211.

Step 517: Based on the fault parameters provided by the breakpoint manager 209 in step 515, the fault injector 211 injects a fault into the distributed system 230.

Step 519: Similar to step 511 with the difference that the trace observer and notifier 215 keeps information that the event is observed after the injected fault, thus it could be caused by the fault. Step 521 : The same as 513 (this step is needed when there is chained fault injection and there is one additional breakpoint set by the breakpoint manager 209).

Step 523: The operation result is recorded by the operation adapter 213.

Step 525: The trace is forwarded to the model builder 207, which may lead to a model update (e.g. a new state could be observed or the weight of a state could be changed).

Step 527: The plan engine 205 may update the execution plan based on the updated model from the model builder 207.

Step 529: The output of the operation is recorded and associated with the current step from the execution plan.

Step 531: The execution plan and the operation results are combined in a report that can be presented to the user via the user interface 201.

As will be appreciated, embodiments of the fault injection apparatus 200 implement a modelbased approach for the reliability testing of the distributed system 230, which is applicable to a wide range of distributed applications including modern micro-service architectures. In the following, some further embodiments of the fault injection apparatus 200 will be described in more detail.

In an embodiment, the operation (Op) of the distributed system 230 to be tested by the fault injection apparatus 200 may be an HTTP request with a set of parameters, e.g., “GET /v2/ servers” , or it can be a code snippet using SDK functions, e.g. openstacksdk. For simplicity and in a non-limiting fashion, in the following examples, idempotent functions are considered that do not require setup/tear down phases to be executed before and after calls. For non-idempotent functions, additional code that is executed before/after issuing the operation may be required.

In an exemplary embodiment, the fault injection apparatus 200 may be configured to check the reliability of the OpenStack VM creation operation. As already described above, the set of faults to be injected may be pre-configured in advance in a library of actions that can be performed against the distributed system 230, such as a restart of a certain service or an interrupted network connection. In an embodiment, fault parameters, such as bootstrap iterations, chained faults levels, and/or coverage targets may be specified. Bootstrap iterations indicate how many times the operation will be repeated to create the initial state machine model of the distributed system 230. Higher values result in a higher precision of estimation of state probabilities. Defining a chained faults level indicates how many levels of chained faults are allowed. Chains allow to process cascaded fault injections such as that the first fault injection turns the distributed system 230 into a new state, while this new state, in turn, may also be subject to another fault injection. Defining a coverage target enables to run the execution plan until the specified coverage target is reached. This option allows to skip less probable states to reduce overall execution time. The following table 1 gives an example for these type of fault parameters (which may be defined via the user interface 201 of the fault injection apparatus 200).

Table 1: Exemplary inputs via the user interface of the fault injection apparatus

The following table 2 shows the output report generated by the fault injection apparatus 200 according to an embodiment after the execution of the steps illustrated in Figure 3. Table 2 lists the operation executed, states where faults were injected, the estimated probability of a state occurring, a list of injected faults, and whether they affected the correct execution of the operation by the distributed system 230.

Table 2: Exemplary report generated by the fault injection apparatus The selected operation (i.e. the operation of the distributed system 230 to be tested) is provided by the user interface 201 to the plan engine 205. As already described above, the plan engine 205 is responsible for orchestrating a fault injection cycle. It executes a sequence of actions, defined by the execution plan, that are performed with the distributed system 230.

Initially, the plan engine 205 might not have any knowledge about the distributed system 230. In an embodiment, to obtain this knowledge the plan engine 205 executes the operation to be tested and waits for the model builder 207 to create a model of the distributed system 230. As already described above, the model generated by the model builder 207 represents the observations of the distributed system’s state machine with estimated probabilities of transitions between states. The fault injection apparatus 200 may be configured to repeat execution of the operation to be tested multiple times. In an embodiment, this may be controlled by means of the parameter "Bootstrap iterations".

Once the model of the state machine of the distributed system 230 has been generated by the model builder 207, the plan engine 205 generates the execution plan on the basis thereof. For instance, once the bootstrapping phase is finished, the plan engine 205 may be configured to transform the model of the state machine of the distributed system 230 into the execution plan with a statement like the following one:

<line> EXEC OPERATION <Op> ON STATE <Si> INJECT FAULT <Fj>, wherein line denotes a unique statement number, Op denotes the operation to be tested, Si denotes a state of the distributed system 230 and Fj denotes the fault to be injected. In an embodiment, a state Si of the distributed system 230 represents a specific point in a distributed trace. It may include the name of a processing node (i.e. component) of the distributed system 230, a function within a processing node and the like. In an embodiment, each state belongs to exactly one processing node of the distributed system 230, but each processing node may have many states. In an embodiment, the state does not depend on runtime properties such as process id (PID) or host name. The fault parameter Fj may be a reference to a function defined in the fault library 203 of the fault injection apparatus 200. In an embodiment, the fault injection apparatus 200 may be configured to implement the fault Fj using a script with a remote call to the distributed system 230. Figure 6 illustrates the states Si and traces of two processing nodes Ci and C2 of an exemplary distributed system 230. In an embodiment, the plan engine 205 of the fault injection apparatus 200 may be configured to generate an execution plan comprising one or more of the following exemplary statements for the states shown in figure 6:

When an execution plan contains multiple statements, the order of execution of the statements may be controlled by using one of the following execution strategies: (i) a sequential execution, where the statements of the execution plan are executed one by one; (ii) a frequent first execution, where the statements of the execution plan corresponding to the most frequent states of the distributed system 230 are executed first (in an embodiment, the process may stop when a certain coverage target is achieved); and (iii) a long tail execution, where the statements of the execution plan corresponding to non-frequent states are executed first, thus, allowing testing less probable states of the distributed system 230.

In an embodiment, the execution plan generated by the plan engine 205 may be implemented as a priority queue, with the ordering specified by the execution strategy, as illustrated in figure 7.

In an embodiment, the execution of a single statement of the execution plan by the fault injection apparatus 200 may involve one or more of the following steps: (i) the breakpoint manager 209 configures a new breakpoint for state Si with fault handler Fj; (ii) the operation Op is sent to the operation adapter 213, which will wait until the operation is finished; (iii) the breakpoint is cleared by the breakpoint manager 209; and (iv) the results are forwarded to the user interface 201.

In an embodiment, the result(s) of processing an execution plan may be presented in a tabular form, as in the following table 3, where each row has the following attributes: Operation - Operation Op analyzed; State - State Si where the fault was injected; Estimated state probability (%) - Percentage of traces where the state was observed; Fault injected - Fault injected; and Operation result - Indication if the operation was successful or not.

Table 3: Exemplary report generated by the fault injection apparatus

In an embodiment, a statement may associate a fault with one and only one state of the distributed system 230. However, a fault may also be associated with several states. For example, if the distributed system 230 is in state A and the next state is B (with observed probability PB), it is often desirable to inject a fault into the processing node associated with state B while the distributed system 230 is still in state A. The following table 4 shows the various types of statements implemented by the fault injection apparatus 200 according to an embodiment.

Table 4: Different types of statements implemented by the fault injection apparatus

In an embodiment, the size of an execution plan may be proportional to the number of states of the distributed system 230 multiplied by the number of faults stored in the fault library 203. Since very large execution plans may take a long time to process, the fault injection apparatus 200 may be configured to stop processing an execution plan, when a certain coverage level is achieved. The following table 5 shows the coverage functions implemented by the fault injection apparatus 200 according to an embodiment (where "# breakpoint states" denotes the number of states used as breakpoints and "# states" denotes the total number of states of the model of the distributed system 230).

Table 5: Different coverage functions implemented by the fault injection apparatus

As already described above, the operation adapter 213 of the fault injection apparatus 200 is mainly responsible for communicating with the distributed system 230, in particular for error handling and for collecting execution results. Each time the operation Op to be tested is executed by the distributed system 230, the operation adapter 213 may be configured to associate it with a unique trace id (TID), such as a random number, for example, a hexadecimal string representing 64 or 128 bit integers. The trace id allows the fault injection apparatus 200 to associate trace events with the execution of the operation by the distributed system 230. This may be advantageous, because the distributed system 230 may have multiple operations running in parallel, wherein each operation can produce different trace events, but will have a unique trace identifier.

Operations to be tested can be synchronous or asynchronous. In the latter case, the distributed system 230 may signal that an operation was accepted, but its processing will continue in the background. For example, an OpenStack VM creation operation is asynchronous and the user only receives a reply once a database object is created, but the spawning and network plumbing of the VM may take several seconds more. To handle asynchronous operations, the fault injection apparatus 200 is configured to handle an operation to be tested as finished when no more new events are received within a predefined time interval. In an embodiment, operations to be tested and the operation adapter 213 may operate synchronously. When the operation Op completes, the operation adapter 213 may report the result(s) to the user interface 201.

During the execution of operation Op, the instrumentation points of the distributed system 230 emit trace events along the execution path. Each event has a set of attributes describing the location of the instrumentation point and execution parameters. Attributes may be split into one of the following four categories: (i) static - identifies code location (e.g., component/function name, RPC or DB SQL statement); runtime - captures runtime information (e.g., PID, container identifier or host name); (iii) structural - describes the position of trace events within a trace (e.g., reference to caller); and (iv) timing - indicates when the trace event was generated (e.g., timestamp).

An exemplary trace event may look as follows: {

“tracel d” : “abcdef01234567890”, “eventld”: “01234567890abcdef”, “parentld”: “0987654321 fedcba”, “service”: “serviceA”, “functonName”: “getObject”, “host”: “10.1.1.2”, “pid”: 12345, “timestamp”: 1234567890 } In this example, the static attributes are “service” and “functionName”, the runtime attributes are “host” and “pid”, the structural attributes are “traceld”, “eventld” and “parentld”, and the timing attribute is “timestamp”. Trace events corresponding to the same operation may be grouped into a trace object. In an embodiment, from a data structure perspective a trace may be represented as a tree with parent-child relationships set by the attributes parentld and eventld. Based on the static attributes, the trace observer and notifier 215 may be configured to generate a State ID. For example, by using concatenation and a hash function in the following way: def get_state(event): return hash(event[‘component’] + + event[‘functionName’])

The trace observer and notifier 215 may be configured to persist a mapping from event to State ID. The mapping may be used to find out the state corresponding to an event’s parent.

Figure 8 shows an example of how the trace observer and notifier 215 may be configured for mapping events to states for the same code snippet, such as the following one:

Def A.getXY(idX, idY):

X = B.getObject(idX)

Y = B.getObject(idY)

Return X, Y

Figure 8 illustrates two cases, namely when the code is executed sequentially (on the lefthand side of figure 8) and when the code is executed in parallel (in the middle of figure 8). In this example, four pairs of events are mapped to four states, because the same function is called twice.

For each trace event observed, the trace observer and notifier 215 may be configured to carry out one or more of the following actions: (i) generate a State ID and update the event to State ID mappings; (ii) send the State ID and event’s runtime attributes to the breakpoint manager 209; and (iii) send the State ID and reference to the previous State ID to the model builder 207.

As already described above, the model builder 207 of the fault injection apparatus 200 is configured to represent the distributed system 230 by a model of its state machine consisting of a set of states Sj. The occurrence of each state is probabilistic, and the model builder 207 is configured to estimate the probability Py of moving from state Si to Sj. From a mathematical perspective, such a system can be represented as a discrete-time Markov Chain, where the probability of moving to the next state solely depends on the present state and does not dependent on previous states, i.e.: Pr(Sn+1 — X | Si — X1 , S2 ^— X2, ... Sn ^— Xn) — Pr(Sn+1 ^— X | Sn ^— Xn)

A Markov Chain can also be represented as a directed graph, where there is an edge from vertex i to vertex j when there is a non-zero probability of moving from state Si to Sj. The probability value Py can be used as weight for edge i j (the sum of the weights of outgoing edges is equal to 1). Thus, according to embodiments the model builder 207 is configured to implement a Markov Chain and/or a graph for evaluating the probability to reach state Si of a distributed system 230. The model builder 207 may be configured to determine the probability recursively in the following way:

In an embodiment, the probability of a state may be used to rank the statements of an execution plan.

In an embodiment, the model builder 207 of the fault injection apparatus 200 may initially start with an empty model of the state machine of the distributed system 230. During execution of the operation to be tested, the trace observer and notifier 215 notifies the model builder 207 about the respective observed state Si and provides a reference to the parent state S_p (as S_p occurs before Si, it is already present in the model generated by the model builder 207). If the state Si is new, the model builder 207 may add it into a graph G. If the edge Ey is not in the graph G, a new edge may be added with weight P = 1. If the edge was already in the graph, its weight may be updated according to the observations.

Figure 9 shows a further example for traces and a corresponding model generated by the fault injection apparatus 200 according to an embodiment. In the example shown in figure 9, the underlying distributed system 230 contains three processing nodes, i.e. components C1, C2 and C3, wherein C1 always sends requests to C2 and C2 queries data from C3, but stores it in in-memory cache for some period of time. There are instrumentation points at boundaries between the processing nodes, so it is possible to know when a processing node is about to send a request and when a request is received. Since the data from C3 is cached inside C2, requests from C2 to C3 do not always occur. The left-most side of Figure 9 shows the effect of caching (C2 returns data from memory and does not query C3). The case when data is retrieved is shown in the middle of figure 9 (C2 queries result from C3 and stores in memory). According to an embodiment, the communication between the processing nodes C1 , C2 and C3 may be modelled by the model builder 207 using a Markov Chain illustrated as a directed weighted graph (as illustrated on the right side of figure 9, i.e. the directed weighted graph corresponding to the Markov Chain modelling of the communication between the components C1 , C2, and C3). Assuming that caching contains the requested data in 90% of the cases, the edge S2 -> S7 has a weight of 0.9 and edge S2 -> S3 has a weight of 0.1 .

When a statement of the execution plan has been completed, new states discovered may update the model generated by the model builder 207 or older states may have their probability updated. When a new state Si is observed, the execution plan may be extended in a similar way to what is done during bootstrapping by adding new statements to the execution plan. Updates of the state machine model of the distributed system 230 may also introduce changes in probabilities which are applied to the execution plan.

As already described above, for each statement, the plan engine 205 may instruct the breakpoint manager 209 to set an injection breakpoint at state Si with fault Fj. During the execution of the operation, the trace observer and notifier 215 may notify the breakpoint manager 209 about the current state of the distributed system 230 and its runtime properties. When the current state matches a state configured as a breakpoint, the fault injector 211 may inject a fault by means of, for instance, the following statement:

INJECT FAULT <Fi> WITH ATTRIBUTES <PI D>, <HOST>, <other runtime attributes>

Thus, the fault injector 211 is generally responsible for the execution of fault actions. In an embodiment, the fault injector 211 may be configured to query the fault library 203 to find specific actions (functions) which can inject the desired faults. The implementation of actions may depend on the technology used by the distributed system 230, varying from plain Bash script and SSH, to agent-based playbooks. Each function may accept a set of runtime properties describing where to inject the fault. For example, it may be a PI D when a fault is to be injected into a process; a container id when the service is executed in containerized environments; or a network card or port for network-based faults.

The following example illustrates the implementation of a fault which blocks a firewall using the host id and PID: def interrupt_connection(host, pid): firewall_block(host, find_connection(host, pid))

According to an embodiment, the fault injection apparatus 200 may be configured to use, i.e. inject one or more of the following faults and fault types affecting the processing nodes of the distributed system 230: (i) execution failure(s), such as process abnormal termination, restart of a process, hang of a process, and the like; (ii) system resource failure(s), such as disk space issues, high utilization, and the like; (iii) network failure(s), such as packet loss, connection loss, message transport issues, and the like;

Examples for functions related to the previous fault types may include: (i) low_disk(host) - provokes low disk conditions on a specific host, i.e. processing node; (ii) pause_process(host, pid) - emulates hanging process with PI D on a specific host, i.e. processing node; and (iii) reject_connection(host1 , pid1 , host2, [pid2, portl , port2]) - triggers a connectivity failure between processes running on different hosts, i.e. processing nodes of the distributed system 230.

Once the fault is injected, the fault injector 211 transfers the processing to the breakpoint manager 209. The fault should be injected as fast as possible, since the service is in a suspended state. In case of an agent-based injector, the delay may be in the range of milliseconds.

Initially, an execution plan may contain one or more statements with only one fault. However, the injection of faults may reveal the existence of new states of the distributed system 230. For instance, in a distributed system 230 with two processing nodes, i.e. components Ci and C2, the processing node Ci may send a token to the processing node C2, wherein the processing node Ci acts optimistically, i.e. it sends a token always and renews it only in case of failure. This can be described as follows:

# Component C1

10 error = query_C2(token) 20 if error != 0 then 30 token = renew_token() 40 goto 10

Injecting a fault F1 into the function query_C2 exposes a new code path, namely a call to the function renew_token. To evaluate the reliability of this new function, two faults need to be injected, namely F1 followed by F2. This is illustrated in figure 10, which shows a directed graph that corresponds to a Markov Chain modelling communication between the processing nodes C1 and C2. The dashed edges in figure 10 correspond to state transitions caused by the first injected fault, while the dotted edge in figure 10 corresponds to a transition caused by chained faults. As can be taken from figure 10, states S7 and S8 are revealed only when the fault is injected in state S2.

For handling such scenarios, the model builder 207 may be configured to store states into the model of the distributed system 230 with a link to the causing state. Moreover, the model builder 207 may be configured to update the execution plan with a new statement containing two faults. In an embodiment, the apparatus 200 may be configured to limit the level of chaining of faults, since a full coverage of every next fault level may have an exponential cost.

Figure 11 is a flow diagram of a method 1100 for injecting a fault into a distributed system 230. The method 1100 comprises the steps of: collecting, at 1101 , trace information on an execution of an operation by a plurality of processing nodes of the distributed system 230; generating, at 1103, based on the trace information, a model of a state machine of the distributed system 230, wherein the model of the state machine of the distributed system 230 defines a plurality of different states of the distributed system 230; generating, at 1105, an execution plan based on the model of the state machine of the distributed system 230, wherein, for at least one state of the plurality of different states of the distributed system 230, the execution plan identifies a type of fault to be injected in the at least one state; and injecting, at 1107, at least one fault having the type identified in the execution plan into the distributed system 230.

The person skilled in the art will understand that the "blocks" ("units") of the various figures (method and apparatus) represent or describe functionalities of embodiments of the present disclosure (rather than necessarily individual "units" in hardware or software) and thus describe equally functions or features of apparatus embodiments as well as method embodiments (unit = step).

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described embodiment of an apparatus is merely exemplary. For example, the unit division is merely logical function division and may be another division in an actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, functional units in the embodiments of the invention may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.

Claims

1. An apparatus (200) for injecting a fault into a distributed system (230), wherein the apparatus (200) comprises: a trace observer (215) configured to collect trace information on an execution of an operation by a plurality of processing nodes of the distributed system (230); a model builder (207) configured to generate, based on the trace information, a model of a state machine of the distributed system (230), wherein the model of the state machine of the distributed system (230) defines a plurality of different states of the distributed system (230); a plan engine (205) configured to generate an execution plan based on the model of the state machine of the distributed system (230), wherein, for at least one state of the plurality of different states of the distributed system (230), the execution plan identifies a type of a fault to be injected in the at least one state; and a fault injector (211) configured to inject at least one fault having the type identified in the execution plan into the distributed system (230).

2. The apparatus (200) of claim 1, wherein the trace information comprises information on an order of the execution of the operation by the plurality of processing nodes of the distributed system (230).

3. The apparatus (200) of claim 1 or 2, further comprising a breakpoint manager (209) configured to set a breakpoint at the at least one state of the plurality of different states of the distributed system (230), wherein the fault injector (211) is further configured to inject the at least one fault into the distributed system (230) when, during the execution of the operation by the plurality of processing nodes, the breakpoint is reached.

4. The apparatus (200) of claim 3, wherein the breakpoint manager (209) is further configured to block any further state transition, while the at least one fault is being injected into the distributed system (230).

5. The apparatus (200) of any one of the preceding claims, wherein the execution plan identifies, for more than one state of the plurality of different states of the distributed system (230), a respective type of a fault to be injected in the respective state, and wherein the fault

29 injector (211) is further configured to inject the plurality of faults identified by the execution plan in sequential order or in an order based on a state probability.

6. The apparatus (200) of any one of the preceding claims, further comprising a fault library (203) comprising a plurality of different fault types, and wherein the fault injector (211) is further configured to select the at least one fault from the fault library (203) based on the execution plan.

7. The apparatus (200) of any one of the preceding claims, wherein the trace observer (215) comprises one or more communication libraries for collecting the trace information on the execution of the operation by the plurality of processing nodes of the distributed system (230).

8. The apparatus (200) of any one of the preceding claims, wherein the model of the state machine of the distributed system (230) further defines a plurality of transition probabilities between the states of the plurality of different states.

9. The apparatus (200) of any one of the preceding claims, wherein the model builder (207) is further configured to generate the model of the state machine of the distributed system (230) as a directed graph or a weighted Markov chain.

10. The apparatus (200) of any one of the preceding claims, wherein the model builder (207) is further configured to update the model of the state machine of the distributed system (230) when a new state is detected.

11. The apparatus (200) of any one of the preceding claims, wherein the fault injector (211) is further configured to inject a first fault and a second fault into the distributed system (230) based on the execution plan, wherein the first fault causes the distributed system (230) to be set into a different state, and wherein the second fault is injected in the different state.

12. The apparatus (200) of any one of the preceding claims, wherein the fault injector (211) is further configured to: inject a plurality of faults into the distributed system (230) based on the execution plan; and terminate the execution of the execution plan, once a specified coverage level has been reached.

30

13. The apparatus (200) of any one of the preceding claims, wherein the type of the fault to be injected in the at least one state depends on a current state of the distributed system (230), the current state and one or more previous states of the distributed system (230), or the current state and one or more future states of the distributed system (230).

14. A method (1100) for injecting a fault into a distributed system (230), wherein the method (1100) comprises: collecting (1101) trace information on an execution of an operation by a plurality of processing nodes of the distributed system (230); generating (1103), based on the trace information, a model of a state machine of the distributed system (230), wherein the model of the state machine of the distributed system (230) defines a plurality of different states of the distributed system (230); generating (1105) an execution plan based on the model of the state machine of the distributed system (230), wherein, for at least one state of the plurality of different states of the distributed system (230), the execution plan identifies a type of a fault to be injected in the at least one state; and injecting (1107) at least one fault having the type identified in the execution plan into the distributed system (230).

15. A computer program product comprising a computer-readable storage medium for storing program code which causes a computer or a processor to perform the method (1100) of claim 14, when the program code is executed by the computer or the processor.