WO2022085014A1

WO2022085014A1 - Application fault analysis using machine learning

Info

Publication number: WO2022085014A1
Application number: PCT/IN2020/050901
Authority: WO
Inventors: Bisht ASHUTOSH; Sripadraj Raghotham; Jain Rahul
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2022-04-28

Abstract

Apparatuses and methods for application fault analysis using machine learning are disclosed. In one embodiment, a method implemented in a computing device to detect a fault in a cloud infrastructure includes receiving at least one execution log resulting from running at least one task on the cloud infrastructure; using a machine learning model to determine whether the at least one execution log matches at least one fault, the machine learning model being trained by a plurality of logs associated with a plurality of faults in the cloud infrastructure and a plurality of logs associated with normal execution in the cloud infrastructure; and for at least one log, generating a log template in which at least one value in the at least one log is converted into at least one parameter variable.

Description

APPLICATION FAULT ANALYSIS USING MACHINE LEARNING

TECHNICAL FIELD

The present disclosure relates to fault detection and in particular, apparatuses and methods for application fault analysis using machine learning.

BACKGROUND

Cloud infrastructure provides a rich set of management tasks that operate computing, storage, and networking resources in the cloud. Monitoring the executions of these tasks is crucial for cloud providers to promptly identify and understand problems that compromise cloud availability. However, such monitoring is challenging because there are multiple distributed service components involved in the execution.

Existing solutions exist, such as Cloud Seer, for example. Cloud Seer enables workflow monitoring by taking a lightweight non-intrusive approach that purely works on interleaved logs widely existing in cloud infrastructures.

Cloud Seer first builds an automaton for the workflow of each management task based on normal executions, and then it checks log messages against a set of automata for workflow divergences in a streaming manner. Divergences found during the checking process indicate potential execution problems, which may or may not be accompanied by error log messages.

For each potential problem, Cloud Seer outputs context information including the affected task automaton and related log messages hinting where the problem occurs to assist with further diagnosis. Cloud Seer uses a process for error detecting that is based on logs that represent correct task execution and then determines whether live logs deviate from correct task execution.

However, such deviation may not be a real fault. For example, this could occur when there are other workflows that are used in live operation that were not used or envisioned during the training stage. These spurious fault conditions result in additional work for operational staff which includes much time spent analyzing all the logs that are being flagged as a fault. Therefore, existing approaches for fault detection using logs are inefficient and lacking.

SUMMARY

Some embodiments advantageously provide apparatuses and methods for application fault analysis using machine learning.

According to one aspect of the present disclosure, a method implemented in a computing device to detect a fault in a cloud infrastructure. The method includes receiving at least one execution log resulting from running at least one task on the cloud infrastructure; using a machine learning model to determine whether the at least one execution log matches at least one fault, the machine learning model being trained by a plurality of logs associated with a plurality of faults in the cloud infrastructure and a plurality of logs associated with normal execution in the cloud infrastructure; and for at least one log, generating a log template in which at least one value in the at least one log is converted into at least one parameter variable.

In some embodiments of this aspect, the method includes when the at least one execution log is determined to match the at least one fault, indicating an error to be addressed. In some embodiments of this aspect, the machine learning model is trained by the plurality of faults and collecting the plurality of logs associated with the faults. In some embodiments of this aspect, the method further includes segregating interleaved log templates into at least two separate sets of log templates, each set being associated with a respective parallel transaction. In some embodiments of this aspect, using the machine learning model to determine whether the at least one execution log matches the at least one fault further comprises using the machine learning model to determine whether the at least one execution log is similar to the at least one fault, the similarity being based at least in part on a similarity score.

In some embodiments of this aspect, the similarity score represents a probability that the at least one execution log matches the at least one fault. In some embodiments of this aspect, using the machine learning model to determine whether the at least one execution log matches the at least one fault further comprises using the machine learning model to determine a similarity score, the similarity score representing a level of similarity of the at least one execution log to a set of log templates associated with the at least one fault; and when the similarity score at least meets a similarity threshold, determining that the at least one execution log matches the at least one fault and indicating that the at least one execution log corresponds an error in the cloud infrastructure.

In some embodiments of this aspect, the machine learning model is a graphbased model comprising a graph representing the at least one fault, a node of the graph representing a log template and an edge of the graph representing a subsequent log template in a faulty transaction. In some embodiments of this aspect, the graph comprises a plurality of sub-graphs, each sub-graph representing a corresponding fault. In some embodiments of this aspect, the similarity score is based at least in part on an incoming sequence of the at least one execution log as compared to an incoming sequence for the set of log templates associated with the at least one fault. In some embodiments of this aspect, the similarity score is based at least in part on an incoming sequence, a total number of nodes, a total number of nodes that match a sequence position, a number of nodes that do not match, a number of dissimilar nodes and a position of a last matched node, in the at least one execution log as compared to the set of log templates associated with the at least one fault.

In some embodiments of this aspect, the machine learning model is a language-based model in which each log template is mapped to a corresponding log template vector. In some embodiments of this aspect, for the at least one execution log, the language-based model outputs at least one fault index representing a probability that the at least one execution log matches the at least one fault corresponding to the at least one fault index. In some embodiments of this aspect, for the at least one execution log, the language-based model outputs at least one fault index vector. In some embodiments of this aspect, using the machine learning model to determine whether the at least one execution log matches the at least one fault further comprises when the at least one fault index vector is similar to a predetermined fault index vector, determining that the at least one execution log matches the at least one fault, the similarity of the at least one fault index vector to the predetermined fault index vector representing the probability that the at least one execution log matches the at least one fault. In some embodiments of this aspect, the method further includes when the at least one execution log is determined to match the at least one fault, identifying at least one fault remedy associated with the at least one fault.

According to another aspect of the present disclosure, a computing device to detect a fault in a cloud infrastructure is provided. The computing device includes processing circuitry configured to cause the computing device to receive at least one execution log resulting from running at least one task on the cloud infrastructure; use a machine learning model to determine whether the at least one execution log matches at least one fault, the machine learning model being trained by a plurality of logs associated with a plurality of faults in the cloud infrastructure and a plurality of logs associated with normal execution in the cloud infrastructure; and for at least one log, generate a log template in which at least one value in the at least one log is converted into at least one parameter variable.

In some embodiments of this aspect, the processing circuitry configured to cause the computing device to when the at least one execution log is determined to match the at least one fault, indicate an error to be addressed. In some embodiments of this aspect, the machine learning model is trained by the plurality of faults and collecting the plurality of logs associated with the faults. In some embodiments of this aspect, the processing circuitry is configured to cause the computing device to segregate interleaved log templates into at least two separate sets of log templates, each set being associated with a respective parallel transaction. In some embodiments of this aspect, the processing circuitry is configured to cause the computing device to use the machine learning model to determine whether the at least one execution log matches the at least one fault by being configured to cause the computing device to use the machine learning model to determine whether the at least one execution log is similar to the at least one fault, the similarity being based at least in part on a similarity score.

In some embodiments of this aspect, the similarity score represents a probability that the at least one execution log matches the at least one fault. In some embodiments of this aspect, the processing circuitry is configured to cause the computing device to use the machine learning model to determine whether the at least one execution log matches the at least one fault by being configured to cause the computing device to use the machine learning model to determine a similarity score, the similarity score representing a level of similarity of the at least one execution log to a set of log templates associated with the at least one fault; and when the similarity score at least meets a similarity threshold, determine that the at least one execution log matches the at least one fault and indicate that the at least one execution log corresponds an error in the cloud infrastructure.

In some embodiments of this aspect, the machine learning model is a language-based model in which each log template is mapped to a corresponding log template vector. In some embodiments of this aspect, for the at least one execution log, the language-based model outputs at least one fault index representing a probability that the at least one execution log matches the at least one fault corresponding to the at least one fault index. In some embodiments of this aspect, for the at least one execution log, the language-based model outputs at least one fault index vector. In some embodiments of this aspect, the processing circuitry is configured to cause the computing device to use the machine learning model to determine whether the at least one execution log matches the at least one fault by being configured to cause the computing device to when the at least one fault index vector is similar to a predetermined fault index vector, determine that the at least one execution log matches the at least one fault, the similarity of the at least one fault index vector to the predetermined fault index vector representing the probability that the at least one execution log matches the at least one fault. In some embodiments of this aspect, the processing circuitry is configured to cause the computing device to when the at least one execution log is determined to match the at least one fault, identify at least one fault remedy associated with the at least one fault.

According to another aspect of the present disclosure, a computing device for detecting a fault in a cloud infrastructure is provided. The computing device includes processing circuitry. The processing circuitry is configured to cause the computing device to perform any of the methods described herein.

According to yet another aspect of the present disclosure, an apparatus is provided. The apparatus includes computer instructions executable by at least one processor to perform any of the methods described herein.

BRIEF DESCRIPTION OF THE DRAWINGS

A more complete understanding of the present embodiments, and the attendant advantages and features thereof, will be more readily understood by reference to the following detailed description when considered in conjunction with the accompanying drawings wherein:

FIG. 1 illustrates an example system architecture according to some embodiments of the present disclosure;

FIG. 2 is a flowchart of an example method of detecting a fault according to some embodiments of the present disclosure;

FIG. 3 is schematic diagram illustrating yet another example system architecture for a training phase according to some embodiments of the present disclosure;

FIG. 4 is schematic diagram illustrating yet another example system architecture for an inference phase according to some embodiments of the present disclosure;

FIG. 5 is a schematic diagram illustrating an example transaction segregation according to some embodiments of the present disclosure;

FIG. 6 is a schematic diagram illustrating an example merging of graphs according to some embodiments of the present disclosure; FIG. 7 is a schematic diagram illustrating an example word to vectorization process according to some embodiments of the present disclosure;

FIG. 8 is a schematic diagram illustrating an example of vectors (each vector being on a per template basis) for two transactions according to some embodiments of the present disclosure;

FIG. 9 is a schematic diagram illustrating an example classification model with log vectors according to some embodiments of the present disclosure;

FIG. 10 is a schematic diagram illustrating an example paragraph vector according to some embodiments of the present disclosure; and

FIG. 11 is a schematic diagram illustrating an example regression model with log vectors according to some embodiments of the present disclosure.

DETAILED DESCRIPTION

As discussed above, existing fault detection techniques flag conditions as faults that may not be a real fault. For example, this could occur when there are other workflows that are used in live operation that were not used or envisioned during the training stage. In some other cases, ad-hoc temporary procedure s/workflows are used. Such ad-hoc workflows again show up as fault conditions. These spurious fault condition indicated by the existing solutions result in additional work for operational staff. This is because they must spend time analyzing all the logs that are being flagged as a fault.

Although existing solutions may indicate execution error logs, they do not provide any indication by which the fault can be corrected.

In some embodiments, training logs are collected for model training. However, unlike existing solutions where training logs are collected for only correct task execution, some embodiments may include one or more of:

1. Identifying multiple fault conditions. These could be, for example: a. Operational fault conditions such as ‘quota for a tenant is exceeded’; b. Transient environment fault conditions such as ‘network connectivity to destination is down’; and/or c. Transient application fault conditions such as ‘MySQL server down’. 2. Next, in a correctly working cloud environment, one fault condition is induced. Logs may then be collected from the various cloud infrastructure components (such as, for example, nova, neutron, etc. in case of an Openstack environment).

3. The above steps may be performed for every other fault, as well. A catalogue may then be created of such application logs and the associated fault condition. This catalogue (of logs and associated condition) may then be used for development and training of a machine learning (ML) model.

4. With each fault, a templatized documentation may also be created for resolving the fault with the assistance of subject matter experts. The template variables in the documentation may be fdled based on the particular context from logs.

Once the ML model is developed, a live stream of logs coming from the infrastructure components may be passed to the ML Model. One or more of the following steps may then be taken:

1. The ML model continuously predicts if a collection of live logs and metrics matches against the known set of faults. This match however may not be an exact match. The model may use ML techniques to determine a similarity score of live logs against the known set of fault logs.

2. If the similarity score exceeds a (configured) level of threshold, it is flagged as indicating a fault.

In addition to flagging the fault, one or more of the following may also be performed:

1. Based on the fault, the associated documentation template (to resolve the fault) is selected.

2. Data from the historical logs (such as the last 5 minutes (mins) of the live stream of logs) is extracted to identify the context of the fault. This context may then be used to fill the variables in the documentation template.

3. The updated documentation may then be provided to a user as a sample/way to resolve the problem. Some embodiments of the present disclosure may include at least two alternatives techniques for building the ML model; one is based on a graph-based model and the other is based language-based models.

Some embodiments of the present disclosure may include one or more of the following steps:

1. An ML model is trained to identify fault conditions using logs from correct and faulty task executions. These identification of patterns of logs is not performed manually with help of a human expert. Instead, it is learned by the ML model based on training data. One of the reasons is that many of such systems are asynchronous in nature. As such, it is not easy to write rules for these conditions.

2. The ML model provides a similarity score between 0 and 1 that provides an indication of how similar the current set of log sequences is to a known set of faults. A score is provided against each known fault condition.

3. Graph based models may be used to identify the similarity between current logs and the set of known faults.

4. Alternative solutions include using a language-based ML model to identify the similarity between current logs and the set of known faults.

Some embodiments of the present disclosure provide a set of steps that can be used to mitigate the fault that is identified by the ML model (e.g., when the similarity score is above a predetermined threshold score).

Some embodiments may advantageously provide one or more of the following:

1. Some embodiments provide a solution that may not require access to source code for development of the prediction model.

2. Some embodiments provide a solution in which there is no need for any instrumentation at the application source code level.

3. Some embodiments provide a solution in which predictions are available in a near real time basis.

Some embodiments may advantageously provide one or more of the following: 1. Some embodiments may miss some fault conditions (which were not in the training set), however, the logs that it flags as an error will have a high precision (faults identified are really fault conditions). This may ensure that there is no wasted work (or at least a reduction as compared to existing arrangements) by operational staff looking into the logs flagged as faulty.

2. In some embodiments, the logs flagged as faults provide an explanation as to why certain logs are identified as a fault (e.g., by providing an explanation such as: “This collection of logs is similar (with similarity score of 90%) with the log sequence of known fault - XYZ.”

In addition to identification of the fault (and providing an explanation on why it is considered as fault), some embodiments further provide a set of tentative corrective actions to be applied to correct or mitigate the identified fault.

Before describing in detail exemplary embodiments, it is noted that the embodiments reside primarily in combinations of apparatus components and processing steps related to apparatuses and methods for application fault analysis using machine learning. Accordingly, components have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the embodiments so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.

As used herein, relational terms, such as “first” and “second,” “top” and “bottom,” and the like, may be used solely to distinguish one entity or element from another entity or element without necessarily requiring or implying any physical or logical relationship or order between such entities or elements. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the concepts described herein. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes” and/or “including” when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. In embodiments described herein, the joining term, “in communication with” and the like, may be used to indicate electrical or data communication, which may be accomplished by physical contact, induction, electromagnetic radiation, radio signaling, infrared signaling or optical signaling, for example. One having ordinary skill in the art will appreciate that multiple components may interoperate and modifications and variations are possible of achieving the electrical and data communication.

In some embodiments described herein, the term “coupled,” “connected,” and the like, may be used herein to indicate a connection, although not necessarily directly, and may include wired and/or wireless connections.

In some embodiments, the non-limiting term computing device is used herein and can be any type of computing device capable of implementing one or more of the techniques disclosed herein. For example, the computing device may be a device in or in communication with a cloud system.

A computing device may include physical components, such as processors, allocated processing elements, or other computing hardware, computer memory, communication interfaces, and other supporting computing hardware. The computing device may use dedicated physical components, or the computing device may be implemented as one or more allocated physical components in a cloud environment, such as one or more resources of a datacenter. A computing device may be associated with multiple physical components that may be located either in one location, or may be distributed across multiple locations.

In some embodiments, the term “execution log” may be used to indicate a log record obtained by running cloud infrastructure applications and/or components e.g., whose faults are being detected by some embodiments of the present disclosure.

In some embodiments, the term “log template” may be used to indicate a template log in which one or more context values in an actual log may be replaced by one or more parameters (template variables). A log template may be considered a templatized form of a log line.

Note that although some embodiments of the example system disclosed herein may be used to detect anomalies and/or faults in cloud platforms. Other systems may also benefit from exploiting the ideas covered within this disclosure. Note further, that functions described herein as being performed by a computing device described herein are not limited to performance by a single physical device and, in fact, can be distributed among several physical devices.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms used herein should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Referring now to the drawing figures, in which like elements are referred to by like reference numerals, there is shown in FIG. 1 a schematic diagram of the communication system 10, according to one embodiment, constructed in accordance with the principles of the present disclosure. The communication system 10 in FIG. 1 is a non-limiting example and other embodiments of the present disclosure may be implemented by one or more other systems and/or networks. FIG. 1 presents an overview of the different components in one embodiment of the present disclosure using data from log records in e.g., a cloud infrastructure. The system 10 includes a computing device 12 and a cloud infrastructure 14. The computing device 12 is shown including a fault pattern detector 16, a template parser 18 and a transaction segregation 20.

Although the example system 10 shown in FIG. 1 includes a single computing device 12 including the fault pattern detector 16, the template parser 18 and the transaction segregation 20, it should be understood that, in some embodiments, one or more of the fault pattern detector 16, the template parser 18 and the transaction segregation 20 may be included in separate computing devices 12.

The computing device 12, such as via fault pattern detector 16, the template parser 18 and the transaction segregation 20, is configured to receive at least one execution log resulting from running at least one task on the cloud infrastructure; use a machine learning model to determine whether the at least one execution log matches at least one fault, the machine learning model being trained by a plurality of logs associated with a plurality of faults in the cloud infrastructure and a plurality of logs associated with normal execution in the cloud infrastructure; and for at least one log, generate a log template in which at least one value in the at least one log is converted into at least one parameter variable.

It should be understood that the system 10 may include numerous devices of those shown in FIG. 1, as well as additional devices not shown in FIG. 1. In addition, the system 10 may include many more connections/interfaces than those shown in FIG. 1.

Example implementations, in accordance with some embodiments, of computing device 12 are described with reference to FIG. 1.

The computing device 12 includes a communication interface 22, processing circuitry 24, and memory 26. The communication interface 22 may be configured to communicate with any of the elements of the system 10 according to some embodiments of the present disclosure. In some embodiments, the communication interface 22 may be formed as or may include, for example, one or more radio frequency (RF) transmitters, one or more RF receivers, and/or one or more RF transceivers, and/or may be considered a radio interface. In some embodiments, the communication interface 22 may also include a wired interface.

The processing circuitry 24 may include one or more processors 28 and memory, such as, the memory 26. In particular, in addition to a traditional processor and memory, the processing circuitry 24 may comprise integrated circuitry for processing and/or control, e.g., one or more processors and/or processor cores and/or FPGAs (Field Programmable Gate Array) and/or ASICs (Application Specific Integrated Circuitry) adapted to execute instructions. The processor 28 may be configured to access (e.g., write to and/or read from) the memory 26, which may comprise any kind of volatile and/or nonvolatile memory, e.g., cache and/or buffer memory and/or RAM (Random Access Memory) and/or ROM (Read-Only Memory) and/or optical memory and/or EPROM (Erasable Programmable Read-Only Memory).

Thus, the computing device 12 may further include software stored internally in, for example, memory 26, or stored in external memory (e.g., database) accessible by the computing device 12 via an external connection. The software may be executable by the processing circuitry 24. The processing circuitry 24 may be configured to control any of the methods and/or processes described herein and/or to cause such methods, and/or processes to be performed by, e.g., computing device 12, the fault pattern detector 16, the template parser 18 and the transaction segregation 20. The memory 26 is configured to store data, programmatic software code and/or other information described herein. In some embodiments, the software may include instructions stored in memory 26 that, when executed by the processor 28, the fault pattern detector 16, the template parser 18 and the transaction segregation 20 causes the processing circuitry 24 and/or configures the computing device 12 to perform the processes described herein with respect to the computing device 12 (e.g., processes described with reference to FIG. 2 and/or any of the other figures).

In FIG. 1, the connection between the computing device 12 and cloud infrastructure 14 are shown without explicit reference to any intermediary devices or connections. However, it should be understood that intermediary devices and/or connections may exist between these devices, although not explicitly shown.

Although FIG. 1 shows the fault pattern detector 16, the template parser 18 and the transaction segregation 20, as being within a respective processor, it is contemplated that these elements may be implemented such that a portion of the elements is stored in a corresponding memory within the processing circuitry. In other words, the elements may be implemented in hardware or in a combination of hardware and software within the processing circuitry. The elements may likewise be distributed among multiple computing devices.

FIG. 2 is a flowchart of an example process in a computing device 12 for e.g., detecting a fault according to some embodiments of the present disclosure. One or more Blocks and/or functions and/or methods performed by the computing device 12 may be performed by one or more elements of computing device 12 such as by the fault pattern detector 16, the template parser 18, the transaction segregation 20 in processing circuitry 24, memory 26, processor 28, communication interface 22, etc. according to the example process/method. The example method includes receiving (Block SI 00), such as via fault pattern detector 16, template parser 18, transaction segregation 20 in processing circuitry 24, memory 26, processor 28 and/or communication interface 22, at least one execution log resulting from running at least one task on the cloud infrastructure. The method includes using (Block SI 02), such as via fault pattern detector 16, template parser 18, transaction segregation 20 in processing circuitry 24, memory 26, processor 28 and/or communication interface 22, a machine learning model to determine whether the at least one execution log matches at least one fault, the machine learning model being trained by a plurality of logs associated with a plurality of faults in the cloud infrastructure and a plurality of logs associated with normal execution in the cloud infrastructure. The method includes for at least one log, generating (Block S104), such as via fault pattern detector 16, template parser 18, transaction segregation 20 in processing circuitry 24, memory 26, processor 28 and/or communication interface 22, a log template in which at least one value in the at least one log is converted into at least one parameter variable.

In some embodiments, the method includes when the at least one execution log is determined to match the at least one fault, indicating, such as via fault pattern detector 16, template parser 18, transaction segregation 20 in processing circuitry 24, memory 26, processor 28 and/or communication interface 22, an error to be addressed. In some embodiments, the machine learning model is trained by the plurality of faults and collecting the plurality of logs associated with the faults. In some embodiments, the method further includes segregating, such as via fault pattern detector 16, template parser 18, transaction segregation 20 in processing circuitry 24, memory 26, processor 28 and/or communication interface 22, interleaved log templates into at least two separate sets of log templates, each set being associated with a respective parallel transaction. In some embodiments, using the machine learning model to determine whether the at least one execution log matches the at least one fault further comprises using, such as via fault pattern detector 16, template parser 18, transaction segregation 20 in processing circuitry 24, memory 26, processor 28 and/or communication interface 22, the machine learning model to determine whether the at least one execution log is similar to the at least one fault, the similarity being based at least in part on a similarity score.

In some embodiments, the similarity score represents a probability that the at least one execution log matches the at least one fault. In some embodiments, using the machine learning model to determine whether the at least one execution log matches the at least one fault further comprises using, such as via fault pattern detector 16, template parser 18, transaction segregation 20 in processing circuitry 24, memory 26, processor 28 and/or communication interface 22, the machine learning model to determine a similarity score, the similarity score representing a level of similarity of the at least one execution log to a set of log templates associated with the at least one fault. In some embodiments, when the similarity score at least meets a similarity threshold, determining, such as via fault pattern detector 16, template parser 18, transaction segregation 20 in processing circuitry 24, memory 26, processor 28 and/or communication interface 22, that the at least one execution log matches the at least one fault and indicating that the at least one execution log corresponds an error in the cloud infrastructure.

In some embodiments, the machine learning model is a graph-based model comprising a graph representing the at least one fault, a node of the graph representing a log template and an edge of the graph representing a subsequent log template in a faulty transaction. In some embodiments, the graph comprises a plurality of sub-graphs, each sub-graph representing a corresponding fault. In some embodiments, the similarity score is based at least in part on an incoming sequence of the at least one execution log as compared to an incoming sequence for the set of log templates associated with the at least one fault. In some embodiments, the similarity score is based at least in part on an incoming sequence, a total number of nodes, a total number of nodes that match a sequence position, a number of nodes that do not match, a number of dissimilar nodes and a position of a last matched node, in the at least one execution log as compared to the set of log templates associated with the at least one fault.

In some embodiments, the machine learning model is a language -based model in which each log template is mapped to a corresponding log template vector. In some embodiments, for the at least one execution log, the language-based model outputs at least one fault index representing a probability that the at least one execution log matches the at least one fault corresponding to the at least one fault index. In some embodiments, for the at least one execution log, the language -based model outputs at least one fault index vector. In some embodiments, using the machine learning model to determine whether the at least one execution log matches the at least one fault further comprises when the at least one fault index vector is similar to a predetermined fault index vector, determining, such as via fault pattern detector 16, template parser 18, transaction segregation 20 in processing circuitry 24, memory 26, processor 28 and/or communication interface 22, that the at least one execution log matches the at least one fault, the similarity of the at least one fault index vector to the predetermined fault index vector representing the probability that the at least one execution log matches the at least one fault.

In some embodiments, the method further includes when the at least one execution log is determined to match the at least one fault, identifying, such as via fault pattern detector 16, template parser 18, transaction segregation 20 in processing circuitry 24, memory 26, processor 28 and/or communication interface 22, at least one fault remedy associated with the at least one fault.

Having generally described arrangements for application fault analysis using machine learning, a more detailed description of some of the embodiments are provided as follows with reference to FIGS. 3-11, and which may be implemented by may be implemented by any one or more of computing device 12, fault pattern detector 16, the template parser 18 and the transaction segregation 20.

FIG. 3 illustrates an example system 10 including fault pattern detector 16, template parser 18 and transaction segregation 20, which may be using during an ML training phase. A database of success logs 30 and failed logs 32 may be input into the template parser 18. The template parser 18 may convert the logs into log templates, as discussed in more detail below. The transaction segregator 20 may segregate interleaved log templates for multiple parallel transactions, as discussed in more detail below. The ML model may be trained at fault pattern detector 16. The fault pattern detector 16 may provide detector output, which is then compared against the expected output in order to train the detector 16. Arrows in FIG. 3 indicate the flow of data; however, the actual sequence of operations may be carried out independently and/or in parallel, in some embodiments. For example, while the transaction segregator 20 is segregating transactions, more logs can be given as input to the template parser 18.

FIG. 4 illustrates an example system 10 including fault pattern detector 16, template parser 18 and transaction segregation 20, which may be using during an inference phase. Live logs 34 are provided to the template parser 18, which converts the live logs into corresponding log templates. The transaction segregator 20 then segregates interleaved log templates for parallel transactions. The fault pattern detector 16 then detects and/or predicts whether the live logs indicate a known fault. In some embodiments, the fault pattern detector 16 outputs similarity scores indicating a probability that the live logs correspond to one or more known fault.

Some embodiments of the present disclosure may be described below in the context of Openstack - a free open standard based cloud computing platform. It is deployed as Infrastructure-as-a-service where virtual servers are made available to users. It should be understood that Openstack is merely exemplary and some embodiments of the present disclosure may be implemented on other types of cloud computing platforms.

Some embodiments of the present disclosure may be described below in the context of Ericsson Network Function Virtualization Infrastructure (NFVi), which may be based on Openstack. Ericsson NFVi provides virtualization related infrastructure to host VNF (virtual Network functions such as virtual Evolved Packet Gateway or vEPG).

Openstack (and Ericsson NFVi) solutions include multiple loosely coupled web services. These include:

• Nova - provides a way to provision compute instances (i.e., virtual server/machines);

• Neutron - provides connectivity-as-a-service between interfaces and devices (such as virtual network interface cards (NICs) that are attached to virtual machines);

• Cinder - provides block storage services to nova virtual machines;

• Glance - provides services for management of VM images; and

• Keystone - provides client authentication and distributed tenant authorization.

One of the services by Ericsson NFVi is the creation of virtual machines (based on user instructions). Some embodiments of the present disclosure are explained below in terms of example user scenarios.

Solution Modules/Units

Some embodiments of the present disclosure may include one or more of the following modules:

Data Directory for Successful Scenarios Some embodiments include a data directory including application logs for successful scenarios, which may be stored in a database success logs 30. As an example, this directory may include logs from Nova, Neutron, Cinder, Glance, Keystone applications when a VM is created successfully.

Data Directory for Failed Scenarios

Some embodiments include a data directory including logs fdes for fault scenarios, which may be stored in a database failed logs 32. For each fault, a new directory may be created.

For example, when a VM creation fails due to ‘MySQL service being down’, a fault directory may be created, which may be called fault- 1 for example. This fault directory includes logs from all services, such as Nova, Neutron, Cinder, Glance and Keystone.

There are multiple runs performed for the same fault condition. This is to ensure that temporal relations between events are adequately captured. A separate directory may be created for each run.

Similarly, when a VM creation fails due to ‘cinder service being unavailable’, a fault directory fault-2 may be created. This directory may include logs from all services such as Nova, Neutron, Cinder, Glance, Keystone (including logs from the failed service, e.g., cinder service).

In the example scenario, the overall directory structure may be:

/home/test/fault-l/run-l/nova.log

/home/test/fault- 1/run- 1/neutron.log

/home/test/fault- 1 /run-2/nova. log

/home/test/fault- l/run-2/neutron.log

/home/test/fault- 2/run-l/nova.log

Log Template Parser Some embodiments include a log template parser 18 configured to parse the lines in log files into log templates and context. This can be performed using, for example, the technique noted in a DeepLog solution such as described in the following reference (https://www.cs.utah.edu/~lifeifei/papers/deeplog.pdf).

For example, the log template T for a log entry such as:

“Took 10 seconds to build VM instance.” is

“Took * seconds to build VM instance.”

For this example log entry line, the context is the value <10> and the asterisks in the corresponding log template T represents a parameter. Parameter(s) are abstracted as asterisk(s) in a log template. The parameter values reflect the underlying system state, particular context and performance status. Although the example shows one parameter, a log entry may have multiple parameters.

Values of certain parameters may serve as identifiers for an execution sequence, such as instance_id in an OpenStack logs.

The log template parser 18 may be configured to receive input from the data directories discussed above for successful and fault scenarios. The log template parser 18 converts those files including log entries into files including log templates.

Transaction Segregator

Openstack services such a Nova can process multiple transaction in parallel. As a result, the Nova logs from multiple transactions may be interleaved by a transaction segregator. Each log line may include an identifier for the transaction. The transaction segregator may use such identifiers to create a logical sequence of log templates for multiple ongoing transactions. FIG. 5 shows the way the transaction segregator may segregate interleaved templates for two transactions.

FIG. 5 notation uses Tl, T2 etc. to represent the various log templates; Idl, Id2 to represent the unique identifier for ongoing transactions. Tl-idl then represents template Tl for transaction with identifier idl . The left hand side shows an example of interleaved log templates for two transactions, idl and id2. The right hand side shows the log templates being separated for each transaction by transaction segregator 20.

Fault Pattern Detector Some embodiments include the fault pattern detector 16 which matches input log sequences against known log fault sequences. At least three embodiments are described below.

Embodiment- 1: Graph Based Models

In this embodiment, the model at fault pattern detector 16 leams/builds an individual graph for each fault condition. It uses the log templates and their sequence in the success scenarios and fault scenarios to create such as a graph.

The nodes of the graph represent a log template. The edges in the graph represent the next log template in the faulty transaction.

In order to reduce computation overhead, graphs for fault conditions can be combined. The graph uses a special marker in the node to identify if part of a subgraph in the combined graph corresponds to a sequence for a fault scenario. FIG. 6 illustrates an example of this merging scenario and use of a special marker (as a star). The example graph merging depicted in FIG. 6 allows for merging log template sequences that have two patterns:

• Proper subset such as Fault-1 is Tl-> T2 -> T3 -> T4 and Fault-2 is T1 -> T2 -

> T3; and

• Partial subset such as Fault-1 is Tl-> T2 -> T3 -> T4 and Fault-3 is Tl->T20.

Incoming transactions are matched against these graphs. In the case of an incoming log template sequence matching the known fault sequence, the fault pattern detector 16 flags it as a known fault scenario.

Similarity Match

In some cases, the sequence of input may differ slightly than the logs templates identified for a fault sequence. For example, consider a fault sequence represented as:

T1 -> T2 -> T3 -> T4 -> T5.

The incoming log template sequence (from the live logs) for a transaction can be:

T1 -> T2 -> T4 -> T3 -> T5.

As can be seen, the order of T3 and T4 is flipped. In order to accommodate such cases, the fault pattern detector 16 may be configured to use a similarity score (e.g., between 0 and 1) to identify how similar the incoming log template sequence is as compared to a fault sequence. In some embodiments, the similarity score may be considered to represent a probability that the incoming log template sequence is a fault sequence (instead of a binary yes or no to an exact match). The score may be based on one or more of:

• a total number of nodes in the fault sequence (for e.g., 5 in above scenario);

• a total number of nodes that exactly match their sequence position (for e.g., 3 in above scenario);

• count of dissimilar nodes (for e.g., 0 in above nodes since all of T1-T5 are present in the fault sequence as well as in the live log template sequence); and

• Position of last matched node (starting in reverse from the end). In the context of the above example, this would be 0, since the last node - T5 - is same in execution log and fault log pattern.

Embodiment -2 Vectorizing the Log Templates

In this embodiment, a vector (or embedding) is created for each log template. These vectors are created using a word-to-vector technique in which each word is mapped to a unique vector. These vectors are initialized with random values. A task is setup to predict a word given the other words in a context. The concatenation or sum of the vectors is then used as features for prediction of the next word in a sentence.

FIG. 7 illustrates an example of this embodiment for an example phrase - ‘the cat sat on’ . In the example, the context of three words (‘the’, ‘cat’, ‘sat’) is used to predict the fourth word (‘on’).

The neural network-based word vectors are trained using stochastic gradient descent where the gradient is obtained via backpropagation. These type of models are commonly known as neural language models.

Some embodiments of the present disclosure may use the technique to create vectors for log templates. In this case however, the fault pattern detector 16 is configured to predict the next log template given the sequence of log templates in a transaction (e.g., cloud infrastructure application/service transaction). At the end of training the ML model at fault pattern detector 16, each log template may be mapped to a vector. FIG. 8 shows example vectors for two transactions. There is a unique vector mapped to each log template. In some embodiments, once vectors are created for these log templates, an ML classification model is built.

• The input for this ML model may be the vectors corresponding to log templates for a transaction. The ML model takes a fixed N number of vectors. If the number of log templates are less than N, then the input is padded with null vectors.

• The output for the ML model may be a multi-class output where each output corresponds to fault index. For example, when the fault pattern detector 16 identifies 100 faults, the model will have 100 outputs. Each output has a value between 0-1. The output value corresponds to the probability that input vectors (i.e., sequence of log templates) corresponds to a given fault.

FIG. 9 show an example of the ML model at fault pattern detector 16 where the input takes 4 log vectors as inputs and provides 3 outputs - each output corresponding to the probability of a fault (e.g., fault 1, fault 2 and fault 3).

Embodiment -3: Vectorizing the Log Templates and the Fault Index

In some embodiments, in addition to vectorizing the log templates, the fault index may also be vectorized. The vectors for the fault index may be created in a manner similar to a technique for generating a paragraph vector.

As shown in FIG. 10, a paragraph vector may be used to predict a next word given many contexts sampled from the paragraph. The contexts are fixed-length and sampled from a sliding window over the paragraph. The paragraph token can be thought of as another word. It may act as a memory that remembers what is missing from the current context - or the topic of the paragraph.

The paragraph vector may be shared across all contexts generated from the same paragraph but not across paragraphs. The word vector matrix W, however, may be shared across paragraphs i.e., the vector for “powerful” is the same for all paragraphs.

The paragraph vectors and word vectors are trained using stochastic gradient descent and the gradient is obtained via backpropagation. At every step of a stochastic gradient descent, a fixed-length context can be sampled from a random paragraph, the error gradient can be computed from the network and used to update the parameters in the model, as shown in FIG. 10 for example. The same or similar technique may be used to create vectors for log templates and fault IDs. For example:

• paragraph is replaced by a fault index;

• (as in the previous embodiment) words are replaced by log templates;

• (as in previous alternative), the task is to predict the next log template given the sequence of log templates in a transaction.

In some embodiments, at the end of training:

• Each log template is mapped to a vector; and

• Each fault index is mapped to a vector.

Once the vectors are created for these log templates, a ML regression model may be built (unlike the previous embodiment where a classification model was built).

Use of regression model is also different from the typical scenario where a vector for a new paragraph is learned using a gradient descent method while keeping the word vectors and soft maximum weights fixed. For the ML regression model, as with the present disclosure:

• The input includes vectors corresponding to log templates for a transaction.

The ML model takes fixed N number of vectors. If the number of log templates are less than N, then the input is padded with null vectors.

• The output is a single vector. When the input log templates match the training data, the output matches (e.g., exact match, or probability or similarity score at least meeting a threshold) the vector for the fault index.

FIG. 11 shows an example of the ML regression model which takes 4 log template vectors as input and provides one vector as an output. Well-known vector similarity metrics such as cosine similarity may be used to identify the fault vector closest to the output vector.

Solution Operations

Some embodiments of the present disclosure may include at least 3 main operations (which may be performed by computing device 12 including one or more of fault pattern detector 16, template parser 18 and transaction segregation 20): gathering data for success and fault scenarios, extracting log template sequences for a fault condition and matching input log templates for detection of the fault condition. Some embodiments may further include remedial steps for the fault condition.

Gathering Data for Success and Fault Scenarios

In this operation, logs are gathered for successful and fault scenarios such as successful VM-create operations. The logs are gathered from all Openstack services such as Nova, Neutron, Cinder, Glance and Keystone.

Even for a single operation such as VM-create, there can be multiple ways in which the VM can be created. For each such operations, logs are gathered and stored in a directory for successful scenarios.

In addition, from successful scenarios, logs are also generated for fault scenarios. These fault scenarios could be related to one or more of, for example:

• Environment conditions such as ‘network connectivity down’;

• Application fault condition such as ‘MySQL database service is down’; and

• Operational conditions such as ‘CPU quota for tenant has exceed’ .

The application logs for each fault condition may be stored separately (e.g., separate directories). A unique identifier is assigned for each fault condition.

This operation is run at a low frequency. In some cases, this is performed when a new major release of a software is released.

Extracting Log Template Sequence for Fault Condition

In this operation, the following steps may be performed by e.g., computing device 12, template parser 18:

1. Log template parser 18 creates logs templates from the log lines in various application log files. Logs for both successful and fault scenarios are used to create these log templates. Once the log templates are created, all the log lines in the log files are converted into its templatized form i.e., for every log file, a new log file is created that contains the template logs along with the context data.

2. Transaction segregator 20 then reads these new log files (with log templates) and groups them based on the transaction identifier (available in the context information). As a result, multiple transactions are available with the sequence of log templates. The sequences are available for successful scenarios and fault scenarios. 3. Fault pattern detector 16 uses the log templates of success and fault scenarios to extract the sequence of log template that indicate a fault scenario. As a result, an object/model/ graph is created that includes log templates for fault scenarios along with the fault identifier.

This operation may be run whenever new data (success and fault scenario) is made available. In most cases, this will be run at a same frequency as the data gathering operation.

Matching Input Log Templates for Detection of Fault Condition

In some embodiments, this operation may be continuously performed in a production environment (e.g., live execution). In this operation, one or more of the following may be performed by computing device 12, template parser 18, transaction segregator 20 and/or fault pattern detector 16:

1. Incoming logs are converted into templatized versions.

2. Log templates related to a transaction are grouped together.

3. Log templates for a transaction are matched against the log templates for a fault condition using the fault pattern detector 16. The log template similarity may be given as a similarity score. If, for example, the similarity score is above a predetermined threshold, the transaction is flagged as fault condition. The associated fault index may also be provided.

Remedial Steps for Fault Condition

In some embodiments, this operation is performed only for those cases where a fault condition is flagged for a transaction. In this operation, one or more of the following may be performed:

1. Fault Remedy template associated with the fault index is fetched.

2. The context parameters from the log templates are used to create the list of instructions to remedy the fault.

Thus, advantageously some embodiments provide for fault detection as well as providing a remedial plan to remedy the fault.

Alternative Use Cases

Although some embodiments of the present disclosure are explained in terms of execution of Openstack services (such as VM creation, VM migration, etc.), the embodiments of the present disclosure can be used in deployment of Openstack Cloud as well.

In many cases such as Airship, Openstack services are deployed as a collection of container services. In such containerized cases, the whole Openstack services may be deployed every couple of months. For large Tier-1 operators, such deployments can occur over hundreds of sites for every release.

Use of some embodiments of the proposed solution may reduce the deployment time (and effort) for such scenarios by reducing the troubleshooting time.

Apart from deployment of Openstack as containers, virtual network functions (VNFs) may also be deployed as containers. Thus, some embodiments of the proposed solution can be used in such scenarios as well to reduce the deployment time (by reducing the troubleshooting time).

Some embodiments may be implemented in a distributed manner.

For example, for the cloud operations use case, the transactions identifiers are scoped within each service (such as Nova, Neutron, etc.). As a result, one or more of the following components may be run in parallel for each service: template parser 18; and transaction segregator 20.

Currently, there are multiple ML solutions available that perform anomaly detection based on application logs. In all such cases, the ML solution creates a baseline for normal operations. Any deviations from normal is then considered an anomaly.

In all such cases, the receiver of the anomaly indication (for example, the operational engineer) then must spend considerable time and effort to understand why a certain sequence of logs is flagged as an anomaly by the ML solution. In many cases, the analysis may reveal that the anomalous sequence of logs is not a fault condition. For example, the logs may have been generated due to some ad-hoc operational procedure. In all such cases, the additional effort performed by the receiver of the anomalies does not bring any benefit to him/her.

Also, in cases where the anomaly does indeed turn out to be a fault, the operational engineer spends considerable time attempting to identify the corrective set of actions. Existing solutions in some sense can be considered to be optimized for high- recall. They want to ensure that no fault condition is missed by the solution. It however comes at the cost of low precision (faults identified by solutions may not be faults).

Some embodiments of the present disclosure may provide for other trade-offs. Some embodiments can be considered to be optimized for high-precision. Some embodiments attempt to ensure that faults flagged are indeed faults. Some embodiments however come at the cost of low recall (e.g., sometimes genuine faults may not be identified as faults).

This change in trade-off however comes with a large benefit to the operational engineers. With some embodiments of the proposed solution, there is an overall reduction in effort for troubleshooting, as compared to existing/current solutions.

As will be appreciated by one of skill in the art, the concepts described herein may be embodied as a method, data processing system, and/or computer program product. Accordingly, the concepts described herein may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects all generally referred to herein as a “circuit” or “module.” Furthermore, the disclosure may take the form of a computer program product on a tangible computer usable storage medium having computer program code embodied in the medium that can be executed by a computer. Any suitable tangible computer readable medium may be utilized including hard disks, CD-ROMs, electronic storage devices, optical storage devices, or magnetic storage devices.

Some embodiments are described herein with reference to flowchart illustrations and/or block diagrams of methods, systems and computer program products. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer readable memory or storage medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

It is to be understood that the functions/acts noted in the blocks may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Although some of the diagrams include arrows on communication paths to show a primary direction of communication, it is to be understood that communication may occur in the opposite direction to the depicted arrows.

Computer program code for carrying out operations of the concepts described herein may be written in an object oriented programming language such as Java® or C++. However, the computer program code for carrying out operations of the disclosure may also be written in conventional procedural programming languages, such as the "C" programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer. In the latter scenario, the remote computer may be connected to the user's computer through a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).

Many different embodiments have been disclosed herein, in connection with the above description and the drawings. It will be understood that it would be unduly repetitious and obfuscating to literally describe and illustrate every combination and subcombination of these embodiments. Accordingly, all embodiments can be combined in any way and/or combination, and the present specification, including the drawings, shall be construed to constitute a complete written description of all combinations and subcombinations of the embodiments described herein, and of the manner and process of making and using them, and shall support claims to any such combination or subcombination.

It will be appreciated by persons skilled in the art that the embodiments described herein are not limited to what has been particularly shown and described herein above. In addition, unless mention was made above to the contrary, it should be noted that all of the accompanying drawings are not to scale. A variety of modifications and variations are possible in light of the above teachings without departing from the scope of the following claims.

Claims

Claims:

1. A method implemented in a computing device (12) to detect a fault in a cloud infrastructure, the method comprising: receiving (SI 00) at least one execution log resulting from running at least one task on the cloud infrastructure; using (SI 02) a machine learning model to determine whether the at least one execution log matches at least one fault, the machine learning model being trained by a plurality of logs associated with a plurality of faults in the cloud infrastructure and a plurality of logs associated with normal execution in the cloud infrastructure; and for at least one log, generating (SI 04) a log template in which at least one value in the at least one log is converted into at least one parameter variable.

2. The method of Claim 1, further comprising: when the at least one execution log is determined to match the at least one fault, indicating an error to be addressed.

3. The method of any one of Claims 1 and 2, wherein the machine learning model is trained by the plurality of faults and collecting the plurality of logs associated with the faults.

4. The method of any one of Claims 1-3, further comprising: segregating interleaved log templates into at least two separate sets of log templates, each set being associated with a respective parallel transaction.

5. The method of any one of Claims 1-4, wherein using the machine learning model to determine whether the at least one execution log matches the at least one fault further comprises: using the machine learning model to determine whether the at least one execution log is similar to the at least one fault, the similarity being based at least in part on a similarity score.

6. The method of any one of Claims 1-5, wherein the similarity score represents a probability that the at least one execution log matches the at least one fault.

7. The method of any one of Claims 1-6, wherein using the machine learning model to determine whether the at least one execution log matches the at least one fault further comprises: using the machine learning model to determine a similarity score, the similarity score representing a level of similarity of the at least one execution log to a set of log templates associated with the at least one fault; and when the similarity score at least meets a similarity threshold, determining that the at least one execution log matches the at least one fault and indicating that the at least one execution log corresponds an error in the cloud infrastructure.

8. The method of any one of Claims 1-7, wherein the machine learning model is a graph-based model comprising a graph representing the at least one fault, a node of the graph representing a log template and an edge of the graph representing a subsequent log template in a faulty transaction.

9. The method of Claim 8, wherein the graph comprises a plurality of sub-graphs, each sub-graph representing a corresponding fault.

10. The method of any one of Claims 6-9, wherein the similarity score is based at least in part on an incoming sequence of the at least one execution log as compared to an incoming sequence for the set of log templates associated with the at least one fault.

11. The method of any one of Claims 6-10, wherein the similarity score is based at least in part on an incoming sequence, a total number of nodes, a total number of nodes that match a sequence position, a number of nodes that do not match, a number of dissimilar nodes and a position of a last matched node, in the at least one execution log as compared to the set of log templates associated with the at least one fault.

12. The method of any one of Claims 1-7, wherein the machine learning model is a language-based model in which each log template is mapped to a corresponding log template vector.

13. The method of Claim 12, wherein: for the at least one execution log, the language -based model outputs at least one fault index representing a probability that the at least one execution log matches the at least one fault corresponding to the at least one fault index.

14. The method of Claim 12, wherein for the at least one execution log, the language -based model outputs at least one fault index vector; and using the machine learning model to determine whether the at least one execution log matches the at least one fault further comprises when the at least one fault index vector is similar to a predetermined fault index vector, determining that the at least one execution log matches the at least one fault, the similarity of the at least one fault index vector to the predetermined fault index vector representing the probability that the at least one execution log matches the at least one fault.

15. The method of any one of Claims 1-14, further comprising: when the at least one execution log is determined to match the at least one fault, identifying at least one fault remedy associated with the at least one fault.

16. A computing device (12) for detecting a fault in a cloud infrastructure, the computing device (12) comprising processing circuitry (24), the processing circuitry (24) configured to cause the computing device (12) to: receive at least one execution log resulting from running at least one task on the cloud infrastructure; use a machine learning model to determine whether the at least one execution log matches at least one fault, the machine learning model being trained by a plurality of logs associated with a plurality of faults in the cloud infrastructure and a plurality of logs associated with normal execution in the cloud infrastructure; and for at least one log, generate a log template in which at least one value in the at least one log is converted into at least one parameter variable.

17. The computing device (12) of Claim 16, where the processing circuitry (24) is further configured to cause the computing device (12) to: when the at least one execution log is determined to match the at least one fault, indicate an error to be addressed.

18. The computing device (12) of any one of Claims 16 and 17, wherein the machine learning model is trained by the plurality of faults and collecting the plurality of logs associated with the faults.

19. The computing device (12) of any one of Claims 16-18, wherein the processing circuitry (24) is further configured to cause the computing device (12) to: segregate interleaved log templates into at least two separate sets of log templates, each set being associated with a respective parallel transaction.

20. The computing device (12) of any one of Claims 16-19, wherein the processing circuitry (24) is configured to cause the computing device (12) to use the machine learning model to determine whether the at least one execution log matches the at least one fault by being configured to cause the computing device (12) to: use the machine learning model to determine whether the at least one execution log is similar to the at least one fault, the similarity being based at least in part on a similarity score.

21. The computing device (12) of any one of Claims 16-20, wherein the similarity score represents a probability that the at least one execution log matches the at least one fault.

22. The computing device (12) of any one of Claims 16-21, wherein the processing circuitry (24) is configured to cause the computing device (12) to use the machine learning model to determine whether the at least one execution log matches the at least one fault by being configured to cause the computing device (12) to: use the machine learning model to determine a similarity score, the similarity score representing a level of similarity of the at least one execution log to a set of log templates associated with the at least one fault; and when the similarity score at least meets a similarity threshold, determine that the at least one execution log matches the at least one fault and indicating that the at least one execution log corresponds an error in the cloud infrastructure.

23. The computing device (12) of any one of Claims 16-22, wherein the machine learning model is a graph-based model comprising a graph representing the at least one fault, a node of the graph representing a log template and an edge of the graph representing a subsequent log template in a faulty transaction.

24. The computing device (12) of Claim 23, wherein the graph comprises a plurality of sub-graphs, each sub-graph representing a corresponding fault.

25. The computing device (12) of any one of Claims 21-24, wherein the similarity score is based at least in part on an incoming sequence of the at least one execution log as compared to an incoming sequence for the set of log templates associated with the at least one fault.

26. The computing device (12) of any one of Claims 21-25, wherein the similarity score is based at least in part on an incoming sequence, a total number of nodes, a total number of nodes that match a sequence position, a number of nodes that do not match, a number of dissimilar nodes and a position of a last matched node, in the at least one execution log as compared to the set of log templates associated with the at least one fault.

27. The computing device (12) of any one of Claims 16-22, wherein the machine learning model is a language -based model in which each log template is mapped to a corresponding log template vector.

28. The computing device (12) of Claim 27, wherein: for the at least one execution log, the language -based model outputs at least one fault index representing a probability that the at least one execution log matches the at least one fault corresponding to the at least one fault index.

29. The computing device (12) of Claim 27, wherein: for the at least one execution log, the language -based model outputs at least one fault index vector; and the processing circuitry (24) is configured to cause the computing device (12) to use the machine learning model to determine whether the at least one execution log matches the at least one fault by being configured to cause the computing device (12) to: when the at least one fault index vector is similar to a predetermined fault index vector, determine that the at least one execution log matches the at least one fault, the similarity of the at least one fault index vector to the predetermined fault index vector representing the probability that the at least one execution log matches the at least one fault.

30. The computing device (12) of any one of Claims 16-29, wherein the processing circuitry (24) is further configured to cause the computing device (12) to: when the at least one execution log is determined to match the at least one fault, identify at least one fault remedy associated with the at least one fault.

31. A computing device (12) for detecting a fault in a cloud infrastructure, the computing device (12) comprising processing circuitry (24), the processing circuitry (24) configured to cause the computing device (12) to perform any of the methods of Claims 1-15.