WO2019060327A1

WO2019060327A1 - Online detection of anomalies within a log using machine learning

Info

Publication number: WO2019060327A1
Application number: PCT/US2018/051601
Authority: WO
Inventors: Feifei Li
Original assignee: University Of Utah Research Foundation
Priority date: 2017-09-20
Filing date: 2018-09-18
Publication date: 2019-03-28

Abstract

Improvements in how anomalies are detected within a tracked execution path of an application are disclosed. Log entries in a log are parsed into respective structured data sequences that include a log key and a parameter set for each entry. The combination of these structured data sequences represents an execution path for the application. A vector is then generated, where the vector includes the parameter sets and a set of time values indicating how much time elapsed between each adjacent log entry in the log. A machine learning sequential (MLS) model is then trained using the vectors and the log keys. When the MLS model is applied to a new log entry, the MLS model generates a probability indicating an extent to which the new log entry is normal or abnormal. The MLS model may be applied in a streaming manner to detect anomalies in a quick and efficient manner.

Description

ONLINE DETECTION OF ANOMALIES WITHIN A LOG USING MACHINE

LEARNING

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of and priority to United States Provisional Patent Application Serial No. 62/561, 126 filed on September 20, 2017 and entitled "DEEPLOG: ANOMALY DETECTION AND DIAGNOSIS FROM SYSTEM LOGS THROUGH DEEP LEARNING," which application is expressly incorporated herein by reference in its entirety.

BACKGROUND

[0002] Computers have impacted nearly every aspect of modern-day living. For instance, computers are generally involved in work, recreation, healthcare, transportation, and so forth. The increasing complexity of modern computer systems has become a significant limiting factor in deploying and managing them, especially when a computer system operates in a seemingly obscure or unpredicted manner.

[0003] Anomaly detection is highly beneficial when building a secure and trustworthy computer system. As systems and applications get increasingly more complex, they are often subject to more bugs and vulnerabilities, which an adversary may exploit to launch attacks. Such attacks are also becoming increasingly more sophisticated and difficult to not only resolve but also to detect. As a result, anomaly detection has become more challenging, and many traditional anomaly detection methodologies are proving to be quite deficient. Therefore, there has arisen a substantial need to improve how anomalies are detected and diagnosed in order to provide a safer and more reliable computer system.

[0004] The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced.

BRIEF SUMMARY

[0005] Embodiments disclosed herein relate to computer systems, methods, and hardware storage devices that operate within a computing architecture that improves how anomalies are detected within a system log and subsequently diagnosed.

[0006] In some embodiments, each log entry included within a log is parsed into a corresponding structured data sequence. Each structured data sequence is formatted to include a log key and a parameter set for the corresponding log entry. A combination of these structured data sequences represents an execution path of an application that is being tracked by the log. A vector is also generated. The vector includes (1) the corresponding parameter set for each of the log entries and (2) a set of time values indicating how much time elapsed between each of the adjacent log entries. A machine learning sequential (MLS) model is then trained using the vector and the log keys from each of at least some of the log entries. This MLS model is specially designed to generate a conditional probability distribution that, when applied to at least a portion of the execution path after that portion is modified by a newly arrived log entry, generates a probability indicating an extent to which the newly arrived log entry is normal or abnormal.

[0007] Subsequently, after a particular portion of the execution path actually is modified to include a new log entry, the MLS model is applied to at least that portion. It will be appreciated that after the new log entry is received (and prior to applying the MLS model), the new log entry is prepared by parsing it to generate a corresponding log key and a new vector. Then, the process of applying the MLS model causes a probability to be generated, where the probability indicates an extent to which the new log entry is normal or abnormal. Furthermore, the process of applying the MLS model to the particular portion of the execution path includes applying the MLS model to either one of the new log entry' s new log key or the new log entry' s new vector. In this manner, the disclosed embodiments are able to facilitate anomaly detection and diagnosis through the use of system logs and deep machine learning.

[0008] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

[0009] Additional features and advantages will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the teachings herein. Features and advantages of the invention may be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. Features of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth hereinafter. BRIEF DESCRIPTION OF THE DRAWINGS

[0010] In order to describe the manner in which the above-recited and other advantages and features can be obtained, a more particular description of the subject matter briefly described above will be rendered by reference to specific embodiments which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments and are not therefore to be considered to be limiting in scope, embodiments will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:

[0011] Figure 1 illustrates a flowchart of an example method for detecting anomalies within a system event log.

[0012] Figure 2 illustrates an example table of structured/parsed log entries that are extracted from the system event log.

[0013] Figure 3 illustrates an example computing architecture that is configured to perform the method for detecting anomalies.

[0014] Figure 4 illustrates how machine learning sequential (MLS) model is initially trained using a corpus of training data, such as, for example a set of log keys extracted from log entries.

[0015] Figures 5A, 5B, and 5C illustrate how the deep neural network of the MLS model is build using successive combinations of different Long Short-Term Memory ("LS1M') blocks.

[0016] Figure 6 illustrates an example of a divergence point identified as part of a concurrency detection in which multiple threads are adding log entries to the log.

[0017] Figure 7 illustrates an example of a diverge point identified as a part of a new task detection in which multiple threads are adding log entries to the log.

[0018] Figure 8 illustrates an example of a loop detection where certain instructions

(and corresponding log keys) are repeatedly added to the log in a looping manner.

[0019] Figure 9 illustrates an example execution flow where an anomaly occurred.

[0020] Figure 10 illustrates an example computer system configured to perform any of the disclosed operations.

DETAILED DESCRIPTION

[0021] Embodiments disclosed herein relate to computer systems, methods, and hardware storage devices that operate within a computing architecture that improves how anomalies are detected within a system log and are diagnosed. [0022] In some embodiments, log entries in a log (e.g., a system event log) are parsed into structured data sequences that each include a log key and a parameter set. Together, these structured data sequences represent an application's execution path. A vector is also generated, where the vector includes (1) the parameter set for each log entry and (2) a set of time values indicating how much time elapsed between each adjacent log entry. A machine learning sequential (MLS) model is then trained using the vector and the log keys. This MLS model generates a conditional probability distribution that, when applied to at least a portion of the execution path after that portion is modified by a newly arrived log entry, generates a probability indicating an extent to which the newly arrived log entry is normal or abnormal. After a particular portion of the execution path actually is modified to include a new log entry, the MLS model is applied and generates a probability indicating an extent to which that new log entry is normal or abnormal.

Technical Improvements

[0023] In this manner, the disclosed embodiments are able to facilitate anomaly detection and diagnosis through the use of system logs and deep neural network machine learning. Additionally, the disclosed embodiments (1) provide significant benefits over the current technology, (2) provide technical solutions to problems that are currently troubling the technology, and (3) provide optimizations to thereby improve the operations of the underlying computer system.

[0024] One difficultly in the technical field relates to detecting anomalies in a quick manner. If anomalies are not detected and resolved quickly, then an entire attack can occur against the computer or application without an administer even becoming aware of the attack until well after it is complete. Another challenge with anomaly detection relates to the ability to detect any type of anomaly (including unknown types of anomalies) as opposed to simply detecting specific types of anomalies. Yet another challenge in the technical field relates to concurrency issues where multiple threads/processors are adding to the log, thus making it more difficult to detect proper execution workflow. Existing anomaly detection systems fail to provide a comprehensive solution addressing many and/or all of these issues.

[0025] For instance, existing approaches that leverage system log data for anomaly detection fail to provide an effective universal anomaly detection method that is able to guard against different attacks in an online fashion. Instead, these systems perform their analysis in a slow offline manner, thereby further persisting the problem of not identifying/catching an attack quickly enough. For clarification, an "offline" (i.e. non- streaming) method requires several passes over the entire log data in order to perform an analysis of that data. Many conventional methodologies are also deficient because they focus only on detecting specific types of anomalies when training a binary classifier for anomaly detection. Again, this persists the problems described above because those systems are not easily scalable and are restricted to a very rigid framework. Often, the order of log messages in a log provides useful information for diagnosis and analysis of an application (e.g., the log assists in identifying the execution path). However, in many system logs, log messages are produced by several different concurrent threads or concurrently running tasks. Conventional systems are inadequate when faced with such concurrent execution because they typically rely on a workflow model that is dependent on only a single task.

[0026] As outlined above, there are many issues facing the current technical field. That being said, however, the disclosed embodiments provide significant improvements and solutions to each of these problems as well as many others. For instance, the disclosed embodiments are highly efficient because they operate in an online streaming/dynamic manner when identifying anomalies (i.e. they perform only a single pass over the data as opposed to multiple passes). In this regard, administrators (either human or an autonomous computer system) can be alerted in a timely manner to intervene in an ongoing attack and/or to respond to a system performance issue. The disclosed embodiments are also system, type, and even format agnostic with regard to their ability to detect anomalies. As a result, the embodiments are highly scalable and can be used to detect any type of anomaly, even if it was previously unknown (i.e. a new log key extracted from a particular new log entry may be identified as being new and either normal or abnormal, even if it was not included in the initial training corpus used to train the MLS model). Furthermore, the disclosed embodiments are significantly more robust that prior solutions because they are able to handle simultaneous/concurrent processor execution. Even further, the disclosed embodiments improve the operations of the computer system itself because, by performing these synergistic operations to detect anomalies, the computer system will be less exposed to prolonged attacks and will be provided enhanced protections.

Example Method(s)

[0027] As an initial matter, it is noted that system event logs are a universal resource that exists practically in every system. In some implementations described herein, a log message or a log record/entry refers to at least one line in the log file, which can be produced by a log printing statement in an application (e.g., a user program' s source code or a kernel program's code). It will be appreciated that the application can be executing locally or remotely. Regardless of where the application is executing, the log entries can be received within a stream of data. By analyzing these log entries online in real-time, the disclosed embodiments are able to perform a highly efficient and quick anomaly detection analysis and diagnosis.

[0028] With that understanding, attention will now be directed to Figure 1 which refers to a number of method acts that may be performed. Although the method acts may be discussed in a certain order or illustrated in a flowchart as occurring in a particular order, no particular ordering is required unless specifically stated, or required because an act is dependent on another act being completed prior to the act being performed. This method is used to introduce the disclosed embodiments at a high level. Subsequent portions of this disclosure will delve more fully into specifics and features of the various different embodiments.

[0029] Figure 1 shows a flowchart of an example method 100 for detecting anomalies with a system event log. The disclosed embodiments relate to a deep neural network model (referred to herein as a machine learning sequential ("MLS") model) utilizing Long Short-Term Memory ("LS7M") to model a system log as a natural language sequence. This allows the embodiments to automatically learn log patterns from normal execution and to detect anomalies when log patterns deviate from the model, which was trained from log data under normal execution. In this regard, the embodiments utilize a data-driven approach for detecting anomalies in a manner that leverages the large volume of available system logs and that uses natural language processing.

[0030] Initially, method 100 includes an act 105 where each log entry included within a log is parsed into a corresponding structured data sequence. Each structured data sequence is configured to include a log key and a parameter set extracted from its corresponding log entry. Together, the combination of these structured data sequences represents an execution path of an application that is being tracked by the log.

[0031] It is beneficial to view log entries as a sequence of elements following certain patterns and grammar rules. A system log is produced by a program that follows a rigorous set of logic and control flows and is very much like a natural language (though more structured). In raw form, however, log data (i.e. each log entry) is unstructured free- text and its format and semantics can vary significantly from system to system. Therefore, significant efficiencies can be achieved by initially preparing this data through the parsing process in order to better detect, or rather analyze, the patterns that are included within the data.

[0032] To parse the log entries, each piece of alphanumeric data within a log entry is separated into a number/sequence of tokens using a defined set of delimiters (e.g., spaces, equal signs, colons, semicolons, etc.). Once this tokenization is complete, then those tokens are classified as belonging to a "log key" (also known as "message type"), or, alternatively as belonging to a "parameter set" of the corresponding log entry.

[0033] At this point, an example will be helpful. Consider the following print statement and resulting log entry that is included within a log:

Print Statement: printf("Took %f seconds to build instance. ", t) Resulting Log Entry ("e"): Took 10 seconds to build instance

[0034] The extracted log key for this print statement's resulting log entry "e" is shown below. It will be appreciated that the log key ' ' refers to the string constant from the print statement in the source code which printed "e" during the execution of that code.

Log Key ("£"): Took * seconds to build instance

[0035] The parameter values have been removed from the log key ' ' and replaced with a placeholder (e.g., the "*"). These parameter values reflect the underlying system's state and performance status (e.g., the value "10" in the log entry "e" is a parameter value). In some embodiments, as will be discussed in further detail later, values of certain parameters and/or log keys may serve as an introductory identifier for the commencement of a particular series or sequence of application executions (e.g., "block id' in a log may indicate that a certain series of related actions will occur and "instance id' may indicate that a different series of related actions will occur). Use of these identifiers enables log entries to be grouped together or even to untangle a group of log entries (e.g., groups produced by concurrent processes) to separate, single-thread sequential sequences. Turning briefly to Figure 2, table 200 includes rows of log entries that have been parsed in the manner described above. Specifically, in the first column, the underlined text correlates to log keys (e.g., a set of word tokens that are group to form a log key) while the non-underlined text corresponds to parameter values.

[0036] Returning to Figure 1, in act 110, a vector is generated. It will be appreciated that a single vector may be generated to include a large compilation of data (which data is described below), or, alternatively, multiple discrete vectors may be generated. If multiple vectors are initially created, then they may be subsequently merged to form a single vector. In any event, the vector includes (1) the parameter sets for each of the log entries and (2) a set of time values indicating how much time elapsed between each adjacent log entry (i.e. both parameter values as well as timing information). With reference to Figure 2, each row in table 200 is representative of a vector. The information in the first column is the log key and parameter set information while the information in columns two and three correspond to metadata that provides additional information about the log key and parameter set information. As shown, the vector includes timing differences (e.g., "ti - to") between each adjacent/successive log entry. In this manner, the disclosed embodiments are able to store parameter values for each log entry "e", as well as the time that elapsed between "e" and its predecessor log entry in a vector "v_e".

[0037] Returning to Figure 1, in act 1 15, a machine learning sequential (MLS) model (aka a deep learning model) is trained using both the vector information and the log keys from each of at least some (and potentially all) of the log entries. This MLS model, after being trained/tuned using the vector and log keys, generates a probability distribution. When this probability distribution is applied/compared to the execution path (or even a sub-portion of the execution path) after that execution path has been modified by a newly arrived log entry, then the MLS model generates a probability indicating an extent to which the newly arrived log entry is "normal" or "abnormal." In this manner, the MLS model is able to perform anomaly detection at a "per log entry level."

[0038] That is, the MLS model generates a prediction describing a set of predicted log entries. These predicted log entries are log entries that the MLS model anticipates will likely occur (i.e. show up) next in the sequence of log entries obtained as a part of the application' s execution. If the new log entry is included in the list of predicted log entries and if its corresponding prediction probability is sufficiently high (i.e. it satisfies a threshold prediction level), then the new log entry is considered to be normal, or at least within the normal range. In contrast, if the newly arrived log entry is not included in the list or if its associated prediction probability is not sufficiently high (i.e. it does not satisfy the threshold prediction level), then the new log entry is considered to be an anomaly and is marked as being "abnormal."

[0039] Regarding the MLS model, this model is a type of a deep neural network that models the sequence of log entries using a Long Short-Term Memory (LSTM). This modeling ability allows for the automatic learning on different log patterns from normal execution. Additionally, this modeling ability allows the system to flag deviations from normal system execution as anomalies. [0040] In some embodiments the deep neural network is a Recurrent Neural Network ("RAW"). A RNN is an artificial neural network that uses a loop to forward the output of last state to current input, thus keeping track of history for making predictions. Long Short-Term Memory (LSTM) networks are an instance of RNNs that have the ability to remember long-term dependencies over sequences. These features will be discussed in more detail later.

[0041] Because entries in a system log are a sequence of events produced by the execution of structured source code (and hence can be viewed as a structured language), an LSTM is optimal and can be used for online anomaly detection over system logs. By not only using the log keys but also the parameter values for anomaly detection, the disclosed embodiments are able to capture different types of anomalies and are not limited to detecting only a single type. With this type of neural network, additional advantages can be achieved because this type of neural network is capable of depending on only a small training data set that consists of a sequence of "normal log entries". After being trained, the MLS model can recognize normal log sequences and can be used for online anomaly detection over incoming log entries in a streaming fashion, as described further later on.

[0042] The MLS model is able to implicitly capture the potentially nonlinear and high dimensional dependencies among log entries from the training data, which corresponds to normal system execution paths. To help administrators diagnose a problem once an anomaly is identified, the MLS model is also able to build workflow models from log entries during its training phase. In this regard, the MLS model is able to separate log entries produced by concurrent tasks or threads into different sequences so that a workflow model can be constructed for each separate task. Once an anomaly is detected, administrators can diagnose the detected anomaly and perform root cause analysis effectively through use of the workflow model.

[0043] Since the MLS model's neural network uses a learning-driven approach, it is possible to incrementally update the MLS model (e.g., from feedback provided by a human or computer administrator) so that it can adapt to new log patterns that emerge over time. To do so, the MLS model incrementally updates its probability distribution weights during the detection phase (e.g., perhaps in response to live user feedback indicating a normal log entry was incorrectly classified as an anomaly). This feedback may be incorporated immediately in a dynamic online manner to adapt to emerging new system execution patterns (i.e. new logs and/or new log data). In this regard, the MLS model can initially be trained using one corpus of log data and then later tuned/refined using an entirely different corpus of log data or user feedback.

[0044] Returning to method 100, there is an act 120 where, after modifying a particular portion of the execution path to include a new log entry, the MLS model is applied at least to the particular portion. This application process generates a probability indicating an extent to which the new log entry is normal or abnormal. Using this probability, the disclosed embodiments are then able to determine whether the new log entry is an abnormality/anomaly. If it is, then the disclosed embodiments can execute a diagnostic program in an attempt to more fully understand how/why this abnormality occurred.

[0045] In some embodiments, applying the MLS model to the particular portion is performed by sending a selected number (later referred to as a history "h") of log entries to the MLS model. In this case, all of the selected number of log entries appear in the execution path prior to (i.e. before) the appearance of the new log entry in the log. Then, from the MLS model, an output probability distribution is received, where the distribution describes probabilities for a set of predicted log keys that are predicted to appear as a next log key in the execution path. Some embodiments then flag the new log key, which was extracted from the new log entry, as "normal" if the new log key is among a set of top candidates (later referred to as "g") selected from the set of predicted log keys. Alternatively, some embodiments flag the new log key as "abnormal" if the new log key is not among the set of top candidates.

Example Architecture

[0046] Figure 3 shows an example architecture 300 for performing the method 100 of Figure 1. As shown, the MLS model 305, which is an example implementation of the MLS model described in method 100 of Figure 1, includes three main components: the log key anomaly detection model 310, the workflow model 315, and the parameter value anomaly detection model 320. Together, these sub-models of the MLS model 305 may be used to detect and even diagnose anomalies, as generally described earlier in method 100.

[0047] The training data for the log key anomaly detection model 310, the workflow model 315, and the parameter value anomaly detection model 320 are the log entries included within the normal execution log file(s) 325. Each log entry (e.g., "ti : log entry Γ shown in the normal execution log file(s) 325) is parsed and classified as either belonging to either a log key ("£") or a parameter value vector, in the manner described earlier. [0048] The combined collection of log keys, constituting the log key sequence 330, is provided to the log key anomaly detection model 310 and is used to train that model. Additionally, the log key sequence 330 is provided to the system execution workflow model 315 in order to train that model for diagnosis purposes. The combined collection of parameter values, constituting the parameter value vector(s) 335, is fed to the parameter value anomaly detection model 320 and is used to train that model.

[0049] When a new log entry arrives (e.g., new log entry 340), that new log entry 340 is parsed by the log parser 345 into a log key and a parameter value vector in the manner described earlier. The log key anomaly detection model 310 then checks (see label 350) whether the incoming log key is normal or abnormal by comparing the incoming log key to the model's probability distribution. To be considered normal, in some embodiments, the incoming log key should be included in the log key anomaly detection model 310' s generated list of predicted log keys. Additionally, to be considered normal, the probability associated with the incoming log key (as identified within the probability) should satisfy a sufficiently high threshold probability level (e.g., 50%, 55%, 60%), 65%o, 75%), 90% and so on). If the incoming log key is not included in the generated list of predicted log keys or if the incoming log key's associated probability fails to satisfy the threshold probability level, then the incoming log key is considered to be abnormal.

[0050] Accordingly, if the result of the comparison indicates the incoming log key appears to be abnormal (i.e. line 355), then an administrator (e.g., a human or a computer system) is provided a notification/alert. Other remedial measures may also be performed (e.g., stopping the application, turning off communication channels, etc.).

[0051] If the result of the comparison indicates the incoming log key appears to be normal (i.e. line 360), then the associated parameter value vector (i.e. the parameter value vector corresponding to the new log entry 340) is checked (i.e. line 365) by the parameter value anomaly detection model 320 using its own probability distribution data. If the parameter vector anomaly detection model 320 indicates that the associated parameter value vector is abnormal (i.e. line 370) as a result of the check, then the administrator is notified (i.e. line 370) and/or other remedial actions are performed. In this manner, the new log entry 340 will be labeled as an anomaly if either its log key or its parameter value vector are identified as being abnormal. If new log entry 340 is labeled as abnormal, the workflow model 315 provides (e.g., see label 375) semantic information to the administrator to diagnose the anomaly. If, however, the new log entry 340 is not identified as being abnormal (i.e. it is normal), then the system can refrain from providing an alert and/or performing any remedial actions (e.g., see label 380 indicating a normal status).

[0052] It will be appreciated that in some instances, execution patterns may change over time and in other instances, certain log keys or parameter values may not have been included in the original corpus of training data. To address these situations (so false positives are not repeatedly provided by the MLS model 305), an option to collect user feedback regarding an acceptance or rejection of a normal/abnormal classification (and/or probability indication) made by the MLS model is provided. If an administrator reports a detected anomaly as a false positive, this anomaly is used as a labeled record to incrementally update the MLS model 305 to incorporate and adapt to the new pattern.

[0053] In this manner, the MLS model 305 is able to learn the comprehensive and intricate correlations and patterns embedded in a sequence of log entries produced by normal system execution paths (e.g., those included within the normal execution log file(s) 325 as well as from new log entries and from user feedback). Often, it is acceptable to assume that system logs themselves are secure and protected, and an adversary cannot attack the integrity of a log itself. It is also often acceptable to assume that an adversary cannot modify the system source code to change its logging behavior and patterns. As such, there are two primary types of attacks that are particularly worthwhile to detect and guard against.

[0054] Attack Type #1 - These are attacks that lead to system execution misbehavior and hence anomalous patterns in real system logs (i.e. not necessarily the normal execution log file(s) 325). Examples of these types of attacks include, but are not limited to, Denial of Service (DoS) attacks which may cause slow execution and hence performance anomalies reflected in the log timestamp differences from the parameter value vector sequence; attacks causing repeated server restarts such as Blind Return Oriented Programming (BROP) attack shown as too many server restart log keys; and any attack that may cause task abortion such that the corresponding log sequence ends early and/or exception log entries will appear.

[0055] Attack Type #2 - These are attacks that could leave a trace in system logs due to the logging activities of system monitoring services. An example is suspicious activities logged by an Intrusion Detection System (IDS).

[0056] Detecting execution path anomalies using a log key sequence is first described. Since the total number of distinct print statements (that cause log entries to be printed) in a body of source code is a constant, so too is the total number of distinct log keys (e.g., each individual "#"). For instance, let "K = {ki, k2, . . . , k_n}" be the set of distinct log keys from a log-producing body of source code.

[0057] Once log entries are parsed into log keys, the resulting log key sequence reflects an execution path describing the execution order of the log print statements. Let "m " denote the value of the key at position in a log key sequence. Here, "m " may take one of the "«" possible keys from It will be appreciated that "m " is most strongly dependent on the most recent log keys appearing prior to "m " (i.e. distant log keys provide a relatively smaller influence over "m " while closer log keys provide a relatively larger influence over "m ").

[0058] It is possible to perform anomaly detection using a multiclass classification model, where each distinct log key defines a class. Furthermore, it is possible to train a multi-class classifier over the recent history context of the log key in question (i.e. those log keys that are relatively close to the log key in question). Here, the input is a history of recent log keys, and the output is a probability distribution over the "«" log keys from " ", representing the probability that the next log key in the sequence is a key "k £K .

[0059] Figure 4 summarizes this classification setup. In Figure 4, there is a MLS model 400 (which is an example implementation of the MLS model 305 from Figure 3), a set of inputs 405, and a set of outputs 410. In this scenario, Suppose "f is the sequence id of the next log key to appear. The inputs 405 for classification include a window "w" of the "h" most recent log keys. That is, "w = {ntt-h ; . . · ,' ntt-2 ; m_t-if where each "m " is in and is the log key from the log entry "e ". Note that the same log key value may appear several times in "w".

[0060] By feeding the inputs 405 into the MLS model 400, the outputs 410 of the model 400 (constituting the training phase) is the conditional probability distribution "Prfmt = kj \wj" for each "k EK(i = 1, . . . , n This conditional probability distribution is the same as that which was referred to earlier in connection with method 100 of Figure 1. The detection phase subsequently uses this probability distribution to make a prediction by comparing a set of predicted output against the observed log key value that actually appears in a new log entry.

[0061] In this manner, the process of training the MLS model using the vector and log keys may be performed by first identifying each of at least some distinct log keys within the log entries. Then, a respective class is defined for each distinct log key to thereby form a set of classes. The MLS model is then trained as a multi-class classifier using the set of classes. Because, at least some of these log entries constitute a history of the execution path, the training thereby produces a particular probability distribution over that history of the execution path.

[0062] It will be appreciated that the above-recited training stage can be performed while relying on only a small fraction of log entries produced during the normal execution of the system or application (i.e. only a relatively small number of log entries from the normal execution log file(s) 325 shown in Figure 3 need be used). To achieve this efficiency (i.e. using only a small sample of log entries), the following is performed. Specifically, for each log sequence of length "h" in the training data (i.e. inputs 405), an update of the MLS model 400 may be performed for the probability distribution (i.e. outputs 410) of having "k, £K as the next log key value. For example, suppose a small log file resulting from normal execution is parsed into a sequence of log keys: {k22, k5, kll, k9, kll, k26f Given a window size "h = 3", the input sequence and the output label pairs for training will be:

{k22, k5, kll→k9j

{k5, kll, k9→kllj

{kll, k9, kll→k26j

[0063] It will be appreciated that the disclosed embodiments can accommodate any size "h". Consequently, the MLS model is highly scalable and flexible.

Example Language Models

[0064] In some embodiments, the problem of ascribing probabilities to sequences of words drawn from a fixed vocabulary can be solved through use of an N-Gram analysis. In this case, each log key can be viewed as a word taken from the vocabulary " ". Some of the disclosed embodiments use the N-Gram model to assign probabilities to arbitrarily long sequences. The intuition is that a particular word in a sequence is influenced only by its recent predecessors rather than the entire history (i.e. a shorter "h" as described above). With that in mind, this approximation is equivalent to setting Pr(m_t = k, \mi ; . . . ; ntt-i) = Pr(m_t = k, \m_t-N mt-if where "N" denotes the length of the recent history to be considered.

[0065] To train using a limited number of log entries, it is beneficial to calculate probabilities using relative frequency counts from a large corpus to give maximum likelihood estimates. Given a long sequence of keys (e.g., "{mi ; rti2 ; . . . m_tj), it is possible to estimate the probability of observing the I^th key "k " using the relative frequency counts of \ ; . . . ; m_t-i ; m_t = k,} with respect to the sequence \ ι», \ ; ntt-i). In other words, "Pr(m_t = k,^■ \mi ; . . . ; ntt-i) = count(m_t-N ntt-i ; m_t = k,) / count(m_t-N ;■■ ■ ; mt-i)". Note that these frequencies can be counted using a sliding window of size "N" over the entire key sequence. To apply the N-gram model, the embodiments simply set "N = h" as depicted in Figure 4. Thus, this process can be used as a baseline method.

[0066] In other embodiments, the MLS model uses a LSTM neural network for anomaly detection from a log key sequence. For example, given a sequence of log keys, a LSTM network is trained to maximize the probability of having "k E K" as the next log key value as reflected by the training data sequence. In this manner, the LSTM network learns a probability distribution Pr(m_t = k, in, /, . . . m_t-2 ; mt-if that maximizes the probability of the training log key sequence. Figures 5A, 5B, and 5C illustrate a design using an LSTM network.

[0067] Turning first to Figure 5A, there is shown a single LSTM block 500 that reflects the recurrent nature of LSTM. LSTM block 500 remembers a state for its input as a vector of a fixed dimension. The single LSTM block 500 in Figure 5 A is collectively represented as "Block A. " Label 505 is representative of an operation in which the output of LSTM block 500' s last state is fed back into LSTM block 500 via a feedback loop. In this regard, the state of LSTM block 500 from a previous time step is also fed into its next input (as shown by label 505), together with its (external) data input 510 ("mt-Γ in this particular example), to compute a new state and output. This is how historical information is passed to and maintained in the single LSTM block 500.

[0068] Figure 5B further expands on that which was shown in Figure 5A. Specifically, Block A, which is representative of Block A from Figure 5A, is first illustrated. Additionally, a series of LSTM blocks (e.g., LSTM Blocks 515A, 515B, and 515C) (collectively grouped as "Block B") form an unrolled version of the recurrent model in one layer. Each LSTM block maintains a hidden vector "Hi ," and a cell state vector "Ct-i". Both the hidden vector and the cell state vector are passed to the next block (e.g., from LSTM block 515A to LSTM block 515B and from LSTM block 515B to LSTM block 515C) to initialize the next/sub sequent LSTM block' s state (e.g., "Ht-h" and "G-A" from LSTM block 515 are being fed into LSTM block 515B to initiate LSTM block 515B's state). In some embodiments, one LSTM block is used for each log key from an input sequence "w" (and for a window of "h" log keys). Therefore, in some embodiments, a single layer consists of "A" unrolled LSTM blocks, as shown generally in Figure 5B. Although only three unrolled LSTM blocks are shown in Figure 5B, it will be appreciated that any number of LSTM blocks may be used (e.g., to correspond to the number of log keys that are selected for use).

[0069] Within a single LSTM block, the input (e.g., Input 520A, 520B, or 520C) (e.g., "nit- ") and the previous output ("Ht-i-i") are used to decide (1) how much of the previous cell state "Ct-i-i" to retain in state "G-,", (2) how to use the current input and the previous output to influence the state, and (3) how to construct the output H_t- ". This may be accomplished using a set of gating functions to determine state dynamics by controlling the amount of information to keep from the input and the previous output and the information flow going to the next step. Each gating function is parameterized by a set of weights to be learned. The expressive capacity of an LSTM block is determined by the number of memory units (i.e. the dimensionality of the hidden state vector "H").

[0070] In some embodiments, the training step entails finding proper assignments to the weights so that the final output of the sequence of LSTMs produces the desired label (output) that comes with inputs in the training data set. During the training process, each input/output pair incrementally updates these weights through loss minimization via gradient descent. Because an input consists of a window "w" of "h" log keys and an output is the log key value that comes right after "w", it is beneficial to use categorical cross-entropy loss for training.

[0071] After training is done, it is possible to predict the output for an input (e.g., "w = {ntt-h ;■■■ ; mt-i}") using a layer of "A" LSTM blocks. Each log key in "w" feeds into a corresponding LSTM block in this layer.

[0072] Figure 5C shows how a deep LSTM neural network (e.g., MLS model 525) may be generated. Specifically, if multiple layers are stacked (e.g., layer 530A and layer 530B) and the hidden state of the previous layer is used as the input of each corresponding LSTM block in the next layer, it becomes a deep LSTM neural network, as shown by MLS model 525. For simplicity, Figure 5C omits an input layer and an output layer constructed by standard encoding-decoding schemes. The input layer encodes the "«" possible log keys from as one-hot vectors. That is, a sparse ^-dimensional vector "w," is constructed for the log key "ki E K such that "ufi] = and '¾ // = 0" for all other "j ≠ i". The output layer translates the final hidden state into a probability distribution function using a standard multinomial logistic function to represent Pr[m_t = k, \wj" for each "k E K The example shown in Figure 5C shows only two rows of hidden layers (e.g., layer 530A and layer 530B), but any number of layers may be used. Parameter Value And Performance Anomaly Detection

[0073] Up to this point, most of the discussion has focused on the use of log keys to detect anomalies. While the log key sequence (e.g., log key sequence 330 from Figure 3) is useful for detecting execution path anomalies, some anomalies are not shown as a deviation from a normal execution path, but rather as an irregular parameter value. These parameter value vectors (for the same log key) form a parameter value vector sequence (e.g., parameter value vector(s) 335 of Figure 3), and these sequences from different log keys form a multi-dimensional feature space that is beneficial in performance monitoring and anomaly detection. To perform anomaly detection using the parameter value vectors, several different approaches may be followed.

[0074] One approach is to store all parameter value vector sequences into a matrix, where each column is a parameter value sequence from a log key ' ' (note that it is possible to have multiple columns for depending on the size of its parameter value vector). Row "z" in this matrix represents a time instance "t". For example, consider the log entries in Table 200 of Figure 2. There are 3 distinct log key ("£") values in this example, as shown below:

ki = Deletion of * complete

k₂ = Took * seconds to deallocate network ...

k₃ = VM stopped (Lifecycle Event)

[0075] The sizes of their parameter value vectors are 2, 2, and 1 respectively. To illustrate, ki's parameter vector value consists of "tz-to" and "file lie?', thereby forming 2 values in the parameter value vector. The first log entry (i.e. Row 1 which begins with "ti") in table 200 represents time instance "tz" with values [ti - to, file lid, null, null, null]. Similarly, row 2 and row 3 are [null, null, t2 - U, 0.61, null] and [null, null, null, null, ts - ti respectively. Notice, there are five values in each vector to correspond with the total number of parameter values.

[0076] In some implementations, each row can be configured to represent a range of time instances so that each row corresponds to multiple log messages within that time range and thus the matrix/table may become less sparse (i.e. more populated). That said, table 200 (i.e. the matrix) will (beneficially) still be very sparse even when there are many log key values and/or some large parameter value vectors.

[0077] While the above approach can be used, additional efficiencies may be realized by a separate approach in which a parameter value anomaly detection model is trained by viewing each parameter value vector sequence (for a log key) as a separate time series. [0078] Consider again the example in Table 200 of Figure 2. The time series for the parameter value vector sequence of "fo"is: { [¾ - ti, 0.61], [t'2—t'i, /] } . This improved approach allows the problem to be reduced to anomaly detection from a multi-variate time series data. Furthermore, it is possible to apply an LSTM-based approach again. Therefore, in this approach, some of the embodiments use a similar LSTM network as shown in Figure 5C to model a multi-variate time series data, with the following adjustments. Note a separate LSTM network is built for the parameter value vector sequence of each distinct log key value. Therefore, there is an LSTM network for the parameter value vector sequence and a separate LSTM network for the log key sequence.

[0079] The input at each time step is simply the parameter value vector from that timestamp. The values in each vector are normalized by the average and the standard deviation of all values from the same parameter position from the training data. The output is a probability density function for predicting the next parameter value vector, based on a sequence of parameter value vectors from recent history.

[0080] For the multi-variate time series data, the training process tries to adjust the weights of its LSTM model in order to minimize the error between a prediction and an observed parameter value vector. Thus, mean square loss is used to minimize the error during the training process. The difference between a prediction and an observed parameter value vector is measured by the mean square error (MSE).

[0081] Instead of setting an arbitrary error threshold for anomaly detection purposes in an ad-hoc fashion, it is beneficial to partition the training data into two subsets, namely: the model training set and the validation set. For each vector "v" in the validation set, the model produced by the training set is applied to calculate the MSE between the prediction (using the vector sequence from before "v" in the validation set) and "v". At every time step, the errors between the predicted vectors and the actual ones in the validation group are modeled as a Gaussian distribution.

[0082] At deployment, if the error between a prediction and an observed value vector is within a specified threshold level (e.g., a relatively high-level such as for example 75%, 80%, 85%), etc.) of confidence interval of the above Gaussian distribution, the parameter value vector of the incoming log entry is considered normal. If the threshold is not satisfied, then the parameter value vector is considered abnormal.

[0083] Since parameter values in a log message often record relevant system state metrics, this approach/method is able to detect various types of performance anomalies. For example, a performance anomaly may be reflected as a "slow down". Recall that some embodiments store in each parameter value vector the time elapsed between consecutive log entries. The above LSTM model, by modeling parameter value vector as a multi-variate time series, is able to detect unusual patterns in one or more dimensions in this time series (the elapsed time value is just one such dimension).

Adjusting Probability Distribution Weights From Feedback

[0084] It will be appreciated that training data may not cover all possible normal execution patterns. For instance, system behavior may change over time, particularly as a result of dependencies on workload and data characteristics. To address these changes, the disclosed embodiments are able to incrementally update the MLS model's probability distribution weights (e.g., those in the LSTM models) to incorporate and adapt to new log patterns.

[0085] To do so, the embodiments provide a mechanism for an administrator to enter feedback input. This allows the MLS model to use false positive information to adjust its weights. For example, suppose "h = 3" and the recent history sequence is {ki, k2, fe}, and the MLS model predicted the next log key to be "k " with probability "7.00", but the next log key value is actually "fo" (which will be labeled as an anomaly by the MLS model). If the administrator reports "fo" as a false positive (i.e. the MLS model made a wrong prediction), the embodiments are able to use the following input-output pair {ki, k2, ks→ ki to update the weights of its model to learn this new pattern. Consequently, the next time the history sequence {ki, k2, fe} is identified, the embodiments will output both "k " and "fo" with updated probabilities. The same update procedure works for the parameter value anomaly detection model 320 illustrated in Figure 3.

[0086] Note that the MLS model does not need to be re-trained from scratch. After the initial training process, the various sub-models in the overall MLS model exist as several multi-dimensional weight vectors. The update process feeds in small snippets of new training data, and adjusts the weights to minimize the error between model output and actual observed values from the false positive cases, to thereby "tune" the MLS model to new data as it emerges.

Workflow Construction From Multi-Task Execution

[0087] As discussed, each log key may be representative of the execution of a log printing statement in the source code. In some instances, a task operation (e.g., VM creation) will produce a sequence of multiple related log entries, where each of these related log entries are linked to one another as a result of their being a part of the task operation. The order of log entries produced by a task may, in some instances, represent an execution order of each function for accomplishing that task. As a result, it is often beneficial to build a workflow model as a finite state automaton (FSA) to capture the execution path of any task. This workflow model can also be used to detect execution path anomalies. Additionally, the workflow model is very useful towards enabling administrators to diagnose what had gone wrong in the execution of a task when an anomaly has been detected. A number of case scenarios will now be discussed regarding situations where it is beneficial to perform workflow analysis.

[0088] One case is when multiple programs concurrently write to the same log. Often, each log entry contains the name of the program that created it, so the workflow model can use this information when organizing the log entries.

[0089] Another case is when the process or task id is included in a log entry. Here, focus is placed on the case where a user program is executed repeatedly to perform different, but logically related tasks within that program. In this situation, while tasks may not overlap in time, the same log key may appear in more than one task, and concurrency is possible within each task (e.g., multiple threads in one task).

[0090] One object of the disclosed embodiments is to separate log entries for different tasks in a log file and to then build a workflow model for each task based on its log key sequence. Hence, different groups of log entries may be classified as belonging to difference tasks that are executing on behalf of the application. In this regard, the input into the system is the log key sequence parsed from a raw log file, and the output is a set of workflow models, one for each identified task.

[0091] In some of the embodiments that perform anomaly detection from log keys, the input is a sequence of log keys of length "h" from recent history, and the output is a probability distribution of some (or all) possible log key values. One observation regarding this data is that the output actually encodes the underlying workflow execution path.

[0092] For example, given a log key sequence, the MLS model is configured to predict what will happen next based on the execution patterns the MLS model has observed during the training stage. If a sequence "w" is never followed by a particular key value ' ' in the training stage, then Pr[m_t = k\w] = 0". Correspondingly, if a sequence "w" is always followed by ' ', then Pr[m_t = k\w] = Γ . For example, suppose on a sequence "25→5 ", the output prediction is "{57:1.00}" . Consequently, it can be determined that ^ίί25→54→57' is from one task. [0093] A more complicated case is when a sequence "w" is to be followed by a log key value from a group of different keys; here, the probabilities of these keys to appear after "w" sum to 1. To handle this case, the disclosed embodiments are configured to operate in the manner shown in Figure 6. By way of introduction, Figure 6 shows a concurrency detection scheme 600 involving a number of example log key values (e.g., 25, 18, 54, 57, 56, and 31). The legend 605 provides example descriptions regarding what each of the log key values may represent. It will be appreciated that these are examples only and are meant simply for the purpose of illustration. As such, the embodiments should not be limited simply to that shown in the figures.

[0094] Consider the log sequence "54→57' shown in Figure 6 and suppose the predicted probability distribution for the next log key is " {18: 0.8, 56: 0.2}", which means that the next step could be either "18" or "56", as shown. This ambiguity could be caused by using an insufficient history sequence length (e.g., "h" as discussed earlier). For example, if two tasks share the same workflow segment "54→5T the first task has a pattern " 18→54→57→18" that is executed 80% of the time, and the second task has a pattern ^ίί31→54→57→56 that is executed 20% of the time. This will lead to a model that predicts "{18: 0.8, 56: 0.2}" given the sequence "54→57".

[0095] This issue is addressed by training models with dynamically adjustable history sequence lengths (e.g., using "h = 3" instead of "h = 2"). During workflow construction, it is possible to dynamically adjust/optimize a log sequence length that leads to a more certain prediction (i.e. the MLS model is scalable to be applied to different history sequence lengths formed from different numbers of log entries). To illustrate, in the above example, the sequence " ^' 18→54→57" will lead to the prediction "{18: 1.00}" and the sequence "31→54→57' will lead to the prediction "{56: 1.00}" . In this manner, if the MLS model determines that its predictions are not producing accurate results, the MLS model can dynamically and automatically change the length of its history sequences in an attempt to improve its predictions.

[0096] If a small sequence is ruled out as being a shared segment from different tasks (i.e., increasing the sequence length for training and prediction does not lead to a more certain prediction), the challenge now is to find out whether the multi-key prediction output is caused by either concurrency in the same task or the start of a different task. Positions in the sequence where a concurrency in the same task occurs or where the start of a different task occurs are referred to herein as a "divergence point" (e.g., divergence point 610 in Figure 6) [0097] As more particularly shown in Figure 6, if the divergence point 610 is caused by a concurrency in the same task. Here, a common pattern is that keys with the highest probabilities in the prediction output will appear one after another, and the certainty (measured by higher probabilities for less number of keys) for the following predictions will increase, especially because keys for some of the concurrent threads had already appeared. The prediction will eventually become certain after all keys from concurrent threads are included in the history sequence.

[0098] On the other hand, in the new task detection 700 of Figure 7, if the divergence point (e.g., divergence point (new task occurrence) 705) is caused by the start of a new task occurrence, the predicted log key candidates ("24" and "26" in Figure 7) will not appear one after another. If each such log key is incorporated into the history sequence, the next prediction is a deterministic prediction of a new log key (e.g., "2¥→60", ^ίί26→37'). If this is the case, it is acceptable to stop growing the workflow model of the current task (stop at log key "57" in Figure 7), and start constructing workflow models for new tasks (e.g., as shown by the workflow model 710 in Figure 7). Note that the two "new tasks" in Figure 7 (i.e. those beginning with "2 " and "26", respectively) could also be an "if-else" branch (e.g., "57→ if (24→ 60→. . . ") else "(26→ 37→ . . .)")· To handle such situations, it is beneficial to apply a simple heuristic: if the "new task" has a sufficiently low number of log keys (e.g., 3) and always appears after a particular task(e.g., "T_p"), the embodiments treat it as part of an "if-else" branch of "T_p", otherwise as a new task.

[0099] Once divergence points are distinguished (e.g., those caused by concurrency (i.e. multiple threads) in the same task and those caused by new tasks), the embodiments construct workflow models (e.g., workflow model 615 and workflow model 710, as illustrated in Figures 6 and 7, respectively).

[00100] Figure 8 represents a loop detection 800. It will be appreciated that a loop is typically shown in the initial workflow model as an unrolled chain (e.g., ^ίί26→37→39→40→39→4Ο ), as shown in Figure 8. While this workflow chain is initially "26→37→39→40→39→40^, it is beneficial to identify the repeated fragments as a loop execution ("39→ 0" repeating, as shown by workflow model 805).

[00101] Another approach to generating a workflow model is to use a density-based clustering technique. The basis for this technique is that log keys in the same task normally appear together/near each other, but log keys from different tasks may not always appear together as the ordering of tasks is not fixed during multiple executions of different tasks. This allows the embodiments to cluster log keys based on co-occurrence patterns and to separate keys into different tasks when co-occurrence rate is sufficiently low. For example, in a log key sequence, the distance "if' between any two log keys is defined as the number of log keys between them plus 1, as illustrated by the sequence {ki, li2, ki , d(ki, ki) = [1, 2], d(A¾ ki) = 1 (note that there are two distance values between the pair (ki, ki). Based on this, it is possible to build a co-occurrence matrix as shown below in Table 1 - Co-occurrence Matrix of Log Keys (ki, kj) Within Distance "d".

Table 1 - Co-occurrence Matrix of Log Keys (ki, kj) Within Distance "if'

Here, each element pd(i, j) represents the probability of two log keys "k" and "k" having appeared within distance "if' in the input sequence. Specifically, let f(ki) be the frequency of "k" in the input sequence, and fd(ki, kj) be the frequency of pair (ki, kj) appearing together within distance "if' in the input sequence. The equation below shows the relevance of kj to ki.

[00102] For example, when:

d = 1 p'^{(lj) =} -f¾T ^{= 1}

[00103] This means that for every occurrence of ki, there is a kj next to it. Note, that in this definition, f(ki) in the denominator is scaled by "if' because while counting cooccurrence frequencies within "if', a key "k" is counted by "if' times. Scaling f(ki) by a factor of "if' ensures the following relationship. Additionally, it is possible to build multiple co-occurrence matrices for different distance values of "if'. In other words, the following relationship is satisfied:

= l for any "/" [00104] With a co-occurrence matrix for each distance value "if' that was built, it is possible to output a set of tasks TASK = (Ti , T₂ , . . .)· The clustering procedure works as follows.

[00105] First, for "d = 7", a check is performed to determine if any pi(i, j) is greater than a threshold τ (e.g., τ = 0.9). When it does, "k " and "k " are connected together to form Ti = [ki , kj]. Next, a recursive check is performed to determine if Ti could be extended from either its head or tail. For example, if there exists "k_x EK" such that "pi(k, , k_x) > a further check is performed to determine if , k_x) > (i.e., if kj and k_x have a large co-occurrence probability within distance 2). If yes, Ti = [k_x , ki , kj], otherwise add T₂ = [ki , k_x] is added to TASK.

[00106] This procedure continues until no task "V in TASK could be further extended. In the general case when a task "V to be extended has more than 2 log keys, when checking if "A_*" could be included as the new head or tail, it is beneficial to check if "kx" has a co-occurrence probability greater than τ with each log key in "7" up to distance d where d' is the smaller of: (1) the length of "7" and (2) the maximum value of "i ' for which a co-occurrence matrix was built. For example, to check if "T = [ki, k2, ksf should connect "kj at its tail, a check is performed to determine

"min(pi(ki, k4), P2(k2, k4), pi(ki, k4)) > τ" . In this manner, some of the embodiments are able to determine that certain tokens classified as belonging to either the log entry's parameter set or log key constitute identifiers for a particular execution sequence of the application. These identifiers may then be used to group the associated log entries within the log together or, alternatively to untangle those log entries from one another.

[00107] The above process connects sequential log keys for each task. When a task "7i = [ki , kj]" cannot be extended to include any single key, a check is performed to determine if "77' can be extended by a selected number of log keys (e.g., 2) (i.e. if there exists '% , k_y £K such that "pi(k, , k_x) + pi(k, , k_y) > r", or "pi(k_j , k_x) + pi(k_j , k_y ) >

[00108] Suppose the latter case is true, the next thing to check is whether "k_x" and "k_y" are log keys produced by concurrent threads in task "77'. If they are, "pd(kj , k_x)" increases with larger "if' values (i.e. "p2(kj , k_x) > pi(kj , k_x)"), which occurs because the appearance ordering of keys from concurrent threads is not certain. Otherwise "k_x" and "ky" do not belong to "77', thus "T₂ = [kj , k_xJ" and "T₃ = [kj , k_yJ" are added into TASK instead. Finally, for each task "V in TASK, it is beneficial to eliminate "V if its sequence is included as a sub-sequence in another task. Once a log key sequence is separated out and identified for each task, the workflow model construction for a task follows the same discussion presented in connection with Figures 6 through 8.

[00109] Generally speaking, larger "h" values will increase the prediction accuracy because more history information is utilized in LSTM until it reaches a point where keys that are far back in history do not contribute to the prediction of keys to appear. At this point, continuing to increase "h" does not hurt the prediction accuracy of LSTM, because LSTM is able to learn that only the recent history in a long sequence matters and will therefore ignore the long tail. However, a large "h" value may impact performance. More computations (and layers) are required for both training and prediction, which likely slows down performance. The value of "g", therefore, beneficially regulates the tradeoff between true positives (anomaly detection rate) and false positives (false alarm rate).

[00110] The workflow model provides a guidance to optimally set a proper value for both "h" and "g". Here, "h" is set to be just large enough to incorporate relevant dependencies for making a good prediction, so "h" can be set as the length of the shortest workflow. The number of possible execution paths represents a good value for "g"; hence, in some embodiments, "g" is set as the maximum number of branches at all divergence points from the workflows of all tasks.

[00111] Attention will now be directed to Figure 9 which shows an anomaly detection 900 process. Whenever an anomaly is detected, the workflow model can be used to help diagnose this anomaly and understand how and why it has happened. Figure 9 shows such an example.

[00112] In this example, using a history sequence [26, 37, 38], the top prediction from the MLS model is log key "39" (suppose "g = /"), as shown. However, by following the actual execution 905 line, it is shown that the actual log key that appeared is "(57", which is an anomaly 910. Using a workflow model's data for this task, it is possible to identify the current execution point in the corresponding workflow and to further discover that this error happened right after log key 38 and before the log key 39.

[00113] Accordingly, the disclosed embodiments are configured to operate a computing architecture that is specially designed to improve how anomalies are detected within a system event log. As discussed, the embodiments are able to perform these processes in a scalable, flexible, and highly efficient manner. Furthermore, the embodiments are able to perform these processes very rapidly in an online, streaming manner. Even further, the MLS model may be applied directly to the log keys in accordance with this online manner. These and the other disclosed features constitute significant improvements over existing systems (e.g., especially offline systems because those systems require multiple passes over the data and also require that a number of appearances for each distinct log key be counted during those multiple passes).

Example Computer Systems

[00114] Attention will now be directed to Figure 10 which illustrates an example computer system 1000 that may be used to facilitate the operations described herein. It will be appreciated that previous references to a "computing architecture" may refer to computer system 1000 by itself or to computer system 1000 along with any number of other computer systems. As such, computer system 1000 may take various different forms. For example, in Figure 10, computer system 1000 may be embodied as a tablet 1000A or a desktop 1000B. The ellipsis lOOOC demonstrates that computer system 1000 may be embodied in any other form. For example, computer system 1000 may also be a distributed system that includes one or more connected computing components/devices that are in communication with computer system 1000, a laptop computer, a mobile phone, a server, a data center, and/or any other computer system.

[00115] In a basic configuration, computer system 1000 includes various different components. For example, Figure 10 shows that computer system 1000 includes at least one processor 1005 (aka a "hardware processing unit"), an anomaly engine 1010 (an example implementation of the MLS models described earlier), and storage 1015. Processor 1005 and/or anomaly engine 1010 may be configured to perform any of the operations discussed herein. That is, the anomaly engine 1010 may be a dedicated, specialized, or even general processor. Storage 1015 is shown as including executable code/instructions 1020 as well as an MLS model data 1025. When executed, the executable code/instructions 1020 causes the computer system 1000 to perform the disclosed operations. It will be appreciated that the MLS model data 1025 may be stored remotely as opposed to being stored locally.

[00116] Storage 1015 may be physical system memory, which may be volatile, nonvolatile, or some combination of the two. The term "memory" may also be used herein to refer to non-volatile mass storage such as physical storage media. If computer system 1000 is distributed, the processing, memory, and/or storage capability may be distributed as well. As used herein, the term "executable module," "executable component," "module," or even "component" can refer to software objects, routines, or methods that may be executed on computer system 1000. The different components, modules, engines, and services described herein may be implemented as objects or processors that execute on computer system 1000 (e.g. as separate threads).

[00117] The disclosed embodiments may comprise or utilize a special-purpose or general-purpose computer including computer hardware, such as, for example, one or more processors (such as processor 1005) and system memory (such as storage 1015), as discussed in greater detail below. Embodiments also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. Such computer-readable media can be any available media that can be accessed by a general-purpose or special-purpose computer system. Computer-readable media that store computer-executable instructions in the form of data are physical computer storage media. Computer-readable media that carry computer-executable instructions are transmission media. Thus, by way of example and not limitation, the current embodiments can comprise at least two distinctly different kinds of computer- readable media: computer storage media and transmission media.

[00118] Computer storage media are hardware storage devices, such as RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) that are based on RAM, Flash memory, phase-change memory (PCM), or other types of memory, or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired program code means in the form of computer-executable instructions, data, or data structures and that can be accessed by a general-purpose or special-purpose computer.

[00119] Computer system 1000 may also be connected (via a wired or wireless connection) to external sensors (e.g., one or more remote cameras, accelerometers, gyroscopes, acoustic sensors, magnetometers, etc.) or computer systems. Further, computer system 1000 may also be connected through one or more wired or wireless networks 1030 to remote systems(s) that are configured to perform any of the processing described with regard to computer system 1000. As such, computer system 1000 is able to collect streamed log data from those other external devices as well. A graphics rendering engine may also be configured, with processor 1005, to render one or more user interfaces for the user to view and interact with.

[00120] A "network," like the network 1030 shown in Figure 10, is defined as one or more data links and/or data switches that enable the transport of electronic data between computer systems, modules, and/or other electronic devices. When information is transferred, or provided, over a network (either hardwired, wireless, or a combination of hardwired and wireless) to a computer, the computer properly views the connection as a transmission medium. Computer system 1000 will include one or more communication channels that are used to communicate with the network 1030. Transmissions media include a network that can be used to carry data or desired program code means in the form of computer-executable instructions or in the form of data structures. Further, these computer-executable instructions can be accessed by a general-purpose or special- purpose computer. Combinations of the above should also be included within the scope of computer-readable media.

[00121] Upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to computer storage media (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a network interface card or "NIC") and then eventually transferred to computer system RAM and/or to less volatile computer storage media at a computer system. Thus, it should be understood that computer storage media can be included in computer system components that also (or even primarily) utilize transmission media.

[00122] Computer-executable (or computer-interpretable) instructions comprise, for example, instructions that cause a general-purpose computer, special-purpose computer, or special-purpose processing device to perform a certain function or group of functions. The computer-executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.

[00123] Those skilled in the art will appreciate that the embodiments may be practiced in network computing environments with many types of computer system configurations, including personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, pagers, routers, switches, and the like. The embodiments may also be practiced in distributed system environments where local and remote computer systems that are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network each perform tasks (e.g. cloud computing, cloud services and the like). In a distributed system environment, program modules may be located in both local and remote memory storage devices.

[00124] Additionally, or alternatively, the functionality described herein can be performed, at least in part, by one or more hardware logic components (e.g., the processor 1005). For example, and without limitation, illustrative types of hardware logic components that can be used include Field-Programmable Gate Arrays (FPGAs), Program-Specific or Application-Specific Integrated Circuits (ASICs), Program-Specific Standard Products (ASSPs), System-On-A-Chip Systems (SOCs), Complex Programmable Logic Devices (CPLDs), Central Processing Units (CPUs), and other types of programmable hardware.

[00125] The present invention may be embodied in other specific forms without departing from its spirit or characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

CLAIMS What is claimed is:

1. A computer system comprising:

one or more processors; and

one or more computer-readable hardware storage devices having stored thereon computer-executable instructions that are executable by the one or more processors to cause the computer system to:

for a log that includes a plurality of log entries, parse each log entry in the plurality of log entries into a corresponding structured data sequence comprising a corresponding log key and a corresponding parameter set, wherein a combination of these structured data sequences represents an execution path of an application that is being tracked by the log;

generate a vector that includes (1) the corresponding parameter set for each of the log entries and (2) a set of time values indicating how much time elapsed between each adjacent log entry in the plurality of log entries;

train a machine learning sequential (MLS) model using the generated vector and log keys from each of at least some of the log entries, wherein the MLS model, after being trained, generates a probability distribution that, when applied to at least a portion of the execution path after that portion is modified by a newly arrived log entry, generates a probability indicating an extent to which the newly arrived log entry is normal or abnormal; and

after modifying a particular portion of the execution path to include a new log entry, apply the trained MLS model to at least the particular portion to generate a particular probability indicating a particular extent to which the new log entry is normal or abnormal.

2. The computer system of claim 1, wherein each log entry in the plurality of log entries is an unstructured, free-text log entry.

3. The computer system of claim 1, wherein parsing any one of the log entries in the plurality of log entries includes:

separating the one log entry into a sequence of tokens; and classifying each token in the sequence of tokens as belonging to the one log entry's corresponding log key or, alternatively as belonging to the one log entry's corresponding parameter set.

4. The computer system of claim 3, wherein execution of the computer- executable instructions further causes the computer system to:

determine that one or more tokens classified as belonging to the one log entry's parameter set constitute identifiers for a particular execution sequence of the application, and

use the identifiers to group selected log entries within the log together or, alternatively, to untangle the selected log entries within the log from one another.

5. The computer system of claim 1, wherein execution of the computer- executable instructions further causes the computer system to:

after receiving the new log entry, parse the new log entry to generate a new log key and a new vector.

6. The computer system of claim 6, wherein applying the MLS model to at least the particular portion of the execution path includes applying the MLS model to either one of the new log entry's new log key or the new log entry's new vector.

7. The computer system of claim 1, wherein training the MLS model using the vector and the log keys for each of the at least some of the log entries includes:

identifying each distinct log key from among the at least some of the log entries;

defining a respective class for each distinct log key to thereby form a set of classes; and

training the MLS model as a multi-class classifier using the set of classes, wherein the at least some of the log entries constitute a history of the execution path, and wherein the training produces a particular probability distribution over the history of the execution path.

8. The computer system of claim 1, wherein applying the MLS model to at least the particular portion includes:

sending a selected number of log entries to the MLS model, wherein all of the selected number of log entries appear in the execution path before the new log entry;

from the MLS model, receiving an output probability distribution that describes probabilities for a set of predicted log keys that are predicted to appear as a next log key in the execution path; and

flagging a new log key extracted from the new log entry as normal if the new log key is among a set of top candidates selected from the set of predicted log keys or, alternatively, flagging the new log key as abnormal if the new log key is not among the set of top candidates.

9. The computer system of claim 1, wherein the particular portion of the execution path includes a selected number of log entries constituting a history sequence length of the execution path, and wherein the MLS model is scalable to be applied to different history sequence lengths formed from different numbers of log entries.

10. The computer system of claim 1, wherein the MLS model is applied directly to log keys while refraining from counting a number of appearances for each distinct log key.

11. A method for operating a computing architecture that improves how anomalies are detected within a tracked execution path of an application, the method comprising:

for a log that includes a plurality of log entries, parsing each log entry in the plurality of log entries into a corresponding structured data sequence comprising a corresponding log key and a corresponding parameter set, wherein a combination of these structured data sequences represents an execution path of an application that is being tracked by the log;

generating a vector that includes (1) the corresponding parameter set for each of the log entries and (2) a set of time values indicating how much time elapsed between each adjacent log entry in the plurality of log entries;

training a machine learning sequential (MLS) model using the generated vector and log keys from each of at least some of the log entries, wherein the MLS model, after being trained, generates a probability distribution that, when applied to at least a portion of the execution path after that portion is modified by a newly arrived log entry, generates a probability indicating an extent to which the newly arrived log entry is normal or abnormal; and

after modifying a particular portion of the execution path to include a new log entry, applying the trained MLS model to at least the particular portion to generate a particular probability indicating a particular extent to which the new log entry is normal or abnormal.

12. The method of claim 11, wherein the log includes log entries obtained from multiple different threads.

13. The method of claim 11, wherein the new log entry is identified as being abnormal, and wherein a new log key extracted from the new log entry is identified as being new such that it was not included in a training corpus used to train the MLS model.

14. The method of claim 11, wherein the MLS model performs anomaly detection at a per log entry level.

15. The method of claim 11, wherein the method further includes classifying groups of log entries as belonging to different tasks executed on behalf of the application.

16. The method of claim 11, wherein the method further includes receiving user feedback indicating an acceptance of the particular probability and/or indication regarding whether the new log entry is normal or abnormal.

17. The method of claim 11, wherein the method further includes:

separating log entries that are included in the log and that are produced by concurrent tasks or threads into different sequences; and

constructing a workflow model for each of the different sequences.

18. One or more hardware storage devices having stored thereon computer- executable instructions that are executable by one or more processors of a computer system to cause the computer system to:

after modifying a particular portion of the execution path to include a new log entry, apply the trained MLS model to at least the particular portion to generate a particular probability indicating a particular extent to which the new log entry is normal or abnormal

19. The one or more hardware storage devices of claim 18, wherein generating the particular probability for the new log entry is performed in an online streaming manner.

20. The one or more hardware storage devices of claim 18, wherein the new log entry is indicated as being abnormal, and wherein semantic information of the new log entry is generated for diagnosis of the abnormal log entry.