WO2016093839A1

WO2016093839A1 - Structuring of semi-structured log messages

Info

Publication number: WO2016093839A1
Application number: PCT/US2014/069766
Authority: WO
Inventors: Igor Nor; Doron Shaked; Ron Maurer
Original assignee: Hewlett Packard Enterprise Development Lp
Priority date: 2014-12-11
Filing date: 2014-12-11
Publication date: 2016-06-16

Abstract

Automated structuring of semi-structured log messages is disclosed. One example is a system including a formatting engine to identify shared file formats for a plurality of semi-structured log messages. A representative message identifier identifies representative messages of the plurality of log messages based on the shared file formats. A message segmenter segments the representative messages where each segment corresponds to a message fragment that repeats in a sub-plurality of the log messages. A message similarity evaluator determines a similarity metric for a pair of representative messages, the similarity metric based on a weighted edit distance between segments of messages in the pair of representative messages. A structured message builder converts each representative message to a structured message comprising a string of tokens, the converting based on the similarity metric. A data analytics portal provides the structured messages for operations analytics.

Description

STRUCTURING OF SEMI-STRUCTURED LOG MESSAGES

Background

[0001] Operations analytics are routinely performed on operations data.

Operations analytics includes management of complex systems, infrastructure and devices. Complex and distributed data systems are monitored at regular intervals to maximize their performance, and detected anomalies are utilized to quickly resolve problems. In operations related to information technology, data analytics are used to understand log messages, and search for patterns and trends in telemetry signals that may have sematic operational meanings.

Brief Description of the Drawings

[0002] Figure 1A is a functional block diagram illustrating an example of a system for automated structuring of semi-structured log messages.

[0003] Figure 1B is another functional block diagram illustrating an example of a system for automated structuring of semi-structured log messages.

[0004] Figure 2 is a flow diagram illustrating an example of a method for determining a cluster of representative messages based on an agglomerative hierarchical clustering.

[0005] Figure 3 illustrates examples of message types.

[0006] Figures 4A-4C illustrate an example of determining regular expressions and typed structured log messages.

[0007] Figure 5 is a block diagram illustrating an example of a processing system for implementing the system for automated structuring of semi- structured log messages.

[0008] Figure 6 is a block diagram illustrating an example of a computer readable medium for automated structuring of semi-structured log messages.

[0009] Figure 7 is a flow diagram illustrating an example of a method for automated structuring of semi-structured log messages. Detailed Description

[0010] Operational analytics relates to analysis of operations data, related to, for example, events, logs, and so forth. Various performance metrics may be generated by the operational analytics, and operations management may be performed based on such performance metrics. Operations analytics is vastly important and spans management of complex systems, infrastructure and devices. It is also interesting because relevant analytics are generally limited to anomaly detection and pattern detection. The anomalies are generally related to operations insight, and patterns are indicative of underlying sematic processes that may serve as potential sources of significant semantic anomalies. Generally, analytics is primarily used in IT operations ("ITO") for understanding unstructured or semi-structured log messages and for detecting patterns and trends in telemetry signals that may have sematic operational meanings. Many ITO analytic platforms focus on data collection and

transformation, and on analytic execution.

[0011] In a big data scenario, the size of the volume of data often negatively impacts processing of such query-based analytics. One of the biggest problems in big data analysis is that of formulating the right query. However, before appropriate analyses may be performed, whether it be query-based analytics or non-query-based automated analytics, it is often preferable to transform raw data into an appropriate format. Accordingly, there is an overwhelming need for tools that help transform or pre-process data automatically so that data analytics may be performed more robustly. Therefore, in the context of operational data, it is important to provide for automated structuring of semi-structured data, such as, for example, log messages.

[0012] As disclosed herein, log messages may be analyzed for latent structure, and transformed into a concise set of structured log message types and parameters. The result is a normalized structured dataset ready for statistical analysis. Log entries may be considered to be signals from an operational system and could, theoretically, be processed as signals that are emitted by the logging software. Log signals may also be processed based on log files where the signals may have been persisted. Such processing may be performed in two different flows: offline analysis and online processing. Most of the work is in the offline analysis where a format of the log entries may be determined, and then, for each distinct format, we similar log messages may be clustered together.

[0013] As described in various examples herein, automated structuring of semi- structured log messages is disclosed. One example is a system including a formatting engine, a representative message identifier, a message segmenter, a message similarity evaluator, a structured message builder, and a data analytics portal. The formatting engine identifies shared file formats for a plurality of semi-structured log messages. The representative message identifier identifies representative messages of the plurality of log messages based on the shared file formats. The message segmenter segments the representative messages, where each segment corresponds to a message fragment that repeats in a sub- plurality of the log messages. The message similarity evaluator determines a similarity metric for a pair of representative messages, the similarity metric based on a weighted edit distance between segments of messages in the pair of representative messages. The structured message builder converts each representative message to a structured message comprising a string of tokens, the converting based on the similarity metric. The data analytics portal provides the structured messages for operations analytics.

[0014] In the following detailed description, reference is made to the

accompanying drawings which form a part hereof, and in which is shown by way of illustration specific examples in which the disclosure may be practiced. It is to be understood that other examples may be utilized, and structural or logical changes may be made without departing from the scope of the present disclosure. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims. It is to be understood that features of the various examples described herein may be combined, in part or whole, with each other, unless specifically noted otherwise. [0015] Figure 1A is a functional block diagram illustrating an example of a system for automated structuring of semi-structured log messages. The examples described herein provide an overview of system 100. A more detailed description is provided in Figure 1B. System 100 is shown to include a formatting engine 104, a representative message identifier 106, a message segmenter 108, a message similarity evaluator 110, a structured message builder 112, and a data analytics portal 120.

[0016] The term "system" may be used to refer to a single computing device or multiple computing devices that communicate with each other (e.g. via a network) and operate together to provide a unified service. In some examples, the components of system 100 may communicate with one another over a network. As described herein, the network may be any wired or wireless network, and may include any number of hubs, routers, switches, ceil towers, and so forth. Such a network may be, for example, part of a cellular network, part of the internet, part of an intranet, and/or any other type of network.

[0017] The components of system 100 may be computing resources, each including a suitable combination of a physical computing device, a virtual computing device, a network, software, a cloud infrastructure, a hybrid cloud infrastructure that includes a first cloud infrastructure and a second cloud infrastructure that is different from the first cloud infrastructure, and so forth. The components of system 100 may be a combination of hardware and programming for performing a designated function. In some instances, each component may include a processor and a memory, while programming code is stored on that memory and executable by a processor to perform a designated function.

[0018] The computing device may be, for example, a web-based server, a local area network server, a cloud-based server, a notebook computer, a desktop computer, an all-in-one system, a tablet computing device, a mobile phone, an electronic book reader, or any other electronic device suitable for provisioning a computing resource to perform an automated structuring of semi-structured log messages. Computing device may include a processor and a computer- readable storage medium. [0019] The system 100 identifies file formats for a plurality of semi-structured log messages. The system 100 identifies shared file formats for a plurality of semi- structured log messages. The system 100 identifies representative messages of the plurality of log messages based on the shared file formats. The system 100 segments the representative messages where each segment corresponds to a message fragment that repeats in a sub-plurality of the log messages. The system 100 determines a similarity metric for a pair of representative messages, the similarity metric based on a weighted edit distance between segments of messages in the pair of representative messages. The system 100 converts each representative message to a structured message comprising a string of tokens, the converting based on the similarity metric. The system 100 provides the structured messages for operations analytics.

[0020] In some examples, a formatting engine 104 receives a plurality of semi- structured log messages 102 related to a series of events, and identifies shared file formats 104A for the plurality of semi-structured log messages. In some examples, the formatting engine 104 formats the raw stream of log data into a stream of individual formatted log-messages. In some examples, the formatting engine 104 provides the shared file formats 104A to the representative message identifier 106.

[0021] In some examples, the representative message identifier 106 receives the shared file formats 104A from the formatting engine 104, and identifies representative messages 106A of the plurality of log messages based on the shared file formats 104A. In some examples, the representative message identifier 106 provides the representative messages 106A to the message segmenter 108. In some examples, the representative message identifier 106 provides the representative messages 106A to the structured message builder 112.

[0022] In some examples, the message segmenter 108 receives the

representative messages 106A from the representative message identifier 106, and segments the representative messages 106A, where each segment corresponds to a message fragment that repeats in a sub-plurality of the log messages. In some examples, the message segmenter 108 generates segmented representative messages 108A based on the representative messages 106A. In some examples, the message segmenter 108 provides the segmented representative messages 108A to the message similarity evaluator 110. In some examples, the message segmenter 108 provides the

representative messages 106A to the message similarity evaluator 110.

[0023] In some examples, the message similarity evaluator 110 receives the segmented representative messages 108A from the message segmenter 108, and determines a similarity metric 110A for a pair of representative messages 106A, the similarity metric 110A based on a weighted edit distance between segments of messages in the pair of representative messages 106A. As described herein, the similarity metric 11 OA defined for the segmented representative messages 108A may be utilized to define a similarity metric 11 OA defined for the representative messages 106A.

[0024] In some examples, the structured message builder 112 receives the similarity metric 110A from the message similarity evaluator 110, and receives the representative messages 106A from the representative message identifier 106. The structured message builder 112 converts each representative message 106A to a structured message 118A comprising a string of tokens, the converting based on the similarity metric 110A. In some examples, the structured message builder 112 provides the structured messages 118A to the data analytics portal 120, which in turn provides the structured messages 118A for operations analytics. In some examples, the data analytics portal 120 may receive feedback data 120A, and provide the feedback data 120A to the structured message builder 112. These and other examples will be described herein in more detail.

[0025] Figure 1 B is a functional block diagram illustrating another example of a system for automated structuring of semi-structured log messages. For clarity of exposition, the components in Figure 1 B may share numbering with respective components in Figure 1B. Generally, the components in Figures 1A and 1B may be different. Also, where applicable, references made to Figure 1A may apply to Figure 1 B, and vice versa. [0026] In some examples, the formatting engine 104 receives a plurality of semi- structured log messages 102 related to a series of events. In some examples, the formatting engine 104 formats the raw stream of log data into a stream of individual formatted log-messages. The plurality of semi-structured log messages 102 may be normalized in several ways. For example, a log analysis, and/or a signal analysis may be performed on the input data. Log messages may be analyzed for latent structure and transformed into a concise set of structured log message types and parameters. In some example, each source of log messages may be pre-tagged for file format, where the file format may be identified in an offline stage. In some examples, system 100B includes a format type analyzer 105 to determine file format grouping. In some examples, the format type analyzer 105 may be a standalone component. In some examples, the formatting engine 104 may determine the file format grouping. In some examples, the format type analyzer 105 may be included in the formatting engine 104 as illustrated in Figure 1B.

[0027] Shared file formats 104A are groups of log messages that share common formats. The shared file formats 104A create a natural division of the input mapping log files to their respective types so that the content analyzer does not need to check every message against all types. In some examples, file formats 106A may be combined, since short files often result in overly detailed formats identifying tokens that may be merely as frequent as format tokens. In some examples, the formatting engine 104 may group the log messages into groups that share a common file format. In some examples, the format type analyzer 105 may split the collection of log files into groups of common formats, or shared file formats 104A.

[0028] In some examples, the shared file formats 104A may be a corresponding stream of event types according to matching regular expression. Log messages that do not match may define new regular expressions. Optionally, segments identified as part of the regular expression matching may be parsed as event parameters, or as separate events. These, and other features, are described herein. [0029] In some examples, representative message identifier 106 identifies representative messages 106A based on shared file formats 104A. In some examples, the representative message identifier 106 groups messages that differ at most by values of variables. Such variables are represented by abstract typed tokens such as "NUMBER", "FILE_NAME", etc. Such a grouping may be realized efficiently by utilizing, for example, a compact trie data structure. In some examples, the representative message identifier 106 may select a representative message from each group of similar messages. By performing this analysis, representative message identifier 106 may reduce the total number of messages to be processed by system 100B from millions to thousands, which may be crucial from the performance point of view.

[0030] In some examples, message segmenter 108 segments the

representative messages 106A, based on a segmentation algorithm, to provide segmented representative messages 108A. A segment is a statistically meaningful message fragment repeating in multiple log messages. In some examples, message segmenter 108 may apply the segmentation algorithm in an incremental, on-line fashion. For example, for each new message, the segmentation algorithm may progress left-to-right, building segments one after the other. Each segment may be initialized to the first unsegmented token, and segments may be constructed by repeated steps of conditional addition of a next token, based on a number of statistical criteria related to the relative frequency of the segment candidates before and after extension. Once a segment-extension step is rejected, the message segmenter 108 may mark the segment, and the residual of the log-message may be processed in like manner.

[0031] In some examples, as the segmentation algorithm proceeds and segments are identified, the message segmenter 108 may compile and/or update a dictionary of previously identified segments. Accordingly, at any point of the processing, the message segmenter 108 may access the compiled dictionary of previously discovered segments, and each message may be represented as a list of segments taken from the compiled dictionary. In some examples, the segment identification step may be made computationally efficient via various data structures, such as, for example, generalized suffix- tree data-structures.

[0032] System 100B includes a message similarity evaluator 110 to determine a similarity metric 110A for a pair of segmented representative messages 108A, the similarity metric based on a weighted edit distance between segments of the two messages in the pair of segmented representative messages 108A. In some examples, the message similarity evaluator 110 may augment segments with an associated weight of a segment relative to the message:

W (Segment | Message) =

where length(Segment) is a length of the segment, length(Message) is a length of the message in which the segment appears, frequency(Message) represents a number of times the Message occurs in the log messages,

frequency(Segment) represents a number of times the Segment occurs in the log messages, and α, β, and γ are appropriately chosen parameters. The formula in Eqn. 1 prioritizes segments that are, for example, longer, at front of the message and/or that are relatively unique for a given message.

[0033] Using the formula in Eqn. 1 , message similarity evaluator 110 may determine a similarity metric 110A based on the segmented representative messages 108A. The similarity metric 110A measures an edit distance between two segmented representative messages 108A by a number of edit operations (for example, add, delete, substitute) needed to transform one message into the other, where each edit operation may be associated with a weight according to the weighted segments. Determining similarity metric 110A may be a complicated task from a complexity point of view. Various techniques may be utilized to accelerate determination of the metric, such as applying a diagonal method. Taking into consideration that the size of the message is defined by a number of segments rather than a number of tokens, this method provides relatively fast results.

[0034] System 100B includes a structured message builder 112 to convert each segmented representative message 108A to a structured message 118A, the structured message 118A comprising a string of tokens, and the converting based on the similarity metric

[0035] In some examples, structured message builder 112 may include components to perform various tasks in the conversion process. In some examples, the structured message builder 112 may include at least one of a message type analyzer 114, a regular expression builder 116, and a message type classifier 118. In some examples, the structured message builder 112 may perform the functions of the message type analyzer 114, the regular expression builder 116, and the message type classifier 118.

[0036] In some examples, the message type analyzer 114 determines clusters of representative messages 106A, the clusters based on the message similarity metric 110A. Generally, clusters are indicative of semantic similarity of the representative messages 106A. Clustering based on a given metric may be performed in various ways. In some examples, the message type analyzer 114 may determine the cluster of representative messages 106A based on an agglomerative hierarchical clustering that groups similar messages.

[0037] Figure 2 is a flow diagram illustrating an example of a method for determining a cluster of representative messages 106A based on an

agglomerative hierarchical clustering. At 200, each representative message is initialized as a cluster. At 202, clusters having identical representative messages are merged. At 204, D-closest pair of clusters are iteratively merged using similarity metric D. At 206, it is determined if the D-closest pair is over a threshold. If it is, then the process ends at 208. If not, the process reverts to 204.

[0038] Referring again to Figure 1B, in some examples, message type analyzer 114 identifies message types 114A, each message type 114A associated with a cluster of representative messages. Each merge operation, performed by the message type analyzer 114 to determine clusters, represents an edit distance operation.

[0039] Figure 3 illustrates examples of message types. A first message type 302 is illustrated as a cluster of two representative messages. Each

representative message has a file format "Date Time [URL] ERROR General". A second message type 304 is illustrated as a cluster of two representative messages. Each representative message has a file format "Date Time [URL] Severity General". A third message type 306 is illustrated as a cluster of two representative messages. The first representative message in the third message type 306 has a file format "Date Time Segmental Segment₄₃

Segment Segment₃₀ Segment₁₇". The second representative message in the third message type 306 has a file format "Date Time Segment₁₀₁ Segment₄₃ Segment SegmenW- As illustrated, the two representative messages may be determined to be close in the similarity metric because they share identical segments, such as, Segment₁₀₁, Segment₄₃, and Segment₃₀, and those identical segments appear in similar order.

[0040] Accordingly, it is easy to generate a regular expression representation for each cluster. Referring to Figure 1B, eventual regular expressions represent message types 114A. In some examples, regular expression builder 116 generates regular expressions 116A based on results from the message type analyzer 114. Each such regular expression may be generated from the clustering process described herein, and may represent a message type.

[0041] In some examples, the structured message builder 112 may further include a message type classifier 118 to tag the representative messages based on the regular expressions. In some examples, the message type classifier 118 attributes to each log-message a message-type tag and transforms the original log-message stream into a stream of normalized, in the form of structured messages 115A.

[0042] Figures 4A-4C illustrate an example of determining regular expressions and structured messages. Figure 4A illustrates a log file including log messages. Figure 4B illustrates regular expressions corresponding to some of the log messages in the log file of Figure 4A. Figure 4C illustrates a stream of typed structured log messages 118A generated from the log messages in the log file of Figure 4A by the message type classifier 118 based on the regular expressions 116A.

[0043] For example, a first message 400A in Figure 4A has a file format "Date Time [Number] HP.BI INFO - Starting monitor operation against data 'EDW Seaquest Production Database (EMR)'". A regular expression corresponding to this is illustrated as first expression 400B in Figure 4B: "<L> <D> <T> [<#>] <H> <S> - Starting monitor operation against data 'EDW Production Database ()'", where <L> denotes a newline, <D> denotes date, <T> denotes time, [<#>] denotes the number, <H> denotes ΉΡ.ΒI'', <S> denotes "INFO", denotes "Seaquest", and () denotes "(EMR)". A typed log message or normalized data corresponding to the first message 400A and the first expression 400B is illustrated as first normalized data 400C: "2013-07-16 04:54:55 <2>", where <2> is the class tag of the corresponding message "<Starting monitor operation against data 'EDW Production Database <)'>."

[0044] As another example, a second message 404A in Figure 4A has a file format "Date Time [Number] HP.BI INFO - Starting monitor operation against data 'EDW NeoView Production Database (PLATINUM)'". A regular expression corresponding to this is illustrated as first expression 400B in Figure 4B: "<L> <D> <T> [<#>] <H> <S> - Starting monitor operation against data 'EDW Production Database ()", where <L> denotes a newline, <D> denotes date, <T> denotes time, [<#>] denotes the number, <H> denotes "HP.BI", <S> denotes "INFO", denotes "NeoView", and () denotes "(PLATINUM)". A structured log message or normalized data corresponding to the second message 404A and the first expression 400B is illustrated as second normalized data 404C: "2013-07-16 04:58:55 <2>", where <2> is the normalized data representing "<Starting monitor operation against data 'EDW Production Database ()'>."

[0045] Also, for example, a third message 402A in Figure A has a file format "Date Time [Number] HP.BI INFO -Monitor operation against data 'EDW Seaquest Production Database (EMR)' completed". A regular expression corresponding to this is illustrated as second expression 402B in Figure 4B: "<L> <D> <T> [<#>] <H> <S> - Monitor operation against data '<^*Ρ> ()' completed", where <L> denotes a newline, <D> denotes date, <T> denotes time, [<#>] denotes the number, <H> denotes ΉΡ.ΒΓ, <S> denotes "INFO", <*P> denotes "EDW Seaquest Production Database", and () denotes "(EMR)". A structured log message or normalized data corresponding to the third message 402A and the second expression 402B is illustrated as third normalized data 402C: "2013-07-16 04:55:53 <1>", where <1> is the normalized data representing "<Monitor operation against data '<*P> ()' completed>."

[0046] Data analytics portal 120 provides the structured messages for operations analytics. In some examples, the structured messages 118A may be utilized to detect system anomalies and/or event patterns. In the context of operational data, it may be important to provide an interface that may be utilized by operational investigations to easily formulate and solve operational issues. For example, the data analytics portal 120 may be communicatively linked to an anomaly processor (not shown in the figures). The anomaly processor may detect presence or absence of a system anomaly in the plurality of semi- structured log messages, the system anomalies indicative of rare and unexpected events, and the detecting based on a distribution of data elements of the input data. Whereas a system anomaly is generally related to insight into operational data, event patterns indicate underlying sematic processes that may serve as potential sources of significant semantic anomalies.

[0047] In some examples, the data analytics portal 120 may be communicatively linked to a pattern processor (not shown in the figures). The pattern processor may detect presence or absence of an event pattern in the plurality of semi- structured log messages. Generally, the pattern processor identifies non- coincidental situations, usually events occurring simultaneously. Patterns may be characterized by their unlikely random reappearance. For example, a single co-occurrence in 100 may be somewhat likely, but 90 co-occurrences in 100 is much less likely.

[0048] In some examples, the data analytics portal 120 may receive feedback data 120A from, for example, an interactive graphical user interface. For example, the output may be a corresponding stream of event types according to matching regular expressions as determined herein. In some examples, the data analytics portal 120 may identify, based on feedback data 120A, a log message that does not match a regular expression representation. In some examples, the data analytics portal 120 may provide feedback data 120A to the structured message builder 112. Based on such feedback data 120A, the structured message builder 112 may modify the structured messages 118A. For example, the regular expression builder 116 may identify the non-matching log message as a new regular expression representation. Also, for example, the data analytics portal 120 may provide feedback data 120A that segments identified as part of the regular expression matching may be parsed as event parameters. Based on such feedback data 120A, the structured message builder 112 may parse the segments identified as part of the regular expression matching as event parameters.

[0049] Feedback data 120A may include feedback related to domain relevance, received via an interactive graphical user interface and processed by the interaction processor. The feedback data 120A may be indicative of selection or non-selection of a portion of the interactive graphical user interface.

[0050] In some examples, the data analytics portal 120 may be communicatively linked to a word cloud generator (not shown in the figures) that generates a word cloud based on the structured messages 118A. A word cloud is a visual representation of a plurality of words highlighting words based on a relevance of the word in a given context. For example, a word cloud may comprise words that appear in log messages associated with the selected system anomaly. In some examples, the interactive graphical user interface may display the word cloud. A word cloud highlights words that appear in anomalous messages more than in the rest of the messages. In some examples, relevance of a word may be illustrated by its relative font size in the word cloud. For example, key terms may appear in log messages associated with the system anomaly more frequently than in the rest of the log messages. Accordingly, such key terms may be highlighted in the word cloud. Highlighting may be achieved via a distinctive font, font size, color, and so forth. In some examples, term scores may be determined for key terms, the term scores based on a modified inverse domain frequency. In some examples, the modified inverse domain frequency may be based on an information gain or a Kullback-Liebler Divergence.

[0051] As described herein, aspects of Figure 1 B may include optional components and/or functionalities. For example, in some examples, the components and functionalities illustrated with dashed lines may be optional. For example, the format type analyzer 105 may be an optional component of system 100B. Also, for example, the message type analyzer 114, the regular expression builder 116, and/or the message type classifier 118 may be optional components of system 100B.

[0052] Figure 5 is a block diagram illustrating an example of a processing system 500 for implementing the system 100 and/or system 100B for automated structuring of semi-structured log messages. Processing system 500 includes a processor 502, a memory 504, input devices 518, and output devices 520. Processor 502, memory 504, input devices 518, and output devices 520 are coupled to each other through communication link (e.g., a bus).

[0032] Processor 502 includes a Central Processing Unit (CPU) or another suitable processor. In some examples, memory 504 stores machine readable instructions executed by processor 502 for operating processing system 500. Memory 504 includes any suitable combination of volatile and/or non-volatile memory, such as combinations of Random Access Memory (RAM), Read-Only Memory (ROM), flash memory, and/or other suitable memory.

[0033] The semi-structured log messages 522 may be messages related to a series of events. Memory 504 also stores instructions to be executed by processor 502 including instructions for a formatting engine 506, a

representative message identifier 508, an message segmenter 510, a message similarity evaluator 512, a structured message builder 514, and a data analytics portal 516. In some examples, formatting engine 506, representative message identifier 508, message segmenter 510, message similarity evaluator 512, structured message builder 514, and data analytics portal 516, include formatting engine 104, representative message identifier 106, message segmenter 108, message similarity evaluator 110, structured message builder 112, and data analytics portal 120, respectively, as previously described and illustrated with reference to Figure 1B.

[0034] Processor 502 executes instructions of formatting engine 506 to identify shared file formats for a plurality of semi-structured log messages. In some examples, processor 502 executes instructions of formatting engine 506 to identify file format groupings. In some examples, processor 502 executes instructions of formatting engine 506 to group the log messages into groups that share a common file format. In some examples, processor 502 executes instructions of formatting engine 506 to split the collection of log files into groups of common formats.

[0035] Processor 502 executes instructions of representative message identifier 508 to identify representative messages of the plurality of log messages based on the shared file formats. In some examples, processor 502 executes instructions of representative message identifier 508 to group messages that differ at most by values of variables. Such a grouping may be realized efficiently by utilizing, for example, a compact trie data structure. In some examples, processor 502 executes instructions of representative message identifier 508 to select a representative message from each group of similar messages.

[0036] Processor 502 executes instructions of message segmenter 510 to segment the representative messages wherein each segment corresponds to a message fragment that repeats in a sub-plurality of the log messages. In some examples, processor 502 executes instructions of message segmenter 510 to run a segmentation algorithm. In some examples, processor 502 executes instructions of message segmenter 510 to apply the segmentation algorithm in an incremental, on-line fashion. In some examples, processor 502 executes instructions of message segmenter 510 to compile and/or update a dictionary of previously identified segments.

[0037] Processor 502 executes instructions of a message similarity evaluator 512 to determine a similarity metric for a pair of representative messages, the similarity metric based on a weighted edit distance between segments of messages in the pair of representative messages. In some examples, processor 502 executes instructions of message similarity evaluator 512 to augment segments with an associated weight of a segment relative to the message, for example, utilizing Eqn. 1.

[0038] Processor 502 executes instructions of a structured message builder 514 to convert each representative message to a structured message comprising a string of tokens, the converting based on the similarity metric. In some examples, processor 502 executes instructions of structured message builder 514 to determine clusters of representative messages, the clusters based on the message similarity metric. In some examples, processor 502 executes instructions of structured message builder 514 to determine clusters based on an agglomerative hierarchical clustering that groups similar messages. In some examples, processor 502 executes instructions of structured message builder 514 to identify message types, each message type associated with a cluster of representative messages.

[0039] Processor 502 executes instructions of structured message builder 514 to structure the representative messages based on the message types. In some examples, processor 502 executes instructions of structured message builder 514 to build regular expressions based on the message types. In some examples, processor 502 executes instructions of the structured message builder 514 to determine if a log message does not match a regular expression representation, and if it does not match, then the log message is identified as a new regular expression representation. In some examples, processor 502 executes instructions of the structured message builder 514 to tag the representative messages based on the regular expressions.

[0040] Processor 502 executes instructions of a data analytics portal 516 to provide the structured messages for operations analytics. In some examples, processor 502 executes instructions of the data analytics portal 516 to communicatively link the data analytics portal 516 to at least one of an anomaly processor, a pattern processor, and a word cloud generator. In some examples, processor 502 executes instructions of the data analytics portal 516 to communicatively link the data analytics portal 516 to an anomaly processor that detects presence or absence of a system anomaly in the plurality of semi- structured log messages, the system anomaly indicative of a rare event that is distant from a norm of a distribution based on the series of events. In some examples, processor 502 executes instructions of the data analytics portal 516 to communicatively link the data analytics portal 516 to a pattern processor that detects presence or absence of an event pattern in the plurality of semi- structured log messages. In some examples, processor 502 executes instructions of the data analytics portal 516 to communicatively link the data analytics portal 516 to a word cloud generator that generates a word cloud based on the structured messages.

[0053] Input devices 518 include a keyboard, mouse, data ports, and/or other suitable devices for inputting information into processing system 500. In some examples, input devices 518 are used by the data analytics portal 516 to interact with an interactive graphical user interface. Output devices 520 include a monitor, speakers, data ports, and/or other suitable devices for outputting information from processing system 500. In some examples, output devices 520 are used to provide an interactive visual representation of the system anomalies, event patterns, and the word cloud.

[0054] Figure 6 is a block diagram illustrating an example of a computer readable medium for automated structuring of semi-structured log messages. Processing system 600 includes a processor 602, a computer readable medium 614, a formatting engine 604, a representative message identifier 606, a message segmenter 608, a message similarity evaluator 610, and a structured message builder 612. Processor 602, computer readable medium 614, formatting engine 604, representative message identifier 606, message segmenter 608, message similarity evaluator 610, and structured message builder 612 are coupled to each other through communication link (e.g., a bus).

[0055] Processor 602 executes instructions included in the computer readable medium 614. Computer readable medium 614 includes shared file format identification instructions 616 of a formatting engine 604 to identify shared file formats for a plurality of semi-structured log messages 628. Computer readable medium 614 includes representative message identification instructions 618 of a representative message identifier 606 to identify representative messages of the plurality of log messages based on the shared file formats. Computer readable medium 614 includes representative message segmenting instructions 620 of a message segmenter 608 to segment the representative messages, each segment corresponding to a message fragment that repeats in a sub-plurality of the log messages.

[0056] Computer readable medium 614 includes similarity metric determination instructions 622 of a message similarity evaluator 610 to determine a similarity metric for a pair of representative messages, the similarity metric based on a weighted edit distance between segments of messages in the pair of representative messages.

[0057] Computer readable medium 614 includes clustering instructions 624 of a structured message builder 612 to cluster the representative messages based on the similarity metric, the clustering indicative of relative semantic similarity of the segmented messages. In some examples, computer readable medium 614 includes clustering instructions 624 of a structured message builder 612 to perform an agglomerative hierarchical clustering that groups similar messages.

[0058] Computer readable medium 614 includes conversion instructions 626 of a structured message builder 612 to convert each representative message to a structured message comprising a string of tokens, the converting based on the similarity metric.

[0059] In some examples, computer readable medium 614 includes providing instructions of a data analytics portal to provide the structured messages for operations analytics.

[0060] As used herein, a "computer readable medium" may be any electronic, magnetic, optical, or other physical storage apparatus to contain or store information such as executable instructions, data, and the like. For example, any computer readable storage medium described herein may be any of Random Access Memory (RAM), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., a hard drive), a solid state drive, and the like, or a combination thereof. For example, the computer readable medium 614 can include one of or multiple different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; optical media such as compact disks (CDs) or digital video disks (DVDs); or other types of storage devices.

[0061] As described herein, various components of the processing system 600 are identified and refer to a combination of hardware and programming configured to perform a designated function. As illustrated in Figure 8, the programming may be processor executable instructions stored on tangible computer readable medium 614, and the hardware may include processor 602 for executing those instructions. Thus, computer readable medium 614 may store program instructions that, when executed by processor 602, implement the various components of the processing system 600.

[0062] Such computer readable storage medium or media is (are) considered to be part of an article (or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components. The storage medium or media can be located either in the machine running the machine-readable instructions, or located at a remote site from which machine-readable instructions can be downloaded over a network for execution.

[0063] Computer readable medium 614 may be any of a number of memory components capable of storing instructions that can be executed by processor 602. Computer readable medium 614 may be non-transitory in the sense that it does not encompass a transitory signal but instead is made up of one or more memory components configured to store the relevant instructions. Computer readable medium 614 may be implemented in a single device or distributed across devices. Likewise, processor 602 represents any number of processors capable of executing instructions stored by computer readable medium 614. Processor 602 may be integrated in a single device or distributed across devices. Further, computer readable medium 614 may be fully or partially integrated in the same device as processor 602 (as illustrated), or it may be separate but accessible to that device and processor 602. In some examples, computer readable medium 614 may be a machine-readable storage medium.

[0064] Figure 7 is a flow diagram illustrating an example of a method for automated structuring of semi-structured log messages. At 700, representative messages of the plurality of log messages may be identified based on shared file formats. At 702, segmented messages may be identified based on the shared file formats, each segment corresponding to a message fragment that repeats in a sub-plurality of the log messages. At 704, a similarity metric may be determined for a pair of representative messages, the similarity metric based on a weighted edit distance between segments of messages in the pair of representative messages. At 706, the representative messages may be clustered based on the similarity metric, the clustering indicative of relative semantic similarity of the representative messages. At 708, each representative message may be converted into a structured message based on the clustering. At 710, the structured messages may be provided to a data analytics portal for operations analytics.

[0065] In some examples, identifying the representative messages may include identifying the shared file formats.

[0066] In some examples, converting each segmented message may include converting each representative message to a structured message comprising a string of tokens.

[0067] In some examples, clustering the segmented messages may include determining message types based on an agglomerative hierarchical clustering that groups similar messages.

[0068] In some examples, the data analytics portal may be communicatively linked to at least one of an anomaly processor, a pattern processor, and a word cloud generator.

[0069] Examples of the disclosure provide a generalized system for automated structuring of semi-structured log messages. Although specific examples have been illustrated and described herein, the examples illustrate applications to = structured data. Accordingly, there may be a variety of alternate and/or equivalent implementations that may be substituted for the specific examples shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific examples discussed herein. Therefore, it is intended that this disclosure be limited only by the claims and the equivalents thereof.

Claims

1. A system comprising:

a formatting engine to identify shared file formats for a plurality of semi-structured log messages;

a representative message identifier to identify representative messages of the plurality of log messages based on the shared file formats;

a message segmenter to segment the representative messages, wherein each segment corresponds to a message fragment that repeats in a sub-plurality of the log messages;

a message similarity evaluator to determine a similarity metric for a pair of representative messages, the similarity metric based on a weighted edit distance between segments of messages in the pair of representative messages;

a structured message builder to convert each representative message to a structured message comprising a string of tokens, the converting based on the similarity metric; and

a data analytics portal to provide the structured messages for operations analytics.

2. The system of claim 1 , wherein the formatting engine further includes a format type analyzer to identify file format groupings.

3. The system of claim 1 , wherein the structured message builder further includes a message type analyzer to identify message types based on clusters of similar messages.

4. The system of claim 3, wherein the clusters of similar messages are identified based on an agglomerative hierarchical clustering.

5. The system of claim 3, wherein the structured message builder further includes a regular expression builder to build regular expressions based on the message types.

6. The system of claim 5, wherein the structured message builder further includes a message type classifier to tag the representative messages based on the regular expressions.

7. The system of claim 1 , wherein the data analytics portal is

communicatively linked to at least one of an anomaly processor, a pattern processor, and a word cloud generator.

8. A method to perform operations analytics based on a plurality of semi- structured log messages, the method comprising:

identifying representative messages of the plurality of log messages based on shared file formats;

identifying segmented messages based on the shared file formats, each segment corresponding to a message fragment that repeats in a sub-plurality of the log messages;

determining a similarity metric for a pair of representative messages, the similarity metric based on a weighted edit distance between segments of messages in the pair of representative messages; clustering the representative messages based on the similarity metric, the clustering indicative of relative semantic similarity of the representative messages;

converting each representative message into a structured message based on the clustering; and

providing the structured messages to a data analytics portal for operations analytics.

9. The method of claim 8, wherein identifying the representative messages includes identifying the shared file formats.

10. The method of claim 8, wherein converting each segmented message includes converting each representative message to a structured message comprising a string of tokens.

11.The method of claim 8, wherein clustering the segmented messages includes determining message types based on an agglomerative hierarchical clustering that groups similar messages.

12. The method of claim 8, wherein the data analytics portal is

13. A non-transitory computer readable medium comprising executable

instructions to:

identify shared file formats for a plurality of semi-structured log messages;

identify representative messages based on the shared file formats; segment the representative messages, each segment

corresponding to a message fragment that repeats in a sub-plurality of the log messages;

determine a similarity metric for a pair of representative messages, the similarity metric based on a weighted edit distance between segments of messages in the pair of representative messages;

cluster the representative messages based on the similarity metric, the clustering indicative of relative semantic similarity of the segmented messages; and

convert each representative message to a structured message comprising a string of tokens, the converting based on the similarity metric.

14. The non-transitory computer readable medium of claim 13, further

including instructions to provide the structured messages for operations analytics.

15. The non-transitory computer readable medium of claim 13, wherein the instructions to cluster the segmented messages further including instructions to perform an agglomerative hierarchical clustering that groups similar messages.