CN116341513A

CN116341513A - Multi-source heterogeneous log data analysis method based on semantic enhancement

Info

Publication number: CN116341513A
Application number: CN202310271716.4A
Authority: CN
Inventors: 周娜; 刘晓光; 王刚
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2023-03-20
Filing date: 2023-03-20
Publication date: 2023-06-27

Abstract

The invention discloses a multi-source heterogeneous log data analysis method based on semantic enhancement, which comprises the steps of firstly preprocessing heterogeneous log data in a regular matching mode, wherein common variables are matched by preset regular expressions, and the variables are replaced by words corresponding to semantic one by one, so that important data parts in log sentences can be reserved uniformly; then defining a template tree structure and constructing a template tree, saving the time for constructing and searching the template tree by fixing the height of the template tree, and setting that each layer of nodes of the template tree carry corresponding information to reduce the time required by template matching; and finally, carrying out template splitting and merging, thereby further improving the accuracy of the log analysis result.

Description

Multi-source heterogeneous log data analysis method based on semantic enhancement

Technical Field

The invention belongs to the technical field of data analysis, and particularly relates to a multi-source heterogeneous log data analysis method based on semantic enhancement.

Background

The multi-source heterogeneous log data has the characteristics of unstructured, various kinds and the like, and the multi-source heterogeneous log data records the operation information of a multi-source heterogeneous system in detail, so that operation and maintenance personnel can be helped to monitor the system state better, and system abnormality is detected. With the innovation and upgrade of computer systems, the way to rely on traditional matching rules and human detection is inapplicable. Therefore, multi-source heterogeneous log data analysis is an essential link for realizing multi-source heterogeneous log anomaly detection, and the accuracy of log analysis directly influences the accuracy of anomaly detection. Therefore, there is a need for a multi-source heterogeneous log data parsing method that converts unstructured multi-source heterogeneous log data into a structured form, and prepares the data for subsequent steps of anomaly detection.

Multi-source heterogeneous log data parsing is the process of transforming log data from unstructured to structured form while obtaining log template information. The existing multi-source heterogeneous log data analysis method mainly obtains a log template in a similarity measurement mode, but the variable information in the log is often ignored in the process of combining the templates based on the similarity, so that the accuracy of log analysis is reduced.

Disclosure of Invention

The invention aims to solve the defects of the prior art and provides a multi-source heterogeneous log data analysis method based on semantic enhancement.

The invention is realized by the following technical scheme:

a multi-source heterogeneous log data analysis method based on semantic enhancement comprises the following steps:

step 1, preprocessing heterogeneous log data in a regular matching mode, wherein a regular expression is preset to match common variables, and words corresponding to meanings one by one are used for replacing the variables;

step 2, defining a template tree structure and constructing a template tree;

step 2.1, defining a template tree structure, wherein a first layer of the template tree only stores one root node without data information; the second layer is a length node, and the data stored by the node is the number of words in the log statement after regular matching; the third layer is a prefix node, the data stored by the node is a prefix expression of the log statement after regular matching, the prefix expression consists of the first n/2 words of the log statement, and n is the total number of words contained in the log statement; the fourth layer is a leaf node, the data stored by the node is log cluster information, and the log cluster comprises m log templates;

step 2.2, constructing a template tree according to the template tree structure defined in the step 2.1, and comprising the following steps:

step 2.21: searching or creating a second-layer length node of the template tree according to the log statement length after regular matching;

step 2.22: searching or creating a third layer prefix node of the template tree according to the prefix expression of the log statement after regular matching;

step 2.23: judging whether the log statement is matched with the log template information according to the first three layers of node information, and if the log statement is successfully matched, adding the log statement into a log template set of a log cluster; if the matching fails, creating a log cluster containing log template information based on the target log statement, and adding the log cluster to the leaf node;

step 3, splitting and combining templates; for template merging, merging log templates of the same log cluster in a wildcard substitution mode; for template splitting, word2vec is used for representing Word vectors of log sentences, then similarity of the log sentences in the same template is calculated according to the pearson linear correlation coefficient, and if the similarity is smaller than 0, the original template is split.

In the above technical solution, in step 1, the variables to be replaced include: IP variable, digital variable, time variable.

In the above technical solution, step 1 further includes: locating all special characters in the original log data using a regular expression, replacing the special characters with a single space; the number of consecutive spaces is reduced to one and the number of consecutive identical replacement words is reduced to one.

In the above technical solution, in step 2.21, according to the number of words in the target log statement, traversing the second level nodes of the template tree, and if the searching is successful, indicating that the matching length node operation is completed; if the search fails, a length node needs to be newly built according to the word number.

In the above technical solution, in step 2.22, according to the prefix expression of the target log statement, the third layer node of the template tree is traversed, and whether the matching is successful or not is judged according to the similarity calculation, if the matching is successful, the searching is successful, and if the matching is failed, a prefix node needs to be newly built according to the prefix expression.

In the above technical solution, in step 2, the similarity between the prefix node and the log template is calculated using the edit distance.

In the above technical solution, in step 3, the pearson linear correlation coefficient formula is as follows:

wherein X is _j Representing word vectors in log statements, Y _j Word vectors representing sentences to be matched, and the range of the calculation result Person epsilon [ -1,1]If the correlation is smaller than 0, the template needs to be split, and if the correlation is larger than 0, the correlation is positive.

The present invention also provides a computer readable storage medium storing a computer program which when executed implements the steps of the above method.

The invention has the advantages and beneficial effects that:

compared with the existing log analysis method, the method has the advantages of minimized human intervention, high analysis speed, accurate analysis result and the like. Because the multi-source heterogeneous log data has the characteristics of various types, various structures and the like, the traditional method needs to set corresponding matching rules for each log, the method provided by the invention firstly carries out preliminary screening through regular matching, can uniformly reserve important data parts in log sentences, secondly saves the time for constructing and searching the template tree through fixing the height of the template tree, then sets each layer of nodes of the template tree to carry corresponding information so as to reduce the time required by template matching, and finally carries out merging and splitting of log templates based on semantic vectors, thereby further improving the accuracy of log analysis results.

The beneficial effects of the invention are mainly shown as follows: on one hand, the time and labor cost for upgrading the system architecture and positioning faults are reduced, and the time efficiency and the analysis accuracy of the processes such as log analysis and processing are improved; on the other hand, a valuable data set is provided for an intelligent operation and maintenance system in the machine learning field, so that the realization of engineering applications such as automatic analysis and detection is promoted.

Drawings

FIG. 1 is a flow chart of steps of a multi-source heterogeneous log data parsing method based on semantic enhancement of the present invention.

Fig. 2 is a data diagram of the processing results after the regular matching process of the multi-source heterogeneous log data Trace1 and Trace 2.

FIG. 3 is a schematic diagram of the structure of the template tree of the present invention.

FIG. 4 is a flow chart of building a template tree.

Fig. 5 is a template merge flow chart.

Fig. 6 is a flow chart of template splitting.

FIG. 7 is a template merge and split example.

Other relevant drawings may be made by those of ordinary skill in the art from the above figures without undue burden.

Detailed Description

In order to make the person skilled in the art better understand the solution of the present invention, the following describes the solution of the present invention with reference to specific embodiments.

A multi-source heterogeneous log data analysis method based on semantic enhancement, referring to fig. 1, comprises the following steps:

step 1, canonical matching

The original log data is generally composed of a constant part and a variable part, wherein the constant part is a fixed structure, and the variable part is various parameters such as a number, time, "-", ">" and other variables or special symbols. The log analysis thinking is to remove variable parts in log data, only keep constant parts, and then perform operations such as integrating and encoding on the constant parts. The variable part is usually required to be correspondingly processed by expert knowledge of an actual application scene, so that the accuracy of mining the log template is effectively improved; the processing operation of the variable part mainly comprises direct removal and indirect replacement, and the invention adopts a regular matching mode to preprocess heterogeneous log data. The method specifically comprises the following steps:

step 1.1, first, all special characters in the original log data are located using regular expressions, then the special characters are replaced with a single space.

Step 1.2, next, some regular expressions are preset to match common variables (such as IP, NUMBER, time, etc.), and words with semantic one-to-one correspondence are used to replace these variables (such as the word "NUMBER" is used to replace the NUMBER "23"), because these variables still have corresponding semantic information when analyzing the whole piece of log data. If the wildcards "< >" are used to unify the substitutions, the relationships of all variables are obscured, potentially affecting the accuracy of the vectorized representation of the sentence.

And step 1.3, finally, reducing a plurality of continuous spaces into one, reducing a plurality of continuous identical replacement words into one, and ensuring that the processed log statement still retains the original semantic information.

Fig. 2 shows the processing result after the regular matching process of the multi-source heterogeneous log data Trace1 and Trace2, and it can be seen from fig. 2 that the regular matching well replaces Trace1 and Trace2 data, and the structure and the semantics of the original data are reserved to a greater extent. The existing log parsing method of Drain is mostly limited to the length of log sentences, namely two sentences with the same semantics and structure can be divided into two different log templates due to different lengths. In addition, the existing Spell log parsing algorithm mostly uses wild cards to uniformly replace variables in the variable part, and some key information may be lost by the replaced sentences, so that semantic information is incomplete or wrong. The invention uses word classification type replacement variables with semantic information in the regular matching flow, and combines a plurality of continuous special symbols and keywords, so that the two problems can be effectively relieved to a certain extent.

Step 2, constructing a template tree

After the regular matching process, the variable part in the original log data is matched, a log template is needed to be generated, and the core of the log template generation is the construction of a template tree; the invention will be specifically described with respect to a template tree structure and a template tree construction flow.

Step 2.1, defining a template tree structure

The template tree is an empty tree structure containing root nodes before the log analysis flow starts, and along with the input of log data, the new nodes update node information of each layer of the tree, and finally the template tree is formed.

The structure of the template tree constructed by the invention is shown in fig. 3, the tree height of the whole template tree is set to be a fixed height h=4, and the time complexity of traversing the whole template tree is o (nlogn), namely the analysis efficiency in the log analysis process is determined by the tree height h; in the template tree, the first layer of the tree only stores one root node without data information; the second layer is a length node, and the data stored by the node is the number of words in the log statement after regular matching; the third layer is a prefix node, the data stored by the node is a prefix expression of the log statement after regular matching, the prefix expression consists of the first n/2 words of the log statement, and n is the total number of words contained in the log statement; the fourth layer is a leaf node, and data stored by the node is log cluster information (denoted as (LTi, LCi)), and the log cluster contains m (m > =1) log templates (denoted as LTj).

The concepts of log clusters and log templates are as follows:

the log sequence is expressed as LS= { L1, L2, L3, ··Ln }, wherein LS is the log sequence output according to the time sequence, n is the log sequence length, li represents one log in the log sequence, i epsilon [1, n ]; the set of log templates corresponding to the log sequence is denoted lt= { LT1, LT2, LT3, & LTm }, m is the total number of log templates, LTj represents a corresponding log template generated, j e [1, m ], m e [1, n ].

The concept of log clusters is: log text clustered to the same log cluster in the log sequence has similarity. The higher the similarity, the higher the probability of two logs being divided into the same log cluster; conversely, the lower the similarity, the higher the probability of being partitioned into different log clusters. According to the concept of the log template, log sentences divided into the same log cluster have the same log template, the log template LTj is adopted as the identification of the log cluster, and the set of the log cluster is defined as follows:

SetLC＝{(LT1,LC1)(LT2,LC2),···,(LTm,LCm)}；

LCi＝{L1,L2,L3,···,Ln}；

wherein the collection of log clusters consists of a plurality of log clusters (LTi, LCi), LTi representing a log template, LCi representing a sequence of logs divided into the log clusters, lj representing a log in the sequence of logs, i e [1, m ], j e [1, n ].

Step 2.2, building a template tree

The flow of constructing the template tree according to the template tree structure defined in step 2.1 is shown in fig. 4, and includes the following steps.

Step 2.21: and searching or creating a second-layer length node of the template tree according to the log statement length after regular matching. Specifically, the method comprises the following steps: traversing the second layer node of the template tree according to the number of the target log statement words, and if the searching is successful, indicating that the node operation of the matching length is completed; if the search fails, a length node needs to be newly built according to the word number.

Step 2.22: and searching or creating a third-layer prefix node of the template tree according to the prefix expression of the log statement after regular matching. Specifically, the method comprises the following steps: traversing the third layer node of the template tree according to the prefix expression of the target log statement, judging whether the matching is successful or not according to similarity calculation, if the matching is successful, indicating that the searching is successful, and if the matching is failed, building a prefix node according to the prefix expression.

Step 2.23: judging whether the log statement is matched with the log template information according to the first three layers of node information, and if the log statement is successfully matched, adding the log statement into a log template set of a log cluster; if the matching fails, creating a log cluster containing log template information based on the target log statement, and adding the log cluster to the leaf node.

In the process of constructing the template tree, the step 2.22 and the step 2.23 both involve matching processes, namely prefix expression matching of the step 2.22 and log template matching of the step 2.23.

The prefix expression and the log template are both expressions obtained by shorthand of an original log statement, thereby representing structural information of the entire log statement, and thus the edit distance (Levenshtein) is used to calculate the similarity Sim of the prefix node and the log template, and the formula is defined as follows:

when the formula is used for calculating the similarity of the prefix nodes, fi represents a prefix expression of the target log statement (taking the first half part of the target log statement as the prefix expression), si is the prefix node to be matched, leven is an edit distance similarity calculation function, and Len (f) and Len(s) respectively represent the word number of the target statement and the word number of the prefix node to be matched.

When the formula is used for calculating the similarity of the log templates, fi represents prefix node information where the target log statement is located (namely prefix expressions stored in prefix nodes where the target log statement is successfully matched or successfully created), s is child nodes of the prefix node (namely log template information stored in leaf nodes to be matched), leven is an edit distance similarity calculation function, and Len (f) and Len(s) represent the number of words of the prefix node and the number of words of the log templates in the leaf nodes to be matched respectively.

And if the result value of Sim is closer to 1, the similarity of the two results is higher, otherwise, the similarity of the two results is lower and even dissimilar if the result value of Sim is closer to 0.

Step 3, template splitting and merging

Through the processing of the steps, the log templates obtained through template tree construction are likely to have differences in semantic expression with the original log data, so that the log templates need to be semantically split and combined. Although the variable part in the log data is changed in the regular matching stage, the invention replaces the variable part by the word corresponding to the semantics, so that the original structure is maintained and the semantic information is enhanced.

For template merging, the invention uses a wild card substitution mode to merge log templates of the same log cluster, and the template merging flow is shown in figure 5.

For template splitting, word2vec is used for representing Word vectors of log sentences, similarity of the log sentences in the same template is calculated according to pearson linear correlation coefficients (pearson corelationship coefficient), and if the similarity is smaller than 0, the original template is split. Wherein, the pearson linear correlation coefficient formula is as follows:

Fig. 6 shows a flow of template splitting, which aims to split log sentences with opposite semantics into different templates by comparing log sentences in the same log template.

Through the steps, similar templates are combined, and log messages with opposite original semantics are divided into different log templates again. As shown in FIG. 7, log1-3 is theoretically divided into the same Log Template according to the Template tree construction flow, but the semantic information of Log1 and Log3 are quite different, and the algorithm can re-divide the Log Template and the Log Template to generate two sub-templates corresponding to Template1-1 and Template1-2 respectively, so that the splitting process of the Log Template is completed.

The foregoing has described exemplary embodiments of the invention, it being understood that any simple variations, modifications, or other equivalent arrangements which would not unduly obscure the invention may be made by those skilled in the art without departing from the spirit of the invention.

Claims

1. The multi-source heterogeneous log data analysis method based on semantic enhancement is characterized by comprising the following steps of:

step 2, defining a template tree structure and constructing a template tree;

2. The semantic enhancement-based multi-source heterogeneous log data parsing method according to claim 1, wherein the method comprises the steps of: in step 1, the variables to be replaced include: IP variable, digital variable, time variable.

3. The semantic enhancement-based multi-source heterogeneous log data parsing method according to claim 1, wherein the method comprises the steps of: in step 1, further comprising: locating all special characters in the original log data using a regular expression, replacing the special characters with a single space; the number of consecutive spaces is reduced to one and the number of consecutive identical replacement words is reduced to one.

4. The semantic enhancement-based multi-source heterogeneous log data parsing method according to claim 1, wherein the method comprises the steps of: in step 2.21, traversing the second layer node of the template tree according to the number of target log statement words, and if the searching is successful, describing that the node operation of the matching length is completed; if the search fails, a length node needs to be newly built according to the word number.

5. The semantic enhancement-based multi-source heterogeneous log data parsing method according to claim 1, wherein the method comprises the steps of: in step 2.22, traversing the third layer node of the template tree according to the prefix expression of the target log statement, judging whether the matching is successful or not according to similarity calculation, if the matching is successful, indicating that the searching is successful, and if the matching is failed, building a prefix node according to the prefix expression.

6. The semantic enhancement-based multi-source heterogeneous log data parsing method according to claim 1, wherein the method comprises the steps of: in step 2, the edit distance is used to calculate the similarity of the prefix node and the log template.

7. The semantic enhancement-based multi-source heterogeneous log data parsing method according to claim 1, wherein the method comprises the steps of: in step 3, the pearson linear correlation coefficient formula is as follows:

8. A computer readable storage medium, characterized in that a computer program is stored, which computer program, when executed, implements the steps of the method according to any of claims 1 to 7.