CN110795606A

CN110795606A - Method for generating log analysis rule

Info

Publication number: CN110795606A
Application number: CN201910822081.6A
Authority: CN
Inventors: 王平; 陈宏伟
Original assignee: Jiepu Network Science & Technology Co Ltd Xi'an Jiaoda
Current assignee: Jiepu Network Science & Technology Co Ltd Xi'an Jiaoda
Priority date: 2019-09-02
Filing date: 2019-09-02
Publication date: 2020-02-14

Abstract

The invention discloses a method for generating a log analysis rule, which can only process a dynamic part and generate a regular expression by separating a static part from log contents in an original log text, and adds description to the dynamic part to further obtain a structured log analysis rule. The method for generating the analysis rule has high adaptability and low resource consumption, and the generated analysis rule has high analysis precision.

Description

Method for generating log analysis rule

Technical Field

The invention belongs to the field of computers and information security, and particularly relates to a method for generating analysis rules for a log analysis product.

Background

The network security log comprises a system log generated by an operating system, an alarm log generated by network security equipment and the like, mainly records various security events occurring in the system and network environment, and provides important clues for network anomaly diagnosis and network attack threat discovery. In the analysis of the network security log, the log analysis is a crucial step.

The current log analysis products face a number of practical problems on log normalization, which is regular for manual entry: firstly, the various log types and formats cannot be analyzed through a uniform analysis rule, and a new analysis method needs to be developed in a targeted manner every time a new log type exists, so that the development and maintenance costs are very high. Second, while the same log content generally follows a certain pattern, this pattern is often obscure and difficult to obtain. And thirdly, a corresponding regular expression is usually designed according to the extracted content, and then the specific content in the log is extracted according to the regular expression, but a certain technical threshold exists for writing the regular expression, and the regular expression also needs to be updated continuously, so that the maintenance difficulty of operation and maintenance personnel is increased. The method has the advantages of regular generation of unsupervised machine automatic learning, poor adaptability, better suitability for formatted and structured logs, lower analysis precision and higher resource consumption. In addition, the prior art also has a case that: since most of the logs are composed of English characters and numbers, some character strings are English abbreviations, and direct reading cannot be performed even if the logs are structured.

Disclosure of Invention

In view of the above background, a scheme is proposed to separate a static part from log content in an original log text, so that only a dynamic part can be processed, a regular expression is generated, descriptions are added to the dynamic part, and a structured log parsing rule is obtained.

The adopted specific technical scheme is a generation method of a log analysis rule, which comprises the following steps:

prefetching original log data, and splitting log contents into a set with character strings as units through separators;

identifying whether the character string is static or dynamic, and if the character string is static, clearing;

if the dynamic character strings are dynamic character strings, after the actual meanings are determined, identifying corresponding Chinese description for each dynamic character string; adapting a regular expression to each dynamic character string; creating a mapping structure of the Chinese description and the regular expression;

replacing the dynamic character string in the original log with a corresponding regular expression to obtain a structured log regular expression;

and storing the mapping structure and the log regular expression as an analysis rule.

The splitting of the log content includes: comparing the prefetched log contents, determining a separator and recording the position of the separator; and splitting the log content into a set consisting of independent character strings according to the positions of the separators.

Preferably, a sample log is selected, other logs and the sample log are compared element by element, and when the same symbol appears at the same position of at least two logs, the symbol is a public symbol and the position of the symbol is recorded; each element comprises a character, continuous letters and/or numbers are regarded as one element, and the symbols are characters which have connection or delimitation functions except letters and numbers;

when only one kind of common symbol exists, the kind of common symbol is the separator;

when at least two common symbols exist, determining a separator according to the relevance size among elements divided by the common symbols; the relevance determination includes determining whether the common symbol is used with other characters as a whole.

Further, if the character string split by the separator remains unchanged in each log, the character string is recognized as a static character string; otherwise, identifying the character string as a dynamic character string; and recording the positions of the static and dynamic character strings in the log.

And determining the Chinese description of the dynamic character string at the position by comparing the meanings of the dynamic character string contents at the same position of different logs.

Preferably, a regular database is preset, and a corresponding regular expression is selected from the preset regular database according to the type of the dynamic character string at each position; the character string types include all symbols, all letters, all numbers, and combinations of symbols, and/or letters, and/or numbers.

The mapping structure is a data table with the Chinese description of the dynamic character string as a field name and the corresponding regular expression as content.

And taking original log data, replacing the dynamic character string with a corresponding regular expression according to the recorded position of the dynamic character string, and keeping the static character string unchanged to obtain the log regular expression.

Preferably, each selected regular expression is matched with the corresponding dynamic character string, and if the matching is successful, the regular expression is stored; and if the matching fails, re-selecting the regular expression from the regular database until the matching with the character string is successful so as to verify the accuracy of the regular expression in matching the content of the character string.

And matching the corresponding regular expression through the character string type of the static character string, and removing the corresponding content.

The process of analyzing the log by using the analysis rule obtained by the generation method comprises the following steps: and matching the log data with a log regular expression of the analysis rule according to a preset analysis strategy, and acquiring Chinese description corresponding to the regular expression as a comment of an analysis result after the dynamic character string of the log data is successfully matched with the regular expression.

Compared with the prior art, the technical scheme has the advantages that the log content is divided into independent character strings by determining separators and utilizing the separators by utilizing a plurality of pre-fetched original logs, static character strings and dynamic character strings are distinguished, and for the dynamic character strings, corresponding regular expressions are generated on one hand, and Chinese description is identified on the other hand; the regular expression is used for replacing the dynamic character string in the original log to obtain the log regular expression, and when the log regular expression is used for analyzing the log, the meaning of the dynamic character string in the analysis result has Chinese description with practical significance, so that the log regular expression is beneficial to reading and analyzing by managers. The analysis rule generation method has high adaptability and low resource consumption, and the generated analysis rule has high analysis precision.

Drawings

FIG. 1 is a flowchart illustrating an embodiment of a method for generating a log parsing rule;

fig. 2 is a schematic flow chart of log parsing according to the parsing rule generated in fig. 1.

Detailed Description

The technical solution is described in detail below with reference to the accompanying drawings and examples.

As shown in fig. 1, a method for generating a log parsing rule includes:

first, the original log data is prefetched, and the log contents are split into sets in units of character strings by delimiters.

When the log is analyzed by using the existing analysis rule and a complete and accurate analysis result cannot be obtained, the log is considered as a new log and the analysis rule needs to be updated. Therefore, at least 2 pieces/line of original log data need to be prefetched as a generation basis for the parsing rule, where the "pieces" may be logs for streaming and the "lines" may be logs for file transmission, where there is no essential difference between the two.

After the original log data is prefetched, firstly, the separator of the log is determined, namely, the symbols for dividing different field contents in the log are determined;

the separators themselves usually do not have specific actual meanings, and only function as separation or delimitation, and field contents (such as character strings) with actual meanings are arranged between two adjacent separators. Common delimiters are ","/", and spaces, etc., for example, the delimiter of the log" AAA BBB CCC "is a space character, and the delimiter of the log" DDD _ EE _ F "is" _ ".

The determination of the delimiters includes: selecting a sample log, comparing other logs with the sample log one by one, and recording the position of at least two logs when the same symbol appears at the same position of the logs, wherein the symbol is a public symbol; each element comprises a character, continuous letters and/or numbers are regarded as one element, and the symbols are characters which have connection or delimitation functions except letters and numbers;

when at least two common symbols exist, determining a separator according to the relevance size among elements divided by the common symbols; the association includes the combination of elements that can express the complete meaning.

Secondly, identifying that the character strings in the set are static or dynamic, and identifying the character strings as static character strings if the character strings split by the separators remain unchanged in each log; otherwise, identifying the character string as a dynamic character string; and recording the positions of the static and dynamic character strings in the log.

For static character strings in the set, matching corresponding regular expressions according to the character string types to which the static character strings belong, and screening and removing the static character strings; the character string types include all symbols, all letters, all numbers, and combinations of symbols, and/or letters, and/or numbers, such as the character string "Name" all letters, the character string "8080" all numbers, and the character string "user 01" combinations of numbers and letters.

For the dynamic character string, determining the Chinese description of the dynamic character string at the position by comparing the actual meanings of the dynamic character string contents at the same position of different logs;

presetting a regular database, and selecting a corresponding regular expression from the preset regular database according to the type of the dynamic character string at each position; matching each selected regular expression with a corresponding dynamic character string, and if the matching is successful, storing the regular expressions; if the matching fails, selecting the regular expression from the regular database again until the matching succeeds, so as to verify the accuracy of the regular expression in matching the content of the character string;

creating a mapping structure of the Chinese description and the regular expression, wherein the mapping structure is a data table which takes the Chinese description of the dynamic character string as a field name and takes the corresponding regular expression as content;

And finally, storing the mapping structure and the log regular expression as an analysis rule.

As shown in fig. 2, the process of analyzing the log by using the analysis rule obtained by the generation method includes:

obtaining the log data to be analyzed, wherein the same log can be transmitted in a streaming mode or a file mode.

According to the configured analysis strategy, the analysis rule is executed, including the regular expression configured in the execution strategy, and the configuration of the strategy can be realized by selecting the corresponding regular expression through Chinese description.

The specific analysis is to match the log data through a regular expression included in the analysis rule of the strategy configuration, and after the matching is successful, the corresponding Chinese description is used as a comment for the analysis result, so that more visual reading is realized.

The following examples are given to illustrate the details.

Take the following two pre-fetched logs as an example:

“2018-11-12 11:20:33 192.168.19.1 login admin.”

“2018-11-14 21:16:45 192.168.19.1 logoff user1.”

step 1, comparing the log contents, and determining a separator:

comparing the two logs element by element, i.e. comparing the first element "2" with "2", the second element "0" with "0", and so on, wherein consecutive letters or numbers, such as "2018", "192", "login", "user 1", are all regarded as one element; in the same position of the two logs, the same symbol includes "-", space, ": "and" (that is, the symbols are not only the same, but also the symbols in the same position of different logs are possible to be separators).

For "-", there are two, where as "-", the beginning and end of the log are excluded because the symbols at the beginning and end of the log generally do not have separate functions; the middle ". times" of the log is three times in succession, and the elements of its division are all numbers, this group of numbers represents the commonly known IP address, the four elements are used as a whole, and the ". functions as a concatenation, i.e. not as a separation between strings.

"-" is two consecutive occurrences and no longer present, ": "the character string does not appear after two times of continuous appearance, and all the divided elements are numbers, are used as a whole, play a role in connection, and are not used as the separation between character strings.

The spaces are uniformly appeared three times (namely, character strings with actual meanings exist between two adjacent spaces), so that only the space symbols meet the characteristic requirements of the separators, the separators of the logs are determined to be the spaces, and the positions of the separators are recorded so as to separate the logs in the next step.

Step 2, splitting the log data into character strings by using separators:

for example, split the first log into a set of strings:

{2018-11-12

11:20:33

192.168.19.1

login

admin}；

the second log is split into a set of strings:

{2018-11-14

21:16:45

192.168.19.1

logoff

user1}。

step 3, distinguishing whether the character string is static or dynamic:

comparing the character strings in the two sets in the second step in sequence to know that the changed character strings are the first, second, fourth and fifth character strings and are identified as dynamic character strings; the character string that remains unchanged is a third character string, identified as a static character string. In practice, it cannot be determined with complete accuracy whether a character string is static or dynamic by comparing two logs, such as "192.168.19.1" in this embodiment, and the character string may be static if it is the IP information of a device, but may be dynamic if it is the IP of a user; but through the accumulation of the number of logs, the accuracy can be gradually improved, and finally, complete and accurate identification is achieved.

Step 4, for the static character string, removing according to the character string type, in this embodiment, the character string "192.168.19.1" represents an IP address, and its regular expression may be:

the regular expression can be used to clear the static string, for example, the first log can be processed as:

{2018-11-12

11:20:33

login

admin}；

the rest are dynamic character strings, and for the dynamic character strings, two kinds of processing are carried out, firstly, the meanings of the dynamic character strings in the two logs are combined, and Chinese description is marked:

{2018-11-12 date

11:20:33 time

logic operation

admin username };

secondly, a regular expression is adapted to the dynamic character string, specifically, the type of the character string is determined by judging the composition of the character string, and then the corresponding regular expression is matched from a preset regular library. In this embodiment, the types of the dynamic character strings are a number and symbol combination (more accurately, a date expression format), a number and symbol combination (more accurately, a time expression format), a letter combination, and a letter and number combination; also, if the date and time often occur in the log adjacent at the same time, the regular expression for the date may be: \ d {4} (|/| >) \ d {1,2} \ d {1,2 }; the regular expression of time may be:

the regular expression of operations (English words) may be: \ b [ a-zA-Z ] + \ b; the regular expression for the username may be: [ A-Za-z0-9 \\\/u 4e00 \/u 9fa5] +; thereby, a mapping structure of the Chinese description and the regular expression is obtained, as shown in the following table:

and 5, replacing the dynamic character strings with regular expressions, namely replacing the dynamic character strings in the original log with the corresponding regular expressions to obtain the following log regular expressions:

step 6, the Chinese description mapping structure and the log regular expression are stored as an analysis rule, when the log of the same kind is analyzed, the rule is executed, and in the corresponding result, each dynamic character string can automatically display Chinese description, so that the direct reading is facilitated:

for example, the second log is analyzed, and the output result may be in the following form:

the dates 2018-11-14/time 21:16: 45/192.168.19.1/operation logoff/user name user1. the symbol "/" in this case is merely to distinguish the contents of the respective fields, has no practical meaning, and can be replaced by other symbols.

For static portion 192.168.19.1, its type may be determined by its source, e.g., if it can be determined to be the user's IP after sufficient log validation, "user IP" may be added ahead of it; of course, if a device IP is identified, the chinese description "device IP" may also be identified for ease of understanding.

Claims

1. A method for generating a log parsing rule is characterized by comprising the following steps:

if the dynamic character strings are used, determining the actual meanings, and identifying corresponding Chinese description for each dynamic character string; adapting a regular expression to each dynamic character string; creating a mapping structure of the Chinese description and the regular expression;

replacing the dynamic character string in the original log with a corresponding regular expression to obtain the regular expression of the log;

2. The method of claim 1, wherein the splitting of the log comprises:

comparing the prefetched log contents, determining a separator and recording the position of the separator; and splitting the log content into a set consisting of independent character strings according to the positions of the separators.

3. The method of claim 2, wherein the determining of the separator comprises:

selecting a sample log, comparing other logs with the sample log one by one, and recording the position of at least two logs when the same symbol appears at the same position of the logs, wherein the symbol is a public symbol; each element comprises a character, continuous letters and/or numbers are regarded as one element, and the symbols are characters which have connection or delimitation functions except letters and numbers;

4. The method of claim 3, wherein if the character string split by the separator remains unchanged in the log, the character string is recognized as a static character string; otherwise, identifying the character string as a dynamic character string;

and recording the positions of the static and dynamic character strings in the log.

5. The method of claim 4, wherein the Chinese description of the dynamic string at the location is determined by comparing the meanings of the contents of the dynamic strings at the same location in different logs.

6. The method according to claim 5, characterized in that a regular library is preset, and a corresponding regular expression is selected from the preset regular library according to the type of the dynamic character string at each position; the character string types include all symbols, all letters, all numbers, and combinations of symbols, and/or letters, and/or numbers.

7. The method of claim 6, wherein the mapping structure is a data table with Chinese descriptions of dynamic strings as field names and regular expressions of corresponding strings as contents.

8. The method according to any one of claims 6, characterized in that, the original log data is taken, the dynamic character string is replaced by the corresponding regular expression according to the recorded dynamic character string position, and the static character string is kept unchanged to obtain the regular expression of the log.

9. The method of claim 6, wherein each selected regular expression is matched with a corresponding dynamic string, and if matching is successful, the regular expression is saved; and if the matching fails, re-selecting the regular expression from the regular database until the matching with the character string is successful.

10. The method according to claim 6, characterized in that the corresponding regular expression is matched and the corresponding content is cleared through the character string type to which the static character string belongs.