CN110795606A - Method for generating log analysis rule - Google Patents

Method for generating log analysis rule Download PDF

Info

Publication number
CN110795606A
CN110795606A CN201910822081.6A CN201910822081A CN110795606A CN 110795606 A CN110795606 A CN 110795606A CN 201910822081 A CN201910822081 A CN 201910822081A CN 110795606 A CN110795606 A CN 110795606A
Authority
CN
China
Prior art keywords
log
character string
dynamic
regular expression
static
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910822081.6A
Other languages
Chinese (zh)
Inventor
王平
陈宏伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiepu Network Science & Technology Co Ltd Xi'an Jiaoda
Original Assignee
Jiepu Network Science & Technology Co Ltd Xi'an Jiaoda
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiepu Network Science & Technology Co Ltd Xi'an Jiaoda filed Critical Jiepu Network Science & Technology Co Ltd Xi'an Jiaoda
Priority to CN201910822081.6A priority Critical patent/CN110795606A/en
Publication of CN110795606A publication Critical patent/CN110795606A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for generating a log analysis rule, which can only process a dynamic part and generate a regular expression by separating a static part from log contents in an original log text, and adds description to the dynamic part to further obtain a structured log analysis rule. The method for generating the analysis rule has high adaptability and low resource consumption, and the generated analysis rule has high analysis precision.

Description

Method for generating log analysis rule
Technical Field
The invention belongs to the field of computers and information security, and particularly relates to a method for generating analysis rules for a log analysis product.
Background
The network security log comprises a system log generated by an operating system, an alarm log generated by network security equipment and the like, mainly records various security events occurring in the system and network environment, and provides important clues for network anomaly diagnosis and network attack threat discovery. In the analysis of the network security log, the log analysis is a crucial step.
The current log analysis products face a number of practical problems on log normalization, which is regular for manual entry: firstly, the various log types and formats cannot be analyzed through a uniform analysis rule, and a new analysis method needs to be developed in a targeted manner every time a new log type exists, so that the development and maintenance costs are very high. Second, while the same log content generally follows a certain pattern, this pattern is often obscure and difficult to obtain. And thirdly, a corresponding regular expression is usually designed according to the extracted content, and then the specific content in the log is extracted according to the regular expression, but a certain technical threshold exists for writing the regular expression, and the regular expression also needs to be updated continuously, so that the maintenance difficulty of operation and maintenance personnel is increased. The method has the advantages of regular generation of unsupervised machine automatic learning, poor adaptability, better suitability for formatted and structured logs, lower analysis precision and higher resource consumption. In addition, the prior art also has a case that: since most of the logs are composed of English characters and numbers, some character strings are English abbreviations, and direct reading cannot be performed even if the logs are structured.
Disclosure of Invention
In view of the above background, a scheme is proposed to separate a static part from log content in an original log text, so that only a dynamic part can be processed, a regular expression is generated, descriptions are added to the dynamic part, and a structured log parsing rule is obtained.
The adopted specific technical scheme is a generation method of a log analysis rule, which comprises the following steps:
prefetching original log data, and splitting log contents into a set with character strings as units through separators;
identifying whether the character string is static or dynamic, and if the character string is static, clearing;
if the dynamic character strings are dynamic character strings, after the actual meanings are determined, identifying corresponding Chinese description for each dynamic character string; adapting a regular expression to each dynamic character string; creating a mapping structure of the Chinese description and the regular expression;
replacing the dynamic character string in the original log with a corresponding regular expression to obtain a structured log regular expression;
and storing the mapping structure and the log regular expression as an analysis rule.
The splitting of the log content includes: comparing the prefetched log contents, determining a separator and recording the position of the separator; and splitting the log content into a set consisting of independent character strings according to the positions of the separators.
Preferably, a sample log is selected, other logs and the sample log are compared element by element, and when the same symbol appears at the same position of at least two logs, the symbol is a public symbol and the position of the symbol is recorded; each element comprises a character, continuous letters and/or numbers are regarded as one element, and the symbols are characters which have connection or delimitation functions except letters and numbers;
when only one kind of common symbol exists, the kind of common symbol is the separator;
when at least two common symbols exist, determining a separator according to the relevance size among elements divided by the common symbols; the relevance determination includes determining whether the common symbol is used with other characters as a whole.
Further, if the character string split by the separator remains unchanged in each log, the character string is recognized as a static character string; otherwise, identifying the character string as a dynamic character string; and recording the positions of the static and dynamic character strings in the log.
And determining the Chinese description of the dynamic character string at the position by comparing the meanings of the dynamic character string contents at the same position of different logs.
Preferably, a regular database is preset, and a corresponding regular expression is selected from the preset regular database according to the type of the dynamic character string at each position; the character string types include all symbols, all letters, all numbers, and combinations of symbols, and/or letters, and/or numbers.
The mapping structure is a data table with the Chinese description of the dynamic character string as a field name and the corresponding regular expression as content.
And taking original log data, replacing the dynamic character string with a corresponding regular expression according to the recorded position of the dynamic character string, and keeping the static character string unchanged to obtain the log regular expression.
Preferably, each selected regular expression is matched with the corresponding dynamic character string, and if the matching is successful, the regular expression is stored; and if the matching fails, re-selecting the regular expression from the regular database until the matching with the character string is successful so as to verify the accuracy of the regular expression in matching the content of the character string.
And matching the corresponding regular expression through the character string type of the static character string, and removing the corresponding content.
The process of analyzing the log by using the analysis rule obtained by the generation method comprises the following steps: and matching the log data with a log regular expression of the analysis rule according to a preset analysis strategy, and acquiring Chinese description corresponding to the regular expression as a comment of an analysis result after the dynamic character string of the log data is successfully matched with the regular expression.
Compared with the prior art, the technical scheme has the advantages that the log content is divided into independent character strings by determining separators and utilizing the separators by utilizing a plurality of pre-fetched original logs, static character strings and dynamic character strings are distinguished, and for the dynamic character strings, corresponding regular expressions are generated on one hand, and Chinese description is identified on the other hand; the regular expression is used for replacing the dynamic character string in the original log to obtain the log regular expression, and when the log regular expression is used for analyzing the log, the meaning of the dynamic character string in the analysis result has Chinese description with practical significance, so that the log regular expression is beneficial to reading and analyzing by managers. The analysis rule generation method has high adaptability and low resource consumption, and the generated analysis rule has high analysis precision.
Drawings
FIG. 1 is a flowchart illustrating an embodiment of a method for generating a log parsing rule;
fig. 2 is a schematic flow chart of log parsing according to the parsing rule generated in fig. 1.
Detailed Description
The technical solution is described in detail below with reference to the accompanying drawings and examples.
As shown in fig. 1, a method for generating a log parsing rule includes:
first, the original log data is prefetched, and the log contents are split into sets in units of character strings by delimiters.
When the log is analyzed by using the existing analysis rule and a complete and accurate analysis result cannot be obtained, the log is considered as a new log and the analysis rule needs to be updated. Therefore, at least 2 pieces/line of original log data need to be prefetched as a generation basis for the parsing rule, where the "pieces" may be logs for streaming and the "lines" may be logs for file transmission, where there is no essential difference between the two.
After the original log data is prefetched, firstly, the separator of the log is determined, namely, the symbols for dividing different field contents in the log are determined;
the separators themselves usually do not have specific actual meanings, and only function as separation or delimitation, and field contents (such as character strings) with actual meanings are arranged between two adjacent separators. Common delimiters are ","/", and spaces, etc., for example, the delimiter of the log" AAA BBB CCC "is a space character, and the delimiter of the log" DDD _ EE _ F "is" _ ".
The determination of the delimiters includes: selecting a sample log, comparing other logs with the sample log one by one, and recording the position of at least two logs when the same symbol appears at the same position of the logs, wherein the symbol is a public symbol; each element comprises a character, continuous letters and/or numbers are regarded as one element, and the symbols are characters which have connection or delimitation functions except letters and numbers;
when only one kind of common symbol exists, the kind of common symbol is the separator;
when at least two common symbols exist, determining a separator according to the relevance size among elements divided by the common symbols; the association includes the combination of elements that can express the complete meaning.
Secondly, identifying that the character strings in the set are static or dynamic, and identifying the character strings as static character strings if the character strings split by the separators remain unchanged in each log; otherwise, identifying the character string as a dynamic character string; and recording the positions of the static and dynamic character strings in the log.
For static character strings in the set, matching corresponding regular expressions according to the character string types to which the static character strings belong, and screening and removing the static character strings; the character string types include all symbols, all letters, all numbers, and combinations of symbols, and/or letters, and/or numbers, such as the character string "Name" all letters, the character string "8080" all numbers, and the character string "user 01" combinations of numbers and letters.
For the dynamic character string, determining the Chinese description of the dynamic character string at the position by comparing the actual meanings of the dynamic character string contents at the same position of different logs;
presetting a regular database, and selecting a corresponding regular expression from the preset regular database according to the type of the dynamic character string at each position; matching each selected regular expression with a corresponding dynamic character string, and if the matching is successful, storing the regular expressions; if the matching fails, selecting the regular expression from the regular database again until the matching succeeds, so as to verify the accuracy of the regular expression in matching the content of the character string;
creating a mapping structure of the Chinese description and the regular expression, wherein the mapping structure is a data table which takes the Chinese description of the dynamic character string as a field name and takes the corresponding regular expression as content;
and taking original log data, replacing the dynamic character string with a corresponding regular expression according to the recorded position of the dynamic character string, and keeping the static character string unchanged to obtain the log regular expression.
And finally, storing the mapping structure and the log regular expression as an analysis rule.
As shown in fig. 2, the process of analyzing the log by using the analysis rule obtained by the generation method includes:
obtaining the log data to be analyzed, wherein the same log can be transmitted in a streaming mode or a file mode.
According to the configured analysis strategy, the analysis rule is executed, including the regular expression configured in the execution strategy, and the configuration of the strategy can be realized by selecting the corresponding regular expression through Chinese description.
The specific analysis is to match the log data through a regular expression included in the analysis rule of the strategy configuration, and after the matching is successful, the corresponding Chinese description is used as a comment for the analysis result, so that more visual reading is realized.
The following examples are given to illustrate the details.
Take the following two pre-fetched logs as an example:
“2018-11-12 11:20:33 192.168.19.1 login admin.”
“2018-11-14 21:16:45 192.168.19.1 logoff user1.”
step 1, comparing the log contents, and determining a separator:
comparing the two logs element by element, i.e. comparing the first element "2" with "2", the second element "0" with "0", and so on, wherein consecutive letters or numbers, such as "2018", "192", "login", "user 1", are all regarded as one element; in the same position of the two logs, the same symbol includes "-", space, ": "and" (that is, the symbols are not only the same, but also the symbols in the same position of different logs are possible to be separators).
For "-", there are two, where as "-", the beginning and end of the log are excluded because the symbols at the beginning and end of the log generally do not have separate functions; the middle ". times" of the log is three times in succession, and the elements of its division are all numbers, this group of numbers represents the commonly known IP address, the four elements are used as a whole, and the ". functions as a concatenation, i.e. not as a separation between strings.
"-" is two consecutive occurrences and no longer present, ": "the character string does not appear after two times of continuous appearance, and all the divided elements are numbers, are used as a whole, play a role in connection, and are not used as the separation between character strings.
The spaces are uniformly appeared three times (namely, character strings with actual meanings exist between two adjacent spaces), so that only the space symbols meet the characteristic requirements of the separators, the separators of the logs are determined to be the spaces, and the positions of the separators are recorded so as to separate the logs in the next step.
Step 2, splitting the log data into character strings by using separators:
for example, split the first log into a set of strings:
{2018-11-12
11:20:33
192.168.19.1
login
admin};
the second log is split into a set of strings:
{2018-11-14
21:16:45
192.168.19.1
logoff
user1}。
step 3, distinguishing whether the character string is static or dynamic:
comparing the character strings in the two sets in the second step in sequence to know that the changed character strings are the first, second, fourth and fifth character strings and are identified as dynamic character strings; the character string that remains unchanged is a third character string, identified as a static character string. In practice, it cannot be determined with complete accuracy whether a character string is static or dynamic by comparing two logs, such as "192.168.19.1" in this embodiment, and the character string may be static if it is the IP information of a device, but may be dynamic if it is the IP of a user; but through the accumulation of the number of logs, the accuracy can be gradually improved, and finally, complete and accurate identification is achieved.
Step 4, for the static character string, removing according to the character string type, in this embodiment, the character string "192.168.19.1" represents an IP address, and its regular expression may be:
Figure 3
the regular expression can be used to clear the static string, for example, the first log can be processed as:
{2018-11-12
11:20:33
login
admin};
the rest are dynamic character strings, and for the dynamic character strings, two kinds of processing are carried out, firstly, the meanings of the dynamic character strings in the two logs are combined, and Chinese description is marked:
{2018-11-12 date
11:20:33 time
logic operation
admin username };
secondly, a regular expression is adapted to the dynamic character string, specifically, the type of the character string is determined by judging the composition of the character string, and then the corresponding regular expression is matched from a preset regular library. In this embodiment, the types of the dynamic character strings are a number and symbol combination (more accurately, a date expression format), a number and symbol combination (more accurately, a time expression format), a letter combination, and a letter and number combination; also, if the date and time often occur in the log adjacent at the same time, the regular expression for the date may be: \ d {4} (|/| >) \ d {1,2} \ d {1,2 }; the regular expression of time may be:
Figure 4
the regular expression of operations (English words) may be: \ b [ a-zA-Z ] + \ b; the regular expression for the username may be: [ A-Za-z0-9 \\\/u 4e00 \/u 9fa5] +; thereby, a mapping structure of the Chinese description and the regular expression is obtained, as shown in the following table:
Figure 2
and 5, replacing the dynamic character strings with regular expressions, namely replacing the dynamic character strings in the original log with the corresponding regular expressions to obtain the following log regular expressions:
Figure 5
step 6, the Chinese description mapping structure and the log regular expression are stored as an analysis rule, when the log of the same kind is analyzed, the rule is executed, and in the corresponding result, each dynamic character string can automatically display Chinese description, so that the direct reading is facilitated:
for example, the second log is analyzed, and the output result may be in the following form:
the dates 2018-11-14/time 21:16: 45/192.168.19.1/operation logoff/user name user1. the symbol "/" in this case is merely to distinguish the contents of the respective fields, has no practical meaning, and can be replaced by other symbols.
For static portion 192.168.19.1, its type may be determined by its source, e.g., if it can be determined to be the user's IP after sufficient log validation, "user IP" may be added ahead of it; of course, if a device IP is identified, the chinese description "device IP" may also be identified for ease of understanding.

Claims (10)

1. A method for generating a log parsing rule is characterized by comprising the following steps:
prefetching original log data, and splitting log contents into a set with character strings as units through separators;
identifying whether the character string is static or dynamic, and if the character string is static, clearing;
if the dynamic character strings are used, determining the actual meanings, and identifying corresponding Chinese description for each dynamic character string; adapting a regular expression to each dynamic character string; creating a mapping structure of the Chinese description and the regular expression;
replacing the dynamic character string in the original log with a corresponding regular expression to obtain the regular expression of the log;
and storing the mapping structure and the log regular expression as an analysis rule.
2. The method of claim 1, wherein the splitting of the log comprises:
comparing the prefetched log contents, determining a separator and recording the position of the separator; and splitting the log content into a set consisting of independent character strings according to the positions of the separators.
3. The method of claim 2, wherein the determining of the separator comprises:
selecting a sample log, comparing other logs with the sample log one by one, and recording the position of at least two logs when the same symbol appears at the same position of the logs, wherein the symbol is a public symbol; each element comprises a character, continuous letters and/or numbers are regarded as one element, and the symbols are characters which have connection or delimitation functions except letters and numbers;
when only one kind of common symbol exists, the kind of common symbol is the separator;
when at least two common symbols exist, determining a separator according to the relevance size among elements divided by the common symbols; the relevance determination includes determining whether the common symbol is used with other characters as a whole.
4. The method of claim 3, wherein if the character string split by the separator remains unchanged in the log, the character string is recognized as a static character string; otherwise, identifying the character string as a dynamic character string;
and recording the positions of the static and dynamic character strings in the log.
5. The method of claim 4, wherein the Chinese description of the dynamic string at the location is determined by comparing the meanings of the contents of the dynamic strings at the same location in different logs.
6. The method according to claim 5, characterized in that a regular library is preset, and a corresponding regular expression is selected from the preset regular library according to the type of the dynamic character string at each position; the character string types include all symbols, all letters, all numbers, and combinations of symbols, and/or letters, and/or numbers.
7. The method of claim 6, wherein the mapping structure is a data table with Chinese descriptions of dynamic strings as field names and regular expressions of corresponding strings as contents.
8. The method according to any one of claims 6, characterized in that, the original log data is taken, the dynamic character string is replaced by the corresponding regular expression according to the recorded dynamic character string position, and the static character string is kept unchanged to obtain the regular expression of the log.
9. The method of claim 6, wherein each selected regular expression is matched with a corresponding dynamic string, and if matching is successful, the regular expression is saved; and if the matching fails, re-selecting the regular expression from the regular database until the matching with the character string is successful.
10. The method according to claim 6, characterized in that the corresponding regular expression is matched and the corresponding content is cleared through the character string type to which the static character string belongs.
CN201910822081.6A 2019-09-02 2019-09-02 Method for generating log analysis rule Pending CN110795606A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910822081.6A CN110795606A (en) 2019-09-02 2019-09-02 Method for generating log analysis rule

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910822081.6A CN110795606A (en) 2019-09-02 2019-09-02 Method for generating log analysis rule

Publications (1)

Publication Number Publication Date
CN110795606A true CN110795606A (en) 2020-02-14

Family

ID=69427145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910822081.6A Pending CN110795606A (en) 2019-09-02 2019-09-02 Method for generating log analysis rule

Country Status (1)

Country Link
CN (1) CN110795606A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113672483A (en) * 2021-08-09 2021-11-19 深圳市猿人创新科技有限公司 Equipment log storage method and device, electronic equipment and medium
CN113672482A (en) * 2021-08-09 2021-11-19 深圳市猿人创新科技有限公司 Log message transmission method, device, equipment and medium of terminal equipment
CN114860673A (en) * 2022-07-06 2022-08-05 南京聚铭网络科技有限公司 Log feature identification method and device based on dynamic and static combination

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113672483A (en) * 2021-08-09 2021-11-19 深圳市猿人创新科技有限公司 Equipment log storage method and device, electronic equipment and medium
CN113672482A (en) * 2021-08-09 2021-11-19 深圳市猿人创新科技有限公司 Log message transmission method, device, equipment and medium of terminal equipment
CN113672483B (en) * 2021-08-09 2024-05-31 深圳市猿人创新科技有限公司 Device log storage method and device, electronic device and medium
CN114860673A (en) * 2022-07-06 2022-08-05 南京聚铭网络科技有限公司 Log feature identification method and device based on dynamic and static combination
CN114860673B (en) * 2022-07-06 2022-09-30 南京聚铭网络科技有限公司 Log feature identification method and device based on dynamic and static combination

Similar Documents

Publication Publication Date Title
Hill et al. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study
CN100511215C (en) Multilingual translation memory and translation method thereof
US7257719B2 (en) System and method for storing events to enhance intrusion detection
CN110795606A (en) Method for generating log analysis rule
CN101930524B (en) Document information creation device, document registration system and document information creation method
US7606797B2 (en) Reverse value attribute extraction
US9852122B2 (en) Method of automated analysis of text documents
CN108305180B (en) Friend recommendation method and device
US20170344625A1 (en) Obtaining of candidates for a relationship type and its label
US8484229B2 (en) Method and system for identifying traditional arabic poems
CN111950263B (en) Log analysis method and system and electronic equipment
CN103140849A (en) Transliteration device, program, recording medium, and method
CN109885658B (en) Index data extraction method and device and computer equipment
JP2008210308A (en) Log integrating managing device, log integrating managing method, and log integrating managing program
Cerra et al. Authorship analysis based on data compression
CN108280021A (en) A kind of logging level analysis method based on machine learning
JP4832952B2 (en) Database analysis system, database analysis method and program
CN101021851B (en) Text search device, text search method
Demetrescu et al. Accuracy of author names in bibliographic data sources: An Italian case study
CN114048740B (en) Sensitive word detection method and device and computer readable storage medium
CN114970502A (en) Text error correction method applied to digital government
US20120265520A1 (en) Text processor and method of text processing
TWI818713B (en) Computer-implemented method, computer program product and computer system for automatically assign term to text documents
US10650020B1 (en) Analyzing transformations for preprocessing datasets
Ogrodniczuk et al. Lexical correction of polish twitter political data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination