CN116861879A

CN116861879A - Log template extraction method, device, equipment and storage medium

Info

Publication number: CN116861879A
Application number: CN202310957859.0A
Authority: CN
Inventors: 李帅
Original assignee: China Merchants Bank Co Ltd
Current assignee: China Merchants Bank Co Ltd
Priority date: 2023-07-31
Filing date: 2023-07-31
Publication date: 2023-10-10

Abstract

The application discloses a log template extraction method, device and equipment and a storage medium, and belongs to the field of computers. The method and the device for preprocessing the log file obtain the preprocessed log file. If the log file is a Chinese-English mixed log, separating Chinese and English of the preprocessed log file to obtain a Chinese log and an English log, and respectively performing word segmentation processing on the Chinese log and the English log to obtain a Chinese word sequence and an English word sequence, obtaining a sequence word sequence based on the Chinese word sequence and the English word sequence, extracting a log template based on the sequence word sequence, thereby avoiding complexity caused by language mixing and further improving accuracy of extracting the log template.

Description

Log template extraction method, device, equipment and storage medium

Technical Field

The present application relates to the field of computers, and in particular, to a method, an apparatus, a device, and a storage medium for extracting a log template.

Background

Event logs are of great importance as information sources in systems and network management. With the increasing growth and complexity of event logs, manually analyzing event logs has become cumbersome. Thus, recent research has focused on automated analysis of these log files. The most excellent technology is IPLoM technology at present, which mainly comprises the steps of dividing log files according to different lengths of log messages, comparing word changes between logs with the same length, further dividing the logs by the position with the least word change, dividing clusters by the mapping relation of an independent mark set between two mark positions, and generating log templates from each cluster.

Since the IPLoM technique often ignores some common nonsensical words, called stop words, when the log is divided, the computational complexity is reduced. However, when the stop word lists of the chinese and english are different, different stop word lists may need to be used for the two languages when processing the chinese and english mixed log, so that the IPLoM technology can obtain a better template extraction result when processing the log with the specification such as the pure english log, and it is difficult to process the log with the non-specification condition such as the chinese and english mixed log, so a new solution is needed to process the non-specification log file more accurately.

Disclosure of Invention

The application mainly aims to provide a log template extraction method, device and equipment and a storage medium, and aims to solve the technical problem of poor non-standard log template extraction effect.

In order to achieve the above object, the present application provides a log template extraction method, including:

preprocessing the log file to obtain a preprocessed log file;

if the log file is a Chinese-English mixed log, separating Chinese and English of the preprocessed log file to obtain a Chinese log and an English log;

word segmentation processing is respectively carried out on the Chinese log and the English log to obtain a Chinese word sequence and an English word sequence;

based on the Chinese word sequence and the English word sequence, obtaining a sequence of sequential words;

and extracting the log template based on the sequence of the sequential words.

Optionally, extracting the log template based on the sequence of sequential words includes:

comparing the sequence of the sequential words with each stock log template in a preset log template set;

if all the stock log templates do not have the stock log templates with the similarity to the sequence of the sequential words being larger than a preset threshold value, extracting the log templates based on the sequence of the sequential words.

Optionally, after comparing the sequence of sequential words with each stock log template in the preset log template set, the method further includes:

if all the stock log templates have the stock log templates with the similarity larger than a preset threshold value, classifying the log files into similar stock log templates corresponding to the maximum value of the similarity.

Optionally, the classifying the log file into a similar stock log template corresponding to the maximum value of the similarity includes:

and updating the similar stock log template based on the sequence of sequential words. .

Optionally, comparing the sequence of sequential words with each stock log template in the preset log template set includes: acquiring the same word segmentation number between a constant part in the sequence of the sequential words and a constant part in the stock log template; the constant part is a part which is not converted into a regular expression;

the ratio of the same word segmentation number to the length value of the log template constant part is used as the similarity of the sequence of sequential words and the log template.

Optionally, the comparing the sequence of sequential words with each stock log template in the preset log template set includes:

and if the preset template set is empty, extracting a log template based on the sequence of the sequential words.

Optionally, the preprocessing is performed on the log file to obtain a preprocessed log file, and the method includes:

converting the replacement characteristic text in the log file into a regular expression to obtain an intermediate log file;

and cutting out the effective information of the intermediate log file to obtain the preprocessed log file.

In a second aspect, to achieve the above object, the present application further provides a log template extraction device, where the log classification device includes:

the preprocessing module is used for preprocessing the log file to obtain a preprocessed log file;

the word segmentation module is used for separating the Chinese and English of the preprocessed log file if the log file is a Chinese and English mixed log, so as to obtain a Chinese log and an English log;

the word segmentation module is used for respectively carrying out word segmentation processing on the Chinese text log and the English text log to obtain a Chinese word sequence and an English word sequence;

the ordering module is used for obtaining an order word sequence based on the Chinese word sequence and the English word sequence;

and the extraction module is used for extracting the log template based on the sequence of the sequential words.

In order to achieve the above object, the present application further provides a log template extraction apparatus, which includes:

a processor, a memory and a log template extraction program stored in the memory, wherein the log classification extraction program realizes the steps of the log template extraction method when being run by the processor.

In a fourth aspect, to achieve the above object, the present application further provides a computer-readable storage medium, on which a log template extraction program is stored, the log template extraction program implementing a log template extraction method when executed by a processor.

According to the log template extraction method, device and equipment and the storage medium, chinese and English contents of the Chinese and English mixed log are separated, and word segmentation processing is carried out on the separated Chinese log and English log respectively, so that complexity caused by language mixing is avoided, and accuracy of extracting the log template is improved.

Drawings

FIG. 1 is a schematic diagram of a log template extraction apparatus according to the present application;

FIG. 2 is a flowchart of a log template extraction method according to a first embodiment of the present application

FIG. 3 is a flowchart of a second embodiment of a log template extraction method according to the present application;

fig. 4 is a schematic diagram of a functional module of a third embodiment of the log template extraction method of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Because the existing IPLoM technology usually ignores some common nonsensical words, called stop words, when performing word segmentation processing, so as to reduce the computational complexity. However, the stop word lists of the chinese log and the english log are different, when the chinese-english mixed log is processed, different stop word lists may need to be used for the two languages, which increases the complexity of processing, so that the IPLoM technology can obtain a better template extraction result when processing the standard log, and it is difficult to process the non-standard log case that the log content is the chinese-english mixed log, so a new solution is needed to process such non-standard log files more accurately.

The application provides a solution, which avoids the complexity caused by language mixing by separating Chinese and English contents of Chinese and English mixed logs and respectively performing word segmentation processing on the Chinese logs and the English logs obtained after separation. The application improves the accuracy of extracting the log template through more efficient and accurate processing of the Chinese and English mixed log.

The embodiment of the application is as follows, and a log template extraction system applied in the implementation of the technology of the application is described:

referring to fig. 1, fig. 1 is a schematic structural diagram of a log template extraction device of a hardware running environment according to an embodiment of the present application.

As shown in fig. 1, the log template extraction apparatus may include: a processor 1001, such as a CPU, a user interface 1003, a memory 1005, and a communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may comprise a voice pick-up module, such as a microphone array or the like, and optionally the user interface 1003 may also be a Display screen (Display), an input unit such as a Keyboard (Keyboard) or the like. The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

It is to be appreciated that the log template extraction device can also include a network interface 1004, and the network interface 1004 can optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). Optionally, the log template extraction device may further include an RF (Radio Frequency) circuit, a sensor, an audio circuit, a WiFi module, and the like.

It will be appreciated by those skilled in the art that the log template extraction device structure shown in fig. 1 does not constitute a limitation of the log template extraction device, and may include more or fewer components than shown, or may combine certain components, or may be a different arrangement of components.

Based on the hardware structure of the log template extraction device, but not limited to the hardware structure, the present application provides a first embodiment of a log template extraction method. Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the log template extraction method of the present application.

It should be noted that although a logical order is depicted in the flowchart, in some cases the steps depicted or described may be performed in a different order than presented herein.

In this embodiment, the log template extraction method includes:

step S100, preprocessing the log file to obtain the preprocessed log file.

In this embodiment, the preprocessing may convert the original data into a unified format and structure, ensuring consistency and comparability of the data. This helps to better understand and interpret the data during subsequent analysis, and at the same time, noise, outliers or invalid data in the data can be removed or processed through preprocessing, thereby improving the quality and accuracy of the data.

Step S200, if the log file is a Chinese-English mixed log, separating Chinese and English of the preprocessed log file to obtain a Chinese log and an English log.

Specifically, chinese characters in the Chinese-English mixed log are identified and separated to obtain a Chinese log, and English characters are identified and separated to obtain an English log. The Chinese character and the English character can be identified based on character code separation or based on a neural network model. For example, character-based code separation may be due to the significant difference between Chinese characters and English characters in ASCII coding, which may be used to assist in separating Chinese and English text. Specifically, ASCII codes contain english characters and some special symbols, but do not contain chinese characters. In the hybrid log, the chinese character typically corresponds to a non-ASCII-code character, each character in the log text is traversed, it is determined whether the ASCII code value of the character is within the range of english characters (typically between 0x0041 and 0x 007A), if so, the character is considered to be in the english part, and if not, the character is considered to be in the chinese part.

The purpose of the separation is to better adopt different processing modes for log texts in different languages in the subsequent processing process. For example, the Chinese log and the English log are processed by different word segmentation methods respectively by using different word segmentation tools, and after the Chinese text and the English text are separated, different analysis, mining, translation and other operations can be performed on the log texts in different languages according to actual requirements, so that the content of the mixed log can be processed and understood more accurately, the subsequent processing and application of the mixed language log can be performed better, and the utilization value and analysis efficiency of log information can be improved.

Step S300, word segmentation processing is carried out on the Chinese word logs and the English word logs respectively, and a Chinese word sequence and an English word sequence are obtained.

Because different english words are typically separated by spaces or punctuations, and chinese words are a continuous sequence of characters, chinese word segmentation techniques are relatively complex, requiring more semantic and contextual information to be considered. Therefore, in this embodiment, different word segmentation devices are required for word segmentation processing of the chinese log and word segmentation processing of the english log, so as to obtain a chinese word sequence and an english word sequence.

The word segmentation is an important step of extracting the log template, can improve the text processing efficiency, helps to extract meaningful features and information, and provides a good data basis for subsequent text analysis, so that the quality and effect of extracting the log template are improved.

Step S400, based on the Chinese word sequence and the English word sequence, obtaining the sequence of sequential words.

It can be understood that the original log structure is disturbed by the chinese word sequence obtained by word segmentation of the pair Wen Rizhi and the english word sequence obtained by word segmentation of the english log. In the step, based on the character sequence in the log file, the Chinese word sequence and the English word sequence are recombined into the sequence word sequence, so that the Chinese word sequence of the Chinese word and the English word in the log file is kept, the meaning of the original log is better understood, and the subsequent template extraction task is facilitated.

Step S500, extracting a log template based on the sequence of the sequential words.

In the embodiment, the Chinese and English contents of the Chinese and English mixed logs are separated, and the separated Chinese logs and English logs are subjected to word segmentation respectively, so that the complexity caused by language mixing is avoided, and the accuracy of extracting the log template is improved through more efficient and accurate processing of the Chinese and English mixed logs.

Further, as an embodiment, referring to fig. 3, a second embodiment of the log template extraction method of the present application provides a log template extraction method, based on the embodiment shown in fig. 2, step S500 includes:

step S510, comparing the sequence of the sequential words with each stock log template in the preset log template set.

Step S520, if the stock log templates with the similarity to the sequence of the sequential words being greater than the preset threshold value do not exist in all the stock log templates, the log templates are extracted based on the sequence of the sequential words.

And step S530, if the stock log templates with the similarity meeting the preset threshold value exist in all the stock log templates, classifying the log files into similar stock log templates corresponding to the maximum value of the similarity.

In this embodiment, a preset template set is maintained, where the preset template set is used to store the existing log templates, and the size of the preset template set must be constant due to the finite nature of the log mode.

The preset threshold value is used for comparing the similarity between the sequence of the sequence words and each stock log template in the preset log template set, so as to determine whether the log file judged at the moment can belong to the category to which the stock log template belongs. When the similarity is larger than a preset threshold, the similarity between the sequence of the sequential words and the corresponding stock log template is considered to be higher, and the log files corresponding to the sequence of the sequential words can be classified into the stock log template. And when the similarity is smaller than or equal to a preset threshold value, the similarity between the sequence of the sequential words and the corresponding stock log template is considered to be intersected, and the log files corresponding to the sequence of the sequential words cannot be classified into the stock log template.

If all the stock log templates do not have the stock log templates with the similarity to the sequence of the sequence words being greater than the preset threshold, at the moment, the sequence word file and the existing stock log templates are considered to be not of the same type, so that the log templates can be extracted based on the sequence of the sequence words, namely, a new log template is generated and stored in the preset log template set. In order to ensure the real-time performance of the template addition, the sequence of the sequential words at this time is generally directly used as a new log template, for example, a sequence of sequential words is obtained from step S400 as [, user "," from "," login "], and the new log template is" user from "login".

If one or more stock log templates with the similarity to the sequence of the sequential words larger than a preset threshold exist in all the stock log templates, the stock log template corresponding to the maximum value of the similarity is used as a similar stock log template, and then the log files corresponding to the sequence of the sequential words are classified into the similar stock log templates.

That is, in this embodiment, for the log file, the log file is compared with the log templates stored in each preset log template set to determine that the log file has similarity meeting the requirement, and if the log file has similarity meeting the requirement, a new log template is generated.

Of course, the preset threshold may be customized based on the actual requirements of the category judgment.

As a specific embodiment, the similar stock log template may also be updated based on the sequence of sequential words when step S530 is performed.

In this embodiment, if all the parts in the sequence of sequential words and the similar stock log template are the same, the similar stock log template is not required to be changed, and the log file is directly classified into the similar stock log template.

If not, updating the similar stock log templates accordingly. The log template is updated by converting different parts into regular expressions if the sequence of sequential words and the similar stock log template have different parts. In this embodiment, the regular expressions that the different parts translate into are typically the same.

In addition, in an embodiment, the method further includes step S550, if the preset template set is empty, extracting the log template based on the sequence of sequential words.

In this embodiment, there is no stock log template in the template set preset at the beginning, at this time, the sequence of the sequential words generated in step S400 cannot find a comparable stock log template, at this time, the sequence of the sequential words is directly used as the first stock log template to be maintained in the preset template set.

The novel logs can be clustered and managed in real time through the similarity comparison of the sequence of sequential words of the novel logs and the stock log templates, the effect of 'no-rule newly-added log templates and rule-classified log templates' is achieved, the accuracy, the automation degree and the adaptability of real-time log processing and analysis are improved, and therefore problems can be found and solved in time, and events can be responded quickly.

Further, referring to fig. 4, a third embodiment of the log template extraction method of the present application provides a log template extraction method, based on the embodiment shown in fig. 2, step S510 includes:

step S511, the same word segmentation number between the constant part in the sequence of sequential words and the constant part in the stock log template is obtained.

Wherein the constant portion is a portion that is not converted to a regular expression.

Step S512, the ratio of the same word segmentation number to the length value of the constant part of the stock log template is used as the similarity of the sequence of the sequential words and the stock log template.

In this embodiment, the similarity comparison is mainly performed using the constant portion. Specifically, the following log templates are included in the set:

stock log template 1: user login. The constant part is: [ "user", "slave", "login" ].

Stock log template 2: visited IP address is. The constant part is: [ "access", "IP", "address" ].

The log template extraction apparatus obtains a new sequence of sequential words when executing step S400: the method comprises the steps of (a) obtaining the same word segmentation number between a constant part in a sequence of sequential words and a constant part in an inventory log template by using a user and a login, and obtaining the constant part of the sequence of the sequential words: [ "user", "slave", "login". "].

Thus, the same word segmentation number of the sequence of sequential words and the stock log template 1 is 3 ("user", "from" and "login" occur in both the sequence of sequential words and the stock log template 1), and the length of the constant part of the stock log template 1 is 3 (including 3 words), so the similarity s=1 of the sequence of sequential words and the stock log template 1. The sequence of sequential words is consistent with the calculation mode of the log template 2, and will not be described in detail here. In this embodiment, the similarity may be calculated by this method even if the sequence of sequential words and the constant part length of the stock log template are different.

By comparing the same number of tokens between the constant portion in the sequence of sequential words and the constant portion in the stock log template, similarity can be dynamically calculated based on the specific log content. This flexibility and adaptability enables the present embodiment to be adapted to different types and lengths of log templates, as well as to cope with different log content changes. The embodiment only relates to word segmentation and comparison of the sequence of sequential words and the log template, which can make the calculation process very fast, so that the method can be used in a real-time or near real-time scene to realize extraction and processing of the log template.

Further, as an embodiment, a fourth embodiment of the log template extraction method of the present application provides a log template extraction method, based on the embodiment shown in fig. 2, step S100 includes:

and step S110, converting the replacement characteristic text in the log file into a regular expression to obtain an intermediate log file.

And step S120, intercepting the effective information of the intermediate log file to obtain the preprocessed log file.

In log files there is typically text with more distinct features such as IP address, phone number, mailbox address, bank card number, etc. These feature texts need to be replaced in subsequent log processing and analysis. The process of converting these feature texts into regular expressions may enable more flexible and efficient log processing.

In this embodiment, to simplify processing logic, reduce the risk of errors, different text features are typically converted to the same regular expression. For example, information such as an IP address and a bank card is converted into "×".

In general, the key information of the log is at the sentence head, and the overlong log contains more useless information in the actual scene and greatly influences the classification performance, so that the embodiment can reject the invalid information in the log and keep the valid information of the log file.

It will of course be readily appreciated that, although a logical order is illustrated in the present embodiment, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein. For example, step S400 may also be performed after step S300.

Based on the same inventive concept, the application also provides a log template extraction device, which comprises:

the Chinese-English separating module is used for separating Chinese and English of the preprocessed log file if the log file is a Chinese-English mixed log, so that a Chinese log and an English log are obtained;

In addition, the embodiment of the application also provides a computer storage medium, and the storage medium stores a log template extraction program, and the log template extraction program realizes the steps of the log template extraction method when being executed by a processor. Therefore, a detailed description will not be given here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application. As an example, the program instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of computer programs, which may be stored on a computer-readable storage medium, and which, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random access Memory (Random AccessMemory, RAM), or the like.

It should be further noted that the above-described apparatus embodiments are merely illustrative, where elements described as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present application without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course by means of special purpose hardware including application specific integrated circuits, special purpose CPUs, special purpose memories, special purpose components, etc. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment for many more of the cases of the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read-only memory (ROM), a random-access memory (RAM, randomAccessMemory), a magnetic disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments of the present application.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method for extracting a log template, the method comprising:

preprocessing the log file to obtain a preprocessed log file;

and extracting a log template based on the sequence of the sequential words.

2. The method for extracting a log template according to claim 1, wherein the extracting a log template based on the sequence of sequential words comprises:

and if all the stock log templates do not have the stock log templates with the similarity with the sequence of the sequential words being larger than a preset threshold value, extracting the log templates based on the sequence of the sequential words.

3. The method for extracting log templates according to claim 2, wherein after comparing the sequence of sequential words with each stock log template in the preset log template set, the method further comprises:

and if all the stock log templates have the stock log templates with the similarity larger than a preset threshold value, classifying the log files into similar stock log templates corresponding to the maximum value of the similarity.

4. The method of claim 3, wherein classifying the log file into a similar stock log template corresponding to a maximum value of the similarity, comprises:

and updating the similar stock log template based on the sequence of sequential words.

5. The method for extracting log templates according to claim 2, wherein the comparing the sequence of sequential words with each stock log template in the preset log template set comprises:

acquiring the same word segmentation number between a constant part in the sequence of the sequential words and a constant part in the stock log template; the constant part is a part which is not converted into a regular expression;

and taking the ratio of the same word segmentation number to the length value of the constant part in the stock log template as the similarity of the sequence of the sequential words and the stock log template.

6. The method for extracting log templates according to claim 2, wherein the comparing the sequence of sequential words with each stock log template in the set of preset log templates comprises:

7. The method for extracting a log template according to any one of claims 1 to 6, wherein the preprocessing is performed on the log file to obtain a preprocessed log file, and the method comprises:

8. A log template extraction device, characterized in that the log classification device comprises:

the word segmentation module is used for carrying out word segmentation processing on the Chinese log and the English log respectively to obtain a Chinese word sequence and an English word sequence;

the sequencing module is used for obtaining a sequence of sequential words based on the Chinese word sequence and the English word sequence;

9. A log template extraction apparatus, characterized by comprising: a processor, a memory and a log template extraction program stored in the memory, which log template extraction program, when run by the processor, implements the steps of the log template extraction method according to any one of claims 1-7.

10. A computer-readable storage medium, wherein a log template extraction program is stored on the computer-readable storage medium, which when executed by a processor, implements the log template extraction method according to any one of claims 1 to 7.