CN111368534A

CN111368534A - Application log noise reduction method and device

Info

Publication number: CN111368534A
Application number: CN201811587244.9A
Authority: CN
Inventors: 蒋通通; 叶晓龙; 孟震; 任赣; 竺士杰; 乔柏林; 胡林熙; 张琪
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2020-07-03

Abstract

The embodiment of the invention provides an application log noise reduction method and device. The method comprises the steps of collecting application logs; performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector; according to the feature vector and the noise identification rule obtained in advance according to the topic model, if the application log is judged to be noise, the application log is removed.

Description

Application log noise reduction method and device

Technical Field

The embodiment of the invention relates to the technical field of computer software, in particular to an application log noise reduction method and device.

Background

The application log is currently regarded as one of important operation and maintenance windows for system fault diagnosis and positioning, and most faults can be positioned in real time by extracting and aggregating log event behavior characteristics and the like. In addition, the application log is also widely applied to various operation analyses, for example, deep mining and correlation analysis of user access logs and the like can establish behavior portraits of different crowds, so that multi-level marketing activities are developed. However, as the system scale increases, the system complexity increases, and fault diagnosis, operation analysis, and the like based on logs are affected by various environmental factors, code quality, and the like, such as injecting a large amount of logs unrelated to fault and demand, and disordered logs caused by inaccurate log level setting during system development, and these types of logs cause great interference during subsequent log analysis, and are regarded as "noise data". In order to build an effective fault signature model and aggregate accurate operation index data, these noise logs must be filtered out before analysis.

The current related technical solution for applying log noise reduction is as follows: the first scheme is as follows: the log noise filtering method based on artificial experience labeling mainly comprises the steps of regularly arranging, analyzing and applying the output log data through system responsible personnel such as operation and maintenance personnel, classifying and screening various logs according to long-term work experience of the users, labeling the logs judged to be useless noise, and carrying out mandatory filtering when the logs are collected or put in storage. Most of the filtering modes adopt forms of keyword matching, template regularization and the like. The method is suitable for small application systems with small log quantity and the like, and has relatively thorough effect. Scheme II: the noise filtering method based on the application Log level mainly borrows Log level management standards of various current programming languages, such as 5 large Log levels (debug, info, war, error and fatal) in Java Log4j, and respectively carries out graded printing on fine-grained application debugging logs, operation logs, potential errors, operation errors and serious event logs. And in accordance with similar log-level specifications, developers define info or even above or custom levels by demarcating the output of the underlying noise log. Only the overall control on the log level needs to be realized for the useless noise log filtering in the subsequent log analysis. The third scheme is as follows: a log filtering method based on a noise template skip list mainly extracts and distinguishes based on log time sequence similarity features. The target log time series is compared with the similarity of the noise template to determine whether the target log time series is a noise log. Experiments based on a real cloud computing platform show that the effectiveness of the fault characteristics can be effectively improved by the method.

The prior art scheme mainly has the following problems: (1) the scheme aims at solving the problems existing in a mode of manually marking the experience: with the continuous expansion of the cluster scale of various systems at present, the simple manual labeling becomes a difficult task, not to mention the sudden increase of the log quantity and types caused by the increasing of the applied code change due to the agile development landing. The log noise which grows linearly cannot be identified and filtered out quickly and accurately, and the Martian effect becomes more and more serious. Precisely, this approach is too costly for medium and large projects. (2) The second scheme aims at the problems existing in the noise filtering method of the application log level: the method has the advantages that the accuracy and comprehensiveness of log level specification are guaranteed, the existing and predicted log types need to be classified accurately, and meanwhile, the accurate understanding and execution of log level setting need to be guaranteed. However, with complex iterations of the system, the distinction between the effectiveness of the new log and the noise is gradually blurred on the boundary of the original specification, and the system development is not accurate any more on the subsequent setting, that is, the noise log data continuously overflows to other levels on the original level, and finally cannot be distinguished. Therefore, this method has practical application defects. (3) The third scheme is based on the problems of a log filtering mode of a noise template skip list: the filtering method based on the noise template skip list firstly needs to perform wavelet change on an original log sequence, further calculates the similarity difference value with the noise template, and finally performs noise judgment according to a set threshold, wherein the biggest problems are extraction of the noise template and setting of the similarity calculation threshold. In summary, the prior art solutions are complex, time consuming and cost intensive.

Disclosure of Invention

The embodiment of the invention provides a method and a device for denoising an application log, which are used for solving the problems that the prior art is too complex and time-consuming, and consumes a large amount of cost.

In a first aspect, an embodiment of the present invention provides an application log denoising method, including:

collecting an application log;

performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector;

and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.

In a second aspect, an embodiment of the present invention provides an apparatus for applying log noise reduction, including:

the log corpus library module is used for collecting application logs;

the word segmentation module is used for carrying out word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector;

and the noise identification module is used for removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a subject model.

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

a processor, a memory, a communication interface, and a communication bus; wherein the content of the first and second substances,

the processor, the memory and the communication interface complete mutual communication through the communication bus;

the communication interface is used for information transmission between communication devices of the electronic equipment;

the memory stores computer program instructions executable by the processor, the processor invoking the program instructions to perform a method comprising:

collecting an application log;

In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following method:

collecting an application log;

According to the application log noise reduction method and device provided by the embodiment of the invention, the obtained application log is segmented through the pre-confirmed segmentation rule, and the characteristic vector obtained after segmentation is judged by adopting the pre-obtained noise identification rule, so that the noise identification can be simply, conveniently and accurately carried out on various application logs.

Drawings

FIG. 1 is a flowchart of a method for denoising an application log according to an embodiment of the present invention;

FIG. 2 is a flowchart of another log denoising method according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a device for applying log noise reduction according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of another apparatus for applying log noise reduction according to an embodiment of the present invention;

fig. 5 illustrates a physical structure diagram of an electronic device.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of an application log denoising method according to an embodiment of the present invention, and as shown in fig. 1, the method includes:

and step S01, collecting application logs.

And collecting the application logs in real time from the network and storing the application logs in a log corpus.

And step S02, performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector.

Due to the variety of sources of the application logs, there are a variety of different formats and specifications, and include irregular, fragmented text data, and overly redundant, inaccurate log information. In the prior art, noise identification needs to be adjusted according to different formats and specifications. The embodiment of the invention adopts a dictionary-free word segmentation mode based on word frequency statistics to perform word segmentation processing on the real-time application log, and hard constraints on log formats and specifications are not required to be followed.

And regarding each application log as a short document, performing word segmentation on the application log according to a word segmentation rule obtained in advance, specifically performing word segmentation on the application log, obtaining a characteristic value of each word, extracting main words and the characteristic values, and establishing a characteristic vector corresponding to the application log.

And step S03, removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a subject model.

In the embodiment of the invention, the noise theme is established in the pre-confirmed theme model through the word frequency, the document characteristics and the like, so that the noise identification rule can be obtained through the theme model.

And judging the characteristic vector of the application log by adopting the noise identification rule, if the characteristic vector is in accordance with the noise identification rule, judging the application log to be noise, and adding a noise mark into the application log, otherwise, not performing any operation or adding a non-noise mark.

Whether the application log is noise or not can be judged through the identification of the marks, and if the corresponding application log is judged to be noise, the application log needs to be removed before subsequent data mining and correlation analysis are carried out on the application log.

The embodiment of the invention carries out word segmentation on the obtained application logs by the pre-confirmed word segmentation rule, and then judges the feature vector obtained after word segmentation by the pre-obtained noise identification rule, thereby carrying out noise identification on various application logs simply, conveniently and accurately.

Fig. 2 is a flowchart of another log denoising method according to an embodiment of the present invention, and as shown in fig. 2, the method further includes:

and step S10, obtaining the word segmentation rule by adopting a statistical word segmentation model according to the historical application logs stored in the training set in the corpus.

In order to obtain the word segmentation rule in advance, it is necessary to collect the historical data of various application logs and store the historical data in a corpus, and a part of the historical application logs is formed into a training set.

And obtaining the word segmentation rule by adopting a preset statistical word segmentation model according to all historical application logs in the training set.

And step S11, performing word segmentation processing on each historical application log in the historical application logs according to the word segmentation rule to obtain corresponding feature vectors.

And segmenting words for each historical application log according to the obtained word segmentation rule, and screening and vectorizing according to the word frequency characteristics to obtain corresponding characteristic vectors.

Further, the statistical word segmentation model is an N-gram language model.

There are many statistical word segmentation models, and only one of them, an N-gram language model, is given in the embodiments of the present invention. The model reflects the confidence level of the segmented word according to the probability or frequency of occurrence of adjacent words in the training set. The N may be set according to actual requirements, and is only exemplified by 3. When a historical application log is processed, 3 grams, namely English words, are sequentially extracted by using a 3-word sliding window, the occurrence times are counted, the corresponding probability is calculated by using a Bayes formula, and the probability of a training sample is enabled to obtain the maximum value by using a maximum likelihood method. And finally, extracting the high-frequency gram as a divided word according to a calculation result. And constructing a feature vector vectorized with the historical application log data by taking each word as a dimension and taking the occurrence frequency or probability information of each word in the training set as a value.

Step S12, obtaining a topic model through training according to the feature vector of the historical application log; wherein the topic model comprises at least a noise topic.

And performing theme modeling according to the obtained feature vectors of all historical application logs, designating the themes in the whole training set as noise and non-noise by using a theme classification thought in natural language learning, and converting the noise identification problem into a probability problem in genre classification.

Further, the topic model is an implicit Dirichlet Allocation (LDA) model.

The specific application of the subject model is very diverse, and the LDA model is only exemplified here. After determining that the topic types are noise topics and non-noise topics, the LDA model obtains polynomial distribution of each topic on all words in the training set, that is, dirichlet distribution, according to the feature vectors obtained in the above embodiments. Secondly, for each historical application log, Dirichlet distribution of the historical application log on all topics is obtained, then the topic corresponding to each feature vector is obtained through iterative optimization according to preset hyper-parameters, and therefore a topic model is established.

And step S13, obtaining the noise identification rule according to the theme model.

The noise identification rule can be obtained through analysis of the established topic model.

According to the embodiment of the invention, the word segmentation rule, the topic model and the noise identification rule are obtained by training the concentrated historical application logs and adopting the statistical word segmentation model, so that the noise identification can be simply, conveniently and accurately carried out on various application logs.

Based on the above embodiment, further, the corpus further includes a test set, where the test set includes at least one test application log; accordingly, the method further comprises:

performing word segmentation processing on each test application log in the test set according to the word segmentation rule to obtain a corresponding feature vector;

and performing noise identification according to the feature vector and the noise identification rule, comparing the noise identification with a preset standard, and optimizing the topic model according to the deviation if the deviation exists.

Besides the training set, the corpus also comprises a test set which is composed of other historical application logs as test application logs.

And obtaining a word segmentation rule according to the training set by the statistical word segmentation model, and obtaining a noise identification rule according to the topic model. And putting the test application logs in the test set into a statistical word segmentation model, obtaining corresponding characteristic vectors according to the obtained word segmentation rule, and obtaining the theme of each test application log according to the determined theme model and the noise identification rule.

And comparing the obtained result with a preset standard, and if a deviation exists, optimizing the topic model again according to the deviation to obtain a more accurate noise identification rule.

The embodiment of the invention tests the word segmentation rule and the noise identification rule through the test set, and if the result has deviation, the topic model is optimized again, so that the noise identification can be simply, conveniently and accurately carried out on various application logs.

Based on the above embodiment, further, the method further includes:

and periodically storing all the collected application logs into the corpus for further optimizing the word segmentation rule and the noise identification rule.

In the actual application process, new applications may occur to obtain new application logs or new habitual words may occur, so that the word segmentation rule and the noise recognition rule can adapt to new changes of the current application logs at any time, all the acquired application logs need to be stored into the corpus as historical application logs periodically and are respectively recorded in a training set or a test set, the word segmentation rule is optimized through the updated training set, the topic model is optimized at the same time, and the optimized noise recognition rule is further obtained.

According to the embodiment of the invention, historical application logs are supplemented to the corpus regularly, so that the word segmentation rule and the noise identification rule are continuously optimized, and noise identification can be simply, conveniently and accurately carried out on various application logs.

Fig. 3 is a schematic structural diagram of a device for applying log noise reduction according to an embodiment of the present invention, and as shown in fig. 3, the device includes: the system comprises a log corpus module 10, a word segmentation module 11 and a noise identification module 12, wherein the log corpus module 10 is used for collecting application logs; the word segmentation module 11 is configured to perform word segmentation processing on the application log according to a pre-obtained word segmentation rule to obtain a feature vector; the noise identification module 12 is configured to remove the application log if the application log is determined to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model. Specifically, the method comprises the following steps:

the log corpus module 10 collects application logs from the network in real time and stores the application logs into the log corpus. Meanwhile, the log corpus module 10 sends the collected application logs to the word segmentation module 11 in real time.

The word segmentation module 11 regards each application log as a short document, performs word segmentation processing on the application log according to a word segmentation rule obtained in advance, specifically performs word segmentation on the application log, obtains a feature value of each word, extracts main words and feature values to form a feature vector corresponding to the application log, and sends the feature vector to the noise identification module 12.

The noise identification module 12 determines the feature vector of the application log by using the noise identification rule, and if the feature vector conforms to the noise identification rule, it determines that the application log is noise and needs to add a noise mark in the application log, otherwise, no operation is performed or a non-noise mark is added.

Whether the application logs are noise or not can be judged through the identification of the marks, and if the corresponding application logs are judged to be noise, the application logs need to be removed through a filtering module before subsequent data mining and correlation analysis are carried out on the application logs.

The apparatus provided in the embodiment of the present invention is configured to execute the method, and the functions of the apparatus refer to the method embodiment specifically, and detailed method flows thereof are not described herein again.

Fig. 4 is a schematic structural diagram of another apparatus for applying log noise reduction according to an embodiment of the present invention, and as shown in fig. 4, the apparatus further includes: a training segmentation module 20, a modeling module 21, and a classification construction module 22, wherein,

the training word segmentation module 20 is configured to obtain the word segmentation rule by using a statistical word segmentation model according to a historical application log stored in a training set in a corpus; the training word segmentation module 20 is further configured to perform word segmentation processing on each historical application log in the historical application logs according to the word segmentation rule to obtain a corresponding feature vector; the modeling module 21 is configured to obtain a topic model through training according to the feature vector of the historical application log; wherein the topic model comprises at least a noise topic; the classification construction module 22 is configured to obtain the noise identification rule according to the topic model. Specifically, the method comprises the following steps:

The training word segmentation module 20 obtains the word segmentation rule by using a preset statistical word segmentation model according to all historical application logs in a training set. The training word segmentation module 20 sends word segmentation rules to the word segmentation module 11.

The training word segmentation module 20 may perform word segmentation on each historical application log according to the obtained word segmentation rule, and perform screening and vectorization according to the word frequency feature to obtain a corresponding feature vector.

Further, the statistical word segmentation model is an N-gram language model.

The modeling module 21 performs topic modeling according to the feature vectors of all the historical application logs obtained by the training word segmentation module 20, assigns topics in the whole training set to be noise and non-noise by using a topic classification thought in natural language learning, and converts a noise identification problem into a probability problem in genre classification.

Further, the topic model is an implicit Dirichlet Allocation (LDA) model.

The classification construction module 22 may obtain the noise identification rule through analysis of the established topic model, and send the noise identification rule to the noise identification module 12.

Fig. 5 illustrates a physical structure diagram of an electronic device, and as shown in fig. 5, the server may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may call logic instructions in the memory 830 to perform the following method: collecting an application log; performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector; and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.

Further, embodiments of the present invention disclose a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, the computer is capable of performing the methods provided by the above-mentioned method embodiments, for example, comprising: collecting an application log; performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector; and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.

Further, an embodiment of the present invention provides a non-transitory computer-readable storage medium storing computer instructions, which cause the computer to perform the method provided by the above method embodiments, for example, including: collecting an application log; performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector; and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.

Those of ordinary skill in the art will understand that: in addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-described embodiments of the electronic device and the like are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for denoising an application log, comprising:

collecting an application log;

2. The method of claim 1, further comprising:

obtaining the word segmentation rule by adopting a statistical word segmentation model according to a historical application log stored in a training set in a corpus;

performing word segmentation processing on each historical application log in the historical application logs according to the word segmentation rule to obtain corresponding feature vectors;

obtaining a theme model through training according to the feature vector of the historical application log; wherein the topic model comprises at least a noise topic;

and obtaining the noise identification rule according to the topic model.

3. The method according to claim 2, wherein the corpus further includes a test set, the test set including at least one test application log; accordingly, the method further comprises:

4. The method of claim 2, further comprising:

5. The method of claim 2, wherein the statistical word segmentation model is an N-gram (N-gram) language model.

6. The method of claim 2, wherein the topic model is an implicit Dirichlet Allocation (LDA) model.

7. An apparatus for applying log noise reduction, comprising:

the log corpus library module is used for collecting application logs;

8. The apparatus of claim 7, further comprising:

the training word segmentation module is used for obtaining the word segmentation rule by adopting a statistical word segmentation model according to a historical application log stored in a training set in a corpus;

the training word segmentation module is further used for performing word segmentation processing on each historical application log in the historical application logs according to the word segmentation rule to obtain corresponding feature vectors;

the modeling module is used for obtaining a theme model through training according to the characteristic vector of the historical application log; wherein the topic model comprises at least a noise topic;

and the classification construction module is used for obtaining the noise identification rule according to the topic model.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of applying the log noise reduction method according to any one of claims 1 to 6 when executing the program.

10. A non-transitory computer readable storage medium, having stored thereon a computer program, which, when being executed by a processor, carries out the steps of applying a log noise reduction method according to any one of claims 1 to 6.