CN111368534A - Application log noise reduction method and device - Google Patents

Application log noise reduction method and device Download PDF

Info

Publication number
CN111368534A
CN111368534A CN201811587244.9A CN201811587244A CN111368534A CN 111368534 A CN111368534 A CN 111368534A CN 201811587244 A CN201811587244 A CN 201811587244A CN 111368534 A CN111368534 A CN 111368534A
Authority
CN
China
Prior art keywords
word segmentation
log
noise
application log
application
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811587244.9A
Other languages
Chinese (zh)
Inventor
蒋通通
叶晓龙
孟震
任赣
竺士杰
乔柏林
胡林熙
张琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Zhejiang Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Zhejiang Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Zhejiang Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201811587244.9A priority Critical patent/CN111368534A/en
Publication of CN111368534A publication Critical patent/CN111368534A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The embodiment of the invention provides an application log noise reduction method and device. The method comprises the steps of collecting application logs; performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector; according to the feature vector and the noise identification rule obtained in advance according to the topic model, if the application log is judged to be noise, the application log is removed.

Description

Application log noise reduction method and device
Technical Field
The embodiment of the invention relates to the technical field of computer software, in particular to an application log noise reduction method and device.
Background
The application log is currently regarded as one of important operation and maintenance windows for system fault diagnosis and positioning, and most faults can be positioned in real time by extracting and aggregating log event behavior characteristics and the like. In addition, the application log is also widely applied to various operation analyses, for example, deep mining and correlation analysis of user access logs and the like can establish behavior portraits of different crowds, so that multi-level marketing activities are developed. However, as the system scale increases, the system complexity increases, and fault diagnosis, operation analysis, and the like based on logs are affected by various environmental factors, code quality, and the like, such as injecting a large amount of logs unrelated to fault and demand, and disordered logs caused by inaccurate log level setting during system development, and these types of logs cause great interference during subsequent log analysis, and are regarded as "noise data". In order to build an effective fault signature model and aggregate accurate operation index data, these noise logs must be filtered out before analysis.
The current related technical solution for applying log noise reduction is as follows: the first scheme is as follows: the log noise filtering method based on artificial experience labeling mainly comprises the steps of regularly arranging, analyzing and applying the output log data through system responsible personnel such as operation and maintenance personnel, classifying and screening various logs according to long-term work experience of the users, labeling the logs judged to be useless noise, and carrying out mandatory filtering when the logs are collected or put in storage. Most of the filtering modes adopt forms of keyword matching, template regularization and the like. The method is suitable for small application systems with small log quantity and the like, and has relatively thorough effect. Scheme II: the noise filtering method based on the application Log level mainly borrows Log level management standards of various current programming languages, such as 5 large Log levels (debug, info, war, error and fatal) in Java Log4j, and respectively carries out graded printing on fine-grained application debugging logs, operation logs, potential errors, operation errors and serious event logs. And in accordance with similar log-level specifications, developers define info or even above or custom levels by demarcating the output of the underlying noise log. Only the overall control on the log level needs to be realized for the useless noise log filtering in the subsequent log analysis. The third scheme is as follows: a log filtering method based on a noise template skip list mainly extracts and distinguishes based on log time sequence similarity features. The target log time series is compared with the similarity of the noise template to determine whether the target log time series is a noise log. Experiments based on a real cloud computing platform show that the effectiveness of the fault characteristics can be effectively improved by the method.
The prior art scheme mainly has the following problems: (1) the scheme aims at solving the problems existing in a mode of manually marking the experience: with the continuous expansion of the cluster scale of various systems at present, the simple manual labeling becomes a difficult task, not to mention the sudden increase of the log quantity and types caused by the increasing of the applied code change due to the agile development landing. The log noise which grows linearly cannot be identified and filtered out quickly and accurately, and the Martian effect becomes more and more serious. Precisely, this approach is too costly for medium and large projects. (2) The second scheme aims at the problems existing in the noise filtering method of the application log level: the method has the advantages that the accuracy and comprehensiveness of log level specification are guaranteed, the existing and predicted log types need to be classified accurately, and meanwhile, the accurate understanding and execution of log level setting need to be guaranteed. However, with complex iterations of the system, the distinction between the effectiveness of the new log and the noise is gradually blurred on the boundary of the original specification, and the system development is not accurate any more on the subsequent setting, that is, the noise log data continuously overflows to other levels on the original level, and finally cannot be distinguished. Therefore, this method has practical application defects. (3) The third scheme is based on the problems of a log filtering mode of a noise template skip list: the filtering method based on the noise template skip list firstly needs to perform wavelet change on an original log sequence, further calculates the similarity difference value with the noise template, and finally performs noise judgment according to a set threshold, wherein the biggest problems are extraction of the noise template and setting of the similarity calculation threshold. In summary, the prior art solutions are complex, time consuming and cost intensive.
Disclosure of Invention
The embodiment of the invention provides a method and a device for denoising an application log, which are used for solving the problems that the prior art is too complex and time-consuming, and consumes a large amount of cost.
In a first aspect, an embodiment of the present invention provides an application log denoising method, including:
collecting an application log;
performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector;
and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.
In a second aspect, an embodiment of the present invention provides an apparatus for applying log noise reduction, including:
the log corpus library module is used for collecting application logs;
the word segmentation module is used for carrying out word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector;
and the noise identification module is used for removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a subject model.
In a third aspect, an embodiment of the present invention further provides an electronic device, including:
a processor, a memory, a communication interface, and a communication bus; wherein the content of the first and second substances,
the processor, the memory and the communication interface complete mutual communication through the communication bus;
the communication interface is used for information transmission between communication devices of the electronic equipment;
the memory stores computer program instructions executable by the processor, the processor invoking the program instructions to perform a method comprising:
collecting an application log;
performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector;
and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.
In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following method:
collecting an application log;
performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector;
and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.
According to the application log noise reduction method and device provided by the embodiment of the invention, the obtained application log is segmented through the pre-confirmed segmentation rule, and the characteristic vector obtained after segmentation is judged by adopting the pre-obtained noise identification rule, so that the noise identification can be simply, conveniently and accurately carried out on various application logs.
Drawings
FIG. 1 is a flowchart of a method for denoising an application log according to an embodiment of the present invention;
FIG. 2 is a flowchart of another log denoising method according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a device for applying log noise reduction according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of another apparatus for applying log noise reduction according to an embodiment of the present invention;
fig. 5 illustrates a physical structure diagram of an electronic device.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of an application log denoising method according to an embodiment of the present invention, and as shown in fig. 1, the method includes:
and step S01, collecting application logs.
And collecting the application logs in real time from the network and storing the application logs in a log corpus.
And step S02, performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector.
Due to the variety of sources of the application logs, there are a variety of different formats and specifications, and include irregular, fragmented text data, and overly redundant, inaccurate log information. In the prior art, noise identification needs to be adjusted according to different formats and specifications. The embodiment of the invention adopts a dictionary-free word segmentation mode based on word frequency statistics to perform word segmentation processing on the real-time application log, and hard constraints on log formats and specifications are not required to be followed.
And regarding each application log as a short document, performing word segmentation on the application log according to a word segmentation rule obtained in advance, specifically performing word segmentation on the application log, obtaining a characteristic value of each word, extracting main words and the characteristic values, and establishing a characteristic vector corresponding to the application log.
And step S03, removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a subject model.
In the embodiment of the invention, the noise theme is established in the pre-confirmed theme model through the word frequency, the document characteristics and the like, so that the noise identification rule can be obtained through the theme model.
And judging the characteristic vector of the application log by adopting the noise identification rule, if the characteristic vector is in accordance with the noise identification rule, judging the application log to be noise, and adding a noise mark into the application log, otherwise, not performing any operation or adding a non-noise mark.
Whether the application log is noise or not can be judged through the identification of the marks, and if the corresponding application log is judged to be noise, the application log needs to be removed before subsequent data mining and correlation analysis are carried out on the application log.
The embodiment of the invention carries out word segmentation on the obtained application logs by the pre-confirmed word segmentation rule, and then judges the feature vector obtained after word segmentation by the pre-obtained noise identification rule, thereby carrying out noise identification on various application logs simply, conveniently and accurately.
Fig. 2 is a flowchart of another log denoising method according to an embodiment of the present invention, and as shown in fig. 2, the method further includes:
and step S10, obtaining the word segmentation rule by adopting a statistical word segmentation model according to the historical application logs stored in the training set in the corpus.
In order to obtain the word segmentation rule in advance, it is necessary to collect the historical data of various application logs and store the historical data in a corpus, and a part of the historical application logs is formed into a training set.
And obtaining the word segmentation rule by adopting a preset statistical word segmentation model according to all historical application logs in the training set.
And step S11, performing word segmentation processing on each historical application log in the historical application logs according to the word segmentation rule to obtain corresponding feature vectors.
And segmenting words for each historical application log according to the obtained word segmentation rule, and screening and vectorizing according to the word frequency characteristics to obtain corresponding characteristic vectors.
Further, the statistical word segmentation model is an N-gram language model.
There are many statistical word segmentation models, and only one of them, an N-gram language model, is given in the embodiments of the present invention. The model reflects the confidence level of the segmented word according to the probability or frequency of occurrence of adjacent words in the training set. The N may be set according to actual requirements, and is only exemplified by 3. When a historical application log is processed, 3 grams, namely English words, are sequentially extracted by using a 3-word sliding window, the occurrence times are counted, the corresponding probability is calculated by using a Bayes formula, and the probability of a training sample is enabled to obtain the maximum value by using a maximum likelihood method. And finally, extracting the high-frequency gram as a divided word according to a calculation result. And constructing a feature vector vectorized with the historical application log data by taking each word as a dimension and taking the occurrence frequency or probability information of each word in the training set as a value.
Step S12, obtaining a topic model through training according to the feature vector of the historical application log; wherein the topic model comprises at least a noise topic.
And performing theme modeling according to the obtained feature vectors of all historical application logs, designating the themes in the whole training set as noise and non-noise by using a theme classification thought in natural language learning, and converting the noise identification problem into a probability problem in genre classification.
Further, the topic model is an implicit Dirichlet Allocation (LDA) model.
The specific application of the subject model is very diverse, and the LDA model is only exemplified here. After determining that the topic types are noise topics and non-noise topics, the LDA model obtains polynomial distribution of each topic on all words in the training set, that is, dirichlet distribution, according to the feature vectors obtained in the above embodiments. Secondly, for each historical application log, Dirichlet distribution of the historical application log on all topics is obtained, then the topic corresponding to each feature vector is obtained through iterative optimization according to preset hyper-parameters, and therefore a topic model is established.
And step S13, obtaining the noise identification rule according to the theme model.
The noise identification rule can be obtained through analysis of the established topic model.
According to the embodiment of the invention, the word segmentation rule, the topic model and the noise identification rule are obtained by training the concentrated historical application logs and adopting the statistical word segmentation model, so that the noise identification can be simply, conveniently and accurately carried out on various application logs.
Based on the above embodiment, further, the corpus further includes a test set, where the test set includes at least one test application log; accordingly, the method further comprises:
performing word segmentation processing on each test application log in the test set according to the word segmentation rule to obtain a corresponding feature vector;
and performing noise identification according to the feature vector and the noise identification rule, comparing the noise identification with a preset standard, and optimizing the topic model according to the deviation if the deviation exists.
Besides the training set, the corpus also comprises a test set which is composed of other historical application logs as test application logs.
And obtaining a word segmentation rule according to the training set by the statistical word segmentation model, and obtaining a noise identification rule according to the topic model. And putting the test application logs in the test set into a statistical word segmentation model, obtaining corresponding characteristic vectors according to the obtained word segmentation rule, and obtaining the theme of each test application log according to the determined theme model and the noise identification rule.
And comparing the obtained result with a preset standard, and if a deviation exists, optimizing the topic model again according to the deviation to obtain a more accurate noise identification rule.
The embodiment of the invention tests the word segmentation rule and the noise identification rule through the test set, and if the result has deviation, the topic model is optimized again, so that the noise identification can be simply, conveniently and accurately carried out on various application logs.
Based on the above embodiment, further, the method further includes:
and periodically storing all the collected application logs into the corpus for further optimizing the word segmentation rule and the noise identification rule.
In the actual application process, new applications may occur to obtain new application logs or new habitual words may occur, so that the word segmentation rule and the noise recognition rule can adapt to new changes of the current application logs at any time, all the acquired application logs need to be stored into the corpus as historical application logs periodically and are respectively recorded in a training set or a test set, the word segmentation rule is optimized through the updated training set, the topic model is optimized at the same time, and the optimized noise recognition rule is further obtained.
According to the embodiment of the invention, historical application logs are supplemented to the corpus regularly, so that the word segmentation rule and the noise identification rule are continuously optimized, and noise identification can be simply, conveniently and accurately carried out on various application logs.
Fig. 3 is a schematic structural diagram of a device for applying log noise reduction according to an embodiment of the present invention, and as shown in fig. 3, the device includes: the system comprises a log corpus module 10, a word segmentation module 11 and a noise identification module 12, wherein the log corpus module 10 is used for collecting application logs; the word segmentation module 11 is configured to perform word segmentation processing on the application log according to a pre-obtained word segmentation rule to obtain a feature vector; the noise identification module 12 is configured to remove the application log if the application log is determined to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model. Specifically, the method comprises the following steps:
the log corpus module 10 collects application logs from the network in real time and stores the application logs into the log corpus. Meanwhile, the log corpus module 10 sends the collected application logs to the word segmentation module 11 in real time.
The word segmentation module 11 regards each application log as a short document, performs word segmentation processing on the application log according to a word segmentation rule obtained in advance, specifically performs word segmentation on the application log, obtains a feature value of each word, extracts main words and feature values to form a feature vector corresponding to the application log, and sends the feature vector to the noise identification module 12.
In the embodiment of the invention, the noise theme is established in the pre-confirmed theme model through the word frequency, the document characteristics and the like, so that the noise identification rule can be obtained through the theme model.
The noise identification module 12 determines the feature vector of the application log by using the noise identification rule, and if the feature vector conforms to the noise identification rule, it determines that the application log is noise and needs to add a noise mark in the application log, otherwise, no operation is performed or a non-noise mark is added.
Whether the application logs are noise or not can be judged through the identification of the marks, and if the corresponding application logs are judged to be noise, the application logs need to be removed through a filtering module before subsequent data mining and correlation analysis are carried out on the application logs.
The apparatus provided in the embodiment of the present invention is configured to execute the method, and the functions of the apparatus refer to the method embodiment specifically, and detailed method flows thereof are not described herein again.
The embodiment of the invention carries out word segmentation on the obtained application logs by the pre-confirmed word segmentation rule, and then judges the feature vector obtained after word segmentation by the pre-obtained noise identification rule, thereby carrying out noise identification on various application logs simply, conveniently and accurately.
Fig. 4 is a schematic structural diagram of another apparatus for applying log noise reduction according to an embodiment of the present invention, and as shown in fig. 4, the apparatus further includes: a training segmentation module 20, a modeling module 21, and a classification construction module 22, wherein,
the training word segmentation module 20 is configured to obtain the word segmentation rule by using a statistical word segmentation model according to a historical application log stored in a training set in a corpus; the training word segmentation module 20 is further configured to perform word segmentation processing on each historical application log in the historical application logs according to the word segmentation rule to obtain a corresponding feature vector; the modeling module 21 is configured to obtain a topic model through training according to the feature vector of the historical application log; wherein the topic model comprises at least a noise topic; the classification construction module 22 is configured to obtain the noise identification rule according to the topic model. Specifically, the method comprises the following steps:
in order to obtain the word segmentation rule in advance, it is necessary to collect the historical data of various application logs and store the historical data in a corpus, and a part of the historical application logs is formed into a training set.
The training word segmentation module 20 obtains the word segmentation rule by using a preset statistical word segmentation model according to all historical application logs in a training set. The training word segmentation module 20 sends word segmentation rules to the word segmentation module 11.
The training word segmentation module 20 may perform word segmentation on each historical application log according to the obtained word segmentation rule, and perform screening and vectorization according to the word frequency feature to obtain a corresponding feature vector.
Further, the statistical word segmentation model is an N-gram language model.
There are many statistical word segmentation models, and only one of them, an N-gram language model, is given in the embodiments of the present invention. The model reflects the confidence level of the segmented word according to the probability or frequency of occurrence of adjacent words in the training set. The N may be set according to actual requirements, and is only exemplified by 3. When a historical application log is processed, 3 grams, namely English words, are sequentially extracted by using a 3-word sliding window, the occurrence times are counted, the corresponding probability is calculated by using a Bayes formula, and the probability of a training sample is enabled to obtain the maximum value by using a maximum likelihood method. And finally, extracting the high-frequency gram as a divided word according to a calculation result. And constructing a feature vector vectorized with the historical application log data by taking each word as a dimension and taking the occurrence frequency or probability information of each word in the training set as a value.
The modeling module 21 performs topic modeling according to the feature vectors of all the historical application logs obtained by the training word segmentation module 20, assigns topics in the whole training set to be noise and non-noise by using a topic classification thought in natural language learning, and converts a noise identification problem into a probability problem in genre classification.
Further, the topic model is an implicit Dirichlet Allocation (LDA) model.
The specific application of the subject model is very diverse, and the LDA model is only exemplified here. After determining that the topic types are noise topics and non-noise topics, the LDA model obtains polynomial distribution of each topic on all words in the training set, that is, dirichlet distribution, according to the feature vectors obtained in the above embodiments. Secondly, for each historical application log, Dirichlet distribution of the historical application log on all topics is obtained, then the topic corresponding to each feature vector is obtained through iterative optimization according to preset hyper-parameters, and therefore a topic model is established.
The classification construction module 22 may obtain the noise identification rule through analysis of the established topic model, and send the noise identification rule to the noise identification module 12.
The apparatus provided in the embodiment of the present invention is configured to execute the method, and the functions of the apparatus refer to the method embodiment specifically, and detailed method flows thereof are not described herein again.
According to the embodiment of the invention, the word segmentation rule, the topic model and the noise identification rule are obtained by training the concentrated historical application logs and adopting the statistical word segmentation model, so that the noise identification can be simply, conveniently and accurately carried out on various application logs.
Fig. 5 illustrates a physical structure diagram of an electronic device, and as shown in fig. 5, the server may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may call logic instructions in the memory 830 to perform the following method: collecting an application log; performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector; and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.
Further, embodiments of the present invention disclose a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, the computer is capable of performing the methods provided by the above-mentioned method embodiments, for example, comprising: collecting an application log; performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector; and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.
Further, an embodiment of the present invention provides a non-transitory computer-readable storage medium storing computer instructions, which cause the computer to perform the method provided by the above method embodiments, for example, including: collecting an application log; performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector; and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.
Those of ordinary skill in the art will understand that: in addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-described embodiments of the electronic device and the like are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for denoising an application log, comprising:
collecting an application log;
performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector;
and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.
2. The method of claim 1, further comprising:
obtaining the word segmentation rule by adopting a statistical word segmentation model according to a historical application log stored in a training set in a corpus;
performing word segmentation processing on each historical application log in the historical application logs according to the word segmentation rule to obtain corresponding feature vectors;
obtaining a theme model through training according to the feature vector of the historical application log; wherein the topic model comprises at least a noise topic;
and obtaining the noise identification rule according to the topic model.
3. The method according to claim 2, wherein the corpus further includes a test set, the test set including at least one test application log; accordingly, the method further comprises:
performing word segmentation processing on each test application log in the test set according to the word segmentation rule to obtain a corresponding feature vector;
and performing noise identification according to the feature vector and the noise identification rule, comparing the noise identification with a preset standard, and optimizing the topic model according to the deviation if the deviation exists.
4. The method of claim 2, further comprising:
and periodically storing all the collected application logs into the corpus for further optimizing the word segmentation rule and the noise identification rule.
5. The method of claim 2, wherein the statistical word segmentation model is an N-gram (N-gram) language model.
6. The method of claim 2, wherein the topic model is an implicit Dirichlet Allocation (LDA) model.
7. An apparatus for applying log noise reduction, comprising:
the log corpus library module is used for collecting application logs;
the word segmentation module is used for carrying out word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector;
and the noise identification module is used for removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a subject model.
8. The apparatus of claim 7, further comprising:
the training word segmentation module is used for obtaining the word segmentation rule by adopting a statistical word segmentation model according to a historical application log stored in a training set in a corpus;
the training word segmentation module is further used for performing word segmentation processing on each historical application log in the historical application logs according to the word segmentation rule to obtain corresponding feature vectors;
the modeling module is used for obtaining a theme model through training according to the characteristic vector of the historical application log; wherein the topic model comprises at least a noise topic;
and the classification construction module is used for obtaining the noise identification rule according to the topic model.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of applying the log noise reduction method according to any one of claims 1 to 6 when executing the program.
10. A non-transitory computer readable storage medium, having stored thereon a computer program, which, when being executed by a processor, carries out the steps of applying a log noise reduction method according to any one of claims 1 to 6.
CN201811587244.9A 2018-12-25 2018-12-25 Application log noise reduction method and device Pending CN111368534A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811587244.9A CN111368534A (en) 2018-12-25 2018-12-25 Application log noise reduction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811587244.9A CN111368534A (en) 2018-12-25 2018-12-25 Application log noise reduction method and device

Publications (1)

Publication Number Publication Date
CN111368534A true CN111368534A (en) 2020-07-03

Family

ID=71205896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811587244.9A Pending CN111368534A (en) 2018-12-25 2018-12-25 Application log noise reduction method and device

Country Status (1)

Country Link
CN (1) CN111368534A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114564473A (en) * 2022-04-28 2022-05-31 江苏益柏锐信息科技有限公司 Data processing method, equipment and medium based on ERP enterprise management system
CN114896236A (en) * 2022-05-25 2022-08-12 中卫市昊科电子技术有限公司 Big data denoising optimization method and big data system applying artificial intelligence analysis
CN115757068A (en) * 2022-11-17 2023-03-07 中电云数智科技有限公司 Process log acquisition and automatic noise reduction method and system based on eBPF
CN116578534A (en) * 2023-04-11 2023-08-11 华能信息技术有限公司 Log message data format identification method and system

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120101965A1 (en) * 2010-10-26 2012-04-26 Microsoft Corporation Topic models
CN102902752A (en) * 2012-09-20 2013-01-30 新浪网技术(中国)有限公司 Method and system for monitoring log
CN103678282A (en) * 2014-01-07 2014-03-26 苏州思必驰信息科技有限公司 Word segmentation method and device
CN105045812A (en) * 2015-06-18 2015-11-11 上海高欣计算机系统有限公司 Text topic classification method and system
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107092650A (en) * 2017-03-13 2017-08-25 网宿科技股份有限公司 A kind of Web Log Analysis method and device
CN107943791A (en) * 2017-11-24 2018-04-20 北京奇虎科技有限公司 A kind of recognition methods of refuse messages, device and mobile terminal
CN108170818A (en) * 2017-12-29 2018-06-15 深圳市金立通信设备有限公司 A kind of file classification method, server and computer-readable medium
CN108268560A (en) * 2017-01-03 2018-07-10 中国移动通信有限公司研究院 A kind of file classification method and device
CN108509793A (en) * 2018-04-08 2018-09-07 北京明朝万达科技股份有限公司 A kind of user's anomaly detection method and device based on User action log data
CN108667678A (en) * 2017-03-29 2018-10-16 中国移动通信集团设计院有限公司 A kind of O&M Log security detection method and device based on big data

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120101965A1 (en) * 2010-10-26 2012-04-26 Microsoft Corporation Topic models
CN102902752A (en) * 2012-09-20 2013-01-30 新浪网技术(中国)有限公司 Method and system for monitoring log
CN103678282A (en) * 2014-01-07 2014-03-26 苏州思必驰信息科技有限公司 Word segmentation method and device
CN105045812A (en) * 2015-06-18 2015-11-11 上海高欣计算机系统有限公司 Text topic classification method and system
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN108268560A (en) * 2017-01-03 2018-07-10 中国移动通信有限公司研究院 A kind of file classification method and device
CN107092650A (en) * 2017-03-13 2017-08-25 网宿科技股份有限公司 A kind of Web Log Analysis method and device
CN108667678A (en) * 2017-03-29 2018-10-16 中国移动通信集团设计院有限公司 A kind of O&M Log security detection method and device based on big data
CN107943791A (en) * 2017-11-24 2018-04-20 北京奇虎科技有限公司 A kind of recognition methods of refuse messages, device and mobile terminal
CN108170818A (en) * 2017-12-29 2018-06-15 深圳市金立通信设备有限公司 A kind of file classification method, server and computer-readable medium
CN108509793A (en) * 2018-04-08 2018-09-07 北京明朝万达科技股份有限公司 A kind of user's anomaly detection method and device based on User action log data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANTON A.CHUVAKIN ET AL.: "日志管理与分析权威指南", vol. 1, 机械工业出版社, pages: 96 - 98 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114564473A (en) * 2022-04-28 2022-05-31 江苏益柏锐信息科技有限公司 Data processing method, equipment and medium based on ERP enterprise management system
CN114564473B (en) * 2022-04-28 2022-07-12 江苏益柏锐信息科技有限公司 Data processing method, equipment and medium based on ERP enterprise management system
CN114896236A (en) * 2022-05-25 2022-08-12 中卫市昊科电子技术有限公司 Big data denoising optimization method and big data system applying artificial intelligence analysis
CN115757068A (en) * 2022-11-17 2023-03-07 中电云数智科技有限公司 Process log acquisition and automatic noise reduction method and system based on eBPF
CN115757068B (en) * 2022-11-17 2024-03-05 中电云计算技术有限公司 Process log acquisition and automatic noise reduction method and system based on eBPF
CN116578534A (en) * 2023-04-11 2023-08-11 华能信息技术有限公司 Log message data format identification method and system

Similar Documents

Publication Publication Date Title
CN108737406B (en) Method and system for detecting abnormal flow data
CN111368534A (en) Application log noise reduction method and device
JP7090936B2 (en) ESG-based corporate evaluation execution device and its operation method
CN110020422B (en) Feature word determining method and device and server
CN111798312B (en) Financial transaction system anomaly identification method based on isolated forest algorithm
CN109271520B (en) Data extraction method, data extraction device, storage medium, and electronic apparatus
CN106445915B (en) New word discovery method and device
CN110175851B (en) Cheating behavior detection method and device
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN115544240B (en) Text sensitive information identification method and device, electronic equipment and storage medium
CN108052509A (en) A kind of Text similarity computing method, apparatus and server
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN108021595B (en) Method and device for checking knowledge base triples
CN111754352A (en) Method, device, equipment and storage medium for judging correctness of viewpoint statement
CN114969334B (en) Abnormal log detection method and device, electronic equipment and readable storage medium
CN116578700A (en) Log classification method, log classification device, equipment and medium
CN112685374A (en) Log classification method and device and electronic equipment
CN115470034A (en) Log analysis method, device and storage medium
CN113535458B (en) Abnormal false alarm processing method and device, storage medium and terminal
CN116155541A (en) Automatic machine learning platform and method for network security application
CN115758183A (en) Training method and device for log anomaly detection model
CN115495587A (en) Alarm analysis method and device based on knowledge graph
CN114610576A (en) Log generation monitoring method and device
CN110569498B (en) Compound word recognition method and related device
CN112818972A (en) Method and device for detecting interest point image, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200703