CN111368534A - Application log noise reduction method and device - Google Patents
Application log noise reduction method and device Download PDFInfo
- Publication number
- CN111368534A CN111368534A CN201811587244.9A CN201811587244A CN111368534A CN 111368534 A CN111368534 A CN 111368534A CN 201811587244 A CN201811587244 A CN 201811587244A CN 111368534 A CN111368534 A CN 111368534A
- Authority
- CN
- China
- Prior art keywords
- word segmentation
- log
- noise
- application log
- application
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 230000009467 reduction Effects 0.000 title claims abstract description 14
- 230000011218 segmentation Effects 0.000 claims abstract description 105
- 239000013598 vector Substances 0.000 claims abstract description 57
- 238000012545 processing Methods 0.000 claims abstract description 20
- 238000012549 training Methods 0.000 claims description 37
- 238000012360 testing method Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 4
- 238000004891 communication Methods 0.000 description 11
- 238000001914 filtration Methods 0.000 description 10
- 238000004458 analytical method Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 3
- 238000010219 correlation analysis Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 238000012216 screening Methods 0.000 description 3
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000033772 system development Effects 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002650 habitual effect Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007639 printing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Abstract
The embodiment of the invention provides an application log noise reduction method and device. The method comprises the steps of collecting application logs; performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector; according to the feature vector and the noise identification rule obtained in advance according to the topic model, if the application log is judged to be noise, the application log is removed.
Description
Technical Field
The embodiment of the invention relates to the technical field of computer software, in particular to an application log noise reduction method and device.
Background
The application log is currently regarded as one of important operation and maintenance windows for system fault diagnosis and positioning, and most faults can be positioned in real time by extracting and aggregating log event behavior characteristics and the like. In addition, the application log is also widely applied to various operation analyses, for example, deep mining and correlation analysis of user access logs and the like can establish behavior portraits of different crowds, so that multi-level marketing activities are developed. However, as the system scale increases, the system complexity increases, and fault diagnosis, operation analysis, and the like based on logs are affected by various environmental factors, code quality, and the like, such as injecting a large amount of logs unrelated to fault and demand, and disordered logs caused by inaccurate log level setting during system development, and these types of logs cause great interference during subsequent log analysis, and are regarded as "noise data". In order to build an effective fault signature model and aggregate accurate operation index data, these noise logs must be filtered out before analysis.
The current related technical solution for applying log noise reduction is as follows: the first scheme is as follows: the log noise filtering method based on artificial experience labeling mainly comprises the steps of regularly arranging, analyzing and applying the output log data through system responsible personnel such as operation and maintenance personnel, classifying and screening various logs according to long-term work experience of the users, labeling the logs judged to be useless noise, and carrying out mandatory filtering when the logs are collected or put in storage. Most of the filtering modes adopt forms of keyword matching, template regularization and the like. The method is suitable for small application systems with small log quantity and the like, and has relatively thorough effect. Scheme II: the noise filtering method based on the application Log level mainly borrows Log level management standards of various current programming languages, such as 5 large Log levels (debug, info, war, error and fatal) in Java Log4j, and respectively carries out graded printing on fine-grained application debugging logs, operation logs, potential errors, operation errors and serious event logs. And in accordance with similar log-level specifications, developers define info or even above or custom levels by demarcating the output of the underlying noise log. Only the overall control on the log level needs to be realized for the useless noise log filtering in the subsequent log analysis. The third scheme is as follows: a log filtering method based on a noise template skip list mainly extracts and distinguishes based on log time sequence similarity features. The target log time series is compared with the similarity of the noise template to determine whether the target log time series is a noise log. Experiments based on a real cloud computing platform show that the effectiveness of the fault characteristics can be effectively improved by the method.
The prior art scheme mainly has the following problems: (1) the scheme aims at solving the problems existing in a mode of manually marking the experience: with the continuous expansion of the cluster scale of various systems at present, the simple manual labeling becomes a difficult task, not to mention the sudden increase of the log quantity and types caused by the increasing of the applied code change due to the agile development landing. The log noise which grows linearly cannot be identified and filtered out quickly and accurately, and the Martian effect becomes more and more serious. Precisely, this approach is too costly for medium and large projects. (2) The second scheme aims at the problems existing in the noise filtering method of the application log level: the method has the advantages that the accuracy and comprehensiveness of log level specification are guaranteed, the existing and predicted log types need to be classified accurately, and meanwhile, the accurate understanding and execution of log level setting need to be guaranteed. However, with complex iterations of the system, the distinction between the effectiveness of the new log and the noise is gradually blurred on the boundary of the original specification, and the system development is not accurate any more on the subsequent setting, that is, the noise log data continuously overflows to other levels on the original level, and finally cannot be distinguished. Therefore, this method has practical application defects. (3) The third scheme is based on the problems of a log filtering mode of a noise template skip list: the filtering method based on the noise template skip list firstly needs to perform wavelet change on an original log sequence, further calculates the similarity difference value with the noise template, and finally performs noise judgment according to a set threshold, wherein the biggest problems are extraction of the noise template and setting of the similarity calculation threshold. In summary, the prior art solutions are complex, time consuming and cost intensive.
Disclosure of Invention
The embodiment of the invention provides a method and a device for denoising an application log, which are used for solving the problems that the prior art is too complex and time-consuming, and consumes a large amount of cost.
In a first aspect, an embodiment of the present invention provides an application log denoising method, including:
collecting an application log;
performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector;
and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.
In a second aspect, an embodiment of the present invention provides an apparatus for applying log noise reduction, including:
the log corpus library module is used for collecting application logs;
the word segmentation module is used for carrying out word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector;
and the noise identification module is used for removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a subject model.
In a third aspect, an embodiment of the present invention further provides an electronic device, including:
a processor, a memory, a communication interface, and a communication bus; wherein the content of the first and second substances,
the processor, the memory and the communication interface complete mutual communication through the communication bus;
the communication interface is used for information transmission between communication devices of the electronic equipment;
the memory stores computer program instructions executable by the processor, the processor invoking the program instructions to perform a method comprising:
collecting an application log;
performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector;
and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.
In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following method:
collecting an application log;
performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector;
and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.
According to the application log noise reduction method and device provided by the embodiment of the invention, the obtained application log is segmented through the pre-confirmed segmentation rule, and the characteristic vector obtained after segmentation is judged by adopting the pre-obtained noise identification rule, so that the noise identification can be simply, conveniently and accurately carried out on various application logs.
Drawings
FIG. 1 is a flowchart of a method for denoising an application log according to an embodiment of the present invention;
FIG. 2 is a flowchart of another log denoising method according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a device for applying log noise reduction according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of another apparatus for applying log noise reduction according to an embodiment of the present invention;
fig. 5 illustrates a physical structure diagram of an electronic device.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of an application log denoising method according to an embodiment of the present invention, and as shown in fig. 1, the method includes:
and step S01, collecting application logs.
And collecting the application logs in real time from the network and storing the application logs in a log corpus.
And step S02, performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector.
Due to the variety of sources of the application logs, there are a variety of different formats and specifications, and include irregular, fragmented text data, and overly redundant, inaccurate log information. In the prior art, noise identification needs to be adjusted according to different formats and specifications. The embodiment of the invention adopts a dictionary-free word segmentation mode based on word frequency statistics to perform word segmentation processing on the real-time application log, and hard constraints on log formats and specifications are not required to be followed.
And regarding each application log as a short document, performing word segmentation on the application log according to a word segmentation rule obtained in advance, specifically performing word segmentation on the application log, obtaining a characteristic value of each word, extracting main words and the characteristic values, and establishing a characteristic vector corresponding to the application log.
And step S03, removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a subject model.
In the embodiment of the invention, the noise theme is established in the pre-confirmed theme model through the word frequency, the document characteristics and the like, so that the noise identification rule can be obtained through the theme model.
And judging the characteristic vector of the application log by adopting the noise identification rule, if the characteristic vector is in accordance with the noise identification rule, judging the application log to be noise, and adding a noise mark into the application log, otherwise, not performing any operation or adding a non-noise mark.
Whether the application log is noise or not can be judged through the identification of the marks, and if the corresponding application log is judged to be noise, the application log needs to be removed before subsequent data mining and correlation analysis are carried out on the application log.
The embodiment of the invention carries out word segmentation on the obtained application logs by the pre-confirmed word segmentation rule, and then judges the feature vector obtained after word segmentation by the pre-obtained noise identification rule, thereby carrying out noise identification on various application logs simply, conveniently and accurately.
Fig. 2 is a flowchart of another log denoising method according to an embodiment of the present invention, and as shown in fig. 2, the method further includes:
and step S10, obtaining the word segmentation rule by adopting a statistical word segmentation model according to the historical application logs stored in the training set in the corpus.
In order to obtain the word segmentation rule in advance, it is necessary to collect the historical data of various application logs and store the historical data in a corpus, and a part of the historical application logs is formed into a training set.
And obtaining the word segmentation rule by adopting a preset statistical word segmentation model according to all historical application logs in the training set.
And step S11, performing word segmentation processing on each historical application log in the historical application logs according to the word segmentation rule to obtain corresponding feature vectors.
And segmenting words for each historical application log according to the obtained word segmentation rule, and screening and vectorizing according to the word frequency characteristics to obtain corresponding characteristic vectors.
Further, the statistical word segmentation model is an N-gram language model.
There are many statistical word segmentation models, and only one of them, an N-gram language model, is given in the embodiments of the present invention. The model reflects the confidence level of the segmented word according to the probability or frequency of occurrence of adjacent words in the training set. The N may be set according to actual requirements, and is only exemplified by 3. When a historical application log is processed, 3 grams, namely English words, are sequentially extracted by using a 3-word sliding window, the occurrence times are counted, the corresponding probability is calculated by using a Bayes formula, and the probability of a training sample is enabled to obtain the maximum value by using a maximum likelihood method. And finally, extracting the high-frequency gram as a divided word according to a calculation result. And constructing a feature vector vectorized with the historical application log data by taking each word as a dimension and taking the occurrence frequency or probability information of each word in the training set as a value.
Step S12, obtaining a topic model through training according to the feature vector of the historical application log; wherein the topic model comprises at least a noise topic.
And performing theme modeling according to the obtained feature vectors of all historical application logs, designating the themes in the whole training set as noise and non-noise by using a theme classification thought in natural language learning, and converting the noise identification problem into a probability problem in genre classification.
Further, the topic model is an implicit Dirichlet Allocation (LDA) model.
The specific application of the subject model is very diverse, and the LDA model is only exemplified here. After determining that the topic types are noise topics and non-noise topics, the LDA model obtains polynomial distribution of each topic on all words in the training set, that is, dirichlet distribution, according to the feature vectors obtained in the above embodiments. Secondly, for each historical application log, Dirichlet distribution of the historical application log on all topics is obtained, then the topic corresponding to each feature vector is obtained through iterative optimization according to preset hyper-parameters, and therefore a topic model is established.
And step S13, obtaining the noise identification rule according to the theme model.
The noise identification rule can be obtained through analysis of the established topic model.
According to the embodiment of the invention, the word segmentation rule, the topic model and the noise identification rule are obtained by training the concentrated historical application logs and adopting the statistical word segmentation model, so that the noise identification can be simply, conveniently and accurately carried out on various application logs.
Based on the above embodiment, further, the corpus further includes a test set, where the test set includes at least one test application log; accordingly, the method further comprises:
performing word segmentation processing on each test application log in the test set according to the word segmentation rule to obtain a corresponding feature vector;
and performing noise identification according to the feature vector and the noise identification rule, comparing the noise identification with a preset standard, and optimizing the topic model according to the deviation if the deviation exists.
Besides the training set, the corpus also comprises a test set which is composed of other historical application logs as test application logs.
And obtaining a word segmentation rule according to the training set by the statistical word segmentation model, and obtaining a noise identification rule according to the topic model. And putting the test application logs in the test set into a statistical word segmentation model, obtaining corresponding characteristic vectors according to the obtained word segmentation rule, and obtaining the theme of each test application log according to the determined theme model and the noise identification rule.
And comparing the obtained result with a preset standard, and if a deviation exists, optimizing the topic model again according to the deviation to obtain a more accurate noise identification rule.
The embodiment of the invention tests the word segmentation rule and the noise identification rule through the test set, and if the result has deviation, the topic model is optimized again, so that the noise identification can be simply, conveniently and accurately carried out on various application logs.
Based on the above embodiment, further, the method further includes:
and periodically storing all the collected application logs into the corpus for further optimizing the word segmentation rule and the noise identification rule.
In the actual application process, new applications may occur to obtain new application logs or new habitual words may occur, so that the word segmentation rule and the noise recognition rule can adapt to new changes of the current application logs at any time, all the acquired application logs need to be stored into the corpus as historical application logs periodically and are respectively recorded in a training set or a test set, the word segmentation rule is optimized through the updated training set, the topic model is optimized at the same time, and the optimized noise recognition rule is further obtained.
According to the embodiment of the invention, historical application logs are supplemented to the corpus regularly, so that the word segmentation rule and the noise identification rule are continuously optimized, and noise identification can be simply, conveniently and accurately carried out on various application logs.
Fig. 3 is a schematic structural diagram of a device for applying log noise reduction according to an embodiment of the present invention, and as shown in fig. 3, the device includes: the system comprises a log corpus module 10, a word segmentation module 11 and a noise identification module 12, wherein the log corpus module 10 is used for collecting application logs; the word segmentation module 11 is configured to perform word segmentation processing on the application log according to a pre-obtained word segmentation rule to obtain a feature vector; the noise identification module 12 is configured to remove the application log if the application log is determined to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model. Specifically, the method comprises the following steps:
the log corpus module 10 collects application logs from the network in real time and stores the application logs into the log corpus. Meanwhile, the log corpus module 10 sends the collected application logs to the word segmentation module 11 in real time.
The word segmentation module 11 regards each application log as a short document, performs word segmentation processing on the application log according to a word segmentation rule obtained in advance, specifically performs word segmentation on the application log, obtains a feature value of each word, extracts main words and feature values to form a feature vector corresponding to the application log, and sends the feature vector to the noise identification module 12.
In the embodiment of the invention, the noise theme is established in the pre-confirmed theme model through the word frequency, the document characteristics and the like, so that the noise identification rule can be obtained through the theme model.
The noise identification module 12 determines the feature vector of the application log by using the noise identification rule, and if the feature vector conforms to the noise identification rule, it determines that the application log is noise and needs to add a noise mark in the application log, otherwise, no operation is performed or a non-noise mark is added.
Whether the application logs are noise or not can be judged through the identification of the marks, and if the corresponding application logs are judged to be noise, the application logs need to be removed through a filtering module before subsequent data mining and correlation analysis are carried out on the application logs.
The apparatus provided in the embodiment of the present invention is configured to execute the method, and the functions of the apparatus refer to the method embodiment specifically, and detailed method flows thereof are not described herein again.
The embodiment of the invention carries out word segmentation on the obtained application logs by the pre-confirmed word segmentation rule, and then judges the feature vector obtained after word segmentation by the pre-obtained noise identification rule, thereby carrying out noise identification on various application logs simply, conveniently and accurately.
Fig. 4 is a schematic structural diagram of another apparatus for applying log noise reduction according to an embodiment of the present invention, and as shown in fig. 4, the apparatus further includes: a training segmentation module 20, a modeling module 21, and a classification construction module 22, wherein,
the training word segmentation module 20 is configured to obtain the word segmentation rule by using a statistical word segmentation model according to a historical application log stored in a training set in a corpus; the training word segmentation module 20 is further configured to perform word segmentation processing on each historical application log in the historical application logs according to the word segmentation rule to obtain a corresponding feature vector; the modeling module 21 is configured to obtain a topic model through training according to the feature vector of the historical application log; wherein the topic model comprises at least a noise topic; the classification construction module 22 is configured to obtain the noise identification rule according to the topic model. Specifically, the method comprises the following steps:
in order to obtain the word segmentation rule in advance, it is necessary to collect the historical data of various application logs and store the historical data in a corpus, and a part of the historical application logs is formed into a training set.
The training word segmentation module 20 obtains the word segmentation rule by using a preset statistical word segmentation model according to all historical application logs in a training set. The training word segmentation module 20 sends word segmentation rules to the word segmentation module 11.
The training word segmentation module 20 may perform word segmentation on each historical application log according to the obtained word segmentation rule, and perform screening and vectorization according to the word frequency feature to obtain a corresponding feature vector.
Further, the statistical word segmentation model is an N-gram language model.
There are many statistical word segmentation models, and only one of them, an N-gram language model, is given in the embodiments of the present invention. The model reflects the confidence level of the segmented word according to the probability or frequency of occurrence of adjacent words in the training set. The N may be set according to actual requirements, and is only exemplified by 3. When a historical application log is processed, 3 grams, namely English words, are sequentially extracted by using a 3-word sliding window, the occurrence times are counted, the corresponding probability is calculated by using a Bayes formula, and the probability of a training sample is enabled to obtain the maximum value by using a maximum likelihood method. And finally, extracting the high-frequency gram as a divided word according to a calculation result. And constructing a feature vector vectorized with the historical application log data by taking each word as a dimension and taking the occurrence frequency or probability information of each word in the training set as a value.
The modeling module 21 performs topic modeling according to the feature vectors of all the historical application logs obtained by the training word segmentation module 20, assigns topics in the whole training set to be noise and non-noise by using a topic classification thought in natural language learning, and converts a noise identification problem into a probability problem in genre classification.
Further, the topic model is an implicit Dirichlet Allocation (LDA) model.
The specific application of the subject model is very diverse, and the LDA model is only exemplified here. After determining that the topic types are noise topics and non-noise topics, the LDA model obtains polynomial distribution of each topic on all words in the training set, that is, dirichlet distribution, according to the feature vectors obtained in the above embodiments. Secondly, for each historical application log, Dirichlet distribution of the historical application log on all topics is obtained, then the topic corresponding to each feature vector is obtained through iterative optimization according to preset hyper-parameters, and therefore a topic model is established.
The classification construction module 22 may obtain the noise identification rule through analysis of the established topic model, and send the noise identification rule to the noise identification module 12.
The apparatus provided in the embodiment of the present invention is configured to execute the method, and the functions of the apparatus refer to the method embodiment specifically, and detailed method flows thereof are not described herein again.
According to the embodiment of the invention, the word segmentation rule, the topic model and the noise identification rule are obtained by training the concentrated historical application logs and adopting the statistical word segmentation model, so that the noise identification can be simply, conveniently and accurately carried out on various application logs.
Fig. 5 illustrates a physical structure diagram of an electronic device, and as shown in fig. 5, the server may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may call logic instructions in the memory 830 to perform the following method: collecting an application log; performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector; and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.
Further, embodiments of the present invention disclose a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, the computer is capable of performing the methods provided by the above-mentioned method embodiments, for example, comprising: collecting an application log; performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector; and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.
Further, an embodiment of the present invention provides a non-transitory computer-readable storage medium storing computer instructions, which cause the computer to perform the method provided by the above method embodiments, for example, including: collecting an application log; performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector; and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.
Those of ordinary skill in the art will understand that: in addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-described embodiments of the electronic device and the like are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A method for denoising an application log, comprising:
collecting an application log;
performing word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector;
and removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a topic model.
2. The method of claim 1, further comprising:
obtaining the word segmentation rule by adopting a statistical word segmentation model according to a historical application log stored in a training set in a corpus;
performing word segmentation processing on each historical application log in the historical application logs according to the word segmentation rule to obtain corresponding feature vectors;
obtaining a theme model through training according to the feature vector of the historical application log; wherein the topic model comprises at least a noise topic;
and obtaining the noise identification rule according to the topic model.
3. The method according to claim 2, wherein the corpus further includes a test set, the test set including at least one test application log; accordingly, the method further comprises:
performing word segmentation processing on each test application log in the test set according to the word segmentation rule to obtain a corresponding feature vector;
and performing noise identification according to the feature vector and the noise identification rule, comparing the noise identification with a preset standard, and optimizing the topic model according to the deviation if the deviation exists.
4. The method of claim 2, further comprising:
and periodically storing all the collected application logs into the corpus for further optimizing the word segmentation rule and the noise identification rule.
5. The method of claim 2, wherein the statistical word segmentation model is an N-gram (N-gram) language model.
6. The method of claim 2, wherein the topic model is an implicit Dirichlet Allocation (LDA) model.
7. An apparatus for applying log noise reduction, comprising:
the log corpus library module is used for collecting application logs;
the word segmentation module is used for carrying out word segmentation processing on the application log according to a word segmentation rule obtained in advance to obtain a feature vector;
and the noise identification module is used for removing the application log if the application log is judged to be noise according to the feature vector and a noise identification rule obtained in advance according to a subject model.
8. The apparatus of claim 7, further comprising:
the training word segmentation module is used for obtaining the word segmentation rule by adopting a statistical word segmentation model according to a historical application log stored in a training set in a corpus;
the training word segmentation module is further used for performing word segmentation processing on each historical application log in the historical application logs according to the word segmentation rule to obtain corresponding feature vectors;
the modeling module is used for obtaining a theme model through training according to the characteristic vector of the historical application log; wherein the topic model comprises at least a noise topic;
and the classification construction module is used for obtaining the noise identification rule according to the topic model.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of applying the log noise reduction method according to any one of claims 1 to 6 when executing the program.
10. A non-transitory computer readable storage medium, having stored thereon a computer program, which, when being executed by a processor, carries out the steps of applying a log noise reduction method according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811587244.9A CN111368534A (en) | 2018-12-25 | 2018-12-25 | Application log noise reduction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811587244.9A CN111368534A (en) | 2018-12-25 | 2018-12-25 | Application log noise reduction method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111368534A true CN111368534A (en) | 2020-07-03 |
Family
ID=71205896
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811587244.9A Pending CN111368534A (en) | 2018-12-25 | 2018-12-25 | Application log noise reduction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111368534A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114564473A (en) * | 2022-04-28 | 2022-05-31 | 江苏益柏锐信息科技有限公司 | Data processing method, equipment and medium based on ERP enterprise management system |
CN114896236A (en) * | 2022-05-25 | 2022-08-12 | 中卫市昊科电子技术有限公司 | Big data denoising optimization method and big data system applying artificial intelligence analysis |
CN115757068A (en) * | 2022-11-17 | 2023-03-07 | 中电云数智科技有限公司 | Process log acquisition and automatic noise reduction method and system based on eBPF |
CN116578534A (en) * | 2023-04-11 | 2023-08-11 | 华能信息技术有限公司 | Log message data format identification method and system |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120101965A1 (en) * | 2010-10-26 | 2012-04-26 | Microsoft Corporation | Topic models |
CN102902752A (en) * | 2012-09-20 | 2013-01-30 | 新浪网技术(中国)有限公司 | Method and system for monitoring log |
CN103678282A (en) * | 2014-01-07 | 2014-03-26 | 苏州思必驰信息科技有限公司 | Word segmentation method and device |
CN105045812A (en) * | 2015-06-18 | 2015-11-11 | 上海高欣计算机系统有限公司 | Text topic classification method and system |
CN106844424A (en) * | 2016-12-09 | 2017-06-13 | 宁波大学 | A kind of file classification method based on LDA |
CN107092650A (en) * | 2017-03-13 | 2017-08-25 | 网宿科技股份有限公司 | A kind of Web Log Analysis method and device |
CN107943791A (en) * | 2017-11-24 | 2018-04-20 | 北京奇虎科技有限公司 | A kind of recognition methods of refuse messages, device and mobile terminal |
CN108170818A (en) * | 2017-12-29 | 2018-06-15 | 深圳市金立通信设备有限公司 | A kind of file classification method, server and computer-readable medium |
CN108268560A (en) * | 2017-01-03 | 2018-07-10 | 中国移动通信有限公司研究院 | A kind of file classification method and device |
CN108509793A (en) * | 2018-04-08 | 2018-09-07 | 北京明朝万达科技股份有限公司 | A kind of user's anomaly detection method and device based on User action log data |
CN108667678A (en) * | 2017-03-29 | 2018-10-16 | 中国移动通信集团设计院有限公司 | A kind of O&M Log security detection method and device based on big data |
-
2018
- 2018-12-25 CN CN201811587244.9A patent/CN111368534A/en active Pending
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120101965A1 (en) * | 2010-10-26 | 2012-04-26 | Microsoft Corporation | Topic models |
CN102902752A (en) * | 2012-09-20 | 2013-01-30 | 新浪网技术(中国)有限公司 | Method and system for monitoring log |
CN103678282A (en) * | 2014-01-07 | 2014-03-26 | 苏州思必驰信息科技有限公司 | Word segmentation method and device |
CN105045812A (en) * | 2015-06-18 | 2015-11-11 | 上海高欣计算机系统有限公司 | Text topic classification method and system |
CN106844424A (en) * | 2016-12-09 | 2017-06-13 | 宁波大学 | A kind of file classification method based on LDA |
CN108268560A (en) * | 2017-01-03 | 2018-07-10 | 中国移动通信有限公司研究院 | A kind of file classification method and device |
CN107092650A (en) * | 2017-03-13 | 2017-08-25 | 网宿科技股份有限公司 | A kind of Web Log Analysis method and device |
CN108667678A (en) * | 2017-03-29 | 2018-10-16 | 中国移动通信集团设计院有限公司 | A kind of O&M Log security detection method and device based on big data |
CN107943791A (en) * | 2017-11-24 | 2018-04-20 | 北京奇虎科技有限公司 | A kind of recognition methods of refuse messages, device and mobile terminal |
CN108170818A (en) * | 2017-12-29 | 2018-06-15 | 深圳市金立通信设备有限公司 | A kind of file classification method, server and computer-readable medium |
CN108509793A (en) * | 2018-04-08 | 2018-09-07 | 北京明朝万达科技股份有限公司 | A kind of user's anomaly detection method and device based on User action log data |
Non-Patent Citations (1)
Title |
---|
ANTON A.CHUVAKIN ET AL.: "日志管理与分析权威指南", vol. 1, 机械工业出版社, pages: 96 - 98 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114564473A (en) * | 2022-04-28 | 2022-05-31 | 江苏益柏锐信息科技有限公司 | Data processing method, equipment and medium based on ERP enterprise management system |
CN114564473B (en) * | 2022-04-28 | 2022-07-12 | 江苏益柏锐信息科技有限公司 | Data processing method, equipment and medium based on ERP enterprise management system |
CN114896236A (en) * | 2022-05-25 | 2022-08-12 | 中卫市昊科电子技术有限公司 | Big data denoising optimization method and big data system applying artificial intelligence analysis |
CN115757068A (en) * | 2022-11-17 | 2023-03-07 | 中电云数智科技有限公司 | Process log acquisition and automatic noise reduction method and system based on eBPF |
CN115757068B (en) * | 2022-11-17 | 2024-03-05 | 中电云计算技术有限公司 | Process log acquisition and automatic noise reduction method and system based on eBPF |
CN116578534A (en) * | 2023-04-11 | 2023-08-11 | 华能信息技术有限公司 | Log message data format identification method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108737406B (en) | Method and system for detecting abnormal flow data | |
CN111368534A (en) | Application log noise reduction method and device | |
JP7090936B2 (en) | ESG-based corporate evaluation execution device and its operation method | |
CN110020422B (en) | Feature word determining method and device and server | |
CN111798312B (en) | Financial transaction system anomaly identification method based on isolated forest algorithm | |
CN109271520B (en) | Data extraction method, data extraction device, storage medium, and electronic apparatus | |
CN106445915B (en) | New word discovery method and device | |
CN110175851B (en) | Cheating behavior detection method and device | |
CN113590764B (en) | Training sample construction method and device, electronic equipment and storage medium | |
CN115544240B (en) | Text sensitive information identification method and device, electronic equipment and storage medium | |
CN108052509A (en) | A kind of Text similarity computing method, apparatus and server | |
CN115146062A (en) | Intelligent event analysis method and system fusing expert recommendation and text clustering | |
CN108021595B (en) | Method and device for checking knowledge base triples | |
CN111754352A (en) | Method, device, equipment and storage medium for judging correctness of viewpoint statement | |
CN114969334B (en) | Abnormal log detection method and device, electronic equipment and readable storage medium | |
CN116578700A (en) | Log classification method, log classification device, equipment and medium | |
CN112685374A (en) | Log classification method and device and electronic equipment | |
CN115470034A (en) | Log analysis method, device and storage medium | |
CN113535458B (en) | Abnormal false alarm processing method and device, storage medium and terminal | |
CN116155541A (en) | Automatic machine learning platform and method for network security application | |
CN115758183A (en) | Training method and device for log anomaly detection model | |
CN115495587A (en) | Alarm analysis method and device based on knowledge graph | |
CN114610576A (en) | Log generation monitoring method and device | |
CN110569498B (en) | Compound word recognition method and related device | |
CN112818972A (en) | Method and device for detecting interest point image, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200703 |