CN113239684A - Method and device for automatically identifying abnormal log based on partial mark - Google Patents

Method and device for automatically identifying abnormal log based on partial mark Download PDF

Info

Publication number
CN113239684A
CN113239684A CN202110626278.XA CN202110626278A CN113239684A CN 113239684 A CN113239684 A CN 113239684A CN 202110626278 A CN202110626278 A CN 202110626278A CN 113239684 A CN113239684 A CN 113239684A
Authority
CN
China
Prior art keywords
log
words
module
real
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110626278.XA
Other languages
Chinese (zh)
Inventor
孟伟彬
刘莹
裴丹
菲德利阁·扎特·特里尼达
何林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110626278.XA priority Critical patent/CN113239684A/en
Publication of CN113239684A publication Critical patent/CN113239684A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides an abnormal log automatic identification method based on partial marks, and relates to the technical field of abnormal log identification, wherein the abnormal log automatic identification method based on the partial marks comprises the following steps: preprocessing the real-time log to obtain a preprocessed log, wherein the preprocessing comprises filtering variable words in the real-time log; performing feature extraction on the preprocessed log by using a word bag model, and converting the preprocessed log into a feature vector by using a word frequency-inverse position frequency weighting method; and (3) carrying out anomaly detection on the feature vector by using a pre-trained PU learning anomaly detector to obtain an anomaly detection result. The invention adopting the scheme solves the technical problems of poor flexibility, large workload, incapability of identifying unknown abnormal types and low accuracy rate of the existing method, and realizes the purposes of identifying abnormal logs on line based on a log training model with partial marks to find out reasons related to the abnormality and correcting and stopping loss in time according to the abnormal logs.

Description

Method and device for automatically identifying abnormal log based on partial mark
Technical Field
The application relates to the technical field of abnormal log identification, in particular to an abnormal log automatic identification method and device based on partial marks.
Background
The log is one of the most valuable data sources in the infrastructure operation and maintenance, and the information contained in the log provides a good view for analyzing problems. The log describes a large number of events, which greatly aids in the detection and localization of anomalies. Mining log messages through a data-driven approach may help improve the stability and availability of the infrastructure. The operation and maintenance engineer uses keywords or regular expressions to identify the anomaly logs. With the ever increasing size and complexity of data center infrastructure, the number of logs has grown explosively. Since the rules are manually maintained by engineers. Thus, rule-based methods lack flexibility and are time and labor consuming and not suitable for large scale use. Meanwhile, the abnormal log recognition faces a problem of partial marking. The operation and maintenance engineer defines a large number of rules to identify the exception log, but the rules still cannot cover all the exception logs. Therefore, a large number of exception logs are not marked. Furthermore, due to the large and trivial number of normal logs, the operation and maintenance engineer typically does not label them. More importantly, historical rules cannot cover all functional, model numbers of infrastructure devices. Each model of device requires a dedicated set of rule bases to detect. When a new device comes online, the rules are not updated in time. Therefore, supervised methods that require a fully labeled sample as input cannot address partially labeled scenes.
Automated log detection for known error conditions is a common practice. Typically, the operation and maintenance engineer makes rules for exception log messages that need attention in order to automatically detect exception logs in the future. Rule-based methods are common in the industry as methods for anomaly log identification. The simplest rule is keyword matching, such as keywords like "loss". Another common rule is a regular expression that is manually configured according to domain knowledge.
LogGAN is a generative countermeasure network for anomaly detection of system logs and is also an unsupervised model. The LogGAN detects log-level (log-level) anomalies according to the text features of the log, and the like. The generation of the countermeasure network model reduces the unbalanced influence between the normal instance and the abnormal instance, thereby improving the performance of capturing the abnormality. It learns the rules in normal logs and if the new type of log violates the normal rules it is abnormal.
Rules such as regular expressions are too strict in matching text logs. When the formats of the logs are not completely consistent, even if only one letter, one space or one symbol are different, the strict matching rule cannot match similar abnormal logs. Furthermore, there is likely to be some subtle differences in the log of anomalies that belong to the same anomaly category but come from different infrastructure hardware and software. For example, two different types of devices produce exception logs that are semantically very similar, but that are syntactically different. Similar exceptions from different devices, although the syntax of the log is different, have similar places on the schema. However, rules cannot capture these patterns. Thus, rule-based approaches are not flexible enough for detection of exception logs.
All rules are manually configured and updated by the operation and maintenance engineer. Since a large number of new types of exception logs (thousands per day) are generated each day, the operation and maintenance engineer must configure exception monitoring rules for these logs, and thus, it takes a huge amount of work to configure the rules manually. Furthermore, in many applications, 20% -45% of log statements change throughout the lifecycle. Although some log parsing methods may be applied to assist the operation and maintenance engineer in configuring the monitoring rules, the configured monitoring rules should be further manually marked (abnormal or normal) by the operation and maintenance engineer. Considering the huge number of new types of logs, this approach is still time and labor consuming.
The rule-based approach is for the operation and maintenance engineer to formulate rules based on previously encountered anomalies, which are generally unable to match and identify unknown errors.
If the problem is simply solved in an unsupervised manner, the known abnormal log labels are severely wasted and the accuracy of the unsupervised approach is not ideal.
Disclosure of Invention
The present application is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, a first objective of the present application is to provide an automatic identification method for an abnormal log based on a partial marker, which solves the technical problems of poor flexibility, large workload, incapability of identifying unknown abnormal types, and low accuracy of the existing method, and solves the problem of online abnormal identification of massive system logs, and achieves the purposes of online identifying the abnormal log based on a log training model of the partial marker to find the reason related to the abnormality, and timely correcting and stopping loss according to the abnormal log.
A second object of the present application is to provide an automatic identification apparatus for an anomaly log based on partial marking.
A third object of the present application is to propose a non-transitory computer-readable storage medium.
In order to achieve the above object, an embodiment of a first aspect of the present application provides an automatic identification method for an exception log based on a partial marker, including: preprocessing the real-time log to obtain a preprocessed log, wherein the preprocessing comprises filtering variable words in the real-time log; performing feature extraction on the preprocessed log by using a word bag model, and converting the preprocessed log into a feature vector by using a word frequency-inverse position frequency weighting method; and (3) carrying out anomaly detection on the feature vector by using a pre-trained PU learning anomaly detector to obtain an anomaly detection result.
Optionally, in an embodiment of the present application, the preprocessing the real-time log includes the following steps:
dividing words of the real-time log into text words and variable words, wherein the text words are words describing events occurring in the log, and the variable words are variables occurring in the real-time log;
automatically extracting template words in the real-time log by using an automatic log analysis method;
the template words are used as text words and the variant words are filtered out.
Optionally, in an embodiment of the present application, the word frequency-inverse position frequency weighting method specifically includes defining an inverse position frequency of a word as:
Figure BDA0003102190850000031
Figure BDA0003102190850000032
wherein ILF is the inverse position frequency of the word w, and the position l of the word wkE L is defined as the ordinal position where the word occurs, k represents the kth word, where L is the longest length of all logs.
Optionally, in an embodiment of the present application, training the PU learning anomaly detector includes the following steps:
acquiring a history log, and preprocessing the history log;
performing feature extraction on the preprocessed logs to generate feature vectors;
selecting a preset number of logs from the historical logs for marking, and generating an abnormal mark;
and training the PU learning anomaly detector according to the anomaly marks and the feature vectors.
In order to achieve the above object, a second aspect of the present invention provides an apparatus for automatically identifying an abnormal log based on partial marks, including:
the preprocessing module is used for preprocessing the real-time log to obtain a preprocessed log, wherein the preprocessing comprises filtering variable words in the real-time log;
the extraction module is used for extracting the characteristics of the preprocessed logs by using a word bag model and converting the preprocessed logs into characteristic vectors by using a word frequency-inverse position frequency weighting method;
and the detection module is used for carrying out anomaly detection on the feature vector by using a pre-trained PU learning anomaly detector to obtain an anomaly detection result.
Optionally, in an embodiment of the present application, the preprocessing module includes a classification module, an extraction module, and a filtering module, wherein:
the system comprises a classification module, a real-time log processing module and a log processing module, wherein the classification module is used for classifying words of the real-time log into text words and variable words, the text words are words describing events occurring in the log, and the variable words are variables appearing in the real-time log;
the extraction module is used for automatically extracting template words in the real-time log by using an automatic log analysis method;
and the filtering module is used for filtering variable words by using the template words as text words.
Optionally, in an embodiment of the present application, the system further includes a pre-training module, where the pre-training module includes a calling module, a marking module, and a training module, where:
the calling module is used for calling the preprocessing module to preprocess the historical log to generate training data, and calling the extracting module to extract the characteristics of the training data to generate a characteristic vector; performing feature extraction on the preprocessed logs to generate feature vectors;
the marking module is used for selecting a preset number of logs from the historical logs to mark so as to generate abnormal marks;
and the training module is used for training the PU learning anomaly detector according to the anomaly marks and the feature vectors.
In order to achieve the above object, a non-transitory computer readable storage medium is provided in a third embodiment of the present invention, and when instructions in the storage medium are executed by a processor, a method and an apparatus for automatically identifying an exception log based on a partial flag can be performed.
The method for automatically identifying the abnormal logs based on the partial marks, the device for automatically identifying the abnormal logs based on the partial marks and the non-transitory computer-readable storage medium solve the technical problems of poor flexibility, large workload, incapability of identifying unknown abnormal types and low accuracy of the existing method, solve the problem of online abnormal identification of massive system logs, and achieve the purposes of online identifying the abnormal logs based on the log training model of the partial marks, finding out reasons related to the abnormal logs, and timely correcting and stopping loss according to the abnormal logs.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a flowchart of an automatic identification method for an exception log based on partial tags according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a partial mark log of an automatic identification method for an exception log based on partial mark according to an embodiment of the present application;
fig. 3 is a diagram of an overall design of an automatic identification method for an exception log based on a partial mark according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
The method and the device for automatically identifying the abnormal log based on the partial marks according to the embodiment of the application are described below with reference to the accompanying drawings.
Fig. 1 is a flowchart of an automatic identification method for an exception log based on a partial marker according to an embodiment of the present application.
As shown in fig. 1, the method for automatically identifying an abnormal log based on partial marks comprises the following steps:
step 101, preprocessing a real-time log to obtain a preprocessed log, wherein the preprocessing comprises filtering variable words in the real-time log;
102, extracting the characteristics of the preprocessed log by using a word bag model, and converting the preprocessed log into a characteristic vector by adopting a word frequency-inverse position frequency weighting method;
and 103, carrying out anomaly detection on the feature vector by using a pre-trained PU learning anomaly detector to obtain an anomaly detection result.
According to the method for automatically identifying the abnormal log based on the partial mark, the log after pretreatment is obtained by pretreating the real-time log, wherein the pretreatment comprises the step of filtering variable words in the real-time log; performing feature extraction on the preprocessed log by using a word bag model, and converting the preprocessed log into a feature vector by using a word frequency-inverse position frequency weighting method; and (3) carrying out anomaly detection on the feature vector by using a pre-trained PU learning anomaly detector to obtain an anomaly detection result. Therefore, the technical problems of poor flexibility, large workload, incapability of identifying unknown abnormal types and low accuracy of the existing method can be solved, the problem of online abnormal identification of mass system logs can be solved, the purpose of online identification of abnormal logs based on a log training model with partial marks to find reasons related to the abnormal logs and timely error correction and loss stopping according to the abnormal logs can be realized.
Further, in the embodiment of the present application, the preprocessing the real-time log includes the following steps:
dividing words of the real-time log into text words and variable words, wherein the text words are words describing events occurring in the log, and the variable words are variables occurring in the real-time log;
automatically extracting template words in the real-time log by using an automatic log analysis method;
the template words are used as text words and the variant words are filtered out.
The log is preprocessed, and variables (such as IP addresses) in the log are removed, so that the detection and classification performance of the abnormal log can be improved. The log is preprocessed before extracting features from the log. In order to achieve better effect, the operation and maintenance engineer can change the rule of the log preprocessing in time according to the own domain knowledge. Words that are not relevant to the text classification may also be present in the template words, e.g., a "constant" special symbol in some templates.
Further, in this embodiment of the present application, the word frequency-inverse position frequency weighting method specifically includes defining an inverse position frequency of a word as:
Figure BDA0003102190850000051
Figure BDA0003102190850000052
wherein ILF is the inverse position frequency of the word w, and the position l of the word wkE L is defined as the ordinal position where the word occurs, k represents the kth word, where L is the longest length of all logs.
The log is an unstructured text and cannot be directly applied to a machine learning algorithm. The text log is converted into feature vectors using a bag of words model. The bag of words model represents the text as a vector, with the value of each element in the vector representing an estimate of importance (weight) for each word in the log. In view of the scene that the classic weighting method TF-IDF is not suitable for log analysis, a new log word weighting method-word frequency-inverse position frequency is provided based on the domain knowledge. Word frequency is an important index describing the importance of a word. The more times a word appears in the log, the higher the importance of the word in the log. The inverse position frequency measures the importance of a word, i.e. how many different positions the word appears in the log.
Further, in the embodiment of the present application, the method for training the PU learning anomaly detector includes the following steps:
acquiring a history log, and preprocessing the history log;
performing feature extraction on the preprocessed logs to generate feature vectors;
selecting a preset number of logs from the historical logs for marking, and generating an abnormal mark;
and training the PU learning anomaly detector according to the anomaly marks and the feature vectors.
In the off-line learning part, the log is preprocessed, parameters are filtered, then a feature vector is constructed, and finally an anomaly detector based on PU learning is trained.
Fig. 2 is a schematic diagram of a partial mark log of an automatic identification method for an exception log based on a partial mark according to an embodiment of the present application.
As shown in fig. 2, in the method for automatically identifying an abnormal log based on partial marking, only a part of the abnormal log is marked, the normal log and a large number of abnormal logs are not marked, and only a part of positive samples are marked in the data used for training, and labels of negative samples are not marked. The abnormal log identification is realized by using PU learning, and the scene of the invention cannot be met by a supervision method and a traditional semi-supervision method.
Fig. 3 is a diagram of an overall design of an automatic identification method for an exception log based on a partial mark according to an embodiment of the present application.
As shown in fig. 3, the method for automatically identifying an abnormal log based on a partial mark converts an abnormal log identification problem into a text classification problem, and mainly includes two parts: offline learning and online detection. And an off-line learning part is used for preprocessing the log and filtering parameters, then constructing a feature vector, and finally training an anomaly detector based on PU learning. The online detection part is used for preprocessing the real-time log and extracting characteristics; and then judging whether the log is abnormal by using the trained binary classifier.
The second embodiment of the present application provides an apparatus for automatically identifying an abnormal log based on a partial mark, including:
the preprocessing module is used for preprocessing the real-time log to obtain a preprocessed log, wherein the preprocessing comprises filtering variable words in the real-time log;
the extraction module is used for extracting the characteristics of the preprocessed logs by using a word bag model and converting the preprocessed logs into characteristic vectors by using a word frequency-inverse position frequency weighting method;
and the detection module is used for carrying out anomaly detection on the feature vector by using a pre-trained PU learning anomaly detector to obtain an anomaly detection result.
Further, in this embodiment of the present application, the preprocessing module includes a classification module, an extraction module, and a filtering module, wherein:
the system comprises a classification module, a real-time log processing module and a log processing module, wherein the classification module is used for classifying words of the real-time log into text words and variable words, the text words are words describing events occurring in the log, and the variable words are variables appearing in the real-time log;
the extraction module is used for automatically extracting template words in the real-time log by using an automatic log analysis method;
and the filtering module is used for filtering variable words by using the template words as text words.
Further, in this embodiment of the present application, the apparatus further includes a pre-training module, where the pre-training module includes a calling module, a marking module, and a training module, where:
the calling module is used for calling the preprocessing module to preprocess the historical log to generate training data, and calling the extracting module to extract the characteristics of the training data to generate a characteristic vector; performing feature extraction on the preprocessed logs to generate feature vectors;
the marking module is used for selecting a preset number of logs from the historical logs to mark so as to generate abnormal marks;
and the training module is used for training the PU learning anomaly detector according to the anomaly marks and the feature vectors.
The device for automatically identifying the abnormal log based on the partial mark is used for preprocessing the real-time log through the preprocessing module to obtain the preprocessed log, wherein the preprocessing comprises the step of filtering variable words in the real-time log; the extraction module is used for extracting the characteristics of the preprocessed logs by using a word bag model and converting the preprocessed logs into characteristic vectors by using a word frequency-inverse position frequency weighting method; and the detection module is used for carrying out anomaly detection on the feature vector by using a pre-trained PU learning anomaly detector to obtain an anomaly detection result. Therefore, the technical problems of poor flexibility, large workload, incapability of identifying unknown abnormal types and low accuracy of the existing method can be solved, the problem of online abnormal identification of mass system logs can be solved, the purpose of online identification of abnormal logs based on a log training model with partial marks to find reasons related to the abnormal logs and timely error correction and loss stopping according to the abnormal logs can be realized.
In order to implement the above embodiments, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method and apparatus for automatically identifying an anomaly log based on partial markers of the above embodiments.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (8)

1.一种基于部分标记的异常日志自动识别方法,其特征在于,包括以下步骤:1. a kind of abnormal log automatic identification method based on partial mark, is characterized in that, comprises the following steps: 对实时日志进行预处理,得到预处理后的日志,其中,所述预处理包括过滤掉所述实时日志中的变量单词;Preprocessing the real-time log to obtain a pre-processed log, wherein the preprocessing includes filtering out variable words in the real-time log; 使用词袋模型对所述预处理后的日志进行特征提取,采用词频-逆位置频率加权方法将所述预处理后的日志转换成特征向量;Using a bag of words model to perform feature extraction on the preprocessed log, and using a word frequency-inverse position frequency weighting method to convert the preprocessed log into a feature vector; 使用预先训练的PU learning异常检测器对所述特征向量进行异常检测,得到异常检测结果。Anomaly detection is performed on the feature vector using a pre-trained PU learning anomaly detector to obtain an anomaly detection result. 2.如权利要求1所述的方法,其特征在于,所述对实时日志进行预处理,包括以下步骤:2. The method of claim 1, wherein the real-time log is preprocessed, comprising the following steps: 将所述实时日志的词分为文本单词和变量单词,其中,所述文本单词为描述日志发生的事件的词,所述变量单词是在所述实时日志中出现的变量;The words of the real-time log are divided into text words and variable words, wherein the text words are words describing events that occur in the log, and the variable words are variables that appear in the real-time log; 使用自动化的日志解析方法,自动抽取所述实时日志中的模板单词;Use an automated log parsing method to automatically extract template words in the real-time log; 使用所述模板单词作为文本单词,过滤掉变量单词。The variable words are filtered out using the template words as text words. 3.如权利要求1所述的方法,其特征在于,所述词频-逆位置频率加权方法,具体包括,将单词的逆位置频率定义为:3. method as claimed in claim 1, is characterized in that, described word frequency-inverse position frequency weighting method, specifically comprises, the inverse position frequency of word is defined as:
Figure FDA0003102190840000011
Figure FDA0003102190840000011
Figure FDA0003102190840000012
Figure FDA0003102190840000012
其中,ILF为单词w的逆位置频率,单词w的位置lk∈L定义为该单词出现的序数位置,k表示第k个单词,其中L是所有日志的最长长度。where ILF is the inverse position frequency of word w, the position l k ∈ L of word w is defined as the ordinal position where the word appears, and k represents the kth word, where L is the longest length of all logs.
4.如权利要求1所述的方法,其特征在于,训练所述PU learning异常检测器,包括以下步骤:4. The method of claim 1, wherein training the PU learning anomaly detector comprises the steps of: 获取历史日志,对所述历史日志进行所述预处理;obtaining a historical log, and performing the preprocessing on the historical log; 对预处理后的日志进行所述特征提取,生成特征向量;performing the feature extraction on the preprocessed log to generate a feature vector; 在所述历史日志中选取预设数量的日志进行标记,生成异常标记;Selecting a preset number of logs from the historical logs for marking to generate abnormal marks; 根据所述异常标记和所述特征向量对所述PU learning异常检测器进行训练。The PU learning anomaly detector is trained based on the anomaly markers and the feature vector. 5.一种基于部分标记的异常日志自动识别装置,其特征在于,包括:5. An abnormal log automatic identification device based on partial marking, characterized in that, comprising: 预处理模块,用于对实时日志进行预处理,得到预处理后的日志,其中,所述预处理包括过滤掉所述实时日志中的变量单词;a preprocessing module, configured to preprocess the real-time log to obtain a pre-processed log, wherein the preprocessing includes filtering out variable words in the real-time log; 提取模块,用于使用词袋模型对所述预处理后的日志进行特征提取,采用词频-逆位置频率加权方法将所述预处理后的日志转换成特征向量;an extraction module, configured to perform feature extraction on the preprocessed log using a bag-of-words model, and convert the preprocessed log into a feature vector by using a word frequency-inverse position frequency weighting method; 检测模块,用于使用预先训练的PU learning异常检测器对所述特征向量进行异常检测,得到异常检测结果。A detection module, configured to perform anomaly detection on the feature vector by using a pre-trained PU learning anomaly detector to obtain an anomaly detection result. 6.如权利要求5所述的装置,其特征在于,所述预处理模块包括分类模块、抽取模块、过滤模块,其中:6. The apparatus of claim 5, wherein the preprocessing module comprises a classification module, an extraction module, and a filtering module, wherein: 所述分类模块,用于将所述实时日志的词分为文本单词和变量单词,其中,所述文本单词为描述日志发生的事件的词,所述变量单词是在所述实时日志中出现的变量;The classification module is used to divide the words of the real-time log into text words and variable words, wherein the text words are words that describe events that occur in the log, and the variable words appear in the real-time log. variable; 所述抽取模块,用于使用自动化的日志解析方法,自动抽取所述实时日志中的模板单词;The extraction module is used to automatically extract template words in the real-time log using an automated log parsing method; 所述过滤模块,用于使用所述模板单词作为文本单词,过滤掉变量单词。The filtering module is used for filtering out variable words by using the template words as text words. 7.如权利要求5所述的装置,其特征在于,还包括预训练模块,所述预训练模块包括调用模块、标记模块、训练模块,其中:7. The apparatus of claim 5, further comprising a pre-training module, the pre-training module comprising a calling module, a marking module, and a training module, wherein: 调用模块,用于调用所述预处理模块对历史日志进行预处理,生成训练数据,调用所述提取模块对所述训练数据进行特征提取,生成特征向量;对预处理后的日志进行所述特征提取,生成特征向量;The calling module is used to call the preprocessing module to preprocess the historical log, generate training data, and call the extraction module to perform feature extraction on the training data to generate a feature vector; perform the feature on the preprocessed log Extract, generate feature vector; 标记模块,用于在所述历史日志中选取预设数量的日志进行标记,生成异常标记;a marking module, used for selecting a preset number of logs in the historical logs for marking, and generating abnormal marks; 训练模块,用于根据所述异常标记和所述特征向量对所述PU learning异常检测器进行训练。A training module, configured to train the PU learning anomaly detector according to the anomaly marker and the feature vector. 8.一种非临时性计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-4中任一所述的方法。8. A non-transitory computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the method according to any one of claims 1-4 is implemented.
CN202110626278.XA 2021-06-04 2021-06-04 Method and device for automatically identifying abnormal log based on partial mark Pending CN113239684A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110626278.XA CN113239684A (en) 2021-06-04 2021-06-04 Method and device for automatically identifying abnormal log based on partial mark

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110626278.XA CN113239684A (en) 2021-06-04 2021-06-04 Method and device for automatically identifying abnormal log based on partial mark

Publications (1)

Publication Number Publication Date
CN113239684A true CN113239684A (en) 2021-08-10

Family

ID=77136839

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110626278.XA Pending CN113239684A (en) 2021-06-04 2021-06-04 Method and device for automatically identifying abnormal log based on partial mark

Country Status (1)

Country Link
CN (1) CN113239684A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200160230A1 (en) * 2018-11-19 2020-05-21 International Business Machines Corporation Tool-specific alerting rules based on abnormal and normal patterns obtained from history logs
CN111611218A (en) * 2020-04-24 2020-09-01 武汉大学 An automatic identification method of distributed abnormal log based on deep learning
CN112463957A (en) * 2020-12-14 2021-03-09 清华大学 Abstract extraction method and device for unstructured text log stream

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200160230A1 (en) * 2018-11-19 2020-05-21 International Business Machines Corporation Tool-specific alerting rules based on abnormal and normal patterns obtained from history logs
CN111611218A (en) * 2020-04-24 2020-09-01 武汉大学 An automatic identification method of distributed abnormal log based on deep learning
CN112463957A (en) * 2020-12-14 2021-03-09 清华大学 Abstract extraction method and device for unstructured text log stream

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WEIBIN MENG 等: ""LogClass: Anomalous Log Identification and Classification With Partial Labels"", 《HTTPS://IEEEXPLORE.IEEE.ORG/DOCUMENT/9339940》 *

Similar Documents

Publication Publication Date Title
CN112463933A (en) Online extraction method and device for system log template
US10810508B1 (en) Methods and apparatus for classifying and discovering historical and future operational states based on Boolean and numerical sensor data
CN110224850A (en) Telecommunication network fault early warning method, device and terminal device
Dvornik et al. Stepformer: Self-supervised step discovery and localization in instructional videos
US20230297886A1 (en) Cluster targeting for use in machine learning
US20200204428A1 (en) System and method of automated fault correction in a network environment
WO2017034512A1 (en) Interactive analytics on time series
CN116150635B (en) Unknown fault detection method for rolling bearings based on cross-domain correlation representation
CN110634081A (en) A method and device for processing abnormal data of a hydropower station
CN112883990A (en) Data classification method and device, computer storage medium and electronic equipment
CN111338876B (en) Decision method, system and storage medium for fault mode and influence analysis
KR20220073307A (en) A System and Method for Deriving Data Boundary
CN113986643A (en) Method, electronic device and computer program product for analyzing log file
CN109254577A (en) A kind of intelligence manufacture procedure fault classification method and device based on deep learning
CN116541713B (en) Bearing fault diagnosis model training method based on local time-frequency feature transfer learning
CN116089812B (en) Fault diagnosis method based on semi-supervised adversarial domain generalized intelligent model
KR20210011822A (en) Method of detecting abnormal log based on artificial intelligence and system implementing thereof
CN117874236A (en) Error log processing method and device, electronic equipment and readable storage medium
Hwang et al. Anomaly detection in time series data and its application to semiconductor manufacturing
Zhang et al. Selectivity drives productivity: efficient dataset pruning for enhanced transfer learning
CN117952224A (en) Deep learning model deployment method, storage medium and computer equipment
CN117421231A (en) Automatic software testing method, system and device
CN112840352B (en) Method for configuring an image evaluation device, image evaluation method, and image evaluation device
CN113239684A (en) Method and device for automatically identifying abnormal log based on partial mark
CN117540253A (en) Data analysis method and system for square box quality inspection system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210810

WD01 Invention patent application deemed withdrawn after publication