CN118093325A - Log template acquisition method, electronic equipment and storage medium - Google Patents

Log template acquisition method, electronic equipment and storage medium Download PDF

Info

Publication number
CN118093325A
CN118093325A CN202410518808.2A CN202410518808A CN118093325A CN 118093325 A CN118093325 A CN 118093325A CN 202410518808 A CN202410518808 A CN 202410518808A CN 118093325 A CN118093325 A CN 118093325A
Authority
CN
China
Prior art keywords
log
current
regular expression
logs
max
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410518808.2A
Other languages
Chinese (zh)
Other versions
CN118093325B (en
Inventor
顾兆军
张智凯
刘春波
岳文龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Civil Aviation University of China
Original Assignee
Civil Aviation University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Civil Aviation University of China filed Critical Civil Aviation University of China
Priority to CN202410518808.2A priority Critical patent/CN118093325B/en
Publication of CN118093325A publication Critical patent/CN118093325A/en
Application granted granted Critical
Publication of CN118093325B publication Critical patent/CN118093325B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to the field of computer technology application, in particular to a log template acquisition method, electronic equipment and a storage medium, which comprise the following steps: obtaining a plurality of initial key logs from an original log data set, then carrying out preliminary identification on word types in the data set, generating a first regular expression representing a non-digital variable and a second regular expression containing a digital constant based on the initial key logs, correcting the word types of the preliminary identification by using the generated expressions, obtaining a corresponding log template based on the corrected word types, then judging whether the current log template meets preset conditions, taking the current log template as a target log template if the current log template meets the preset conditions, otherwise, adjusting the current expression, adjusting the word types in the current data set by using the new expression to obtain a new log template, and repeating the previous judging step until the preset conditions are met. The method and the device can improve the generation efficiency and accuracy of the log template.

Description

Log template acquisition method, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technology application, and in particular, to a log template obtaining method, an electronic device, and a storage medium.
Background
In the development and maintenance of modern software, journals provide critical information about system and network activities, helping developers and operation and maintenance engineers understand system behavior and trace back system problem sources, detect and respond to security events, and conduct troubleshooting and vulnerability analysis. In practice, operation and maintenance engineers typically employ rule-based log parsing methods, such as the Grok filter technique employed by logstack, by manually writing and matching the entire log template with regular expressions. Grok is a method of matching a log line to a regular expression, mapping a specific portion of the log line to a dedicated field, and performing an operation based on this mapping. The problem with this type of approach is that each Grok filter rule corresponds to a class of log events, which means that the Grok rule base is difficult to maintain and expand for modern software systems that contain a large number of heterogeneous log event types, and that are continually updated. Second, each newly added Grok rule results in an additional canonical match to the entire log line. The log of a modern software system may contain hundreds of thousands of log templates, the cost of manually giving regular expressions that match all of the log templates is unacceptable, and the advent of new log templates is completely inadaptable. Another type of log parsing method is based on predefined heuristic rules, where researchers find certain types of features inherent in log data, and algorithms use these features to perform template acquisition. For example, SLCT (Simple Log Cluster Tool) based on frequent word statistics consider that words that occur more frequently in log files are constant. This approach may be effective in preliminary parsing, but it is difficult to identify rare log templates that occur at low frequency. On this basis, LFA (Log File Analyzer) considers the position of the word for statistics, logram uses an n-gram class method as a statistical index, so that the context information of the word is taken into consideration. IPLoM (Iterative Partition Log Mining) proposes an iterative partitioning approach by continually partitioning the log into small clusters based on log length and word location characteristics. Drain is a log parsing algorithm widely used in recent years, and is essentially a tree representation of an iterative partitioning algorithm based on a prefix parse tree of the log. There are two problems with this type of strategy:
a) The super parameter of the similarity threshold has great influence on the algorithm performance and is difficult to adjust to obtain an optimal value;
b) The merging operation may be performed when the similarity of the set of logs is high, this strategy results from the algorithm designer's observation of most log properties and is not appropriate for all logs.
Another type of log parsing method is to learn constant/variable features of a log from a large number of annotated log data sets using deep learning techniques. Wherein UniParser uses LSTM network based on contrast learning strategy to make log context coding, logPPT uses RoBERTa network to make log sequence feature acquisition. However, neural network methods place additional GPU hardware requirements on the system and their operating efficiency is difficult to match with the huge log generation rate.
In summary, the existing log template acquisition method either depends on a large number of rules or needs to adopt a complex deep learning algorithm, so that the acquisition cost is too high, and the real-time processing requirement of a large-scale log data set is difficult to adapt.
Disclosure of Invention
Aiming at the technical problems, the invention adopts the following technical scheme:
According to a first aspect of the present invention, there is provided a log template acquisition method, the method comprising the steps of:
S100, acquiring an original log data set D= { D 1,D2,……,Di,……,Dn},Di which is the ith log in D, wherein the value of i is 1 to n, and n is the number of the logs in D.
S200, sampling the D to obtain k initial key logs.
S300, acquiring a word segmentation symbol set, an initial first regular expression and an initial second regular expression based on the k initial key logs; the first regular expression is a regular expression corresponding to a non-digital variable, and the second regular expression is a regular expression corresponding to a constant containing a number.
S400, performing word segmentation processing on the D by using the word segmentation symbol set to obtain a corresponding word segmentation result set W= { W 1,W2,……,Wi,……,Wn};Wi as a word segmentation result corresponding to the D i, W i={Wi1,Wi2,……,Wij,……,Wif(i)},Wij as the j-th word in the W i, the j value is 1 to f (i), and f (i) is the number of words in the W i.
S500, for W ij in W i, if W ij contains numbers, the type identifier corresponding to W ij is marked as a variable identifier, and if W ij does not contain numbers, the type identifier corresponding to W ij is marked as a constant identifier.
S600, based on the current first regular expression and the current second regular expression, the type identifier corresponding to W ij in W i is adjusted, and D with the type identifier adjusted is obtained and is used as a current log data set.
S700, acquiring a corresponding current log template set based on the current log data set, taking the current log template set as a target log template corresponding to the D if the current log template set is the same as the last log template set, and exiting the current control program, otherwise, executing S800.
S800, adjusting the current first regular expression, taking the adjusted first regular expression as the current first regular expression, adjusting the current second regular expression, taking the adjusted second regular expression as the current second regular expression, and executing S600.
According to a second aspect of the present invention, there is provided an electronic device comprising a processor and a memory; the processor is configured to execute the steps of the method according to the first aspect of the present invention by calling a program or instructions stored in the memory.
According to a third aspect of the present invention there is provided a non-transitory computer readable storage medium storing a program or instructions which cause a computer to perform the steps of the method of the first aspect of the present invention.
The invention has at least the following beneficial effects:
According to the technical scheme provided by the embodiment of the invention, k initial key logs are firstly obtained from an original log data set, then, the word types in the data set are primarily marked, a first regular expression representing a non-digital variable and a second regular expression containing a digital constant are generated based on the k initial key logs, the generated expression is used for correcting the primarily marked word types, the corrected word types are obtained, a corresponding log template is obtained based on the corrected word types, then, whether the current log template meets preset conditions is judged, if yes, the current log template is used as a target log template, otherwise, the current first regular expression and the current second regular expression are adjusted, the word types in the current data set are adjusted by using the new regular expression, a new log template is obtained, and the previous judging step is repeated until the preset conditions are met. The method and the device can improve the generation efficiency and accuracy of the log template.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a log template obtaining method according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
It should be noted that some exemplary embodiments are described as a process or a method depicted as a flowchart. Although a flowchart depicts steps as a sequential process, many of the steps may be implemented in parallel, concurrently, or with other steps. Furthermore, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.
An embodiment of the present invention provides a log template obtaining method, as shown in fig. 1, where the method may include the following steps:
S100, acquiring an original log data set D= { D 1,D2,……,Di,……,Dn},Di which is the ith log in D, wherein the value of i is 1 to n, and n is the number of the logs in D.
In an embodiment of the present invention, the original log data set may be a log data set input by a user. In an embodiment of the invention, the log data set may include log data sets generated by a distributed system (HDFS, hadoop, zookeeper, openStack, spark), supercomputers (BGL, HPC), client applications (Proxifier, thunderbird), server applications (Apache, openSSH), mobile applications (HEALTHAPP), operating systems (Windows, linux, mac, andriod), and the like.
S200, sampling the D to obtain k initial key logs.
In the embodiment of the invention, k can be set based on actual needs, for example, 32, 64 or 128.
Further, in the embodiment of the present invention, S200 may specifically include:
S201, for D i, performing word segmentation on D i by using a general segmentation character to obtain a corresponding word segmentation result W0 i, and acquiring a special symbol set M i corresponding to D i; wherein W0 i={W0i1,W0i2,……,W0ir,……,W0ig(i)};W0ir is the r-th word in W0 i, the value of r is 1 to g (i), and g (i) is the number of words in W0 i; m i={Mi1,Mi2,……,Mis,……,Mih(i)},Mis is the s-th special symbol in M i, the value of s is 1 to h (i), and h (i) is the number of special symbols in M i.
In an embodiment of the present invention, the universal delimiter may be an existing universal delimiter, such as a space, or the like. Special symbols may be existing definitions such as symbols other than numerals and letters.
S202, based on the log length and the special symbol set, performing preliminary classification on all logs in the D to obtain m initial log clusters.
In the embodiment of the invention, the length of the log is the number of word segmentation corresponding to the log. All logs in each log cluster have the same log length and the same special symbol.
S203, a counter t=1 is set.
S204, if t is less than or equal to k, executing S205, otherwise, executing S208.
S205, randomly extracting a log cluster from the initial log clusters which are not extracted at present as a t-th sample log cluster C t, randomly extracting d logs from C t as a t-th log sample candidate set PC t={PCt1,PCt2,……,PCta,……,PCtd};PCta as a-th log sample in PC t, wherein the value of a is 1 to d.
In the embodiment of the invention, d can be set based on actual needs, for example, 32, 64 or 128. Those skilled in the art will appreciate that if the number of logs in C t is less than d, then all logs in C t are taken as log samples.
S206, acquiring a log sample corresponding to the min (S t1,St2,……,Sta,……,Std) as a t-th key log and adding the t-th key log into a current key log set; wherein S ta is the maximum similarity ,Sta=max(S1 ta,S2 ta,……,Su ta,……,Sp ta),Su ta between the PC ta and the current logs in the key log set, the similarity between the PC ta and the u-th log in the current key log set, the value of u is 1 to p, and p is the number of the current logs in the key log set; the initial value of the key log set is an empty set; min () represents taking the minimum value and max () represents taking the maximum value.
In an embodiment of the present invention, S u ta is determined based on the longest common subsequence between PC ta and the u-th log in the current critical log set. Specifically, S u ta satisfies the following condition: s u ta=LPta-u max/Lta-u max, where LP ta-u max is the length of the longest common subsequence between PC ta and the u-th log in the current critical log set, L ta-u max=max(Lta,Lu),Lta is the length of PC ta, L u is the length of the u-th log in the current critical log set, and max () represents the maximum value, i.e., L ta-u max is the maximum of the length of PC ta and the length of the u-th log in the current critical log set.
S207, t=t+1 is set, and S204 is executed.
S208, taking k logs in the current key log set as the k initial key logs.
S300, acquiring a word segmentation symbol set, an initial first regular expression and an initial second regular expression based on the k initial key logs; the first regular expression is a regular expression corresponding to a non-digital variable, and the second regular expression is a regular expression corresponding to a constant containing a number.
In the embodiment of the invention, the word segmentation symbol set may be a union of word segmentation symbols corresponding to k initial key logs.
In an embodiment of the present invention, the first regular expression and the second regular expression may be obtained based on a manual or trained neural network model. Specifically, words belonging to non-digital variables and containing digital constants in each key log may be obtained based on a manual or trained neural network model, e.g., words in log "Invalid user test from 52.80.34.196" where "test" is a non-digital variable, log "… … available network connection on network <"Alt0" in "VIA INTERFACE ALT0" is a word containing a numerical constant. It will be appreciated by those skilled in the art that any method of obtaining words belonging to non-numeric variables and numeric constants in each key log based on a manual or trained neural network model falls within the scope of the present invention.
After all words belonging to the non-numeric variable and the numeric constant in the key log are acquired, the context-like words are represented using the same regular expression, for example, log "Invalid user test from 52.80.34.196" corresponds to a regular expression "(The regular expression corresponding to > VIA INTERFACE ALT0 "is" (. And finally, acquiring a preset number of regular expressions based on actual requirements. Those skilled in the art will recognize that any method for obtaining a regular expression corresponding to a word falls within the scope of the present invention.
S400, performing word segmentation processing on the D by using the word segmentation symbol set to obtain a corresponding word segmentation result set W= { W 1,W2,……,Wi,……,Wn};Wi as a word segmentation result corresponding to the D i, W i={Wi1,Wi2,……,Wij,……,Wif(i)},Wij as the j-th word in the W i, the j value is 1 to f (i), and f (i) is the number of words in the W i.
Although word segmentation is a key element of log parsing, little attention has been paid. The inventor of the present invention found through research that word segmentation has an important effect on log analysis results. In some cases, some log templates that cannot be correctly identified may be accurately identified in another particular word segmentation. Therefore, optimization of word segmentation techniques needs to be explored more deeply to improve accuracy and efficiency of log parsing. Taking the logs "StackScrollAlgorithm: overlapAmount:220.0" and "StackScrollAlgorithm:state. Cliptopamount204" as examples, if a colon is not used as a word segmentation symbol, each log will be separated into three words. However, since the latter two words contain numbers, the algorithm cannot consider these words when classifying, and the two types of logs will be mapped to the same set ("StackScrollAlgorithm") and cannot be distinguished. In contrast, if a colon is used as the participle symbol, the two types of logs will map to ("StackScrollAlgorithm", "overlapAmount") and ("StackScrollAlgorithm", "state. Even neural network models trained based on a large number of multiple log data sets are difficult to ensure that they can be applied without fine tuning on the new data set. Therefore, in the invention, the union of the word segmentation symbols corresponding to the k initial key logs is used as the word segmentation symbol set, so that the original logs can be converted into a more formatted form as much as possible, and the word segmentation accuracy can be improved.
S500, for W ij in W i, if W ij contains a number, marking the type identifier corresponding to W ij as a variable identifier, i.e. marking W ij as a variable, if W ij does not contain a number, marking the type identifier corresponding to W ij as a constant identifier, i.e. marking W ij as a constant.
In embodiments of the present invention, existing constant/variable discriminators may be used to obtain the type identification for each word.
S600, based on the current first regular expression and the current second regular expression, the type identifier corresponding to W ij in W i is adjusted, and D with the type identifier adjusted is obtained and is used as a current log data set.
In the embodiment of the invention, the initial value of the current first regular expression is an initial first regular expression, and the initial value of the current second regular expression is an initial second regular expression.
Further, in S600, each log in the original log data set may be respectively matched using the obtained first regular expression and the second regular expression, if the corresponding word is matched, the type identifier corresponding to the word is adjusted to the type identifier corresponding to the corresponding regular expression, for example, matching is performed with a regular expression "(.
Those skilled in the art will appreciate that the process of matching using regular expressions may be implemented based on existing regular matching algorithms.
The ideal goal of log template extraction is to comprehensively and correctly extract constants in the log. In practical applications, it is a very difficult task to implement high accuracy log parsing with only a constant/variable arbiter. Consider the following: assuming that a certain constant/variable classifier has an accuracy of 99% and each log contains 10 words on average, then the probability that the classifier correctly extracts all constants in one log drops to the 10 th power of 0.99, i.e., 0.904; if the accuracy of the constant/variable classifier is slightly lower than 95%, the accuracy of the classifier on an entire log of length 10 drops rapidly to 0.599. Minor deviations in the classifier are amplified in log template discrimination. In addition, in practical application, the form of the log is quite complex and changeable, and the continuous expansion of the log scale in a large-scale system limits the application of the deep learning method. These factors together make it difficult for the constant/variable arbiter to achieve both high accuracy and high efficiency. In view of this, in the embodiment of the present invention, based on the key log, the corresponding first regular expression and second regular expression are obtained and used to correct the discrimination result of the constant/variable discriminator, so that the generated log template can be as accurate as possible.
S700, acquiring a corresponding current log template set based on the current log data set, taking the current log template set as a target log template corresponding to the D if the current log template set is the same as the last log template set, namely, the log template is not changed, and exiting the current control program, otherwise, executing S800.
In the embodiment of the invention, the generation of the log template can adopt the prior art. For example, for each log in the current log dataset, a constant of the log may be used as a static portion of the log template, a variable in the log may be used as a variable portion of the log template, and then the variable portion may be replaced with a wild card to obtain a corresponding log template. And finally, merging the logs with the same corresponding log template to obtain a current log template set.
S800, adjusting the current first regular expression, taking the adjusted first regular expression as the current first regular expression, adjusting the current second regular expression, taking the adjusted second regular expression as the current second regular expression, and executing S600.
The technical effect of S800 is to remedy the possible sacrifice of log sampling module in reducing human consumption by finding potentially misclassified samples. The method for loop iteration continuously optimizes the rule set, so that the rule set can be better adapted to the change of a system and the evolution of log data, and the efficiency and the accuracy of operation and maintenance personnel in the aspects of log analysis and fault investigation are improved.
Further, in embodiments of the present invention, potential variables in the categorical samples may be identified based on the longest common subsequence, and potential constants in the categorical samples may be identified based on word frequency statistics.
Further, in an exemplary embodiment of the present invention, in S800, adjusting the current first regular expression, and taking the adjusted first regular expression as the current first regular expression may specifically include:
S810, obtaining similarity SL bv between any two log templates LT b and LT v in the current log template set; the values of b and v are 1 to Q, Q is the number of log templates in the current log template set, and b is not equal to v.
In an embodiment of the invention, SL bv is determined based on the ratio of the length of the longest common subsequence between LT b and LT v to the length of the shortest term between the two. Specifically, SL bv satisfies the following conditions: SL bv=LPbv max/Lbv max, wherein LP bv max is the length of the longest common subsequence between LT b and LT v, LP bv max=max(Lb,Lv),Lb is the length of LT b, L v is the length of LT v, max () means taking the maximum value, i.e. LP bv max is the maximum one of the length of LT b and the length of LT v.
S811, if SL bv > ST0, executing S812; ST0 is a preset similarity threshold.
If the variables attributed to a particular log template are divided into constants by mistake, then several highly similar log templates may be generated, and therefore if SL bv > ST0, there may be a classification sample.
In an embodiment of the present invention, ST0 may be an empirical value, and may be determined based on a corresponding log type. In one exemplary embodiment, 0.5.ltoreq.ST0.ltoreq.0.7. The inventors of the present invention have studied to find that if either ST0 < 0.5 or 0.7 < ST0, no potentially categorical sample can be identified effectively.
S812, adding LT b to the LTC if LT b does not belong to the LTC, otherwise, not adding LT b to the LTC; if LT v does not belong to an LTC, adding LT v to the LTC, otherwise, not adding LT v to the LTC; the LTC is a current candidate log template set, and the initial value of the current candidate log template set is a null value.
S813, extracting q logs from each log cluster corresponding to the current candidate log template set as new key logs.
In an embodiment of the present invention, q may be set based on actual needs, and in a non-limiting exemplary embodiment, q may be equal to 2 or 3. S814, generating a new first regular expression based on the new key log.
In the embodiment of the present invention, based on the new key logs, the method for generating the new first regular expression may be the same as the method for generating the first regular expression based on the k key logs.
S815, generating a corrected first regular expression based on the current first regular expression and the new first regular expression.
Specifically, merging the current first regular expression and the new first regular expression to obtain a corrected first regular expression. Therefore, the regular expression is richer, namely, a new digital constant recognition rule can be generated, and potential classification samples are found in a new round of iteration.
S816, respectively taking the corrected first regular expressions as the current first regular expressions, and executing S600.
Further, in S800, adjusting the current second regular expression, and taking the adjusted second regular expression as the current second regular expression may specifically include:
S801, acquiring a word frequency set WF and a constant set CV corresponding to a current log data set; wherein WF= { WF 1,WF2,……,WFe,……,WFx};WFe is the word frequency of the e-th word W e in the word set WP corresponding to the current log data set, the value of e is 1 to x, x is the word number ;WF=Wc 1∪Wc 2∪……∪Wc i∪……∪Wc n,Wc i in the WP and is the word segmentation result corresponding to the i-th log data in the current log data set; cv=cv 1∪CV2∪……∪CVi∪……∪CVn,CVi is a constant set corresponding to the i-th log data corresponding to the current log data set.
S802, if WF e is larger than WF0 and W e does not belong to CV, extracting q logs from the logs corresponding to the log templates corresponding to W e as new key logs; WF0 is a preset word frequency threshold.
If WF e > WF0, and W e does not belong to CV, i.e., the word occurs too frequently, but is not in a constant set, it is stated that the word is likely to be a potential constant, possibly a misclassified sample.
In an embodiment of the present invention, q may be set based on actual needs, and in a non-limiting exemplary embodiment, q may be equal to 2 or 3.
In an embodiment of the present invention, WF0 may be an empirical value, and in one exemplary embodiment WF0 satisfies the following condition: wf0=f×wf max, f is a preset coefficient, 0< f < 1, WF max is the largest of WF. Preferably, f=0.2 or 0.3.
Those skilled in the art will appreciate that if the log template corresponding to W e is multiple, q logs are extracted from the log corresponding to each of the corresponding log templates.
S803, generating a new second regular expression based on the new key log.
In the embodiment of the present invention, based on the new key logs, a manner of generating the new second regular expression may be the same as the foregoing method of generating the second regular expression based on the k key logs.
S804, generating a corrected second regular expression based on the current second regular expression and the new second regular expression.
Specifically, merging the current second regular expression and the new second regular expression to obtain a corrected second regular expression. Therefore, the regular expression is richer, namely, a new digital constant recognition rule can be generated, and a potential under-classification sample is found in a new round of iteration.
S805, taking the corrected second regular expression as the current second regular expression, and executing S600.
In the embodiment of the invention, 16 log data sets provided by LogPai groups are used for carrying out experiments on the log template acquisition method provided by the embodiment of the invention. These 16 datasets cover the distributed system (HDFS, hadoop, zookeeper, openStack, spark), supercomputers (BGL, HPC), client applications (Proxifier, thunderbird), server applications (Apache, openSSH), mobile applications (HEALTHAPP), operating system (Windows, linux, mac, andriod). For each type of data, the LogPai team sampled 2000 representative logs and manually noted.
Experiments show that the log template acquisition method provided by the embodiment of the invention can achieve the analysis precision similar to that of a log analysis method based on a modern neural network, and the result strongly proves the effectiveness of a strategy of improving log analysis performance by manually giving a rule of a few word patterns.
The embodiment of the invention also provides electronic equipment, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being configured to perform the methods of embodiments of the present invention.
The embodiment of the invention also provides a non-transitory computer readable storage medium, which stores computer executable instructions for executing the method according to the embodiment of the invention.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution disclosed in the present invention can be achieved, and are not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (9)

1. A log template acquisition method, the method comprising the steps of:
S100, acquiring an original log data set D= { D 1,D2,……,Di,……,Dn},Di to be processed, wherein i is the ith log in D, the value of i is 1 to n, and n is the number of the logs in D;
S200, sampling the D to obtain k initial key logs;
S300, acquiring a word segmentation symbol set, an initial first regular expression and an initial second regular expression based on the k initial key logs; the first regular expression is a regular expression corresponding to a non-digital variable, and the second regular expression is a regular expression corresponding to a constant containing a number;
S400, performing word segmentation processing on the D by using the word segmentation symbol set to obtain a corresponding word segmentation result set W= { W 1,W2,……,Wi,……,Wn};Wi as a word segmentation result corresponding to the D i, wherein W i={Wi1,Wi2,……,Wij,……,Wif(i)},Wij is the j-th word in the W i, the value of j is 1 to f (i), and f (i) is the number of words in the W i;
S500, for W ij in W i, if W ij contains numbers, marking the type identifier corresponding to W ij as a variable identifier, and if W ij does not contain numbers, marking the type identifier corresponding to W ij as a constant identifier;
S600, based on the current first regular expression and the current second regular expression, adjusting the type identifier corresponding to W ij in W i to obtain D with the type identifier adjusted, and using the D as a current log data set;
S700, acquiring a corresponding current log template set based on the current log data set, taking the current log template set as a target log template corresponding to the D if the current log template set is the same as the last log template set, and exiting the current control program, otherwise, executing S800;
S800, adjusting the current first regular expression, taking the adjusted first regular expression as the current first regular expression, adjusting the current second regular expression, taking the adjusted second regular expression as the current second regular expression, and executing S600.
2. The method according to claim 1, wherein S200 specifically comprises:
S201, for D i, performing word segmentation on D i by using a general segmentation character to obtain a corresponding word segmentation result W0 i, and acquiring a special symbol set M i corresponding to D i; wherein W0 i={W0i1,W0i2,……,W0ir,……,W0ig(i)};W0ir is the r-th word in W0 i, the value of r is 1 to g (i), and g (i) is the number of words in W0 i; m i={Mi1,Mi2,……,Mis,……,Mih(i)},Mis is the s-th special symbol in M i, the value of s is 1 to h (i), and h (i) is the number of special symbols in M i;
S202, performing preliminary classification on all logs in the step D based on log lengths and special symbol sets to obtain m initial log clusters;
S203, setting a counter t=1;
s204, executing S205 if t is less than or equal to k, otherwise executing S208;
S205, randomly extracting a log cluster from the initial log clusters which are not extracted at present as a t-th sample log cluster C t, randomly extracting d logs from C t as t-th log sample candidate sets PC t={PCt1,PCt2,……,PCta,……,PCtd};PCta as a-th log sample in PC t, wherein the value of a is 1 to d;
S206, acquiring a log sample corresponding to the min (S t1,St2,……,Sta,……,Std) as a t-th key log and adding the t-th key log into a current key log set; wherein S ta is the maximum similarity ,Sta=max(S1 ta,S2 ta,……,Su ta,……,Sp ta),Su ta between the PC ta and the current logs in the key log set, the similarity between the PC ta and the u-th log in the current key log set, the value of u is 1 to p, and p is the number of the current logs in the key log set; the initial value of the key log set is an empty set; min () represents taking the minimum value, and max () represents taking the maximum value;
s207, setting t=t+1, and executing S204;
S208, taking k logs in the current key log set as the k initial key logs.
3. The method according to claim 2, wherein S u ta satisfies the following condition: s u ta=LPta-u max/Lta-u max, where LP ta-u max is the length of the longest common subsequence between PC ta and the u-th log in the current critical log set, L ta-u max=max(Lta,Lu),Lta is the length of PC ta, L u is the length of the u-th log in the current critical log set, and max () represents the maximum value.
4. The method of claim 1, wherein in S800, the adjusting the current first regular expression and using the adjusted first regular expression as the current first regular expression specifically includes:
S810, obtaining similarity SL bv between any two log templates LT b and LT v in the current log template set; b and v have values of 1 to Q, wherein Q is the number of log templates in the current log template set, and b is not equal to v;
S811, if SL bv > ST0, executing S812; ST0 is a preset similarity threshold;
S812, adding LT b to the LTC if LT b does not belong to the LTC, otherwise, not adding LT b to the LTC; if LT v does not belong to an LTC, adding LT v to the LTC, otherwise, not adding LT v to the LTC; the LTC is a current candidate log template set, and the initial value of the current candidate log template set is a null value;
s813, extracting q logs from each log cluster corresponding to the current candidate log template set as new key logs;
S814, generating a new first regular expression based on the new key log;
s815, generating a corrected first regular expression based on the current first regular expression and the new first regular expression;
S816, taking the corrected first regular expression as the current first regular expression, and executing S600.
5. The method of claim 4, wherein SL bv satisfies the following condition: SL bv=LPbv max/Lbv max, wherein LP bv max is the length of the longest common subsequence between LT b and LT v, LP bv max=max(Lb,Lv),Lb is the length of LT b, L v is the length of LT v, max () denotes taking the maximum value.
6. The method of claim 1, wherein in S800, the adjusting the current second regular expression and using the adjusted second regular expression as the current second regular expression specifically includes:
S801, acquiring a word frequency set WF and a constant set CV corresponding to a current log data set; wherein WF= { WF 1,WF2,……,WFe,……,WFx};WFe is the word frequency of the e-th word W e in the word set WP corresponding to the current log data set, the value of e is 1 to x, x is the number of words ;WF=Wc 1∪W c 2∪……∪W c i∪……∪W c n,W c i in the WP and is the word segmentation result corresponding to the i-th log data in the current log data set; cv=cv 1∪CV2∪……∪CVi∪……∪CVn,CVi is a constant set corresponding to the ith log data corresponding to the current log data set;
S802, if WF e is larger than WF0 and W e does not belong to CV, extracting q logs from the logs corresponding to the log templates corresponding to W e as new key logs; WF0 is a preset word frequency threshold;
s803, generating a new second regular expression based on the new key log;
S804, generating a corrected second regular expression based on the current second regular expression and the new second regular expression;
s805, taking the corrected second regular expression as the current second regular expression, and executing S600.
7. The method of claim 6, wherein WF0 satisfies the following condition: wf0=f×wf max, f is a preset coefficient, 0 < f < 1, WF max is the largest of WF.
8. An electronic device comprising a processor and a memory;
the processor is adapted to perform the steps of the method according to any of claims 1 to 7 by invoking a program or instruction stored in the memory.
9. A non-transitory computer-readable storage medium storing a program or instructions that cause a computer to perform the steps of the method of any one of claims 1 to 7.
CN202410518808.2A 2024-04-28 2024-04-28 Log template acquisition method, electronic equipment and storage medium Active CN118093325B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410518808.2A CN118093325B (en) 2024-04-28 2024-04-28 Log template acquisition method, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410518808.2A CN118093325B (en) 2024-04-28 2024-04-28 Log template acquisition method, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN118093325A true CN118093325A (en) 2024-05-28
CN118093325B CN118093325B (en) 2024-06-21

Family

ID=91157854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410518808.2A Active CN118093325B (en) 2024-04-28 2024-04-28 Log template acquisition method, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN118093325B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569214A (en) * 2019-08-02 2019-12-13 杭州云纪网络科技有限公司 Index construction method and device for log file and electronic equipment
US20200364251A1 (en) * 2018-11-08 2020-11-19 Ho Chi Minh City University Of Technology (Hutech) Cluster computing system and method for automatically generating extraction patterns from operational logs
CN112579707A (en) * 2020-12-08 2021-03-30 西安邮电大学 Log data knowledge graph construction method
CN114201328A (en) * 2021-12-17 2022-03-18 中国平安财产保险股份有限公司 Fault processing method and device based on artificial intelligence, electronic equipment and medium
CN114610515A (en) * 2022-03-10 2022-06-10 电子科技大学 Multi-feature log anomaly detection method and system based on log full semantics
CN114816962A (en) * 2022-06-27 2022-07-29 南京争锋信息科技有限公司 ATTENTION-LSTM-based network fault prediction method
CN115221013A (en) * 2022-09-21 2022-10-21 云智慧(北京)科技有限公司 Method, device and equipment for determining log mode
CN115437877A (en) * 2022-08-18 2022-12-06 华南理工大学 Online analysis method and system for multi-source log, electronic equipment and storage medium
CN115545019A (en) * 2022-10-31 2022-12-30 北京火山引擎科技有限公司 Log template extraction method, apparatus, storage medium, and program product
CN115562645A (en) * 2022-09-29 2023-01-03 中国人民解放军国防科技大学 Configuration fault prediction method based on program semantics
CN115617953A (en) * 2022-11-15 2023-01-17 成都九洲电子信息系统股份有限公司 Intelligent diagnosis method and system for network service link fault
CN115859932A (en) * 2022-11-30 2023-03-28 北京火山引擎科技有限公司 Log template extraction method and device, electronic equipment and storage medium
CN116841779A (en) * 2023-05-09 2023-10-03 广州亚信技术有限公司 Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium
CN117170724A (en) * 2023-08-16 2023-12-05 绿盟科技集团股份有限公司 Automatic updating method, device and equipment for AI model for detecting business abnormality
CN117407242A (en) * 2023-10-10 2024-01-16 浙江大学 Low-cost zero-sample online log analysis method based on large language model
CN117827784A (en) * 2024-01-04 2024-04-05 先进新星技术(新加坡)控股有限公司 Noise log filtering method and system

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200364251A1 (en) * 2018-11-08 2020-11-19 Ho Chi Minh City University Of Technology (Hutech) Cluster computing system and method for automatically generating extraction patterns from operational logs
CN110569214A (en) * 2019-08-02 2019-12-13 杭州云纪网络科技有限公司 Index construction method and device for log file and electronic equipment
CN112579707A (en) * 2020-12-08 2021-03-30 西安邮电大学 Log data knowledge graph construction method
CN114201328A (en) * 2021-12-17 2022-03-18 中国平安财产保险股份有限公司 Fault processing method and device based on artificial intelligence, electronic equipment and medium
CN114610515A (en) * 2022-03-10 2022-06-10 电子科技大学 Multi-feature log anomaly detection method and system based on log full semantics
CN114816962A (en) * 2022-06-27 2022-07-29 南京争锋信息科技有限公司 ATTENTION-LSTM-based network fault prediction method
CN115437877A (en) * 2022-08-18 2022-12-06 华南理工大学 Online analysis method and system for multi-source log, electronic equipment and storage medium
CN115221013A (en) * 2022-09-21 2022-10-21 云智慧(北京)科技有限公司 Method, device and equipment for determining log mode
CN115562645A (en) * 2022-09-29 2023-01-03 中国人民解放军国防科技大学 Configuration fault prediction method based on program semantics
CN115545019A (en) * 2022-10-31 2022-12-30 北京火山引擎科技有限公司 Log template extraction method, apparatus, storage medium, and program product
CN115617953A (en) * 2022-11-15 2023-01-17 成都九洲电子信息系统股份有限公司 Intelligent diagnosis method and system for network service link fault
CN115859932A (en) * 2022-11-30 2023-03-28 北京火山引擎科技有限公司 Log template extraction method and device, electronic equipment and storage medium
CN116841779A (en) * 2023-05-09 2023-10-03 广州亚信技术有限公司 Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium
CN117170724A (en) * 2023-08-16 2023-12-05 绿盟科技集团股份有限公司 Automatic updating method, device and equipment for AI model for detecting business abnormality
CN117407242A (en) * 2023-10-10 2024-01-16 浙江大学 Low-cost zero-sample online log analysis method based on large language model
CN117827784A (en) * 2024-01-04 2024-04-05 先进新星技术(新加坡)控股有限公司 Noise log filtering method and system

Also Published As

Publication number Publication date
CN118093325B (en) 2024-06-21

Similar Documents

Publication Publication Date Title
US20210150142A1 (en) Method and apparatus for determining feature words and server
CN110569500A (en) Text semantic recognition method and device, computer equipment and storage medium
CN110245557B (en) Picture processing method, device, computer equipment and storage medium
CN110457405B (en) Database auditing method based on blood relationship
CN109918498B (en) Problem warehousing method and device
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
US9606984B2 (en) Unsupervised clustering of dialogs extracted from released application logs
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN112580346B (en) Event extraction method and device, computer equipment and storage medium
CN107357895B (en) Text representation processing method based on bag-of-words model
CN112988753B (en) Data searching method and device
CN109993216B (en) Text classification method and device based on K nearest neighbor KNN
CN111506726B (en) Short text clustering method and device based on part-of-speech coding and computer equipment
CN115953123A (en) Method, device and equipment for generating robot automation flow and storage medium
CN114254636A (en) Text processing method, device, equipment and storage medium
CN113869398A (en) Unbalanced text classification method, device, equipment and storage medium
CN118093325B (en) Log template acquisition method, electronic equipment and storage medium
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN115909376A (en) Text recognition method, text recognition model training device and storage medium
CN112651590B (en) Instruction processing flow recommending method
CN113095073B (en) Corpus tag generation method and device, computer equipment and storage medium
CN103744830A (en) Semantic analysis based identification method of identity information in EXCEL document
CN112988699B (en) Model training method, and data label generation method and device
CN111199170B (en) Formula file identification method and device, electronic equipment and storage medium
CN114528378A (en) Text classification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant