CN118093325A - Log template acquisition method, electronic equipment and storage medium - Google Patents
Log template acquisition method, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN118093325A CN118093325A CN202410518808.2A CN202410518808A CN118093325A CN 118093325 A CN118093325 A CN 118093325A CN 202410518808 A CN202410518808 A CN 202410518808A CN 118093325 A CN118093325 A CN 118093325A
- Authority
- CN
- China
- Prior art keywords
- log
- current
- regular expression
- logs
- max
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 230000014509 gene expression Effects 0.000 claims abstract description 122
- 230000011218 segmentation Effects 0.000 claims description 34
- 238000012545 processing Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 abstract description 2
- 238000004458 analytical method Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000012423 maintenance Methods 0.000 description 4
- 238000003062 neural network model Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000000638 solvent extraction Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 210000001072 colon Anatomy 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013024 troubleshooting Methods 0.000 description 1
- 238000012038 vulnerability analysis Methods 0.000 description 1
Landscapes
- Machine Translation (AREA)
Abstract
The invention relates to the field of computer technology application, in particular to a log template acquisition method, electronic equipment and a storage medium, which comprise the following steps: obtaining a plurality of initial key logs from an original log data set, then carrying out preliminary identification on word types in the data set, generating a first regular expression representing a non-digital variable and a second regular expression containing a digital constant based on the initial key logs, correcting the word types of the preliminary identification by using the generated expressions, obtaining a corresponding log template based on the corrected word types, then judging whether the current log template meets preset conditions, taking the current log template as a target log template if the current log template meets the preset conditions, otherwise, adjusting the current expression, adjusting the word types in the current data set by using the new expression to obtain a new log template, and repeating the previous judging step until the preset conditions are met. The method and the device can improve the generation efficiency and accuracy of the log template.
Description
Technical Field
The present invention relates to the field of computer technology application, and in particular, to a log template obtaining method, an electronic device, and a storage medium.
Background
In the development and maintenance of modern software, journals provide critical information about system and network activities, helping developers and operation and maintenance engineers understand system behavior and trace back system problem sources, detect and respond to security events, and conduct troubleshooting and vulnerability analysis. In practice, operation and maintenance engineers typically employ rule-based log parsing methods, such as the Grok filter technique employed by logstack, by manually writing and matching the entire log template with regular expressions. Grok is a method of matching a log line to a regular expression, mapping a specific portion of the log line to a dedicated field, and performing an operation based on this mapping. The problem with this type of approach is that each Grok filter rule corresponds to a class of log events, which means that the Grok rule base is difficult to maintain and expand for modern software systems that contain a large number of heterogeneous log event types, and that are continually updated. Second, each newly added Grok rule results in an additional canonical match to the entire log line. The log of a modern software system may contain hundreds of thousands of log templates, the cost of manually giving regular expressions that match all of the log templates is unacceptable, and the advent of new log templates is completely inadaptable. Another type of log parsing method is based on predefined heuristic rules, where researchers find certain types of features inherent in log data, and algorithms use these features to perform template acquisition. For example, SLCT (Simple Log Cluster Tool) based on frequent word statistics consider that words that occur more frequently in log files are constant. This approach may be effective in preliminary parsing, but it is difficult to identify rare log templates that occur at low frequency. On this basis, LFA (Log File Analyzer) considers the position of the word for statistics, logram uses an n-gram class method as a statistical index, so that the context information of the word is taken into consideration. IPLoM (Iterative Partition Log Mining) proposes an iterative partitioning approach by continually partitioning the log into small clusters based on log length and word location characteristics. Drain is a log parsing algorithm widely used in recent years, and is essentially a tree representation of an iterative partitioning algorithm based on a prefix parse tree of the log. There are two problems with this type of strategy:
a) The super parameter of the similarity threshold has great influence on the algorithm performance and is difficult to adjust to obtain an optimal value;
b) The merging operation may be performed when the similarity of the set of logs is high, this strategy results from the algorithm designer's observation of most log properties and is not appropriate for all logs.
Another type of log parsing method is to learn constant/variable features of a log from a large number of annotated log data sets using deep learning techniques. Wherein UniParser uses LSTM network based on contrast learning strategy to make log context coding, logPPT uses RoBERTa network to make log sequence feature acquisition. However, neural network methods place additional GPU hardware requirements on the system and their operating efficiency is difficult to match with the huge log generation rate.
In summary, the existing log template acquisition method either depends on a large number of rules or needs to adopt a complex deep learning algorithm, so that the acquisition cost is too high, and the real-time processing requirement of a large-scale log data set is difficult to adapt.
Disclosure of Invention
Aiming at the technical problems, the invention adopts the following technical scheme:
According to a first aspect of the present invention, there is provided a log template acquisition method, the method comprising the steps of:
S100, acquiring an original log data set D= { D 1,D2,……,Di,……,Dn},Di which is the ith log in D, wherein the value of i is 1 to n, and n is the number of the logs in D.
S200, sampling the D to obtain k initial key logs.
S300, acquiring a word segmentation symbol set, an initial first regular expression and an initial second regular expression based on the k initial key logs; the first regular expression is a regular expression corresponding to a non-digital variable, and the second regular expression is a regular expression corresponding to a constant containing a number.
S400, performing word segmentation processing on the D by using the word segmentation symbol set to obtain a corresponding word segmentation result set W= { W 1,W2,……,Wi,……,Wn};Wi as a word segmentation result corresponding to the D i, W i={Wi1,Wi2,……,Wij,……,Wif(i)},Wij as the j-th word in the W i, the j value is 1 to f (i), and f (i) is the number of words in the W i.
S500, for W ij in W i, if W ij contains numbers, the type identifier corresponding to W ij is marked as a variable identifier, and if W ij does not contain numbers, the type identifier corresponding to W ij is marked as a constant identifier.
S600, based on the current first regular expression and the current second regular expression, the type identifier corresponding to W ij in W i is adjusted, and D with the type identifier adjusted is obtained and is used as a current log data set.
S700, acquiring a corresponding current log template set based on the current log data set, taking the current log template set as a target log template corresponding to the D if the current log template set is the same as the last log template set, and exiting the current control program, otherwise, executing S800.
S800, adjusting the current first regular expression, taking the adjusted first regular expression as the current first regular expression, adjusting the current second regular expression, taking the adjusted second regular expression as the current second regular expression, and executing S600.
According to a second aspect of the present invention, there is provided an electronic device comprising a processor and a memory; the processor is configured to execute the steps of the method according to the first aspect of the present invention by calling a program or instructions stored in the memory.
According to a third aspect of the present invention there is provided a non-transitory computer readable storage medium storing a program or instructions which cause a computer to perform the steps of the method of the first aspect of the present invention.
The invention has at least the following beneficial effects:
According to the technical scheme provided by the embodiment of the invention, k initial key logs are firstly obtained from an original log data set, then, the word types in the data set are primarily marked, a first regular expression representing a non-digital variable and a second regular expression containing a digital constant are generated based on the k initial key logs, the generated expression is used for correcting the primarily marked word types, the corrected word types are obtained, a corresponding log template is obtained based on the corrected word types, then, whether the current log template meets preset conditions is judged, if yes, the current log template is used as a target log template, otherwise, the current first regular expression and the current second regular expression are adjusted, the word types in the current data set are adjusted by using the new regular expression, a new log template is obtained, and the previous judging step is repeated until the preset conditions are met. The method and the device can improve the generation efficiency and accuracy of the log template.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a log template obtaining method according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
It should be noted that some exemplary embodiments are described as a process or a method depicted as a flowchart. Although a flowchart depicts steps as a sequential process, many of the steps may be implemented in parallel, concurrently, or with other steps. Furthermore, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.
An embodiment of the present invention provides a log template obtaining method, as shown in fig. 1, where the method may include the following steps:
S100, acquiring an original log data set D= { D 1,D2,……,Di,……,Dn},Di which is the ith log in D, wherein the value of i is 1 to n, and n is the number of the logs in D.
In an embodiment of the present invention, the original log data set may be a log data set input by a user. In an embodiment of the invention, the log data set may include log data sets generated by a distributed system (HDFS, hadoop, zookeeper, openStack, spark), supercomputers (BGL, HPC), client applications (Proxifier, thunderbird), server applications (Apache, openSSH), mobile applications (HEALTHAPP), operating systems (Windows, linux, mac, andriod), and the like.
S200, sampling the D to obtain k initial key logs.
In the embodiment of the invention, k can be set based on actual needs, for example, 32, 64 or 128.
Further, in the embodiment of the present invention, S200 may specifically include:
S201, for D i, performing word segmentation on D i by using a general segmentation character to obtain a corresponding word segmentation result W0 i, and acquiring a special symbol set M i corresponding to D i; wherein W0 i={W0i1,W0i2,……,W0ir,……,W0ig(i)};W0ir is the r-th word in W0 i, the value of r is 1 to g (i), and g (i) is the number of words in W0 i; m i={Mi1,Mi2,……,Mis,……,Mih(i)},Mis is the s-th special symbol in M i, the value of s is 1 to h (i), and h (i) is the number of special symbols in M i.
In an embodiment of the present invention, the universal delimiter may be an existing universal delimiter, such as a space, or the like. Special symbols may be existing definitions such as symbols other than numerals and letters.
S202, based on the log length and the special symbol set, performing preliminary classification on all logs in the D to obtain m initial log clusters.
In the embodiment of the invention, the length of the log is the number of word segmentation corresponding to the log. All logs in each log cluster have the same log length and the same special symbol.
S203, a counter t=1 is set.
S204, if t is less than or equal to k, executing S205, otherwise, executing S208.
S205, randomly extracting a log cluster from the initial log clusters which are not extracted at present as a t-th sample log cluster C t, randomly extracting d logs from C t as a t-th log sample candidate set PC t={PCt1,PCt2,……,PCta,……,PCtd};PCta as a-th log sample in PC t, wherein the value of a is 1 to d.
In the embodiment of the invention, d can be set based on actual needs, for example, 32, 64 or 128. Those skilled in the art will appreciate that if the number of logs in C t is less than d, then all logs in C t are taken as log samples.
S206, acquiring a log sample corresponding to the min (S t1,St2,……,Sta,……,Std) as a t-th key log and adding the t-th key log into a current key log set; wherein S ta is the maximum similarity ,Sta=max(S1 ta,S2 ta,……,Su ta,……,Sp ta),Su ta between the PC ta and the current logs in the key log set, the similarity between the PC ta and the u-th log in the current key log set, the value of u is 1 to p, and p is the number of the current logs in the key log set; the initial value of the key log set is an empty set; min () represents taking the minimum value and max () represents taking the maximum value.
In an embodiment of the present invention, S u ta is determined based on the longest common subsequence between PC ta and the u-th log in the current critical log set. Specifically, S u ta satisfies the following condition: s u ta=LPta-u max/Lta-u max, where LP ta-u max is the length of the longest common subsequence between PC ta and the u-th log in the current critical log set, L ta-u max=max(Lta,Lu),Lta is the length of PC ta, L u is the length of the u-th log in the current critical log set, and max () represents the maximum value, i.e., L ta-u max is the maximum of the length of PC ta and the length of the u-th log in the current critical log set.
S207, t=t+1 is set, and S204 is executed.
S208, taking k logs in the current key log set as the k initial key logs.
S300, acquiring a word segmentation symbol set, an initial first regular expression and an initial second regular expression based on the k initial key logs; the first regular expression is a regular expression corresponding to a non-digital variable, and the second regular expression is a regular expression corresponding to a constant containing a number.
In the embodiment of the invention, the word segmentation symbol set may be a union of word segmentation symbols corresponding to k initial key logs.
In an embodiment of the present invention, the first regular expression and the second regular expression may be obtained based on a manual or trained neural network model. Specifically, words belonging to non-digital variables and containing digital constants in each key log may be obtained based on a manual or trained neural network model, e.g., words in log "Invalid user test from 52.80.34.196" where "test" is a non-digital variable, log "… … available network connection on network <"Alt0" in "VIA INTERFACE ALT0" is a word containing a numerical constant. It will be appreciated by those skilled in the art that any method of obtaining words belonging to non-numeric variables and numeric constants in each key log based on a manual or trained neural network model falls within the scope of the present invention.
After all words belonging to the non-numeric variable and the numeric constant in the key log are acquired, the context-like words are represented using the same regular expression, for example, log "Invalid user test from 52.80.34.196" corresponds to a regular expression "(The regular expression corresponding to > VIA INTERFACE ALT0 "is" (. And finally, acquiring a preset number of regular expressions based on actual requirements. Those skilled in the art will recognize that any method for obtaining a regular expression corresponding to a word falls within the scope of the present invention.
S400, performing word segmentation processing on the D by using the word segmentation symbol set to obtain a corresponding word segmentation result set W= { W 1,W2,……,Wi,……,Wn};Wi as a word segmentation result corresponding to the D i, W i={Wi1,Wi2,……,Wij,……,Wif(i)},Wij as the j-th word in the W i, the j value is 1 to f (i), and f (i) is the number of words in the W i.
Although word segmentation is a key element of log parsing, little attention has been paid. The inventor of the present invention found through research that word segmentation has an important effect on log analysis results. In some cases, some log templates that cannot be correctly identified may be accurately identified in another particular word segmentation. Therefore, optimization of word segmentation techniques needs to be explored more deeply to improve accuracy and efficiency of log parsing. Taking the logs "StackScrollAlgorithm: overlapAmount:220.0" and "StackScrollAlgorithm:state. Cliptopamount204" as examples, if a colon is not used as a word segmentation symbol, each log will be separated into three words. However, since the latter two words contain numbers, the algorithm cannot consider these words when classifying, and the two types of logs will be mapped to the same set ("StackScrollAlgorithm") and cannot be distinguished. In contrast, if a colon is used as the participle symbol, the two types of logs will map to ("StackScrollAlgorithm", "overlapAmount") and ("StackScrollAlgorithm", "state. Even neural network models trained based on a large number of multiple log data sets are difficult to ensure that they can be applied without fine tuning on the new data set. Therefore, in the invention, the union of the word segmentation symbols corresponding to the k initial key logs is used as the word segmentation symbol set, so that the original logs can be converted into a more formatted form as much as possible, and the word segmentation accuracy can be improved.
S500, for W ij in W i, if W ij contains a number, marking the type identifier corresponding to W ij as a variable identifier, i.e. marking W ij as a variable, if W ij does not contain a number, marking the type identifier corresponding to W ij as a constant identifier, i.e. marking W ij as a constant.
In embodiments of the present invention, existing constant/variable discriminators may be used to obtain the type identification for each word.
S600, based on the current first regular expression and the current second regular expression, the type identifier corresponding to W ij in W i is adjusted, and D with the type identifier adjusted is obtained and is used as a current log data set.
In the embodiment of the invention, the initial value of the current first regular expression is an initial first regular expression, and the initial value of the current second regular expression is an initial second regular expression.
Further, in S600, each log in the original log data set may be respectively matched using the obtained first regular expression and the second regular expression, if the corresponding word is matched, the type identifier corresponding to the word is adjusted to the type identifier corresponding to the corresponding regular expression, for example, matching is performed with a regular expression "(.
Those skilled in the art will appreciate that the process of matching using regular expressions may be implemented based on existing regular matching algorithms.
The ideal goal of log template extraction is to comprehensively and correctly extract constants in the log. In practical applications, it is a very difficult task to implement high accuracy log parsing with only a constant/variable arbiter. Consider the following: assuming that a certain constant/variable classifier has an accuracy of 99% and each log contains 10 words on average, then the probability that the classifier correctly extracts all constants in one log drops to the 10 th power of 0.99, i.e., 0.904; if the accuracy of the constant/variable classifier is slightly lower than 95%, the accuracy of the classifier on an entire log of length 10 drops rapidly to 0.599. Minor deviations in the classifier are amplified in log template discrimination. In addition, in practical application, the form of the log is quite complex and changeable, and the continuous expansion of the log scale in a large-scale system limits the application of the deep learning method. These factors together make it difficult for the constant/variable arbiter to achieve both high accuracy and high efficiency. In view of this, in the embodiment of the present invention, based on the key log, the corresponding first regular expression and second regular expression are obtained and used to correct the discrimination result of the constant/variable discriminator, so that the generated log template can be as accurate as possible.
S700, acquiring a corresponding current log template set based on the current log data set, taking the current log template set as a target log template corresponding to the D if the current log template set is the same as the last log template set, namely, the log template is not changed, and exiting the current control program, otherwise, executing S800.
In the embodiment of the invention, the generation of the log template can adopt the prior art. For example, for each log in the current log dataset, a constant of the log may be used as a static portion of the log template, a variable in the log may be used as a variable portion of the log template, and then the variable portion may be replaced with a wild card to obtain a corresponding log template. And finally, merging the logs with the same corresponding log template to obtain a current log template set.
S800, adjusting the current first regular expression, taking the adjusted first regular expression as the current first regular expression, adjusting the current second regular expression, taking the adjusted second regular expression as the current second regular expression, and executing S600.
The technical effect of S800 is to remedy the possible sacrifice of log sampling module in reducing human consumption by finding potentially misclassified samples. The method for loop iteration continuously optimizes the rule set, so that the rule set can be better adapted to the change of a system and the evolution of log data, and the efficiency and the accuracy of operation and maintenance personnel in the aspects of log analysis and fault investigation are improved.
Further, in embodiments of the present invention, potential variables in the categorical samples may be identified based on the longest common subsequence, and potential constants in the categorical samples may be identified based on word frequency statistics.
Further, in an exemplary embodiment of the present invention, in S800, adjusting the current first regular expression, and taking the adjusted first regular expression as the current first regular expression may specifically include:
S810, obtaining similarity SL bv between any two log templates LT b and LT v in the current log template set; the values of b and v are 1 to Q, Q is the number of log templates in the current log template set, and b is not equal to v.
In an embodiment of the invention, SL bv is determined based on the ratio of the length of the longest common subsequence between LT b and LT v to the length of the shortest term between the two. Specifically, SL bv satisfies the following conditions: SL bv=LPbv max/Lbv max, wherein LP bv max is the length of the longest common subsequence between LT b and LT v, LP bv max=max(Lb,Lv),Lb is the length of LT b, L v is the length of LT v, max () means taking the maximum value, i.e. LP bv max is the maximum one of the length of LT b and the length of LT v.
S811, if SL bv > ST0, executing S812; ST0 is a preset similarity threshold.
If the variables attributed to a particular log template are divided into constants by mistake, then several highly similar log templates may be generated, and therefore if SL bv > ST0, there may be a classification sample.
In an embodiment of the present invention, ST0 may be an empirical value, and may be determined based on a corresponding log type. In one exemplary embodiment, 0.5.ltoreq.ST0.ltoreq.0.7. The inventors of the present invention have studied to find that if either ST0 < 0.5 or 0.7 < ST0, no potentially categorical sample can be identified effectively.
S812, adding LT b to the LTC if LT b does not belong to the LTC, otherwise, not adding LT b to the LTC; if LT v does not belong to an LTC, adding LT v to the LTC, otherwise, not adding LT v to the LTC; the LTC is a current candidate log template set, and the initial value of the current candidate log template set is a null value.
S813, extracting q logs from each log cluster corresponding to the current candidate log template set as new key logs.
In an embodiment of the present invention, q may be set based on actual needs, and in a non-limiting exemplary embodiment, q may be equal to 2 or 3. S814, generating a new first regular expression based on the new key log.
In the embodiment of the present invention, based on the new key logs, the method for generating the new first regular expression may be the same as the method for generating the first regular expression based on the k key logs.
S815, generating a corrected first regular expression based on the current first regular expression and the new first regular expression.
Specifically, merging the current first regular expression and the new first regular expression to obtain a corrected first regular expression. Therefore, the regular expression is richer, namely, a new digital constant recognition rule can be generated, and potential classification samples are found in a new round of iteration.
S816, respectively taking the corrected first regular expressions as the current first regular expressions, and executing S600.
Further, in S800, adjusting the current second regular expression, and taking the adjusted second regular expression as the current second regular expression may specifically include:
S801, acquiring a word frequency set WF and a constant set CV corresponding to a current log data set; wherein WF= { WF 1,WF2,……,WFe,……,WFx};WFe is the word frequency of the e-th word W e in the word set WP corresponding to the current log data set, the value of e is 1 to x, x is the word number ;WF=Wc 1∪Wc 2∪……∪Wc i∪……∪Wc n,Wc i in the WP and is the word segmentation result corresponding to the i-th log data in the current log data set; cv=cv 1∪CV2∪……∪CVi∪……∪CVn,CVi is a constant set corresponding to the i-th log data corresponding to the current log data set.
S802, if WF e is larger than WF0 and W e does not belong to CV, extracting q logs from the logs corresponding to the log templates corresponding to W e as new key logs; WF0 is a preset word frequency threshold.
If WF e > WF0, and W e does not belong to CV, i.e., the word occurs too frequently, but is not in a constant set, it is stated that the word is likely to be a potential constant, possibly a misclassified sample.
In an embodiment of the present invention, q may be set based on actual needs, and in a non-limiting exemplary embodiment, q may be equal to 2 or 3.
In an embodiment of the present invention, WF0 may be an empirical value, and in one exemplary embodiment WF0 satisfies the following condition: wf0=f×wf max, f is a preset coefficient, 0< f < 1, WF max is the largest of WF. Preferably, f=0.2 or 0.3.
Those skilled in the art will appreciate that if the log template corresponding to W e is multiple, q logs are extracted from the log corresponding to each of the corresponding log templates.
S803, generating a new second regular expression based on the new key log.
In the embodiment of the present invention, based on the new key logs, a manner of generating the new second regular expression may be the same as the foregoing method of generating the second regular expression based on the k key logs.
S804, generating a corrected second regular expression based on the current second regular expression and the new second regular expression.
Specifically, merging the current second regular expression and the new second regular expression to obtain a corrected second regular expression. Therefore, the regular expression is richer, namely, a new digital constant recognition rule can be generated, and a potential under-classification sample is found in a new round of iteration.
S805, taking the corrected second regular expression as the current second regular expression, and executing S600.
In the embodiment of the invention, 16 log data sets provided by LogPai groups are used for carrying out experiments on the log template acquisition method provided by the embodiment of the invention. These 16 datasets cover the distributed system (HDFS, hadoop, zookeeper, openStack, spark), supercomputers (BGL, HPC), client applications (Proxifier, thunderbird), server applications (Apache, openSSH), mobile applications (HEALTHAPP), operating system (Windows, linux, mac, andriod). For each type of data, the LogPai team sampled 2000 representative logs and manually noted.
Experiments show that the log template acquisition method provided by the embodiment of the invention can achieve the analysis precision similar to that of a log analysis method based on a modern neural network, and the result strongly proves the effectiveness of a strategy of improving log analysis performance by manually giving a rule of a few word patterns.
The embodiment of the invention also provides electronic equipment, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being configured to perform the methods of embodiments of the present invention.
The embodiment of the invention also provides a non-transitory computer readable storage medium, which stores computer executable instructions for executing the method according to the embodiment of the invention.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution disclosed in the present invention can be achieved, and are not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.
Claims (9)
1. A log template acquisition method, the method comprising the steps of:
S100, acquiring an original log data set D= { D 1,D2,……,Di,……,Dn},Di to be processed, wherein i is the ith log in D, the value of i is 1 to n, and n is the number of the logs in D;
S200, sampling the D to obtain k initial key logs;
S300, acquiring a word segmentation symbol set, an initial first regular expression and an initial second regular expression based on the k initial key logs; the first regular expression is a regular expression corresponding to a non-digital variable, and the second regular expression is a regular expression corresponding to a constant containing a number;
S400, performing word segmentation processing on the D by using the word segmentation symbol set to obtain a corresponding word segmentation result set W= { W 1,W2,……,Wi,……,Wn};Wi as a word segmentation result corresponding to the D i, wherein W i={Wi1,Wi2,……,Wij,……,Wif(i)},Wij is the j-th word in the W i, the value of j is 1 to f (i), and f (i) is the number of words in the W i;
S500, for W ij in W i, if W ij contains numbers, marking the type identifier corresponding to W ij as a variable identifier, and if W ij does not contain numbers, marking the type identifier corresponding to W ij as a constant identifier;
S600, based on the current first regular expression and the current second regular expression, adjusting the type identifier corresponding to W ij in W i to obtain D with the type identifier adjusted, and using the D as a current log data set;
S700, acquiring a corresponding current log template set based on the current log data set, taking the current log template set as a target log template corresponding to the D if the current log template set is the same as the last log template set, and exiting the current control program, otherwise, executing S800;
S800, adjusting the current first regular expression, taking the adjusted first regular expression as the current first regular expression, adjusting the current second regular expression, taking the adjusted second regular expression as the current second regular expression, and executing S600.
2. The method according to claim 1, wherein S200 specifically comprises:
S201, for D i, performing word segmentation on D i by using a general segmentation character to obtain a corresponding word segmentation result W0 i, and acquiring a special symbol set M i corresponding to D i; wherein W0 i={W0i1,W0i2,……,W0ir,……,W0ig(i)};W0ir is the r-th word in W0 i, the value of r is 1 to g (i), and g (i) is the number of words in W0 i; m i={Mi1,Mi2,……,Mis,……,Mih(i)},Mis is the s-th special symbol in M i, the value of s is 1 to h (i), and h (i) is the number of special symbols in M i;
S202, performing preliminary classification on all logs in the step D based on log lengths and special symbol sets to obtain m initial log clusters;
S203, setting a counter t=1;
s204, executing S205 if t is less than or equal to k, otherwise executing S208;
S205, randomly extracting a log cluster from the initial log clusters which are not extracted at present as a t-th sample log cluster C t, randomly extracting d logs from C t as t-th log sample candidate sets PC t={PCt1,PCt2,……,PCta,……,PCtd};PCta as a-th log sample in PC t, wherein the value of a is 1 to d;
S206, acquiring a log sample corresponding to the min (S t1,St2,……,Sta,……,Std) as a t-th key log and adding the t-th key log into a current key log set; wherein S ta is the maximum similarity ,Sta=max(S1 ta,S2 ta,……,Su ta,……,Sp ta),Su ta between the PC ta and the current logs in the key log set, the similarity between the PC ta and the u-th log in the current key log set, the value of u is 1 to p, and p is the number of the current logs in the key log set; the initial value of the key log set is an empty set; min () represents taking the minimum value, and max () represents taking the maximum value;
s207, setting t=t+1, and executing S204;
S208, taking k logs in the current key log set as the k initial key logs.
3. The method according to claim 2, wherein S u ta satisfies the following condition: s u ta=LPta-u max/Lta-u max, where LP ta-u max is the length of the longest common subsequence between PC ta and the u-th log in the current critical log set, L ta-u max=max(Lta,Lu),Lta is the length of PC ta, L u is the length of the u-th log in the current critical log set, and max () represents the maximum value.
4. The method of claim 1, wherein in S800, the adjusting the current first regular expression and using the adjusted first regular expression as the current first regular expression specifically includes:
S810, obtaining similarity SL bv between any two log templates LT b and LT v in the current log template set; b and v have values of 1 to Q, wherein Q is the number of log templates in the current log template set, and b is not equal to v;
S811, if SL bv > ST0, executing S812; ST0 is a preset similarity threshold;
S812, adding LT b to the LTC if LT b does not belong to the LTC, otherwise, not adding LT b to the LTC; if LT v does not belong to an LTC, adding LT v to the LTC, otherwise, not adding LT v to the LTC; the LTC is a current candidate log template set, and the initial value of the current candidate log template set is a null value;
s813, extracting q logs from each log cluster corresponding to the current candidate log template set as new key logs;
S814, generating a new first regular expression based on the new key log;
s815, generating a corrected first regular expression based on the current first regular expression and the new first regular expression;
S816, taking the corrected first regular expression as the current first regular expression, and executing S600.
5. The method of claim 4, wherein SL bv satisfies the following condition: SL bv=LPbv max/Lbv max, wherein LP bv max is the length of the longest common subsequence between LT b and LT v, LP bv max=max(Lb,Lv),Lb is the length of LT b, L v is the length of LT v, max () denotes taking the maximum value.
6. The method of claim 1, wherein in S800, the adjusting the current second regular expression and using the adjusted second regular expression as the current second regular expression specifically includes:
S801, acquiring a word frequency set WF and a constant set CV corresponding to a current log data set; wherein WF= { WF 1,WF2,……,WFe,……,WFx};WFe is the word frequency of the e-th word W e in the word set WP corresponding to the current log data set, the value of e is 1 to x, x is the number of words ;WF=Wc 1∪W c 2∪……∪W c i∪……∪W c n,W c i in the WP and is the word segmentation result corresponding to the i-th log data in the current log data set; cv=cv 1∪CV2∪……∪CVi∪……∪CVn,CVi is a constant set corresponding to the ith log data corresponding to the current log data set;
S802, if WF e is larger than WF0 and W e does not belong to CV, extracting q logs from the logs corresponding to the log templates corresponding to W e as new key logs; WF0 is a preset word frequency threshold;
s803, generating a new second regular expression based on the new key log;
S804, generating a corrected second regular expression based on the current second regular expression and the new second regular expression;
s805, taking the corrected second regular expression as the current second regular expression, and executing S600.
7. The method of claim 6, wherein WF0 satisfies the following condition: wf0=f×wf max, f is a preset coefficient, 0 < f < 1, WF max is the largest of WF.
8. An electronic device comprising a processor and a memory;
the processor is adapted to perform the steps of the method according to any of claims 1 to 7 by invoking a program or instruction stored in the memory.
9. A non-transitory computer-readable storage medium storing a program or instructions that cause a computer to perform the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410518808.2A CN118093325B (en) | 2024-04-28 | 2024-04-28 | Log template acquisition method, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410518808.2A CN118093325B (en) | 2024-04-28 | 2024-04-28 | Log template acquisition method, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118093325A true CN118093325A (en) | 2024-05-28 |
CN118093325B CN118093325B (en) | 2024-06-21 |
Family
ID=91157854
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410518808.2A Active CN118093325B (en) | 2024-04-28 | 2024-04-28 | Log template acquisition method, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118093325B (en) |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110569214A (en) * | 2019-08-02 | 2019-12-13 | 杭州云纪网络科技有限公司 | Index construction method and device for log file and electronic equipment |
US20200364251A1 (en) * | 2018-11-08 | 2020-11-19 | Ho Chi Minh City University Of Technology (Hutech) | Cluster computing system and method for automatically generating extraction patterns from operational logs |
CN112579707A (en) * | 2020-12-08 | 2021-03-30 | 西安邮电大学 | Log data knowledge graph construction method |
CN114201328A (en) * | 2021-12-17 | 2022-03-18 | 中国平安财产保险股份有限公司 | Fault processing method and device based on artificial intelligence, electronic equipment and medium |
CN114610515A (en) * | 2022-03-10 | 2022-06-10 | 电子科技大学 | Multi-feature log anomaly detection method and system based on log full semantics |
CN114816962A (en) * | 2022-06-27 | 2022-07-29 | 南京争锋信息科技有限公司 | ATTENTION-LSTM-based network fault prediction method |
CN115221013A (en) * | 2022-09-21 | 2022-10-21 | 云智慧(北京)科技有限公司 | Method, device and equipment for determining log mode |
CN115437877A (en) * | 2022-08-18 | 2022-12-06 | 华南理工大学 | Online analysis method and system for multi-source log, electronic equipment and storage medium |
CN115545019A (en) * | 2022-10-31 | 2022-12-30 | 北京火山引擎科技有限公司 | Log template extraction method, apparatus, storage medium, and program product |
CN115562645A (en) * | 2022-09-29 | 2023-01-03 | 中国人民解放军国防科技大学 | Configuration fault prediction method based on program semantics |
CN115617953A (en) * | 2022-11-15 | 2023-01-17 | 成都九洲电子信息系统股份有限公司 | Intelligent diagnosis method and system for network service link fault |
CN115859932A (en) * | 2022-11-30 | 2023-03-28 | 北京火山引擎科技有限公司 | Log template extraction method and device, electronic equipment and storage medium |
CN116841779A (en) * | 2023-05-09 | 2023-10-03 | 广州亚信技术有限公司 | Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium |
CN117170724A (en) * | 2023-08-16 | 2023-12-05 | 绿盟科技集团股份有限公司 | Automatic updating method, device and equipment for AI model for detecting business abnormality |
CN117407242A (en) * | 2023-10-10 | 2024-01-16 | 浙江大学 | Low-cost zero-sample online log analysis method based on large language model |
CN117827784A (en) * | 2024-01-04 | 2024-04-05 | 先进新星技术(新加坡)控股有限公司 | Noise log filtering method and system |
-
2024
- 2024-04-28 CN CN202410518808.2A patent/CN118093325B/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200364251A1 (en) * | 2018-11-08 | 2020-11-19 | Ho Chi Minh City University Of Technology (Hutech) | Cluster computing system and method for automatically generating extraction patterns from operational logs |
CN110569214A (en) * | 2019-08-02 | 2019-12-13 | 杭州云纪网络科技有限公司 | Index construction method and device for log file and electronic equipment |
CN112579707A (en) * | 2020-12-08 | 2021-03-30 | 西安邮电大学 | Log data knowledge graph construction method |
CN114201328A (en) * | 2021-12-17 | 2022-03-18 | 中国平安财产保险股份有限公司 | Fault processing method and device based on artificial intelligence, electronic equipment and medium |
CN114610515A (en) * | 2022-03-10 | 2022-06-10 | 电子科技大学 | Multi-feature log anomaly detection method and system based on log full semantics |
CN114816962A (en) * | 2022-06-27 | 2022-07-29 | 南京争锋信息科技有限公司 | ATTENTION-LSTM-based network fault prediction method |
CN115437877A (en) * | 2022-08-18 | 2022-12-06 | 华南理工大学 | Online analysis method and system for multi-source log, electronic equipment and storage medium |
CN115221013A (en) * | 2022-09-21 | 2022-10-21 | 云智慧(北京)科技有限公司 | Method, device and equipment for determining log mode |
CN115562645A (en) * | 2022-09-29 | 2023-01-03 | 中国人民解放军国防科技大学 | Configuration fault prediction method based on program semantics |
CN115545019A (en) * | 2022-10-31 | 2022-12-30 | 北京火山引擎科技有限公司 | Log template extraction method, apparatus, storage medium, and program product |
CN115617953A (en) * | 2022-11-15 | 2023-01-17 | 成都九洲电子信息系统股份有限公司 | Intelligent diagnosis method and system for network service link fault |
CN115859932A (en) * | 2022-11-30 | 2023-03-28 | 北京火山引擎科技有限公司 | Log template extraction method and device, electronic equipment and storage medium |
CN116841779A (en) * | 2023-05-09 | 2023-10-03 | 广州亚信技术有限公司 | Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium |
CN117170724A (en) * | 2023-08-16 | 2023-12-05 | 绿盟科技集团股份有限公司 | Automatic updating method, device and equipment for AI model for detecting business abnormality |
CN117407242A (en) * | 2023-10-10 | 2024-01-16 | 浙江大学 | Low-cost zero-sample online log analysis method based on large language model |
CN117827784A (en) * | 2024-01-04 | 2024-04-05 | 先进新星技术(新加坡)控股有限公司 | Noise log filtering method and system |
Also Published As
Publication number | Publication date |
---|---|
CN118093325B (en) | 2024-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210150142A1 (en) | Method and apparatus for determining feature words and server | |
CN110569500A (en) | Text semantic recognition method and device, computer equipment and storage medium | |
CN110245557B (en) | Picture processing method, device, computer equipment and storage medium | |
CN110457405B (en) | Database auditing method based on blood relationship | |
CN109918498B (en) | Problem warehousing method and device | |
CN112395385B (en) | Text generation method and device based on artificial intelligence, computer equipment and medium | |
US9606984B2 (en) | Unsupervised clustering of dialogs extracted from released application logs | |
CN111984792A (en) | Website classification method and device, computer equipment and storage medium | |
CN112580346B (en) | Event extraction method and device, computer equipment and storage medium | |
CN107357895B (en) | Text representation processing method based on bag-of-words model | |
CN112988753B (en) | Data searching method and device | |
CN109993216B (en) | Text classification method and device based on K nearest neighbor KNN | |
CN111506726B (en) | Short text clustering method and device based on part-of-speech coding and computer equipment | |
CN115953123A (en) | Method, device and equipment for generating robot automation flow and storage medium | |
CN114254636A (en) | Text processing method, device, equipment and storage medium | |
CN113869398A (en) | Unbalanced text classification method, device, equipment and storage medium | |
CN118093325B (en) | Log template acquisition method, electronic equipment and storage medium | |
CN114842982B (en) | Knowledge expression method, device and system for medical information system | |
CN115909376A (en) | Text recognition method, text recognition model training device and storage medium | |
CN112651590B (en) | Instruction processing flow recommending method | |
CN113095073B (en) | Corpus tag generation method and device, computer equipment and storage medium | |
CN103744830A (en) | Semantic analysis based identification method of identity information in EXCEL document | |
CN112988699B (en) | Model training method, and data label generation method and device | |
CN111199170B (en) | Formula file identification method and device, electronic equipment and storage medium | |
CN114528378A (en) | Text classification method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |