CN112560407A - Method for extracting computer software log template on line - Google Patents

Method for extracting computer software log template on line Download PDF

Info

Publication number
CN112560407A
CN112560407A CN202011505125.1A CN202011505125A CN112560407A CN 112560407 A CN112560407 A CN 112560407A CN 202011505125 A CN202011505125 A CN 202011505125A CN 112560407 A CN112560407 A CN 112560407A
Authority
CN
China
Prior art keywords
log
template
word
sequence
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011505125.1A
Other languages
Chinese (zh)
Inventor
徐云川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zork Data Technology Co ltd
Original Assignee
Shanghai Zork Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Zork Data Technology Co ltd filed Critical Shanghai Zork Data Technology Co ltd
Priority to CN202011505125.1A priority Critical patent/CN112560407A/en
Publication of CN112560407A publication Critical patent/CN112560407A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for extracting a computer software log template on line, which reads data in a log stream mode. After receiving a log, the invention finds the log template corresponding to the log, and creates a new log template or updates the existing log template if necessary. The invention discloses a method for extracting a computer software log template on line, which mainly comprises three stages of preprocessing, obtaining a candidate log template and obtaining the log template. Firstly, preprocessing the text of the input log, and then searching a candidate log template in the existing log templates according to the preprocessing result. The algorithm then picks a matching log template from the candidate log templates. If no candidate log template or a matching log template exists, a new log template may be generated that may swallow portions of the candidate log template. The method for extracting the computer software log template on line has the advantages of high robustness, high operation speed and good dynamic adaptability.

Description

Method for extracting computer software log template on line
Technical Field
The invention relates to a computer software log processing method, in particular to a method for extracting a computer software log template on line.
Background
Logs play an important role during the development and maintenance phases of software systems. The information contained in the log may help development and operation and maintenance personnel to discover and locate problems. However, as the scale and complexity of software systems increase, the amount of logs also increases rapidly, and manual monitoring of logs becomes increasingly impractical. To implement automated log analysis, the original log needs to be parsed into structured data first. In order to automatically adapt to logs of various formats, one possible solution is to extract log templates online from existing logs. Each log template is a string of characters that contains placeholders to mark what is here a parameter variable. Ideally, the structure of the template is similar to the structure of the entry of the print statement that generated the log.
In the prior art, the speak (Parsing of System Event logs) reads data in a log stream manner, and realizes extraction of a log template based on a Longest Common Subsequence (LCS). Spell defines a data structure, LCSObject. An LCSObject corresponds to a log template for storing data of the log template. Each LCSObject contains an lcseq for storing the word sequence of the log template. Spell works as follows:
after receiving an input log, firstly, segmenting words based on the separators to obtain a word sequence s. Then, for each LCSEq and s LCS of the existing LCSObjects, the longest LCS and one or more corresponding LCSObjects are found. If the found LCS length is less than half of the length of s, the log template matching with the current input log is not found, so a LCSEq is assigned as s by newly creating an LCSOobject. If the found LCS length is more than or equal to half of the s length, the shortest LCSSeq is taken out from the corresponding LCSobjects as a matched template, and then, a new LCSSeq is generated for the LCSobjects through backtracking (a method for reconstructing a common subsequence in the longest common subsequence algorithm), wherein the non-common parts of the original LCSSeq and s are replaced by placeholders, and a plurality of continuous common parts are combined into one common part.
In the above working mode, speak needs to perform LCS calculation on the LCS seq and s of each LCS object, which is a large calculation amount. To reduce the amount of computation, Spell proposes two optimization methods: 1. a simple traversal; 2. prefix tree method. In practical implementation, Spell preferably uses a prefix tree method to obtain an LCSObject and its public word sequence, and if the length of the public word sequence is less than half of the length of s, then uses a simple traversal method to obtain an LCSObject and its public word sequence. If the length of the public word sequence obtained in this way is less than half of the length of s, an LCSObject is newly created. Otherwise, calculating the LCSs of the LCSeq and s of the LCSobject, and generating a new LCSeq by backtracking to replace the LCSeq of the LCSobject.
Finally, the input log belongs to the log template corresponding to the LCSObject which is just created or updated, and the character string of the log template can be generated by using the LCSSeq. Meanwhile, the currently existing LCSObject represents all the currently parsed log templates.
The disadvantages of Spell are: for logs with higher variable ratio or logs with variables with more advanced positions, the error rate of the extracted log template is higher, and the condition of log template quantity explosion easily occurs.
Therefore, there is a need to provide a new method for extracting a computer software log template online to overcome the defects of the prior art.
Disclosure of Invention
The invention aims to solve the problem that log analysis automation cannot be realized due to the fact that log analysis software cannot be automatically adapted to various different log formats on line in the prior art, and provides a method for extracting a computer software log template on line.
The technical scheme of the invention is as follows:
a method for extracting a computer software log template on line comprises the following steps: s1, inputting a log, and preprocessing the text of the input log; the pretreatment comprises the following steps: a. replacing the special variable; replacing character strings corresponding to special variables in the text of the input log with special variable placeholders; b. word segmentation; cutting the text of the input log to generate a word sequence; c. generating a sequence of word objects; converting each word in the word sequence into a word object; d. generating a skeleton sequence; filtering on the basis of the word sequence, and reserving separators of partial types in the word sequence as brief features of the log; s2, acquiring a candidate log template: acquiring a candidate log template from the existing log templates according to the preprocessing result; the skeleton sequence of the processed log forms a skeleton sequence network, and the skeleton sequence network is updated according to the skeleton sequence of the input log; searching a skeleton sequence similar to the skeleton sequence of the input log in a skeleton sequence network; screening a candidate log template according to the incidence relation between the skeleton sequence and the log template; s3, obtaining a log template: acquiring a matching log template from the candidate log templates; if the candidate log template or the matched log template does not exist, a new log template is generated; and S4, establishing an incidence relation between the skeleton sequence and the matching log template, and outputting the relation as required.
As a preferred technical solution, in step S1, the special variable includes a URL and a timestamp.
As a preferred technical solution, in step S1, when the text of the input log is cut, the types of the generated words are arranged from high to low according to the priority as follows: a special variable placeholder; left side next to the letter or number "-"; a mixed string of English letters and numbers beginning with English letters; a numerical value comprising a decimal point; a string of alphanumeric English letters and numbers beginning with numbers; a continuous blank value; and others.
As a preferred technical solution, the attributes of the word object in step S1 include a word value, a word length, and a word style; the word value is the special variable placeholder or an original character string; the word length is the character string length of the word value; the word styles comprise special variable placeholders, common variable placeholders and original character strings.
As a further preferred technical solution, the common variable placeholder includes a blank variable placeholder, a numerical variable placeholder, and a numerical variable placeholder; the blank variable placeholders represent contiguous blanks; the numeric variable placeholder represents a numeric value with a decimal point; the number variable placeholders represent consecutive numbers.
As a preferred technical solution, in step S2, the skeleton sequence network represents a distance relationship of skeleton sequences; each point in the skeleton sequence network corresponds to an existing skeleton sequence, and the distance and the similarity between the two skeleton sequences are stored in the edge between two connected points; the distance satisfies a triangle inequality, and the similarity is obtained by adopting the following formula according to the distance: similarity (a, b) ═ 1-distance (a, b)/max (length (a), length (b)).
As a preferred technical solution, the step of updating the skeleton sequence network in step S2 is: s21, defining the skeleton sequence of the new log as x; s22, if the skeleton sequence network contains x, ending, otherwise, entering the next step; s23, adding x into the skeleton sequence network; s24, establishing a set V only containing x; s25, if all the framework sequences in the framework sequence network are already in V, ending, otherwise, taking out a framework sequence y not in V and entering the next step; s26, calculating the distance d _ xy and the similarity S _ xy between x and y; s27, if a skeleton sequence which is connected with y and is not in V exists, taking out one skeleton sequence z and entering the next step, otherwise, jumping to the step S213; s28, acquiring the distance d _ yz between y and z from the edge between y and z; s29, calculating a minimum distance d _ xz _ min | d _ xy-d _ yz | between x and z using a triangle inequality; s210, calculating the maximum similarity S _ xz _ min between x and z according to the similarity formula; s211, judging whether the difference between x and z is large enough according to d _ xz _ min and S _ xz _ min, and if yes, adding z into V; s212, jumping to the step S27; s213, connecting x and y, and storing the distance d _ xy and the similarity S _ xy on the edge; s214, adding y into V; s215, jumping to step S25.
As a preferred technical solution, the template corresponds to a word template object sequence, and the attributes of the word template object are a word template value, a word template pattern and a word template length; the attributes of the log template further comprise a length and a minimum matching length; after the step S3 obtains the matching log template from the candidate log template, the step further includes updating the matching log template, where updating the matching log template includes: updating the word template value and the word template length of each word template object in the word template object sequence; the method for updating the word template length comprises the following steps: before processing the current log, the total number of all logs corresponding to the matching log template is N, the total length of the character string is L, and the word template object corresponds to a plurality of word objects [ O _ i, O _ i +1, … O _ i + k ] in the word object sequence of the input log in the matching result, so that the updated word template length is:
Figure BDA0002844728580000061
where len (Oi) is the word length of the word object Oi.
As a preferred technical solution, in step S3, the method for generating a new log template includes: firstly, creating a log template according to the current log, and then combining the log template with the candidate log template in sequence; calculating the similarity between the word template object sequences of the two log templates during merging, and merging the two log templates if the similarity is high; the similarity formula of the word template object sequence is as follows:
Figure BDA0002844728580000062
where M1 and M2 are word template object sequences, | M1| and | M2| are lengths of M1 and M2, and LCS '(M1, M2) is an LCS' length between M1 and M2; the LCS' length calculation method comprises the following steps: s31, assuming that the lengths of the word template object sequence X and the word template object sequence Y are M and n respectively, generating a matrix M with the size of (M +1) X (n + 1); the ith row and jth column of the matrix M represent the LCS 'length, i.e. LCS' (Xi, Yi), between the sub-sequence of the first i objects of X and the sub-sequence of the first j objects of Y, calculated by the following formula:
Figure BDA0002844728580000063
where sim _ word is the similarity between word template objects, the formula is as follows:
simword(xi,yi)=w(xi,yi)·simlength(|xi|,|yi|)
wherein xi and yj are the ith word template object of X and the jth word template object of Y respectively, w (xi, yj) is a preset weight function, a weight value with a value range of [0,1] is returned according to the combination of the word template value and the word template pattern of xi and yj, | xi | and | yj | are the word template lengths of xi and yj, sim _ length is the length similarity, and the sim _ length formula is as follows:
Figure BDA0002844728580000071
s32, backing is carried out by the matrix M, the word template value, the word template style and the word template length of each word template object are reconstructed, and the combined log template is obtained.
As a preferable technical solution, in step S4, the content to be output as required is the template information corresponding to the input log and/or the information of all templates.
The invention discloses a method for extracting a computer software log template on line, which reads data in a log stream mode. After receiving a log, the invention finds the log template corresponding to the log, and creates a new log template or updates the existing log template if necessary. The invention discloses a method for extracting a computer software log template on line, which mainly comprises three stages of preprocessing, obtaining a candidate log template and obtaining the log template. Firstly, preprocessing the text of the input log, and then searching a candidate log template in the existing log templates according to the preprocessing result. The algorithm then picks a matching log template from the candidate log templates. If no candidate log template or a matching log template exists, a new log template may be generated that may swallow portions of the candidate log template.
According to the method for extracting the computer software log template on line, the skeleton sequence is constructed, the approximate skeleton sequence is quickly found by utilizing the triangle inequality, and then a small number of candidate log templates are positioned, so that the calculated amount is reduced, and the operation speed is improved. In addition, the method for extracting the computer software log template on line combines similar log templates in the process of processing the log stream, and has the advantages that: can dynamically adapt to the mode change existing in the log data; the phenomenon of log template quantity explosion is not easy to occur; all numbers and values in the log are not violently quantified, and only after the position is changed, the position becomes a variable placeholder.
Drawings
FIG. 1 is a flow chart of a method for extracting a computer software log template on line according to the present invention;
FIG. 2 is a flowchart of a network for updating a skeleton sequence in a method for extracting a computer software log template on line according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and "a" and "an" generally include at least two, but do not exclude at least one, unless the context clearly dictates otherwise.
It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.
It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element.
Fig. 1 shows a method for extracting a computer software log template on line according to the present invention, which comprises the following steps:
s1, inputting a log, and preprocessing the text of the input log; the pretreatment comprises the following steps: a. replacing the special variable; replacing character strings corresponding to special variables in the text of the input log with special variable placeholders; b. word segmentation; cutting the text of the input log to generate a word sequence; c. generating a sequence of word objects; converting each word in the word sequence into a word object; d. generating a skeleton sequence; filtering on the basis of the word sequence, and reserving separators of partial types in the secondary sequence as brief features of the log;
s2, acquiring a candidate log template: acquiring a candidate log template from the existing log templates according to the preprocessing result; the skeleton sequence of the processed log forms a skeleton sequence network, and the skeleton sequence network is updated according to the skeleton sequence of the input log; searching a skeleton sequence similar to the skeleton sequence of the input log in a skeleton sequence network; screening a candidate log template according to the incidence relation between the skeleton sequence and the log template;
s3, obtaining a log template: acquiring a matching log template from the candidate log templates; if the candidate log template or the matched log template does not exist, a new log template is generated;
and S4, establishing an incidence relation between the skeleton sequence and the matching log template, and outputting the relation as required.
Wherein, in the pre-treatment:
a. the method for extracting the computer software log template on line defines a plurality of special variables, wherein the special variables are URLs and timestamps, and other special variables can be defined according to actual needs in actual application. The character strings corresponding to the special variables are obvious in characteristics, and once the special variables appear in the log, the variable types can be directly confirmed from the characters without worrying about that other character strings happen to form the character strings. This step replaces the strings in the log text corresponding to these variables with special variable placeholders.
b. And (5) word segmentation. The invention relates to a method for extracting a computer software log template on line, which cuts a log text through a series of regular rules to generate a word sequence. The types of the generated words during cutting are ranked from high to low according to priority as follows:
1. a special variable placeholder;
2. left side next to the letter or number "-";
3. a mixed string of English letters and numbers beginning with English letters;
4. a numerical value comprising a decimal point;
5. a string of alphanumeric English letters and numbers beginning with numbers;
6. a continuous blank value;
7. and others.
c. A sequence of word objects is generated. And converting each word in the word sequence into a word object. The word object has the following attributes:
1. word value: special variable placeholders or original strings (priority from high to low).
2. Word length: string length of word value.
3. Word style: special variable placeholders, normal variable placeholders, raw strings (priority from high to low). The general variable placeholders are blank variable placeholders, numerical variable placeholders and numerical variable placeholders; the blank variable placeholders represent contiguous blanks; the numeric variable placeholder represents a numeric value with a decimal point; the number variable placeholders represent consecutive numbers.
d. Generating a framework sequence: the skeleton sequence is filtered on the basis of the word sequence, and only part types of separators in the word sequence are reserved as a brief characteristic of the log format.
After preprocessing, the log has the following attributes:
1. word object sequence
2. Framework sequences
3. Length: length of sequence of word objects.
Then, a candidate log template is obtained. Theoretically, each known log template belongs to a candidate log template for the current log. However, matching the log to each log template one-to-one would incur a large computational overhead. In order to reduce the calculation amount, a batch of skeleton sequences similar to the current skeleton sequence are found out in all the appeared skeleton sequences, and then partial log templates are screened out from all the log templates according to the similar skeleton sequences to serve as candidate log templates.
The distance relationship of the skeleton sequence is maintained in the skeleton sequence network. Each point of the network structure corresponds to a known skeleton sequence, and the distance and similarity between the two skeleton sequences are stored in the edge between two connected points. Here, the distance metric employs a Levenshtein distance, which satisfies the triangle inequality. The similarity is obtained by adopting the following formula according to the calculated distance: similarity (a, b) ═ 1-distance (a, b)/max (length (a), length (b)).
As shown in fig. 2, each time a log is input, the network updates according to the skeleton sequence of the log. The update process is as follows:
s21, defining the skeleton sequence of the new log as x;
s22, if the skeleton sequence network contains x, ending, otherwise, entering the next step;
s23, adding x into the skeleton sequence network;
s24, establishing a set V only containing x;
s25, if all the framework sequences in the framework sequence network are already in V, ending, otherwise, taking out a framework sequence y not in V and entering the next step;
s26, calculating the distance d _ xy and the similarity S _ xy between x and y;
s27, if a skeleton sequence which is connected with y and is not in V exists, taking out one skeleton sequence z and entering the next step, otherwise, jumping to the step S213;
s28, acquiring the distance d _ yz between y and z from the edge between y and z;
s29, calculating a minimum distance d _ xz _ min | d _ xy-d _ yz | between x and z using a triangle inequality;
s210, calculating the maximum similarity S _ xz _ min between x and z according to the similarity formula;
s211, judging whether the difference between x and z is large enough according to d _ xz _ min and S _ xz _ min, and if yes, adding z into V;
s212, jumping to the step S27;
s213, connecting x and y, and storing the distance d _ xy and the similarity S _ xy on the edge;
s214, adding y into V;
s215, jumping to step S25.
After maintaining such a skeleton sequence network, if a similar skeleton sequence of a certain skeleton sequence needs to be obtained, all the skeleton sequences connected with the network are obtained from the network, and then the skeleton sequences which are not similar enough are removed according to the distance and the similarity stored on each edge.
And after the framework sequence network is updated, screening the log templates in the framework sequence network, and selecting candidate log templates. The invention can save the incidence relation between the skeleton sequence and the log template: after the log template of each log is determined, the log template is associated with the skeleton sequence of the log. Because the association relation is stored, all the associated log templates can be found through one skeleton sequence, and all the associated skeleton sequences can also be found through one log template.
After similar skeleton sequences are obtained, log templates associated with the skeleton sequences are integrated into a list, and reverse ordering is carried out according to the sequence of the last log contained in each log template, namely the log template which is active recently is arranged in front.
Similar to the log, each log template has a sequence of word template objects, each word template object having the following attributes:
1. word template value: special variable placeholders, original strings, or empty objects (priority from high to low). A null object indicates that the word is one of three general variables, or more generally, a type-free variable. The type-free variable representation here may be any string, including an empty string.
2. Word template style: special variable placeholders, normal variable placeholders, raw strings, or type-free variable placeholders (priority from high to low).
3. Word template length: string length of word template value. And if the word template value is a null object, the length of the word template is the average length of the character strings of all the logs corresponding to the word template at the position of the word template.
In addition to the word template object sequence, the log template has the following properties:
1. length: length of word template object sequence.
2. Minimum matching length: the length of the word template object sequence after removing words with the word template style being a type-free variable placeholder.
After the candidate log template is taken, the matching stage is firstly carried out, the matching log template is tried to be obtained from the candidate log template, if the matching log template cannot be found, the fusion stage is carried out, a log template is newly built, and the merging with the candidate log template is tried. And finally, storing the association relationship between the log template and the skeleton sequence.
And in the matching stage, matching the candidate log template with the current log one by one. For both matched parties, from the perspective of the word object and the word template object, the word object and the word template object with the same word style can be matched with each other, and the word template object with the word style of no type variable placeholder can be matched with any number of any word objects. When matching, it is not only necessary to determine whether matching is possible, but also to obtain a matching manner, i.e., how each word template object of the word template object sequence corresponds to each word object of the word object sequence one to one or more.
The matching process follows the following logic:
1. if the length of the log is smaller than the minimum matching length of the log template, subsequent calculation is not needed, and matching failure is directly judged;
2. when matching, starting from the initial positions of the word object sequence and the word template object sequence, comparing the word objects and the word template objects one by one backwards, if the word patterns of the word objects are the same as the word template patterns of the word template objects, establishing a corresponding relation, and if the word patterns are different from the word template patterns of the word template objects, failing to match;
3. when a word template object with a word template pattern of a type-free variable placeholder is encountered in the word template sequence, skipping the position of the word template object, searching backwards in the word object sequence for a word object with a word pattern identical to the word template pattern of the word template object at the current position in the word template object sequence, establishing a corresponding relationship if the word object is found, adding the corresponding relationship into a check point list, and failing to match if the word object is not found;
4. when the matching is failed in the steps 2 and 3, if the check point exists in the check point list, taking out the latest check point from the check point list without putting back, returning the matching process to the check point, skipping the position of the current word object in the word object sequence, continuously searching the word object with the same word pattern as the word template pattern of the word template object at the current position in the word template object sequence backwards in the word object sequence, if the word object is found, establishing a corresponding relation, adding the corresponding relation into the check point list, and if the word object is not found, failing to match.
After a plurality of matching log templates are obtained from the candidate log templates, the template with the minimum matching length and the maximum attribute value is selected from the matching log templates to serve as the optimal matching log template. The logic behind this selection is: when the log conforms to the styles of the various log templates, the log template with the most abundant word information is selected.
And after the optimal matching log template is selected, updating the word template value and the word template length of each word template object of the log template by using the matching mode of the log template and the current log. The following explains the update of these two attributes:
case of updating word template values: the word pattern of the word template object is a common variable placeholder, and the word template value has two conditions, one is an original character string, and the other is a null object. If the original character string is the log template, the log template does not find changes at the position, and all logs belonging to the log template are the original character string at the position of the word. If the object is empty, the log template is proved to have a change at the position, and the position is really a common variable. In the process of searching the optimal matching log template, the word pattern and the word template pattern are compared, and the word value and the word template value are not checked, so that the word template value of a certain word template object of the log template is possibly an original character string, and the word value of the word object of the corresponding log is another original character string. In this case, the word template value of the word template object in the log template needs to be updated to a null object.
Case of updating word template length: the word template length of a word template object is calculated by recording the total length of the original character string corresponding to the word template object and dividing the total number of the logs corresponding to the log template. Assuming that before processing the current log, the total number of all logs corresponding to the matching log template is N, the total length of the character string is L, and the word template object corresponds to a plurality of word objects [ O _ i, O _ i +1, … O _ i + k ] in the word object sequence of the input log in the matching result, the updated word template length is:
Figure BDA0002844728580000171
where len (Oi) is the word length of the word object Oi.
If a matching log template cannot be found, a new log template needs to be generated. The process of generating a new log template is merging. In the merging stage, a log template is created according to the current log, and then the log template is merged with the candidate log template in sequence. And calculating the similarity between the word template object sequences of the two log templates during merging, and merging the two log templates if the similarity is high, wherein the similarity formula of the word template object sequences is as follows:
Figure BDA0002844728580000181
where M1 and M2 are templates, M1 and M2 are word template object sequences, | M1| and | M2| are the lengths of M1 and M2, and LCS '(M1, M2) is the LCS' length between M1 and M2.
The LCS' algorithm designed by the invention is adjusted on the existing LCS algorithm. The LCS algorithm divides the process of finding the longest common subsequence into two steps: the method comprises the steps of firstly constructing a matrix M, and secondly backing by using the matrix M to obtain the longest public subsequence.
First step of LCS:
assuming that the length of sequence X and sequence Y are M and n, respectively, the size of M is (M +1) X (n + 1). The ith row and jth column of M represent the LCS length, i.e., LCS (Xi, Yi), of the sub-sequence of the first i objects of X and the sub-sequence of the first j objects of Y, calculated by the following formula:
Figure BDA0002844728580000182
in this calculation, the objects in the two sequences only compare whether the strings are identical, and if they are identical, the length of the LCS is incremented by one.
The second step of LCS:
backing is carried out by using the M matrix to generate LCS.
The first step of the improved LCS' of the invention:
the size of the matrix M constructed by LCS' is consistent with that of LCS. The ith row and jth column of the matrix M represent the LCS 'lengths, i.e., LCS' (Xi, Yi), of the sub-sequences of the first i objects of X and the first j objects of Y, calculated by the following formula:
Figure BDA0002844728580000191
where sim _ word is the similarity between word template objects, the formula is as follows:
simword(xi,yi)=w(xi,yi)·simlength(|xi|,|yi|)
wherein xi and yj are the ith word template object of X and the jth word template object of Y respectively, w (xi, yj) is a preset weight function, a weight value with a value range of [0,1] is returned according to the combination of the word template value and the word template pattern of xi and yj, | xi | and | yj | are the word template lengths of xi and yj, sim _ length is the length similarity, and the sim _ length formula is as follows:
Figure BDA0002844728580000192
in this calculation process, the elements in the two sequences not only compare character strings, but also compare styles, and give different weights to different comparison combinations. In addition, the weight will be corrected by the similarity of the lengths of the two.
Compare the M matrices for LCS and LCS':
the values in the M matrix of LCS are integers and the values in the M matrix of LCS' are floating point numbers
The values in the M matrix of LCS only consider whether the strings are identical, while the values in M of LCS' consider string, style and length information collectively.
Second step of LCS' of the invention:
backing is carried out by using the M matrix, and in the LCS' generating process, the word template value, the word template style and the word template length of each word template object are reconstructed to obtain a combined log template.
In practical applications, in step S4, the content output as required is log template information corresponding to the input log and/or information of all log templates.
In the process of searching a log template (LCSObject), the conventional Spell reduces the calculation times of LCS by using a prefix tree and an optimization skill of simple traversal, but also loses the advantages of LCS dynamic matching, so that the matching accuracy is reduced. Meanwhile, Spell generates a log template by using an LCS, and only concerns the character string, so that when a log with a higher parameter ratio is encountered, the LCS value is very low, the log template cannot be correctly identified, the logs of the same log template are easily distributed to different log templates, and the number of the log templates is exploded.
In contrast, the method for extracting the log template of the computer software on line uses the modified LCS' algorithm, comprehensively considers the information of character strings, styles and lengths in the matching process, and has higher robustness when identifying the log template with higher parameter proportion.
Meanwhile, the method for extracting the computer software log template on line modifies the optimization mode, and an approximate skeleton sequence is quickly found by constructing the skeleton sequence and utilizing the triangle inequality, so that a small number of candidate log templates are positioned, the calculation amount is reduced, and the operation speed is improved.
Finally, the method for extracting the computer software log template on line can combine similar log templates in the process of processing the log stream, and further brings three advantages:
1. pattern changes present in the log data can be dynamically adapted.
2. The phenomenon of log template quantity explosion is not easy to occur.
3. All numbers and values in the log are not violently quantified, and only after the position is changed, the position becomes a variable placeholder.
The invention discloses a method for extracting a computer software log template on line, which reads data in a log stream mode. After receiving a log, the invention finds the template corresponding to the log, and creates a new log template or updates the existing log template if necessary. The invention discloses a method for extracting a computer software log template on line, which mainly comprises three stages of preprocessing, obtaining a candidate log template and obtaining the log template. Firstly, preprocessing the text of the input log, and then searching a candidate log template in the existing log templates according to the preprocessing result. The algorithm then picks a matching log template from the candidate log templates. If no candidate log template or a matching log template exists, a new log template may be generated that may swallow portions of the candidate log template.
According to the method for extracting the computer software log template on line, the skeleton sequence is constructed, the approximate skeleton sequence is quickly found by utilizing the triangle inequality, and then a small number of candidate log templates are positioned, so that the calculated amount is reduced, and the operation speed is improved. In addition, the method for extracting the computer software log template on line combines similar log templates in the process of processing the log stream, and has the advantages that: can dynamically adapt to the mode change existing in the log data; the phenomenon of log template quantity explosion is not easy to occur; all numbers and values in the log are not violently quantified, and only after the position is changed, the position becomes a variable placeholder.
In summary, the embodiments of the present invention are merely exemplary and should not be construed as limiting the scope of the invention. All equivalent changes and modifications made according to the content of the claims of the present invention should fall within the technical scope of the present invention.
The invention discloses a method for extracting a computer software log template on line, which reads data in a log stream mode. After receiving a log, the invention finds the log template corresponding to the log, and creates a new log template or updates the existing log template if necessary. The invention discloses a method for extracting a computer software log template on line, which mainly comprises three stages of preprocessing, obtaining a candidate log template and obtaining the log template. Firstly, preprocessing the text of the input log, and then searching a candidate log template in the existing log templates according to the preprocessing result. The algorithm then picks a matching log template from the candidate log templates. If no candidate log template or a matching log template exists, a new log template may be generated that may swallow portions of the candidate log template. The method for extracting the computer software log template on line has the advantages of high robustness, high operation speed and good dynamic adaptability.

Claims (10)

1. A method for extracting a computer software log template on line is characterized by comprising the following steps: the method comprises the following steps:
s1, inputting a log, and preprocessing the text of the input log; the pretreatment comprises the following steps: a. replacing the special variable; replacing character strings corresponding to special variables in the text of the input log with special variable placeholders; b. word segmentation; cutting the text of the input log to generate a word sequence; c. generating a sequence of word objects; converting each word in the word sequence into a word object; d. generating a skeleton sequence; filtering on the basis of the word sequence, and reserving separators of partial types in the secondary sequence as brief features of the log;
s2, acquiring a candidate log template: acquiring a candidate log template from the existing log templates according to the preprocessing result; the skeleton sequence of the processed log forms a skeleton sequence network, and the skeleton sequence network is updated according to the skeleton sequence of the input log; searching a skeleton sequence similar to the skeleton sequence of the input log in a skeleton sequence network; screening a candidate log template according to the incidence relation between the skeleton sequence and the log template;
s3, obtaining a log template: acquiring a matching log template from the candidate log templates; if the candidate log template or the matched log template does not exist, a new log template is generated;
and S4, establishing an incidence relation between the skeleton sequence and the matching log template, and outputting the relation as required.
2. The method for on-line extracting the computer software log template as claimed in claim 1, wherein: the special variables in step S1 include URL and timestamp.
3. The method for on-line extracting the computer software log template as claimed in claim 1, wherein: in step S1, when the text of the input log is cut, the types of the generated words are arranged from high to low according to the priority as follows: a special variable placeholder; left side next to the letter or number "-"; a mixed string of English letters and numbers beginning with English letters; a numerical value comprising a decimal point; a string of alphanumeric English letters and numbers beginning with numbers; a continuous blank value; and others.
4. The method for on-line extracting the computer software log template as claimed in claim 1, wherein: in step S1, the attributes of the word object include word value, word length and word style; the word value is the special variable placeholder or an original character string; the word length is the character string length of the word value; the word styles comprise special variable placeholders, common variable placeholders and original character strings.
5. The method for on-line extracting the computer software log template as claimed in claim 4, wherein: the common variable placeholders include blank variable placeholders, numeric variable placeholders, and numeric variable placeholders; the blank variable placeholders represent contiguous blanks; the numeric variable placeholder represents a numeric value with a decimal point; the number variable placeholders represent consecutive numbers.
6. The method for on-line extracting the computer software log template as claimed in claim 1, wherein: the skeleton sequence network in the step S2 represents the distance relationship of the skeleton sequence; each point in the skeleton sequence network corresponds to an existing skeleton sequence, and the distance and the similarity between the two skeleton sequences are stored in the edge between two connected points; the distance satisfies a triangle inequality, and the similarity is obtained by adopting the following formula according to the distance: similarity (a, b) ═ 1-distance (a, b)/max (length (a), length (b)).
7. The method for on-line extracting the computer software log template as claimed in claim 1, wherein: the step of updating the skeleton sequence network in step S2 is:
s21, defining the skeleton sequence of the new log as x;
s22, if the skeleton sequence network contains x, ending, otherwise, entering the next step;
s23, adding x into the skeleton sequence network;
s24, establishing a set V only containing x;
s25, if all the framework sequences in the framework sequence network are already in V, ending, otherwise, taking out a framework sequence y not in V and entering the next step;
s26, calculating the distance d _ xy and the similarity S _ xy between x and y;
s27, if a skeleton sequence which is connected with y and is not in V exists, taking out one skeleton sequence z and entering the next step, otherwise, jumping to the step S213;
s28, acquiring the distance d _ yz between y and z from the edge between y and z;
s29, calculating a minimum distance d _ xz _ min | d _ xy-d _ yz | between x and z using a triangle inequality;
s210, calculating the maximum similarity S _ xz _ min between x and z according to the similarity formula;
s211, judging whether the difference between x and z is large enough according to d _ xz _ min and S _ xz _ min, and if yes, adding z into V;
s212, jumping to the step S27;
s213, connecting x and y, and storing the distance d _ xy and the similarity S _ xy on the edge;
s214, adding y into V;
s215, jumping to step S25.
8. The method for on-line extracting the computer software log template as claimed in claim 1, wherein: the log template corresponds to a word template object sequence, and the attributes of the word template object are a word template value, a word template style and a word template length; the attributes of the log template further comprise a length and a minimum matching length; after the step S3 obtains the matching log template from the candidate log template, the step further includes updating the matching log template, where updating the matching log template includes: updating the word template value and the word template length of each word template object in the word template object sequence; the method for updating the word template length comprises the following steps: before processing the current log, the total number of all logs corresponding to the matching log template is N, the total length of the character string is L, and the word template object corresponds to a plurality of word objects [ O _ i, O _ i +1, … O _ i + k ] in the word object sequence of the input log in the matching result, so that the updated word template length is:
Figure FDA0002844728570000041
where len (Oi) is the word length of the word object Oi.
9. The method for on-line extracting the computer software log template as claimed in claim 1, wherein: in step S3, the method for generating a new log template includes: firstly, creating a log template according to the current log, and then combining the log template with the candidate log template in sequence; calculating the similarity between the word template object sequences of the two log templates during merging, and merging the two log templates if the similarity is high; the similarity formula of the word template object sequence is as follows:
Figure FDA0002844728570000042
where M1 and M2 are word template object sequences, | M1| and | M2| are lengths of M1 and M2, and LCS '(M1, M2) is an LCS' length between M1 and M2; the LCS' length calculation method comprises the following steps:
s31, assuming that the lengths of the word template object sequence X and the word template object sequence Y are M and n respectively, generating a matrix M with the size of (M +1) X (n + 1); the ith row and jth column of the matrix M represent the LCS 'length, i.e. LCS' (Xi, Yi), between the sub-sequence of the first i objects of X and the sub-sequence of the first j objects of Y, calculated by the following formula:
Figure FDA0002844728570000051
where sim _ word is the similarity between word template objects, the formula is as follows:
simword(xi,yi)=w(xi,yi)·simlength(|xi|,|yi|)
wherein xi and yj are the ith word template object of X and the jth word template object of Y respectively, w (xi, yj) is a preset weight function, a weight value with a value range of [0,1] is returned according to the combination of the word template value and the word template pattern of xi and yj, | xi | and | yj | are the word template lengths of xi and yj, sim _ length is the length similarity, and the sim _ length formula is as follows:
Figure FDA0002844728570000052
s32, backing is carried out by the matrix M, the word template value, the word template style and the word template length of each word template object are reconstructed, and the combined log template is obtained.
10. The method for on-line extracting the computer software log template as claimed in claim 1, wherein: in step S4, the content to be output as needed is log template information corresponding to the input log and/or information of all log templates.
CN202011505125.1A 2020-12-18 2020-12-18 Method for extracting computer software log template on line Pending CN112560407A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011505125.1A CN112560407A (en) 2020-12-18 2020-12-18 Method for extracting computer software log template on line

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011505125.1A CN112560407A (en) 2020-12-18 2020-12-18 Method for extracting computer software log template on line

Publications (1)

Publication Number Publication Date
CN112560407A true CN112560407A (en) 2021-03-26

Family

ID=75063594

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011505125.1A Pending CN112560407A (en) 2020-12-18 2020-12-18 Method for extracting computer software log template on line

Country Status (1)

Country Link
CN (1) CN112560407A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113553309A (en) * 2021-07-28 2021-10-26 恒安嘉新(北京)科技股份公司 Log template determination method and device, electronic equipment and storage medium
CN113590421A (en) * 2021-07-27 2021-11-02 招商银行股份有限公司 Log template extraction method, program product, and storage medium
CN115545122A (en) * 2022-11-28 2022-12-30 中国银联股份有限公司 Object matching method, device, equipment, system, medium and program product
CN115544975A (en) * 2022-12-05 2022-12-30 济南丽阳神州智能科技有限公司 Log format conversion method and device
CN115630626A (en) * 2022-11-17 2023-01-20 国网湖北省电力有限公司信息通信公司 Online extraction method for log template of data center equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590421A (en) * 2021-07-27 2021-11-02 招商银行股份有限公司 Log template extraction method, program product, and storage medium
CN113590421B (en) * 2021-07-27 2024-04-26 招商银行股份有限公司 Log template extraction method, program product and storage medium
CN113553309A (en) * 2021-07-28 2021-10-26 恒安嘉新(北京)科技股份公司 Log template determination method and device, electronic equipment and storage medium
CN115630626A (en) * 2022-11-17 2023-01-20 国网湖北省电力有限公司信息通信公司 Online extraction method for log template of data center equipment
CN115630626B (en) * 2022-11-17 2023-02-28 国网湖北省电力有限公司信息通信公司 Online extraction method for log template of data center equipment
CN115545122A (en) * 2022-11-28 2022-12-30 中国银联股份有限公司 Object matching method, device, equipment, system, medium and program product
CN115544975A (en) * 2022-12-05 2022-12-30 济南丽阳神州智能科技有限公司 Log format conversion method and device

Similar Documents

Publication Publication Date Title
CN112560407A (en) Method for extracting computer software log template on line
US11113234B2 (en) Semantic extraction method and apparatus for natural language, and computer storage medium
CN101978348B (en) Manage the archives about approximate string matching
CN106649783B (en) Synonym mining method and device
CN110175158B (en) Log template extraction method and system based on vectorization
US8391614B2 (en) Determining near duplicate “noisy” data objects
US20210349862A1 (en) Data analysis system and data analysis method
CN113626400A (en) Log event extraction method and system based on log tree and analytic tree
CN111274785A (en) Text error correction method, device, equipment and medium
CN104485107A (en) Name voice recognition method, name voice recognition system and name voice recognition equipment
CN115017268B (en) Heuristic log extraction method and system based on tree structure
CN112395425A (en) Data processing method and device, computer equipment and readable storage medium
CN112131341A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN116341513A (en) Multi-source heterogeneous log data analysis method based on semantic enhancement
JP5637888B2 (en) Same intention text generation apparatus, intention estimation apparatus, and same intention text generation method
JP4005477B2 (en) Named entity extraction apparatus and method, and numbered entity extraction program
CN111090737A (en) Word stock updating method and device, electronic equipment and readable storage medium
CN114036371A (en) Search term recommendation method, device, equipment and computer-readable storage medium
JP2004013726A (en) Device for extracting keyword and device for retrieving information
CN113420219A (en) Method and device for correcting query information, electronic equipment and readable storage medium
CN107203512B (en) Method for extracting key elements from natural language input of user
CN112651590A (en) Instruction processing flow recommending method
CN113609279B (en) Material model extraction method and device and computer equipment
JP4049543B2 (en) Document search device, document search program, recording medium
CN111339756A (en) Text error detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination