CN116050380A - Log analysis method and device - Google Patents

Log analysis method and device Download PDF

Info

Publication number
CN116050380A
CN116050380A CN202211685297.0A CN202211685297A CN116050380A CN 116050380 A CN116050380 A CN 116050380A CN 202211685297 A CN202211685297 A CN 202211685297A CN 116050380 A CN116050380 A CN 116050380A
Authority
CN
China
Prior art keywords
log
training
preset
word
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211685297.0A
Other languages
Chinese (zh)
Inventor
韦屹
李喆
潘剑
陈智斌
农英雄
黄聪
杨振宇
陆瑛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Tobacco Guangxi Industrial Co Ltd
Original Assignee
China Tobacco Guangxi Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Tobacco Guangxi Industrial Co Ltd filed Critical China Tobacco Guangxi Industrial Co Ltd
Priority to CN202211685297.0A priority Critical patent/CN116050380A/en
Publication of CN116050380A publication Critical patent/CN116050380A/en
Pending legal-status Critical Current

Links

Images

Abstract

The application provides a log parsing method and device, wherein the method comprises the following steps: acquiring a log to be analyzed; processing the log to be analyzed according to a first preset method to obtain a segmented log; inputting the segmentation log into a pre-training model to obtain the middle score of each word in the log to be analyzed; calculating a target score of each word according to the segmentation log and the intermediate score; and obtaining a target template of the log to be analyzed according to the log to be analyzed, the target score, the word and a preset algorithm, wherein the preset algorithm is used for generating the target template. By the method and the device, the problems of low accuracy and poor generalization capability in the related art are solved.

Description

Log analysis method and device
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a log parsing method and apparatus.
Background
In the field of computer technology, detecting whether software or a system is abnormal through a log is a common safety protection means. From simple small to large complex software systems, as well as distributed file systems and high performance cloud computing management platforms, vulnerabilities inevitably exist, which can cause the system itself to operate abnormally. If the software system fails, engineers need to check the system operation log, and timely diagnose and buffer the failure. However, software systems typically generate a large number of logs, and manually analyzing the logs is time consuming and impractical. Thus, the prior art performs automated analysis of logs. Log parsing is the key and first step in automated log analysis, which parses a semi-structured log into structured log templates for subsequent analysis.
In the traditional log parsing method, a developer manually sets a regular expression according to a source code so as to generate a log template. However, on the one hand, source code for most software systems today is not available, and on the other hand, the large number of log templates makes manually setting regular expressions very time-consuming. In order to overcome the shortcomings of the traditional method, the prior art utilizes heuristic, clustering, frequent item mining, neural networks and other technologies to generate a log parser. However, existing log resolvers are still unsatisfactory in accuracy, generalization capability.
Therefore, the prior art has the problems of low accuracy and poor generalization capability.
Disclosure of Invention
The application provides a log analysis method and a log analysis device, which at least solve the problems of low accuracy and poor generalization capability in the related technology.
According to an aspect of the embodiments of the present application, there is provided a log parsing method, including:
acquiring a log to be analyzed;
processing the log to be analyzed according to a first preset method to obtain a segmented log;
inputting the segmentation log into a pre-training model to obtain the middle score of each word in the log to be analyzed;
calculating a target score for each of the words based on the segmentation log and the intermediate score;
And obtaining a target template of the log to be analyzed according to the log to be analyzed, the target score, the word and a preset algorithm, wherein the preset algorithm is used for generating the target template.
According to another aspect of the embodiments of the present application, there is also provided a log parsing apparatus, including:
the first acquisition module is used for acquiring logs to be analyzed;
the first processing module is used for processing the log to be analyzed according to a first preset method to obtain a segmented log;
the first input module is used for inputting the segmentation log into a pre-training model to obtain the middle score of each word in the log to be analyzed;
the calculation module is used for calculating the target score of each word according to the segmentation log and the intermediate score;
the first obtaining module is used for obtaining a target template of the log to be analyzed according to the log to be analyzed, the target score, the word and a preset algorithm, wherein the preset algorithm is used for generating the target template.
According to yet another aspect of the embodiments of the present application, there is also provided an electronic device including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete communication with each other through the communication bus; wherein the memory is used for storing a computer program; a processor for performing the method steps of any of the embodiments described above by running the computer program stored on the memory.
According to a further aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the method steps of any of the embodiments described above when run.
In the embodiment of the application, the log to be analyzed is obtained; processing the log to be analyzed according to a first preset method to obtain a segmented log; inputting the segmentation log into a pre-training model to obtain the middle score of each word in the log to be analyzed; calculating a target score of each word according to the segmentation log and the intermediate score; and obtaining a target template of the log to be analyzed according to the log to be analyzed, the target score, the word and a preset algorithm, wherein the preset algorithm is used for generating the target template. According to the method, the log to be analyzed is processed by using the first preset method to obtain the segmented log, the middle score of the segmented log is obtained by using the pre-training model, the target score of the word is calculated according to the middle score, and finally, the target template of the log to be analyzed is generated by using the preset algorithm and according to the target score, so that the analysis precision of log analysis is improved, the accuracy is improved, the universality and the robustness are high, and the problems of low accuracy and poor generalization capability in the related technology are solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a flow diagram of an alternative log parsing method according to an embodiment of the present application;
FIG. 2 is a general framework diagram of an alternative self-supervising log parsing method based on using semantic contribution differences, according to an embodiment of the present application;
FIG. 3 is a schematic illustration of a self-attention model based self-supervised log parsing method using semantic contribution differentiation, according to an embodiment of the present application;
FIG. 4 is an exemplary diagram of an alternative self-care mechanism based on a self-supervised log parsing method using semantic contribution differentiation, according to an embodiment of the present application;
FIG. 5 is a flow diagram of an alternative TESC algorithm based on a self-supervised log parsing method using semantic contribution differentiation, according to an embodiment of the present application;
FIG. 6 is a first round grouping scenario diagram of an alternative self-supervising log parsing method based on using semantic contribution differences and an existing log parser, according to an embodiment of the present application;
FIG. 7 is a block diagram of an alternative log parsing device according to an embodiment of the present application;
fig. 8 is a block diagram of an alternative electronic device according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
At present, almost all existing log resolvers have poor generalization capability and are only suitable for specific systems, and some resolvers perform poorly on the basis of partial data training and cannot process invisible words, so that the problems may cause incorrect log analysis results. Considering that the log is presented in a semi-structured natural language form, the log analysis is regarded as a natural language processing task, and the template is extracted according to the difference of contribution of the constant words and the variable words in the log to the log semantics.
Based on the foregoing, according to an aspect of the embodiments of the present application, there is provided a log parsing method, as shown in fig. 1, a flow of the method may include the following steps:
step S101, obtaining a log to be analyzed.
Optionally, the method includes an offline stage and an online stage, wherein the offline stage is used for training and obtaining a pre-training model, and the online stage is used for extracting a target template of a log to be analyzed based on the pre-training model. In the online stage, firstly, a log to be analyzed which needs to be analyzed is obtained, for example: packetR response provider 1for block blk 38865049064139660terminating.
Step S102, processing the log to be analyzed according to a first preset method to obtain a segmented log.
Optionally, the log to be parsed is processed by using a first preset method, and the log to be parsed is first divided into a plurality of words, for example: [ PacketR sensor ], [ for ], etc., the above words are represented by sub-characters, and finally a segmentation log is obtained. The sub-characters are used herein for the purpose of subsequently encoding the log to be parsed into a digital vector.
Step S103, inputting the segmentation log into a pre-training model to obtain the middle score of each word in the log to be analyzed.
Optionally, the segmentation log is input into a pre-training model that generates a query vector, a key vector, and a value vector for each word in the segmentation log, respectively, and calculates an intermediate score, e.g., an attention score, for each word from the query vector, the key vector, and the value vector, respectively, and using a softmax function.
Step S104, calculating the target score of each word according to the segmentation log and the intermediate score.
Optionally, the target score for a word is defined as the sum of the intermediate scores of all words, such as: the target score is set as the semantic contribution score and the intermediate score is set as the attention score, and then the semantic contribution score is the sum of the attention scores of all words. Thus, the target score for each word is obtained from the segmentation log and the sum of all intermediate scores is calculated.
Step S105, obtaining a target template of the log to be analyzed according to the log to be analyzed, the target score, the word and a preset algorithm, wherein the preset algorithm is used for generating the target template.
Optionally, k words are selected according to the level of the target score (for example, the semantic contribution score) of each word, and the size of k is set according to the requirement. Grouping the logs to be analyzed according to the selected words, and extracting templates of the logs to be analyzed, namely target templates, through a preset algorithm, for example, a template extraction algorithm TESC.
In the embodiment of the application, the log to be analyzed is obtained; processing the log to be analyzed according to a first preset method to obtain a segmented log; inputting the segmentation log into a pre-training model to obtain the middle score of each word in the log to be analyzed; calculating a target score of each word according to the segmentation log and the intermediate score; and obtaining a target template of the log to be analyzed according to the log to be analyzed, the target score, the word and a preset algorithm, wherein the preset algorithm is used for generating the target template. According to the method, the log to be analyzed is processed by using the first preset method to obtain the segmented log, the middle score of the segmented log is obtained by using the pre-training model, the target score of the word is calculated according to the middle score, and finally, the target template of the log to be analyzed is generated by using the preset algorithm and according to the target score, so that the analysis precision of log analysis is improved, the accuracy is improved, the universality and the robustness are high, and the problems of low accuracy and poor generalization capability in the related technology are solved.
As an alternative embodiment, before entering the segmentation log into the pre-training model, the method further comprises:
acquiring a training log;
processing the training log according to a second preset method to obtain a segmentation training log and a replacement segmentation training log;
inputting the replacement segmentation training log into an initial model to obtain a prediction result;
obtaining an objective function according to the prediction result, the segmentation training log and the first preset formula;
and adjusting the initial model until the numerical value of the objective function is always in a preset range, and obtaining the pre-training model.
Alternatively, the present embodiment is an offline stage, using a masking language model (Masked language Model, MLM) approach to pre-train an initial model (e.g., a self-attention based initial model) to enable the model to predict the masked words. It should be noted that: the term "training" is only used to distinguish between offline phases and online phases, so training logs and logs are not essentially different; the segmentation training log and the segmentation log are not essentially different, and other words comprising training have the same meaning and are not repeated here.
First, a training log for training an initial model is acquired, and there are a plurality of, for example, N training logs. Then, each training log is split into segmented training logs represented by the sub-characters through a second preset method, and partial sub-characters in the segmented training logs are replaced and hidden through a random replacement mode, so that the replaced segmented training logs are obtained. And then, inputting the replacement segmentation training log into an initial model, and predicting the replaced sub-characters through the initial model to obtain a prediction result P, wherein P is the probability distribution of the whole vocabulary.
When the LoSs is calculated, only the random masking or the replaced character is calculated, the rest of the output is discarded, the cross entropy LoSs is adopted as an objective function, and the objective function LoSs is obtained according to the prediction result, the segmentation training log and a first preset formula (1) mlm
Figure BDA0004020624200000061
Where N is the total number of a batch of input logs, M is the total number of randomly masked characters in the jth log,
Figure BDA0004020624200000062
is the real character of the i-th mask character in j logs, is->
Figure BDA0004020624200000063
Is the predicted result of the ith mask character in j logs in P. />
Figure BDA0004020624200000064
Is obtained from the segmentation training log.
The initial model is adjusted, when the value of the objective function tends to be stable and does not drop any more, the training is completed, the self-attention-based model, namely the pre-training model, is obtained, in other words, when the value of the objective function is always in a smaller preset range, the value of the objective function also tends to be stable and does not drop any more, and the preset range can be set according to requirements.
Another training method is as follows: setting a threshold value of the objective function, and if the value of the objective function is smaller than the threshold value after the initial model is adjusted, stopping training to obtain a pre-training model.
In the embodiment of the application, the log is firstly split into the molecular characters, and then the model based on self-attention is pre-trained by adopting a mask language model MLM method, so that the pre-trained model can predict the masked words. The technical problems that the conventional log parser has poor generalization capability and cannot process invisible words are solved.
As an alternative embodiment, processing the training log according to a second preset method to obtain a segmentation training log and a replacement segmentation training log includes:
dividing the log to be analyzed according to a preset delimiter to obtain a first preset number of training words;
marking the training word to obtain training sub-characters, wherein the training sub-characters are used for representing the training word;
obtaining a segmentation training log according to the training sub-characters;
and replacing training sub-characters with preset proportions in the segmentation training log by using a preset strategy to obtain a replacement segmentation training log.
Alternatively, the present embodiment is described with reference to fig. 2, as shown in fig. 2: offline training, input: an original log; preprocessing, namely acquiring content in an original log: the offline and online phases of fig. 2 are illustrated with this log for ease of illustration, the packetresponse 1for block k38865049064139660 termination (i.e., training log). According to the delimiter [', ]! ? =' ] (i.e. preset delimiter) divide the training log into a first preset number of training words: [ PacketR sensor ], [1], [ for ], [ block ], [ blk38865049064139660], [ terminating ], wherein the first predetermined number is related to a log for which the first predetermined number is 6. The words are respectively marked by using WordPieces, and training sub-characters are obtained: packet, #res, #pon, #der, 1, for, block, b, #lk, #38, #86, terminating, the training sub-characters collectively make up a split training log: [ Packet, #res, #pon, #der ], [1], [ for ], [ block ], [ b, #lk, 38, #86 … … ], [ terminating ]. It should be noted that most existing log parsers directly use some special symbols (e.g. comma, semicolon, dash) as separators, so that vocabulary explosion is easy to happen. By the above method it is made possible to represent almost all unique words using approximately 30000 sub-characters, avoiding vocabulary explosion.
Because the offline stage adopts the self-supervised MLM training method to train the initial model, the embodiment uses the preset strategy to randomly replace the preset sub-characters with 20% of the preset ratio in each segmentation training log before the initial model is pre-trained, and the preset strategy can set for example according to the requirements: selecting 20% of the sub-characters needing to be replaced, replacing 80% of the sub-characters needing to be replaced by 'MASK', replacing 10% of the sub-characters by any other character, and not replacing 10% of the sub-characters, thereby obtaining the replacement segmentation training log. The above steps are as shown in fig. 2, and are obtained by random masking: [ Packet, #res, #pon, #der ], [1], [ for ], [ MASK ], [ b, #lk, 38, #86 … … ], [ terminating ].
In the embodiment of the application, the words are represented by the sub-characters, so that vocabulary explosion is avoided, the problem that vocabulary explosion easily occurs in the prior art is solved, in addition, the sub-characters are used for representing the words and further representing the logs, and a foundation is provided for encoding the logs subsequently.
As an alternative embodiment, inputting the replacement segmentation training log into the initial model to obtain the prediction result includes:
acquiring preset conditions;
coding the training sub-characters in the replacement segmentation training log based on a preset condition to obtain training character embedded vectors;
Obtaining a training position embedded vector according to the replacement segmentation training log and the training sub-characters;
combining the training character embedded vector and the training position embedded vector to obtain a training embedded vector;
generating a training query vector, a training key vector and a training value vector of each training word according to the training embedded vector and the preset matrix;
obtaining intermediate output data according to the training query vector, the training key vector, the training value vector, the preset parameters and the second preset formula;
obtaining a residual value according to the training embedded vector, the intermediate output data, the preset weight and the preset residual connecting layer function;
and obtaining a prediction result according to the residual value and a third preset formula.
Alternatively, since the present application has only one downstream task (i.e., log parsing), only the semantic contribution differences between constant and variable words in the log need to be enlarged.
In order to expand the semantic contribution difference between the variable words and the constant words, when the context features are considered to encode the words, preset conditions are set: 1. giving up the multi-head attention mechanism in the Bert model, and using only one attention head to enlarge the difference; 2. only one encoder layer is needed (self-attention); 3. different residual connection settings.
Based on preset conditions, taking the alternative segmentation training log as input, character embedding encodes each training sub-character into a vector character embedding (i.e. training character embedding vector), wherein the character embedding (embedding layer) is a linear dimension conversion layer (vocobu_size d_model), wherein d_model is the internal network dimension of the model, and further, position embedding is added in the character embedding to better utilize the log sequence, and the log sequence is determined by the alternative segmentation training log and the training sub-characters. Thus, the training embedded vector is:
training embedded vector = training character embedded vector + training position embedded vector (2)
According to training embedded vector and preset matrix W Q 、W K 、W V Generating a training query vector Q, a training key vector K and a training value vector V. According to Q, K, V, preset parameters
Figure BDA0004020624200000091
And a second preset formula (3) to obtain intermediate output data Attention (Q, K, V), wherein (Q, K, V) represents Q, K, V is self-Attention input, d k Is a hyper-parameter, which is the dimension of the vector:
Figure BDA0004020624200000092
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004020624200000093
representing the attention score matrix, T representing the matrix transpose, and ". Cndot. Represents the matrix dot product.
The residual connection layer is designed as a formula (4), and a residual value y is obtained according to a training embedded vector, intermediate output data (x) and a preset residual connection layer function, namely the formula (4), wherein the intermediate output data is written as the transition (Q, K, V):
y=Layernorm(weight*x+attention(x)) (4)
Where x is the original input embedding, the intent (x) is the output of x as the self-attention input (i.e., intermediate output data), layerrnorm () is a normalization function, and weight is used to enhance the impact of the contextual features in pre-training, with weight unity set to 0.01.
After residual connection, a classifier is used for obtaining a prediction result P according to a residual value y and a third preset formula, namely formula (5):
Figure BDA0004020624200000094
wherein W and b are both trainable parameters,
Figure BDA0004020624200000095
is an activation function, and the prediction result P is the probability distribution of the whole vocabulary;
part of the above process is shown in fig. 2: segment training logs [ Packet, #res, #pon, #der will be replaced]、[1]、[for]、[MASK]、[b,##lk,,38,##86........]、[terminating]The input encodes the replacement segmentation training log by the embedding layer as a training embedding vector based on the self-attention model: [1.1,0.2,...]、[0.3,...]、[1.2,...]、[0.23,...]、[0.24...]、[2.1....]. Based on attention blockTraining embedded vectors and preset matrix W Q 、W K 、W V Generating a query vector Q, a training key vector K and a training value vector V, and further generating intermediate output data: [1.1,0.2,...]、[0.3,...]、[1.2,...]、[0.23,...]、[0.24...]、[2.1...]. And finally, normalizing the prediction mask characters by using the residual connection layers according to the training embedded vector and the intermediate output data, and finally obtaining a prediction result P.
In addition, part of the above-described process is also shown in fig. 3: and taking the alternative segmentation training log as input, performing character embedding and position information embedding on the alternative segmentation training log, combining the alternative segmentation training log to obtain a training embedded vector, and starting an attention mechanism. According to training embedded vector and preset matrix W Q 、W K 、W V A query vector Q, a training key vector K, a training value vector V are generated, and intermediate output data is generated by a self-attention head. The training embedded vector, the weight (i.e. preset weight) and the intermediate output data are input into a residual error connection layer, a classifier obtains a prediction result (P) according to the output information (i.e. residual error value y) of the residual error connection layer, and then the MLM Loss (i.e. Loss) is calculated mlm )。
In the embodiment of the application, the sub-characters are encoded into vector character embedding, and position embedding is introduced to obtain an embedded vector. And calculating the output of the self-attention head according to the query vector, the key vector and the value vector, and finally obtaining a final prediction result according to the residual error connection layer. By the method, the pre-training model can predict invisible words, so that the pre-training model can predict hidden words. The technical problems that the conventional log parser has poor generalization capability and cannot process invisible words are solved.
As an optional embodiment, processing the log to be parsed according to a first preset method to obtain a split log includes:
dividing the log to be analyzed according to a preset delimiter to obtain a second preset number of words;
marking the word to obtain sub-characters, wherein the sub-characters are used for representing the word;
And obtaining a segmentation log according to the sub-characters.
Alternatively, the present embodiment is described with reference to fig. 2, as shown in fig. 2: on-line training, input: an original log; preprocessing, namely acquiring content in an original log: the offline and online phases of fig. 2 are illustrated with the log to be parsed (i.e., the log is illustrated for ease of illustration) by the packetresponse 1for block blk38865049064139660 terminating. According to the delimiter [', ]! ? =' ] (i.e. preset delimiter) divide the log to be parsed into a second preset number of words: [ PacketR sensor ], [1], [ for ], [ block ], [ blk38865049064139660], [ terminating ], wherein the second predetermined number is related to the log to be parsed, and the second predetermined number is 6 for the log to be parsed. The words are respectively marked by using WordPieces, and the sub-characters are obtained: packet, #res, #pon, #der, 1, for, block, b, #lk, #38, #86, terminating, the above sub-characters collectively make up a split log: [ Packet, #res, #pon, #der ], [1], [ for ], [ block ], [ b, #1k, # 86..the use of # ], [ terminating ].
In the embodiment of the application, the words are represented by the sub-characters, so that vocabulary explosion is avoided, the problem that vocabulary explosion easily occurs in the prior art is solved, in addition, the sub-characters are used for representing the words and further representing the logs, and a foundation is provided for encoding the logs subsequently.
As an alternative embodiment, inputting the segmentation log into the pre-training model to obtain the intermediate score of each word in the log to be analyzed, including:
coding the sub-characters to obtain character embedded vectors;
obtaining a position embedding vector according to the segmentation log and the sub-characters;
combining the character embedding vector and the position embedding vector to obtain an embedding vector;
generating a query vector, a key vector and a value vector of each word according to the embedded vector and the preset matrix;
and obtaining the intermediate score according to the query vector, the key vector, the value vector, the preset parameters and the fourth preset formula.
Optionally, with the split log as input, character embedding encodes each sub-character into a vector character embedding (i.e., character embedded vector), where the character embedding (embedding layer) is a linear dimension conversion layer (vocobu_size) d_model, where d_model is the internal network dimension of the model. In addition, a position embedding (i.e., a position embedding vector) is added in character embedding to better utilize the log sequence, and a position embedding vector is generated according to the position of the sub-character in the segmentation log. Combining the character embedding vector and the position embedding vector to obtain an embedding vector.
According to the embedded vector and the preset matrix W Q 、W K 、W V (and the foregoing W) Q 、W K 、W V Meaning the same), generates a query vector q for each word i Key vector k i Value vector v i As shown in formula (6):
Figure BDA0004020624200000121
wherein W is Q 、W x And W is v Is three trainable matrices.
This embodiment, and in particular the last step, is described in connection with fig. 4. To contain two words W 1 ,W 2 Takes the log to be analyzed as input, and obtains two words W by the method 1 ,W 2 Is x respectively 1 And x 2 . A query vector (query) q will then be generated for each word according to equation (6) 1 ,q 2 Key vector (Keys) k 1 ,k 2 Sum vector (Values) v 1 ,v 2
At the time of encoding the first word W 1 When calculating q 1 With all keys (i.e. k 1 ,k 2 ) Dot product of (q) 1 *k 1 =146,q 1 *k 2 =54. Dividing each by
Figure BDA0004020624200000122
And weights were calculated using a softmax function to obtain 7.3 and 2.7The weight will be applied to (v) 1 ,v 2 ) The weighted summation is performed to generate an output vector (i.e., softmax x Values, also the intermediate output data of the foregoing) that is capable of integrating the context information. These weights are the words W 1 Corresponding attention score, calculate the second word W 2 The attention score steps of (2) are the same and are not described in detail herein. In practice, the attention function of a group of matrixes is calculated at the same time, the query vector is packed into a matrix Q, the key vector and the value vector are respectively packed into matrixes K and V, then the attention score matrix is calculated, and the attention score matrix is obtained, namely, the intermediate score is obtained, and the process is shown in the formula (7):
Figure BDA0004020624200000123
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004020624200000124
representing the attention score matrix, T representing the matrix transpose, and ". Cndot. Represents the matrix dot product.
It should be noted that dk is a superparameter, which is the dimension of a vector divided by
Figure BDA0004020624200000125
The operation of (1) achieves a scaling effect avoiding the occurrence of larger values that would result in a subsequent softmax assigning too much distribution to the corresponding tag, making parameter updating difficult.
Part of the above process is shown in fig. 2: the vector is embedded according to training using attention blocks [1.1,0.2 ].]、[0.3,...]、[1.2,...]、[0.23,...]、[0.24...]、[2.1...]And a preset matrix W Q 、W K 、W V Generating a query vector Q, a training key vector K and a training value vector V, and further generating intermediate output data: [1.1,0.2,...]、[0.3,...]、[1.2,...]、[0.23,...]、[0.24...]、[2.1...]An attention score matrix (i.e., an attention score matrix) is generated in the process.
In addition, part ofThe above process is also shown in fig. 3: according to training embedded vector and preset matrix W Q 、W K 、W V A query vector Q, a training key vector K and a training value vector V are generated, intermediate output data is generated through a self-attention head, and attention scores are generated in the process.
In the embodiment of the application, the sub-characters are firstly encoded into vector character embedding, and position embedding is introduced to obtain an embedded vector. And then calculating the output of the self-attention head according to the query vector, the key vector and the value vector to obtain the attention score of each word, and providing a basis for calculating the semantic contribution score of each word through the attention score of all the words.
As an alternative embodiment, calculating the target score for each word from the segmentation log and the intermediate score includes:
acquiring the number of sub-characters contained in a segmentation log and taking the number as a first number;
acquiring the number of sub-characters contained in each word and taking the number as a second number;
and obtaining the target score according to the first sequence number of the query vector, the second sequence number of the key vector, the first number, the second number, the intermediate score and a fifth preset formula, wherein the first sequence number and the second sequence number are used for determining the intermediate score.
Alternatively, the semantic contribution score (i.e., target score) of a word is defined as the sum of the attention scores of all words, word W i The semantic contribution score calculation of (2) is shown as formula (8), namely a fifth preset formula:
Figure BDA0004020624200000131
where l is the word W i The affiliated log (i.e. partition log), W i Is a word with position index i in l, len (WordPiece (l)) calculates the number of sub-characters of l (i.e., the first number), len (WordPiece (W) i ) Calculating W i Number of sub-characters (i.e., second number), score [ m ]][n]Representing the time when a sub-character is encoded (i.e., dot product q m And k n ) Time of dayAttention scores generated by the sub-characters, for example: score [1 ]][1]Represents q 1 *k 1 The corresponding attention score, m is the number of the query vector (i.e. the first number), n is the number of the key vector (i.e. the second number), wordPiece (l) indicates that l is split into sub-characters as shown in formula (9), and similarly WordPiece (W) i ) Representing W is i Splitting into sub-characters.
subtokens q |q∈[0,len(WordPiece(l))]∩Z (9)
Wherein, subtokens represents subword decomposition, and Z is an integer.
The above is shown in fig. 2: and calculating semantic contribution scores according to the attention score matrix. As also shown in fig. 3: after the attention score is generated, a semantic contribution score is calculated from the attention score.
In the embodiment of the application, the semantic contribution score of each word is calculated through the attention scores of all the words, and a basis is provided for gradually grouping the logs through the first k words with the highest semantic contribution scores and generating a template of the log to be analyzed. Solves the problems of low accuracy and poor generalization capability in the prior art.
As an optional embodiment, obtaining the target template of the log to be parsed according to the log to be parsed, the target score, the word and the preset algorithm includes:
according to the preset sequence and the target score, sequencing the words to obtain a word sequence;
selecting a first third preset number of words in the word sequence as target words;
matching the target word with a preset log group to obtain a matching result, wherein the preset log group comprises a template, the matching result comprises a candidate log group, and the candidate log group comprises a candidate template;
If the matching result does not have the candidate template, a new log group is created, the log to be analyzed is used as a new template in the new log group, and the new template is a target template;
if the candidate templates containing all target words do not exist in the matching result, a new log group is created, the log to be analyzed is used as a new template in the new log group, and the new template is the target template;
if only one candidate template containing all target words exists in the matching result, adding the log to be analyzed into the candidate log group to obtain an updated log group, and updating the candidate template according to a third preset method and the updated log group to obtain a target template;
and if the number of the candidate templates containing all the target words in the matching result is greater than one, re-selecting the target words and matching the target words with the candidate templates until the number of the candidate templates is zero or one, executing the operation of creating a new log group and taking the log to be analyzed as a new template in the new log group and the new template as the target template under the condition that the number of the candidate templates is zero, adding the log to be analyzed into the candidate log group under the condition that the number of the candidate templates is one to obtain an updated log group, and executing the operation of updating the candidate templates according to a third preset method and the updated log group to obtain the target template.
Alternatively, since the number of constant words in each type of log template is different, it cannot be known in advance. And the semantic contribution score of the fixed word is higher than that of the variable word, but the order of the semantic contribution score of the fixed word may be uncertain in the log. The present embodiment therefore groups the logs step by step using the top k semantic contribution highest scores instead of the top k positions in the traditional approach.
In a preset algorithm, such as a template extraction algorithm TESC algorithm, a log to be parsed and a semantic contribution score (i.e., a target score) corresponding to each word in the log to be parsed are input, and a log template to which the log to be parsed belongs is output.
Firstly, words in a log to be analyzed are ranked according to a corresponding semantic contribution score and a preset sequence (from high to low). Extracting a third preset number of words before sorting to serve as target words (for example, the word with the highest semantic contribution score, the word with the second highest semantic contribution score and the like serve as target words), wherein the third preset number is not limited by a specific number. And comparing the target word with templates in the existing log group (namely a preset log group), and analyzing the number of different words in the same column in the log group and the input log to obtain a matching result.
If the number of the successfully matched log groups in the matching result is 0, a new log group is created by taking the input log to be analyzed as a representative, and the log to be analyzed is used as a template of the new log group.
If one log group is matched with only one log group, the log to be analyzed is distributed to the group, and the log template of the group is updated to obtain the target template.
If the number of matching groups is greater than 1, the target word is re-selected by: sorting words in descending order according to semantic contribution score, deleting the words with highest score from the word sequence to obtain a new word sequence, recursively calling TESC, inputting the new word sequence and a first round of matching groups, re-selecting target words according to the new word sequence, and matching with the first round of matching groups (namely candidate templates) until the number of candidate templates is zero or one (namely the number of candidate templates is zero or one), and executing the above steps, wherein under the condition that the number of candidate templates is zero, a new log group is created by taking an input log to be analyzed as a representative, and the log to be analyzed is used as a template of the new log group; and if the number of the candidate templates is one, executing the process of distributing the log to be analyzed to the group and updating the log templates of the group to obtain the target template.
The present embodiment will be described with reference to fig. 5, as shown in fig. 5: a threshold of 2 indicates that the number of different words in the same column is less than 2, the word at that position will be identified as a variable word and the position will be replaced with "< >" in the template.
As can be seen from the direction of the Log stream, when the Log of the first input TESC is Invalid user support from 103.207.39.165, and at this time, there is no Template, and the number of Log groups successfully matched is 0, a new Log group1 is created on behalf of the Log, and the Log is taken as a Template (Template) of the Log group 1: invalid user support from 103.207.39.165, is also the target template for the log. The Log of the second input TESC is Invalid user test from 52.80.34.196, the word with the highest contribution of the Log semantics is "test", and the Template containing the word does not exist in the matching result, so a new Log group2 is created by taking the Log as a representative, and the Log is taken as a Template (Template) of the Log group 2: invalid user test from 52.80.34.196, is also the target template for the log. The third time the TESC is input, the Log with the highest semantic contribution is Invalid user inspur from 175.102.13.6, the word with the highest semantic contribution is "user", two different templates exist in the matching result, the word with the highest score (i.e. the templates of Log group1 and Log group 2) is deleted from the initial word sequence, the target word is "user" according to the new word sequence, the target word is "input" (i.e. the word with the second highest semantic contribution score in the initial word sequence) and is re-matched with the Log group (i.e. Log group1 and Log group 2) successfully matched in the previous round, at this time, the templates of Log group1 and Log group2 successfully matched in the previous round do not contain "input" (i.e. the number of candidate templates is zero), a new Log group3 is created by taking the Log as a representative, and the Template (temp) as Log group3 is also created: invalid user inspur from 175.102.13.6, is also the target template for the log. The fourth time the Log of the TESC is Invalid user support from 195.154.37.122, the word with the highest semantic contribution of the Log is "support", only one template in the matching result contains the word (i.e. the template of Log group 1), the Log to be analyzed is added into Log group1 (i.e. the candidate Log group), an updated Log group is obtained, variable words in the template are selected according to a third preset method and the updated Log group, and the candidate template is updated according to the variable words, so that the target template is obtained: invalid user support from < >.
In the embodiment of the application, the template extraction algorithm TESC is adopted, the top k semantic contribution scores are used for replacing the top k positions in the traditional method, the logs are gradually grouped, and the accuracy of generating the template is greatly improved.
As an optional embodiment, updating the candidate template according to the third preset method and the updated log group to obtain the target template includes:
acquiring all logs in the updated log group;
acquiring the number of words at the same position in a log;
when the number of the words is smaller than a preset threshold value, the words at the positions are used as variable words;
and replacing the variable words in the candidate templates to obtain the target templates.
Optionally, for updating the log template representing each group, the present application uses the number of different words in the same column to determine whether the word at the location is a variable, compares the number of different words to a preset threshold, e.g., 3, and if the number is less than the preset threshold, the word at the location is identified as a variable word. If a word at a location is identified as a variable word, the log template will be replaced with "×" at that location.
This embodiment will be described with reference to fig. 6, as shown in fig. 6: there are 5 log 0 in log samples: multi-web #45main build action completed:SUCCESS;1: multi-web #46main build action completed:FAILURE;2: security #46main build action completed:FAILURE;3: multi-back #18main build action completed:SUCCESS;4: multi-back #19main build action completed:SUCESS. The number of words at the same location in the log is obtained, for example: the first word of each log is selected at the same position, and according to the 5 logs, the following steps are obtained: the multi-web appears as the first word in logs 0 and 1, so there are 2, and similarly, there are 1 for security, 2 for multi-back, and the number of three words does not exceed the threshold 3, so the position of the first word in the generated template is replaced by "×", and the positions of other words are similar, which are not repeated here, and finally the template is obtained: * Main build action completed.
Each time the log to be analyzed is added into the log group, the log in the updated log group is increased continuously, and the variable words are changed. And after the log group is updated, the steps are re-executed, the variable words are re-confirmed, and the non-replaced variable words in the templates (namely the candidate templates) in front of the log group are replaced, so that the target template is obtained.
Further, as can be seen from fig. 6: the templates (templates) for the 5 logs are known as: * Main build action completed. The traditional method Drain is word grouping according to the first position, namely grouping according to multi-web, security, multi-back and generating templates respectively as follows: template1: multi-web main build action completed; template2: security main build action completed, FAILURE; template3: multi-back main build action completed: SUCCESS; comparing with the known templates, it is known that templates 1-3 are all error templates. The Semlog of the method is word grouping with highest semantic contribution score, namely grouping according to build and generating a template as follows: template4: * Main build action completed. Comparing with the known Template, it is known that Template4 is the correct Template. To sum up, the example of fig. 5 illustrates that logs generated by the same template should not be grouped by k words in the previous k positions, but k words with the highest semantic contribution score, the pre-training model in the present invention will output the semantic contribution score of each word, quantify whether it is a constant word, then TESC groups the logs according to the output of the pre-training model, and since the semantic contribution score of each log is "build" in the log sample, the grouping is completed in the first round, and all logs are correctly grouped.
In the embodiment of the application, the logs are grouped according to the word with the highest semantic contribution score, so that a template is generated, all the logs are correctly divided into a group, whether the word at the position is a variable is determined according to the number of different words in the same column, the position of the variable in the log template is replaced by 'x', and the template is generated, so that the accuracy of log grouping and the accuracy of template generation are improved.
According to another aspect of the embodiments of the present application, a log parsing apparatus for implementing the log parsing method is also provided. FIG. 7 is a block diagram of an alternative log parsing apparatus according to an embodiment of the present application, as shown in FIG. 7, the apparatus may include:
a first obtaining module 701, configured to obtain a log to be parsed;
the first processing module 702 is configured to process the log to be parsed according to a first preset method to obtain a split log;
a first input module 703, configured to input the segmentation log into a pre-training model, to obtain a middle score of each word in the log to be parsed;
a calculation module 704, configured to calculate a target score of each word according to the segmentation log and the intermediate score;
the first obtaining module 705 is configured to obtain a target template of the log to be parsed according to the log to be parsed, the target score, the word, and a preset algorithm, where the preset algorithm is used to generate the target template.
According to the method, the log to be analyzed is processed by the first preset method to obtain the segmented log, the middle score of the segmented log is obtained by the pre-training model, the target score of the word is calculated according to the middle score, and finally, the target template of the log to be analyzed is generated according to the target score by using the preset algorithm, so that the analysis precision of log analysis is improved, the accuracy is improved, the universality and the robustness are high, and the problems of low accuracy and poor generalization capability in the related technology are solved.
As an alternative embodiment, the apparatus further comprises: the second acquisition module is used for acquiring a training log; the second processing module is used for processing the training logs according to a second preset method to obtain segmentation training logs and replacement segmentation training logs; the second input module is used for inputting the replacement segmentation training log into the initial model to obtain a prediction result; the second obtaining module is used for obtaining an objective function according to the prediction result, the segmentation training log and the first preset formula; and the adjusting module is used for adjusting the initial model until the numerical value of the objective function is always in a preset range to obtain the pre-training model.
As an alternative embodiment, the second processing module includes: the first segmentation unit is used for carrying out segmentation processing on the log to be analyzed according to a preset delimiter to obtain a first preset number of training words; the first marking unit is used for marking the training word to obtain training sub-characters, wherein the training sub-characters are used for representing the training word; the first obtaining unit is used for obtaining a segmentation training log according to the training sub-characters; and the replacing unit is used for replacing training sub-characters with preset proportions in the segmentation training log by using a preset strategy to obtain a replacement segmentation training log.
As an alternative embodiment, the second input module includes: the first acquisition unit is used for acquiring preset conditions; the first coding unit is used for coding the training sub-characters in the replacement segmentation training log based on a preset condition to obtain training character embedded vectors; the second obtaining unit is used for obtaining a training position embedded vector according to the replacement segmentation training log and the training sub-characters; the first combining unit is used for combining the training character embedded vector and the training position embedded vector to obtain a training embedded vector; the first generation unit is used for generating a training query vector, a training key vector and a training value vector of each training word according to the training embedded vector and the preset matrix; the third obtaining unit is used for obtaining intermediate output data according to the training query vector, the training key vector, the training value vector, the preset parameters and the second preset formula; a fourth obtaining unit, configured to obtain a residual value according to the training embedded vector, the intermediate output data, the preset weight, and the preset residual connection layer function; and a fifth obtaining unit, configured to obtain a prediction result according to the residual value and the third preset formula.
As an alternative embodiment, the first processing module includes: the second segmentation unit is used for carrying out segmentation processing on the log to be analyzed according to the preset delimiter to obtain a second preset number of words; the second marking unit is used for marking the word to obtain a sub-character, wherein the sub-character is used for representing the word; and a sixth obtaining unit, configured to obtain a segmentation log according to the sub-characters.
As an alternative embodiment, the first input module includes: the second coding unit is used for coding the sub-characters to obtain character embedded vectors; seventh obtaining unit, configured to obtain a position embedding vector according to the segmentation log and the sub-characters; the second combination unit is used for combining the character embedding vector and the position embedding vector to obtain an embedding vector; the second generating unit is used for generating a query vector, a key vector and a value vector of each word according to the embedded vector and the preset matrix; and the eighth obtaining unit is used for obtaining the intermediate score according to the query vector, the key vector, the value vector, the preset parameter and the fourth preset formula.
As an alternative embodiment, the computing module includes: a second acquisition unit configured to acquire the number of the sub-characters included in the division log as a first number; a third obtaining unit for obtaining the number of sub-characters included in each word as a second number; and a ninth obtaining unit, configured to obtain the target score according to the first sequence number of the query vector, the second sequence number of the key vector, the first number, the second number, the intermediate score, and the fifth preset formula, where the first sequence number and the second sequence number are used to determine the intermediate score.
As an alternative embodiment, the first obtaining module includes: the ordering unit is used for ordering the words according to the preset sequence and the target score to obtain a word sequence; the selecting unit is used for selecting a first third preset number of words in the word sequence to serve as target words; the matching unit is used for matching the target word with a preset log group to obtain a matching result, wherein the preset log group comprises a template, the matching result comprises a candidate log group, and the candidate log group comprises a candidate template; the first creating unit is used for creating a new log group and taking the log to be analyzed as a new template in the new log group if the candidate template does not exist in the matching result, and the new template is a target template; the second creating unit is used for creating a new log group and taking the log to be analyzed as a new template in the new log group if the candidate templates containing all target words do not exist in the matching result, and the new template is the target template; the first updating unit is used for adding the log to be analyzed into the candidate log group to obtain an updated log group if only one candidate template containing all target words exists in the matching result, and updating the candidate template according to a third preset method and the updated log group to obtain the target template; and the second updating unit is used for re-selecting the target words and matching the target words with the candidate templates if the number of the candidate templates containing all the target words is larger than one in the matching result, and executing the operation of creating a new log group and taking the log to be analyzed as a new template in the new log group and taking the new template as the target template until the number of the candidate templates is zero or one, adding the log to be analyzed into the candidate log group when the number of the candidate templates is one to obtain an updated log group, and executing the operation of updating the candidate templates according to a third preset method and the updated log group to obtain the target template.
As an alternative embodiment, the first updating unit comprises: the first acquisition sub-module is used for acquiring all logs in the updated log group; the second acquisition submodule is used for acquiring the number of words at the same position in the log; the sub-module is used for taking the words at the positions as variable words when the number of the words is smaller than a preset threshold value; and the replacing sub-module is used for replacing the variable words in the candidate templates to obtain the target templates.
It should be noted that the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above embodiments.
According to still another aspect of the embodiments of the present application, there is further provided an electronic device for implementing the log parsing method, where the electronic device may be a server, a terminal, or a combination thereof.
Fig. 8 is a block diagram of an alternative electronic device, according to an embodiment of the present application, including a processor 801, a communication interface 802, a memory 803, and a communication bus 804, as shown in fig. 8, wherein the processor 801, the communication interface 802, and the memory 803 communicate with each other via the communication bus 804,
a memory 803 for storing a computer program;
The processor 801 is configured to implement the steps of the log parsing method according to the above embodiment when executing the computer program stored in the memory 803.
Alternatively, in the present embodiment, the above-described communication bus may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The memory may include RAM or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
As an example, as shown in fig. 8, the memory 803 may include, but is not limited to, the first obtaining module 701, the first processing module 702, the first input module 703, the calculating module 704, and the first obtaining module 705 in the log parsing apparatus. In addition, other module units in the log parsing device may be included, but are not limited to, and are not described in detail in this example.
The processor may be a general purpose processor and may include, but is not limited to: CPU (Central Processing Unit ), NP (Network Processor, network processor), etc.; but also DSP (Digital Signal Processing, digital signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field-Programmable Gate Array, field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments, and this embodiment is not described herein.
It will be understood by those skilled in the art that the structure shown in fig. 8 is only schematic, and the device implementing the log parsing method may be a terminal device, and the terminal device may be a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a tablet computer, a palm computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 8 is not limited to the structure of the electronic device described above. For example, the terminal device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in fig. 8, or have a different configuration than shown in fig. 8.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, ROM, RAM, magnetic or optical disk, etc.
According to yet another aspect of embodiments of the present application, there is also provided a storage medium. Alternatively, in the present embodiment, the above-described storage medium may be used to store program code for executing the log parsing method.
Alternatively, in this embodiment, the storage medium may be located on at least one network device of the plurality of network devices in the network shown in the above embodiment.
Alternatively, in the present embodiment, a storage medium is provided to store step program codes for performing the log parsing method of the above embodiment.
Alternatively, specific examples in the present embodiment may refer to examples described in the above embodiments, which are not described in detail in the present embodiment.
Alternatively, in the present embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a U disk, ROM, RAM, a mobile hard disk, a magnetic disk or an optical disk.
In the description of the present specification, a description referring to the terms "present embodiment," "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction. In the description of the present disclosure, the meaning of "a plurality" is at least two, such as two, three, etc., unless explicitly specified otherwise.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims (10)

1. A method of log parsing, the method comprising:
acquiring a log to be analyzed;
processing the log to be analyzed according to a first preset method to obtain a segmented log;
inputting the segmentation log into a pre-training model to obtain the middle score of each word in the log to be analyzed;
calculating a target score for each of the words based on the segmentation log and the intermediate score;
and obtaining a target template of the log to be analyzed according to the log to be analyzed, the target score, the word and a preset algorithm, wherein the preset algorithm is used for generating the target template.
2. The method of claim 1, wherein prior to said entering the segmentation log into a pre-training model, the method further comprises:
acquiring a training log;
processing the training log according to a second preset method to obtain a segmentation training log and a replacement segmentation training log;
inputting the replacement segmentation training log into an initial model to obtain a prediction result;
obtaining an objective function according to the prediction result, the segmentation training log and a first preset formula;
and adjusting the initial model until the numerical value of the objective function is always in a preset range, and obtaining the pre-training model.
3. The method according to claim 2, wherein the processing the training log according to the second preset method to obtain a segmentation training log and a replacement segmentation training log includes:
dividing the log to be analyzed according to a preset delimiter to obtain a first preset number of training words;
marking the training word to obtain training sub-characters, wherein the training sub-characters are used for representing the training word;
obtaining the segmentation training log according to the training sub-characters;
and replacing training sub-characters with preset proportions in the segmentation training log by using a preset strategy to obtain the replacement segmentation training log.
4. A method according to claim 3, wherein said inputting the replacement segmentation training log into an initial model to obtain a predicted outcome comprises:
acquiring preset conditions;
coding the training sub-characters in the replacement segmentation training log based on the preset condition to obtain training character embedded vectors;
obtaining a training position embedded vector according to the replacement segmentation training log and the training sub-characters;
combining the training character embedded vector and the training position embedded vector to obtain a training embedded vector;
Generating a training query vector, a training key vector and a training value vector of each training word according to the training embedded vector and a preset matrix;
obtaining intermediate output data according to the training query vector, the training key vector, the training value vector, preset parameters and a second preset formula;
obtaining a residual value according to the training embedded vector, the intermediate output data, a preset weight and a preset residual error connecting layer function;
and obtaining the prediction result according to the residual value and a third preset formula.
5. The method of claim 1, wherein the processing the log to be parsed according to the first preset method to obtain the split log comprises:
dividing the log to be analyzed according to a preset delimiter to obtain a second preset number of words;
marking the word to obtain a sub-character, wherein the sub-character is used for representing the word;
and obtaining the segmentation log according to the sub-characters.
6. The method of claim 5, wherein the inputting the segmentation log into a pre-training model to obtain the intermediate score for each word in the log to be parsed comprises:
Coding the sub-characters to obtain character embedded vectors;
obtaining a position embedding vector according to the segmentation log and the sub-characters;
combining the character embedding vector and the position embedding vector to obtain an embedding vector;
generating a query vector, a key vector and a value vector of each word according to the embedded vector and a preset matrix;
and obtaining the intermediate score according to the query vector, the key vector, the value vector, preset parameters and a fourth preset formula.
7. The method of claim 6, wherein said calculating a target score for each of said words based on said segmentation log and said intermediate score comprises:
the number of the sub characters contained in the segmentation log is obtained and is used as a first number;
acquiring the number of the sub-characters contained in each word and taking the number as a second number;
and obtaining the target score according to a first sequence number of the query vector, a second sequence number of the key vector, the first number, the second number, the intermediate score and a fifth preset formula, wherein the first sequence number and the second sequence number are used for determining the intermediate score.
8. The method of claim 7, wherein the obtaining the target template of the log to be parsed according to the log to be parsed, the target score, the word and a preset algorithm comprises:
sequencing the words according to a preset sequence and the target score to obtain a word sequence;
selecting a first third preset number of words in the word sequence as target words;
matching the target word with a preset log group to obtain a matching result, wherein the preset log group comprises a template, the matching result comprises a candidate log group, and the candidate log group comprises a candidate template;
if the candidate templates are not found in the matching result, a new log group is created, the log to be analyzed is used as a new template in the new log group, and the new template is the target template;
if the candidate templates containing all the target words do not exist in the matching result, creating a new log group, taking the log to be analyzed as a new template in the new log group, and taking the new template as the target template;
If only one candidate template containing all target words exists in the matching result, adding the log to be analyzed into the candidate log group to obtain an updated log group, and updating the candidate template according to a third preset method and the updated log group to obtain the target template;
and if the number of the candidate templates containing all the target words is greater than one in the matching result, re-selecting the target words and matching the candidate templates until the number of the candidate templates is zero or one, executing the operation of creating a new log group and taking the log to be analyzed as a new template in the new log group when the number of the candidate templates is zero, adding the log to be analyzed into the candidate log group when the number of the candidate templates is one, obtaining an updated log group, and executing the operation of updating the candidate templates according to a third preset method and the updated log group to obtain the target template.
9. The method of claim 8, wherein updating the candidate templates according to a third predetermined method and the updated log group to obtain the target template comprises:
Acquiring all logs in the updated log group;
acquiring the number of words at the same position in the log;
when the number of the words is smaller than a preset threshold value, the words at the positions are used as variable words;
and replacing the variable words in the candidate templates to obtain the target template.
10. A log parsing apparatus, comprising:
the first acquisition module is used for acquiring logs to be analyzed;
the first processing module is used for processing the log to be analyzed according to a first preset method to obtain a segmented log;
the first input module is used for inputting the segmentation log into a pre-training model to obtain the middle score of each word in the log to be analyzed;
the calculation module is used for calculating the target score of each word according to the segmentation log and the intermediate score;
the first obtaining module is used for obtaining a target template of the log to be analyzed according to the log to be analyzed, the target score, the word and a preset algorithm, wherein the preset algorithm is used for generating the target template.
CN202211685297.0A 2022-12-27 2022-12-27 Log analysis method and device Pending CN116050380A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211685297.0A CN116050380A (en) 2022-12-27 2022-12-27 Log analysis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211685297.0A CN116050380A (en) 2022-12-27 2022-12-27 Log analysis method and device

Publications (1)

Publication Number Publication Date
CN116050380A true CN116050380A (en) 2023-05-02

Family

ID=86115700

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211685297.0A Pending CN116050380A (en) 2022-12-27 2022-12-27 Log analysis method and device

Country Status (1)

Country Link
CN (1) CN116050380A (en)

Similar Documents

Publication Publication Date Title
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN111814466A (en) Information extraction method based on machine reading understanding and related equipment thereof
CN110598206A (en) Text semantic recognition method and device, computer equipment and storage medium
CN113901797A (en) Text error correction method, device, equipment and storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN111858843B (en) Text classification method and device
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN113656547B (en) Text matching method, device, equipment and storage medium
CN115438650B (en) Contract text error correction method, system, equipment and medium fusing multi-source characteristics
CN112084435A (en) Search ranking model training method and device and search ranking method and device
CN112085091B (en) Short text matching method, device, equipment and storage medium based on artificial intelligence
CN111177367A (en) Case classification method, classification model training method and related products
CN112100374A (en) Text clustering method and device, electronic equipment and storage medium
CN112580346A (en) Event extraction method and device, computer equipment and storage medium
CN113515593A (en) Topic detection method and device based on clustering model and computer equipment
CN112395880A (en) Error correction method and device for structured triples, computer equipment and storage medium
CN116050380A (en) Log analysis method and device
CN110852066A (en) Multi-language entity relation extraction method and system based on confrontation training mechanism
CN115186647A (en) Text similarity detection method and device, electronic equipment and storage medium
CN113868417A (en) Sensitive comment identification method and device, terminal equipment and storage medium
CN113449510B (en) Text recognition method, device, equipment and storage medium
CN115525730B (en) Webpage content extraction method and device based on page weighting and electronic equipment
CN114282643A (en) Data processing method and device and computing equipment
CN115983283A (en) Emotion classification method and device based on artificial intelligence, computer equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination