CN116467141A - Log recognition model training, log clustering method, related system and equipment - Google Patents
Log recognition model training, log clustering method, related system and equipment Download PDFInfo
- Publication number
- CN116467141A CN116467141A CN202310366197.XA CN202310366197A CN116467141A CN 116467141 A CN116467141 A CN 116467141A CN 202310366197 A CN202310366197 A CN 202310366197A CN 116467141 A CN116467141 A CN 116467141A
- Authority
- CN
- China
- Prior art keywords
- log
- training
- logs
- clustering
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 208
- 238000000034 method Methods 0.000 title claims abstract description 97
- 230000006870 function Effects 0.000 claims abstract description 50
- 239000013598 vector Substances 0.000 claims abstract description 30
- 238000012512 characterization method Methods 0.000 claims abstract description 13
- 238000012360 testing method Methods 0.000 claims abstract description 12
- 230000011218 segmentation Effects 0.000 claims description 16
- 230000015654 memory Effects 0.000 claims description 15
- 238000003860 storage Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 9
- 238000005070 sampling Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 7
- 230000008569 process Effects 0.000 description 15
- 238000004458 analytical method Methods 0.000 description 14
- 238000005259 measurement Methods 0.000 description 7
- 238000009826 distribution Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000007621 cluster analysis Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 239000002245 particle Substances 0.000 description 2
- HDAJUGGARUFROU-JSUDGWJLSA-L MoO2-molybdopterin cofactor Chemical compound O([C@H]1NC=2N=C(NC(=O)C=2N[C@H]11)N)[C@H](COP(O)(O)=O)C2=C1S[Mo](=O)(=O)S2 HDAJUGGARUFROU-JSUDGWJLSA-L 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
- G06F11/3072—Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
- G06F11/3082—Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting the data filtering being achieved by aggregating or compressing the monitored data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/231—Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Quality & Reliability (AREA)
- Probability & Statistics with Applications (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a log recognition model training and log clustering method, and a related system and equipment. The method comprises the following steps: inputting a training data set into a log recognition model, and generating a clustering center of the logs under each log template according to embedded characterization of the logs generated by the model; training the next training batch according to the training data of each training batch after updating the model parameters according to a preset loss function, wherein the loss function is a function obtained by weighting the comparison learning loss and the central clustering loss; after the training of all training batches is completed, returning to the step of inputting the training data set into the log recognition model, and obtaining the trained log recognition model after meeting the preset training requirements. And inputting the test set into a model, determining the similarity between the logs according to the feature vector of the output logs, and clustering according to the similarity. The method can cluster logs based on semantics, is applicable to various systems, can be used across systems, has mobility and has good clustering effect.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a log recognition model training and log clustering method, and a related system and device.
Background
The system can generate a large amount of log data in the running process, the log data can be used in the system debugging and fault diagnosis processes, and the running condition of the system can be analyzed and the possible problems in the running process of the system can be solved by analyzing the log. The current method for analyzing the log comprises the following steps:
algorithms based on rule matching, such as: drain, spell, FT-tree, moLFI, etc.; the method carries out regular matching on the original log by using priori knowledge, and eliminates part of simple variables such as IP addresses and the like. And then grouping the logs based on a predefined classification rule in a mode of constructing an analysis tree, and finally carrying out word-by-word comparison and distinguishing on the grouped logs to generate corresponding log templates, so as to realize analysis on the logs based on the templates.
Neural network-based algorithms, such as: uniparser, nuLog, etc. The method regards the template extraction problem as a supervised two-classification problem for each word in the log, and distinguishes constants and variables in the log. Firstly, a strong supervision data set capable of distinguishing whether each word in a log is a variable is constructed for model training, then semantics are learned by capturing token-level information and context information during model training, model performance is optimized by contrast learning, template extraction is realized by using a model, and analysis of the log is realized based on the template.
Disclosure of Invention
The inventor of the application found that with the increasing size of modern systems, the number and types of logs increase exponentially, and log analysis methods including the two algorithms cannot meet the requirement of log analysis. This is because: the existing two algorithms generally generate relatively fixed log templates for the logs, and perform regular matching on the logs by using the log templates to realize identification of different types of logs and classification of the logs, in the log analysis process, the log template can only be generated for arrangement without considering semantic information of the logs, so that the logs with the same semantic meaning cannot be clustered, the existing method relies on excessive priori knowledge, and effective key information of the logs can be extracted through analysis of a large amount of data, so that the method cannot be widely applied to various systems, has poor performance in cross-system use and log clustering, has no mobility and has poor practicability.
The present invention has been made in view of the above problems, and is directed to providing a log recognition model training, log clustering method and related systems, devices that overcome or at least partially solve the above problems.
The embodiment of the invention provides a log recognition model training method, which comprises the following steps:
inputting a training data set into a log recognition model, and generating a clustering center of the logs under each log template according to embedded characterization of the logs generated by the model; the log pairs included in the training data of each training batch of the training data set respectively belong to different types of log templates;
training the training data of the next training batch after updating model parameters according to a preset loss function aiming at the training data of each training batch, wherein the loss function is a function weighted by comparison learning loss and central clustering loss, the comparison learning loss is determined by adopting a comparison learning mode based on log pairs included in the training data, and the central clustering loss is determined according to the log under each log template and the clustering center of the log template;
after the training of all training batches is completed, if the preset training requirement is not met, returning to the step of inputting the training data set into the log recognition model, and obtaining the trained log recognition model after the preset training requirement is met.
In some optional embodiments, the inputting the training data set into the log recognition model, generating a cluster center of the log under each log template according to the embedded representation of the log generated by the model, includes:
Inputting the training data set into a log recognition model, and performing word segmentation processing on the log by the model to generate token embedded characterization of a plurality of word segmentation; generating an embedded representation of each log according to the token embedded representation of the word segmentation included in each log;
traversing the logs included under the log templates aiming at each log template, and generating a clustering center of the logs under the log templates according to the embedded characterization of the traversed logs.
In some optional embodiments, the updating the model parameters according to the preset loss function includes:
taking one log pair of log pairs included in training data of one training batch as a positive example pair, taking the rest log pairs as negative example pairs, and adopting a comparison learning mode to determine comparison learning loss, wherein the comparison learning loss aims at maximizing the log feature space similarity under the same log template and minimizing the log feature space similarity under different templates;
determining central clustering loss according to the logs under each log template and the clustering center of the log template, wherein the central clustering loss aims at maximizing the similarity of the characteristic space between the logs and the clustering center;
and weighting the contrast learning loss and the central clustering loss to obtain a preset loss function, and adjusting model parameters according to the preset loss function.
In some optional embodiments, after the meeting of the preset training requirement, obtaining a trained log recognition model includes:
and after the iteration times reach the preset times, the loss determined according to the preset loss function is smaller than a preset threshold value, or the distance between the clustering center obtained at this time and the clustering center obtained last time is smaller than the preset threshold value, obtaining a trained log recognition model.
In some alternative embodiments, before the training data set is input into the log recognition model, the method further includes:
constructing a log pair according to the log under the log template aiming at each type of log template;
sampling from log pairs corresponding to different types of log templates, and constructing a training data set comprising a plurality of training batch training data.
In some alternative embodiments, sampling from log pairs corresponding to different types of log templates, constructing a training data set comprising a plurality of training batches of training data includes:
extracting log templates with the number of training batch sizes from the log template set for each training batch;
and extracting a log pair from the log pairs included in each extracted log template to obtain training data of the training batch.
The embodiment of the invention provides a log clustering method, which comprises the following steps:
inputting the logs in the test set into a trained log identification model, and outputting the feature vectors of the logs; the log recognition model is obtained by training the log recognition model training method;
and determining the similarity of the two logs according to the semantic similarity of the feature vectors of the two logs, and clustering the logs according to the similarity.
In some optional embodiments, the determining the similarity of the two logs according to the semantic similarity of the feature vectors of the two logs is specifically used for clustering the logs according to the similarity:
determining the semantic similarity of the feature vectors of the two logs by adopting a selected similarity algorithm to determine the similarity of the two logs; the similarity algorithm comprises at least one of a cosine similarity algorithm, an L1 similarity algorithm and an L2 similarity algorithm;
clustering logs in the test set by adopting a selected hierarchical clustering algorithm according to the determined similarity; the hierarchical clustering algorithm comprises at least one of an unsupervised hierarchical clustering algorithm cluster, a K-means algorithm and a DBSCAN algorithm.
The embodiment of the invention provides a log clustering system, which comprises the following steps: a server and a client;
The server is configured and arranged to be used for realizing the device of the log recognition model training method and/or the device for realizing the log clustering method;
the client is used for collecting logs and providing the logs to the server, and receiving log clustering results transmitted by the server.
The embodiment of the invention provides a computer storage medium, wherein computer executable instructions are stored in the computer storage medium, and the computer executable instructions realize the log identification model training method and/or realize the log clustering method when being executed by a processor.
The embodiment of the invention provides clustering equipment, which comprises the following components: the system comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the log identification model training method and/or the log clustering method when executing the program.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
according to the training method of the log recognition model, through the constructed training data set comprising the log pairs, semantic information existing in the log text is extracted and utilized in a pre-training mode, natural semantics existing in the log data are effectively utilized, and therefore semantics in the log are fully mined for extracting clustering features; based on log pairs, a loss function training model is constructed by adopting a mode of contrast learning and center clustering, and the constructed loss function can simultaneously optimize intra-class distance and inter-class distance, so that log features output by the model can have good clustering distribution in semantic space, and the clustering feasibility in the subsequent step is improved. The similarity information in the text is extracted by utilizing the learned knowledge of the pre-training model in natural semantics, so that the method can be applied to various systems, can be used across systems, can ensure the mobility and practicability of the method and the model under different system logs, and can ensure good clustering effect.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a flowchart of a training method of a log recognition model according to an embodiment of the present invention;
FIG. 2 is a flowchart of a log clustering method according to a second embodiment of the present invention;
FIG. 3 is a flowchart of a log recognition model training and log clustering method according to a third embodiment of the present invention;
FIG. 4 is a schematic diagram of a training device for a log recognition model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a log clustering device according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a log clustering system according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
With the rapid increase of the number of logs, analysis of a large number of logs cannot analyze semi-structured logs by simply constructing regular expressions, and the conventional regular expression matching mode or source code analysis mode cannot be used for identifying log types, so that the existing log analysis methods such as algorithms based on rule matching and algorithms based on neural networks cannot meet the requirement of log analysis, and an efficient log analysis method is urgently needed for classifying mass logs of different categories.
In order to improve the mobility and practicality of the log clustering method and the model, meet the analysis requirement of massive logs and meet the clustering requirement of logs expressing the same meaning in certain scenes, the application provides a semantic-based method for clustering the logs. Firstly, fine tuning is carried out based on an advanced pre-training model BERT so as to fully utilize semantic information in a natural corpus; and then, optimizing the model from the perspective of log clustering by adopting a mode of contrast learning and center clustering. In the method, the log features output by the model have separability, and the higher classification accuracy can be achieved by directly adopting a simple hierarchical clustering algorithm, and the method can also show stronger robustness under a cross-system condition. The method can effectively utilize semantic information in the log to perform cluster analysis on the log, and greatly improve the efficiency and performance of log clustering.
Example 1
The first embodiment of the invention provides a training method for a log recognition model, the flow of which is shown in fig. 1, comprising the following steps:
step S11: inputting a training data set into a log recognition model, and generating a clustering center of the logs under each log template according to embedded characterization of the logs generated by the model; the log pairs included in the training data of each training batch of the training data set respectively belong to different types of log templates.
A training data set can be constructed in advance, and for each type of log template, a log pair is constructed according to the log under the log template; sampling from log pairs corresponding to different types of log templates, and constructing a training data set comprising a plurality of training batch training data. Optionally, extracting log templates with the number of training batch sizes from the log template set for each training batch; and extracting a log pair from the log pairs included in each extracted log template to obtain training data of the training batch. That is, preferably, there is only one log pair under each log template in a training batch. Of course, alternatively, there may be more than one, but at least one.
Inputting the training data set into a log recognition model, and performing word segmentation processing on the log by the model to generate token embedded characterization of a plurality of word segmentation; generating an embedded representation of each log according to the token embedded representation of the word segmentation included in each log; traversing the logs included under the log templates aiming at each log template, and generating a clustering center of the logs under the log templates according to the embedded characterization of the traversed logs. For each log in the training set data of the input log recognition model, the model can process the log to obtain the embedded representation of the word segmentation level, and the embedded representation of one log is calculated in an average or weighted average mode according to the embedded representation of the word segmentation level. And for each log template, calculating a clustering center of the logs included under the log template in an average or weighted average mode according to embedded characterization of all the logs under the log template.
Step S12: and training the training data of the next training batch after updating model parameters according to a preset loss function aiming at the training data of each training batch, wherein the loss function is a weighted function of comparison learning loss and central clustering loss, the comparison learning loss is determined by adopting a comparison learning mode based on a log pair included in the training data, and the central clustering loss is determined according to the log under each log template and the clustering center of the log template.
The loss function can be pre-constructed for training of the model, preferably, the two loss functions of the contrast learning loss and the central clustering loss can be firstly constructed, and then the preset loss function is obtained after weighted summation based on the contrast learning loss and the central clustering loss, namely, the model is trained by comprehensively considering the losses in two aspects. For the training data of each training batch, a primary loss function can be calculated, model parameters can be updated correspondingly, the training data of the next training batch can be continuously trained by the updated model, and iteration is performed for multiple times, so that the model meeting the requirements can be obtained through training.
Constructing a contrast learning loss function, taking one log pair of log pairs included in training data of one training batch as a positive example pair, and the other log pairs as negative example pairs, and determining contrast learning loss in a contrast learning mode, wherein the contrast learning loss aims at maximizing the log feature space similarity under the same log template and minimizing the log feature space similarity under different templates.
And constructing a central clustering loss function, and determining central clustering loss according to the logs under each log template and the clustering center of the log template, wherein the central clustering loss aims at maximizing the similarity of the characteristic space of the logs and the clustering center.
Then, the method comprises the steps of. And weighting the contrast learning loss and the central clustering loss to obtain a preset loss function, and adjusting model parameters according to the preset loss function.
Step S13: after the training of all training batches is completed, if the preset training requirement is not met, returning to the step of inputting the training data set into the log recognition model, and obtaining the trained log recognition model after the preset training requirement is met.
Training data of a plurality of training batches in a training set is trained to complete training of one round, if the preset training requirement is not met, the step S101 is returned to perform training of the next round, and a log recognition model meeting the requirement can be obtained through training of the plurality of rounds. The training requirements may be training round requirements, loss size requirements, or cluster center distance requirements, etc. For example: and after the iteration times reach the preset times, the loss determined according to the preset loss function is smaller than a preset threshold value, or the distance between the clustering center obtained at this time and the clustering center obtained last time is smaller than the preset threshold value, obtaining a trained log recognition model.
The clustering center can be updated after each round of training, and in each round, model parameters can be updated according to the loss function after each batch of training.
In the method of the embodiment, through the constructed training data set comprising the log pairs, semantic information existing in the log text is extracted and utilized in a pre-training mode, natural semantics existing in the log data are effectively utilized, and semantics in the log are fully mined for extracting clustering features; based on log pairs, a loss function training model is constructed by adopting a mode of contrast learning and center clustering, the output of the model on a clustering space is directly optimized, and the constructed loss function can simultaneously optimize the intra-class distance and the inter-class distance, so that log features output by the model can have good clustering distribution on a semantic space, the effect of log semantic clustering is improved, and the feasibility of clustering in a subsequent step is improved. The similarity information in the text is extracted by utilizing the learned knowledge of the pre-training model in natural semantics, so that the method can be applied to various systems, can be used across systems, can ensure the mobility and practicability of the method and the model under different system logs, and can ensure good clustering effect. The model generates meaningful ebedding for each log as a log feature vector, and can be directly provided for downstream tasks for use after simple fine tuning. The downstream tasks may be various log analysis tasks, such as: when the logs are clustered, the log features can be used for clustering; the log feature may also be used to make a classification, such as determining if the log is abnormal.
Example two
The second embodiment of the present invention provides a log clustering method, the flow of which is shown in fig. 2, comprising the following steps:
step S21: inputting the logs in the test set into a trained log identification model, and outputting characteristic vectors of the logs; the log recognition model is obtained by training by implementing a provided log recognition model training method.
The test set comprises a plurality of logs to be tested and analyzed, each log model is subjected to word segmentation to identify the embedded representation of each word segmentation, and the embedded representation of the log is obtained according to the embedded representation of the word segmentation, so that the feature vector is obtained.
Step S22: and determining the similarity of the two logs according to the semantic similarity of the feature vectors of the two logs, and clustering the logs according to the similarity.
Determining the semantic similarity of the feature vectors of the two logs by adopting a selected similarity algorithm to determine the similarity of the two logs; and clustering the logs in the test set by adopting a selected hierarchical clustering algorithm according to the determined similarity. By means of clustering through semantic similarity, semantic features can be considered in the clustering process, the clustering effect is improved, the log clustering method is applicable to various systems and fields, and the mobility is improved.
The similarity algorithm comprises at least one of a cosine similarity algorithm, an L1 similarity algorithm and an L2 similarity algorithm;
the hierarchical clustering algorithm comprises at least one of an unsupervised hierarchical clustering algorithm cluster, a K-means clustering (K-means) algorithm, and a Density-based clustering algorithm (Density-Based Spatial Clustering ofApplications with Noise, DBSCAN) algorithm.
In this embodiment, feature recognition is performed on the logs in the test set based on the log recognition model obtained by training, clustering is performed based on the semantic similarity of the recognized feature vectors, and knowledge learned in natural semantics by the pre-training model is utilized to extract similarity information in the text, so that the semantic information of the logs is fully considered during clustering, and a good clustering effect can be obtained.
Example III
The third embodiment of the present invention provides a specific implementation process of the log recognition model training and log clustering method, where the flow is shown in fig. 3, and the method includes the following steps:
step S31: a training dataset is pre-constructed. The process of constructing the training dataset includes:
s311: for logs under each different type of log template, multiple log pairs are generated, e.g., all logs under each log template are paired, at most And the number of log pairs, n is the number of logs under the log template.
The log template is generally a structured text generated by using the same output statement such as print () for the original dayRegular matching is carried out on the marks. Assuming m log templates, traversing all log templates, and using all different logs under each log template to construct log pairs. The log pairs under each log template are arranged and combined, two logs are extracted each time, and the log pairs are regarded as positive example pairs, so that the total log pairs can be obtainedA log pair, wherein m is log template number, n i Is the number of logs under the ith log template.
S312: sampling is performed from the log pairs of each log template for constructing a training dataset. In the constructed dataset, log pairs under the same template in one training batch (batch) only appear once.
1) When constructing the data set, a batch size (batch size) of different log templates are extracted from the log template set, namely, the size of the data set of each batch is, and the number of the extracted log templates is the same.
2) Extracting log pairs from a plurality of log pairs under the extracted log template to form a batch in a data set, and extracting a log pair under the log template to obtain a total of batch size log pairs to form a batch of data set.
batch size refers to the sample size at one batch of model training, which is commonly used for gradient descent by optimizers.
3) Repeating 1) and 2), collecting multiple batches of data sets until the data set size meets the requirement, for example, 10 ten thousand data sets need to be constructed, 1 ten thousand data in each batch, and 10 batch data sets are collected.
S313: the log is subjected to variable replacement by using a small amount of regular expressions, and partial variables are replaced by special identifiers, such as the identifier "[ var ]". The special identifier is a symbol that can be identified by the model, and when the special identifier is replaced, the model can identify the variable, for example, 1, 2 and 3 are variables, and 1, 2 and 3 are replaced by the identifier "[ var ]".
When the positive and negative example pairs of the data set are constructed, the data set can be generated by adopting a MOCO algorithm mode. I.e. a larger queue of positive examples is maintained and model parameters are updated using a momentum approach. Or generating the positive example pair by adopting other modes, such as carrying out random text transformation on the original log, taking the original log generated by a system, deleting some words in the original log randomly to obtain a replaced log, and forming the log pair by the replaced log and the original log, wherein the method is more likely to be used under the condition that the labels are not too many.
Step S32: and inputting the training data set into a log recognition model, and generating a clustering center of the log under each log template according to the embedded characterization of the log generated by the model.
For each log template, the process of determining the clustering center of the log template comprises the following steps:
s321: for an input log, the log recognition model first performs word segmentation on the log, for example: the BERT-based model generates token embedded tokens (token) for a plurality of tokens. All token ebadd are averaged to get the output of the model as an embedded representation (ebadd) of the corresponding log. The token word refers to a feature vector of a corresponding sub-word after word segmentation of the text.
The BERT is a large pre-training language model based on a transducer, which is commonly used for text processing, and the pre-training model can adopt a BERT structure, or can adopt an algorithm model based on other structures, such as a model based on a CNN, RNN, transducer structure or other models constructed based on a transducer structure.
S322: and traversing all logs under each log template for all log templates and generating an email corresponding to each log. And respectively calculating the average value of all log subedding under each log template as a clustering center of the log template.
In different rounds, the obtained clustering center is changed due to the change of the parameters of the model, and the clustering center can be more accurate and the aggregation characteristic of the reaction log can be better through iteration.
Step S33: and updating model parameters according to a preset loss function aiming at training data of each training batch.
In each round, model parameters may be updated once per training batch according to the loss function.
For example, two different loss functions may be used to train the model, optimizing the clustering effect of the model:
s331: the contrast learning Loss (InfoNCE Loss) function was constructed as follows:
wherein Lcontast represents a contrast learning penalty, there are N pairs of log pairs (s i ,p i ) The method adopts a contrast learning mode to make the (s i ,p i ) As a positive example, the rest (s i ,p j ) InfoNCEloss, τ, represents the temperature coefficient, i represents the ith sample in a batch, and j represents samples other than i, as a negative example. The similarity of the logs under the same template in the feature space is larger, and the similarity of the logs under different templates is smaller. The sim () represents a similarity function, the cosine similarity is used for measurement during model training, the L1 or L2 similarity can be used for measurement, the L1 or L2 similarity algorithm can be used for similarity algorithm, and the similarity measurement mode can be replaced according to requirements. The contrast learning is a machine learning algorithm for optimizing the distance between the positive example and the negative example of the input sample.
S332: building a center cluster loss (center family) function:
wherein Lcenter represents center clustering loss, each log template has a corresponding clustering center, and the clustering center of the template corresponding to each log is as followsThe similarity between each log and the corresponding cluster center is maximized, so that different logs under the same template can be more compact in the feature space.
The central cluster loss may be replaced with other loss of optimal cluster homogeneity and integrity, e.g., central clustering may be implemented using a t-distribution loss function.
S333: constructing a preset loss function, namely a training objective function L:
L=L contrast +λ*L center
the training goal of the final model is to optimize both of the above-mentioned losses simultaneously, where λ is the scaling factor of the center similarity loss, for balancing the two different losses.
In the model training process, the model is optimized from the clustering angle. And the intra-class distance and the inter-class distance are optimized in the loss function, so that the log features output by the model can have good clustering distribution in semantic space, and the feasibility of clustering in the subsequent step is improved. For example, infoNCE Loss may optimize intra-class distance and inter-class distance, and Center Similarity Loss may optimize inter-class distance.
Step S34: judging whether all batches of training are completed.
If yes, step S35 is executed, and if no, step S33 is executed, training data of the next training batch is trained.
Step S35: and judging whether the preset training requirement is met, if so, executing the step S36, otherwise, returning to the step S32.
After the training of all training batches is completed, if the preset training requirement is not met, returning to the step of inputting the training data set into the log recognition model, and continuing the training of the next round until the preset training requirement is met, and obtaining the trained log recognition model.
And (3) through training of a plurality of rounds, updating the clustering center and model parameters of the log template for a plurality of times until the training rounds are finished, the loss is small enough, and one or more conditions that the clustering center is not in change (or the change distance meets the requirement) are met, so that the training is finished.
Since all logs are written based on a certain readability, they contain a lot of semantic information. The method is based on the latest pre-training model for training, and can effectively utilize semantic information in the text for similarity evaluation, so that similar system logs are clustered. Thus, unlike previous algorithms that generate simple matching templates for logs only, the model generates feature vectors with semantic information for different logs.
Step S36: inputting the logs in the test set into a trained log recognition model, and outputting the feature vectors of the logs.
The method can utilize the model to generate log features, load the trained model, input each log in the test set as the model, and obtain the feature vector output of the fixed dimension of the corresponding log.
Step S37: and determining the similarity of the two logs according to the semantic similarity of the feature vectors of the two logs, and clustering the logs according to the similarity.
The method can adopt a hierarchical clustering mode when clustering the logs based on semantic similarity, and the hierarchical clustering algorithm can adopt an unsupervised hierarchical clustering algorithm or a K-means, DBSCAN algorithm. Hierarchical clustering is an algorithm that clusters data points in top-down or bottom-up fashion.
After the feature vectors of the corresponding logs are generated by using the log recognition model, the logs of the same type have a certain similarity in the semantic space of the corresponding feature vectors. The cosine similarity between the feature vectors of different logs can be directly calculated to calculate the similarity of the different logs. By adopting a simple clustering algorithm and cosine similarity as distance measurement, the system logs can be clustered directly, so that logs of the same type are combined and used for downstream tasks to analyze and process.
For example: after the cosine similarity is calculated, the two feature vectors can be clustered by an unsupervised hierarchical clustering algorithm, the cosine similarity is calculated by two feature vectors, if the two feature vectors are more than 95% similar, the two feature vectors are dissimilar, and the like, and a measurement standard of similarity and dissimilarity can be set. The similarity algorithm can also adopt an L1 or L2 similarity algorithm, and the similarity measurement mode can be replaced according to requirements, for example, an L1 or L2 distance is used as the similarity measurement.
The algorithm model may be arranged on a cloud or a background server, and is generally prone to be installed on the cloud, and logs are collected from clients. And performing cluster analysis on the cloud or background server, and outputting a cluster result for the user.
Based on the same inventive concept, the embodiment of the present invention further provides a log recognition model training device, where the device may be disposed in a clustering device, and the structure of the device is shown in fig. 4, and includes:
the cluster center generating module 41 is used for inputting the training data set into the log recognition model, and generating a cluster center of the log under each log template according to the embedded characterization of the log generated by the model; the log pairs included in the training data of each training batch of the training data set respectively belong to different types of log templates.
The parameter updating module 42 is configured to update the model parameters according to a preset loss function for training data of each training batch, train training data of a next training batch, wherein the loss function is a weighted function of a comparison learning loss and a central clustering loss, the comparison learning loss is determined by adopting a comparison learning mode based on a log pair included in the training data, and the central clustering loss is determined according to a log under each log template and a clustering center of the log template.
The training control module 43 is configured to return to the step of inputting the training data set into the log recognition model if the training data set does not meet the preset training requirement after training of all training batches is completed, and obtain a trained log recognition model after the training data set meets the preset training requirement.
Based on the same inventive concept, the embodiment of the present invention further provides a log clustering device, where the device may be disposed in a clustering apparatus, and the structure of the device is shown in fig. 5, and includes:
the feature output module 51 is configured to input the logs in the test set into a trained log recognition model, and output feature vectors of the logs; the log recognition model is obtained by training the log recognition model training method according to any one of claims 1-6;
And the clustering module 52 is configured to determine the similarity of the two logs according to the semantic similarity of the feature vectors of the two logs, and cluster the logs according to the similarity.
The embodiment of the invention also provides a log clustering system, the structure of which is shown in figure 6, comprising a server 1 and a client 2;
the server 1 is configured to provide means for implementing the above-described log recognition model training method and/or means for implementing the above-described log clustering method;
the client 2 is configured to collect logs and provide the logs to the server 1, and receive log clustering results transmitted by the server 1.
The embodiment of the invention also provides a computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions realize the log recognition model training method and/or realize the log clustering method when being executed by a processor.
The embodiment of the invention also provides clustering equipment, which comprises: the system comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the log identification model training method and/or the log clustering method when executing the program.
The specific manner in which the various modules perform the operations in relation to the systems, apparatus and associated devices of the embodiments described above have been described in detail in relation to the embodiments of the method and will not be described in detail.
Compared with the prior art, the method and the system provided by the embodiment of the invention have the following beneficial effects:
in the aspect of data set preparation, the model does not need strong labeling information similar to Unipartner, semantic information in a log can be effectively utilized, the model can be effectively trained only by dividing the log types without word level labeling, and the time cost of manual labeling can be greatly reduced.
The domain adaptation fine adjustment is performed based on the pretrained model with good performance, model parameters are updated based on the loss function in the training process of each batch, the learned knowledge of the pretrained model in natural semantics can be effectively utilized for extracting similarity information in texts, and the mobility and practicability of an algorithm under different system logs can be ensured.
The previous methods of Drain and Uniparser et al generated a relatively fixed log template for the logs and used the template to regularly match the logs to identify different types of logs. The method of the invention generates different feature vectors for different logs from the aspect of semantic clustering. During clustering, the logs of the same template can be considered to be clustered, and the logs expressing the same semantics under different templates can be considered to be clustered, so that the clustering flexibility and the usability are expanded.
Before, each log is matched with a specific template method, and when the downstream task needs to analyze the log, a characteristic matrix of the log needs to be constructed according to the log template. There is a split between log cluster matching and downstream tasks, and error accumulation may occur during feature construction. The model of the invention can directly generate the available characteristics of the downstream task for the log, and optimize the downstream task end to end, thereby reducing the complexity of the algorithm and improving the availability.
Unless specifically stated otherwise, terms such as processing, computing, calculating, determining, displaying, or the like, may refer to an action and/or process of one or more processing or computing systems, or similar devices, that manipulates and transforms data represented as physical (e.g., electronic) quantities within the processing system's registers or memories into other data similarly represented as physical quantities within the processing system's memories, registers or other such information storage, transmission or display devices. Information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate preferred embodiment of this invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. The processor and the storage medium may reside as discrete components in a user terminal.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. These software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.
The foregoing description includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, as used in the specification or claims, the term "comprising" is intended to be inclusive in a manner similar to the term "comprising," as interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean "non-exclusive or".
Claims (11)
1. A log recognition model training method, comprising:
inputting a training data set into a log recognition model, and generating a clustering center of the logs under each log template according to embedded characterization of the logs generated by the model; the log pairs included in the training data of each training batch of the training data set respectively belong to different types of log templates;
Training the training data of the next training batch after updating model parameters according to a preset loss function aiming at the training data of each training batch, wherein the loss function is a function weighted by comparison learning loss and central clustering loss, the comparison learning loss is determined by adopting a comparison learning mode based on log pairs included in the training data, and the central clustering loss is determined according to the log under each log template and the clustering center of the log template;
after the training of all training batches is completed, if the preset training requirement is not met, returning to the step of inputting the training data set into the log recognition model, and obtaining the trained log recognition model after the preset training requirement is met.
2. The method of claim 1, wherein the inputting the training dataset into the log recognition model generates a cluster center for the logs under each log template based on the embedded representation of the logs generated by the model, comprising:
inputting the training data set into a log recognition model, and performing word segmentation processing on the log by the model to generate token embedded characterization of a plurality of word segmentation; generating an embedded representation of each log according to the token embedded representation of the word segmentation included in each log;
Traversing the logs included under the log templates aiming at each log template, and generating a clustering center of the logs under the log templates according to the embedded characterization of the traversed logs.
3. The method of claim 1, wherein updating the model parameters according to the predetermined loss function comprises:
taking one log pair of log pairs included in training data of one training batch as a positive example pair, taking the rest log pairs as negative example pairs, and adopting a comparison learning mode to determine comparison learning loss, wherein the comparison learning loss aims at maximizing the log feature space similarity under the same log template and minimizing the log feature space similarity under different templates;
determining central clustering loss according to the logs under each log template and the clustering center of the log template, wherein the central clustering loss aims at maximizing the similarity of the characteristic space between the logs and the clustering center;
and weighting the contrast learning loss and the central clustering loss to obtain a preset loss function, and adjusting model parameters according to the preset loss function.
4. The method of claim 1, wherein obtaining the trained log recognition model after the predetermined training requirement is met comprises:
And after the iteration times reach the preset times, the loss determined according to the preset loss function is smaller than a preset threshold value, or the distance between the clustering center obtained at this time and the clustering center obtained last time is smaller than the preset threshold value, obtaining a trained log recognition model.
5. The method of any of claims 1-4, wherein prior to entering the training dataset into the log recognition model, further comprising:
constructing a log pair according to the log under the log template aiming at each type of log template;
sampling from log pairs corresponding to different types of log templates, and constructing a training data set comprising a plurality of training batch training data.
6. The method of claim 5, wherein sampling from the log pairs corresponding to the different types of log templates, constructing a training dataset comprising a plurality of training batches of training data comprises:
extracting log templates with the number of training batch sizes from the log template set for each training batch;
and extracting a log pair from the log pairs included in each extracted log template to obtain training data of the training batch.
7. A method of clustering logs, comprising:
Inputting the logs in the test set into a trained log identification model, and outputting the feature vectors of the logs; the log recognition model is obtained by training the log recognition model training method according to any one of claims 1-6;
and determining the similarity of the two logs according to the semantic similarity of the feature vectors of the two logs, and clustering the logs according to the similarity.
8. The method of claim 7, wherein the determining the similarity of the two logs according to the semantic similarity of the feature vectors of the two logs is specifically used for clustering the logs according to the similarity:
determining the semantic similarity of the feature vectors of the two logs by adopting a selected similarity algorithm to determine the similarity of the two logs; the similarity algorithm comprises at least one of a cosine similarity algorithm, an L1 similarity algorithm and an L2 similarity algorithm;
clustering logs in the test set by adopting a selected hierarchical clustering algorithm according to the determined similarity; the hierarchical clustering algorithm comprises at least one of an unsupervised hierarchical clustering algorithm cluster, a K-means algorithm and a DBSCAN algorithm.
9. A log clustering system, comprising: a server and a client;
The server is configured to provide means for implementing the log recognition model training method of any one of claims 1-6 and/or means for implementing the log clustering method of any one of claims 7-8;
the client is used for collecting logs and providing the logs to the server, and receiving log clustering results transmitted by the server.
10. A computer storage medium having stored therein computer executable instructions which when executed by a processor implement the log recognition model training method of any one of claims 1-6 and/or implement the log clustering method of any one of claims 7-8.
11. A clustering device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the log recognition model training method of any one of claims 1-6 and/or the log clustering method of any one of claims 7-8 when the program is executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310366197.XA CN116467141A (en) | 2023-03-31 | 2023-03-31 | Log recognition model training, log clustering method, related system and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310366197.XA CN116467141A (en) | 2023-03-31 | 2023-03-31 | Log recognition model training, log clustering method, related system and equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116467141A true CN116467141A (en) | 2023-07-21 |
Family
ID=87184990
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310366197.XA Pending CN116467141A (en) | 2023-03-31 | 2023-03-31 | Log recognition model training, log clustering method, related system and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116467141A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117421595A (en) * | 2023-10-25 | 2024-01-19 | 广东技术师范大学 | System log anomaly detection method and system based on deep learning technology |
CN117633564A (en) * | 2023-11-29 | 2024-03-01 | 中国电子投资控股有限公司 | Command clustering method, device, medium and equipment based on incomplete subtree core |
-
2023
- 2023-03-31 CN CN202310366197.XA patent/CN116467141A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117421595A (en) * | 2023-10-25 | 2024-01-19 | 广东技术师范大学 | System log anomaly detection method and system based on deep learning technology |
CN117633564A (en) * | 2023-11-29 | 2024-03-01 | 中国电子投资控股有限公司 | Command clustering method, device, medium and equipment based on incomplete subtree core |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111914090B (en) | Method and device for enterprise industry classification identification and characteristic pollutant identification | |
CN110069709B (en) | Intention recognition method, device, computer readable medium and electronic equipment | |
CN112989035B (en) | Method, device and storage medium for identifying user intention based on text classification | |
CN116467141A (en) | Log recognition model training, log clustering method, related system and equipment | |
CN109993236A (en) | Few sample language of the Manchus matching process based on one-shot Siamese convolutional neural networks | |
CN116049412B (en) | Text classification method, model training method, device and electronic equipment | |
CN112767106B (en) | Automatic auditing method, system, computer readable storage medium and auditing equipment | |
CN108287848B (en) | Method and system for semantic parsing | |
CN114386421A (en) | Similar news detection method and device, computer equipment and storage medium | |
CN110348516A (en) | Data processing method, device, storage medium and electronic equipment | |
CN112667979A (en) | Password generation method and device, password identification method and device, and electronic device | |
CN112966072A (en) | Case prediction method and device, electronic device and storage medium | |
CN115168590A (en) | Text feature extraction method, model training method, device, equipment and medium | |
CN114897085A (en) | Clustering method based on closed subgraph link prediction and computer equipment | |
CN113535928A (en) | Service discovery method and system of long-term and short-term memory network based on attention mechanism | |
CN117235137A (en) | Professional information query method and device based on vector database | |
CN113762005A (en) | Method, device, equipment and medium for training feature selection model and classifying objects | |
CN116955534A (en) | Intelligent complaint work order processing method, intelligent complaint work order processing device, intelligent complaint work order processing equipment and storage medium | |
Lim et al. | More powerful selective kernel tests for feature selection | |
CN115759085A (en) | Information prediction method and device based on prompt model, electronic equipment and medium | |
CN114722941A (en) | Credit default identification method, apparatus, device and medium | |
CN115186096A (en) | Recognition method, device, medium and electronic equipment for specific type word segmentation | |
CN111737469A (en) | Data mining method and device, terminal equipment and readable storage medium | |
CN112463964A (en) | Text classification and model training method, device, equipment and storage medium | |
Huang et al. | Subgraph generation applied in GraphSAGE deal with imbalanced node classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |