WO2022227388A1

WO2022227388A1 - Log anomaly detection model training method, apparatus and device

Info

Publication number: WO2022227388A1
Application number: PCT/CN2021/120446
Authority: WO
Inventors: 吴凡; 阿克尔·亚历山大; 维特科普·托尔斯滕·菲利普; 高·奥德伊
Original assignee: 华为技术有限公司
Priority date: 2021-04-29
Filing date: 2021-09-24
Publication date: 2022-11-03
Also published as: CN115269304A

Abstract

A log anomaly detection model training method, apparatus and device, applied to the field of artificial intelligence. The method comprises: obtaining a first log sample set, and pre-training an initial log anomaly detection model by means of the fist log sample set to obtain a pre-trained log anomaly detection model; then obtaining second log samples; and finely adjusting the pre-trained log anomaly detection model by means of a second sample set, such that a trained log anomaly detection model can be obtained, wherein the first log sample set is obtained by processing log data of a target object, the second log sample set is obtained by processing log data of a target sub-object, and the target sub-object belongs to the target object. The method can solve the problems existing in the prior art that a lot of manpower and time costs would be consumed and the efficiency is low due to the fact that due to the low generalization capability of the log anomaly detection model obtained by training, a user needs to train to obtain a new model when implementing anomaly detection of a log generated by a similar object.

Description

Log anomaly detection model training method, device and equipment

technical field

The present application relates to the field of artificial intelligence, and in particular, to a log anomaly detection model training method, device, and device.

Background technique

Hard disk drives (HDD), network devices (such as routers, switches, etc.) and processors will generate various logs during operation to record their own status and important events. Information can be represented through logs, so logs can be used for anomaly detection and troubleshooting.

At present, as shown in Figure 1, if a user wants to perform anomaly detection on the logs generated by a specific object (such as a certain type of memory produced by a certain manufacturer), he mainly obtains the historical logs generated by the specific object as training Samples are used to train the initial log anomaly detection model to obtain a trained log anomaly detection model with better detection effect for the logs of the specific object, and then use the trained model to implement anomaly detection for the logs generated by the specific object.

However, the log anomaly detection model trained by the above method has the problem of low generalization ability, which causes the user to perform anomaly detection on the logs generated by similar specific objects (such as the memory of another model produced by another manufacturer). The log anomaly detection model that has been trained cannot be used, and the historical logs generated by similar specific objects can only be re-acquired as training samples to train the initial log anomaly detection model, and log anomalies with better detection effect can be obtained for the logs generated for similar specific objects. Detecting models, for different specific objects, the process of retraining to obtain new models usually consumes a lot of manpower and time costs, and the efficiency is low.

SUMMARY OF THE INVENTION

The present application provides a log anomaly detection model training method, device and equipment, which can solve the problem that the log anomaly detection model obtained by training in the prior art has low generalization ability, which causes users to perform anomaly detection on logs generated by similar specific objects. , it is necessary to train a new log anomaly detection model for the specific object, which will cost a lot of manpower and time, and has low efficiency.

A first aspect provides a method for training a log anomaly detection model, the method comprising:

obtaining a first log sample set, wherein the first log sample set is obtained by processing log data of the target object;

Pre-training an initial log anomaly detection model by using the first log sample set to obtain a pre-trained log anomaly detection model;

acquiring a second log sample set, wherein the second log sample set is obtained by processing log data of a target sub-object, and the target sub-object belongs to the target object;

Fine-tune the pre-trained log anomaly detection model by using the second log sample set to obtain a trained log anomaly detection model.

In the above solution, by pre-training the initial log anomaly detection model with the first log sample set from the target object, a pre-trained log anomaly detection model with high-quality model parameters and strong generalization ability can be obtained. When the trained log anomaly detection model for the target sub-object is used, the second log sample set from the target sub-object is used to fine-tune the pre-trained log anomaly detection model with high-quality model parameters and strong generalization ability, i.e. Yes, compared with the prior art, the model training method provided by the present application can solve the problem that the log anomaly detection model obtained by training in the prior art has a low generalization ability, which causes the user to perform anomaly detection on logs generated by similar specific objects. When a new log anomaly detection model needs to be retrained for the specific object, it will cost a lot of manpower and time, and the efficiency will be low.

In a possible implementation manner, the target object includes at least one of the following sub-objects: hard disk, memory, flash memory, network device and processor, and the target sub-object is any one of the target objects Subobjects of the type.

In a possible implementation manner, the first log sample set includes m log samples, where m is a natural number greater than 1, and the initial log anomaly detection model is pre-trained by using the first log sample set to obtain Pre-trained log anomaly detection models, including:

Perform word segmentation on the m log samples respectively to obtain m word sequences corresponding to the m log samples;

Through the m word sequences, the initial log anomaly detection model is pre-trained to obtain a pre-trained log anomaly detection model.

In a possible implementation manner, the initial log anomaly detection model is pre-trained through the m word sequences to obtain a pre-trained log anomaly detection model, including:

Perform mask processing on words with preset proportions in the m word sequences, respectively, to obtain m word sequences after mask processing;

The initial log anomaly detection model is pre-trained through the masked m word sequences to obtain a pre-trained log anomaly detection model.

In the above scheme, the initial log anomaly detection model is pre-trained through the masked m word sequences to obtain a pre-trained log anomaly detection model, which can better learn the context information of the masked words. Thus, the pre-trained log anomaly detection model learns the semantic information of each word sequence, so that the trained log anomaly detection model obtained subsequently can detect whether the to-be-detected log is abnormal according to the semantic information of the to-be-detected log.

In a possible implementation, the initial log anomaly detection model is pre-trained with the m word sequences processed by the mask to obtain a pre-trained log anomaly detection model, including:

Obtain the word embedding vector and the position embedding vector corresponding to each word in the m word sequences after the mask processing, wherein, the word embedding vector corresponding to each word is used to represent each word. A multi-dimensional vector, the position embedding vector corresponding to each word represents the position of each word in the word sequence to which it belongs;

According to the word embedding vector and the position embedding vector corresponding to each word in the m word sequences after the mask processing, respectively, obtain m first row vectors corresponding to the m word sequences after the mask processing;

Using the m first row vectors, pre-train the initial log anomaly detection model to obtain a pre-trained log anomaly detection model.

In a possible implementation manner, using the m first row vectors to pre-train an initial log anomaly detection model to obtain the pre-trained log anomaly detection model, including:

The m first row vectors are respectively input into the initial log anomaly detection model for training to obtain m second row vectors;

Obtain the loss of the m second row vectors to the initial cluster center;

According to the loss of the m second row vectors to the initial cluster center, an initial log anomaly detection model is trained to obtain the pre-trained log anomaly detection model and a target cluster center.

In a possible implementation, the method further includes:

Obtain the percentile corresponding to the loss of the m second row vectors to the target cluster center;

A classification threshold is determined according to the percentile corresponding to the loss of the m second row vectors to the target cluster center, wherein the classification threshold is used for the trained log anomaly detection model to detect the to-be-detected Anomaly detection is performed on the log, and the detection result is obtained.

In a possible implementation manner, the formula for obtaining the loss from the m second row vectors to the initial cluster center is:

Wherein, V _i represents the ith second row vector among the m second row vectors, c represents the initial cluster center, and loss(c,V _i ) represents the ith second row vector to The loss of the initial cluster center, i is a natural number.

In a second aspect, a log anomaly detection model training device is provided, the device comprising:

an acquisition module, configured to acquire a first log sample set, wherein the first log sample set is obtained by processing log data of the target object;

a training module, which pre-trains the initial log anomaly detection model through the first log sample set to obtain a pre-trained log anomaly detection model;

The obtaining module is further configured to obtain a second log sample set, wherein the second log sample set is obtained by processing log data of a target sub-object, and the target sub-object belongs to the target object;

The training module is further configured to fine-tune the pre-trained log anomaly detection model through the second log sample set to obtain a trained log anomaly detection model.

In a possible implementation manner, the first log sample set includes m log samples, where m is a natural number greater than 1, and the training module is specifically used for:

In a possible implementation manner, the training module is specifically used for:

Obtain the word embedding vector and the position embedding vector corresponding to each word in the m word sequences after the mask processing, wherein the word embedding vector corresponding to each word is a multi-dimensional representation of each word. vector, the position embedding vector corresponding to each word represents the position of each word in the word sequence to which it belongs;

The m first row vectors are respectively input into the initial log anomaly detection model for training, and m second row vectors are obtained, wherein the m second row vectors and the m word sequences processed by the mask There is a one-to-one correspondence, and each second row vector in the m second row vectors includes the semantic information of the word sequence after its corresponding mask processing;

Obtain the loss of the m second row vectors to the initial cluster center;

According to the loss of the m second row vectors to the initial cluster center, an initial log anomaly detection model is trained to obtain a pre-trained log anomaly detection model and a target cluster center.

In a possible implementation manner, the training module is further used for:

In a third aspect, a non-transitory computer-readable storage medium is provided, and the non-transitory computer-readable storage medium stores instructions for implementing the first aspect or any possible implementation of the first aspect. method provided.

In a fourth aspect, a computing device is provided, the computing device includes a processor and a memory; the processor is configured to execute instructions stored in the memory, so that the computing device implements the first aspect or the first aspect above Methods provided by any possible implementation.

In a fifth aspect, a computer program product is provided, including a computer program, which, when the computer program is read and executed by a computing device, causes the computing device to perform the above-mentioned first aspect or any possible implementation of the first aspect method provided.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments.

Fig. 1 is the schematic diagram of a kind of prior art involved in this application;

2 is a schematic diagram of masking a word in an input sequence by a masked language model (MLM) method involved in the present application;

3 is a schematic diagram of the principle of a log anomaly detection model training method provided by the present application;

4 is a schematic flowchart of a method for training a log anomaly detection model provided by the present application;

5 is a schematic diagram of obtaining a word sequence corresponding to the i-th first log sample provided by the present application;

6 is a schematic diagram of a first row vector and a second row vector corresponding to a word sequence obtained after mask processing provided by the present application;

7 is a schematic diagram of an exemplary word embedding vector and position embedding vector provided by the present application;

8 is a schematic diagram of an exemplary word vector provided by the present application;

FIG. 9 is a schematic flowchart of a pre-trained log anomaly detection model provided by the present application;

Fig. 10 is the schematic diagram of using m first row vectors involved in the present application to pre-train the initial log anomaly detection model to obtain m second row vectors and target cluster centers;

11 is a schematic flowchart of a log anomaly detection method involved in the present application;

12 is a schematic diagram of anomaly detection obtained by performing anomaly detection on the first row vector x corresponding to the sequence to be detected involved in the present application;

13 is a schematic structural diagram of a log anomaly detection model training device provided by the present application;

FIG. 14 is a schematic structural diagram of a computing device provided by the present application.

Detailed ways

The embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

The terms "first" and "second" in the embodiments of the present application are only used for the purpose of description, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature defined as "first" or "second" may expressly or implicitly include one or more of that feature.

In the embodiments of the present application, "at least one" refers to one or more, and "multiple" refers to two or more. "And/or", which describes the association relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, which can indicate: the existence of A alone, the existence of A and B at the same time, and the existence of B alone, where A, B can be singular or plural. The character "/" generally indicates that the associated objects are an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of a single item(s) or a plurality of items(s). For example, at least one (a) of a, b or c may represent: a, b, c, a-b, a-c, b-c or a-b-c, wherein a, b, c may be single or multiple.

In order to facilitate the understanding of the embodiments of the present application by those skilled in the art, related concepts or terms and the like involved in the present application are first introduced.

(1) Logs are records generated by objects such as hard disks, network devices, processors, etc., which are used to indicate the status of hard disks, network devices, processors, etc. and what events have occurred. For example, hard drives generate logs when a failure occurs or when it is thought to fail.

The log is generally stored in the device in the form of a log file, and the log file may be a directly readable text file or a machine-readable binary file, or a file existing in other forms, which is not specifically limited in this application. Each log file consists of line-by-line log records, one or several consecutive records describe an independent event, and a log record describing an independent event can be called a log entry. A log file contains multiple log entries. A log entry usually contains the event time, event content, event type, event level, and so on.

Usually, the formats of log entries generated by different objects are different. For example, the format of log entries generated by device A is: event occurrence time, identification of the device accessing device A, and event content. A log entry generated by device A contains 20 characters. , the format of the log entry generated by device B is: the identifier of the device accessing device B, the event occurrence time and the event content, and a log generated by device B contains 50 characters.

(2) Neural network, which can be composed of neural units (also called neurons). The neural unit can refer to the operation unit that takes the variable x _s and the intercept b as input, and the output of the operation unit can be:

Among them, s=1, 2, ... n, n is a natural number greater than 1, W _s is the weight of x _s , and b is the bias of the neural unit. f is an activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function and other functions, which are not limited here. A neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.

(3) Loss function, in the process of training the model, because it is hoped that the output of the model is as close as possible to the value you really want to predict, you can compare the predicted value of the current model with the target value you really want, and then according to the two to update the weight vector of each layer of the neural network in the model (of course, there is usually an initialization process before the first update, that is, pre-configuring parameters for each layer in the model), for example, if the model's When the predicted value is high, adjust the weight vector to make the prediction lower, and keep adjusting until the model can predict the real desired target value or a value that is very close to the real desired target value. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function (loss function) or objective function (objective function), which are used to measure the difference between the predicted value and the target value. important function. Among them, taking the loss function as an example, the higher the output value of the loss function (loss), the greater the difference, then the training process of the model becomes the process of reducing the loss as much as possible.

(4) Pre-training (pre-training) refers to a process in which the initial model is trained by using a large data set, so that the initial model can learn to identify common features in the large data set, and the model is obtained by pre-training (hereinafter referred to as pre-training). Model) has a strong generalization ability, which can provide high-quality model parameters for subsequent model training on a specific data set, and can adapt to a variety of specific data sets.

(5) Fine-tuning refers to a process of further training a pre-trained model using a specific data set to obtain a trained model applied to the specific data set. Typically, the amount of data for a specific dataset used in the fine-tuning stage is smaller than the amount of data for a larger dataset used in the pre-training stage, e.g. The ratio of the amount is 1:100, 1:500, 1:1000, etc., which is not specifically limited here; the number of times that the pre-training model is trained using a specific data set in the fine-tuning stage is less than that using a large data set in the pre-training stage. The number of times of training, for example, in order to get a pre-trained model, set the number of times to train the initial model using a large dataset to 100 times in the pre-training phase. The number of times the model is trained is 20 times.

(6) MLM method, a pre-training method of the model, the model trained by this method can learn the semantic information of the input sequence, and the semantic information is stored in the vector output by the model corresponding to the word "CLS" in the input sequence. . This is because the strategy of masking the input sequence by MLM is different from that of traditional pre-training methods (such as masked sequence to sequence (MASS)). Words will be randomly selected 15% of the words, replace 80% of the randomly selected 15% words with the special symbol MASK, replace 10% of the randomly selected 15% of the words with random words, and the remaining 10% No replacement operation is performed.

For example, as shown in Figure 2, the input sequence is <CLS><raise><head><look><ming><month><low><head><think><gu><xiang>, the input sequence Including 11 words, 15% of the words randomly selected by MLM from the input sequence are the third word "head" and the seventh word "low" in the input sequence, MLM replaces the randomly selected input with the special symbol MASK The third word in the sequence, "head", replaces the seventh word "low" in the randomly chosen input sequence with the random word "person".

When pre-training the model, the advantage of masking the sequences input to the model for training is that it can improve the fault tolerance and inference accuracy of the pre-trained model.

For example, when pre-training the model, the first input sequence is <CLS><raise><head><look><ming><month><low><head><think><so>< Township>, that is, the sequence without mask processing, use the model to learn the first sequence, because each word in the first sequence is not masked, that is, each word in the first sequence are known, therefore, the model only needs to learn the words in the first sequence, without learning the context of the words in the first sequence, it can learn that the semantic information of the first sequence is "Looking up at the bright moon" Looking down and thinking about my hometown”, therefore, the resulting pre-trained model usually does not have the ability to reason about the semantic information of the sequence based on the context of the words in the sequence. If the pre-trained model is used to infer the semantic information of the second sequence <CLS><ju><wang><ming><month><head><thinking><xiang>, it can be seen that the second sequence is similar to Compared with the first sequence, the three words <head><lower><so> are missing. Since the pre-trained model does not have the ability to infer the semantic information of the sequence according to the context of the words in the sequence, the model inference obtains the second sequence It is less likely that the semantic information of the second sequence is "looking up at the bright moon and thinking about the hometown", and it is more likely that the semantic information of the second sequence is "homesick" or "looking at the bright moon and homesick", the fault tolerance of the model and the accuracy rate is lower.

If the model is pre-trained, the first input sequence is <CLS><raise><MASK><look><MASK><month><low><head><think><so><MASK>, that is After masking the sequence, use the model to learn the first sequence, because some words in the first sequence (ie <head><ming><township>) are masked, that is to say, part of the first sequence The word is unknown, and another part of the word is known. Therefore, the model not only needs to learn the known words in the first sequence, but also learn the masked words according to the context of the masked words in the first sequence. After learning and learning the masked words, the semantic information of the first sequence can be learned as "looking up at the bright moon and bowing your head and thinking about hometown". Therefore, the obtained pre-trained model usually has the ability to reason about the context of the words in the sequence. The ability to sequence semantic information. If the pre-trained model is used to infer the semantic information of the second sequence <CLS><ju><wang><ming><month><head><thinking><xiang>, it can be seen that the second sequence is similar to Compared with the first sequence, the two words <low> and <hence> are missing, and the two words <ming> and <township> are more, because the pre-trained model has the ability to infer the semantic information of the sequence according to the context of the words in the sequence , therefore, it is more likely that the semantic information of the second sequence of model inference is "looking up at the bright moon and bowing your head to think of hometown", and the semantic information of the second sequence obtained by its inference is the possibility of "homesickness" or "homesickness while looking at the bright moon" The smaller the model, the higher the fault tolerance and accuracy of the model.

(7) Log anomaly detection, which refers to detecting the event information included in the log entry, to determine whether the event information included in the log entry is the abnormal information of the device that generated the log entry, and determining whether the event information included in the log entry is the device In the case of abnormal information, it is determined that the device is abnormal.

(8) The initial log anomaly detection model refers to a model (also called an algorithm) that is not trained using log training samples. detected model.

With the continuous increase of the system scale, it is difficult to ensure the stability and reliability of large-scale systems. In addition, the network environment is becoming increasingly complex, and various new types of attacks are emerging. Anomaly detection is one of the supporting technologies to ensure system security. During the operation of the system, the system's hard disk, processor, and network devices that provide network services for the system will generate various log files to record the system operation status and events. The log contains rich information, and a large amount of log data contains The huge amount of information provides a way for system anomaly detection, making log anomaly detection a research hotspot in the field of anomaly detection. Among them, using log training samples to train a trained log anomaly detection model to perform log anomaly detection is a relatively popular log anomaly detection method at present.

However, the log anomaly detection model trained by the existing log anomaly detection model training method has the problem of low generalization ability, so that when the user implements anomaly detection on logs generated by similar specific objects, the user cannot use the already trained log anomaly detection model. The log anomaly detection model can only re-acquire the historical logs generated by similar specific objects as training samples to retrain the initial log anomaly detection model, and obtain a log anomaly detection model with better detection effect for logs generated by similar specific objects. The process of obtaining a new model usually consumes a lot of manpower and time, and is inefficient.

For example, if the user has already trained a trained log anomaly detection model A that has better detection effect on logs generated by the hard disk, if the user wants to perform anomaly detection on the logs generated by the memory, although the model A is generated for the hard disk The log generated by the memory has a good detection effect, but the log generated by the memory usually has a poor detection effect. Even if the user uses the historical log generated by the memory as a training sample to continue to train the model A, due to the low generalization ability of the model A, it is usually A log anomaly detection model with better detection effect for logs generated in memory cannot be obtained. Users can only re-acquire historical logs generated in memory as training samples, and use the re-acquired training samples to train the initial log anomaly detection model, and obtain a log anomaly detection model for memory-generated logs. The log has a well-trained log anomaly detection model with better detection effect.

For another example, in the case where the user has trained a trained log anomaly detection model B (hereinafter referred to as model B) that has a better detection effect on the logs generated by the hard disk of type B produced by manufacturer B (hereinafter referred to as the B hard disk) Next, if the user wants to perform anomaly detection on the logs generated by the hard disk of model C (hereinafter referred to as the hard disk C) produced by the manufacturer C, although the model B has a good detection effect for the logs generated by the hard disk B, the The detection effect of the logs generated by the hard disk is usually poor. Even if the user continues to train the model B using the historical logs generated by the C hard disk as the training samples, due to the low generalization ability of the model B, it is usually impossible to obtain the logs generated for the C hard disk. For the log anomaly detection model with detection effect, the user can only re-acquire the historical logs generated by the C hard disk as training samples, and use the re-acquired training samples to train the initial log anomaly detection model, and obtain the training with better detection effect for the logs generated by the C hard disk. Good log anomaly detection model.

In view of the problems existing in the above existing log anomaly detection model training methods, the present application provides a log anomaly detection model training method. As shown in FIG. 3 , the model training method provided by the present application includes two stages: pre-training and fine-tuning. In the training phase, the first log sample set from the target object (the target object includes multiple target sub-objects) can be used to pre-train the initial log anomaly detection model to obtain a model with high-quality model parameters and strong generalization ability. In the fine-tuning stage, the second log sample set from the target sub-object (the target sub-object belongs to the target object) is used to fine-tune the pre-trained log anomaly detection model to obtain a trained log anomaly detection model for the target sub-object. In view of the prior art, the model training method provided by the present application can solve the problem of low generalization ability of the log anomaly detection model obtained by training in the prior art, and achieve the purpose of improving the efficiency of model training.

Wherein, the target object includes but is not limited to at least one of the following sub-objects: hard disk, memory, flash memory, network device and processor; the target sub-object is any type of sub-object in the target object. It is easy to understand that the above-mentioned target object and target sub-object are merely exemplary examples, which are not specifically limited in this application.

Taking the target object including only the hard disk sub-object as an example, in the case that the target object includes a hard disk sub-object, the target sub-object may be hard disks of different models.

Taking the target object including two sub-objects of hard disk and memory as an example, when the target object includes two sub-objects of hard disk and memory, the target sub-object may be hard disk or memory.

It can be understood that when training a model, the more training samples are used, the better the parameters of the trained model, the wider the source of training samples used, and the stronger the generalization ability of the trained model. Therefore, In this embodiment, the target object may include as many sub-objects as possible, and the data volume of the first log sample set obtained from the target object may also be as large as possible, so that the parameters of the pre-trained log anomaly detection model obtained in this way are of better quality. , the generalization ability of the model is stronger.

In order to facilitate a clearer understanding of the log anomaly detection model training method provided by the present application, the method provided by the present application will be described in detail below with reference to the schematic flowchart shown in FIG. 4 . As shown in FIG. 4 , the method includes the following steps:

S401. The computing device acquires a first log sample set including m first log samples, where the first log sample set is obtained by processing log data of a target object.

From the above description of log entries, it can be seen that log entries include event occurrence time, event content, event type, event level, etc., among which the event content can reflect the event that occurred. By analyzing the event content, you can know the event that occurred. Whether the event is abnormal or not, the event time, event type, and event level have little effect on judging whether the event is abnormal.

Therefore, in a possible implementation manner, the computing device may obtain m first event contents from a large number of log entries included in the log data of a large number of target objects as m first log samples, and delete the event contents except for the event contents in the large number of log entries. other parts other than the event occurrence time, event type, event level, etc. The log data of a large number of target objects may be obtained by a computing device from crawlers on the Internet, or collected manually from the target objects, which is not limited herein.

For example, suppose the log data of the target object includes log entry A: 2021/06/03Thu 18:18:33PD_Vendor Done Check done,0xd ms.Flag 8ALL, where 2021/06/03Thu 18:18:33 is the event occurrence time, PD_Vendor Done Check done, 0xd ms.Flag 8 is the event content, ALL is the event level, then the first log sample obtained by the computing device from log entry A is PD_Vendor Done Check done, 0xd ms.Flag 8.

It can be understood that m first event contents are obtained from the log data of a large number of target objects as m first log samples, and the event occurrence time, event type, event level, etc. in the log data of a large number of target objects except the event contents are deleted. For other parts, the difference between the formats of the m first log samples can be shielded, so as to achieve the purpose of increasing the number of first log samples in the first log sample set.

It can be understood that the larger the value of m, the better the parameters of the pre-trained log anomaly detection model and the stronger the generalization ability of the model. Therefore, in the specific implementation, the computing device can obtain as many first log samples as possible and add them to the log anomaly detection model. The first log sample set. Among them, m is a natural number greater than 1.

It should be noted that the above-mentioned acquisition of m first event contents from log data of a large number of target objects as m first log samples is only an example, and should not be regarded as a specific limitation. In a specific implementation, the computing device may also select m log entries with the same format from the log data of a large number of target objects directly as m first log samples, or obtain m preset contents from the log data of a large number of target objects As the m first log samples, the preset content includes, in addition to the event content in the log entry, other content other than the event content in the log entry, such as event level and/or event type.

S402. The computing device performs word segmentation on the m first log samples, respectively, to obtain m word sequences corresponding to the m log samples.

Taking the computing device performing word segmentation on the i-th first log sample among the m first log samples as an example, the process for the computing device to obtain the word sequence corresponding to the i-th first log sample includes:

S4021. Perform word segmentation on the i-th first log sample to obtain a first word sequence.

Continue to take the i-th first log sample as PD_Vendor Done Check done, 0xd ms.Flag 8 as an example, the computing device performs word segmentation on the log sample, and the obtained first word sequence is:

<PD_VendorDone><Check><done><0xd><ms><Flag><8>, as shown in Figure 5.

S4022, in the case that the first word sequence includes a mixed word composed of numbers and characters, replace the mixed word composed of numbers and characters in the first word sequence with the word number to obtain a second word sequence, where the first word sequence does not In the case of including mixed words, no replacement operation is required, and the first word sequence is directly determined as the second word sequence.

Continuing to take the example in S4021 as an example, it can be seen that if the first word sequence <PD_VendorDone><Check><done><0xd><ms><Flag><8> includes the mixed word 0xd, the computing device will The second word sequence obtained after replacing the mixed word 0xd in the word sequence <PD_VendorDone><Check><done><0xd><ms><Flag><8> is:

<PD_VendorDone><Check><done><number><ms><Flag><8>, as shown in Figure 5.

It should be noted that the above-mentioned replacement of the compound word with number is only an example, and in a specific implementation, the compound word may also be replaced with other words such as num or sep, which is not specifically limited here.

S4023 , after obtaining the second word sequence, add a special classification mark, ie, a CLS mark, to the beginning of the sentence of the second word sequence to obtain a third word sequence.

Among them, the obtained CLS mark at the beginning of the third sequence sentence marks the beginning of the third word sequence.

Continuing to take the example in S4022 as an example, the computing device adds a CLS tag to the beginning of the sentence of the second word sequence <PD_VendorDone><Check><done><number><ms><Flag><8>, and the obtained third The word sequence is:

<CLS><PD_VendorDone><Check><done><number><ms><Flag><8>, as shown in Figure 5. S4024. After the third word sequence is obtained, if the number of words included in the third word sequence is less than the preset threshold, add a pad mark to the end of the sentence of the third word sequence to obtain a fourth word sequence, so that the fourth word sequence The number of included words is equal to the preset threshold, and when the number of words included in the third word sequence is equal to the preset threshold, the third word sequence is directly determined as the fourth word sequence, and the third word sequence is determined to be the fourth word sequence. When the number is greater than the preset threshold, the words at the end of the third word sequence are truncated to obtain a fourth word sequence, so that the number of words included in the fourth word sequence is equal to the preset threshold.

Continuing to take the example in S4023 as an example, assuming that the preset threshold is 10, it can be seen that the third word sequence <CLS><PD_VendorDone><Check><done><number><ms><Flag><8> includes The number of words in 8 is 8, which is less than the preset threshold of 10, then the computing device adds two words at the end of the sentence of the third word sequence <CLS><PD_VendorDone><Check><done><number><ms><Flag><8> pad mark, the obtained fourth word sequence is:

<CLS><PD_VendorDone><Check><done><number><ms><Flag><8><pad><pad>, as shown in Figure 5.

It should be noted that the above-mentioned replacement of mixed words with pad is only an example. In specific implementation, other words such as PAD or pa can also be used to replace mixed words, which are not specifically limited here; the above preset threshold of 10 is only As an example, in a specific implementation, it may also be 20, 50, etc., which is not specifically limited here.

S4025. After obtaining the fourth word sequence, use a preset dictionary to convert each word in the fourth word sequence to obtain a fifth word sequence, that is, the word sequence corresponding to the i-th first log sample.

Wherein, the preset dictionary includes a large number of words and the correspondence between the large number of words and their corresponding token (identification, ID for short), for example, the word "Check", the token ID "6" and the word " Correspondence between Check" and token ID "6".

Continue to take the example in S4024 as an example, assuming that the token ID corresponding to CLS included in the preset dictionary is 1, the token ID corresponding to pad is 0, the token ID corresponding to PD_VendorDone is 5, and the token ID corresponding to Check is 5. is 6, the token ID corresponding to done is 7, the token ID corresponding to number is 4, the token ID corresponding to ms is 8, the token ID corresponding to Flag is 9, the token ID corresponding to 8 is 10, and the token ID corresponding to pad is 10. The token ID is 0, then the computing device uses the preset dictionary to perform the fourth word sequence <CLS><PD_VendorDone><Check><done><number><ms><Flag><8><pad><pad> Conversion, the fifth word sequence obtained is:

<1><5><6><7><4><8><9><10><0><0>, as shown in Figure 5.

In a specific embodiment of the present application, when using a preset dictionary to convert words in the fourth word sequence, if there is no certain word in the fourth word sequence and a token corresponding to the word in the preset dictionary In the case of ID, the word and the token ID corresponding to the word can be added to the preset dictionary.

For example, if the maximum token ID included in the default dictionary is "100000", and the token ID of the word "identification" is not included in the default dictionary, the word "identification" and the word "identification" can be added to the default dictionary. The token ID corresponding to the word is "100001" or "100008", etc.

It should be noted that the above process of obtaining the word sequence corresponding to the i-th first log sample is only an example. In a specific implementation, S4023 can be executed before S4022, or S4024 can be executed before S4022, There is no specific limitation here.

It can be understood that when the computing device performs steps S4021 to S4025 for each of the m first log samples, the computing device can obtain m word sequences, and the words included in the m word sequences are equal in number.

S403. The computing device pre-trains the initial log anomaly detection model through m word sequences, to obtain a pre-trained log anomaly detection model.

In a specific embodiment of the present application, the method for pre-training the initial log anomaly detection model by the computing device through m word sequences may be the MLM method or the MASS method. Here, no specific limitation is made.

Taking the pre-training method adopted as the MLM method as an example, the computing device pre-trains the initial log anomaly detection model, and the process of obtaining the pre-trained log anomaly detection model may specifically include the following steps:

S4031. The computing device respectively performs mask processing on words in a preset proportion in the m word sequences, and obtains m word sequences after mask processing.

The preset ratio may be 10%, 15%, 20%, and the like.

For the operation that the computing device respectively performs mask processing on the words in the m word sequences with preset proportions, reference may be made to the relevant description of the MLM method above.

S4032: The computing device pre-trains the initial log anomaly detection model through the masked m word sequences, to obtain a pre-trained log anomaly detection model.

In a specific embodiment of the present application, the computing device pre-trains the initial log anomaly detection model through the masked m word sequences, and the specific process of obtaining the pre-trained log anomaly detection model may include the following steps:

S1. The computing device separately obtains a word embedding vector and a position embedding vector corresponding to each word in the m word sequences after mask processing.

The word embedding vector corresponding to each word above is a multi-dimensional vector used to represent each word. Among them, word embedding vector is a general term for a set of language modeling and feature learning technologies in the field of natural language processing. Convert to a multidimensional vector. In the specific implementation, the word embedding vector corresponding to each word can be obtained by one-hot encoding, or it can be obtained by word-to-vector (Word2Vec) model or glove (Glove) model. The dimension of the word embedding vector may be 256 dimensions or 512 dimensions, and may also be more or less dimensions, which are not specifically limited here.

Take the i-th word sequence after mask processing as <1><MASK><6><7><4><9><9><10><0><0> as an example, as shown in Figure 6 , assuming that the method of obtaining the word embedding vector corresponding to each word in the i-th word sequence after mask processing is the Word2Vec model, and the dimension of the word embedding vector is 5 dimensions, then the obtained mask-processing i-th word The word "1" in the sequence can be a 5-dimensional vector Vi _,1 (-0.065, -0.035, 0.019, -0.026, 0.085), and the word "MASK" can be a 5-dimensional vector Vi _,2 (0.000, 0.000 , 0.000, 0.000, 0.000), ..., the word "0" at the end of the sentence can be a 5-dimensional vector Vi _,10 (-0.027, -0.013, 0.006, 0.023, 0.014), as shown in Figure 7. Among them, the t in the subscript of the 5-dimensional vector V _i, t represents the position of each word in the i-th word sequence.

It should be noted that the above-exemplified value in the word embedding vector includes three decimal places only as an example, and in a specific implementation, it may include fewer or more decimal places, which is not specifically limited herein.

The position embedding vector corresponding to each word above is used to represent the position of each word in the word sequence, and its dimension is the same as that of the word embedding vector.

In a possible implementation, the position embedding vector of a word can be obtained by the following formula:

PE(pos,2j)=sin(pos/10000 ^(2j/dmodel) )

PE(pos,2j+1)=cos(pos/10000 ^(2j/dmodel) )

Among them, PE() represents the position embedding vector, pos represents the position of the word in the word sequence, and its value range is [0, the number of words included in the word sequence), d _model represents the dimension of the position embedding vector, and 2j represents the position The even-numbered dimension index of the embedding vector, 2j+1 represents the odd-numbered dimension index of the position embedding vector, and the dimension d _model of the position embedding vector is 5 as an example. When the dimension d _model of the position embedding vector is 5, j takes 0 respectively. , 1, 2, when j is 0, the calculated value of PE(pos, 2j) (ie PE(pos, 0)) is the value of the zeroth dimension of the embedded vector at this position, PE(pos, 2j+ 1) (ie PE(pos, 1)) is the value of the first dimension of the embedded vector at this position. When j is 1, the calculated PE(pos, 2j) (ie PE(pos, 2)) The value of is the value of the second dimension of the embedding vector at this position, and the calculated value of PE(pos,2j+1) (ie PE(pos,3)) is the value of the third dimension of the embedding vector at this position, When j is 2, the calculated value of PE(pos, 2j) (that is, PE(pos, 4)) is the value of the fourth dimension of the embedded vector at the position.

Continue to use the masked i-th word sequence shown in Figure 6 as <1><MASK><6><7><4><9><9><10><0><0>, For example, the dimension d _model of the obtained position embedding vector is 5. The position embedding vector V _i,1 ' corresponding to the word "1" in the i-th word sequence after mask processing obtained by the above formula is:

((PE(0,2*0)=sin(0/10000 ^(2*0/5) ), (PE(0,2*0+1)=cos(0/10000 ^(2*0/5) ) , (PE(0,2*1)=sin(0/10000 ^(2*1/5) ), (PE(0,2*1+1)=cos(0/10000 ^(2*1/5) ) , (PE(0,2*2)=sin(0/10000 ^(2*2/5) )), namely (0.000, 1.000, 0.000, 1.000, 0.000), as shown in Figure 7;

The position embedding vector V _i,2 ' corresponding to the word "MASK" in the i-th word sequence after mask processing obtained by the above formula is:

((PE(1,2*0)=sin(1/10000 ^(2*0/5) ), (PE(1,2*0+1)=cos(1/10000 ^(2*0/5) ) , (PE(1,2*1)=sin(1/10000 ^(2*1/5) ), (PE(1,2*1+1)=cos(1/10000 ^(2*1/5) ) , (PE(1,2*2)=sin(1/10000 ^(2*2/5) )), that is (0.842, 0.540, 0.025, 1.000, 0.001), as shown in Figure 7; ...;

The corresponding position embedding vector V _i,10 ' of the "0" at the end of the sentence is:

((PE(9,2*0)=sin(9/10000 ^(2*0/5) ), (PE(9,2*0+1)=cos(9/10000 ^(2*0/5) ) , (PE(9,2*1)=sin(9/10000 ^(2*1/5) ), (PE(9,2*1+1)=cos(9/10000 ^(2*1/5) ) , (PE(9,2*2)=sin(9/10000 ^(2*2/5) )), namely (0.412, -0.911, 0.224, 0.975, 0.006), as shown in FIG. 7 .

It should be noted that the numerical value in the position embedding vector exemplified above includes three decimal places only as an example, and in a specific implementation, it may include fewer or more decimal places, which is not specifically limited herein.

In another possible implementation, the position embedding vector of a word can be obtained by the following formula:

PE(pos,2j)=sin(pos/10000 ^(2j/dmodel) )

PE(pos,2j+1)=cos(pos/10000 ^{(2j+1/dmodel)} )

S2. The computing device obtains m first row vectors corresponding to the masked m word sequences according to the word embedding vector and the position embedding vector corresponding to each word in the masked m word sequences, respectively.

Specifically, the computing device can obtain each word in each masked word sequence by superimposing the word embedding vector and the position embedding vector corresponding to each word in each masked word sequence The corresponding word vector, so as to obtain the first row vector corresponding to each masked word sequence, and other ways to obtain the first row vector through the word embedding vector and the position embedding vector are also within the scope of protection of this application. No specific restrictions are imposed.

Continue to take the word embedding vector and position embedding vector corresponding to each word in the i-th word sequence after mask processing shown in Figure 6 as an example, then the word vector V _i,1 " corresponding to the word "1" is :

As shown in Figure 8;

The word vector V _i,2 " corresponding to the word "MASK" is:

As shown in Figure 8;...;

The word vector V _i,10 " corresponding to the word "0" at the end of the sentence is:

As shown in Figure 8.

After obtaining the word vector corresponding to each word in the i-th word sequence after mask processing, the combination of word vectors corresponding to all words in the i-th word sequence after mask processing is the result of mask processing. The first row vector V _i ' corresponding to the i-th word sequence.

In this embodiment, the process of obtaining the first row vector corresponding to each word sequence in the m word sequences after mask processing is the same as obtaining the first row vector V _i corresponding to the ith word sequence after mask processing. ' The process is similar, for details, please refer to the above related description, which will not be repeated here.

S3. The computing device uses m first row vectors to train an initial log anomaly detection model to obtain a pre-trained log anomaly detection model.

In a specific embodiment of the present application, the computing device uses m first row vectors to train an initial log anomaly detection model, and the specific process of obtaining a pre-trained log anomaly detection model is shown in Figure 9, which may include the following steps:

A1. Input the m first row vectors into the initial log anomaly detection model for training, and obtain m second row vectors corresponding to the m word sequences after mask processing.

The second row vector corresponding to the masked word sequence represents a vector including semantic information of the masked word sequence, and the second row vector corresponding to each word sequence is the CLS mark corresponding to each word sequence. The output of the initial log anomaly detection model.

Continue to use the ith word sequence <1><MASK><6><7><4><9><9><10><0><0> after the mask processing shown in Figure 6. One row vector V _i ' is taken as an example, if the first row vector V _i ' is input into the initial log anomaly detection model for training, the second row vector V _i can be obtained, as shown in Figure 6, the second row vector V _i includes the mask. The semantic information of the i-th word sequence after code processing is "The supplier information has been checked".

A2. Randomly obtain any second row vector of the m second row vectors as the initial cluster center c of the normal log class.

A3. Calculate the loss from each of the m second row vectors to the initial cluster center c, respectively.

In a specific embodiment of the present application, the loss from the ith second row vector V _i to the initial cluster center c can be obtained through the following loss function:

in,

Represents the square root of the value of each dimension of the second row vector V _i .

A4. According to the loss from the ith second row vector to the initial cluster center c, determine whether the ith second row vector can be assigned to the normal log class, and then determine whether the ith second row vector can belong to the normal log class In the case of , assign the ith second row vector to the normal log class.

Specifically, it can be determined that the ith second row vector can be assigned to the normal log class when the loss (c, V _i ) from the ith second row vector to the initial cluster center c is less than the first classification threshold , otherwise, it is determined that the i-th second row vector cannot be assigned to the normal log class. The first classification threshold may be set by the user according to the actual situation.

A5. After assigning the second row vector of the m second row vectors that can be assigned to the normal log class to the normal log class, recalculate the centroid of the normal log class, and determine the calculated centroid as the new normal log class. Cluster center c ₁ .

A6. Repeat steps A3 to A5 until a termination condition is reached, and a pre-trained log anomaly detection model is obtained.

The termination condition may be the maximum number of iterations, the minimum square error, the rate of change of the cluster center point, etc., which are not specifically limited here.

When the termination condition is reached, the normal log class and the centroid of the normal log class no longer change. Here, the centroid of the normal log class that no longer changes is represented by C, that is, the target cluster center described below.

As shown in Figure 10, the m first row vectors are input into the initial log anomaly detection model for training, and the m second row vectors obtained are finally divided into the second row vectors belonging to the normal log category (refer to Figure 10). The vector inside the circle) and the second row vector that does not belong to the normal log class (referring to the vector outside the circle shown in Figure 10), the centroid of the normal log class is C.

Although the pre-trained log anomaly detection model already has high-quality model parameters, since the pre-trained log anomaly detection model is trained using the first log sample set from the target object, if the model is directly used to generate the target sub-object If the parameters of the model are not accurate enough, the accuracy of the obtained detection results will be low. Therefore, the training sample set from the target sub-object (that is, the first set of n second log samples described below) can be further used. Second log sample set), fine-tune the pre-trained log anomaly detection model to obtain more accurate model parameters, and when the parameters of the model are more accurate, perform anomaly detection on the logs generated by the target sub-object, and the obtained detection results are also would be more accurate.

It can be understood that since the pre-trained log anomaly detection model already has high-quality model parameters, when using the second training sample set from the target sub-object to fine-tune the pre-trained log anomaly detection model, the second training sample set only More accurate model parameters can be obtained by including a small number of second training samples, and the process of fine-tuning to obtain more accurate model parameters only requires a small amount of labor and time cost, and the model training efficiency is high.

S404. The computing device acquires a second log sample set including n second log samples, where the second log sample set is obtained by processing log data of the target sub-object.

The log data of a large number of target sub-objects may be historical logs generated by the target sub-objects.

It can be understood that the larger the value of n, the more accurate the parameters of the trained log anomaly detection model. Therefore, in a specific implementation, the computing device can obtain as many second log samples as possible and add them to the second log sample set. Among them, n is a natural number greater than 1, and n is usually less than m.

S405. The computing device performs word segmentation on the n second log samples, respectively, to obtain n word sequences in the second log.

S406. The computing device fine-tunes the pre-trained log anomaly detection model through n word sequences to obtain a trained log anomaly detection model.

It can be understood that the model parameters of the trained log anomaly detection model obtained by fine-tuning are more accurate than those of the pre-trained log anomaly detection model. Subsequently, the trained log anomaly detection model is used to perform anomaly detection on the logs generated by the target sub-object. The test results obtained will also be more accurate.

In this embodiment, the process for the computing device to acquire the second log sample set including n second log samples is similar to the process for the computing device to acquire the first log sample set including m first log samples in S401. Reference can be made to the relevant description in S401; the computing device performs word segmentation on the n second log samples respectively, and the process of obtaining n word sequences corresponding to the n second log samples is the same as the computing device in S402. The process of obtaining m word sequences corresponding to m first log samples is similar. For details, please refer to the relevant description in S402; the computing device fine-tunes the pre-trained log anomaly detection model through n word sequences to obtain the training The process of a good log anomaly detection model is similar to the process in S403 that the computing device pre-trains the initial log anomaly detection model through m word sequences, and obtains a pre-trained log anomaly detection model. For details, please refer to the relevant information in S403. describe.

In a specific embodiment of the present application, when the computing device obtains the target cluster center C, it can also calculate the log anomaly detection model used for training according to the target cluster center C and m second row vectors. The second classification threshold for anomaly detection of logs to be detected.

Further, according to the target cluster center C and m second row vectors, the process of calculating the second classification threshold by the computing device may include:

B1. The computing device obtains the loss of m second row vectors to the target cluster center C.

The process of obtaining the loss from m second row vectors to the target cluster center C by the computing device is similar to the above-mentioned process of obtaining the loss from m second row vectors to the initial cluster center c. For details, please refer to the above related description.

B2. The computing device obtains the percentile corresponding to the loss of the m second row vectors to the target cluster center C.

Among them, percentile is a statistical term. If a set of data is sorted from small to large and the corresponding cumulative percentile is calculated, the value of the data corresponding to a percentile is called the percentile percentile. For example, a value at the 80th percentile is called the 80th percentile.

Therefore, the computing device obtains the percentiles corresponding to the losses of the m second row vectors to the target cluster center C, that is, the computing device sorts the losses from the m second row vectors to the target cluster center C from small to large, and Calculate the corresponding cumulative percentile.

B3. The computing device determines the target percentile according to the percentile corresponding to the loss of the m second row vectors to the target cluster center C.

B4. The computing device determines the second classification threshold according to the target percentile.

In a specific embodiment of the present application, the second classification threshold T can be determined by the following formula:

T=P·β

Among them, P represents the value at the target percentile, β is used to expand the distance around the target cluster center C, P and β can be based on the number of normal samples and the number of abnormal samples in the m first log samples, For example, when the number of normal samples in the m first log samples is much larger than the number of abnormal samples (for example, the number of normal samples and the number of abnormal samples are 10000:1 or 5000:1, etc.), the target percentile The value of the number can be as large as possible, such as 90%, 95%, etc., and the value of β can be 1.8, 2.0, 2.5, etc. In the m first log samples, the number of normal samples is close to the number of abnormal samples (such as normal samples). When the number of β and the number of abnormal samples is 500:1 or 100:1), the value of the target percentile can be close to 50%, such as 45%, 51%, etc., and the value of β can be 1.2, 1.5 etc.; the second classification threshold T is used for the process of using the trained log anomaly detection model to perform anomaly detection on the log to be detected, please refer to the relevant description in FIG. 11 .

It can be seen that the present application calculates the second classification threshold T according to the target cluster center C and m second row vectors, and when calculating the second classification threshold T, considers the m first log samples The value of the target percentile and the value of β are taken on the basis of the number of normal samples and the number of abnormal samples, instead of manually setting the classification threshold based on experience as in the prior art, and the manually set classification threshold is too large Or if it is too small, it will have a great impact on the accuracy of the trained log anomaly detection model. For example, when the manually set classification threshold is too large, this will make the trained log anomaly detection model have a high chance of mistaking anomalies. Logs are classified as normal logs. When the manually set classification threshold is too small, the trained log anomaly detection model has a high probability of misclassifying normal logs as abnormal logs. Therefore, by determining the second classification threshold T by the method provided in this application, the accuracy of anomaly detection performed by the trained log anomaly detection model can be improved.

In a specific implementation, after using the log anomaly detection model training method provided in this application to train a trained log anomaly detection model, the trained log anomaly detection model can be deployed to the target sub-object, and the trained log anomaly detection model can be used to detect The model performs anomaly detection on the logs to be detected generated by the target sub-object.

Please refer to FIG. 11 . FIG. 11 is a schematic flowchart of an exemplary process of using a trained log anomaly detection model to perform anomaly detection on a log to be detected of a target sub-object. As shown in FIG. 11 , the detection process includes:

S111. Obtain the log entry to be detected generated by the target sub-object.

It can be understood that the log entry to be detected here is the log to be detected as described above.

S112. Acquire the content of the event to be detected from the log entry to be detected.

S113. Perform word segmentation on the content of the event to be detected to obtain a sequence of words to be detected corresponding to the content of the event to be detected.

S114: Obtain the first row vector corresponding to the word sequence to be detected.

S115: Input the first row vector corresponding to the word sequence to be detected into the trained log anomaly detection model to perform anomaly detection, and obtain a detection result.

In a specific embodiment of the present application, the first row vector corresponding to the word sequence to be detected is input into the trained log anomaly detection model for anomaly detection, and the specific process of obtaining the detection result may include the following steps:

S1151. Input the first row vector corresponding to the word sequence to be detected into the trained log anomaly detection model to obtain the second row vector corresponding to the word sequence to be detected.

S1152: Obtain the loss from the second row vector corresponding to the word sequence to be detected to the cluster center C'.

Among them, the cluster center C' represents the centroid of the normal log class that no longer changes when the trained log anomaly detection model is obtained in the fine-tuning stage, and the n second row vectors corresponding to the n second log samples are clustered.

The loss (C',X) of the second row vector corresponding to the word sequence to be detected to the cluster center C' can be obtained by the following formula:

Wherein, X represents the second row vector corresponding to the word sequence to be detected.

S1153. Determine whether the loss (C', X) is less than the second classification threshold T, and if the loss (C', X) is determined to be less than the second classification threshold T, execute S1154, and after determining the loss (C', When X) is greater than or equal to the second classification threshold T, execute S1155.

In a specific implementation, it is also possible to determine whether the loss (C', X) is less than or equal to the second classification threshold T, and in the case where it is determined that the loss (C', X) is less than or equal to the second classification threshold T, execute S1154 , if it is determined that loss(C', X) is greater than the second classification threshold T, execute S1155.

S1154. Determine that there is no device abnormality information in the log entry to be detected. The device refers to a device that generates the log entry to be detected.

S1155. Determine that the log entry to be detected contains information about a device abnormality.

As shown in Figure 12, input the first row vector x corresponding to the word sequence to be detected into the trained log anomaly detection model for anomaly detection, and the obtained anomaly detection result includes the relative aggregation of the second row vector x corresponding to the word sequence to be detected. The loss (C', X) of the class center C', assuming that the loss (C', X) is 5, the second classification threshold T is 8, and the loss (C', X) is less than the second classification threshold T, then The trained log anomaly detection model will attribute the second row of vector X to the normal log category, and output the detection result that there is no device anomaly information in the log entry to be detected.

It should be noted that, in the above example, the trained log anomaly detection model outputs the detection result that there is no device anomaly information in the log entry to be detected only as an example. In a specific implementation, the output detection result can also be "The device is normal", etc., and no specific restrictions are made here.

In this embodiment, the definitions of the word sequence to be detected, the first row vector corresponding to the word sequence to be detected, etc. are the same as the definitions of the word sequence, the first row vector, etc. in the embodiment of FIG. 4 . For details, please refer to FIG. 4 The relevant content in the illustrated embodiment will not be described again here. In this embodiment, the process of performing word segmentation on the content of the event to be detected to obtain the sequence of words to be detected corresponding to the content of the event to be detected is the same as the computing device in S402 performing word segmentation on m first log samples, and obtaining m first log samples corresponding to The process of the m word sequences is similar, and for details, please refer to the relevant description in S402; the process of obtaining the first row vector corresponding to the word sequence to be detected is the same as the process of obtaining the m word sequences after mask processing by the computing device in S403. The process of the m first row vectors is similar, and for details, please refer to the relevant description in S403.

It should be noted that when obtaining the first row vector corresponding to the word sequence to be detected, it is not necessary to perform mask processing on the word sequence to be detected, and the word embedding vector and position embedding vector corresponding to each word in the word sequence to be detected can be directly obtained. , so as to obtain the first row vector corresponding to the word sequence to be detected.

It should be noted that although the above description of the log anomaly detection model training method provided by this application takes computing devices as the execution subject, in specific implementation, the log anomaly detection model training method provided by this application is the execution subject. It can also be a computing device cluster including at least two computing devices, and at least two computing devices in the computing device cluster can cooperate to implement the log anomaly detection model training method provided by this application. In the case of computing device B, step S401 is performed by computing device A, steps S402 to S406 are performed by computing device B, or steps S401 to S403 are performed by computing device A, and steps S404 and S406 are performed by computing device A and computing device B jointly execute.

A log anomaly detection model training method provided by the present application is described in detail above. Based on the same inventive concept, the log anomaly detection model training device provided by the present application will be described below.

Referring to FIG. 13, FIG. 13 is a schematic structural diagram of a log anomaly detection model training device 100 provided by the present application. The device 100 includes: an acquisition module 110 and a training module 120, wherein,

an obtaining module 110, configured to obtain a first log sample set, wherein the first log sample set is obtained by processing log data of the target object;

The training module 120 pre-trains the initial log anomaly detection model by using the first log sample set to obtain a pre-trained log anomaly detection model;

The obtaining module 110 is further configured to obtain a second log sample set, wherein the second log sample set is obtained by processing log data of a target sub-object, and the target sub-object belongs to the target object;

The training module 120 is further configured to fine-tune the pre-trained log anomaly detection model by using the second log sample set to obtain a trained log anomaly detection model.

In a possible implementation manner, the first log sample set includes m log samples, where m is a natural number greater than 1, and the training module 120 is specifically used for:

In a possible implementation manner, the training module 120 is specifically used for:

Obtain the loss of the m second row vectors to the initial cluster center;

In a possible implementation manner, the training module 120 is further configured to:

Specifically, for the specific implementation of various operations performed by the above log abnormality detection model training apparatus 100, reference may be made to the description in the relevant content in the above embodiment of the log abnormality detection model training method.

It should be understood that the log anomaly detection model training apparatus 100 is only an example provided by the embodiment of the present application, and the log anomaly detection model training apparatus 100 may have more or less components than those shown in FIG. 13 , and may combine two one or more components, or may be implemented with different configurations of components.

The log anomaly detection model training apparatus 100 provided in this application can be applied to various computing devices such as cloud servers, personal computers, and terminal devices, and can also be applied to a computing device cluster including at least two computing devices. The following applies to one computing device Described as an example.

Referring to FIG. 14 , FIG. 14 is a schematic structural diagram of a computing device 200 provided by the present application. The computing device 200 includes: a processor 210 , a memory 220 and a communication interface 230 , wherein one of the processor 210 , the memory 220 and the communication interface 230 is They can be connected to each other through the bus 240 . in,

The processor 210 can read the program code (including instructions) stored in the memory 220, and execute the program code stored in the memory 220, so that the computing device 200 executes the steps in the log anomaly detection model training method provided by the above method embodiments, or makes The computing device 200 deploys the log anomaly detection model training apparatus 100 .

The processor 210 may have various specific implementation forms, for example, the processor 210 may be a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU), etc., and the processor 210 may also be a single-core processor or multi-core processors. The processor 210 may be a combination of a CPU and a hardware chip. The above-mentioned hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof. The above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (generic array logic, GAL) or any combination thereof. The processor 210 can also be independently implemented by a logic device with built-in processing logic, such as an FPGA or a DSP.

The memory 220 may store program codes and program data. The program code includes: the code of the acquisition module 110 and the code of the training module 120, etc., and the program data includes: the first log sample set, the second log sample set, the word sequence before mask processing and the word sequence after mask processing and many more.

In practical applications, the memory 220 may be a non-volatile memory, such as a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (erasable). PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), or flash memory. The memory 220 may also be volatile memory, which may be random access memory (RAM), which acts as an external cache.

Communication interface 230 may be a wired interface (eg, an Ethernet interface) or a wireless interface (eg, a cellular network interface or using a wireless local area network interface) for communicating with other computing nodes or devices. When the communication interface 230 is a wired interface, the communication interface 230 may use a protocol family above transmission control protocol/internet protocol (TCP/IP), for example, remote function call (RFC) protocol, simple object access protocol (SOAP) protocol, simple network management protocol (SNMP) protocol, common object request broker architecture (CORBA) protocol, and distributed protocols and many more.

The bus 240 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA for short) bus or the like. The bus 240 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is shown in FIG. 14, but it does not mean that there is only one bus or one type of bus.

The above computing device 200 is configured to execute the method described in the above embodiment of the log anomaly detection model training method, which belongs to the same concept as the above method embodiment, and the specific implementation process is detailed in the above method embodiment, which will not be repeated here.

The computing device 200 deploys the functional modules of the log anomaly detection model training apparatus 100, see the apparatus embodiment shown in FIG. 13 .

It should be understood that the computing device 200 is only an example provided by the embodiments of the present application, and the computing device 200 may have more or less components than those shown in FIG. 14 , two or more components may be combined, or Different configurations of components are possible.

The present application also provides a non-transitory computer-readable storage medium, where instructions are stored in the non-transitory computer-readable storage medium, and when the instructions are run, part of the log anomaly detection model training method described in the above embodiment or all steps.

The present application also provides a computer program product, when the computer program product is read and executed by a computer, part or all of the steps of the log anomaly detection model training method described in the above method embodiments can be implemented.

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media, or semiconductor media, and the like.

The steps in the method of the embodiment of the present application may be sequentially adjusted, combined or deleted according to actual needs; the units in the device of the embodiment of the present application may be divided, combined or deleted according to actual needs.

The embodiments of the present application have been introduced in detail above, and specific examples are used to illustrate the principles and implementations of the present application. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present application; at the same time, for Persons of ordinary skill in the art, according to the idea of the present application, will have changes in the specific implementation manner and application scope. In conclusion, the contents of this specification should not be construed as a limitation on the present application.

Claims

A method for training a log anomaly detection model, wherein the method comprises:

obtaining a first log sample set, wherein the first log sample set is obtained by processing log data of the target object;

Pre-training an initial log anomaly detection model by using the first log sample set to obtain a pre-trained log anomaly detection model;

acquiring a second log sample set, wherein the second log sample set is obtained by processing log data of a target sub-object, and the target sub-object belongs to the target object;

Fine-tune the pre-trained log anomaly detection model by using the second log sample set to obtain a trained log anomaly detection model.
The method of claim 1, wherein:

The target object includes at least one of the following sub-objects: hard disk, memory, flash memory, network device and processor, and the target sub-object is any type of sub-object in the target object.
The method according to claim 1 or 2, wherein the first log sample set includes m log samples, where m is a natural number greater than 1, and the first log sample set is used to detect anomalies in the initial log. The model is pre-trained to obtain a pre-trained log anomaly detection model, including:

Perform word segmentation on the m log samples respectively to obtain m word sequences corresponding to the m log samples;

Through the m word sequences, the initial log anomaly detection model is pre-trained to obtain a pre-trained log anomaly detection model.
The method according to claim 3, wherein the initial log anomaly detection model is pre-trained through the m word sequences to obtain a pre-trained log anomaly detection model, comprising:

Perform mask processing on words with preset proportions in the m word sequences, respectively, to obtain m word sequences after mask processing;

The initial log anomaly detection model is pre-trained through the masked m word sequences to obtain a pre-trained log anomaly detection model.
The method according to claim 4, wherein the initial log anomaly detection model is pre-trained through the m word sequences processed by the mask to obtain a pre-trained log anomaly detection model, comprising:

Obtain the word embedding vector and the position embedding vector corresponding to each word in the m word sequences after the mask processing, wherein, the word embedding vector corresponding to each word is used to represent each word. A multi-dimensional vector, the position embedding vector corresponding to each word represents the position of each word in the word sequence to which it belongs;

According to the word embedding vector and the position embedding vector corresponding to each word in the m word sequences after the mask processing, respectively, obtain m first row vectors corresponding to the m word sequences after the mask processing;

Using the m first row vectors, pre-train the initial log anomaly detection model to obtain a pre-trained log anomaly detection model.
The method according to claim 5, wherein the use of the m first row vectors to pre-train an initial log anomaly detection model to obtain a pre-trained log anomaly detection model, comprising:

The m first row vectors are respectively input into the initial log anomaly detection model for training, and m second row vectors are obtained, wherein the m second row vectors and the m word sequences processed by the mask There is a one-to-one correspondence, and each second row vector in the m second row vectors includes the semantic information of the word sequence after its corresponding mask processing;

Obtain the loss of the m second row vectors to the initial cluster center;

According to the loss of the m second row vectors to the initial cluster center, an initial log anomaly detection model is trained to obtain a pre-trained log anomaly detection model and a target cluster center.
The method according to claim 6, wherein the method further comprises:

Obtain the percentile corresponding to the loss of the m second row vectors to the target cluster center;

A classification threshold is determined according to the percentile corresponding to the loss of the m second row vectors to the target cluster center, wherein the classification threshold is used for the trained log anomaly detection model to detect the to-be-detected Anomaly detection is performed on the log, and the detection result is obtained.
The method according to claim 6 or 7, wherein the formula for obtaining the loss from the m second row vectors to the initial cluster center is:

Wherein, V i represents the ith second row vector among the m second row vectors, c represents the initial cluster center, and loss(c,V i ) represents the ith second row vector to The loss of the initial cluster center, i is a natural number.
A log anomaly detection model training device, characterized in that the device comprises:

an acquisition module, configured to acquire a first log sample set, wherein the first log sample set is obtained by processing log data of the target object;

a training module, which pre-trains the initial log anomaly detection model through the first log sample set to obtain a pre-trained log anomaly detection model;

The obtaining module is further configured to obtain a second log sample set, wherein the second log sample set is obtained by processing log data of a target sub-object, and the target sub-object belongs to the target object;

The training module is further configured to fine-tune the pre-trained log anomaly detection model through the second log sample set to obtain a trained log anomaly detection model.
The device of claim 9, wherein:

The target object includes at least one of the following sub-objects: hard disk, memory, flash memory, network device and processor, and the target sub-object is any type of sub-object in the target object.
The device according to claim 8 or 9, wherein the first log sample set includes m log samples, where m is a natural number greater than 1, and the training module is specifically used for:

Perform word segmentation on the m log samples respectively to obtain m word sequences corresponding to the m log samples;

Through the m word sequences, the initial log anomaly detection model is pre-trained to obtain a pre-trained log anomaly detection model.
The device according to claim 11, wherein the training module is specifically used for:

Perform mask processing on words with preset proportions in the m word sequences, respectively, to obtain m word sequences after mask processing;

The initial log anomaly detection model is pre-trained through the masked m word sequences to obtain a pre-trained log anomaly detection model.
The device according to claim 12, wherein the training module is specifically used for:

Obtain the word embedding vector and the position embedding vector corresponding to each word in the m word sequences after the mask processing, wherein the word embedding vector corresponding to each word is a multi-dimensional representation of each word. vector, the position embedding vector corresponding to each word represents the position of each word in the word sequence to which it belongs;

According to the word embedding vector and the position embedding vector corresponding to each word in the m word sequences after the mask processing, respectively, obtain m first row vectors corresponding to the m word sequences after the mask processing;

Using the m first row vectors, pre-train the initial log anomaly detection model to obtain a pre-trained log anomaly detection model.
The device according to claim 13, wherein the training module is specifically used for:

The m first row vectors are respectively input into the initial log anomaly detection model for training, and m second row vectors are obtained, wherein the m second row vectors and the m word sequences processed by the mask There is a one-to-one correspondence, and each second row vector in the m second row vectors includes the semantic information of the word sequence after its corresponding mask processing;

Obtain the loss of the m second row vectors to the initial cluster center;

According to the loss of the m second row vectors to the initial cluster center, an initial log anomaly detection model is trained to obtain a pre-trained log anomaly detection model and a target cluster center.
The device according to claim 14, wherein the training module is further used for:

Obtain the percentile corresponding to the loss of the m second row vectors to the target cluster center;

The classification threshold is determined according to the percentile corresponding to the loss of the m second row vectors to the target cluster center, wherein the classification threshold is used for the trained log anomaly detection model to detect the to-be-detected log. Anomaly detection is performed on the log, and the detection result is obtained.
The device according to claim 14 or 15, wherein the formula for obtaining the loss from the m second row vectors to the initial cluster center is:

Wherein, V i represents the ith second row vector among the m second row vectors, c represents the initial cluster center, and loss(c,V i ) represents the ith second row vector to The loss of the initial cluster center, i is a natural number.
A non-transitory computer-readable storage medium, characterized in that, the non-transitory computer-readable medium stores instructions, and the instructions are used to implement the method of any one of claims 1 to 8.
A computing device, characterized in that the computing device includes a processor and a memory; the processor is configured to execute instructions stored in the memory, so that the computing device implements the method of any one of claims 1 to 8 .