CN114936615A

CN114936615A - Small sample log information anomaly detection method based on characterization consistency correction

Info

Publication number: CN114936615A
Application number: CN202210876386.7A
Authority: CN
Inventors: 许扬汶; 刘天鹏; 韩冬; 孙腾中; 刘灵娟
Original assignee: Nanjing Big Data Group Co ltd
Current assignee: Nanjing Big Data Group Co ltd
Priority date: 2022-07-25
Filing date: 2022-07-25
Publication date: 2022-08-23
Anticipated expiration: 2042-07-25
Also published as: CN114936615B

Abstract

The invention discloses a small sample log information abnormity detection method based on characterization consistency proofreading, which comprises the following steps of: data preprocessing, extracting event characteristics and carrying out serialization processing; the self-learning characteristic representation network is used for iterative training, an original special-shaped computing network is used for learning a characteristic extractor from a small sample classification task, a characteristic consistency proofreading module is constructed, the original special-shaped computing network and the self-learning characteristic representation network are respectively trained through the characteristic consistency proofreading functions of the original special-shaped computing network and the self-learning characteristic representation network, and the trained self-learning characteristic representation network is used as an embedded network; inputting test set data to obtain a classification result; and performing corresponding processing according to the output result. The method utilizes the supervision information of the original special-shaped computing network to guide the self-learning characteristics to represent network training, is more suitable for model training under the condition of small samples, and simultaneously improves the classification effect of the abnormal detection model under the condition of small samples.

Description

Small sample log information anomaly detection method based on characterization consistency correction

Technical Field

The invention relates to a classification detection method, in particular to a small sample log information abnormity detection method based on characterization consistency proofreading.

Background

With the development of new technologies such as internet, big data, cloud computing and the like, more and more industries and scenes start digital operation. The services enable the life of common users to be convenient and efficient, but also bring a profit channel for the black and gray industry, so that a series of new network security problems are derived. Aiming at the endless network security problem, the traditional detection method can not meet the current requirements on network security defense. The anomaly detection technology based on artificial intelligence technologies such as a neural network and the like has self-learning capability and dynamic monitoring capability, so that the network security technology is qualitatively improved. A large number of samples are needed as training data for a traditional neural network detection model, however, in practical application, user data acquisition difficulty is large, time consumption is long, and labeling cost is high, so that effective samples are scarce, and a high-efficiency detection model is difficult to train in the face of a small sample task.

The proofreading learning is an important learning paradigm, and mainly utilizes auxiliary tasks to mine own supervision information from large-scale unsupervised data, and trains a network through the constructed supervision information, so that valuable characteristics of downstream tasks can be learned, and the method comprises main methods based on context, time sequence and the like. However, conventional collation learning methods tend to rely on a large number of training samples. In a small sample scene, due to the lack of enough samples, the obtained supervision information is mainly concentrated on the difference of base class samples, and valuable semantic information of a new class is ignored. The direct application of the learning task in a small sample scene may learn some inappropriate "shortcuts" instead of the key semantic information, i.e. learn a biased representation, thereby leading to misdirection of the main task, causing a performance degradation of the small sample learning.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a small sample log information abnormity detection method based on characterization consistency proofreading, which can improve the small sample learning performance, and can efficiently distinguish abnormal behaviors aiming at mobile phone application.

The technical scheme is as follows: the invention relates to a small sample log information abnormity detection method based on characterization consistency proofreading, which comprises the following steps of:

(1) data preprocessing, namely analyzing log information to obtain structured log data, extracting and classifying event characteristics, performing serialization processing, and converting the event characteristics into numerical vector data;

(2) the method comprises the steps that a model feature extractor is subjected to iterative training, preprocessed data are divided into a training set and a testing set, based on training set data, a small sample classification task is executed on each epsilon by using a task-based epicode training strategy, an original special-shaped computing network is used for learning a feature extractor from the small sample classification task, then a characterization consistency checking module is constructed, characterization consistency checking functions of the original special-shaped computing network and a self-learning feature representation network are computed, the characterization consistency checking functions are used for respectively training the original special-shaped computing network and the self-learning feature representation network, parameters of the original special-shaped computing network and the self-learning feature representation network are continuously updated in the iterative process, and finally the trained self-learning feature representation network is used as an embedded network;

(3) inputting the preprocessed test set data as a model, using the trained self-learning characteristic representation network, calculating the similarity of the test sample and each category, and obtaining the category with the highest similarity as a classification result;

(4) and performing corresponding processing according to an output result of the prediction stage, and if abnormal behaviors are found, giving out a warning prompt to remind system management personnel to pay attention to the abnormal behaviors so as to guarantee the system safety.

Preferably, the event characteristics in step (1) are an event behavior description string _ id and a security label, where the event behavior description string _ id includes three categories, namely, File operation File, Process operation Process, and Registry operation registration, and the three categories of events include 16 event operation behaviors, and are divided as shown in the following table:

preferably, the 16 event operation behaviors are sequentially stored in a reference vector with the vector size of 16, and a vector matrix corresponding to the reference vector is initialized with the matrix size of

Represented by a 16-bit binary number, each bit representing an event operation behavior attribute flag value executed by the program, 0 indicating the absence of such an event type, and 1 indicating the presence of such an event type.

Preferably, the vector matrix is spliced with the security label to form an event behavior vector, and the vector size is

The first 16 bits represent attribute marking values of 16 event operation behaviors, the last 1-bit security label represents a normal event behavior or an abnormal event behavior marking value, when the security label is 0, the security label represents no abnormal behavior, when the security label is 1, the security label represents that a file operation abnormal behavior exists, when the security label is 2, the security label represents that a process operation abnormal behavior exists, and when the security label is 3, the security label represents that a registry operation abnormal behavior exists.

Preferably, the step (2) comprises the following steps:

(2.1) using the training set as input to the model, computing the network using the original idiotypes

Computing a class prototype in which

Are learnable network parameters. In particular, for a small sample task

，

In order to support the set of data,

computing a category for a set of queries

Is prototyped as

Wherein the content of the first and second substances,

representing a sample label in feature space of

The class prototype of (a) is,

presentation support set

Wherein the label is

Is determined by the data set of (a),

representing a data set

The size of (a) is (b),

a feature vector representing the sample is then generated,

a label representing the corresponding sample;

for a query from a set of queries

New sample of

Each category is obtained by distance discrimination

Normalized classification score of (a):

wherein

Represents the softmax function; specifying classification penalty functions

Comprises the following steps:

wherein, the first and the second end of the pipe are connected with each other,x _q a feature vector representing the sample is determined,y _q a label representing the corresponding sample;

(2.2) building a self-learning feature representation network for a query from a set of queries

Input data of

Generating transformations using a method of random enhancement

Forming pairs of training samples, calculating the objective function of the self-learning feature representation network

Comprises the following steps:

(2.3) constructing a characteristic consistency proofreading function

Computing the original profile into a network

And self-learning feature representation network

And (4) performing proofreading:

wherein

And

are all learnable network parameters;

(2.4) computing network for primitive specialties

And fusing a classification loss function and a characterization consistency correction function, and calculating a final original special type calculation network training function as follows:

wherein the content of the first and second substances,

is a weight variable;

(2.5) representing networks for self-learning features

And fusing a target function and a characteristic consistency correction function of the self-learning characteristic representation network, and calculating a final self-learning characteristic representation network training function as follows:

wherein, the first and the second end of the pipe are connected with each other,

is a weight variable;

(2.6) in model training, designing and using an original special-shaped computing network and a self-learning feature representation network interactive iterative updating method, and training

And

using the final

As an embedded network.

Preferably, the interactive iterative updating method includes: first, the original special-type computing network is initialized respectively

And self-learning characterizing networks

(ii) a Fixed self-learning feature representation of parameters in a network

To obtain a characteristic consistency check function

Further using the primitive prototype to compute a network training function

For parameter

Performing a one-step optimization, followed by updated parameters

To obtain a new characterization consistency check function

Representing network training functions using self-learning features

Performing one-step optimization to obtain updated parameters

And repeating the interactive iteration updating step until the training function is converged.

Preferably, the step (3) includes inputting the preprocessed test set data as a model, and using the trained self-learning feature representation network

Extracting the characteristics of the sample, calculating the average value of the support set samples corresponding to each class as a prototype of the class, then calculating the similarity between the test sample and each class prototype through a small sample log information abnormity judgment function, and finally obtaining the class with the highest similarity as a final detection result.

Preferably, the small sample log information anomaly determination function is:

。

the invention also provides a computer readable storage medium, a computer program is stored in the computer readable storage medium, and when the computer program is executed by a processor, the method for detecting the small sample log information abnormity based on the characterization consistency proofreading is realized.

Has the beneficial effects that: compared with the prior art, the invention has the following remarkable advantages: the method has the advantages that the self-supervision learning of the characteristic consistency proofreading is provided, the supervision information of the original special type computing network is used for guiding the self-learning characteristic representation network training, so that the two modules are matched, the characteristic consistency proofreading learning utilizes the inherent supervision information in the marked data, the learning characteristic manifold is improved, the representation deviation is reduced, more effective semantic information is mined, the information is integrated to form uniform distribution, and the original characterization method is further enriched and expanded; the interactive iteration updating method can further converge the target function, is more suitable for model training under the condition of small samples, improves the classification effect of the abnormal detection model under the condition of small samples, effectively detects abnormal behaviors in log files and ensures the application safety of the mobile phone.

Drawings

FIG. 1 is a flow chart of the operation of the present invention;

FIG. 2 is a schematic flow chart of the model iterative training phase of the present invention;

FIG. 3 is a comparison of classification discrimination accuracy between the method of the present invention and the prior art.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

For the task of judging whether the system has abnormal behavior or not by the given log record file, a high-efficiency abnormal detection model can be trained by using a training set, and then the user record is monitored in real time and early warned by using the model. As shown in fig. 1, the method for detecting the abnormality of the log information of the small sample based on the token consistency proofreading includes the following steps: the method comprises a data preprocessing stage, a model iterative training stage, a prediction stage and a response stage.

(1) A data preprocessing stage:

analyzing the log information to obtain structured log data, extracting event features, classifying and sorting the extracted features, serializing the features, and converting the features into numerical vector data. The method specifically comprises the following steps:

the log data set D is composed of a plurality of records, each record is composed of log data content and a label, and in the data preprocessing stage, key fields including event behavior description string _ id and a security label are extracted from the log data. In this embodiment, the following information can be extracted for one log record:

{

"start_time":"2020-08-16T20:55:00",

"end_time":"2020-08-16T20:57:00",

"size":2741,

"Processes":{

"pid":3500,

"name":[python]\\python.exe",

"events":{

"time":"2020-08-16T20:55:00",

"event_id":2233,

"ignored":false,

"string-id":"File:Permissions:|temp|\\000c34576f5c",

"action":"Permissions",

"target":"[temp]\\000c34576f5c",

"abstraction":""

}

"label": 1

}

the event behavior description string _ id is 'File: Permissions | temp | \ \000c34576f5 c', the event type can be obtained as 'File', the specific event operation behavior is 'Permissions', and the illegal change permission operation for the File is realized. The security label is 1, which indicates that the log record has abnormal behavior for the file system.

In this embodiment, the event vector matrix is 0010000000000000, where bit 3 is 1, indicating that there is a "Permissions" event type. By splicing the vector matrix with the security label, the vector size is formed

00100000000000001. All data records are preprocessed as described above, each obtained data is a feature vector with the size of 1 × 17, and the data is randomly divided into a training set and a test set.

(2) And (3) in the model iterative training stage, optimizing the self-learning characteristic representation network, as shown in figure 2.

I.e. prototype network, computing class prototypes, in which

Are learnable network parameters. The method comprises the following specific steps: randomly extracting, for each epsilon, from a training set using a task-based epsilon training strategyNEach class is extracted respectivelyKThe samples form a support setSAnd then from thisNThe remaining samples in the individual class extract a portion of the data as a query setQThe resulting classification problem is called the N-way K-shot small sample task. Performing a small sample classification task for each epicode

，

In order to support the set of data,

computing a category for a set of queries

Is prototyped as

representing a sample label in feature space of

The prototype of the category of (1),

presentation support set

Wherein the label is

Is determined by the data set of (a),

representing a data set

The size of (a) is smaller than (b),

a feature vector representing the sample is then generated,

a label representing the corresponding sample;

for a query from a set of queries

New sample of

Using the following distance judgmentGet each classification

Normalized classification score of (a):

wherein

Represents the softmax function; specifying classification loss functions

Comprises the following steps:

wherein the content of the first and second substances,x _q a feature vector representing the sample is determined,y _q a label representing the corresponding sample;

Input data of

Generating transformations using a method of random enhancement

Comprises the following steps:

(2.3) construction ofCharacterizing consistency check functions

Computing the original profile into a network

And self-learning feature representation network

And (3) performing proofreading:

wherein

And

are all learnable network parameters;

(2.4) computing network for primitive specialties

wherein the content of the first and second substances,

is a weight variable;

(2.5) representing networks for self-learning features

Fusing target function and characterization consistency correction function of the self-learning characteristic representation network, and calculating final self-learning characteristic representation network trainingThe function is:

wherein the content of the first and second substances,

is a weight variable;

And

the method specifically comprises the following steps:

first, the original special-type computing network is initialized respectively

And self-learning feature representation network

(ii) a Fixed self-learning feature representation of parameters in a network

To obtain a characteristic consistency check function

Further using the primitive prototype to compute a network training function

For parameter

Performing a one-step optimization, followed by updated parameters

To obtain a new characterization consistency check function

Representing network training functions using self-learning features

Performing one-step optimization to obtain updated parameters

Repeating the interactive iterative updating steps until the training function is converged, and using the finally trained self-learning characteristic to represent the network

As an embedded network.

(3) A prediction stage: the preprocessed test set data is used as model input, and a self-learning characteristic representation network is used

Extracting the characteristics of the sample, calculating the average value of the support set samples corresponding to each class as the prototype of the class, then calculating the similarity between the test sample and each class prototype through a small sample log information abnormity decision function, and finally obtaining the class with the highest similarity as the final detection result, wherein the small sample log information abnormity decision function is

And aiming at the vector matrix 0010000000000000, finally obtaining the distance which is the closest to the class prototype distance of label =1, namely considering that the prediction label of the segment record is 1, and the abnormal behavior aiming at file operation exists.

(4) A response phase: and performing corresponding processing according to the prediction result, and the system finds that the event has abnormal behavior aiming at the file system, timely sends out a warning prompt and gives record information of the abnormal behavior so as to facilitate the manager to further troubleshoot errors.

The small sample log information anomaly detection method based on the characterization consistency proofreading is verified through a simulation experiment, a model training method and a model testing method are realized by using python, the method is compared with small sample learning methods such as an original special type calculation network, a matching network and a relation network, and the comparison result under a 5way 5shot task is shown in FIG. 3. ProtoNet represents an original special type calculation network, MatchingNet represents a matching network, relationship Net represents a relational network, MAML represents a model independent algorithm, and RAS represents the small sample learning method based on the characterization consistency proofreading. All procedures were performed on a standard server equipped with Intel Core i7-8700 CPU, 3.20GHz, 32 GBRAM and NVIDIA TITAN RTX, using a neural network whose activation function was the ReLu function, an Adam optimizer, using 0.01 as the initial learning rate and stepping it down during training. As can be seen from FIG. 3, the classification and identification accuracy of the small sample log information anomaly detection method based on the characterization consistency proofreading is higher than that of other methods by more than 5%, so that the method has the advantages of being more suitable for the special task of small sample learning, and meanwhile, the anomaly detection model classification effect under the condition of small samples is improved.

Claims

1. A small sample log information abnormity detection method based on characterization consistency proofreading is characterized by comprising the following steps:

(1) preprocessing data, analyzing log information, extracting event characteristics and classifying;

(2) the method comprises the steps of iterative training of a self-learning characteristic representation network, dividing preprocessed data into a training set and a testing set, based on training set data, firstly using an epicode training strategy based on tasks, executing a small sample classification task for each epicode, using an original special-type computing network to learn a characteristic extractor from the small sample classification task, then constructing a characterization consistency proofreading module, computing characterization consistency proofreading functions of the original special-type computing network and the self-learning characteristic representation network, respectively training the original special-type computing network and the self-learning characteristic representation network by using the characterization consistency proofreading functions, continuously updating parameters of the original special-type computing network and the self-learning characteristic representation network by using an interactive iterative updating method, and finally using the trained self-learning characteristic representation network as an embedded network;

(3) and (3) using the trained self-learning characteristic representation network to calculate the similarity between the test set data and each category, and obtaining the category with the highest similarity as a final detection result.

2. The method for detecting the small sample log information abnormality based on the characterization consistency proofreading according to claim 1, wherein the event characteristics in the step (1) include an event behavior description string _ id and a security label, wherein the event behavior description string _ id includes a File operation File, a Process operation Process and a Registry operation registration.

3. The small-sample log information anomaly detection method based on the characterization consistency proofreading according to claim 2, characterized in that event behavior description string _ id is represented by binary numbers, each binary bit represents an event operation behavior attribute flag value executed by a program, 0 represents that no such event type exists, and 1 represents that such event type exists, so as to form a vector matrix; the security label represents a normal event behavior or an abnormal event behavior marking value, when the security label is 0, the security label indicates that no abnormal behavior exists, when the security label is 1, the security label indicates that a file operation abnormal behavior exists, when the security label is 2, the security label indicates that a process operation abnormal behavior exists, and when the security label is 3, the security label indicates that a registry operation abnormal behavior exists; and splicing the vector matrix with the security label to form an event behavior vector.

4. The method for detecting the anomaly of the log information of the small sample based on the characterization consistency check according to claim 1, wherein the step (2) comprises the following steps:

(2.1) using the training set as input to the model, using the original profile computational network

Computing a class prototype in which

Is a learnable network parameter; for a small sample task

，

In order to support the set of data,

computing a category for a set of queries

The prototype of (a) is:

wherein the content of the first and second substances,

representing a sample label in feature space of

The class prototype of (a) is,

presentation support set

Wherein the label is

Is determined by the data set of (a),

representing a data set

The size of (a) is (b),

a feature vector representing the sample is then generated,

a label representing the corresponding sample;

for a query from a set of queries

New sample of

Each category is obtained by distance discrimination

Normalized classification score of (a):

wherein

Represents the softmax function; specifying classification loss functions

Comprises the following steps:

wherein the content of the first and second substances,

a feature vector representing the sample is then generated,

a label representing the corresponding sample;

Feature vector of the sample

Generating transformations using a method of random enhancement

Comprises the following steps:

；

(2.3) constructing a characteristic consistency proofreading function

Computing the original profile into a network

And self-learning feature representation network

And (4) performing proofreading:

wherein

And

are all learnable network parameters;

(2.4) computing network for primitive specialties

wherein the content of the first and second substances,

is a weight variable;

(2.5) representing the network for self-learning characteristics

is a weight variable;

(2.6) Using the characterization consistency check functionTraining a line model, namely training a final self-learning characteristic representation network by using an original special-shaped calculation and self-learning characteristic representation interactive iteration updating method

As an embedded network.

5. The method for detecting the anomaly of the log information of the small samples based on the characterization consistency proofreading according to claim 4, wherein the interactive iterative updating method in the step (2.6) comprises the following steps:

first, the original special-type computing network is initialized respectively

And self-learning feature representation network

After which the fixed self-learning features represent parameters in the network

Calculating a characteristic consistency check function

Further using the primitive profile to compute a network training function

To the parameter

Performing one-step optimization and then updating by optimization

Recalculating the characterization consistency check function

Using self-learning characterizing functions

To the parameter

Performing one-step optimization to obtain optimized and updated parameters

And repeating the iteration updating step until the training function is converged.

6. The method for detecting the abnormality of the log information of the small samples based on the characterization consistency check as claimed in claim 1, wherein the step (3) comprises using the preprocessed test set data as the model input and using the trained self-learning feature representation network

Extracting the characteristics of the samples, calculating the average value of the support set samples corresponding to each class as a prototype of the class, and then calculating the similarity between the test sample and each class prototype through a small sample log information abnormity judgment function to obtain the class with the highest similarity as a classification result.

7. The method for detecting the small sample log information abnormality based on the characterization consistency proofreading according to claim 6, wherein the small sample log information abnormality determination function is:

。

8. the method for detecting the small sample log information abnormality based on the characterization consistency proofreading as claimed in claim 1, further comprising the step (4) of performing early warning and response processing according to a detection result.

9. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-8.