CN115719070A

CN115719070A - Multi-step attack detection model pre-training method based on alarm semantics

Info

Publication number: CN115719070A
Application number: CN202211492686.1A
Authority: CN
Inventors: 张旭; 于洋; 王浩铭; 吴铤; 齐永兴
Original assignee: Hangzhou Innovation Research Institute of Beihang University
Current assignee: Hangzhou Innovation Research Institute of Beihang University
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2023-02-28

Abstract

The invention relates to the field of pre-training of a multi-step attack detection model, in particular to a pre-training method of a multi-step attack detection model based on alarm semantics, which comprises the following steps: obtaining an alarm description embedding vector by using an off-line alarm sequence; the alarm description embedding vector is used for pre-training a multi-step attack detection model, based on the idea that alarms generated in the same attack stage have higher semantic similarity, the method adopts semantic clustering to aggregate the alarms belonging to the same attack stage, and then converts the alarm vector membership degree of each attack stage into the probability of generating an alarm in each attack stage, so that the problem that the model falls into a local optimal solution is avoided.

Description

Multi-step attack detection model pre-training method based on alarm semantics

Technical Field

The invention relates to the field of multi-step attack detection model pre-training, in particular to a multi-step attack detection model pre-training method based on alarm semantics.

Background

Ourston et al applied hidden Markov models to multi-step attack detection for the first time in 2003 and labeled the alarm sequence using HMM. Xue et al propose a multi-step attack detection and prediction method for the problem that the observed value of hidden Markov model is difficult to determine in multi-step attack detection. The literature updates the existing hidden Markov model through a Baum-Welch algorithm, then identifies the alarm belonging to an attack scene by using a Forward algorithm, and finally labels the alarm by using a Viterbi algorithm and predicts the next possible alarm. Ghafir et al first proposed a novel intrusion detection system for APT attack detection and prediction. The paper comprises two parts, and the author of the first part realizes reconstruction of an attack scene by detecting the traffic characteristics of each attack stage contained in a killing Chain (Cyber Kill Chain). The second part attacks decoding, this phase uses Hidden Markov Models (HMMs) to determine the most likely sequence of APT phases and predicts the attacker's next attack from the sequence of APT phases. Tu et al propose a probabilistic model based on hidden Markov models and probabilistic reasoning to detect an attack intention at an early stage, for the problem that hidden Markov models cannot predict multiple attack intents. The model uses online parameter update rules to better adapt to dynamic network environments. Shawly et al, which covers EM, spectral, baum-Welch, differential evolution, K-means, and piecewise K-means algorithms, analyze the detection accuracy and prediction accuracy based on the hidden Markov model detection algorithm.

Although the importance of unsupervised learning based HMMs in MSA detection has been widely recognized in the field, there is still the problem that the Baum-Welch algorithm is very sensitive to initialization values. The current Baum-Welch algorithm uses an average initialization method to initialize HMMs. However, this initialization method easily causes the multi-step attack detection model to fall into a local optimal solution, and reduces the effectiveness of model detection. Alarm description generated by network interaction in the same attack stage has higher semantic similarity and can be used for distinguishing each attack stage. However, in the current HMM-based MSA detection method, the alarm description attribute is encoded by using class encoding, and semantic information rich in alarm description is lost.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a multi-step attack detection model pre-training method based on alarm semantics, which adopts semantic clustering to aggregate alarms belonging to the same attack stage, thereby avoiding the problem that the model falls into the local optimal solution.

In order to achieve the purpose, the invention provides a multi-step attack detection model pre-training method based on alarm semantics, which comprises the following steps:

obtaining an alarm description embedding vector by utilizing an off-line alarm sequence;

and utilizing the alarm description embedded vector to pre-train a multi-step attack detection model.

Preferably, the obtaining of the alarm description embedding vector by using the offline alarm sequence includes:

acquiring an alarm text corresponding to an alarm rule;

dividing the alarm text based on basic words to obtain an alarm word text;

utilizing the alarm word text to perform stop word removal processing based on a stop word table to obtain an alarm word basic text;

obtaining an alarm description embedded model by utilizing the alarm word basic text;

inputting an alarm description embedding model by utilizing the alarm word basic text to obtain an alarm description embedding vector;

wherein, the stop word list is a stop word list of the assistant words of the tone.

Further, obtaining an alarm description embedding model by using the alarm word basic text comprises:

utilizing the alarm word basic text as a training set;

and training based on the PV-DBOW version of Doc2Vec to obtain an alarm description embedding model by using the training set as input and the alarm description embedding vector corresponding to the alarm word basic text in the training set as output.

Preferably, the pre-training process of the multi-step attack detection model by using the alarm description embedding vector comprises:

establishing an alarm embedding vector membership matrix of the current multi-step attack stage according to the current multi-step attack stage by using the alarm description embedding vector;

utilizing the alarm embedding vector membership matrix of the current multi-step attack stage to calculate the cluster center of the current multi-step attack stage;

utilizing the cluster center of the current multi-step attack stage to iteratively update and calculate an alarm embedded vector membership matrix;

acquiring the position of a corresponding multi-step attack stage by utilizing the alarm embedded vector membership matrix;

calculating an HMM emission probability matrix by using the alarm embedded vector membership matrix;

and obtaining a pre-training result by utilizing the HMM emission probability matrix.

Further, the calculation formula for calculating the cluster center of the current multi-step attack stage by using the embedded vector membership matrix of the current multi-step attack stage is as follows:

wherein, C _j Is the cluster center of the jth multi-step attack stage, N is the number of multi-step attack stages, u _ij Embedding the ith alarm vector in the vector membership matrix into the membership value, x, of the jth attack stage _i And m is the number of clusters of the ith alarm vector.

Further, the calculation formula for iteratively updating and calculating the alarm embedding vector membership matrix by using the cluster center of the current multi-step attack stage is as follows:

wherein, U _k+1 Embedding vector membership matrix, u, for iteratively updated alarms _ij ' is the membership value of the ith alarm vector in the j attack stage embedded in the vector membership matrix, C _j In clusters for the jth multi-step attack stageHeart, C _k Is the cluster center of the kth multi-step attack stage, N is the number of multi-step attack stages, x _i And m is the number of the clustered clusters for the ith alarm vector.

Further, acquiring the corresponding multi-step attack stage position by using the alarm embedded vector membership matrix:

determine U _k+1 And U _k If the difference is less than epsilon, if so, output U _k+1 Otherwise, the alarm description embedding vector and the cluster center of the multi-step attack stage are reused for iteratively updating the alarm embedding vector membership matrix;

utilizing the alarm embedded vector membership matrix after iterative update to obtain a multi-step attack stage of the attack cluster according to the earliest alarm in each attack cluster;

utilizing the multi-step attack stage of the attack cluster as a corresponding multi-step attack stage position;

where ε is an empirical constant.

Further, the calculation formula for calculating the HMM emission probability matrix by using the alarm embedded vector membership matrix is as follows:

wherein B is an HMM emission probability matrix, u ₁₁ To u _MN And embedding vectors for the alarm description corresponding to each multi-step attack stage position.

Further, obtaining a pre-training result by using the HMM emission probability matrix includes:

obtaining the probability of the alarm description embedding vector corresponding to the multi-step attack stage by utilizing the HMM emission probability matrix;

and utilizing the probability of the alarm description embedded vector as a pre-training result.

Compared with the closest prior art, the invention has the following beneficial effects:

the method solves the problem that the current Baum-Welch initialization method easily causes the model to fall into the local optimal solution. Based on the idea that the alarms generated in the same attack stage have higher semantic similarity, the method adopts semantic clustering to aggregate the alarms belonging to the same attack stage, and then converts the alarm vector membership of each attack stage into the probability of generating the alarm in each attack stage. And finally, replacing the initial value of Baum-Welch with the initial value optimized by the alarm semantic knowledge so as to avoid the problem that the model falls into the local optimal solution.

Drawings

FIG. 1 is a flow chart of a multi-step attack detection model pre-training method based on alarm semantics provided by the present invention;

FIG. 2 is a flowchart of a method for pre-training an actual application of a multi-step attack detection model based on alarm semantics.

Detailed Description

The following provides a more detailed description of embodiments of the present invention, with reference to the accompanying drawings.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1:

the invention provides a multi-step attack detection model pre-training method based on alarm semantics, which comprises the following steps of:

s1, obtaining an alarm description embedding vector by using an offline alarm sequence;

and S2, pre-training the multi-step attack detection model by using the alarm description embedding vector.

S1 specifically comprises the following steps:

s1-1, acquiring an alarm text corresponding to an alarm rule;

s1-2, dividing the alarm text based on basic words to obtain an alarm word text;

s1-3, performing stop word removal processing on the alarm word text based on a stop word table to obtain an alarm word basic text;

s1-4, obtaining an alarm description embedded model by utilizing the alarm word basic text;

s1-5, inputting an alarm description embedding model by using the alarm word basic text to obtain an alarm description embedding vector;

wherein, the stop word list is a stop tone auxiliary word list.

In the embodiment, a multi-step attack detection model pre-training method based on alarm semantics, basic words comprise real words and imaginary words.

S1-4 specifically includes:

s1-4-1, using the alarm word basic text as a training set;

s1-4-2, utilizing the training set as input, utilizing an alarm description embedding vector corresponding to an alarm word basic text in the training set as output, and training based on a PV-DBOW version of Doc2Vec to obtain an alarm description embedding model.

S2 specifically comprises the following steps:

s2-1, establishing an alarm embedding vector membership matrix of the current multi-step attack stage according to the current multi-step attack stage by utilizing the alarm description embedding vector;

s2-2, calculating the cluster center of the current multi-step attack stage by utilizing the alarm embedded vector membership matrix of the current multi-step attack stage;

s2-3, utilizing the cluster center of the current multi-step attack stage to iteratively update and calculate an alarm embedded vector membership matrix;

s2-4, acquiring corresponding multi-step attack stage positions by utilizing the alarm embedded vector membership matrix;

s2-5, calculating an HMM emission probability matrix by using the alarm embedded vector membership matrix;

and S2-6, obtaining a pre-training result by utilizing the HMM emission probability matrix.

The formula for S2-2 is calculated as follows:

The formula for S2-3 is calculated as follows:

wherein, U _k+1 Embedding vector membership matrix u for iteratively updated alarms _ij ' is the membership value of the ith alarm vector in the embedded vector membership matrix belonging to the jth attack stage, C _j Cluster center for the jth multi-step attack stage, C _k Is the cluster center of the kth multi-step attack stage, N is the number of multi-step attack stages, x _i And m is the number of clusters of the ith alarm vector.

S2-4 specifically comprises:

s2-4-1, judgment of U _k+1 And U _k If the difference is less than epsilon, if so, output U _k+1 If not, the alarm description embedded vector and the cluster center of the multi-step attack stage are reused to update the alarm embedded vector membership matrix in an iterative manner;

s2-4-2, utilizing the alarm embedded vector membership matrix after iterative updating to obtain a multi-step attack stage of the attack cluster according to the earliest alarm in each attack cluster;

where ε is an empirical constant.

In the embodiment, the alarm semantics-based multi-step attack detection model pre-training method comprises the step of iteratively updating the alarm embedded vector membership matrix to return to S2-1 for reprocessing.

The calculation formula of S2-5 is as follows:

where B is the HMM emission probability matrix, u ₁₁ To u _MN And embedding vectors for the alarm description corresponding to each multi-step attack stage position.

S2-6 specifically comprises:

s2-6-1, obtaining the probability of the alarm description embedding vector corresponding to the multi-step attack stage by utilizing the HMM emission probability matrix;

and S2-6-2, utilizing the probability of the alarm description embedding vector as a pre-training result.

Example 2:

a multi-step attack detection model pre-training practical application method based on alarm semantics is disclosed, as shown in FIG. 2, and includes:

because the alarm descriptions generated in the same attack stage have higher similarity, the alarms in the semantic vector space are closer, and the alarms belonging to the same attack stage can be gathered together through a fuzzy C-means clustering algorithm. And then mapping the distance relation to the initial parameters of the model to obtain the initial parameters which are closer to the actual attack scene, and performing unsupervised training on the initial parameters to avoid the problem that the model falls into the local optimal solution. Therefore, alarm semantic information is extracted by utilizing the steps of alarm embedding and parameter pre-training, and initial parameters of the multi-step attack detection model are optimized.

1. Alarm embedding based on Doc2 vec:

in order to solve the problem that the One-Hot coding can not capture the semantic relation of alarm description at present, an alarm embedding module converts the alarm description into a low-dimensional continuous value by adopting a Doc2vec model, maps the alarm description with similar semantics to a similar position in a vector space, and realizes the extraction of semantic knowledge in the alarm description.

Currently, doc2vec includes two training modes, PV-DM and PV-DBOW. Since the word vectors obtained by PV-DBOW are more accurate when fewer descriptions are included in the alarm, a PV-DBOW training alarm embedding model is selected herein. The specific training process of the alarm description embedded model is as follows:

1. acquiring all alarm texts in an alarm rule set;

2. segmenting the alarm text to obtain all words contained in the alarm text;

3. removing stop words in the alarm text by using the stop word list;

4. the Doc2vec model is trained with the alert text with stop words removed.

2. Model parameter pre-training based on fuzzy C mean:

the fuzzy C-means algorithm is a clustering algorithm based on membership, and the idea is to maximize the similarity between vectors divided into the same cluster and minimize the similarity between vectors in different clusters. And in the parameter pre-training step, the alarm embedded vectors belonging to the same attack stage are aggregated by using a fuzzy C-means algorithm. And after the clustering algorithm is converged, obtaining the membership of each alarm embedded vector to all classes (attack stages). And further, according to the clustering result, the alarm vector and the membership degree matrix of various types (attack stages) are converted into a probability matrix for generating an alarm in the attack stage, so that an initial alarm description transition probability matrix which is closer to an actual attack scene is obtained. The method mainly comprises the following steps:

1. and determining the number N of the classes. Where N represents the N attack stages of the multi-step attack.

2. Random initialization membership degree matrix U ⁽⁰⁾ . Wherein matrix element u _ij And representing the membership value of the ith alarm vector belonging to the jth attack stage.

3. Updating cluster center C with U ^(k) Updating cluster center C ^k ＝[c _j ]. Wherein, c _j Cluster center, x, representing the j-th attack stage _i Representing the ith alarm vector, m is the cluster number of the cluster:

4. updating each element in the membership degree matrix:

5. if | U ^(k+1) -U ^k ‖<If epsilon, stop updating and return to the membership matrix U ^(k+1) Otherwise, return to 2.

6. And determining the attack stage. After step 5, each attack cluster contains a certain number of alarms. And determining the attack stage of each cluster according to the earliest alarm in each attack cluster.

HMM emission probability matrix calculation. And (3) converting the membership of each alarm embedded vector to the attack stage into the probability (HMM emission probability matrix) of generating alarm description in the multi-step attack stage, and completing the pre-training of hidden Markov model parameters:

as will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A multi-step attack detection model pre-training method based on alarm semantics is characterized by comprising the following steps:

and pre-training the multi-step attack detection model by using the alarm description embedding vector.

2. The method for pre-training the multi-step attack detection model based on the alarm semantics as claimed in claim 1, wherein the obtaining of the alarm description embedding vector by using the offline alarm sequence comprises:

acquiring an alarm text corresponding to an alarm rule;

dividing the alarm text based on basic words to obtain an alarm word text;

inputting the alarm word basic text into an alarm description embedding model to obtain an alarm description embedding vector;

3. The method of claim 2, wherein the obtaining of the alarm description embedding model by using the alarm word base text comprises:

utilizing the alarm word basic text as a training set;

4. The method for pre-training the multi-step attack detection model based on alarm semantics of claim 1, wherein pre-training the multi-step attack detection model using the alarm description embedding vector comprises:

calculating the cluster center of the current multi-step attack stage by using the alarm embedded vector membership matrix of the current multi-step attack stage;

5. The method for pre-training the multi-step attack detection model based on the alarm semantics as claimed in claim 4, wherein the calculation formula for calculating the cluster center of the current multi-step attack stage by using the alarm embedding vector membership matrix of the current multi-step attack stage is as follows:

6. The method for pre-training the multi-step attack detection model based on the alarm semantics as claimed in claim 4, wherein the calculation formula for calculating the alarm embedding vector membership matrix by using the cluster center iteration update of the current multi-step attack stage is as follows:

wherein, U _k+1 Embedding vector membership matrix u for iteratively updated alarms _ij ' is the membership value of the ith alarm vector in the embedded vector membership matrix belonging to the jth attack stage, C _j Cluster center for the jth multi-step attack stage, C _k Cluster center for the kth multi-step attack stage, NFor the number of multi-step attack stages, x _i And m is the number of clusters of the ith alarm vector.

7. The method for pre-training the multi-step attack detection model based on the alarm semantics as claimed in claim 4, wherein the alarm embedding vector membership matrix is used to obtain the corresponding multi-step attack stage position:

judge U _k+1 And U _k If the difference is less than epsilon, then output U _k+1 Otherwise, the alarm description embedding vector and the cluster center of the multi-step attack stage are reused for iteratively updating the alarm embedding vector membership matrix;

where ε is an empirical constant.

8. The method as claimed in claim 4, wherein the computation formula for computing the HMM emission probability matrix by using the alarm embedding vector membership matrix is as follows:

9. The method as claimed in claim 4, wherein the pre-training method for the multi-step attack detection model based on alarm semantics comprises obtaining a pre-training result by using the HMM emission probability matrix: