CN112019497A

CN112019497A - Word embedding-based multi-stage network attack detection method

Info

Publication number: CN112019497A
Application number: CN202010660792.0A
Authority: CN
Inventors: 周鹏; 周公延
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-12-01
Anticipated expiration: 2040-07-10
Also published as: CN112019497B

Abstract

The invention provides a word embedding-based multi-stage network attack detection method, which comprises the following steps: 1) carrying out feature selection on a data set formed by network traffic features after attack occurs; 2) vectorizing network traffic data using a word embedding method; 3) respectively constructing a current vector and a historical vector, and constructing a training sample by using a negative sampling method; 4) establishing a multi-stage attack detection model based on word embedding, calculating an association vector, calculating association probability by using a supervised learning classification algorithm, and judging the possibility that the current data belongs to multi-stage attack. The method has the advantages that the intrusion detection system can automatically associate the attack stage from the data packet level without defining association rules, and simultaneously, the problem that no alarm is generated in part of the attack stage when multi-stage attack detection is carried out from the alarm level is avoided.

Description

Word embedding-based multi-stage network attack detection method

Technical Field

The invention relates to a word embedding-based multi-stage attack network attack detection method, which is suitable for purposefully carrying out multi-stage network attack intrusion detection by an attacker under an industrial internet boundary protection scene.

Background

The industrial internet boundary protection generally comprises five aspects of identification, protection, detection, response and recovery. The intrusion detection technology is an important ring in industrial internet boundary protection, and the attack is positioned by monitoring and detecting continuous network flow of the industrial internet and analyzing the network flow characteristics after the attack so as to identify the occurrence of a security event and provide information for a security response and security recovery mechanism.

Due to the continuous development of the industrial internet boundary protection technology, it is gradually difficult for an attacker to infiltrate the network by utilizing isolated vulnerabilities and security flaws (such as SQL injection attacks, denial of service attacks, etc.). Therefore, in order to successfully invade, an attacker often needs to combine and gradually infiltrate a series of attack means such as network detection, vulnerability discovery, defect utilization and the like, so that one invasion process is composed of a plurality of stages to form a multi-stage attack, and what is more, in order to achieve the purpose of hiding the attack, the attacker often disguises some attack stages as normal network behaviors, but the disguised attack stages are associated with other behaviors to achieve the purpose of hiding the attack.

The traditional intrusion detection technology based on machine learning generally models network traffic analysis, or identifies based on the network characteristics of existing attacks, or detects through the abnormity of network packets, basically ignores the sequence correlation characteristics of network data, and cannot detect multi-stage attacks. Therefore, detection of multi-stage attacks faces new challenges. On the other hand, the existing multi-stage attack detection methods are mainly classified into rule-based methods and statistical learning algorithm-based methods, wherein the rule-based methods need to write rules manually and are generally used for extracting multi-stage attacks from attacked data and performing association analysis. The method based on the statistical learning algorithm mainly uses a hidden Markov model, a large number of attack samples are learned through statistical analysis to obtain model parameters, but the hidden Markov model uses an independence hypothesis, namely, the current state is only related to the previous state, and the deeper multi-stage attack characteristics cannot be learned.

Disclosure of Invention

The invention aims to provide a multi-stage network attack detection method based on a word embedding method from the viewpoint that corresponding network packets have potential correlation in different attack stages of multi-stage attack. Different from the existing method, the invention develops a word embedding-based multi-stage network attack detection method aiming at the sequence characteristics of network flow data and the planned multi-stage attack behavior of an attacker.

In order to achieve the above purpose, the invention is realized by the following technical scheme:

a multi-stage attack network attack detection method based on word embedding comprises the following steps:

1) carrying out feature selection on a data set formed by network traffic features after attack occurs;

2) vectorizing network traffic data using a word embedding method;

3) respectively constructing a current vector and a historical vector, and constructing a training sample by using a negative sampling method;

4) establishing a multi-stage attack detection model based on word embedding, calculating an association vector, calculating association probability by using a supervised learning classification algorithm, and judging the possibility that the current data belongs to multi-stage attack.

The characteristic selection in the step 1) comprises the following steps:

step 1.1, randomly dividing a data set consisting of a large number of attack samples into a training set, a verification set and a test set, and making X ═ (X)₁,x₂,x₃,…,x_i) Represents a sequence data, x_i＝(x_i ⁽¹⁾,x_i ⁽²⁾,x_i ⁽³⁾,…,x_i ^(j)) Representing a single data packet, where x_i ^(j)Representing a data packet x_iThe jth feature of (1);

step 1.2, according to the network data packet composition and the network transmission protocol, analyzing attack data, performing preliminary feature selection and feature construction, and selecting n features;

the vectorization of the network traffic data in the step 2) comprises the following steps:

step 2.1, according to n network flow characteristics obtained by characteristic selection, deleting other network flow characteristics which are not selected in the original data set, wherein the network flow refers to the information quantity passing through network equipment or transmission media in unit time;

step 2.2, the data set is divided into sequences corresponding to features, i.e. S ═ S⁽¹⁾,s⁽²⁾,s⁽³⁾,…,s^(j)) Wherein the individual signature sequences are denoted as s^(j)＝(x₁ ^(j),x₂ ^(j),x₃ ^(j),…,x_i ^(j)) Finally, obtaining a sequence with no more than the characteristic quantity;

and 2.3, using the obtained multiple feature sequences and using a skip-gram word embedding method in word2vec, namely using a single feature sequence as a corpus, selecting a value in a window range as a sample each time, using the central word of the window as input, using the rest words as output, constructing a neural network, using the central word to predict the rest words, and using the weight of a hidden layer of the neural network as a vector of the input word. Similarly, word embedding is carried out on other characteristic sequences to obtain a word embedding vector corresponding to each characteristic, and the word embedding vector is expressed as v_i ^j∈R^k；

Step 2.4, splicing the vectors corresponding to the plurality of characteristics to obtain vector representation of the network flow data, namely v_i＝(v_i ⁽¹⁾,v_i ⁽²⁾,v_i ⁽³⁾,…,v_i ^(j))；

The step 3) of constructing the current vector and the historical vector and constructing the training sample comprises the following steps:

step 3.1, creating a vector H epsilon R with the length of m^m×kStoring the history information of the multi-stage attack, and expressing the data at any time t as D e R^kInformation indicating current data, let A_iRepresenting a corresponding attack phase;

step 3.2, initializing H as data of the first attack stage, and taking data D in the second stage attack⁺∈A₂As a positive example, [ H, D ]⁺]As a positive sample, correspondingThe label is 1, and data at other moments are taken

As a negative example, [ H, D ]^-]As a negative sample, the corresponding label is 0, and the positive and negative sample construction ratio is 1: g, g>1 is an algorithm parameter which is manually set during sampling training;

step 3.3, updating H according to the attack stage, wherein the new vector contains the information of the attack stage which is present at present, and repeating the sample construction process in the step 3.2 to construct M samples;

the establishment of the word embedding-based multi-stage attack detection model in the step 4) comprises the following steps:

4.1, converting the sequence modeling problem into a classification problem based on the training sample constructed in the step 3, calculating a correlation vector based on the current data vector D and the historical vector H, and recording the correlation vector as R (H, D) as the input of a classifier;

the correlation vector R (H, D) is calculated by

R(H,D)＝D⊙[h₁,h₂,…,h_m]

Wherein h is_mRepresenting the m-th vector in H, the above formula represents D to each vector H in H_mMaking a Hadamard product;

the optimization objective is

Wherein S represents the size of a training set, H and D are input of a model, y is a label and represents the real output of the model, and p (D, H) represents the association probability between current data D and historical data H which have been attacked;

step 4.2, initializing a history vector H;

step 4.3, reading the real-time network traffic data, selecting the characteristics used by the read-in real-time network traffic data according to the network traffic characteristics selected in the step 1.2, and vectorizing the characteristics according to the step 2 to obtain a real-time data vector D;

step 4.4, Using step4.1, judging the relation between D and the occurred attack stage, and outputting the association probability P_a＝p(D|H)；

Step 4.5, define the threshold, if P_aIf the value is larger than the given threshold value, adding the D into the cache, and updating the H when the size of the cache reaches the specified size;

and 4.6, repeating the steps 4.3 to 4.5.

Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable advantages:

the intrusion detection system can automatically associate the attack stage from the data packet level without defining association rules, and simultaneously avoids the problem that no alarm is generated in part of the attack stage when multi-stage attack detection is carried out from the alarm level.

Drawings

FIG. 1 is a general flow chart of the process of the present invention.

FIG. 2(a) is a diagram of a multi-stage attack detection model in the training phase of the present invention.

FIG. 2(b) is a diagram of a multi-stage attack detection model at the testing stage of the present invention.

Fig. 3(a) is a diagram of a network traffic data vectorization model according to the present invention.

Fig. 3(b) is an exemplary diagram of vectorization of network traffic data according to the present invention.

FIG. 4 is a receiver operational characteristic of a multi-stage attack detection model.

FIG. 5 is a confusion matrix corresponding to the multi-stage attack detection results in the test set.

FIG. 6(a) is a diagram showing the real-time detection result of multi-stage attack in the test set according to the present invention.

FIG. 6(b) is a diagram of the original attack phase distribution of network traffic in the test suite according to the present invention.

Fig. 6(c) is a diagram of the multi-stage attack real-time detection result in other network scenarios according to the present invention.

Fig. 6(d) is the original attack stage distribution diagram under other network scenarios.

Detailed Description

The invention is described in detail below with reference to the drawings and preferred embodiments.

The first embodiment is as follows:

as shown in fig. 1, a multi-stage network attack detection method based on word embedding includes the following steps:

2) vectorizing network traffic data using a word embedding method;

Example two:

this embodiment is substantially the same as the first embodiment, and is characterized in that:

in this embodiment, a multi-stage network attack detection method based on word embedding, the feature selection in step 1) includes the following steps:

step 1.1, randomly dividing a data set consisting of a large number of attack samples into a training set, a verification set and a test set, wherein the division ratio of the embodiment is 80%: 10%: 10%, let X be (X)₁,x₂,x₃,…,x_i) Represents a sequence data, x_i＝(x_i ⁽¹⁾,x_i ⁽²⁾,x_i ⁽³⁾,…,x_i ^(j)) Representing a single data packet, where x_i ^(j)Representing a data packet x_iThe jth feature of (1);

step 1.2, according to the network data packet composition and the network transmission protocol, analyzing attack data, performing preliminary feature selection and feature construction, selecting n features, selecting an IP address, a network protocol, a port number, data length and time difference as original features in the embodiment, and mapping the port number to a common vulnerability utilization port;

step 2.2, the data set is divided into sequences corresponding to features, i.e. S ═ S⁽¹⁾,s⁽²⁾,s⁽³⁾,…,s^(j)) Wherein the individual signature sequences are denoted as s^(j)＝(x₁ ^(j),x₂ ^(j),x₃ ^(j),,x_i ^(j)) Finally, a sequence with no more than the number of features is obtained, in the embodiment, only a network, a port number and an IP are finally selected for vectorization, wherein a source port and a destination port are of the same polarity, so that only an embedded vector of a sequence learning port is constructed, and IP addresses are the same;

and 2.3, using the obtained multiple feature sequences and using a skip-gram word embedding method in word2vec, namely using a single feature sequence as a corpus, selecting a value in a window range as a sample each time, using the central word of the window as input, using the rest words as output, constructing a neural network, using the central word to predict the rest words, and using the weight of a hidden layer of the neural network as a vector of the input word. Similarly, word embedding is carried out on other characteristic sequences to obtain a word embedding vector corresponding to each characteristic, and the word embedding vector is expressed as v_i ^j∈R^kDenotes v_i ^j∈R^kIn this embodiment, the vector dimension is set to be 8;

The training sample construction in the step 3) comprises the following steps:

step 3.1, maintain a vector H belonging to R with length of m^m×j×kStoring the history information of the multi-stage attack, and expressing the data at any time t as D e R^kWatch, watchInformation indicating current data, order A_iRepresents the corresponding attack phase, the length of H in this example is 50;

step 3.2, initializing H as data of the first attack stage, and taking data D in the second stage attack⁺∈A₂As a positive example, [ H, D ]⁺]As a positive sample, the corresponding label is 1, and data at other time points is arbitrarily taken

As a negative example, [ H, D ]^-]As a negative sample, the corresponding label is 0, the positive and negative sample construction ratio is 1: g, and in order to enhance the generalization capability of the model, the positive and negative sample ratio is set to be 1:50 in the embodiment;

step 3.3, updating H according to the attack stage, wherein the new vector contains the information of the attack stage which has appeared at present, repeating the sample construction process in the step 3.2, and constructing M samples, wherein M is 20400 in the embodiment;

4.1, converting the sequence modeling problem into a classification problem based on the training sample constructed in the step 3, calculating an association vector based on a current data vector D and a historical vector H, recording the association vector as R (H, D), and training a multi-stage attack detection model based on word embedding;

the correlation vector calculation method comprises

R(H,D)＝D⊙[h₁,h₂,…,h_m]

Wherein h is_mRepresenting the m-th vector in H, the above formula represents D to each vector H in H_mAnd (5) making a Hadamard product.

The optimization objective is

Wherein S represents the size of the training set, H and D are the input of the model, y is the label and represents the real output of the model, and p (D, H) represents the association probability between the current data D and the historical data H in which the attack has occurred.

Step 4.2, initializing a history vector H, reading real-time network traffic data, wherein the initialization H is the first stage of the multi-stage attack;

4.3, selecting the characteristics used by the read-in real-time network flow data according to the network flow characteristics selected in the step 1.2, and vectorizing the characteristics according to the step 2 to obtain a real-time data vector D;

step 4.4, judging the relation between the D and the occurred attack stage by using the model established in the step 4.1, and outputting the association probability P_a＝p(D|H)；

Step 4.5, define the threshold, if P_aIf the value is larger than the given threshold value, adding the value D into the cache, and updating the value H when the size of the cache reaches the specified size, wherein the threshold value is selected according to the recall rate and the accuracy rate of the model in the embodiment, and the size of the cache is set to be 50;

and 4.6, repeating the steps 4.3 to 4.5.

Example three:

this embodiment is basically the same as the second embodiment, and is characterized in that:

in this embodiment, when a word embedding-based multi-stage attack detection model is established, a plurality of classification algorithms are used to output association probabilities, and the results are shown in the following table through comparison analysis in multiple aspects of accuracy, recall, F1, AUC, and inference time:

the network traffic characteristics selected in this embodiment are as follows:

(1) protocol, network transport protocol types such as TCP, UDP, ICMP, etc.;

(2) data length, length of a single data packet;

(3) delta time, the time difference between the current packet and the last packet;

(4) source port, source host port number;

(5) destinationport, destination host port number;

(6) source IP, source host IP address;

(7) destination IP, destination host IP address;

in this embodiment, fig. 2(a) shows a multi-stage attack detection model based on word embedding in a training stage, where inputs of the model are historical attack stage data and current data, a historical vector and a current vector are obtained respectively after word embedding vectorization, an association vector is calculated by using the method mentioned in step 4.1, and finally, association between the current data and the historical stage of attack is determined by using a classifier. Fig. 2(b) is a model diagram of a test phase, where the test phase model acquires network traffic in real time, initializes a history vector by using an intrusion detection system, outputs the association probability between current data and the history vector, adds the probability value to a cache when the probability value is greater than a given threshold, and updates the history vector when the cache reaches a specified size.

In this embodiment, the multi-variable word embedding vectorization network data model in step two is shown in fig. 3(a), fig. 3(b) is an example of multi-variable word embedding vectorization, and it is noted that the example of fig. 3(b) includes three features, and there are only two sequences after serialization because the port numbers are the same between the source host and the target host, so the port numbers of the source host and the target host are regarded as the same sequence, and the same port numbers have the same word vector between the source host and the target host.

In this embodiment, fig. 4 shows the receiver operation characteristic curve of the proposed multi-stage attack detection model when using different classifiers, which shows that the ensemble learning classifier achieves the best effect, mainly because the vectors obtained by using the word embedding method are high-dimensional data, in this embodiment, the word embedding dimension is not high, and a good result can be obtained by using the conventional machine learning classification algorithm.

Fig. 5 shows the confusion matrix corresponding to the multi-stage attack detection results in the test set, and it can be seen that each attack stage is correctly detected with an accuracy of over 90%.

Fig. 6(a) shows the association probability of the data at the current time and the previous attack stages in the test set network traffic of the multi-stage attack model, fig. 6(b) shows the positions where different stages of the attack occur in the original test set data, and comparing fig. 6(a) and fig. 6(b), the model can correctly detect that the model is associated with the previous attack stages. Meanwhile, in order to test the generalization performance of the model, the multi-stage attack detection model of the invention is further evaluated in different network scenarios, and the corresponding detection result and the original attack stage distribution are as shown in fig. 6(c) and fig. 6(d), and it can be seen from the figure that the attack stage 2 to the attack stage 4 are correctly associated in the network scenario.

The present invention is not limited to the above embodiments, and those skilled in the art can implement the present invention in other various embodiments according to the disclosure of the present invention, so that all designs and concepts of the present invention can be changed or modified without departing from the scope of the present invention.

Claims

1. A multi-stage network attack detection method based on word embedding is characterized by comprising the following steps:

2) vectorizing network traffic data using a word embedding method;

2. The word-embedding-based multi-stage network attack detection method according to claim 1, wherein the feature selection in the step 1) comprises the following steps:

Step 2.4, splicing the vectors corresponding to the plurality of characteristics to obtain vector representation of the network flow data, namely v_i＝(v_i ⁽¹⁾,v_i ⁽²⁾,v_i ⁽³⁾,…,v_i ^(j))。

3. The word-embedding-based multi-stage network attack detection method according to claim 1, wherein the constructing of the training samples in the step 3) comprises the following steps:

and 3.3, updating H according to the attack stage, wherein the new vector contains the current attack stage information, and repeating the sample construction process in the step 3.2 to construct M samples.

4. The word-embedding-based multi-stage network attack detection method according to claim 1, wherein the establishing of the word-embedding-based multi-stage attack detection model in the step 4) comprises the following steps:

the correlation vector R (H, D) is calculated by

R(H,D)＝D⊙[h₁,h₂,…,h_m]

Wherein h is_mRepresenting the m-th vector in H, the above formula represents D to each vector H in H_mMake HadaProduct of Mare;

the optimization objective is

Step 4.5, define the threshold, if P_aIf the value is larger than the given threshold value, adding the D into the cache, and updating the H when the size of the cache reaches the specified size, wherein the threshold value is selected according to the recall rate and the accuracy rate of the model, and the size of the cache is set to be 50;

and 4.6, repeating the steps 4.3 to 4.5.