CN112398779A

CN112398779A - Network traffic data analysis method and system

Info

Publication number: CN112398779A
Application number: CN201910739001.0A
Authority: CN
Inventors: 方少峰; 孙鹏科; 闫振中; 郑岩; 马福利; 佟继周
Original assignee: National Space Science Center of CAS
Current assignee: National Space Science Center of CAS
Priority date: 2019-08-12
Filing date: 2019-08-12
Publication date: 2021-02-23
Anticipated expiration: 2039-08-12
Also published as: CN112398779B

Abstract

The invention belongs to the technical field of network traffic data analysis, and particularly relates to an anomaly detection method of network traffic data, which comprises the following steps: processing the original network flow data captured in real time to obtain network flow data; if the network flow data is abnormal data, outputting the abnormality, inputting the abnormal data into a first abnormal classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data; if the network flow data is not abnormal data, adopting an unsupervised abnormal detection method to further detect whether the network flow data is abnormal; if the network flow data is abnormal data, inputting the abnormal data into a pre-trained second abnormal classifier, judging the type of the abnormal data to be an unknown attack type, and marking the abnormal data to be the unknown attack type; if the network flow data is not abnormal data, the output is normal.

Description

Network traffic data analysis method and system

Technical Field

The invention belongs to the technical field of anomaly detection technology and network traffic data analysis based on machine learning and big data, and particularly relates to a network traffic data analysis method and system, in particular to a network traffic data analysis method and system based on a sparse self-encoder and an extreme random tree.

Background

In recent decades, with the rapid development of the internet, from consumer interconnection, industrial interconnection to everything interconnection, the communication mode and consumption mode of people and the economic form of the whole country are reshaped once. The network security problem is getting more and more troublesome, and the traditional defense means is unconscious when facing a new attack mode due to the endless network attack. From a basic data link layer to a network layer and a transmission layer and then to a higher-level representation layer and an application layer, the manner of network attack is complex, and is continuously updated, and a means is often not satisfactory, for example, a distributed denial of service attack (DDOS) with a continuously increasing scale, which has both a traditional network layer DDOS attack using the characteristics of a TCP/IP protocol and an application layer DDOS attack developed on the basis, and is specifically applied to the application layer, and can be classified into a DNS-Flood attack, a slow connection attack and a CC attack.

Various network security technologies are developed to ensure network security, wherein the network traffic analysis and intrusion detection technologies play a very important role in detecting the abnormality in the network traffic and then providing early warning. The current network flow data analysis and intrusion detection methods are many, and mainly comprise: on the basis of the traditional method based on the feature library, the statistical method based on the probability and the rule, the supervised learning method based on the classification technology, the unsupervised learning based on the clustering technology and the derived technology based on the outlier detection, researchers establish a plurality of network anomaly detection systems, and the experiment effect is good, but when the method is put into specific use, a plurality of problems are often found. In short, in the field of anomaly detection, anomalies can be classified into point anomalies, condition anomalies and mode anomalies, different detection systems are effective for most of the point anomalies, and for the condition anomalies and the mode anomalies, the accuracy and the false alarm rate are often required to be balanced.

For example, an anomaly detection system based on a conventional feature library needs to frequently update the feature library, and is completely unable to detect unknown anomalies or simply encrypted data streams, and the cost for maintaining and updating the feature library is very high; when the network flow becomes complex, the abnormal network flow data is often judged to be normal by a statistical method based on probability and rules, and is easy to be utilized by attackers;

unsupervised detection systems based on clustering and outliers often face the problems of extremely slow algorithm training speed, difficult parameter selection and high false alarm rate when the data scale and feature dimensions are high.

The interworking interconnection of networks and the advent of the big data era enable the data scale of the networks to grow exponentially, and a large amount of traffic data can be generated every day. Although most of this data is normal traffic; however, the size and variety of the abnormal traffic is also continuously increasing, which brings new opportunities and challenges to the network security field. In the big data era, as the characteristics of network flow data become richer and richer, the existing anomaly detection system established based on the traditional KDD-CUP99 or NSL-KDD data set has the problems of poor reliability, low accuracy, high false alarm rate and the like. Whether network intrusion analysis or enterprise internal threat analysis exists, the existing method faces various problems, and a detection and defense system extremely depends on efficient analysis of network flow data and establishes effective feature preprocessing, feature selection and feature extraction tools according to new data features, which is a very important problem.

Disclosure of Invention

The invention aims to solve the problems of poor generalization performance, low detection rate and high false alarm rate of the conventional network traffic data analysis method in the network security anomaly detection technology, and provides a network traffic data analysis method based on a sparse self-encoder and an extreme random tree, which can select corresponding automatic encoding means for different types of features from network flow data, not only can play a role in reducing feature dimensionality, but also can effectively calculate the distance of features such as IP addresses, protocols and the like, thereby providing a foundation for the conventional anomaly detection technology based on distance or density; then, for the processed numerical characteristics, a characteristic selection method based on an extreme random tree is adopted, so that the dimension can be reduced, and the selected characteristics can still have practical significance, thereby providing possibility for subsequent analysis; the extracted feature set can be combined with a supervised classification technology and an unsupervised outlier anomaly detection technology, experiments show that the accuracy is effectively improved, the false alarm rate is greatly reduced, and the calculation speed is much faster due to the fact that data are recoded and feature engineering processing is carried out.

In order to achieve the above object, the present invention provides an anomaly detection method for network traffic data, including:

processing the original network flow data captured in real time to obtain network flow data;

if the network flow data is abnormal data, outputting the abnormality, inputting the abnormal data into a first abnormal classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data;

if the network flow data is not abnormal data, adopting an unsupervised abnormal detection method to further detect whether the network flow data is abnormal;

if the network flow data is abnormal data, inputting the abnormal data into a pre-trained second abnormal classifier, judging the type of the abnormal data to be an unknown attack type, and marking the abnormal data to be the unknown attack type;

if the network flow data is not abnormal data, the output is normal.

As one improvement of the above technical solution, the capturing of the original network traffic data in real time is processed to obtain network flow data; the method specifically comprises the following steps:

capturing original network flow data in real time;

extracting available data characteristics from the obtained original network traffic data to obtain network traffic characteristic data;

performing data cleaning and attribute splitting on the acquired network flow characteristic data, and splitting the acquired network flow characteristic data into numerical data and non-numerical data;

inputting the non-numerical data into a pre-trained sparse self-encoder for re-encoding to obtain encoded non-numerical data;

inputting numerical data into a pre-established extreme random tree model, and performing descending order arrangement and screening on the importance of the numerical data to obtain screened numerical data;

and carrying out normalization processing on the encoded non-numerical data and the screened numerical data to obtain network flow data.

As one improvement of the above technical solution, the non-numerical data is input to a pre-trained sparse self-encoder for re-encoding, and encoded non-numerical data is obtained; the method specifically comprises the following steps:

dividing according to the attribute label set, performing attribute splitting on the non-numerical data, and acquiring a non-numerical feature set from the non-numerical data;

carrying out single-hot coding on the non-numerical characteristic set to obtain a non-numerical characteristic set subjected to single-hot coding, inputting the non-numerical characteristic set to a pre-trained sparse self-encoder, and obtaining an encoder extracted from the sparse self-encoder;

and adopting a TCPIP2Vec algorithm based on a sparse self-encoder to re-encode the one-hot code of the non-numerical characteristic set to obtain encoded non-numerical data.

As one of the improvements of the above technical solution, the establishing and training of the sparse autoencoder specifically includes:

establishing a sparse self-encoder based on sparse self-encodingThe TCPIP2Vec algorithm of the encoder adopts a cross entropy loss function J added with KL divergence sparse penalty term_S(W, b) training a sparse autoencoder;

wherein the content of the first and second substances,

a KL divergence penalty applied to the coding function; ρ is a sparse parameter and β is a regularization parameter;

the average activation value of the jth hidden unit of the coding layer; wherein the content of the first and second substances,

the calculation formula of (a) is as follows:

wherein n is the number of block samples set during training; f (x)ⁱ) As a coding function, xⁱIs the ith sample;

KL is divergence; the KL divergence is a measure for comparing the similarity between two probability distributions, and is calculated as follows:

wherein ρ (t) is a sparse parameter function;

the input of the sparse self-encoder is a non-numerical characteristic set subjected to one-hot encoding; the output of the sparse self-encoder is a recoded non-numerical characteristic set, namely coded non-numerical data.

As one improvement of the above technical solution, the numerical data is input into a pre-established extreme random tree model, and the importance of the numerical data is sorted and screened in descending order to obtain the screened numerical data; the method specifically comprises the following steps:

dividing according to the attribute numbers, and performing attribute splitting on the numerical data to obtain a split numerical characteristic set;

inputting the split numerical characteristic set into a pre-established extreme random tree model, and arranging each numerical characteristic in the split numerical characteristic set in a descending order from large to small according to importance to obtain a sorted numerical characteristic set;

and screening the sorted numerical feature set according to a preset threshold value to obtain the importance factors of each numerical feature in the sorted numerical feature set larger than the preset threshold value, and recording the importance factors as screened numerical data.

As an improvement of the above technical solution, the establishing process of the extreme random tree-based feature selection model specifically includes:

randomly selecting numerical characteristics in the split numerical characteristic set to construct a plurality of decision trees;

wherein, the construction process of each decision tree is as follows:

the importance factor of each numerical feature is obtained according to the following calculation formula:

wherein G (D, A) is an importance factor of the numerical characteristic A relative to the numerical characteristic set D to be divided, namely an information gain ratio; d is a data set to be divided; a is the currently selected numerical characteristic; h_A(D) The information entropy is obtained by taking the currently selected numerical characteristic A as a random variable; h (D) is the information entropy of set D with the data class as a random variable; h (D | a) is the conditional information entropy of the subset obtained after the set D is divided using the feature a;

wherein, when each decision tree is constructed, the K numerical characteristics are randomly selectedRandomly selecting K numerical features, wherein K is the total dimension of the numerical features, and K is the feature dimension set for constructing each decision tree; the value of K is set to be less than K, and is generally ordered

When each decision tree is constructed, selecting the numerical characteristic with the largest information gain ratio G (D, A) from the k selected numerical characteristics, then constructing nodes and splitting;

when the nodes of the decision tree are split, randomly selecting an arbitrary number between the maximum value and the minimum value of the numerical characteristic, and recording the arbitrary number as a comparison value; when the numerical characteristic of the sample is greater than the comparison value, taking the sample as a left branch; when the numerical characteristic of the sample is smaller than the comparison value, the sample is taken as a right branch, and then the bifurcation value of the numerical characteristic of the sample is calculated; wherein, the sample is a split numerical characteristic set;

traversing the selected k numerical characteristics to construct a decision tree;

repeating the process of constructing the basic decision tree for N times to construct N decision trees; wherein, the number of the decision tree is determined by using a cross validation and grid search method;

judging each numerical characteristic in the split numerical characteristic set by utilizing a plurality of decision trees, specifically judging whether the original network flow data corresponding to the numerical characteristic is normal data or abnormal data by each decision tree of the plurality of decision trees, summarizing the judgment result of each decision tree by a voting method, and taking the result of the majority of the judgment result as the final judgment result; wherein, the judgment result is that the original network flow data corresponding to the numerical characteristic is normal data or abnormal data;

obtaining importance factors of each numerical characteristic in the split numerical characteristic set according to a finally obtained judgment result and the formula, sorting the importance factors of each numerical characteristic in the split numerical characteristic set according to a descending order of importance, screening each numerical characteristic in the split numerical characteristic set according to a set threshold value, obtaining the importance factors of each numerical characteristic in the sorted numerical characteristic set which is larger than a preset threshold value, and recording the importance factors as screened numerical data;

the input of the extreme random tree model is a split numerical characteristic set, and the output of the extreme random tree model is screened numerical data.

As one improvement of the above technical solution, the method extracts available data features from the acquired original network traffic data to acquire network traffic feature data; the method specifically comprises the following steps:

extracting a first feature from the acquired raw network traffic data by using an Argus tool, wherein the first feature comprises: a source IP address, a source port number, a destination IP address, a destination port number, and a transport protocol type;

extracting second features from the acquired raw network traffic data using a Bro-IDS tool, the second features comprising: counting from a source IP to a target IP packet, counting from a target IP to a source IP packet, an application layer protocol type, transmission bits per second of the source IP and transmission bits per second of the target IP;

among the available data features are: the first feature, the second feature, and other features extracted from the protocol header file, including: a value of a source TCP advertisement window size, a value of a target TCP advertisement window size, a sequence number of the source TCP, a sequence number of the target TCP, an average of a size of a stream packet transmitted by the source, an average of a size of a stream packet transmitted by the target, a pipe depth of an http request/response connection, an actual size of data transmitted from an http service of the server without compression, a source jitter time (millisecond), a target jitter time (millisecond), a source packet interval arrival time, a target packet interval arrival time, a number of "syn" and "ack" in the TCP connection, an interval time of syn and syn _ ack in the TCP connection, an interval time of syn _ ack packet and ack packet in the TCP connection.

As an improvement of the above technical solution, the establishing and training of the first anomaly classifier specifically includes:

adopting an AdaBoost supervision and classification integration algorithm to construct an anomaly detection function:

wherein G (x) is a first anomaly classifier; g_m(x) Is a decision tree weak classifier; wherein m is 1,2, …, 30; alpha is alpha_mIs a reaction with G_m(x) Corresponding classifier weight coefficients;

training a first anomaly classifier according to the anomaly detection function, wherein the input of the first anomaly classifier is { (x)₁,y₁),(x₂,y₂)(x₃,y₃),…(x_n,y_n) In which x_iProcessing each piece of network flow data; x is the number of_i∈Rⁿ；y_iIs a corresponding label; y is_iAnd e {0,1}, wherein 0 represents normal, 1 represents abnormal, and the output of the method is the judgment result of the trained first abnormal classifier on each network traffic characteristic.

As an improvement of the above technical solution, the establishing and training of the second anomaly classifier specifically includes:

an unsupervised anomaly detection function is constructed by reconstructing an error function using an auto-encoder:

L(x_p,x_r)＝||x_p-x_r||²＝||x_p-F(G(x_p))||²

wherein, L (x)_p,x_r) For reconstructing the error function, F and G denote the decoding and encoding functions, x, respectively, of the self-encoder_pFor raw network traffic data to be detected, x_rIs a self-encoder pair x_pPerforming the reconstructed data;

training a second anomaly classifier according to the unsupervised anomaly detection function, and acquiring a reconstruction error threshold according to a reconstruction error function;

during detection, if the reconstruction error of one piece of network flow data is greater than the reconstruction error threshold value, the piece of network flow data is judged to be abnormal data;

if the reconstruction error of one piece of network flow data is less than or equal to the reconstruction error threshold value, judging that the data is normal data;

the input of the second abnormal classifier is network flow data x which is judged to be normal data by the first abnormal classifier_p(ii) a The output of which is the trained second anomaly classifier pair x_pThe result of the discrimination (1).

Based on the network traffic data analysis method, the invention also provides a network traffic data analysis system, which comprises: the system comprises an original data acquisition module, a data preprocessing module, a data feature extraction module, a data anomaly detection module and a data result display module; wherein the content of the first and second substances,

the original data acquisition module is used for capturing original network flow data in real time;

the data preprocessing module is used for extracting available data characteristics from the acquired original network traffic data and acquiring network traffic characteristic data;

the data characteristic extraction module is used for carrying out data cleaning and attribute splitting on the acquired network flow characteristic data and splitting the acquired network flow characteristic data into numerical data and non-numerical data; inputting the non-numerical data into a pre-trained sparse self-encoder for re-encoding to obtain encoded non-numerical data; inputting numerical data into a pre-established extreme random tree model, and performing descending order arrangement and screening on the importance of the numerical data to obtain screened numerical data;

the data anomaly detection module is used for carrying out normalization processing on the encoded non-numerical data and the screened numerical data to obtain network flow data; detecting whether the network flow data is abnormal data;

if the network flow data is not abnormal data, the output is normal.

And the data result display module is used for displaying the detection result of detecting whether the network flow data is abnormal data.

Compared with the prior art, the invention has the beneficial effects that:

1. the method of the invention greatly reduces the false alarm rate of abnormal data, greatly reduces manual processing, after the processing of the technical scheme, the false alarm rate of most abnormal recognizers based on classification is reduced to below 10 percent, and because the data feature extraction module carries out effective IP recoding, feature extraction and feature selection on the data, the model has better robustness on the selection of the classifier and better robustness on the selection of the parameters of the classifier;

2. the method greatly improves the recall rate of the abnormal data, the abnormal data and the normal data are more obviously distinguished after being processed by a sparse self-encoder and an extreme random tree, and the recall rate is improved to more than 90 percent after model tuning;

3. the method of the invention improves the detection rate of abnormal data, and because the data feature extraction module is used in the training process, the traditional abnormal detection algorithm, such as a decision tree model, a Bayesian classifier model, a self-encoder model, a random forest model and the like, can directly obtain information related to the abnormality from effective features during detection, and the feature dimension of each piece of network flow data is reduced by one order of magnitude, thus greatly improving the detection efficiency.

Drawings

FIG. 1 is a schematic structural diagram of a network traffic data analysis system based on a sparse self-encoder and an extreme random tree according to the present invention;

FIG. 2 is a detailed flowchart of step 2) of the method for detecting anomaly of network traffic data according to the present invention;

FIG. 3 is a detailed flowchart of step 3) of the method for detecting anomaly of network traffic data according to the present invention;

fig. 4 is a specific flowchart of step 4) of the method for detecting the anomaly of the network traffic data according to the present invention.

Detailed Description

The invention will now be further described with reference to the accompanying drawings.

The invention provides an anomaly detection method of network traffic data, which comprises the following steps:

step 1), capturing original network flow data in real time;

specifically, an open source tool is adopted to capture original network flow data from a network environment in real time and store the original network flow data into a file in a PCAP format; in this embodiment, a TCPDUMP tool is mainly used to capture original network traffic data from a network environment in real time;

step 2) extracting available data characteristics from the obtained original network traffic data to obtain network traffic characteristic data;

specifically, as shown in fig. 2, an Argus tool is used to extract a first feature from the acquired raw network traffic data, where the first feature includes: a source IP address, a source port number, a destination IP address, a destination port number, and a transport protocol type;

and other features extracted from the protocol header file such as the value of the source TCP advertisement window size, the value of the target TCP advertisement window size, the sequence number of the source TCP, the sequence number of the target TCP, the mean of the sizes of the stream packets transmitted by the source, the mean of the sizes of the stream packets transmitted by the target, the pipe depth of the http request/response connection, the actual uncompressed size of the data transmitted from the server's http service, the source jitter time (milliseconds), the target jitter time (milliseconds), the source packet interval arrival time, the number of "syn" (synchronization sequence number) and "ack" (acknowledgement character) in the target packet interval arrival time TCP connection, the interval time of syn and syn _ ack in the TCP connection, and the interval time of syn _ ack packets and ack packets in the TCP connection.

Among the available data features are: the first and second characteristics, and other characteristics extracted from the protocol header file, namely source IP address, source port number, destination IP address, destination port number, transport protocol type, source IP to destination IP packet count, destination IP to source IP packet count, application layer protocol type, source IP transmission bits per second, destination IP transmission bits per second, and values of source TCP advertised window size, values of destination TCP advertised window size, sequence number of source TCP, sequence number of destination TCP, mean of stream packet size transmitted by the source, mean of stream packet size transmitted by the destination, pipe depth of http request/response connection, actual uncompressed size of data transmitted from http service of the server, source jitter time (milliseconds), destination jitter time (milliseconds), source packet interval arrival time, number of "syn" and "ack" in destination packet interval arrival time TCP connection, The interval time of syn and syn _ ack in a TCP connection, and the interval time of syn _ ack packet and ack packet in a TCP connection;

taking UNSW _ NB15 data set as an example, the obtained raw network traffic data is as follows: "59.166.0.0,1390,149.171.126.6, 53, udp, CON,0.001055,132,164,31,29,0,0, dns,500473.9375,621800.9375,2,2,0,0,0,0,66,82,0,0,0,0, 0,1421927414, 0.017,0.013,0,0,0,0,0,0,0,0, 0,3,7,1,3,1,1,1, 0";

the original network flow data totally comprises 47 numerical characteristics, and the 47 numerical characteristics are sequentially as follows:

“srcip,sport,dstip,dsport,proto,state,dur,sbytes,dbytes,sttl,dttl,sloss,dloss,service,Sload,Dload,Spkts,Dpkts,swin,dwin,stcpb,dtcpb,smeansz,dmeansz,trans_depth,res_bdy_len,Sjit,Djit,Stime,Ltime,Sintpkt,Dintpkt,tcprtt,synack,ackdat,is_sm_ips_ports,ct_state_ttl,ct_flw_http_mthd,is_ftp_login,ct_ftp_cmd,ct_srv_src,ct_srv_dst,ct_dst_ltm,ct_src_ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm”；

wherein the piece of raw network traffic data includes available data characteristics of a piece of network traffic, comprising: source IP address, source port number, destination IP address, destination port number, transport protocol type, source IP to destination IP packet count, destination IP to source IP packet count, application layer protocol type, source IP transmission bits per second, destination IP transmission bits per second, and other characteristics extracted from protocol header files, such as values of source TCP advertisement window size, values of destination TCP advertisement window size, sequence number of source TCP, sequence number of destination TCP, mean of stream packet size transmitted by source, mean of stream packet size transmitted by destination, pipeline depth of http request/response connection, actual uncompressed size of data transmitted from http service of server, source jitter time (millisecond), destination jitter time (millisecond), source packet interval arrival time, number of "syn" and "ack" in destination packet interval arrival time TCP connection, Interval time of SYN and SYN _ ACK in TCP connection, interval time of SYN _ ACK packet and ACK packet in TCP connection;

step 3) carrying out data cleaning on the acquired network flow characteristic data, carrying out attribute splitting on the cleaned network flow characteristic data, and splitting the network flow characteristic data into numerical data and non-numerical data;

specifically, as shown in fig. 3, step 3) specifically includes:

step 3-1) data cleaning is carried out on the network flow characteristic data, data without recording labels are removed, and the specific data cleaning process comprises the following steps: recording normalization, missing value processing and NAN value processing;

step 3-2) attribute splitting is carried out on the cleaned network flow characteristic data, and the network flow characteristic data are split into numerical data and non-numerical data according to different characteristics of each characteristic attribute;

step 4) as shown in fig. 3, inputting the non-numerical data into a pre-trained sparse self-encoder for re-encoding, and acquiring encoded non-numerical data;

specifically, the encoded non-numeric data includes: the source IP address, the target IP address, the transmission protocol type, the protocol state type and the network service type have 5 characteristics;

specifically, non-numerical data is subjected to one-hot encoding, the characteristic dimensionality of the data subjected to one-hot encoding is 294, a sparse self-encoder is constructed, and the network structure of the self-encoder is shown in the following table:

the dimension of the coding layer is set to be 20, the constructed sparse self-encoder is trained by adopting non-numerical data subjected to unique heat coding, a cross entropy loss function is selected as the loss function, a KL divergence sparse penalty item is added, an Adam optimization algorithm is used by a training optimizer, the number of training rounds is converged when the number is about 20, and the encoder is stored; inputting a non-numerical characteristic set subjected to one-hot coding; the output is the encoded part of the sparse autoencoder, namely:

the step 4) specifically comprises the following steps:

step 4-1) dividing according to the attribute label set, and extracting a non-numerical characteristic set from non-numerical data;

step 4-2) carrying out one-hot coding on the non-numerical characteristic set extracted in the step 4-1);

step 4-3), constructing and training a sparse self-encoder, taking a non-numerical characteristic set subjected to unique heat encoding as a set of the sparse self-encoder, taking the set of the sparse self-encoder as input, and outputting the set of the sparse self-encoder as an encoder extracted from the sparse self-encoder;

step 4-4) borrows for reference from Word2Vec algorithm in natural language processing, adopts TCPIP2Vec algorithm based on sparse self-encoder, utilizes the encoder extracted in step 4-3) to re-encode the one-hot code of the non-numerical characteristic set, obtains encoded non-numerical data, namely the re-encoded non-numerical characteristic set, and is used for carrying out numerical similarity calculation, namely calculating the similarity of IP addresses and the like, and meanwhile, the characteristic dimension of the data attribute is greatly reduced compared with the one-hot code.

Wherein, in the step 4-3), constructing and training the sparse self-encoder specifically comprises:

firstly, a symmetrical neural network H with parameters W, b is constructed_w,bAs a sparse autoencoder:

H_w,b＝g(f(X))

wherein f (X) is a coding function; g (X) is a decoding function; the two functions are approximated by constructing a neural network, wherein all weight parameters of the neural network are W, and all deviation parameters are b;

for achieving the effect of sparse coding, the cross entropy loss function is adopted to train the neural network H_w,b(ii) a Wherein the cross entropy loss function is:

wherein, J_S(W, b) is a cross entropy loss function, and the cross entropy loss function is selected according to the non-numerical characteristic set of the one-hot coding;

an average activation value of a jth hidden unit of an encoding layer of a sparse self-encoder; wherein the content of the first and second substances,

the calculation formula of (a) is as follows:

wherein n is the number of block samples set during training; f (x)ⁱ) As a coding function, xⁱIs the ith network traffic data sample;

KL is divergence; wherein, KL divergence compares a measure of similarity between two probability distributions, and its calculation formula is as follows:

where ρ (t) is the sparse parameter function:

an average activation value of a jth hidden unit of an encoding layer of a sparse self-encoder;

training a sparse self-encoder by adopting a cross entropy loss function; the input of the sparse self-encoder is a non-numerical characteristic set subjected to single-hot encoding, and is recorded as:

X＝[x¹,…xⁿ]

wherein the characteristic dimension is n; x is a non-numerical characteristic set subjected to one-hot coding; x is the number ofⁿThe nth factor in the non-numerical value feature set subjected to unique hot coding;

the output of the sparse autoencoder is an encoder extracted from the sparse autoencoder:

in addition to imposing KL divergence penalties, there are other sparse penalty modes, such as absolute value penalties or L₁And (5) norm punishment. The training process of the sparse self-encoder is the same as that of the existing neural network training, the gradient is calculated, the idea of inverse propagation is utilized, and specifically, a common Adam optimization algorithm or a classical random gradient descent is used.

Step 5) inputting the numerical data into a pre-established extreme random tree model, performing descending order arrangement and screening on the importance of the numerical data, and acquiring the screened numerical data as shown in fig. 3;

wherein the numerical data has 42 numerical characteristics in total, and the numerical data includes: a source port number, a destination port number, a total duration of logging, a number of bytes processed from source IP to destination IP, a number of bytes processed from destination IP to source IP, a lifetime from source IP to destination IP, a lifetime from destination IP to source IP, a number of source packets retransmitted or dropped, a number of destination IP retransmitted or dropped, a number of bits transmitted per second for source IP, a number of bits transmitted per second for destination IP, a number of packets transmitted per second for source IP to destination IP, a number of packets transmitted per source IP to source IP, a value for a source TCP advertisement window size, a value for a destination TCP advertisement window size, a sequence number for source TCP, a sequence number for destination TCP, an average of stream packet sizes transmitted by the source, an average of stream packet sizes transmitted by the destination, a pipeline depth for http request/response connections, an uncompressed actual size of data transmitted from a server's http service, a source jitter time (milliseconds), Target jitter time (milliseconds), time of start of recording, time of last recording, source packet interval arrival time, target packet interval arrival time, number of "syn" and "ack" in a TCP connection, interval time of syn and syn _ ack in a TCP connection, interval time between syn _ ack and ack packets, number of streams with GET and POST methods in http service type, number of streams with commands in ftp session.

Specifically, the remaining numerical data includes a plurality of redundant information in addition to non-numerical data, a decision tree algorithm is adopted based on an extreme random tree embedding method, numerical data including 42 numerical features are subjected to cross validation, an information gain ratio is set as a numerical feature selection method, the number of estimators is set to be 50 most appropriate, and after the extreme random tree model is established, screened numerical data are output.

After the numerical characteristics are sorted in the descending order of importance, five attribute characteristics with the top importance ranking are selected by combining the subsequent abnormal detection process, namely the survival time from the source IP to the target IP, the survival time from the target IP to the source IP, the specific range of the survival time of the source IP/the target IP for each state, the number of bits transmitted by the target per second and the value of the size of the target TCP notification window.

The step 5) specifically comprises the following steps:

step 5-1) dividing according to the attribute numbers, performing attribute splitting on the numerical data, and acquiring a split numerical characteristic set;

step 5-2) inputting the split numerical characteristic set into a pre-established extreme random tree model, and arranging each numerical characteristic in the split numerical characteristic set in a descending order from large to small according to importance to obtain a sorted numerical characteristic set;

and 5-3) screening the sorted numerical feature set according to a preset threshold value to obtain importance factors of each numerical feature in the sorted numerical feature set larger than the preset threshold value, and recording the importance factors as screened numerical data.

The step 5) further comprises the following steps: adopting a recursive feature elimination algorithm, namely an RFE algorithm, sequentially deleting a numerical feature from the screened numerical data, normalizing the remaining screened numerical data and the recoded non-numerical data, and performing anomaly detection to obtain a detection result; and comparing the detection result with the previous detection result without deleting the numerical characteristic, and detecting whether the detection results under the two conditions are consistent or not for verifying whether the previous detection result is correct or not.

The step 5) further comprises the following steps: carrying out category splitting on the cleaned network flow characteristic data, and recording a corresponding category number characteristic set; wherein the content of the first and second substances,

if the split network flow characteristic data is provided with a known data label, classifying according to the known class label, recording the data number of the corresponding class, and recording as a known numerical value characteristic set;

if the split network flow characteristic data is provided with an Unknown data label, classifying the split network flow characteristic data into Unknown, recording the data number of the corresponding category as well, and recording as an Unknown numerical value characteristic set;

specifically, in the UNSW _ NB15 data, in addition to normal data, there are 9 common attack types, which are respectively:

fuzzers, attack behavior by randomly generated data to halt a program or network

Analysis, including different port scanning attacks, spam and html file penetration

Backdoors, a technique for silently bypassing system security mechanisms and accessing computers and their data

Dos for making network resources unavailable to users by temporarily disrupting or suspending services to hosts connected to the internet

Exploits, an attacker knows the security problem of an operating system or software and uses the vulnerability to attack

Generic, an attack technique for block ciphers (given block and key sizes) regardless of the structure of the block cipher

Reconnnaissandance, containing all attacks able to simulate the collection of information

Shellcode, a small segment of code for software vulnerability payloads

Worms, the attacker replicates itself to propagate to other computers, using the computer network for self-propagation, relying on failed security guards on the target computer to access

And recording a data number set of a corresponding category, and establishing a set for the unknown attack type set so as to store an output result of unsupervised anomaly detection.

As shown in fig. 3, the data after class splitting, the encoded non-numerical data, and the sorted numerical data are normalized and archived.

In the step 5-2), the process of establishing the extreme random tree-based feature selection model specifically includes:

wherein, the construction process of each decision tree is as follows:

when each decision tree is constructed, randomly selecting K numerical features from the K numerical features, wherein K is the total dimension of the numerical features, and K is the feature dimension set for constructing each decision tree; the value of K is set to be less than K, and is generally ordered

judging each numerical characteristic in the split numerical characteristic set by using a plurality of decision trees, specifically, judging whether the original network traffic data corresponding to the numerical characteristic is normal data or abnormal data by each decision tree of the plurality of decision trees, summarizing the judgment result of each decision tree by a voting method, and taking the result of the majority of the judgment result as a final judgment result, for example, the judgment result that the original network traffic data corresponding to the numerical characteristic is normal data is more than the judgment result that the original network traffic data corresponding to the numerical characteristic is abnormal data, namely the judgment result that the original network traffic data corresponding to the numerical characteristic is normal data is more than the judgment result that the original network traffic data corresponding to the numerical characteristic is normal data, and taking the judgment result that the original network traffic data corresponding to the numerical characteristic is normal data as a final judgment result; wherein, the judgment result is that the original network flow data corresponding to the numerical characteristic is normal data or abnormal data;

Step 6) recoding the split non-numerical data by using the encoder extracted in the step 4-3), wherein the characteristic dimensionality of the recoded data is 20; screening the characteristic attributes of the sorted numerical value characteristic set obtained in the step 5-2), wherein the screened characteristic dimension is 17; merging the processed non-numerical and numerical data, and recording as X₀(ii) a Then to X₀Carrying out data normalization processing, and recording the processed data as X₁(ii) a Wherein X₀，X₁Each row represents a piece of network flow data, and each column represents each setting characteristic of the processed network flow data; mixing X₁Inputting the data into a first anomaly classifier or a second anomaly classifier, detecting whether each piece of network flow data is anomalous data, and specifically executing the following steps:

if the network flow data is not abnormal data, the output is normal.

As shown in fig. 4, the step 6) specifically includes:

step 6-1) for the integrated data X₀Carrying out normalization processing, and recording the processed data as X₁；

Specifically, for X₀Each column X of₀[i]Dividing the square difference by subtracting the mean value according to the transformation function; the transformation function is as follows:

wherein, X₀[i]The ith column of the encoded network stream data; x₁[i]Is to X₀The ith column of the network flow data after normalization processing; μ is a column vector X₀[i]Average value of (d): σ is the column vector X₀[i]The variance of (a);

in other embodiments, the data may be normalized between [0,1] by performing a linear transformation on the re-encoded data as follows:

wherein, X₀[i]The ith column of the encoded network stream data; x₁[i]Is to X₀The ith column of the network flow data after normalization processing; min is the column vector X₀[i]Minimum value of (d): max is the column vector X₀[i]Maximum value of (d);

in other embodiments, the column vector X is subtracted₀[i]Is divided by the column vector X₀[i]To X is given a quartile₀[i]Normalization processing is carried out, so that processed network flow data X is obtained₁[i]The principle is the same as the first two methods, but the method is more robust because some outliers are given directly to kicks in the calculation.

Step 6-2) inputting network flow data into a pre-trained first abnormal classifier, and detecting whether the input network flow data is abnormal data; outputting a detection result;

specifically, firstly, a supervision anomaly detection method based on classification is adopted to detect whether input network flow data is anomalous data;

if the network flow data is abnormal data, outputting the abnormality, inputting the abnormal data into a first abnormality classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data; the abnormal data is network flow data of a known attack type;

if the network flow data is judged to be not abnormal data by the supervision abnormality detection method, inputting the abnormal data into a pre-trained second abnormality classifier, and further detecting whether the network flow data is unknown abnormal by adopting an unsupervised abnormality detection method;

if the network flow data is judged to be abnormal data through an unsupervised anomaly detection algorithm, the abnormal data is marked as an unknown attack type; the abnormal data is suspicious network flow data of unknown attack types;

if the network flow data is not abnormal data, the output is normal.

Wherein the establishing and training of the first anomaly classifier comprises:

adopting an AdaBoost supervision and classification algorithm to construct an anomaly detection function:

training a first anomaly classifier according to an anomaly detection function;

the training process is as follows:

initializing the weight vector of the input data: w_m＝(w₁₁,w₁₂,…w_1n),w_1i1/n, and m 1 represents the weak classifier of the current training; in the weight vector W_mTraining weak classifier G with the objective of minimizing the classification error rate_m(x) Calculating a classifier weight coefficient according to the classification error rate; then judging whether M is smaller than M, if M is smaller than M<M, updating the weight vector of the next step according to the training result of the previous step, and training the abnormal classifier function G of the (M + 1) th step by taking the classification error rate as the minimum target_m+1(x) Then, updating m to be m +1, otherwise, constructing an abnormal detection function according to the trained weak classifier;

wherein, the update weight vector update formula is as follows:

wherein m represents the current step and m +1 represents the next step;

the calculation formula of the classification error rate is as follows:

wherein I (t) is an indicator function;

the classifier weight coefficient calculation formula is as follows:

wherein Z is_mIs a normalized coefficient, and the specific calculation formula is as follows:

training a first anomaly classifier according to the anomaly detection function, wherein the input of the first anomaly classifier is { (x)₁,y₁),(x₂,y₂)(x₃,y₃),…(x_n,y_n) In which x_iFor each processed network flow data, namely the network flow data X normalized in the step 6-1)₁Row i of (1); n is a matrix X₁Represents the total number of network flow data; x is the number of_i∈Rⁿ；y_iIs a corresponding label; y is_iAnd e {0,1}, wherein 0 represents normal, 1 represents abnormal, and the output of the method is the judgment result of the trained first abnormal classifier on each network traffic characteristic.

Wherein G is_m(x) The specific learning process of the weak classifier of the decision tree obtained by step-by-step learning is as follows:

step 6-2-1) initializing a weight vector of input data:

where, initializing m is 1, W_mFor weight vectors, each component w_miCorresponding network stream training data X₁Each row represents the weight of the network flow data;

step 6-2-2) in the weight vector W_mOn the basis of the above-mentioned training data, a function of class error rate is adopted, and the class error rate is minimized as a target to train G_m(x) (ii) a Wherein, the classification error rate function is shown in formula (2):

wherein e is_mIs a classification error rate; w is a_miFor each component of the weight vector; i (t) is an indicator function; g_m(x_i) For the discrimination result of the m-th classifier on the ith network flow data: y is_iLabel for ith network traffic data:

then, according to equation (3), a classifier weight coefficient is calculated:

wherein alpha is_mIs the classifier weight coefficient;

step 6-2-3) then judging whether M is smaller than M, if M is satisfied<M, updating the weight vector of the next step according to the training result of the previous step, and training the abnormal classifier function G of the (M + 1) th step by taking the classification error rate as the minimum target_m+1(x) Then, updating m to be m +1, otherwise, constructing an abnormal detection function according to the trained weak classifier; wherein, in particular,

according to the weight coefficient alpha of the classifier_mUpdating the weight vector according to formula (4);

wherein, w_m+1,iIs the weight vector W of the m +1 th step_m+1The respective components of (a); z_mTo normalize the coefficients: w is a_miIs the mth step weight vector W_mThe respective components of (a); y is_iFor the ith network flow data x_iThe label of (1); g_m(x) An mth classifier discrimination function;

wherein Z is calculated according to the formula (5)_m：

Step 6-2-4) Weak multiple decision treesClassifier G_m(x) Linearly combining classifier weight coefficients into a first anomaly classifier, namely a strong learner:

wherein G is_m(x) A weak classifier for decision tree; alpha is alpha_mIs a reaction with G_m(x) Corresponding classifier weight coefficients;

wherein the establishing and training of the second anomaly classifier comprises:

an unsupervised anomaly detection function is constructed by reconstructing an error function by using an auto-encoder:

L(x_p,x_r)＝||x_p-x_r||²＝||x_p-F(G(x_p))||²

wherein, L (x)_p,x_r) For reconstructing the error function, F and G denote the decoding and encoding functions, x, respectively, of the self-encoder_pFor raw network traffic data to be detected, x_rIs a self-encoder pair x_pThe data after reconstruction is performed. In particular, the amount of the solvent to be used,

step 6-3-1) firstly, extracting the network flow characteristic data marked as normal from the network flow characteristic data calibrated in the previous data division stage, and marking as X_normal；

Step 6-3-2) with X_normalAs self-encoder training data, simulating an encoding function G and a decoding function F by using a multilayer perceptron to construct a self-encoder;

step 6-3-3) taking the reconstruction error function as a target function, training by adopting an Adam optimization algorithm, and calculating an error threshold error by using the following formula after training:

wherein, X_normalThe network traffic characteristic data marked as normal; n' is self-encoder trainingNumber of pieces of all network traffic characteristic data used, i.e. X_normalThe number of rows of (c); x is X_normalF and G are the decoding function and the encoding function of the trained self-encoder, respectively; l is the calculated reconstruction error corresponding to x;

6-3-4) training the self-encoder and calculating an error threshold value error, calibrating the supervision abnormity detection into normal data which is not marked as x_pInput into the self-encoder to obtain the reconstructed data x_rAnd calculating a reconstruction error L (x)_p,x_r)；

Step 6-3-5) reconstruction error L (x)_p,x_r) Comparing with error, if the reconstruction error is larger than 3 times of error, marking the piece of data x_rOutputting an unknown type exception if the exception is abnormal; and if the reconstruction error is within 3 times of error, judging that the data is normal and outputting the data normally.

And 7) displaying a detection result for detecting whether the network flow data is abnormal data.

Specifically, if the detection result is that the network flow data is network flow data of a known attack type, an alarm is issued;

if the detection result is that the network flow data is the network flow data of unknown attack type; if the attack is unknown, the attack is classified into the database, and professional personnel are informed to make manual analysis;

if the suspicious attack data is determined to be a new abnormal data type after being analyzed by the professional, adding the suspicious attack data into a training set of a first abnormal classifier; if the analyzed data is normal data, adding the normal data into a training set of a second abnormal classifier;

counting detection results periodically, wherein the detection results comprise: counting known attack types and unknown attack types of abnormal data appearing in the whole network environment, detecting the accuracy rate, the recall rate and the false alarm rate of the abnormal data, and selecting whether to retrain and update the algorithm in the model; the accuracy, recall rate and false alarm rate of the detection result are calculated according to the test result of the test data with the attack type label, namely:

wherein, TP, TN, FP, FN represent several statistical results of the network flow characteristic data of the established network flow data analysis method and system; in particular, the amount of the solvent to be used,

TP represents the number of the network flow characteristic data with normal detection result and normal system prediction;

TN represents the number of the network flow characteristic data with abnormal detection result and abnormal system prediction;

FP represents the number of the network flow characteristic data with abnormal detection results and normal system prediction;

FN represents the number of the network flow characteristic data with normal detection result and abnormal system prediction;

and determining whether to retrain and update the system according to the statistical result and the number of the suspicious attacks.

As shown in fig. 1, the present invention further provides a network traffic data analysis system based on a sparse autoencoder and an extreme random tree, the system comprising: the system comprises an original data acquisition module, a data preprocessing module, a data feature extraction module, a data anomaly detection module and a data result display module; wherein the content of the first and second substances,

if the network flow data is not abnormal data, the output is normal.

The method of the invention aims at the network flow characteristics of IP addresses, ports, TCP/IP protocols and the like, and by using the Word2Vec algorithm processed by natural language for reference, provides a TCP/IP numeralization algorithm based on a sparse self-encoder, maps the characteristics of the IP addresses, the TCP/IP protocols and the like to an n-dimensional real number domain space, and provides a good support for various distance or density-based supervision or unsupervised algorithms in a subsequent data anomaly detection module. Compared with the traditional one-hot coding, the accuracy of the system is greatly improved, and the false alarm rate of the system is well reduced.

In a data characteristic extraction module, various characteristic selection means are integrated, for UNSW _ NB15 data, besides a TCP/IP numerical algorithm based on a sparse self-encoder, a characteristic processing means based on an extreme random tree is also used, and selectable criteria comprise information entropy, information gain ratio, Gini index and the like, so that on the premise of ensuring important information of original data, the system dimension is greatly reduced, and the system operation efficiency is improved.

In the data anomaly detection module, the supervised learning and unsupervised learning means are integrated, and the basic idea is as follows: firstly, modeling a traffic data portrait model for the normal behavior of network traffic; then, constructing abnormal detection classifiers belonging to different attack types by using a supervision learning means for known attacks or abnormal network flow behaviors; and (3) continuously expanding and perfecting a normal behavior model of network traffic and a classifier of abnormal behavior by using an unsupervised learning means aiming at the rest unknown data. Through double detection of supervision anomaly detection and unsupervised anomaly detection, accuracy and false alarm rate are guaranteed, and the capability of finding unknown attacks is reserved.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for detecting the abnormity of network flow data is characterized by comprising the following steps:

if the network flow data is not abnormal data, the output is normal.

2. The method according to claim 1, wherein the capturing of the original network traffic data in real time is processed to obtain network flow data; the method specifically comprises the following steps:

capturing original network flow data in real time;

3. The method according to claim 2, wherein the non-numerical data is input to a pre-trained sparse self-encoder for re-encoding, and encoded non-numerical data is obtained; the method specifically comprises the following steps:

4. The method according to claim 2, wherein the building and training of the sparse self-encoder specifically comprises:

establishing a sparse autoencoder, adopting a cross entropy loss function J added with KL divergence sparse penalty term based on a TCPIP2Vec algorithm of the sparse autoencoder_S(W, b) training a sparse autoencoder;

wherein the content of the first and second substances,

the calculation formula of (a) is as follows:

wherein ρ (t) is a sparse parameter function;

5. The method according to claim 2, wherein the numerical data is input into a pre-established extreme random tree model, and the importance of the numerical data is sorted and screened in a descending order to obtain screened numerical data; the method specifically comprises the following steps:

6. The method according to claim 5, wherein the process of establishing the extreme stochastic tree-based feature selection model specifically comprises:

wherein, the construction process of each decision tree is as follows:

7. The method of claim 2, wherein the available data features are extracted from the acquired raw network traffic data to obtain network traffic feature data; the method specifically comprises the following steps:

8. The method of claim 1, wherein the building and training of the first anomaly classifier specifically comprises:

training a first anomaly classifier according to the anomaly detection function, wherein the input of the first anomaly classifier is { (x)₁,y₁),(x₂,y₂)(x₃,y₃),…(x_n,y_n) In which x_iProcessing each piece of network flow data; x is the number of_i∈Rⁿ；y_iIs a corresponding label; y is_iE {0,1}, with 0 indicating normal, 1 indicating abnormal, the output of which is the first abnormal classifier after training for each entryAnd judging the network flow characteristics.

9. The method according to claim 1, wherein the building and training of the second anomaly classifier specifically comprises:

L(x_p,x_r)＝||x_p-x_r||²＝||x_p-F(G(x_p))||²

10. A network traffic data analysis system, the system comprising: the system comprises an original data acquisition module, a data preprocessing module, a data feature extraction module, a data anomaly detection module and a data result display module; wherein the content of the first and second substances,

if the network flow data is not abnormal data, outputting the data normally;