CN112398779B

CN112398779B - Network traffic data analysis method and system

Info

Publication number: CN112398779B
Application number: CN201910739001.0A
Authority: CN
Inventors: 方少峰; 孙鹏科; 闫振中; 郑岩; 马福利; 佟继周
Original assignee: National Space Science Center of CAS
Current assignee: National Space Science Center of CAS
Priority date: 2019-08-12
Filing date: 2019-08-12
Publication date: 2022-11-01
Anticipated expiration: 2039-08-12
Also published as: CN112398779A

Abstract

The invention belongs to the technical field of network traffic data analysis, and particularly relates to an anomaly detection method of network traffic data, which comprises the following steps: processing the original network flow data captured in real time to obtain network flow data; if the network flow data is abnormal data, outputting the abnormality, inputting the abnormal data into a first abnormal classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data; if the network flow data is not abnormal data, adopting an unsupervised abnormal detection method to further detect whether the network flow data is abnormal; if the network flow data is abnormal data, inputting the abnormal data into a pre-trained second abnormal classifier, judging the type of the abnormal data to be an unknown attack type, and marking the abnormal data to be the unknown attack type; if the network flow data is not abnormal data, the output is normal.

Description

Network traffic data analysis method and system

Technical Field

The invention belongs to the technical field of anomaly detection technology and network traffic data analysis based on machine learning and big data, and particularly relates to a network traffic data analysis method and system, in particular to a network traffic data analysis method and system based on a sparse self-encoder and an extreme random tree.

Background

In recent decades, with the rapid development of the internet, from consumer interconnection, industrial interconnection to everything interconnection, the communication mode and consumption mode of people are reformed once and again. The network security problem is getting more and more troublesome, and the traditional defense means is unconscious when facing a new attack mode due to the endless network attack. From a basic data link layer to a network layer and a transmission layer and then to a higher-level representation layer and an application layer, the manner of network attack is complex, and is continuously updated, and a means is often not satisfactory, for example, a distributed denial of service attack (DDOS) with a continuously increasing scale, which has both a traditional network layer DDOS attack using the characteristics of a TCP/IP protocol and an application layer DDOS attack developed on the basis, and is specifically applied to the application layer, and can be classified into a DNS-Flood attack, a slow connection attack and a CC attack.

Various network security technologies are developed to ensure network security, wherein the network traffic analysis and intrusion detection technologies play a very important role in detecting the abnormality in the network traffic and then providing early warning. The current network flow data analysis and intrusion detection methods are many, and mainly comprise: on the basis of the traditional method based on the feature library, the statistical method based on the probability and the rule, the supervised learning method based on the classification technology, the unsupervised learning based on the clustering technology and the derived technology based on the outlier detection, researchers establish a plurality of network anomaly detection systems, and the experiment effect is good, but when the method is put into specific use, a plurality of problems are often found. In short, in the field of anomaly detection, anomalies can be classified into point anomalies, condition anomalies and mode anomalies, different detection systems are effective for most of the point anomalies, and for the condition anomalies and the mode anomalies, the accuracy and the false alarm rate are often balanced.

For example, an anomaly detection system based on a conventional feature library needs to frequently update the feature library, and is completely unable to detect unknown anomalies or simply encrypted data streams, and the cost for maintaining and updating the feature library is very high; when the network flow becomes complex, the abnormal network flow data is often judged to be normal by a statistical method based on probability and rules, and is easy to be utilized by attackers;

unsupervised detection systems based on clustering and outliers often face the problems of extremely low algorithm training speed, difficult parameter selection and high false alarm rate when the data scale and feature dimensions are high.

The intercommunication and interconnection of networks and the arrival of the big data era enable the network data scale to increase exponentially, and a large amount of flow data can be generated every day. Although most of this data is normal traffic; however, the size and variety of the abnormal traffic is also continuously increasing, which brings new opportunities and challenges to the network security field. In the big data era, as the network flow data characteristics become richer and richer, the existing anomaly detection system established based on the traditional KDD-CUP99 or NSL-KDD data set has the problems of poor reliability, low accuracy, high false alarm rate and the like. Whether network intrusion analysis or enterprise internal threat analysis exists, the existing method faces various problems, and a detection and defense system extremely depends on efficient analysis of network flow data and establishes effective feature preprocessing, feature selection and feature extraction tools according to new data features, which is a very important problem.

Disclosure of Invention

The invention aims to solve the problems of poor generalization performance, low detection rate and high false alarm rate of the conventional network traffic data analysis method in the network security anomaly detection technology, and provides a network traffic data analysis method based on a sparse self-encoder and an extreme random tree, which can select corresponding automatic encoding means for different types of features from network flow data, not only can play a role in reducing feature dimensionality, but also can effectively calculate the distance of features such as IP addresses, protocols and the like, thereby providing a foundation for the conventional anomaly detection technology based on distance or density; then, for the processed numerical characteristics, a characteristic selection method based on an extreme random tree is adopted, so that the dimension can be reduced, and the selected characteristics can still have practical significance, thereby providing possibility for subsequent analysis; the extracted feature set can be combined with a supervised classification technology and an unsupervised outlier anomaly detection technology, experiments show that the accuracy is effectively improved, the false alarm rate is greatly reduced, and the calculation speed is much faster due to the fact that data are recoded and feature engineering processing is carried out.

In order to achieve the above object, the present invention provides an anomaly detection method for network traffic data, including:

processing the original network flow data captured in real time to obtain network flow data;

if the network flow data is abnormal data, outputting the abnormality, inputting the abnormal data into a first abnormal classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data;

if the network flow data is not abnormal data, adopting an unsupervised abnormal detection method to further detect whether the network flow data is abnormal;

if the network flow data is abnormal data, inputting the abnormal data into a pre-trained second abnormal classifier, judging the type of the abnormal data to be an unknown attack type, and marking the abnormal data as the unknown attack type;

if the network flow data is not abnormal data, the output is normal.

As one improvement of the above technical solution, the capturing of the original network traffic data in real time is processed to obtain network flow data; the method specifically comprises the following steps:

capturing original network flow data in real time;

extracting available data characteristics from the obtained original network flow data to obtain network flow characteristic data;

performing data cleaning and attribute splitting on the acquired network flow characteristic data, and splitting the acquired network flow characteristic data into numerical data and non-numerical data;

inputting the non-numerical data into a pre-trained sparse self-encoder for re-encoding to obtain encoded non-numerical data;

inputting numerical data into a pre-established extreme random tree model, and performing descending order arrangement and screening on the importance of the numerical data to obtain screened numerical data;

and carrying out normalization processing on the encoded non-numerical data and the screened numerical data to obtain network flow data.

As one improvement of the above technical solution, the non-numerical data is input to a pre-trained sparse self-encoder for re-encoding, and encoded non-numerical data is obtained; the method specifically comprises the following steps:

dividing according to the attribute label set, performing attribute splitting on the non-numerical data, and acquiring a non-numerical feature set from the non-numerical data;

carrying out single-hot coding on the non-numerical characteristic set to obtain a non-numerical characteristic set subjected to single-hot coding, inputting the non-numerical characteristic set to a pre-trained sparse self-encoder, and obtaining an encoder extracted from the sparse self-encoder;

and adopting a TCPIP2Vec algorithm based on a sparse self-encoder to re-encode the one-hot code of the non-numerical characteristic set to obtain encoded non-numerical data.

As one of the improvements of the above technical solution, the establishing and training of the sparse autoencoder specifically includes:

establishing a sparse autoencoder, adopting a cross entropy loss function J added with KL divergence sparse penalty term based on a TCPIP2Vec algorithm of the sparse autoencoder_S(W, b) training a sparse autoencoder;

wherein the content of the first and second substances,

a KL divergence penalty applied to the coding function; ρ is a sparse parameter and β is a regularization parameter;

the average activation value of the jth hidden unit of the coding layer; wherein the content of the first and second substances,

the calculation formula of (a) is as follows:

wherein n is the number of block samples set during training; f (x)ⁱ) As a coding function, xⁱIs the ith sample;

KL is divergence; the KL divergence is a measure for comparing the similarity between two probability distributions, and is calculated as follows:

wherein ρ (t) is a sparse parameter function;

the input of the sparse self-encoder is a non-numerical characteristic set subjected to one-hot encoding; the output of the sparse self-encoder is a recoded non-numerical characteristic set, namely coded non-numerical data.

As one improvement of the above technical solution, the numerical data is input to a pre-established extreme random tree model, and the importance of the numerical data is sorted and screened in a descending order to obtain the screened numerical data; the method specifically comprises the following steps:

dividing according to the attribute numbers, and performing attribute splitting on the numerical data to obtain a split numerical characteristic set;

inputting the split numerical characteristic set into a pre-established extreme random tree model, and arranging each numerical characteristic in the split numerical characteristic set in a descending order from large to small according to importance to obtain a sorted numerical characteristic set;

and screening the sorted numerical feature set according to a preset threshold value to obtain the importance factors of each numerical feature in the sorted numerical feature set larger than the preset threshold value, and recording the importance factors as screened numerical data.

As an improvement of the above technical solution, the establishing process of the extreme random tree-based feature selection model specifically includes:

randomly selecting numerical characteristics in the split numerical characteristic set to construct a plurality of decision trees;

wherein, the construction process of each decision tree is as follows:

the importance factor of each numerical feature is obtained according to the following calculation formula:

wherein G (D, A) is an importance factor of the numerical characteristic A relative to the numerical characteristic set D to be divided, namely an information gain ratio; d is a data set to be divided; a is the currently selected numerical characteristic; h_A(D) The information entropy is obtained by taking the currently selected numerical characteristic A as a random variable; h (D) is the information entropy of set D with the data class as a random variable; h (D | a) is the conditional information entropy of the subset obtained after the set D is divided using the feature a;

when each decision tree is constructed, randomly selecting K numerical features from the K numerical features, wherein K is the total dimension of the numerical features, and K is the feature dimension set for constructing each decision tree; the value of K is set to be less than K, generally

When each decision tree is constructed, selecting the numerical characteristic with the largest information gain ratio G (D, A) from the k selected numerical characteristics, then constructing nodes and splitting;

when the nodes of the decision tree are split, randomly selecting an arbitrary number between the maximum value and the minimum value of the numerical characteristic, and recording the arbitrary number as a comparison value; when the numerical characteristic of the sample is greater than the comparison value, taking the sample as a left branch; when the numerical characteristic of the sample is smaller than the comparison value, the sample is taken as a right branch, and then the bifurcation value of the numerical characteristic of the sample is calculated; wherein, the sample is a split numerical characteristic set;

traversing the selected k numerical characteristics to construct a decision tree;

repeating the process of constructing the basic decision tree for N times to construct N decision trees; wherein, the number of the decision tree is determined by using a cross validation and grid search method;

judging each numerical characteristic in the split numerical characteristic set by utilizing a plurality of decision trees, specifically judging whether the original network flow data corresponding to the numerical characteristic is normal data or abnormal data by each decision tree of the plurality of decision trees, summarizing the judgment result of each decision tree by a voting method, and taking the result of the majority of the judgment result as the final judgment result; wherein, the judgment result is that the original network flow data corresponding to the numerical characteristic is normal data or abnormal data;

obtaining importance factors of each numerical characteristic in the split numerical characteristic set according to a finally obtained judgment result and the formula, sorting the importance factors of each numerical characteristic in the split numerical characteristic set according to a descending order of importance, screening each numerical characteristic in the split numerical characteristic set according to a set threshold value, obtaining the importance factors of each numerical characteristic in the sorted numerical characteristic set which is larger than a preset threshold value, and recording the importance factors as screened numerical data;

the input of the extreme random tree model is a split numerical characteristic set, and the output of the extreme random tree model is screened numerical data.

As one improvement of the above technical solution, the method extracts available data features from the acquired original network traffic data to acquire network traffic feature data; the method specifically comprises the following steps:

extracting a first feature from the acquired raw network traffic data by using an Argus tool, wherein the first feature comprises: a source IP address, a source port number, a destination IP address, a destination port number, and a transport protocol type;

extracting second features from the acquired raw network traffic data using a Bro-IDS tool, the second features comprising: counting from a source IP to a target IP packet, counting from the target IP to the source IP packet, an application layer protocol type, a source IP transmission digit per second and a target IP transmission digit per second;

among the available data features are: the first feature, the second feature, and other features extracted from the protocol header file include: a value of a source TCP advertisement window size, a value of a target TCP advertisement window size, a sequence number of the source TCP, a sequence number of the target TCP, an average of a size of a stream packet transmitted by the source, an average of a size of a stream packet transmitted by the target, a pipe depth of an http request/response connection, an actual size of data transmitted from an http service of the server without compression, a source jitter time (millisecond), a target jitter time (millisecond), a source packet interval arrival time, a target packet interval arrival time, a number of "syn" and "ack" in the TCP connection, an interval time of syn and syn _ ack in the TCP connection, an interval time of syn _ ack packet and ack packet in the TCP connection.

As an improvement of the above technical solution, the establishing and training of the first anomaly classifier specifically includes:

adopting an AdaBoost supervision and classification integration algorithm to construct an anomaly detection function:

wherein G (x) is a first anomaly classifier; g_m(x) Is a decision tree weak classifier; wherein m =1,2, …,30; alpha is alpha_mIs a reaction with G_m(x) Corresponding classifier weight coefficients;

training a first anomaly classifier according to the anomaly detection function, wherein the input of the first anomaly classifier is { (x)₁,y₁),(x₂,y₂)(x₃,y₃),…(x_n,y_n) In which x_iThe processed data of each piece of network flow is obtained; x is the number of_i∈Rⁿ；y_iIs a corresponding label; y is_iIs e {0,1},0 denotes normal, 1 denotes abnormal, and the output is the first abnormal classifier after trainingAnd judging the network flow characteristics.

As an improvement of the above technical solution, the establishing and training of the second anomaly classifier specifically includes:

an unsupervised anomaly detection function is constructed by reconstructing an error function using an auto-encoder:

L(x_p,x_r)＝||x_p-x_r||²＝||x_p-F(G(x_p))||²

wherein, L (x)_p,x_r) For reconstructing the error function, F and G denote the decoding and encoding functions, x, respectively, of the self-encoder_pFor raw network traffic data to be detected, x_rIs a self-encoder pair x_pPerforming the reconstructed data;

training a second anomaly classifier according to the unsupervised anomaly detection function, and acquiring a reconstruction error threshold according to a reconstruction error function;

during detection, if the reconstruction error of one piece of network flow data is greater than the reconstruction error threshold value, the piece of network flow data is judged to be abnormal data;

if the reconstruction error of one piece of network flow data is less than or equal to the reconstruction error threshold value, judging that the data is normal data;

the input of the second abnormal classifier is network flow data x which is judged to be normal data by the first abnormal classifier_p(ii) a The output of which is the trained second anomaly classifier pair x_pThe result of the discrimination (1).

Based on the network traffic data analysis method, the invention also provides a network traffic data analysis system, which comprises: the system comprises an original data acquisition module, a data preprocessing module, a data feature extraction module, a data anomaly detection module and a data result display module; wherein the content of the first and second substances,

the original data acquisition module is used for capturing original network flow data in real time;

the data preprocessing module is used for extracting available data characteristics from the acquired original network traffic data and acquiring network traffic characteristic data;

the data characteristic extraction module is used for carrying out data cleaning and attribute splitting on the acquired network flow characteristic data and splitting the acquired network flow characteristic data into numerical data and non-numerical data; inputting the non-numerical data into a pre-trained sparse self-encoder for re-encoding to obtain encoded non-numerical data; inputting numerical data into a pre-established extreme random tree model, and performing descending order arrangement and screening on the importance of the numerical data to obtain screened numerical data;

the data anomaly detection module is used for carrying out normalization processing on the encoded non-numerical data and the screened numerical data to obtain network flow data; detecting whether the network flow data is abnormal data;

if the network flow data is abnormal data, inputting the abnormal data into a pre-trained second abnormal classifier, judging the type of the abnormal data to be an unknown attack type, and marking the abnormal data to be the unknown attack type;

if the network flow data is not abnormal data, the output is normal.

And the data result display module is used for displaying the detection result of detecting whether the network flow data is abnormal data.

Compared with the prior art, the invention has the beneficial effects that:

1. the method of the invention greatly reduces the false alarm rate of abnormal data, greatly reduces manual processing, after the processing of the technical scheme, the false alarm rate of most abnormal recognizers based on classification is reduced to below 10 percent, and because the data feature extraction module carries out effective IP recoding, feature extraction and feature selection on the data, the model has better robustness on the selection of the classifier and better robustness on the selection of the parameters of the classifier;

2. the method greatly improves the recall rate of the abnormal data, the abnormal data and the normal data are more obviously distinguished after being processed by a sparse self-encoder and an extreme random tree, and the recall rate is improved to more than 90 percent after model tuning;

3. the method of the invention improves the detection rate of abnormal data, and because the data feature extraction module is used in the training process, the traditional abnormal detection algorithm, such as a decision tree model, a Bayesian classifier model, a self-encoder model, a random forest model and the like, can directly obtain information related to the abnormality from effective features during detection, and the feature dimension of each piece of network flow data is reduced by one order of magnitude, thus greatly improving the detection efficiency.

Drawings

FIG. 1 is a schematic structural diagram of a network traffic data analysis system based on a sparse self-encoder and an extreme random tree according to the present invention;

fig. 2 is a detailed flowchart of step 2) of the method for detecting an anomaly of network traffic data according to the present invention;

FIG. 3 is a detailed flowchart of step 3) of the method for detecting anomaly of network traffic data according to the present invention;

fig. 4 is a specific flowchart of step 4) of the method for detecting the anomaly of the network traffic data according to the present invention.

Detailed Description

The invention will now be further described with reference to the accompanying drawings.

The invention provides an anomaly detection method of network traffic data, which comprises the following steps:

step 1), capturing original network flow data in real time;

specifically, an open source tool is adopted to capture original network flow data from a network environment in real time and store the original network flow data into a file in a PCAP format; in this embodiment, a TCPDUMP tool is mainly used to capture original network traffic data from a network environment in real time;

step 2) extracting available data characteristics from the obtained original network traffic data to obtain network traffic characteristic data;

specifically, as shown in fig. 2, an Argus tool is used to extract a first feature from the acquired raw network traffic data, where the first feature includes: a source IP address, a source port number, a destination IP address, a destination port number, and a transport protocol type;

and other characteristics extracted from the protocol header file such as the value of the source TCP advertised window size, the value of the target TCP advertised window size, the sequence number of the source TCP, the sequence number of the target TCP, the average of the sizes of the streaming packets transmitted by the source, the average of the sizes of the streaming packets transmitted by the target, the pipe depth of the http request/response connection, the actual uncompressed size of the data transmitted from the server's http service, the source jitter time (milliseconds), the target jitter time (milliseconds), the source packet interval arrival time, the number of "syn" (synchronization sequence number) and "ack" (acknowledgement characters) in the target packet interval arrival time TCP connection, the interval times of syn and syn _ ack in the TCP connection, and the interval times of syn _ ack and ack packets in the TCP connection.

Among the available data features are: the first and second characteristics, and other characteristics extracted from the protocol header file, namely, the source IP address, source port number, destination IP address, destination port number, transport protocol type, source IP to destination IP packet count, destination IP to source IP packet count, application layer protocol type, source IP transmission bits per second, destination IP transmission bits per second, and values of source TCP advertised window size, values of destination TCP advertised window size, sequence number of source TCP, sequence number of destination TCP, mean of stream packet size transmitted by the source, mean of stream packet size transmitted by the destination, pipe depth of http request/response connection, actual uncompressed size of data transmitted from http service of the server, source jitter time (millisecond), destination jitter time (millisecond), source packet interval arrival time, number of "syn" and "ack" in destination packet interval arrival time TCP connection, interval time of syn _ ack and syn _ ack in TCP connection;

taking the UNSW _ NB15 data set as an example, the obtained original piece of network traffic data is as follows: <xnotran> "59.166.0.0,1390,149.171.126.6,53,udp,CON,0.001055,132,164,31,29,0,0,dns,500473.9375,621800.9375,2,2,0,0,0,0,66,82,0,0,0,0,1421927414,1421927414,0.017,0.013,0,0,0,0,0,0,0,0,3,7,1,3,1,1,1, 0"; </xnotran>

The original network flow data totally comprises 47 numerical characteristics, and the 47 numerical characteristics are sequentially as follows:

“srcip,sport,dstip,dsport,proto,state,dur,sbytes,dbytes,sttl,dttl,sloss,dloss,service,Sload,Dload,Spkts,Dpkts,swin,dwin,stcpb,dtcpb,smeansz,dmeansz,trans_depth,res_bdy_len,Sjit,Djit,Stime,Ltime,Sintpkt,Dintpkt,tcprtt,synack,ackdat,is_sm_ips_ports,ct_state_ttl,ct_flw_http_mthd,is_ftp_login,ct_ftp_cmd,ct_srv_src,ct_srv_dst,ct_dst_ltm,ct_src_ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm”；

wherein the piece of raw network traffic data includes available data characteristics of a piece of network traffic, comprising: a source IP address, a source port number, a destination IP address, a destination port number, a transport protocol type, a source IP to destination IP packet count, a destination IP to source IP packet count, an application layer protocol type, a source IP transmission number per second, a destination IP transmission number per second, and other characteristics extracted from a protocol header file, such as a value of a source TCP advertisement window size, a value of a destination TCP advertisement window size, a sequence number of a source TCP, a sequence number of a destination TCP, an average of stream packet sizes transmitted by a source, an average of stream packet sizes transmitted by a destination, a pipe depth of an http request/response connection, an actual uncompressed size of data transmitted from an http service of a server, a source jitter time (millisecond), a destination jitter time (millisecond), a packet interval arrival time, a number of "SYN" and "ACK" in a destination TCP connection, interval times of SYN _ ACK and ACK in a TCP connection, interval times of SYN _ ACK packets and ACK packets in a TCP connection;

step 3) carrying out data cleaning on the acquired network flow characteristic data, carrying out attribute splitting on the cleaned network flow characteristic data, and splitting the network flow characteristic data into numerical data and non-numerical data;

specifically, as shown in fig. 3, step 3) specifically includes:

step 3-1) data cleaning is carried out on the network flow characteristic data, data without recording labels are removed, and the specific data cleaning process comprises the following steps: recording normalization, missing value processing and NAN value processing;

step 3-2) attribute splitting is carried out on the cleaned network flow characteristic data, and the network flow characteristic data are split into numerical data and non-numerical data according to different characteristics of each characteristic attribute;

step 4) as shown in fig. 3, inputting the non-numerical data into a pre-trained sparse self-encoder for re-encoding, and acquiring encoded non-numerical data;

specifically, the encoded non-numeric data includes: the source IP address, the target IP address, the transmission protocol type, the protocol state type and the network service type have 5 characteristics;

specifically, non-numerical data is subjected to one-hot encoding, the characteristic dimensionality of the data subjected to one-hot encoding is 294, a sparse self-encoder is constructed, and the network structure of the self-encoder is shown in the following table:

the dimension of the coding layer is set to be 20, the constructed sparse self-encoder is trained by adopting non-numerical data subjected to unique heat coding, a cross entropy loss function is selected as the loss function, a KL divergence sparse penalty item is added, an Adam optimization algorithm is used by a training optimizer, the number of training rounds is converged when the number is about 20, and the encoder is stored; inputting a non-numerical characteristic set subjected to one-hot coding; the output is the encoding part of the sparse autoencoder, namely:

the step 4) specifically comprises the following steps:

step 4-1) dividing according to the attribute label set, and extracting a non-numerical characteristic set from non-numerical data;

step 4-2) carrying out one-hot coding on the non-numerical characteristic set extracted in the step 4-1);

step 4-3), constructing and training a sparse self-encoder, taking a non-numerical characteristic set subjected to unique heat encoding as a set of the sparse self-encoder, taking the set of the sparse self-encoder as input, and outputting the set of the sparse self-encoder as an encoder extracted from the sparse self-encoder;

step 4-4) borrows for reference from Word2Vec algorithm in natural language processing, adopts TCPIP2Vec algorithm based on sparse self-encoder, utilizes the encoder extracted in step 4-3) to re-encode the one-hot code of the non-numerical characteristic set, obtains encoded non-numerical data, namely the re-encoded non-numerical characteristic set, and is used for carrying out numerical similarity calculation, namely calculating the similarity of IP addresses and the like, and meanwhile, the characteristic dimension of the data attribute is greatly reduced compared with the one-hot code.

Wherein, in the step 4-3), constructing and training the sparse self-encoder specifically comprises:

first, a symmetrical neural network H with parameters W, b is constructed_w,bAs a sparse autoencoder:

H_w,b＝g(f(X))

wherein f (X) is a coding function; g (X) is a decoding function; the two functions are approximated by constructing a neural network, wherein all weight parameters of the neural network are W, and all deviation parameters are b;

to achieve sparse coding, cross entropy loss is adoptedFunction training neural network H_w,b(ii) a Wherein the cross entropy loss function is:

wherein, J_S(W, b) is a cross entropy loss function, and the cross entropy loss function is selected according to the non-numerical characteristic set of the one-hot coding;

a KL divergence penalty applied to the coding function; ρ is a sparsity parameter, β is a regularization parameter;

an average activation value of a jth hidden unit of an encoding layer of a sparse self-encoder; wherein the content of the first and second substances,

the calculation formula of (a) is as follows:

wherein n is the number of block samples set during training; f (x)ⁱ) As a coding function, xⁱIs the ith network traffic data sample;

KL is divergence; wherein, KL divergence compares a measure of similarity between two probability distributions, and its calculation formula is as follows:

where ρ (t) is the sparse parameter function:

an average activation value of a j-th hidden unit of an encoding layer of the sparse self-encoder;

training a sparse self-encoder by adopting a cross entropy loss function; the input of the sparse self-encoder is a non-numerical characteristic set subjected to single-hot encoding, and is recorded as:

X＝[x¹,…xⁿ]

wherein the characteristic dimension is n; x is a non-numerical characteristic set subjected to one-hot coding; x is the number ofⁿThe nth factor in the non-numerical value feature set subjected to unique hot coding;

the output of the sparse autoencoder is an encoder extracted from the sparse autoencoder:

in addition to imposing KL divergence penalties, there are other sparse penalty modes, such as absolute value penalties or L₁And (5) norm punishment. The training process of the sparse self-encoder is the same as that of the existing neural network training, the gradient is calculated, the idea of inverse propagation is utilized, and specifically, a common Adam optimization algorithm or a classical random gradient descent is used.

Step 5) inputting the numerical data into a pre-established extreme random tree model, and performing descending order arrangement and screening on the importance of the numerical data to obtain the screened numerical data as shown in FIG. 3;

wherein the numerical data has 42 numerical characteristics in total, and the numerical data includes: source port number, target port number, total duration of recording, number of bytes of source IP to target IP processing, number of bytes of target IP to source IP processing, source IP to target IP lifetime, target IP to source IP lifetime, number of retransmitted or dropped source packets, number of retransmitted or dropped target IPs, number of bits transmitted per second by source IP, number of packets transmitted per second by target IP, number of packets by source IP to target IP, number of packets by target IP to source IP, value of source TCP advertised window size, value of target TCP advertised window size, sequence number of source TCP, sequence number of target TCP, average of stream packet sizes transmitted by source, average of stream packet sizes transmitted by target, pipe depth of http request/response connections, actual uncompressed size of data transmitted from http service of server, source jitter time (milliseconds), target jitter time (milliseconds), time of start of recording, time of last recording, interval arrival time of source packets, target packet interval arrival time, sync and sync number of sync in TCP connections, number of TCP packets and number of commands with type of http _ GET, POST _ ack, session with type of commands in source IP and http _ TCP packets.

Specifically, the remaining numerical data includes a plurality of redundant information in addition to non-numerical data, a decision tree algorithm is adopted based on an extreme random tree embedding method, numerical data including 42 numerical features are subjected to cross validation, an information gain ratio is set as a numerical feature selection method, the number of estimators is set to be 50 most appropriate, and after the extreme random tree model is established, screened numerical data are output.

After the numerical characteristics are sorted in the descending order of importance, five attribute characteristics with the top importance ranking are selected by combining the subsequent abnormal detection process, namely the survival time from the source IP to the target IP, the survival time from the target IP to the source IP, the specific range of the survival time of the source IP/the target IP for each state, the number of bits transmitted by the target per second and the value of the size of the target TCP notification window.

The step 5) specifically comprises the following steps:

step 5-1) dividing according to the attribute numbers, splitting the attributes of the numerical data, and acquiring a split numerical characteristic set;

step 5-2) inputting the split numerical characteristic set into a pre-established extreme random tree model, and arranging each numerical characteristic in the split numerical characteristic set in a descending order from large to small according to importance to obtain a sorted numerical characteristic set;

and 5-3) screening the sorted numerical feature set according to a preset threshold value to obtain importance factors of each numerical feature in the sorted numerical feature set larger than the preset threshold value, and recording the importance factors as screened numerical data.

The step 5) further comprises the following steps: adopting a recursive feature elimination algorithm, namely an RFE algorithm, sequentially deleting a numerical feature from the screened numerical data, normalizing the remaining screened numerical data and the recoded non-numerical data, and performing anomaly detection to obtain a detection result; and comparing the detection result with the previous detection result without deleting the numerical characteristic, and detecting whether the detection results in the two cases are consistent or not for verifying whether the previous detection result is correct or not.

The step 5) further comprises the following steps: carrying out category splitting on the cleaned network flow characteristic data, and recording a corresponding category number characteristic set; wherein the content of the first and second substances,

if the split network flow characteristic data is provided with a known data label, classifying according to the known class label, recording the data number of the corresponding class, and recording as a known numerical value characteristic set;

if the split network flow characteristic data is provided with an Unknown data label, classifying the split network flow characteristic data into Unknown, recording the data number of the corresponding category as well, and recording as an Unknown numerical value characteristic set;

specifically, in the UNSW _ NB15 data, in addition to normal data, there are 9 common attack types, which are:

fuzzers, attack behavior that halts a program or network by randomly generated data

Analysis, including different port scan attacks, spam and html file penetration

Backdoors, a technique for silently bypassing system security mechanisms and accessing computers and their data

Dos, making network resources unavailable to users by temporarily disrupting or suspending services to hosts connected to the internet

Exploits, the attacker knows the security problem of an operating system or software and uses the vulnerability to attack

General, an attack technique for block ciphers (given block and key sizes) regardless of the structure of the block cipher

Reconnnaissandance, containing all attacks able to simulate the collection of information

Shellcode, a small segment of code for software vulnerability payloads

Worms, the attacker replicates itself to spread to other computers, self-spreading with the computer network, relying on failed security guards on the target computer to access

And recording a data number set of a corresponding category, and establishing a set for the unknown attack type set so as to store an output result of unsupervised anomaly detection.

As shown in fig. 3, the data after class splitting, the encoded non-numerical data, and the sorted numerical data are normalized and archived.

In the step 5-2), the process of establishing the extreme random tree-based feature selection model specifically includes:

wherein, the construction process of each decision tree is as follows:

obtaining an importance factor of each numerical characteristic according to the following calculation formula:

when each decision tree is constructed, randomly selecting K numerical features from the K numerical features, wherein K is the total dimension of the numerical features, and K is the feature dimension set for constructing each decision tree; the value of K is set to be less than K, and is generally ordered

judging each numerical characteristic in the split numerical characteristic set by using a plurality of decision trees, specifically, judging whether the original network traffic data corresponding to the numerical characteristic is normal data or abnormal data by each decision tree of the plurality of decision trees, summarizing the judgment result of each decision tree by a voting method, and taking the result of the majority of the judgment result as a final judgment result, for example, the judgment result that the original network traffic data corresponding to the numerical characteristic is normal data is more than the judgment result that the original network traffic data corresponding to the numerical characteristic is abnormal data, namely the judgment result that the original network traffic data corresponding to the numerical characteristic is normal data is more than the judgment result that the original network traffic data corresponding to the numerical characteristic is normal data, and taking the judgment result that the original network traffic data corresponding to the numerical characteristic is normal data as a final judgment result; wherein, the judgment result is that the original network flow data corresponding to the numerical characteristic is normal data or abnormal data;

Step 6) recoding the split non-numerical data by using the encoder extracted in the step 4-3), wherein the characteristic dimensionality of the recoded data is 20; screening the characteristic attributes of the sorted numerical value characteristic set obtained in the step 5-2), wherein the screened characteristic dimension is 17; merging the processed non-numerical and numerical data, and recording as X₀(ii) a Then to X₀Carrying out data normalization processing, and recording the processed data as X₁(ii) a Wherein X₀，X₁Each row represents a piece of network flow data, and each column represents each setting characteristic of the processed network flow data; mixing X₁Inputting the data into a first anomaly classifier or a second anomaly classifier, detecting whether each piece of network flow data is anomalous data, and specifically executing the following steps:

if the network flow data is not abnormal data, the output is normal.

As shown in fig. 4, the step 6) specifically includes:

step 6-1) for the integrated data X₀Carrying out normalization processing, and recording the processed data as X₁；

Specifically, for X₀Each column X of₀[i]Dividing the square difference by subtracting the mean value according to the transformation function; the transformation function is as follows:

wherein, X₀[i]An ith column of the encoded network stream data; x₁[i]Is to X₀The ith column of the network flow data after normalization processing; μ is a column vector X₀[i]Average value of (d): σ is the column vector X₀[i]The variance of (a);

in other embodiments, the data may be normalized between [0,1] by performing a linear transformation on the re-encoded data as follows:

wherein, X₀[i]The ith column of the encoded network stream data; x₁[i]Is to X₀The ith column of the network flow data after normalization processing; min is the column vector X₀[i]Minimum value of (d): max is the column vector X₀[i]Maximum value of (d);

in other embodiments, the column vector X is subtracted₀[i]Is divided by the column vector X₀[i]To X is given₀[i]Normalization processing is carried out, so that processed network flow data X is obtained₁[i]The principle is the same as the first two methods, but the method is more robust because some outliers are given directly to kicks in the calculation.

Step 6-2) inputting network flow data into a pre-trained first abnormal classifier, and detecting whether the input network flow data is abnormal data; outputting a detection result;

specifically, firstly, a supervision anomaly detection method based on classification is adopted to detect whether input network flow data is anomalous data;

if the network flow data is abnormal data, outputting the abnormality, inputting the abnormal data into a first abnormality classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data; the abnormal data is network flow data of a known attack type;

if the network flow data is judged to be abnormal data by the supervision abnormal detection method, inputting the abnormal data into a pre-trained second abnormal classifier, and further detecting whether the network flow data is unknown abnormal by adopting an unsupervised abnormal detection method;

if the network flow data is judged to be abnormal data through an unsupervised anomaly detection algorithm, the abnormal data is marked as an unknown attack type; the abnormal data is suspicious network flow data of unknown attack types;

if the network flow data is not abnormal data, the output is normal.

Wherein the establishing and training of the first anomaly classifier comprises:

adopting an AdaBoost supervision and classification algorithm to construct an anomaly detection function:

training a first anomaly classifier according to an anomaly detection function;

the training process is as follows:

initializing the weight vector of the input data: w is a group of_m＝(w₁₁,w₁₂,…w_1n),w_1i=1/n, m =1 representing the weak classifier currently trained; in the weight vector W_mTraining weak classifier G with the objective of minimizing classification error rate_m(x) Calculating a classifier weight coefficient according to the classification error rate; then judging whether M is smaller than M, if M is smaller than M<M, updating the weight vector of the next step according to the training result of the previous step, and training the abnormal classifier function G of the (M + 1) th step by taking the classification error rate as the minimum target_m+1(x) Then, updating m to be m +1, otherwise, constructing an abnormal detection function according to the trained weak classifier;

wherein, the update weight vector update formula is as follows:

wherein m represents the current step and m +1 represents the next step;

the calculation formula of the classification error rate is as follows:

wherein I (t) is an indicator function;

the classifier weight coefficient calculation formula is as follows:

wherein Z is_mIs a normalized coefficient, and the specific calculation formula is as follows:

training a first anomaly classifier according to the anomaly detection function, wherein the input of the first anomaly classifier is { (x)₁,y₁),(x₂,y₂)(x₃,y₃),…(x_n,y_n) In which x_iFor each processed network flow data, namely the network flow data X normalized in the step 6-1)₁Row i of (1); n is a matrix X₁Represents the total number of network flow data; x is the number of_i∈Rⁿ；y_iIs a corresponding label;y_ithe element belongs to {0,1},0 represents normal, 1 represents abnormal, and the output of the element is the discrimination result of the trained first abnormal classifier on each network traffic characteristic.

Wherein G is_m(x) The specific learning process of the weak classifier of the decision tree obtained by step-by-step learning is as follows:

step 6-2-1) initializing a weight vector of input data:

W_m＝(w_m1,w_m2,…w_mn),

wherein, initializing m =1,W_mFor weight vectors, each component w_miCorresponding network stream training data X₁Each row represents the weight of the network flow data;

step 6-2-2) in the weight vector W_mOn the basis of the above-mentioned training data, a function of class error rate is adopted, and the class error rate is minimized as a target to train G_m(x) (ii) a Wherein, the classification error rate function is shown in formula (2):

wherein e is_mIs a classification error rate; w is a_miFor each component of the weight vector; i (t) is an indicator function; g_m(x_i) For the discrimination result of the m-th classifier on the ith network flow data: y is_iLabel for ith network traffic data:

then, according to equation (3), a classifier weight coefficient is calculated:

wherein alpha is_mIs the classifier weight coefficient;

step 6-2-3) then judging whether M is smaller than M, if M is satisfied<M, updating the next step according to the training result of the previous stepWeight vector and training the abnormal classifier function G of the m +1 th step with the classification error rate minimized as the target_m+1(x) Then, updating m to be m +1, otherwise, constructing an abnormal detection function according to the trained weak classifier; wherein, in particular,

according to the weight coefficient alpha of the classifier_mUpdating the weight vector according to formula (4);

wherein, w_m+1,iIs the weight vector W of the m +1 th step_m+1The respective components of (a); z_mTo normalize the coefficients: w is a_miIs the mth step weight vector W_mThe respective components of (a); y is_iFor the ith network flow data x_iThe label of (2); g_m(x) An mth classifier discrimination function;

wherein Z is calculated according to the formula (5)_m：

Step 6-2-4) weak classifiers G for a plurality of decision trees_m(x) Linearly combining classifier weight coefficients into a first anomaly classifier, namely a strong learner:

wherein G is_m(x) A weak classifier for decision tree; alpha is alpha_mIs a reaction with G_m(x) Corresponding classifier weight coefficients;

the establishing and training of the second anomaly classifier comprises the following steps:

an unsupervised anomaly detection function is constructed by reconstructing an error function by using an auto-encoder:

L(x_p,x_r)＝||x_p-x_r||²＝||x_p-F(G(x_p))||²

wherein, L (x)_p,x_r) For reconstructing the error function, F and G denote the decoding and encoding functions, x, respectively, of the self-encoder_pFor raw network traffic data to be detected, x_rIs a self-encoder pair x_pThe data after reconstruction is performed. In particular, the amount of the solvent to be used,

step 6-3-1) firstly, extracting the network flow characteristic data marked as normal from the network flow characteristic data calibrated in the previous data division stage, and marking as X_normal；

Step 6-3-2) with X_normalAs self-encoder training data, simulating an encoding function G and a decoding function F by using a multilayer perceptron to construct a self-encoder;

step 6-3-3) taking the reconstruction error function as a target function, training by adopting an Adam optimization algorithm, and calculating an error threshold error by using the following formula after training:

wherein, X_normalThe network traffic characteristic data marked as normal; n' is the number of all network traffic characteristics used for the training of the self-encoder, i.e. X_normalThe number of rows of (c); x is X_normalF and G are the decoding function and the encoding function of the trained self-encoder, respectively; l is the calculated reconstruction error corresponding to x;

6-3-4) training the self-encoder and calculating an error threshold value error, calibrating the supervision abnormity detection into normal data which is not marked as x_pInput into the self-encoder to obtain the reconstructed data x_rAnd calculating a reconstruction error L (x)_p,x_r)；

Step 6-3-5) reconstruction error L (x)_p,x_r) Comparing with error, if the reconstruction error is larger than 3 times of error, marking the piece of data x_rOutputting an unknown type exception if the exception is abnormal; if the reconstruction error is within 3 times of error, the method is judged to beNormal data, output is normal.

And 7) displaying a detection result of detecting whether the network flow data is abnormal data.

Specifically, if the detection result is that the network flow data is network flow data of a known attack type, an alarm is issued;

if the detection result is that the network flow data is the network flow data of unknown attack type; if the attack is unknown, the attack is returned to the database, and professional personnel are informed to make manual analysis;

if the suspicious attack data is determined to be a new abnormal data type after being analyzed by the professional, adding the suspicious attack data into a training set of a first abnormal classifier; if the analyzed data is normal data, adding the normal data into a training set of a second abnormal classifier;

counting detection results periodically, wherein the detection results comprise: counting known attack types and unknown attack types of abnormal data appearing in the whole network environment, detecting the accuracy rate, the recall rate and the false alarm rate of the abnormal data, and selecting whether to retrain and update the algorithm in the model; the accuracy, recall rate and false alarm rate of the detection result are calculated according to the test result of the test data with the attack type label, namely:

wherein, TP, TN, FP, FN represent several statistical results of the network flow characteristic data of the established network flow data analysis method and system; in particular, the amount of the solvent to be used,

TP represents the number of the network flow characteristic data with normal detection result and normal system prediction;

TN represents the number of the network flow characteristic data with abnormal detection result and abnormal system prediction;

FP represents the number of the network flow characteristic data with abnormal detection results and normal system prediction;

FN represents the number of the network flow characteristic data with normal detection result and abnormal system prediction;

and determining whether to retrain and update the system according to the statistical result and the number of the suspicious attacks.

As shown in fig. 1, the present invention further provides a network traffic data analysis system based on a sparse autoencoder and an extreme random tree, the system comprising: the system comprises an original data acquisition module, a data preprocessing module, a data feature extraction module, a data anomaly detection module and a data result display module; wherein the content of the first and second substances,

if the network flow data is not abnormal data, the output is normal.

The method of the invention aims at the network flow characteristics such as IP addresses, ports, TCP/IP protocols and the like, and by using the Word2Vec algorithm processed by natural language, provides a TCP/IP numerical algorithm based on a sparse self-encoder, maps the characteristics such as the IP addresses, the TCP/IP protocols and the like to an n-dimensional real number domain space, and provides a good support for various distance or density-based supervision or unsupervised algorithms in a subsequent data anomaly detection module. Compared with the traditional one-hot coding, the accuracy of the system is greatly improved, and the false alarm rate of the system is well reduced.

In a data characteristic extraction module, various characteristic selection means are integrated, for UNSW _ NB15 data, a characteristic processing means based on an extreme random tree is used besides a TCP/IP numerical algorithm based on a sparse self-encoder, and selectable criteria comprise information entropy, information gain ratio, gini index and the like, so that on the premise of ensuring important information of original data, the system dimension is greatly reduced, and the system operation efficiency is improved.

In the data anomaly detection module, the supervised learning and unsupervised learning means are integrated, and the basic idea is as follows: firstly, modeling a traffic data portrait model for the normal behavior of network traffic; then, constructing abnormal detection classifiers belonging to different attack types by using a supervision learning means for known attacks or abnormal network flow behaviors; and (3) continuously expanding and perfecting a normal behavior model of network traffic and a classifier of abnormal behavior by using an unsupervised learning means aiming at the rest unknown data. Through double detection of supervision anomaly detection and unsupervised anomaly detection, accuracy and false alarm rate are guaranteed, and the capability of finding unknown attacks is reserved.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for detecting the abnormity of network flow data is characterized by comprising the following steps:

step 1) processing original network flow data captured in real time to obtain network flow data;

step 2) judging the network flow data obtained in the step 1), outputting an abnormal data if the judgment result is the abnormal data, inputting the abnormal data into a first abnormal classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data; step 3) judging the network flow data obtained in the step 1), if the judgment result is not abnormal data, adopting an unsupervised abnormal detection method to further detect whether the network flow data is abnormal;

step 4) judging according to the further detection in the step 3), if the judgment result is abnormal data, inputting the abnormal data into a pre-trained second abnormal classifier, judging the type of the abnormal data to be an unknown attack type, and marking the abnormal data as the unknown attack type;

step 5) judging according to the further detection of the step 3), and outputting normal data if the judgment result is not abnormal data;

the method comprises the steps of processing original network flow data captured in real time to obtain network flow data; the method specifically comprises the following steps:

capturing original network flow data in real time;

extracting available data characteristics from the obtained original network traffic data to obtain network traffic characteristic data;

carrying out normalization processing on the encoded non-numerical data and the screened numerical data to obtain network flow data;

the establishing and training of the sparse self-encoder specifically comprises the following steps:

wherein the content of the first and second substances,

the calculation formula of (c) is as follows:

wherein ρ (t) is a sparse parameter function;

the input of the sparse self-encoder is a non-numerical characteristic set subjected to single-hot encoding; the output of the sparse self-encoder is a recoded non-numerical characteristic set, namely coded non-numerical data.

2. The method according to claim 1, wherein the non-numerical data is input to a pre-trained sparse self-encoder for re-encoding, and encoded non-numerical data is obtained; the method specifically comprises the following steps:

by using a Word2Vec algorithm in natural language processing for reference, the TCPIP2Vec algorithm based on a sparse autoencoder is adopted to re-encode the single hot code of the non-numerical feature set, and encoded non-numerical data is obtained.

3. The method according to claim 1, wherein the numerical data is input into a pre-established extreme random tree model, and the importance of the numerical data is sorted and screened in a descending order to obtain screened numerical data; the method specifically comprises the following steps:

4. The method according to claim 3, wherein the process of establishing the extreme stochastic tree-based feature selection model specifically comprises:

randomly selecting numerical features in the split numerical feature set to construct a plurality of decision trees;

wherein, the construction process of each decision tree is as follows:

wherein G (D, A) is an importance factor of the numerical characteristic A relative to the numerical characteristic set D to be divided, namely an information gain ratio; d is a data set to be divided; a is the currently selected numerical characteristic; h_A(D) The information entropy is obtained by taking the currently selected numerical characteristic A as a random variable; h (D) is set D with data class as random variableInformation entropy; h (D | a) is the conditional information entropy of the subset obtained after the set D is divided using the feature a;

judging each numerical characteristic in the split numerical characteristic set by utilizing a plurality of decision trees, specifically judging whether the original network traffic data corresponding to the numerical characteristic is normal data or abnormal data through each decision tree in the plurality of decision trees, summarizing the judgment result of each decision tree by a voting method, and taking the result of the majority of the judgment result as a final judgment result; wherein, the judgment result is that the original network flow data corresponding to the numerical characteristic is normal data or abnormal data;

5. The method of claim 1, wherein the available data features are extracted from the acquired raw network traffic data to obtain network traffic feature data; the method specifically comprises the following steps:

extracting second features from the acquired raw network traffic data using a Bro-IDS tool, the second features comprising: counting from a source IP to a target IP packet, counting from a target IP to a source IP packet, an application layer protocol type, transmission bits per second of the source IP and transmission bits per second of the target IP;

among the available data features are: the first feature, the second feature, and other features extracted from the protocol header file, including: a value of a source TCP advertisement window size, a value of a target TCP advertisement window size, a sequence number of the source TCP, a sequence number of the target TCP, an average of a size of a stream packet transmitted by the source, an average of a size of a stream packet transmitted by the target, a pipe depth of an http request/response connection, an actual size of data transmitted from an http service of the server without compression, a source jitter time (millisecond), a target jitter time (millisecond), a source packet interval arrival time, a target packet interval arrival time, a number of "syn" and "ack" in the TCP connection, an interval time of syn and syn _ ack in the TCP connection, an interval time of syn _ ack packet and ack packet in the TCP connection.

6. The method of claim 1, wherein the building and training of the first anomaly classifier specifically comprises:

training a first anomaly classifier according to the anomaly detection function, wherein the input of the first anomaly classifier is { (x)₁,y₁),(x₂,y₂)(x₃,y₃),…(x_n,y_n) In which x_iThe processed data of each piece of network flow is obtained; x is the number of_i∈Rⁿ；y_iIs a corresponding label; y is_iThe element belongs to {0,1},0 represents normal, 1 represents abnormal, and the output of the element is the discrimination result of the trained first abnormal classifier on each network traffic characteristic.

7. The method according to claim 1, wherein the building and training of the second anomaly classifier specifically comprises:

L(x_p,x_r)＝||x_p-x_r||²＝||x_p-F(G(x_p))||²

wherein, L (x)_p,x_r) For reconstruction of the error function, F and G denote the decoding and encoding functions, x, respectively, of the self-encoder_pFor raw network traffic data to be detected, x_rIs a self-encoder pair x_pPerforming the reconstructed data;

8. A network traffic data analysis system, the system comprising: the system comprises an original data acquisition module, a data preprocessing module, a data feature extraction module, a data anomaly detection module and a data result display module; wherein the content of the first and second substances,

judging according to further detection, if the judgment result is abnormal data, inputting the abnormal data into a pre-trained second abnormal classifier, judging the type of the abnormal data to be an unknown attack type, and marking the abnormal data as the unknown attack type; if the judgment result is not abnormal data, outputting the data to be normal;

the data result display module is used for displaying a detection result for detecting whether the network flow data is abnormal data;

the processing process of the data preprocessing module specifically comprises the following steps:

capturing original network flow data in real time;

the establishment and training of the sparse self-encoder specifically comprise:

establishing a sparse autoencoder, adopting a cross entropy loss function J added with KL divergence sparse penalty term based on a TCPIP2Vec algorithm of the sparse autoencoder_S(W,b)，Training a sparse self-encoder;

wherein the content of the first and second substances,

the calculation formula of (a) is as follows:

wherein ρ (t) is a sparse parameter function;