CN112398779B - Network traffic data analysis method and system - Google Patents

Network traffic data analysis method and system Download PDF

Info

Publication number
CN112398779B
CN112398779B CN201910739001.0A CN201910739001A CN112398779B CN 112398779 B CN112398779 B CN 112398779B CN 201910739001 A CN201910739001 A CN 201910739001A CN 112398779 B CN112398779 B CN 112398779B
Authority
CN
China
Prior art keywords
data
numerical
abnormal
network flow
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910739001.0A
Other languages
Chinese (zh)
Other versions
CN112398779A (en
Inventor
方少峰
孙鹏科
闫振中
郑岩
马福利
佟继周
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Space Science Center of CAS
Original Assignee
National Space Science Center of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Space Science Center of CAS filed Critical National Space Science Center of CAS
Priority to CN201910739001.0A priority Critical patent/CN112398779B/en
Publication of CN112398779A publication Critical patent/CN112398779A/en
Application granted granted Critical
Publication of CN112398779B publication Critical patent/CN112398779B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic

Abstract

The invention belongs to the technical field of network traffic data analysis, and particularly relates to an anomaly detection method of network traffic data, which comprises the following steps: processing the original network flow data captured in real time to obtain network flow data; if the network flow data is abnormal data, outputting the abnormality, inputting the abnormal data into a first abnormal classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data; if the network flow data is not abnormal data, adopting an unsupervised abnormal detection method to further detect whether the network flow data is abnormal; if the network flow data is abnormal data, inputting the abnormal data into a pre-trained second abnormal classifier, judging the type of the abnormal data to be an unknown attack type, and marking the abnormal data to be the unknown attack type; if the network flow data is not abnormal data, the output is normal.

Description

Network traffic data analysis method and system
Technical Field
The invention belongs to the technical field of anomaly detection technology and network traffic data analysis based on machine learning and big data, and particularly relates to a network traffic data analysis method and system, in particular to a network traffic data analysis method and system based on a sparse self-encoder and an extreme random tree.
Background
In recent decades, with the rapid development of the internet, from consumer interconnection, industrial interconnection to everything interconnection, the communication mode and consumption mode of people are reformed once and again. The network security problem is getting more and more troublesome, and the traditional defense means is unconscious when facing a new attack mode due to the endless network attack. From a basic data link layer to a network layer and a transmission layer and then to a higher-level representation layer and an application layer, the manner of network attack is complex, and is continuously updated, and a means is often not satisfactory, for example, a distributed denial of service attack (DDOS) with a continuously increasing scale, which has both a traditional network layer DDOS attack using the characteristics of a TCP/IP protocol and an application layer DDOS attack developed on the basis, and is specifically applied to the application layer, and can be classified into a DNS-Flood attack, a slow connection attack and a CC attack.
Various network security technologies are developed to ensure network security, wherein the network traffic analysis and intrusion detection technologies play a very important role in detecting the abnormality in the network traffic and then providing early warning. The current network flow data analysis and intrusion detection methods are many, and mainly comprise: on the basis of the traditional method based on the feature library, the statistical method based on the probability and the rule, the supervised learning method based on the classification technology, the unsupervised learning based on the clustering technology and the derived technology based on the outlier detection, researchers establish a plurality of network anomaly detection systems, and the experiment effect is good, but when the method is put into specific use, a plurality of problems are often found. In short, in the field of anomaly detection, anomalies can be classified into point anomalies, condition anomalies and mode anomalies, different detection systems are effective for most of the point anomalies, and for the condition anomalies and the mode anomalies, the accuracy and the false alarm rate are often balanced.
For example, an anomaly detection system based on a conventional feature library needs to frequently update the feature library, and is completely unable to detect unknown anomalies or simply encrypted data streams, and the cost for maintaining and updating the feature library is very high; when the network flow becomes complex, the abnormal network flow data is often judged to be normal by a statistical method based on probability and rules, and is easy to be utilized by attackers;
unsupervised detection systems based on clustering and outliers often face the problems of extremely low algorithm training speed, difficult parameter selection and high false alarm rate when the data scale and feature dimensions are high.
The intercommunication and interconnection of networks and the arrival of the big data era enable the network data scale to increase exponentially, and a large amount of flow data can be generated every day. Although most of this data is normal traffic; however, the size and variety of the abnormal traffic is also continuously increasing, which brings new opportunities and challenges to the network security field. In the big data era, as the network flow data characteristics become richer and richer, the existing anomaly detection system established based on the traditional KDD-CUP99 or NSL-KDD data set has the problems of poor reliability, low accuracy, high false alarm rate and the like. Whether network intrusion analysis or enterprise internal threat analysis exists, the existing method faces various problems, and a detection and defense system extremely depends on efficient analysis of network flow data and establishes effective feature preprocessing, feature selection and feature extraction tools according to new data features, which is a very important problem.
Disclosure of Invention
The invention aims to solve the problems of poor generalization performance, low detection rate and high false alarm rate of the conventional network traffic data analysis method in the network security anomaly detection technology, and provides a network traffic data analysis method based on a sparse self-encoder and an extreme random tree, which can select corresponding automatic encoding means for different types of features from network flow data, not only can play a role in reducing feature dimensionality, but also can effectively calculate the distance of features such as IP addresses, protocols and the like, thereby providing a foundation for the conventional anomaly detection technology based on distance or density; then, for the processed numerical characteristics, a characteristic selection method based on an extreme random tree is adopted, so that the dimension can be reduced, and the selected characteristics can still have practical significance, thereby providing possibility for subsequent analysis; the extracted feature set can be combined with a supervised classification technology and an unsupervised outlier anomaly detection technology, experiments show that the accuracy is effectively improved, the false alarm rate is greatly reduced, and the calculation speed is much faster due to the fact that data are recoded and feature engineering processing is carried out.
In order to achieve the above object, the present invention provides an anomaly detection method for network traffic data, including:
processing the original network flow data captured in real time to obtain network flow data;
if the network flow data is abnormal data, outputting the abnormality, inputting the abnormal data into a first abnormal classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data;
if the network flow data is not abnormal data, adopting an unsupervised abnormal detection method to further detect whether the network flow data is abnormal;
if the network flow data is abnormal data, inputting the abnormal data into a pre-trained second abnormal classifier, judging the type of the abnormal data to be an unknown attack type, and marking the abnormal data as the unknown attack type;
if the network flow data is not abnormal data, the output is normal.
As one improvement of the above technical solution, the capturing of the original network traffic data in real time is processed to obtain network flow data; the method specifically comprises the following steps:
capturing original network flow data in real time;
extracting available data characteristics from the obtained original network flow data to obtain network flow characteristic data;
performing data cleaning and attribute splitting on the acquired network flow characteristic data, and splitting the acquired network flow characteristic data into numerical data and non-numerical data;
inputting the non-numerical data into a pre-trained sparse self-encoder for re-encoding to obtain encoded non-numerical data;
inputting numerical data into a pre-established extreme random tree model, and performing descending order arrangement and screening on the importance of the numerical data to obtain screened numerical data;
and carrying out normalization processing on the encoded non-numerical data and the screened numerical data to obtain network flow data.
As one improvement of the above technical solution, the non-numerical data is input to a pre-trained sparse self-encoder for re-encoding, and encoded non-numerical data is obtained; the method specifically comprises the following steps:
dividing according to the attribute label set, performing attribute splitting on the non-numerical data, and acquiring a non-numerical feature set from the non-numerical data;
carrying out single-hot coding on the non-numerical characteristic set to obtain a non-numerical characteristic set subjected to single-hot coding, inputting the non-numerical characteristic set to a pre-trained sparse self-encoder, and obtaining an encoder extracted from the sparse self-encoder;
and adopting a TCPIP2Vec algorithm based on a sparse self-encoder to re-encode the one-hot code of the non-numerical characteristic set to obtain encoded non-numerical data.
As one of the improvements of the above technical solution, the establishing and training of the sparse autoencoder specifically includes:
establishing a sparse autoencoder, adopting a cross entropy loss function J added with KL divergence sparse penalty term based on a TCPIP2Vec algorithm of the sparse autoencoderS(W, b) training a sparse autoencoder;
Figure GDA0003797781890000031
wherein the content of the first and second substances,
Figure GDA0003797781890000032
a KL divergence penalty applied to the coding function; ρ is a sparse parameter and β is a regularization parameter;
Figure GDA0003797781890000033
the average activation value of the jth hidden unit of the coding layer; wherein the content of the first and second substances,
Figure GDA0003797781890000034
the calculation formula of (a) is as follows:
Figure GDA0003797781890000035
wherein n is the number of block samples set during training; f (x)i) As a coding function, xiIs the ith sample;
KL is divergence; the KL divergence is a measure for comparing the similarity between two probability distributions, and is calculated as follows:
Figure GDA0003797781890000041
wherein ρ (t) is a sparse parameter function;
the input of the sparse self-encoder is a non-numerical characteristic set subjected to one-hot encoding; the output of the sparse self-encoder is a recoded non-numerical characteristic set, namely coded non-numerical data.
As one improvement of the above technical solution, the numerical data is input to a pre-established extreme random tree model, and the importance of the numerical data is sorted and screened in a descending order to obtain the screened numerical data; the method specifically comprises the following steps:
dividing according to the attribute numbers, and performing attribute splitting on the numerical data to obtain a split numerical characteristic set;
inputting the split numerical characteristic set into a pre-established extreme random tree model, and arranging each numerical characteristic in the split numerical characteristic set in a descending order from large to small according to importance to obtain a sorted numerical characteristic set;
and screening the sorted numerical feature set according to a preset threshold value to obtain the importance factors of each numerical feature in the sorted numerical feature set larger than the preset threshold value, and recording the importance factors as screened numerical data.
As an improvement of the above technical solution, the establishing process of the extreme random tree-based feature selection model specifically includes:
randomly selecting numerical characteristics in the split numerical characteristic set to construct a plurality of decision trees;
wherein, the construction process of each decision tree is as follows:
the importance factor of each numerical feature is obtained according to the following calculation formula:
Figure GDA0003797781890000042
wherein G (D, A) is an importance factor of the numerical characteristic A relative to the numerical characteristic set D to be divided, namely an information gain ratio; d is a data set to be divided; a is the currently selected numerical characteristic; hA(D) The information entropy is obtained by taking the currently selected numerical characteristic A as a random variable; h (D) is the information entropy of set D with the data class as a random variable; h (D | a) is the conditional information entropy of the subset obtained after the set D is divided using the feature a;
when each decision tree is constructed, randomly selecting K numerical features from the K numerical features, wherein K is the total dimension of the numerical features, and K is the feature dimension set for constructing each decision tree; the value of K is set to be less than K, generally
Figure GDA0003797781890000043
When each decision tree is constructed, selecting the numerical characteristic with the largest information gain ratio G (D, A) from the k selected numerical characteristics, then constructing nodes and splitting;
when the nodes of the decision tree are split, randomly selecting an arbitrary number between the maximum value and the minimum value of the numerical characteristic, and recording the arbitrary number as a comparison value; when the numerical characteristic of the sample is greater than the comparison value, taking the sample as a left branch; when the numerical characteristic of the sample is smaller than the comparison value, the sample is taken as a right branch, and then the bifurcation value of the numerical characteristic of the sample is calculated; wherein, the sample is a split numerical characteristic set;
traversing the selected k numerical characteristics to construct a decision tree;
repeating the process of constructing the basic decision tree for N times to construct N decision trees; wherein, the number of the decision tree is determined by using a cross validation and grid search method;
judging each numerical characteristic in the split numerical characteristic set by utilizing a plurality of decision trees, specifically judging whether the original network flow data corresponding to the numerical characteristic is normal data or abnormal data by each decision tree of the plurality of decision trees, summarizing the judgment result of each decision tree by a voting method, and taking the result of the majority of the judgment result as the final judgment result; wherein, the judgment result is that the original network flow data corresponding to the numerical characteristic is normal data or abnormal data;
obtaining importance factors of each numerical characteristic in the split numerical characteristic set according to a finally obtained judgment result and the formula, sorting the importance factors of each numerical characteristic in the split numerical characteristic set according to a descending order of importance, screening each numerical characteristic in the split numerical characteristic set according to a set threshold value, obtaining the importance factors of each numerical characteristic in the sorted numerical characteristic set which is larger than a preset threshold value, and recording the importance factors as screened numerical data;
the input of the extreme random tree model is a split numerical characteristic set, and the output of the extreme random tree model is screened numerical data.
As one improvement of the above technical solution, the method extracts available data features from the acquired original network traffic data to acquire network traffic feature data; the method specifically comprises the following steps:
extracting a first feature from the acquired raw network traffic data by using an Argus tool, wherein the first feature comprises: a source IP address, a source port number, a destination IP address, a destination port number, and a transport protocol type;
extracting second features from the acquired raw network traffic data using a Bro-IDS tool, the second features comprising: counting from a source IP to a target IP packet, counting from the target IP to the source IP packet, an application layer protocol type, a source IP transmission digit per second and a target IP transmission digit per second;
among the available data features are: the first feature, the second feature, and other features extracted from the protocol header file include: a value of a source TCP advertisement window size, a value of a target TCP advertisement window size, a sequence number of the source TCP, a sequence number of the target TCP, an average of a size of a stream packet transmitted by the source, an average of a size of a stream packet transmitted by the target, a pipe depth of an http request/response connection, an actual size of data transmitted from an http service of the server without compression, a source jitter time (millisecond), a target jitter time (millisecond), a source packet interval arrival time, a target packet interval arrival time, a number of "syn" and "ack" in the TCP connection, an interval time of syn and syn _ ack in the TCP connection, an interval time of syn _ ack packet and ack packet in the TCP connection.
As an improvement of the above technical solution, the establishing and training of the first anomaly classifier specifically includes:
adopting an AdaBoost supervision and classification integration algorithm to construct an anomaly detection function:
Figure GDA0003797781890000061
wherein G (x) is a first anomaly classifier; gm(x) Is a decision tree weak classifier; wherein m =1,2, …,30; alpha is alphamIs a reaction with Gm(x) Corresponding classifier weight coefficients;
training a first anomaly classifier according to the anomaly detection function, wherein the input of the first anomaly classifier is { (x)1,y1),(x2,y2)(x3,y3),…(xn,yn) In which xiThe processed data of each piece of network flow is obtained; x is the number ofi∈Rn;yiIs a corresponding label; y isiIs e {0,1},0 denotes normal, 1 denotes abnormal, and the output is the first abnormal classifier after trainingAnd judging the network flow characteristics.
As an improvement of the above technical solution, the establishing and training of the second anomaly classifier specifically includes:
an unsupervised anomaly detection function is constructed by reconstructing an error function using an auto-encoder:
L(xp,xr)=||xp-xr||2=||xp-F(G(xp))||2
wherein, L (x)p,xr) For reconstructing the error function, F and G denote the decoding and encoding functions, x, respectively, of the self-encoderpFor raw network traffic data to be detected, xrIs a self-encoder pair xpPerforming the reconstructed data;
training a second anomaly classifier according to the unsupervised anomaly detection function, and acquiring a reconstruction error threshold according to a reconstruction error function;
during detection, if the reconstruction error of one piece of network flow data is greater than the reconstruction error threshold value, the piece of network flow data is judged to be abnormal data;
if the reconstruction error of one piece of network flow data is less than or equal to the reconstruction error threshold value, judging that the data is normal data;
the input of the second abnormal classifier is network flow data x which is judged to be normal data by the first abnormal classifierp(ii) a The output of which is the trained second anomaly classifier pair xpThe result of the discrimination (1).
Based on the network traffic data analysis method, the invention also provides a network traffic data analysis system, which comprises: the system comprises an original data acquisition module, a data preprocessing module, a data feature extraction module, a data anomaly detection module and a data result display module; wherein the content of the first and second substances,
the original data acquisition module is used for capturing original network flow data in real time;
the data preprocessing module is used for extracting available data characteristics from the acquired original network traffic data and acquiring network traffic characteristic data;
the data characteristic extraction module is used for carrying out data cleaning and attribute splitting on the acquired network flow characteristic data and splitting the acquired network flow characteristic data into numerical data and non-numerical data; inputting the non-numerical data into a pre-trained sparse self-encoder for re-encoding to obtain encoded non-numerical data; inputting numerical data into a pre-established extreme random tree model, and performing descending order arrangement and screening on the importance of the numerical data to obtain screened numerical data;
the data anomaly detection module is used for carrying out normalization processing on the encoded non-numerical data and the screened numerical data to obtain network flow data; detecting whether the network flow data is abnormal data;
if the network flow data is abnormal data, outputting the abnormality, inputting the abnormal data into a first abnormal classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data;
if the network flow data is not abnormal data, adopting an unsupervised abnormal detection method to further detect whether the network flow data is abnormal;
if the network flow data is abnormal data, inputting the abnormal data into a pre-trained second abnormal classifier, judging the type of the abnormal data to be an unknown attack type, and marking the abnormal data to be the unknown attack type;
if the network flow data is not abnormal data, the output is normal.
And the data result display module is used for displaying the detection result of detecting whether the network flow data is abnormal data.
Compared with the prior art, the invention has the beneficial effects that:
1. the method of the invention greatly reduces the false alarm rate of abnormal data, greatly reduces manual processing, after the processing of the technical scheme, the false alarm rate of most abnormal recognizers based on classification is reduced to below 10 percent, and because the data feature extraction module carries out effective IP recoding, feature extraction and feature selection on the data, the model has better robustness on the selection of the classifier and better robustness on the selection of the parameters of the classifier;
2. the method greatly improves the recall rate of the abnormal data, the abnormal data and the normal data are more obviously distinguished after being processed by a sparse self-encoder and an extreme random tree, and the recall rate is improved to more than 90 percent after model tuning;
3. the method of the invention improves the detection rate of abnormal data, and because the data feature extraction module is used in the training process, the traditional abnormal detection algorithm, such as a decision tree model, a Bayesian classifier model, a self-encoder model, a random forest model and the like, can directly obtain information related to the abnormality from effective features during detection, and the feature dimension of each piece of network flow data is reduced by one order of magnitude, thus greatly improving the detection efficiency.
Drawings
FIG. 1 is a schematic structural diagram of a network traffic data analysis system based on a sparse self-encoder and an extreme random tree according to the present invention;
fig. 2 is a detailed flowchart of step 2) of the method for detecting an anomaly of network traffic data according to the present invention;
FIG. 3 is a detailed flowchart of step 3) of the method for detecting anomaly of network traffic data according to the present invention;
fig. 4 is a specific flowchart of step 4) of the method for detecting the anomaly of the network traffic data according to the present invention.
Detailed Description
The invention will now be further described with reference to the accompanying drawings.
The invention provides an anomaly detection method of network traffic data, which comprises the following steps:
step 1), capturing original network flow data in real time;
specifically, an open source tool is adopted to capture original network flow data from a network environment in real time and store the original network flow data into a file in a PCAP format; in this embodiment, a TCPDUMP tool is mainly used to capture original network traffic data from a network environment in real time;
step 2) extracting available data characteristics from the obtained original network traffic data to obtain network traffic characteristic data;
specifically, as shown in fig. 2, an Argus tool is used to extract a first feature from the acquired raw network traffic data, where the first feature includes: a source IP address, a source port number, a destination IP address, a destination port number, and a transport protocol type;
extracting second features from the acquired raw network traffic data using a Bro-IDS tool, the second features comprising: counting from a source IP to a target IP packet, counting from the target IP to the source IP packet, an application layer protocol type, a source IP transmission digit per second and a target IP transmission digit per second;
and other characteristics extracted from the protocol header file such as the value of the source TCP advertised window size, the value of the target TCP advertised window size, the sequence number of the source TCP, the sequence number of the target TCP, the average of the sizes of the streaming packets transmitted by the source, the average of the sizes of the streaming packets transmitted by the target, the pipe depth of the http request/response connection, the actual uncompressed size of the data transmitted from the server's http service, the source jitter time (milliseconds), the target jitter time (milliseconds), the source packet interval arrival time, the number of "syn" (synchronization sequence number) and "ack" (acknowledgement characters) in the target packet interval arrival time TCP connection, the interval times of syn and syn _ ack in the TCP connection, and the interval times of syn _ ack and ack packets in the TCP connection.
Among the available data features are: the first and second characteristics, and other characteristics extracted from the protocol header file, namely, the source IP address, source port number, destination IP address, destination port number, transport protocol type, source IP to destination IP packet count, destination IP to source IP packet count, application layer protocol type, source IP transmission bits per second, destination IP transmission bits per second, and values of source TCP advertised window size, values of destination TCP advertised window size, sequence number of source TCP, sequence number of destination TCP, mean of stream packet size transmitted by the source, mean of stream packet size transmitted by the destination, pipe depth of http request/response connection, actual uncompressed size of data transmitted from http service of the server, source jitter time (millisecond), destination jitter time (millisecond), source packet interval arrival time, number of "syn" and "ack" in destination packet interval arrival time TCP connection, interval time of syn _ ack and syn _ ack in TCP connection;
taking the UNSW _ NB15 data set as an example, the obtained original piece of network traffic data is as follows: <xnotran> "59.166.0.0,1390,149.171.126.6,53,udp,CON,0.001055,132,164,31,29,0,0,dns,500473.9375,621800.9375,2,2,0,0,0,0,66,82,0,0,0,0,1421927414,1421927414,0.017,0.013,0,0,0,0,0,0,0,0,3,7,1,3,1,1,1, 0"; </xnotran>
The original network flow data totally comprises 47 numerical characteristics, and the 47 numerical characteristics are sequentially as follows:
“srcip,sport,dstip,dsport,proto,state,dur,sbytes,dbytes,sttl,dttl,sloss,dloss,service,Sload,Dload,Spkts,Dpkts,swin,dwin,stcpb,dtcpb,smeansz,dmeansz,trans_depth,res_bdy_len,Sjit,Djit,Stime,Ltime,Sintpkt,Dintpkt,tcprtt,synack,ackdat,is_sm_ips_ports,ct_state_ttl,ct_flw_http_mthd,is_ftp_login,ct_ftp_cmd,ct_srv_src,ct_srv_dst,ct_dst_ltm,ct_src_ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm”;
wherein the piece of raw network traffic data includes available data characteristics of a piece of network traffic, comprising: a source IP address, a source port number, a destination IP address, a destination port number, a transport protocol type, a source IP to destination IP packet count, a destination IP to source IP packet count, an application layer protocol type, a source IP transmission number per second, a destination IP transmission number per second, and other characteristics extracted from a protocol header file, such as a value of a source TCP advertisement window size, a value of a destination TCP advertisement window size, a sequence number of a source TCP, a sequence number of a destination TCP, an average of stream packet sizes transmitted by a source, an average of stream packet sizes transmitted by a destination, a pipe depth of an http request/response connection, an actual uncompressed size of data transmitted from an http service of a server, a source jitter time (millisecond), a destination jitter time (millisecond), a packet interval arrival time, a number of "SYN" and "ACK" in a destination TCP connection, interval times of SYN _ ACK and ACK in a TCP connection, interval times of SYN _ ACK packets and ACK packets in a TCP connection;
step 3) carrying out data cleaning on the acquired network flow characteristic data, carrying out attribute splitting on the cleaned network flow characteristic data, and splitting the network flow characteristic data into numerical data and non-numerical data;
specifically, as shown in fig. 3, step 3) specifically includes:
step 3-1) data cleaning is carried out on the network flow characteristic data, data without recording labels are removed, and the specific data cleaning process comprises the following steps: recording normalization, missing value processing and NAN value processing;
step 3-2) attribute splitting is carried out on the cleaned network flow characteristic data, and the network flow characteristic data are split into numerical data and non-numerical data according to different characteristics of each characteristic attribute;
step 4) as shown in fig. 3, inputting the non-numerical data into a pre-trained sparse self-encoder for re-encoding, and acquiring encoded non-numerical data;
specifically, the encoded non-numeric data includes: the source IP address, the target IP address, the transmission protocol type, the protocol state type and the network service type have 5 characteristics;
specifically, non-numerical data is subjected to one-hot encoding, the characteristic dimensionality of the data subjected to one-hot encoding is 294, a sparse self-encoder is constructed, and the network structure of the self-encoder is shown in the following table:
Figure GDA0003797781890000101
the dimension of the coding layer is set to be 20, the constructed sparse self-encoder is trained by adopting non-numerical data subjected to unique heat coding, a cross entropy loss function is selected as the loss function, a KL divergence sparse penalty item is added, an Adam optimization algorithm is used by a training optimizer, the number of training rounds is converged when the number is about 20, and the encoder is stored; inputting a non-numerical characteristic set subjected to one-hot coding; the output is the encoding part of the sparse autoencoder, namely:
Figure GDA0003797781890000111
the step 4) specifically comprises the following steps:
step 4-1) dividing according to the attribute label set, and extracting a non-numerical characteristic set from non-numerical data;
step 4-2) carrying out one-hot coding on the non-numerical characteristic set extracted in the step 4-1);
step 4-3), constructing and training a sparse self-encoder, taking a non-numerical characteristic set subjected to unique heat encoding as a set of the sparse self-encoder, taking the set of the sparse self-encoder as input, and outputting the set of the sparse self-encoder as an encoder extracted from the sparse self-encoder;
step 4-4) borrows for reference from Word2Vec algorithm in natural language processing, adopts TCPIP2Vec algorithm based on sparse self-encoder, utilizes the encoder extracted in step 4-3) to re-encode the one-hot code of the non-numerical characteristic set, obtains encoded non-numerical data, namely the re-encoded non-numerical characteristic set, and is used for carrying out numerical similarity calculation, namely calculating the similarity of IP addresses and the like, and meanwhile, the characteristic dimension of the data attribute is greatly reduced compared with the one-hot code.
Wherein, in the step 4-3), constructing and training the sparse self-encoder specifically comprises:
first, a symmetrical neural network H with parameters W, b is constructedw,bAs a sparse autoencoder:
Hw,b=g(f(X))
wherein f (X) is a coding function; g (X) is a decoding function; the two functions are approximated by constructing a neural network, wherein all weight parameters of the neural network are W, and all deviation parameters are b;
to achieve sparse coding, cross entropy loss is adoptedFunction training neural network Hw,b(ii) a Wherein the cross entropy loss function is:
Figure GDA0003797781890000112
wherein, JS(W, b) is a cross entropy loss function, and the cross entropy loss function is selected according to the non-numerical characteristic set of the one-hot coding;
Figure GDA0003797781890000113
a KL divergence penalty applied to the coding function; ρ is a sparsity parameter, β is a regularization parameter;
Figure GDA0003797781890000121
an average activation value of a jth hidden unit of an encoding layer of a sparse self-encoder; wherein the content of the first and second substances,
Figure GDA0003797781890000122
the calculation formula of (a) is as follows:
Figure GDA0003797781890000123
wherein n is the number of block samples set during training; f (x)i) As a coding function, xiIs the ith network traffic data sample;
KL is divergence; wherein, KL divergence compares a measure of similarity between two probability distributions, and its calculation formula is as follows:
Figure GDA0003797781890000124
where ρ (t) is the sparse parameter function:
Figure GDA0003797781890000125
an average activation value of a j-th hidden unit of an encoding layer of the sparse self-encoder;
training a sparse self-encoder by adopting a cross entropy loss function; the input of the sparse self-encoder is a non-numerical characteristic set subjected to single-hot encoding, and is recorded as:
X=[x1,…xn]
wherein the characteristic dimension is n; x is a non-numerical characteristic set subjected to one-hot coding; x is the number ofnThe nth factor in the non-numerical value feature set subjected to unique hot coding;
the output of the sparse autoencoder is an encoder extracted from the sparse autoencoder:
in addition to imposing KL divergence penalties, there are other sparse penalty modes, such as absolute value penalties or L1And (5) norm punishment. The training process of the sparse self-encoder is the same as that of the existing neural network training, the gradient is calculated, the idea of inverse propagation is utilized, and specifically, a common Adam optimization algorithm or a classical random gradient descent is used.
Step 5) inputting the numerical data into a pre-established extreme random tree model, and performing descending order arrangement and screening on the importance of the numerical data to obtain the screened numerical data as shown in FIG. 3;
wherein the numerical data has 42 numerical characteristics in total, and the numerical data includes: source port number, target port number, total duration of recording, number of bytes of source IP to target IP processing, number of bytes of target IP to source IP processing, source IP to target IP lifetime, target IP to source IP lifetime, number of retransmitted or dropped source packets, number of retransmitted or dropped target IPs, number of bits transmitted per second by source IP, number of packets transmitted per second by target IP, number of packets by source IP to target IP, number of packets by target IP to source IP, value of source TCP advertised window size, value of target TCP advertised window size, sequence number of source TCP, sequence number of target TCP, average of stream packet sizes transmitted by source, average of stream packet sizes transmitted by target, pipe depth of http request/response connections, actual uncompressed size of data transmitted from http service of server, source jitter time (milliseconds), target jitter time (milliseconds), time of start of recording, time of last recording, interval arrival time of source packets, target packet interval arrival time, sync and sync number of sync in TCP connections, number of TCP packets and number of commands with type of http _ GET, POST _ ack, session with type of commands in source IP and http _ TCP packets.
Specifically, the remaining numerical data includes a plurality of redundant information in addition to non-numerical data, a decision tree algorithm is adopted based on an extreme random tree embedding method, numerical data including 42 numerical features are subjected to cross validation, an information gain ratio is set as a numerical feature selection method, the number of estimators is set to be 50 most appropriate, and after the extreme random tree model is established, screened numerical data are output.
After the numerical characteristics are sorted in the descending order of importance, five attribute characteristics with the top importance ranking are selected by combining the subsequent abnormal detection process, namely the survival time from the source IP to the target IP, the survival time from the target IP to the source IP, the specific range of the survival time of the source IP/the target IP for each state, the number of bits transmitted by the target per second and the value of the size of the target TCP notification window.
The step 5) specifically comprises the following steps:
step 5-1) dividing according to the attribute numbers, splitting the attributes of the numerical data, and acquiring a split numerical characteristic set;
step 5-2) inputting the split numerical characteristic set into a pre-established extreme random tree model, and arranging each numerical characteristic in the split numerical characteristic set in a descending order from large to small according to importance to obtain a sorted numerical characteristic set;
and 5-3) screening the sorted numerical feature set according to a preset threshold value to obtain importance factors of each numerical feature in the sorted numerical feature set larger than the preset threshold value, and recording the importance factors as screened numerical data.
The step 5) further comprises the following steps: adopting a recursive feature elimination algorithm, namely an RFE algorithm, sequentially deleting a numerical feature from the screened numerical data, normalizing the remaining screened numerical data and the recoded non-numerical data, and performing anomaly detection to obtain a detection result; and comparing the detection result with the previous detection result without deleting the numerical characteristic, and detecting whether the detection results in the two cases are consistent or not for verifying whether the previous detection result is correct or not.
The step 5) further comprises the following steps: carrying out category splitting on the cleaned network flow characteristic data, and recording a corresponding category number characteristic set; wherein the content of the first and second substances,
if the split network flow characteristic data is provided with a known data label, classifying according to the known class label, recording the data number of the corresponding class, and recording as a known numerical value characteristic set;
if the split network flow characteristic data is provided with an Unknown data label, classifying the split network flow characteristic data into Unknown, recording the data number of the corresponding category as well, and recording as an Unknown numerical value characteristic set;
specifically, in the UNSW _ NB15 data, in addition to normal data, there are 9 common attack types, which are:
fuzzers, attack behavior that halts a program or network by randomly generated data
Analysis, including different port scan attacks, spam and html file penetration
Backdoors, a technique for silently bypassing system security mechanisms and accessing computers and their data
Dos, making network resources unavailable to users by temporarily disrupting or suspending services to hosts connected to the internet
Exploits, the attacker knows the security problem of an operating system or software and uses the vulnerability to attack
General, an attack technique for block ciphers (given block and key sizes) regardless of the structure of the block cipher
Reconnnaissandance, containing all attacks able to simulate the collection of information
Shellcode, a small segment of code for software vulnerability payloads
Worms, the attacker replicates itself to spread to other computers, self-spreading with the computer network, relying on failed security guards on the target computer to access
And recording a data number set of a corresponding category, and establishing a set for the unknown attack type set so as to store an output result of unsupervised anomaly detection.
As shown in fig. 3, the data after class splitting, the encoded non-numerical data, and the sorted numerical data are normalized and archived.
In the step 5-2), the process of establishing the extreme random tree-based feature selection model specifically includes:
randomly selecting numerical characteristics in the split numerical characteristic set to construct a plurality of decision trees;
wherein, the construction process of each decision tree is as follows:
obtaining an importance factor of each numerical characteristic according to the following calculation formula:
Figure GDA0003797781890000141
wherein G (D, A) is an importance factor of the numerical characteristic A relative to the numerical characteristic set D to be divided, namely an information gain ratio; d is a data set to be divided; a is the currently selected numerical characteristic; hA(D) The information entropy is obtained by taking the currently selected numerical characteristic A as a random variable; h (D) is the information entropy of set D with the data class as a random variable; h (D | a) is the conditional information entropy of the subset obtained after the set D is divided using the feature a;
when each decision tree is constructed, randomly selecting K numerical features from the K numerical features, wherein K is the total dimension of the numerical features, and K is the feature dimension set for constructing each decision tree; the value of K is set to be less than K, and is generally ordered
Figure GDA0003797781890000151
When each decision tree is constructed, selecting the numerical characteristic with the largest information gain ratio G (D, A) from the k selected numerical characteristics, then constructing nodes and splitting;
when the nodes of the decision tree are split, randomly selecting an arbitrary number between the maximum value and the minimum value of the numerical characteristic, and recording the arbitrary number as a comparison value; when the numerical characteristic of the sample is greater than the comparison value, taking the sample as a left branch; when the numerical characteristic of the sample is smaller than the comparison value, the sample is taken as a right branch, and then the bifurcation value of the numerical characteristic of the sample is calculated; wherein, the sample is a split numerical characteristic set;
traversing the selected k numerical characteristics to construct a decision tree;
repeating the process of constructing the basic decision tree for N times to construct N decision trees; wherein, the number of the decision tree is determined by using a cross validation and grid search method;
judging each numerical characteristic in the split numerical characteristic set by using a plurality of decision trees, specifically, judging whether the original network traffic data corresponding to the numerical characteristic is normal data or abnormal data by each decision tree of the plurality of decision trees, summarizing the judgment result of each decision tree by a voting method, and taking the result of the majority of the judgment result as a final judgment result, for example, the judgment result that the original network traffic data corresponding to the numerical characteristic is normal data is more than the judgment result that the original network traffic data corresponding to the numerical characteristic is abnormal data, namely the judgment result that the original network traffic data corresponding to the numerical characteristic is normal data is more than the judgment result that the original network traffic data corresponding to the numerical characteristic is normal data, and taking the judgment result that the original network traffic data corresponding to the numerical characteristic is normal data as a final judgment result; wherein, the judgment result is that the original network flow data corresponding to the numerical characteristic is normal data or abnormal data;
obtaining importance factors of each numerical characteristic in the split numerical characteristic set according to a finally obtained judgment result and the formula, sorting the importance factors of each numerical characteristic in the split numerical characteristic set according to a descending order of importance, screening each numerical characteristic in the split numerical characteristic set according to a set threshold value, obtaining the importance factors of each numerical characteristic in the sorted numerical characteristic set which is larger than a preset threshold value, and recording the importance factors as screened numerical data;
the input of the extreme random tree model is a split numerical characteristic set, and the output of the extreme random tree model is screened numerical data.
Step 6) recoding the split non-numerical data by using the encoder extracted in the step 4-3), wherein the characteristic dimensionality of the recoded data is 20; screening the characteristic attributes of the sorted numerical value characteristic set obtained in the step 5-2), wherein the screened characteristic dimension is 17; merging the processed non-numerical and numerical data, and recording as X0(ii) a Then to X0Carrying out data normalization processing, and recording the processed data as X1(ii) a Wherein X0,X1Each row represents a piece of network flow data, and each column represents each setting characteristic of the processed network flow data; mixing X1Inputting the data into a first anomaly classifier or a second anomaly classifier, detecting whether each piece of network flow data is anomalous data, and specifically executing the following steps:
if the network flow data is abnormal data, outputting the abnormality, inputting the abnormal data into a first abnormal classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data;
if the network flow data is not abnormal data, adopting an unsupervised abnormal detection method to further detect whether the network flow data is abnormal;
if the network flow data is abnormal data, inputting the abnormal data into a pre-trained second abnormal classifier, judging the type of the abnormal data to be an unknown attack type, and marking the abnormal data to be the unknown attack type;
if the network flow data is not abnormal data, the output is normal.
As shown in fig. 4, the step 6) specifically includes:
step 6-1) for the integrated data X0Carrying out normalization processing, and recording the processed data as X1
Specifically, for X0Each column X of0[i]Dividing the square difference by subtracting the mean value according to the transformation function; the transformation function is as follows:
Figure GDA0003797781890000161
wherein, X0[i]An ith column of the encoded network stream data; x1[i]Is to X0The ith column of the network flow data after normalization processing; μ is a column vector X0[i]Average value of (d): σ is the column vector X0[i]The variance of (a);
in other embodiments, the data may be normalized between [0,1] by performing a linear transformation on the re-encoded data as follows:
Figure GDA0003797781890000162
wherein, X0[i]The ith column of the encoded network stream data; x1[i]Is to X0The ith column of the network flow data after normalization processing; min is the column vector X0[i]Minimum value of (d): max is the column vector X0[i]Maximum value of (d);
in other embodiments, the column vector X is subtracted0[i]Is divided by the column vector X0[i]To X is given0[i]Normalization processing is carried out, so that processed network flow data X is obtained1[i]The principle is the same as the first two methods, but the method is more robust because some outliers are given directly to kicks in the calculation.
Step 6-2) inputting network flow data into a pre-trained first abnormal classifier, and detecting whether the input network flow data is abnormal data; outputting a detection result;
specifically, firstly, a supervision anomaly detection method based on classification is adopted to detect whether input network flow data is anomalous data;
if the network flow data is abnormal data, outputting the abnormality, inputting the abnormal data into a first abnormality classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data; the abnormal data is network flow data of a known attack type;
if the network flow data is judged to be abnormal data by the supervision abnormal detection method, inputting the abnormal data into a pre-trained second abnormal classifier, and further detecting whether the network flow data is unknown abnormal by adopting an unsupervised abnormal detection method;
if the network flow data is judged to be abnormal data through an unsupervised anomaly detection algorithm, the abnormal data is marked as an unknown attack type; the abnormal data is suspicious network flow data of unknown attack types;
if the network flow data is not abnormal data, the output is normal.
Wherein the establishing and training of the first anomaly classifier comprises:
adopting an AdaBoost supervision and classification algorithm to construct an anomaly detection function:
Figure GDA0003797781890000171
wherein G (x) is a first anomaly classifier; gm(x) Is a decision tree weak classifier; wherein m =1,2, …,30; alpha is alphamIs a reaction with Gm(x) Corresponding classifier weight coefficients;
training a first anomaly classifier according to an anomaly detection function;
the training process is as follows:
initializing the weight vector of the input data: w is a group ofm=(w11,w12,…w1n),w1i=1/n, m =1 representing the weak classifier currently trained; in the weight vector WmTraining weak classifier G with the objective of minimizing classification error ratem(x) Calculating a classifier weight coefficient according to the classification error rate; then judging whether M is smaller than M, if M is smaller than M<M, updating the weight vector of the next step according to the training result of the previous step, and training the abnormal classifier function G of the (M + 1) th step by taking the classification error rate as the minimum targetm+1(x) Then, updating m to be m +1, otherwise, constructing an abnormal detection function according to the trained weak classifier;
wherein, the update weight vector update formula is as follows:
Figure GDA0003797781890000181
wherein m represents the current step and m +1 represents the next step;
the calculation formula of the classification error rate is as follows:
Figure GDA0003797781890000182
wherein I (t) is an indicator function;
the classifier weight coefficient calculation formula is as follows:
Figure GDA0003797781890000183
wherein Z ismIs a normalized coefficient, and the specific calculation formula is as follows:
Figure GDA0003797781890000184
training a first anomaly classifier according to the anomaly detection function, wherein the input of the first anomaly classifier is { (x)1,y1),(x2,y2)(x3,y3),…(xn,yn) In which xiFor each processed network flow data, namely the network flow data X normalized in the step 6-1)1Row i of (1); n is a matrix X1Represents the total number of network flow data; x is the number ofi∈Rn;yiIs a corresponding label;yithe element belongs to {0,1},0 represents normal, 1 represents abnormal, and the output of the element is the discrimination result of the trained first abnormal classifier on each network traffic characteristic.
Wherein G ism(x) The specific learning process of the weak classifier of the decision tree obtained by step-by-step learning is as follows:
step 6-2-1) initializing a weight vector of input data:
Wm=(wm1,wm2,…wmn),
Figure GDA0003797781890000185
wherein, initializing m =1,WmFor weight vectors, each component wmiCorresponding network stream training data X1Each row represents the weight of the network flow data;
step 6-2-2) in the weight vector WmOn the basis of the above-mentioned training data, a function of class error rate is adopted, and the class error rate is minimized as a target to train Gm(x) (ii) a Wherein, the classification error rate function is shown in formula (2):
Figure GDA0003797781890000186
wherein e ismIs a classification error rate; w is amiFor each component of the weight vector; i (t) is an indicator function; gm(xi) For the discrimination result of the m-th classifier on the ith network flow data: y isiLabel for ith network traffic data:
then, according to equation (3), a classifier weight coefficient is calculated:
Figure GDA0003797781890000187
wherein alpha ismIs the classifier weight coefficient;
step 6-2-3) then judging whether M is smaller than M, if M is satisfied<M, updating the next step according to the training result of the previous stepWeight vector and training the abnormal classifier function G of the m +1 th step with the classification error rate minimized as the targetm+1(x) Then, updating m to be m +1, otherwise, constructing an abnormal detection function according to the trained weak classifier; wherein, in particular,
according to the weight coefficient alpha of the classifiermUpdating the weight vector according to formula (4);
Figure GDA0003797781890000191
wherein, wm+1,iIs the weight vector W of the m +1 th stepm+1The respective components of (a); zmTo normalize the coefficients: w is amiIs the mth step weight vector WmThe respective components of (a); y isiFor the ith network flow data xiThe label of (2); gm(x) An mth classifier discrimination function;
wherein Z is calculated according to the formula (5)m
Figure GDA0003797781890000192
Step 6-2-4) weak classifiers G for a plurality of decision treesm(x) Linearly combining classifier weight coefficients into a first anomaly classifier, namely a strong learner:
Figure GDA0003797781890000193
wherein G ism(x) A weak classifier for decision tree; alpha is alphamIs a reaction with Gm(x) Corresponding classifier weight coefficients;
the establishing and training of the second anomaly classifier comprises the following steps:
an unsupervised anomaly detection function is constructed by reconstructing an error function by using an auto-encoder:
L(xp,xr)=||xp-xr||2=||xp-F(G(xp))||2
wherein, L (x)p,xr) For reconstructing the error function, F and G denote the decoding and encoding functions, x, respectively, of the self-encoderpFor raw network traffic data to be detected, xrIs a self-encoder pair xpThe data after reconstruction is performed. In particular, the amount of the solvent to be used,
step 6-3-1) firstly, extracting the network flow characteristic data marked as normal from the network flow characteristic data calibrated in the previous data division stage, and marking as Xnormal
Step 6-3-2) with XnormalAs self-encoder training data, simulating an encoding function G and a decoding function F by using a multilayer perceptron to construct a self-encoder;
step 6-3-3) taking the reconstruction error function as a target function, training by adopting an Adam optimization algorithm, and calculating an error threshold error by using the following formula after training:
Figure GDA0003797781890000194
wherein, XnormalThe network traffic characteristic data marked as normal; n' is the number of all network traffic characteristics used for the training of the self-encoder, i.e. XnormalThe number of rows of (c); x is XnormalF and G are the decoding function and the encoding function of the trained self-encoder, respectively; l is the calculated reconstruction error corresponding to x;
6-3-4) training the self-encoder and calculating an error threshold value error, calibrating the supervision abnormity detection into normal data which is not marked as xpInput into the self-encoder to obtain the reconstructed data xrAnd calculating a reconstruction error L (x)p,xr);
Step 6-3-5) reconstruction error L (x)p,xr) Comparing with error, if the reconstruction error is larger than 3 times of error, marking the piece of data xrOutputting an unknown type exception if the exception is abnormal; if the reconstruction error is within 3 times of error, the method is judged to beNormal data, output is normal.
And 7) displaying a detection result of detecting whether the network flow data is abnormal data.
Specifically, if the detection result is that the network flow data is network flow data of a known attack type, an alarm is issued;
if the detection result is that the network flow data is the network flow data of unknown attack type; if the attack is unknown, the attack is returned to the database, and professional personnel are informed to make manual analysis;
if the suspicious attack data is determined to be a new abnormal data type after being analyzed by the professional, adding the suspicious attack data into a training set of a first abnormal classifier; if the analyzed data is normal data, adding the normal data into a training set of a second abnormal classifier;
counting detection results periodically, wherein the detection results comprise: counting known attack types and unknown attack types of abnormal data appearing in the whole network environment, detecting the accuracy rate, the recall rate and the false alarm rate of the abnormal data, and selecting whether to retrain and update the algorithm in the model; the accuracy, recall rate and false alarm rate of the detection result are calculated according to the test result of the test data with the attack type label, namely:
Figure GDA0003797781890000201
Figure GDA0003797781890000202
Figure GDA0003797781890000203
wherein, TP, TN, FP, FN represent several statistical results of the network flow characteristic data of the established network flow data analysis method and system; in particular, the amount of the solvent to be used,
TP represents the number of the network flow characteristic data with normal detection result and normal system prediction;
TN represents the number of the network flow characteristic data with abnormal detection result and abnormal system prediction;
FP represents the number of the network flow characteristic data with abnormal detection results and normal system prediction;
FN represents the number of the network flow characteristic data with normal detection result and abnormal system prediction;
and determining whether to retrain and update the system according to the statistical result and the number of the suspicious attacks.
As shown in fig. 1, the present invention further provides a network traffic data analysis system based on a sparse autoencoder and an extreme random tree, the system comprising: the system comprises an original data acquisition module, a data preprocessing module, a data feature extraction module, a data anomaly detection module and a data result display module; wherein the content of the first and second substances,
the original data acquisition module is used for capturing original network flow data in real time;
the data preprocessing module is used for extracting available data characteristics from the acquired original network traffic data and acquiring network traffic characteristic data;
the data characteristic extraction module is used for carrying out data cleaning and attribute splitting on the acquired network flow characteristic data and splitting the acquired network flow characteristic data into numerical data and non-numerical data; inputting the non-numerical data into a pre-trained sparse self-encoder for re-encoding to obtain encoded non-numerical data; inputting numerical data into a pre-established extreme random tree model, and performing descending order arrangement and screening on the importance of the numerical data to obtain screened numerical data;
the data anomaly detection module is used for carrying out normalization processing on the encoded non-numerical data and the screened numerical data to obtain network flow data; detecting whether the network flow data is abnormal data;
if the network flow data is abnormal data, outputting the abnormality, inputting the abnormal data into a first abnormal classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data;
if the network flow data is not abnormal data, adopting an unsupervised abnormal detection method to further detect whether the network flow data is abnormal;
if the network flow data is abnormal data, inputting the abnormal data into a pre-trained second abnormal classifier, judging the type of the abnormal data to be an unknown attack type, and marking the abnormal data as the unknown attack type;
if the network flow data is not abnormal data, the output is normal.
And the data result display module is used for displaying the detection result of detecting whether the network flow data is abnormal data.
The method of the invention aims at the network flow characteristics such as IP addresses, ports, TCP/IP protocols and the like, and by using the Word2Vec algorithm processed by natural language, provides a TCP/IP numerical algorithm based on a sparse self-encoder, maps the characteristics such as the IP addresses, the TCP/IP protocols and the like to an n-dimensional real number domain space, and provides a good support for various distance or density-based supervision or unsupervised algorithms in a subsequent data anomaly detection module. Compared with the traditional one-hot coding, the accuracy of the system is greatly improved, and the false alarm rate of the system is well reduced.
In a data characteristic extraction module, various characteristic selection means are integrated, for UNSW _ NB15 data, a characteristic processing means based on an extreme random tree is used besides a TCP/IP numerical algorithm based on a sparse self-encoder, and selectable criteria comprise information entropy, information gain ratio, gini index and the like, so that on the premise of ensuring important information of original data, the system dimension is greatly reduced, and the system operation efficiency is improved.
In the data anomaly detection module, the supervised learning and unsupervised learning means are integrated, and the basic idea is as follows: firstly, modeling a traffic data portrait model for the normal behavior of network traffic; then, constructing abnormal detection classifiers belonging to different attack types by using a supervision learning means for known attacks or abnormal network flow behaviors; and (3) continuously expanding and perfecting a normal behavior model of network traffic and a classifier of abnormal behavior by using an unsupervised learning means aiming at the rest unknown data. Through double detection of supervision anomaly detection and unsupervised anomaly detection, accuracy and false alarm rate are guaranteed, and the capability of finding unknown attacks is reserved.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (8)

1. A method for detecting the abnormity of network flow data is characterized by comprising the following steps:
step 1) processing original network flow data captured in real time to obtain network flow data;
step 2) judging the network flow data obtained in the step 1), outputting an abnormal data if the judgment result is the abnormal data, inputting the abnormal data into a first abnormal classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data; step 3) judging the network flow data obtained in the step 1), if the judgment result is not abnormal data, adopting an unsupervised abnormal detection method to further detect whether the network flow data is abnormal;
step 4) judging according to the further detection in the step 3), if the judgment result is abnormal data, inputting the abnormal data into a pre-trained second abnormal classifier, judging the type of the abnormal data to be an unknown attack type, and marking the abnormal data as the unknown attack type;
step 5) judging according to the further detection of the step 3), and outputting normal data if the judgment result is not abnormal data;
the method comprises the steps of processing original network flow data captured in real time to obtain network flow data; the method specifically comprises the following steps:
capturing original network flow data in real time;
extracting available data characteristics from the obtained original network traffic data to obtain network traffic characteristic data;
performing data cleaning and attribute splitting on the acquired network flow characteristic data, and splitting the acquired network flow characteristic data into numerical data and non-numerical data;
inputting the non-numerical data into a pre-trained sparse self-encoder for re-encoding to obtain encoded non-numerical data;
inputting numerical data into a pre-established extreme random tree model, and performing descending order arrangement and screening on the importance of the numerical data to obtain screened numerical data;
carrying out normalization processing on the encoded non-numerical data and the screened numerical data to obtain network flow data;
the establishing and training of the sparse self-encoder specifically comprises the following steps:
establishing a sparse autoencoder, adopting a cross entropy loss function J added with KL divergence sparse penalty term based on a TCPIP2Vec algorithm of the sparse autoencoderS(W, b) training a sparse autoencoder;
Figure FDA0003797781880000011
wherein the content of the first and second substances,
Figure FDA0003797781880000012
a KL divergence penalty applied to the coding function; ρ is a sparse parameter and β is a regularization parameter;
Figure FDA0003797781880000013
the average activation value of the jth hidden unit of the coding layer; wherein the content of the first and second substances,
Figure FDA0003797781880000014
the calculation formula of (c) is as follows:
Figure FDA0003797781880000021
wherein n is the number of block samples set during training; f (x)i) As a coding function, xiIs the ith sample;
KL is divergence; the KL divergence is a measure for comparing the similarity between two probability distributions, and is calculated as follows:
Figure FDA0003797781880000022
wherein ρ (t) is a sparse parameter function;
the input of the sparse self-encoder is a non-numerical characteristic set subjected to single-hot encoding; the output of the sparse self-encoder is a recoded non-numerical characteristic set, namely coded non-numerical data.
2. The method according to claim 1, wherein the non-numerical data is input to a pre-trained sparse self-encoder for re-encoding, and encoded non-numerical data is obtained; the method specifically comprises the following steps:
dividing according to the attribute label set, performing attribute splitting on the non-numerical data, and acquiring a non-numerical feature set from the non-numerical data;
carrying out single-hot coding on the non-numerical characteristic set to obtain a non-numerical characteristic set subjected to single-hot coding, inputting the non-numerical characteristic set to a pre-trained sparse self-encoder, and obtaining an encoder extracted from the sparse self-encoder;
by using a Word2Vec algorithm in natural language processing for reference, the TCPIP2Vec algorithm based on a sparse autoencoder is adopted to re-encode the single hot code of the non-numerical feature set, and encoded non-numerical data is obtained.
3. The method according to claim 1, wherein the numerical data is input into a pre-established extreme random tree model, and the importance of the numerical data is sorted and screened in a descending order to obtain screened numerical data; the method specifically comprises the following steps:
dividing according to the attribute numbers, and performing attribute splitting on the numerical data to obtain a split numerical characteristic set;
inputting the split numerical characteristic set into a pre-established extreme random tree model, and arranging each numerical characteristic in the split numerical characteristic set in a descending order from large to small according to importance to obtain a sorted numerical characteristic set;
and screening the sorted numerical feature set according to a preset threshold value to obtain the importance factors of each numerical feature in the sorted numerical feature set larger than the preset threshold value, and recording the importance factors as screened numerical data.
4. The method according to claim 3, wherein the process of establishing the extreme stochastic tree-based feature selection model specifically comprises:
randomly selecting numerical features in the split numerical feature set to construct a plurality of decision trees;
wherein, the construction process of each decision tree is as follows:
the importance factor of each numerical feature is obtained according to the following calculation formula:
Figure FDA0003797781880000031
wherein G (D, A) is an importance factor of the numerical characteristic A relative to the numerical characteristic set D to be divided, namely an information gain ratio; d is a data set to be divided; a is the currently selected numerical characteristic; hA(D) The information entropy is obtained by taking the currently selected numerical characteristic A as a random variable; h (D) is set D with data class as random variableInformation entropy; h (D | a) is the conditional information entropy of the subset obtained after the set D is divided using the feature a;
when each decision tree is constructed, randomly selecting K numerical features from the K numerical features, wherein K is the total dimension of the numerical features, and K is the feature dimension set for constructing each decision tree; the value of K is set to be less than K, generally
Figure FDA0003797781880000032
When each decision tree is constructed, selecting the numerical characteristic with the largest information gain ratio G (D, A) from the k selected numerical characteristics, then constructing nodes and splitting;
when the nodes of the decision tree are split, randomly selecting an arbitrary number between the maximum value and the minimum value of the numerical characteristic, and recording the arbitrary number as a comparison value; when the numerical characteristic of the sample is greater than the comparison value, taking the sample as a left branch; when the numerical characteristic of the sample is smaller than the comparison value, the sample is taken as a right branch, and then the bifurcation value of the numerical characteristic of the sample is calculated; wherein, the sample is a split numerical characteristic set;
traversing the selected k numerical characteristics to construct a decision tree;
repeating the process of constructing the basic decision tree for N times to construct N decision trees; wherein, the number of the decision tree is determined by using a cross validation and grid search method;
judging each numerical characteristic in the split numerical characteristic set by utilizing a plurality of decision trees, specifically judging whether the original network traffic data corresponding to the numerical characteristic is normal data or abnormal data through each decision tree in the plurality of decision trees, summarizing the judgment result of each decision tree by a voting method, and taking the result of the majority of the judgment result as a final judgment result; wherein, the judgment result is that the original network flow data corresponding to the numerical characteristic is normal data or abnormal data;
obtaining importance factors of each numerical characteristic in the split numerical characteristic set according to a finally obtained judgment result and the formula, sorting the importance factors of each numerical characteristic in the split numerical characteristic set according to a descending order of importance, screening each numerical characteristic in the split numerical characteristic set according to a set threshold value, obtaining the importance factors of each numerical characteristic in the sorted numerical characteristic set which is larger than a preset threshold value, and recording the importance factors as screened numerical data;
the input of the extreme random tree model is a split numerical characteristic set, and the output of the extreme random tree model is screened numerical data.
5. The method of claim 1, wherein the available data features are extracted from the acquired raw network traffic data to obtain network traffic feature data; the method specifically comprises the following steps:
extracting a first feature from the acquired raw network traffic data by using an Argus tool, wherein the first feature comprises: a source IP address, a source port number, a destination IP address, a destination port number, and a transport protocol type;
extracting second features from the acquired raw network traffic data using a Bro-IDS tool, the second features comprising: counting from a source IP to a target IP packet, counting from a target IP to a source IP packet, an application layer protocol type, transmission bits per second of the source IP and transmission bits per second of the target IP;
among the available data features are: the first feature, the second feature, and other features extracted from the protocol header file, including: a value of a source TCP advertisement window size, a value of a target TCP advertisement window size, a sequence number of the source TCP, a sequence number of the target TCP, an average of a size of a stream packet transmitted by the source, an average of a size of a stream packet transmitted by the target, a pipe depth of an http request/response connection, an actual size of data transmitted from an http service of the server without compression, a source jitter time (millisecond), a target jitter time (millisecond), a source packet interval arrival time, a target packet interval arrival time, a number of "syn" and "ack" in the TCP connection, an interval time of syn and syn _ ack in the TCP connection, an interval time of syn _ ack packet and ack packet in the TCP connection.
6. The method of claim 1, wherein the building and training of the first anomaly classifier specifically comprises:
adopting an AdaBoost supervision and classification integration algorithm to construct an anomaly detection function:
Figure FDA0003797781880000041
wherein G (x) is a first anomaly classifier; gm(x) Is a decision tree weak classifier; wherein m =1,2, …,30; alpha is alphamIs a reaction with Gm(x) Corresponding classifier weight coefficients;
training a first anomaly classifier according to the anomaly detection function, wherein the input of the first anomaly classifier is { (x)1,y1),(x2,y2)(x3,y3),…(xn,yn) In which xiThe processed data of each piece of network flow is obtained; x is the number ofi∈Rn;yiIs a corresponding label; y isiThe element belongs to {0,1},0 represents normal, 1 represents abnormal, and the output of the element is the discrimination result of the trained first abnormal classifier on each network traffic characteristic.
7. The method according to claim 1, wherein the building and training of the second anomaly classifier specifically comprises:
an unsupervised anomaly detection function is constructed by reconstructing an error function using an auto-encoder:
L(xp,xr)=||xp-xr||2=||xp-F(G(xp))||2
wherein, L (x)p,xr) For reconstruction of the error function, F and G denote the decoding and encoding functions, x, respectively, of the self-encoderpFor raw network traffic data to be detected, xrIs a self-encoder pair xpPerforming the reconstructed data;
training a second anomaly classifier according to the unsupervised anomaly detection function, and acquiring a reconstruction error threshold according to a reconstruction error function;
during detection, if the reconstruction error of one piece of network flow data is greater than the reconstruction error threshold value, the piece of network flow data is judged to be abnormal data;
if the reconstruction error of one piece of network flow data is less than or equal to the reconstruction error threshold value, judging that the data is normal data;
the input of the second abnormal classifier is network flow data x which is judged to be normal data by the first abnormal classifierp(ii) a The output of which is the trained second anomaly classifier pair xpThe result of the discrimination (1).
8. A network traffic data analysis system, the system comprising: the system comprises an original data acquisition module, a data preprocessing module, a data feature extraction module, a data anomaly detection module and a data result display module; wherein the content of the first and second substances,
the original data acquisition module is used for capturing original network flow data in real time;
the data preprocessing module is used for extracting available data characteristics from the acquired original network traffic data and acquiring network traffic characteristic data;
the data characteristic extraction module is used for carrying out data cleaning and attribute splitting on the acquired network flow characteristic data and splitting the acquired network flow characteristic data into numerical data and non-numerical data; inputting the non-numerical data into a pre-trained sparse self-encoder for re-encoding to obtain encoded non-numerical data; inputting numerical data into a pre-established extreme random tree model, and performing descending order arrangement and screening on the importance of the numerical data to obtain screened numerical data;
the data anomaly detection module is used for carrying out normalization processing on the encoded non-numerical data and the screened numerical data to obtain network flow data; detecting whether the network flow data is abnormal data;
if the network flow data is abnormal data, outputting the abnormality, inputting the abnormal data into a first abnormal classifier trained in advance, judging the attack type of the abnormal data to be a known attack type, and outputting the attack type of the abnormal data;
if the network flow data is not abnormal data, adopting an unsupervised abnormal detection method to further detect whether the network flow data is abnormal;
judging according to further detection, if the judgment result is abnormal data, inputting the abnormal data into a pre-trained second abnormal classifier, judging the type of the abnormal data to be an unknown attack type, and marking the abnormal data as the unknown attack type; if the judgment result is not abnormal data, outputting the data to be normal;
the data result display module is used for displaying a detection result for detecting whether the network flow data is abnormal data;
the processing process of the data preprocessing module specifically comprises the following steps:
capturing original network flow data in real time;
extracting available data characteristics from the obtained original network traffic data to obtain network traffic characteristic data;
performing data cleaning and attribute splitting on the acquired network flow characteristic data, and splitting the acquired network flow characteristic data into numerical data and non-numerical data;
inputting the non-numerical data into a pre-trained sparse self-encoder for re-encoding to obtain encoded non-numerical data;
inputting numerical data into a pre-established extreme random tree model, and performing descending order arrangement and screening on the importance of the numerical data to obtain screened numerical data;
carrying out normalization processing on the encoded non-numerical data and the screened numerical data to obtain network flow data;
the establishment and training of the sparse self-encoder specifically comprise:
establishing a sparse autoencoder, adopting a cross entropy loss function J added with KL divergence sparse penalty term based on a TCPIP2Vec algorithm of the sparse autoencoderS(W,b),Training a sparse self-encoder;
Figure FDA0003797781880000061
wherein the content of the first and second substances,
Figure FDA0003797781880000062
a KL divergence penalty applied to the coding function; ρ is a sparse parameter and β is a regularization parameter;
Figure FDA0003797781880000063
the average activation value of the jth hidden unit of the coding layer; wherein the content of the first and second substances,
Figure FDA0003797781880000064
the calculation formula of (a) is as follows:
Figure FDA0003797781880000065
wherein n is the number of block samples set during training; f (x)i) As a coding function, xiIs the ith sample;
KL is divergence; the KL divergence is a measure for comparing the similarity between two probability distributions, and is calculated as follows:
Figure FDA0003797781880000066
wherein ρ (t) is a sparse parameter function;
the input of the sparse self-encoder is a non-numerical characteristic set subjected to single-hot encoding; the output of the sparse self-encoder is a recoded non-numerical characteristic set, namely coded non-numerical data.
CN201910739001.0A 2019-08-12 2019-08-12 Network traffic data analysis method and system Active CN112398779B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910739001.0A CN112398779B (en) 2019-08-12 2019-08-12 Network traffic data analysis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910739001.0A CN112398779B (en) 2019-08-12 2019-08-12 Network traffic data analysis method and system

Publications (2)

Publication Number Publication Date
CN112398779A CN112398779A (en) 2021-02-23
CN112398779B true CN112398779B (en) 2022-11-01

Family

ID=74602164

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910739001.0A Active CN112398779B (en) 2019-08-12 2019-08-12 Network traffic data analysis method and system

Country Status (1)

Country Link
CN (1) CN112398779B (en)

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11843623B2 (en) * 2021-03-16 2023-12-12 Mitsubishi Electric Research Laboratories, Inc. Apparatus and method for anomaly detection
CN113158174B (en) * 2021-04-06 2022-06-21 上海交通大学 Automatic search system of grouping cipher actual key information based on graph theory
CN113067754B (en) * 2021-04-13 2022-04-26 南京航空航天大学 Semi-supervised time series anomaly detection method and system
CN113179264B (en) * 2021-04-26 2022-04-12 哈尔滨工业大学 Attack detection method for data transmission in networked control system
CN113392412B (en) * 2021-05-11 2022-05-24 杭州趣链科技有限公司 Data receiving method, data sending method and electronic equipment
CN113364752B (en) * 2021-05-27 2023-04-18 鹏城实验室 Flow abnormity detection method, detection equipment and computer readable storage medium
CN113469247B (en) * 2021-06-30 2022-04-01 广州天懋信息系统股份有限公司 Network asset abnormity detection method
CN113409092B (en) * 2021-07-12 2024-03-26 上海明略人工智能(集团)有限公司 Abnormal feature information extraction method, system, electronic equipment and medium
CN113569944A (en) * 2021-07-26 2021-10-29 北京奇艺世纪科技有限公司 Abnormal user identification method and device, electronic equipment and storage medium
CN113452581B (en) * 2021-08-30 2021-12-14 上海观安信息技术股份有限公司 Method and device for extracting characteristics of streaming data, storage medium and computer equipment
CN114079579B (en) * 2021-10-21 2024-03-15 北京天融信网络安全技术有限公司 Malicious encryption traffic detection method and device
CN113965384B (en) * 2021-10-22 2023-11-03 上海观安信息技术股份有限公司 Network security anomaly detection method, device and computer storage medium
CN113992419B (en) * 2021-10-29 2023-09-01 上海交通大学 System and method for detecting and processing abnormal behaviors of user
CN114189353A (en) * 2021-11-05 2022-03-15 西安理工大学 Network security risk prediction method based on railway dispatching set system
CN114039781B (en) * 2021-11-10 2023-02-03 湖南大学 Slow denial of service attack detection method based on reconstruction abnormity
CN114257517B (en) * 2021-11-22 2022-11-29 中国科学院计算技术研究所 Method for generating training set for detecting state of network node
CN114301629A (en) * 2021-11-26 2022-04-08 北京六方云信息技术有限公司 IP detection method, device, terminal equipment and storage medium
CN114900835A (en) * 2022-04-20 2022-08-12 广州爱浦路网络技术有限公司 Malicious traffic intelligent detection method and device and storage medium
CN114513374B (en) * 2022-04-21 2022-07-12 浙江御安信息技术有限公司 Network security threat identification method and system based on artificial intelligence
CN114826764B (en) * 2022-05-17 2023-07-18 广西科技大学 Edge computing network attack recognition method and system based on ensemble learning
CN114785617B (en) * 2022-06-15 2022-11-15 北京金汇创企业管理有限公司 5G network application layer anomaly detection method and system
CN114785623A (en) * 2022-06-21 2022-07-22 南京信息工程大学 Network intrusion detection method and device based on discretization characteristic energy system
WO2024065185A1 (en) * 2022-09-27 2024-04-04 西门子股份公司 Device classification method and apparatus, electronic device, and computer-readable storage medium
CN115396235B (en) * 2022-10-25 2023-01-13 北京天云海数技术有限公司 Network attacker identification method and system based on hacker portrait
CN115720177B (en) * 2023-01-10 2023-04-14 北京金睛云华科技有限公司 Network intrusion detection method, device and equipment
CN116319005A (en) * 2023-03-21 2023-06-23 上海安博通信息科技有限公司 Attack detection method, device and processing system combined with natural language processing model
CN116561689B (en) * 2023-05-10 2023-11-14 盐城工学院 High-dimensional data anomaly detection method
CN116633543B (en) * 2023-07-21 2023-09-15 沈阳航盛科技有限责任公司 1553B communication protocol data encryption method
CN116805926B (en) * 2023-08-21 2023-11-17 上海飞旗网络技术股份有限公司 Network service type identification model training method and network service type identification method
CN116996869B (en) * 2023-09-26 2023-12-29 济南正大科技发展有限公司 Network abnormal data processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106060043A (en) * 2016-05-31 2016-10-26 北京邮电大学 Abnormal flow detection method and device
CN108093406A (en) * 2017-11-29 2018-05-29 重庆邮电大学 A kind of wireless sense network intrusion detection method based on integrated study
CN108540451A (en) * 2018-03-13 2018-09-14 北京理工大学 A method of classification and Detection being carried out to attack with machine learning techniques
CN108632279A (en) * 2018-05-08 2018-10-09 北京理工大学 A kind of multilayer method for detecting abnormality based on network flow
CN109977151A (en) * 2019-03-28 2019-07-05 北京九章云极科技有限公司 A kind of data analysing method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160095856A (en) * 2015-02-04 2016-08-12 한국전자통신연구원 System and method for detecting intrusion intelligently based on automatic detection of new attack type and update of attack type

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106060043A (en) * 2016-05-31 2016-10-26 北京邮电大学 Abnormal flow detection method and device
CN108093406A (en) * 2017-11-29 2018-05-29 重庆邮电大学 A kind of wireless sense network intrusion detection method based on integrated study
CN108540451A (en) * 2018-03-13 2018-09-14 北京理工大学 A method of classification and Detection being carried out to attack with machine learning techniques
CN108632279A (en) * 2018-05-08 2018-10-09 北京理工大学 A kind of multilayer method for detecting abnormality based on network flow
CN109977151A (en) * 2019-03-28 2019-07-05 北京九章云极科技有限公司 A kind of data analysing method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于集成分类器的恶意网络流量检测;汪洁等;《通信学报》;20181025(第10期);全文 *

Also Published As

Publication number Publication date
CN112398779A (en) 2021-02-23

Similar Documents

Publication Publication Date Title
CN112398779B (en) Network traffic data analysis method and system
Zhang et al. Network intrusion detection: Based on deep hierarchical network and original flow data
US10187401B2 (en) Hierarchical feature extraction for malware classification in network traffic
CN108494746B (en) Method and system for detecting abnormal flow of network port
Palmieri et al. A distributed approach to network anomaly detection based on independent component analysis
CN112738039B (en) Malicious encrypted flow detection method, system and equipment based on flow behavior
US8682812B1 (en) Machine learning based botnet detection using real-time extracted traffic features
CN107370752B (en) Efficient remote control Trojan detection method
CN111600919B (en) Method and device for constructing intelligent network application protection system model
CN102420723A (en) Anomaly detection method for various kinds of intrusion
Monshizadeh et al. Performance evaluation of a combined anomaly detection platform
Le Jeune et al. Machine learning for misuse-based network intrusion detection: overview, unified evaluation and feature choice comparison framework
EP3336739B1 (en) A method for classifying attack sources in cyber-attack sensor systems
Atli Anomaly-based intrusion detection by modeling probability distributions of flow characteristics
CN117220920A (en) Firewall policy management method based on artificial intelligence
Feng et al. Towards learning-based, content-agnostic detection of social bot traffic
CN113904881A (en) Intrusion detection rule false alarm processing method and device
Duan et al. A novel and highly efficient botnet detection algorithm based on network traffic analysis of smart systems
CN109728977B (en) JAP anonymous flow detection method and system
Wang et al. An unknown protocol syntax analysis method based on convolutional neural network
Brandao et al. Log Files Analysis for Network Intrusion Detection
CN112073362B (en) APT (advanced persistent threat) organization flow identification method based on flow characteristics
Wang et al. A two-phase approach to fast and accurate classification of encrypted traffic
Gonzalez-Granadillo et al. An improved live anomaly detection system (i-lads) based on deep learning algorithm
Huizinga Using machine learning in network traffic analysis for penetration testing auditability

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant