CN110311870B - SSL VPN flow identification method based on density data description - Google Patents

SSL VPN flow identification method based on density data description Download PDF

Info

Publication number
CN110311870B
CN110311870B CN201910498412.5A CN201910498412A CN110311870B CN 110311870 B CN110311870 B CN 110311870B CN 201910498412 A CN201910498412 A CN 201910498412A CN 110311870 B CN110311870 B CN 110311870B
Authority
CN
China
Prior art keywords
data
density
ssl vpn
ssl
hypersphere
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910498412.5A
Other languages
Chinese (zh)
Other versions
CN110311870A (en
Inventor
刘扬
吕思才
黄俊恒
孙云霄
王佰玲
王超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hit Weihai Innovation Pioneer Park Co ltd
Harbin Institute of Technology Weihai
Original Assignee
Hit Weihai Innovation Pioneer Park Co ltd
Harbin Institute of Technology Weihai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hit Weihai Innovation Pioneer Park Co ltd, Harbin Institute of Technology Weihai filed Critical Hit Weihai Innovation Pioneer Park Co ltd
Priority to CN201910498412.5A priority Critical patent/CN110311870B/en
Publication of CN110311870A publication Critical patent/CN110311870A/en
Application granted granted Critical
Publication of CN110311870B publication Critical patent/CN110311870B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/46Interconnection of networks
    • H04L12/4633Interconnection of networks using encapsulation techniques, e.g. tunneling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/46Interconnection of networks
    • H04L12/4641Virtual LANs, VLANs, e.g. virtual private networks [VPN]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2483Traffic characterised by specific attributes, e.g. priority or QoS involving identification of individual flows

Abstract

The invention belongs to the technical field of network data processing, and relates to an SSL VPN flow identification method. An SSL VPN flow identification method based on density data description comprises the following steps: capturing traffic which is transmitted safely by using an SSL protocol from network traffic; performing feature extraction on the captured SSL flow; and classifying and judging the extracted feature vectors by taking the SSL VPN data description domain based on density description as a judgment basis, and if the extracted feature vectors fall into the data description domain, judging the SSL VPN flow, otherwise, judging the SSL VPN flow as the common SSL flow. The SSL VPN flow identification method based on the density data description can convert SSL VPN flow into a feature vector, and a data description domain can be found based on density through the density-based data description, so that the SSL VPN flow is classified.

Description

SSL VPN flow identification method based on density data description
Technical Field
The invention belongs to the technical field of network data processing, and relates to an SSL VPN flow identification method.
Background
With the advent of the big data age, data acquisition becomes more and more convenient, and after the data acquisition is completed, how to acquire required information from the data becomes a new challenge. The classification problem is used as an important class of machine learning, a model is built based on existing data, and then the class of the obtained unknown class data is judged. However, the obtaining of the labeled training data required for classification is not easy, for example, VPN traffic may be obtained by various methods, while non-VPN traffic is of various types, and it is difficult to include all non-VPN traffic types in the usually obtained counter-example set. There is only one type of labeled data that can be used for training in this case.
Currently, in the case that only a single type of data is available, the scholars also propose corresponding classification models, and the most widely used are One-Class SVM and SVDD. The core idea of the One-Class SVM is to take the origin of coordinates as a singular point, find the hyperplane, enable the single-Class data set and the origin of coordinates to respectively fall on two sides of the hyperplane, enable the distance between the origin of coordinates and the hyperplane to be as large as possible, consider the data which fall on the same side with the training sample as the data of the target Class during prediction, and otherwise consider the data not of the target Class. The SVDD maps the original sample to a high-dimensional space through a kernel function, a hypersphere containing most data is searched in the high-dimensional space, the volume of the hypersphere is made to be as small as possible, and when prediction is carried out, the data falling in the hypersphere is regarded as the data of a target class, otherwise, the data is not. Both algorithms have proven to work similarly in dealing with the single classification problem.
Still other scholars deal with the single classification problem as an anomaly detection problem, and data of non-target classes as anomaly points. When an isolated Forest (Isolation Forest) is widely applied and constructed, one feature is randomly selected, a value is randomly selected from the value range of the feature to divide a data set, iteration is carried out for multiple rounds, until only one data point exists in leaf nodes, an isolated Tree (Isolation Tree) is obtained, and meanwhile, multiple isolated trees are constructed to obtain the isolated Forest. The abnormal values are outliers, so that the abnormal values can be divided into leaf nodes quickly; rather than outliers, more partitions are often needed to fall to leaf nodes. Whether the leaf node is an outlier can be determined by the path lengths of the leaf node and the root node.
The VPN is a private network established over a public network, and performs encryption of communication. Therefore, the VPN traffic necessarily covers various types of traffic, and the obtained VPN data set usually contains a plurality of class clusters, so that how to classify and identify the VPN traffic data and acquire the required information is very necessary.
Disclosure of Invention
In order to solve the problem of VPN flow identification, the invention provides an SSL VPN flow identification method based on density data description. The method starts from the internal distribution of VPN flow data, and divides a data description domain, thereby realizing VPN flow identification.
The technical scheme adopted by the invention for solving the technical problems is as follows: an SSL VPN flow identification method based on density data description comprises the following steps:
capturing traffic which is transmitted safely by using an SSL protocol from network traffic;
performing feature extraction on the captured SSL flow;
and classifying and judging the extracted feature vectors by taking the SSL VPN data description domain based on density description as a judgment basis, and if the extracted feature vectors fall into the data description domain, judging the SSL VPN flow, otherwise, judging the SSL VPN flow as the common SSL flow.
Further, the method for acquiring the data description domain of the SSL VPN based on the density description includes:
constructing an SSL VPN flow data density description model function;
carrying out preliminary training on the model function to obtain a hypersphere with the maximum density;
further training the model function, and dividing the data points scattered outside the hypersphere with the maximum density into data description domains again to obtain a plurality of hyperspaces;
and integrating data inside all the hypersphere to form a data description domain of the SSL VPN.
Further, the SSL VPN traffic data density description model function is:
Figure BDA0002089380390000021
where ρ (R, a) represents the density, R represents the radius of the hypersphere, a represents the center of the hypersphere, and n is a numberNumber of samples in data set, C n R n Is the volume of the n-dimensional hypersphere.
The preliminary training method comprises the following steps:
using a gradient ascending method, first the partial derivatives of R and a are calculated:
Figure BDA0002089380390000031
giving an initial value R 0 And a 0 And the learning rate eta, the iterative calculation formula is as follows:
Figure BDA0002089380390000032
the learning rate is adjusted by using Adadelta method self-adaptive adjustment, according to the gradient calculated each time, the formula of adjustment is as follows:
Figure BDA0002089380390000033
the further training steps are as follows:
1. initializing an upper error bound gamma, and setting the minimum density alpha or the minimum sample number min _ sample of sample points in a spherical plane;
2. solving the sphere plane with the maximum density of the current data set by gradient ascending;
3. calculating the points of the data set in the hypersphere, and jumping to step 5 if the number of the points is less than min _ sample or the density is less than alpha; otherwise, removing the points, and taking the remaining hyperplane as a new data set;
4. calculating the number of points in the new data set, and jumping to the step 5 if the proportion is less than gamma; otherwise, jumping to the step 2;
5. and outputting all the obtained hypersphere.
The SSL VPN flow identification method based on the density data description can convert SSL VPN flow into a feature vector, and a data description domain can be found based on density through the density-based data description, so that the SSL VPN flow is classified. Has the following beneficial effects:
(1) VPN traffic identification may help managers or operators supervise network traffic;
(2) the method is insensitive to data distribution in the data, can divide a data description domain for data in any shape, and can be suitable for the characteristic of transmitting various flows in a VPN tunnel;
(3) the method is insensitive to abnormal points in the data set, and the abnormal points in the data do not influence the division of the data description domain.
Drawings
FIG. 1 is a flow chart of SSL VPN traffic identification method based on density data description of the present invention;
FIG. 2 is a signature function for type discrimination in a model;
FIG. 3 is an alternative function to the flag function in the model, with the advantages of continuity and derivation;
FIG. 4 illustrates a particular case encountered after solving a model function;
FIG. 5 is a flow chart of model training and solving.
Detailed description of the preferred embodiments
The SSL VPN traffic identification method based on density data description according to the present invention is explained in detail below with reference to the accompanying drawings and embodiments.
The SSL VPN flow identification method based on the density data description of the invention has the flow as shown in figure 1, and comprises the following specific steps:
capturing traffic which is transmitted safely by using an SSL protocol from network traffic;
secondly, performing feature extraction on the captured SSL flow
The SSL VPN traffic classification focuses on information of traffic in a handshake protocol, and firstly, the first stage includes a Client Hello and a Server Hello, and the Client Hello can record cipher suite (supported encryption protocol) and the length of each part of extension. The Server Hello packet directly contains the contents of the certificate and the like in the second stage and the Server hellodone, and can record the length of the certificate, the length of the certificate status and the length of the Server Key Exchange. All SSL VPN traffic characteristics are shown in table 1.
TABLE 1 SSL VPN traffic characteristics
Feature(s) Description of the invention
Length of each part of Extension Extend field length
Lengths of respective parts in Server Hello Certificate, Server Key Exchange, etc. length
Forward packet arrival time statistics Mean, variance, maximum and minimum of forward packet arrival times, etc
Reverse packet arrival time statistics Mean, variance, maximum and minimum of reverse packet arrival times, etc
Forward packet length statistic Mean, variance, maximum and minimum of forward packet length
Reverse packet length statistics Mean, variance, maximum and minimum of reverse packet length
Thirdly, constructing SSL VPN flow data density description model function
Generally, the data of the same type have certain similarity with each other, which causes the data to be concentrated in a certain area, and the data which does not fall in the area has a larger degree of confidence that the data is not the data (abnormal value) of the type, and the area is called as a data description area.
Based on this feature, a core idea of Density-based Data Description (DBDD) can be given: by finding the hyper-sphere with the highest density in a given dataset, the following model function can be obtained:
Figure BDA0002089380390000051
where ρ (R, a) represents the density, R represents the radius of the hypersphere, a represents the center of the hypersphere, n is the number of samples in the data set, C n R n Is the volume of the n-dimensional hypersphere. The significance of this objective function is to maximize the ratio of the sample points falling into a hypersphere with a as the center of sphere R to the hypersphere volume, i.e. the density.
For p (R, a), the molecular part of the function is not derivable, as shown in fig. 2, which makes it difficult to maximize the values of R and a, and the sigmoid function can be substituted for the function f (x) in the molecule of the function. The sigmoid function is shown in fig. 3.
The sigmoid function is used for replacing the sigmoid function, and the advantages of 1. leading the objective function and being convenient to solve are achieved; 2. for better point processing near the boundary, f (x) directly divides the boundary points into 0 or 1 when processing the boundary points, and the sigmoid function is replaced by a value proportional to the point-to-boundary distance when processing the boundary points, similar to the soft interval in SVM, which prevents overfitting to some extent.
Fourthly, performing preliminary training on the constructed model function
1. Using a gradient ascending method, first the partial derivatives of R and a are calculated:
Figure BDA0002089380390000052
2. giving an initial value R 0 And a 0 And the learning rate eta, the iterative calculation formula is as follows:
Figure BDA0002089380390000061
however, it is difficult to directly specify a proper learning rate, too large a learning rate may make the iteration unable to converge, and too small a learning rate may make the step size of each iteration small, thereby making convergence slow.
3. The learning rate is adjusted according to the gradient calculated each time by using Adadelta method self-adaptive adjustment learning rate, and the formula of the adjustment is as follows:
Figure BDA0002089380390000062
4. the above initially trained model can find a hypersphere with the highest density, as shown in fig. 4.
However, the data contained in the ball is only a small part of all the data, so the hypersphere initially trained cannot represent the distribution area of the data, and further training is needed.
Fifthly, further training the model function, as shown in fig. 5, the specific steps are as follows:
step 1, initializing an upper error bound gamma, and setting the minimum density alpha or the minimum sample number min _ sample of sample points in the spherical plane;
step 2, solving the spherical plane with the maximum density of the current data set by gradient rising;
step 3, calculating points of which the data are concentrated in the hypersphere, and jumping to Step 5 if the number of the points is less than min _ sample or the density is less than alpha; otherwise, removing the points, and taking the remaining hyperplane as a new data set;
step 4, calculating the number of points in the new data set, and if the proportion is smaller than gamma, jumping to Step 5; otherwise, jumping to step 2;
and Step 5, outputting all the obtained hypersphere surfaces.
Through the training steps, data points scattered outside the hypersphere with the maximum density can be divided into data description domains again, the data description domains can be prevented from being divided on sparse data through an iterative jumping-out method, and the data description domains can be prevented from being divided based on abnormal points due to the fact that the data amount in the hypersphere is too small.
And sixthly, integrating data in all hyper-spheres to obtain a data description domain of the SSL VPN based on density description.
And seventhly, judging the SSL VPN flow characteristic vector from the characteristic by taking the data description domain of the SSL VPN based on the density description as a judgment basis, and if the SSL VPN flow characteristic vector falls into the data description domain, considering the SSL VPN flow, otherwise, considering the SSL VPN flow as the common SSL flow.

Claims (1)

1. A SSL VPN flow identification method based on density data description is characterized in that: the method comprises the following steps:
capturing traffic which is transmitted safely by using an SSL protocol from network traffic;
performing feature extraction on the captured SSL flow;
classifying and judging the extracted feature vectors by taking a data description domain of the SSL VPN based on density description as a judgment basis, and if the extracted feature vectors fall into the data description domain, judging the SSL VPN flow, otherwise, judging the SSL VPN flow as a common SSL flow;
the method for acquiring the data description domain comprises the following steps:
constructing an SSL VPN flow data density description model function;
carrying out preliminary training on the model function to obtain a hypersphere with the maximum density;
further training the model function, and dividing the data points scattered outside the hypersphere with the maximum density into data description domains again to obtain a plurality of hyperspaces;
integrating all data inside the hypersphere to form a data description domain of the SSL VPN;
the SSL VPN flow data density description model function is as follows:
Figure DEST_PATH_IMAGE002
wherein
Figure DEST_PATH_IMAGE004
Representing density, R represents radius of the hypersphere, a represents center of the hypersphere, n is number of samples in the data set,
Figure DEST_PATH_IMAGE006
is the volume of the n-dimensional hypersphere;
the preliminary training method comprises the following steps:
by gradient ascent, the partial derivatives of R and a are first determined
Figure DEST_PATH_IMAGE008
Giving an initial value
Figure DEST_PATH_IMAGE010
And
Figure DEST_PATH_IMAGE012
learning rate
Figure DEST_PATH_IMAGE014
Then the iterative calculation is as follows:
Figure DEST_PATH_IMAGE016
the learning rate is adjusted by using Adadelta method self-adaptive adjustment, according to the gradient calculated each time, the formula of adjustment is as follows:
Figure DEST_PATH_IMAGE018
the further training steps are as follows:
initializing an upper error bound gamma, and setting the minimum density alpha or the minimum sample number min _ sample of sample points in a spherical plane;
solving the sphere plane with the maximum density of the current data set by gradient ascending;
calculating the points of the data set in the hypersphere, and jumping to step 5 if the number of the points is less than min _ sample or the density is less than alpha; otherwise, removing the points, and taking the remaining hyperplane as a new data set;
calculating the number of points in the new data set, and jumping to the step 5 if the proportion is less than gamma; otherwise, jumping to the step 2;
and outputting all the obtained hypersphere.
CN201910498412.5A 2019-06-10 2019-06-10 SSL VPN flow identification method based on density data description Active CN110311870B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910498412.5A CN110311870B (en) 2019-06-10 2019-06-10 SSL VPN flow identification method based on density data description

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910498412.5A CN110311870B (en) 2019-06-10 2019-06-10 SSL VPN flow identification method based on density data description

Publications (2)

Publication Number Publication Date
CN110311870A CN110311870A (en) 2019-10-08
CN110311870B true CN110311870B (en) 2022-08-02

Family

ID=68077099

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910498412.5A Active CN110311870B (en) 2019-06-10 2019-06-10 SSL VPN flow identification method based on density data description

Country Status (1)

Country Link
CN (1) CN110311870B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112019500B (en) * 2020-07-15 2021-11-23 中国科学院信息工程研究所 Encrypted traffic identification method based on deep learning and electronic device
CN113364703B (en) * 2021-06-03 2023-08-08 天翼云科技有限公司 Processing method and device of network application traffic, electronic equipment and readable medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101296228A (en) * 2008-06-19 2008-10-29 上海交通大学 SSL VPN protocol detection method based on flow analysis
CN108921123A (en) * 2018-07-17 2018-11-30 重庆科技学院 A kind of face identification method based on double data enhancing

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9497277B2 (en) * 2012-12-21 2016-11-15 Highspot, Inc. Interest graph-powered search
US11482307B2 (en) * 2017-03-02 2022-10-25 Drexel University Multi-temporal information object incremental learning software system
US10838420B2 (en) * 2017-07-07 2020-11-17 Toyota Jidosha Kabushiki Kaisha Vehicular PSM-based estimation of pedestrian density data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101296228A (en) * 2008-06-19 2008-10-29 上海交通大学 SSL VPN protocol detection method based on flow analysis
CN108921123A (en) * 2018-07-17 2018-11-30 重庆科技学院 A kind of face identification method based on double data enhancing

Also Published As

Publication number Publication date
CN110311870A (en) 2019-10-08

Similar Documents

Publication Publication Date Title
CN111967294B (en) Unsupervised domain self-adaptive pedestrian re-identification method
CN113378632B (en) Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method
US10013636B2 (en) Image object category recognition method and device
CN110880019B (en) Method for adaptively training target domain classification model through unsupervised domain
CN108647736B (en) Image classification method based on perception loss and matching attention mechanism
CN109218223B (en) Robust network traffic classification method and system based on active learning
CN107392241B (en) Image target classification method based on weighted column sampling XGboost
CN109617888B (en) Abnormal flow detection method and system based on neural network
CN111181939A (en) Network intrusion detection method and device based on ensemble learning
US20230300159A1 (en) Network traffic anomaly detection method and apparatus, and electronic apparatus and storage medium
US11403559B2 (en) System and method for using a user-action log to learn to classify encrypted traffic
CN112862093B (en) Graphic neural network training method and device
CN110929848B (en) Training and tracking method based on multi-challenge perception learning model
CN113326731A (en) Cross-domain pedestrian re-identification algorithm based on momentum network guidance
CN110225001B (en) Dynamic self-updating network traffic classification method based on topic model
CN111008337B (en) Deep attention rumor identification method and device based on ternary characteristics
CN109918498B (en) Problem warehousing method and device
CN110311870B (en) SSL VPN flow identification method based on density data description
CN112087447A (en) Rare attack-oriented network intrusion detection method
CN107358172B (en) Human face feature point initialization method based on human face orientation classification
CN114172688B (en) Method for automatically extracting key nodes of network threat of encrypted traffic based on GCN-DL (generalized traffic channel-DL)
CN111598004A (en) Progressive-enhancement self-learning unsupervised cross-domain pedestrian re-identification method
WO2020024444A1 (en) Group performance grade recognition method and apparatus, and storage medium and computer device
Shrivastav et al. Network traffic classification using semi-supervised approach
CN110765329A (en) Data clustering method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant