CN114006870A

CN114006870A - Network flow identification method based on self-supervision convolution subspace clustering network

Info

Publication number: CN114006870A
Application number: CN202111270837.4A
Authority: CN
Inventors: 王艺杰; 杨东; 吕珍珍; 王文庆; 崔逸群; 邓楠轶; 朱博迪; 介银娟; 董夏昕; 朱召鹏; 崔鑫
Original assignee: Xian Thermal Power Research Institute Co Ltd; Huaneng Group Technology Innovation Center Co Ltd
Current assignee: Xian Thermal Power Research Institute Co Ltd; Huaneng Group Technology Innovation Center Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-02-01

Abstract

The invention discloses a network flow identification method based on an automatic supervision convolution subspace clustering network, which comprises the following steps: preprocessing original network flow data; initializing and pre-training a self-encoder; training a convolution subspace clustering network, and learning a sparse representation matrix of data; adding a clustering module in a convolutional subspace clustering network, measuring the distance between two vectors by using cosine similarity in a similarity matrix construction of the clustering module, and generating a pseudo label by the clustering module; the self-supervision learning is realized by using a classification module to classify data and utilizing a pseudo label generated by a clustering module to calculate the error between a classification result and an expected label, so that the self-supervision effect is realized through the back propagation of a neural network; and finally identifying the network traffic type through a maximum likelihood estimation method. The method is realized based on the statistical characteristics of the flow data instead of the information loaded by the data frame, and has good identification effect on encrypted flow and the like.

Description

Network flow identification method based on self-supervision convolution subspace clustering network

Technical Field

The invention belongs to the technical field of deep learning, network space safety and flow identification, and particularly relates to a network flow identification method based on an automatic supervision convolution subspace clustering network.

Background

With the increasing abundance of network applications and the continuous development of network technologies, a large amount of network traffic is generated every moment, and the network traffic is an important carrier of various information in network transmission. The massive network flow brings great challenges to network security management and flow supervision, accurate identification of the network flow is an important premise for effective network security management and flow supervision, the quality of network transmission can be improved, and normal operation of network security can be guaranteed. The existing network traffic method mainly comprises a port-based identification method, a behavior feature matching-based identification method and a deep packet inspection method, wherein the port-based identification method only has accuracy on network protocol traffic identification using a common port and a registered port, the behavior feature matching-based identification method has high time complexity and space complexity, and the deep packet inspection method has poor intelligent identification capability. These conventional methods do not efficiently accomplish the task of network traffic identification.

Disclosure of Invention

In order to overcome the technical problems, the invention provides a network flow identification method based on an automatic supervision convolution subspace clustering network, which realizes the self supervision of the network by constructing an automatic supervision target by using the intrinsic characteristics of data by adding a clustering module and a classification module in a deep neural network. The clustering module is responsible for generating pseudo labels, and the classification module is responsible for supervising the learning process by utilizing the pseudo labels and the classification network. After the self-supervision is introduced, the process of representing learning and the clustering process are fused, training is carried out in a unified network frame, the representation which is beneficial to clustering tasks can be better learned, and then the clustering accuracy is improved. And after the optimal clustering result is obtained, corresponding each cluster obtained by clustering with a specific network application type by using a likelihood estimation method, and realizing the task of network flow identification.

In order to achieve the purpose, the invention adopts the technical scheme that:

a network flow identification method based on an automatic supervision convolution subspace clustering network comprises the following steps;

1) data preprocessing:

filtering the acquired network traffic data set through a set strategy, converting the original network traffic data of various different formats into a uniform data format, and avoiding loss of key data items during conversion;

2) initializing and pre-training the self-encoder:

initializing a self-encoder network, and then inputting the original data of the step 1) into an encoder and pre-training;

3) training a convolution subspace clustering network, and learning a sparse representation matrix of data:

training a convolution subspace clustering network, initializing a self-encoder part in the convolution subspace clustering network by utilizing the self-encoder parameters obtained by learning in the step 2), and inputting original data into the network;

4) constructing a pseudo label:

by the sparse representation matrix obtained by the convolutional subspace clustering network learning in the step 3), constructing a similarity matrix, then applying spectral clustering to the similarity matrix to obtain a clustering cluster segmentation result of the data sample, wherein the clustering cluster segmentation result obtained by the spectral clustering can be used as a pseudo label of the data set, although the result is not correct on all sample data, the result still contains useful information on the premise of full pre-training, and by utilizing the point, the pseudo label generated by clustering is used for supervising the processes of feature extraction and sparse matrix learning of the network;

5) self-supervision learning:

the self-supervision structure is mainly realized by adding a classification network in the supervision learning field, and because the convolution subspace clustering network can well reconstruct original data, the extracted data features, namely sparse representation layer, contain enough information to predict the labels of data sample points, the classified network is added behind the sparse representation layer of the network, and the pseudo labels generated by the clustering module in the last step are used as expected results of classification, so that the learning of feature extraction network features and the subspace clustering network can be supervised;

6) identifying the type of network traffic:

and judging the mapping relation between the clustered clusters in the step 5) and the specific network type by a maximum likelihood estimation method, and identifying the network traffic type.

The data set in the step 1) is a UNB ISCX network traffic data set, the data set is a network traffic data set collected by 13 applications belonging to five categories of Mail, instant messaging, streaming media, file transfer, VoIP and P2P, and the specific application types related to the data set comprise Fileziella, Handgout, Skype, AIM, Facebook Chat, Gmail Chat, Mail, Torrent, Vimeo, Youtube, ICQ, Handouts Audio and Skype Audio.

In the step 1), the UNB ISCX network traffic data set is processed through the steps of performing stream filtering and stream cleaning in the preprocessing of the network traffic data, and the characteristic attribute of each stream record is mapped into the same number of pixel points, so that original data which contains noise, is incomplete and inconsistent is converted into proper input data.

The self-encoder used in the step 2) is a spindle-shaped structure with two large ends and a small middle part as a whole, and is composed of an encoder and a decoder, namely a network which is formed by reconstructing an original data space from an original data dimensional space to a potential dimensional space and then from the potential dimensional space to the original data space. The invention adopts a convolutional self-encoder, namely, in an encoder part, a network stacked on each layer is a convolutional network, and in a decoder part, a network stacked on each layer is a deconvolution network. After the network parameters of the self-encoder are initialized randomly, the data to be analyzed are input into the network to be pre-trained layer by layer.

Before training the convolutional subspace clustering network in the step 3), initializing a self-encoder part in the convolutional subspace clustering network by using the self-encoder parameters obtained by learning in the previous step, and further continuously training the overall structure of the network until the network is converged.

The pseudo label of the data constructed in the step 4) is formed by adding a clustering module in a convolution subspace clustering network, measuring the distance between two vectors by using cosine similarity in a similarity matrix construction of the clustering module, and further realizing clustering by using a spectral clustering algorithm, wherein the spectral clustering is to convert the obtained data into a graph.

The self-supervision in the step 5) is realized by adding a classification module behind the sparse representation of the data learned in the convolutional subspace clustering network, the used classification module adopts the classification network in the traditional supervision learning field to classify the data, and meanwhile, the pseudo label generated by the clustering module is utilized to calculate the error between the classification result and the expected label, so that the self-supervision effect is realized through the back propagation of the neural network.

Identifying the network traffic type in the step 6), judging the mapping relation between the clustered cluster and the specific network application type by a maximum likelihood estimation method, and setting B to { B ═ B₁,b₂,…,b_nIs the set of clusters after the data set is clustered, where n represents the number of clusters, D ═ D₁,d₂,…,d_mRepresenting the set of the network traffic types to be identified, m representing the number of application types, the number of the application types being less than or equal to the number of clusters, and establishing a mapping relation f by maximum likelihood estimation, wherein the mapping relation f is established by a probability formula P (D)_j|b_i)＝h_ji/h_iWherein j is more than or equal to 1 and less than or equal to m, i is more than or equal to 1 and less than or equal to n, h in the formula_jiRepresents a cluster b_iHas been marked as a network application type d_jNumber of data streams of h_iThen it represents cluster b_iSum of all data objects, P (d)_j|b_i) To form a cluster b_iMapping to a specific network application type d_jThe probability of (2) is expressed as a probability matching function of the data traffic and the network application type

If the value of the probability lower limit of the maximum likelihood estimation is set as x, the above formula is used, and when the cluster b is a cluster_iMarked as specific network application type d_jIs the largest in the proportion of the known type sample object to the total number of all data objects in the cluster, and the value of the sample object exceeds the probability lower limit x, the data flow is identified as the type d of the network application software_jIf the value d does not reach the lower probability limit x, the network traffic type corresponding to the cluster can be marked as an unknown network application type, that is, the identified traffic is unknown traffic, and thus, the network is completedAnd identifying the flow.

The invention has the beneficial effects that:

the invention learns the representation of network flow data by using a convolutional subspace clustering network, reduces the dimension of original data, creatively introduces an automatic supervision method in order to solve the problems that the separability of the representation of the learning and clustering processes in a clustering algorithm based on deep learning and the lack of effective utilization of the internal information of a sample in the training process. Therefore, the original process of sparse representation through network learning and the clustering process are organically combined, and the effect of identifying network flow is improved.

Drawings

FIG. 1 is a general flow diagram of the present invention.

FIG. 2 is a block diagram of the flow identification by the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples.

As shown in fig. 1, six steps of improving and identifying network traffic based on the self-supervised convolutional subspace clustering network are shown, namely data preprocessing, initializing and pre-training a self-encoder, training the convolutional subspace clustering network and learning a sparse representation matrix of data, constructing a pseudo tag, self-supervised learning, and identifying network traffic types.

As shown in FIG. 2, the framework of identifying network traffic type by self-supervised learning in the present invention is shown, the self-supervised method is added to the convolutional subspace clustering network, and the constructed classification module and clustering module are used to realize the self-supervision of the network and optimize the overall performance of the method.

The invention provides a network flow identification method based on an automatic supervision convolution subspace clustering network, which comprises the following steps:

step one, data preprocessing. The data set selected by the invention is a UNB ISCX network traffic data set, the data set is a network traffic data set collected aiming at 13 applications belonging to five categories of Mail, instant messaging, streaming media, file transfer, VoIP and P2P, and the specific application types related to the data set comprise Fileziella, Handgout, Skype, AIM, Facebook Chat, Gmail Chat, Mail, Torrent, Vimeo, Youtube, ICQ, Handouts Audio and Skype Audio. The method further processes the UNB ISCX network flow data set by executing the steps of flow filtering and flow cleaning, and maps the characteristic attribute of each flow record into the same number of pixel points, thereby converting the original data which contains noise, is incomplete and inconsistent into proper input data suitable for the method model of the invention.

And step two, initializing and pre-training the self-encoder. The self-encoder is a fusiform structure with two large ends and a small middle part as a whole, and consists of an encoder and a decoder, namely a network which is formed by reconstructing an original data dimensional space to a potential dimensional space and then reconstructing the potential dimensional space to the original data space. The invention adopts a convolutional self-encoder, namely, in an encoder part, a network stacked on each layer is a convolutional network, and in a decoder part, a network stacked on each layer is a deconvolution network. After the network parameters of the self-encoder are initialized randomly, the data to be analyzed are input into the network to be pre-trained layer by layer.

And step three, training a convolution subspace clustering network, and learning a sparse representation matrix of the data. Before training the convolutional subspace clustering network, initializing a self-encoder part in the convolutional subspace clustering network by utilizing self-encoder parameters obtained by learning in the previous step, inputting original data into the network, and further continuously training the overall structure of the network until the network is converged.

And step four, constructing a pseudo label. And a similarity matrix can be constructed by a sparse representation matrix obtained by convolutional subspace clustering network learning. And then applying spectral clustering on the similarity matrix to obtain a cluster segmentation result of the data sample. The cluster segmentation result obtained by spectral clustering can be used as a pseudo-label for a data set, and although the result is not correct on all sample data, it still contains useful information under sufficient pre-training. With this, the pseudo labels generated by clustering can be used to supervise the process of feature extraction and sparse matrix learning of the network. The pseudo label of the data is constructed by adding a clustering module in a convolution subspace clustering network, measuring the distance between two vectors by using cosine similarity in a similarity matrix construction of the clustering module, and then realizing clustering by using a spectral clustering algorithm, wherein the spectral clustering is to convert the obtained data into a graph.

And step five, self-supervision learning. The construction of self-supervision is mainly realized by adding a classification network of supervision and learning fields. Since the convolutional subspace clustering network can reconstruct the original data well, the extracted data features, i.e., sparse representation layers, contain enough information to predict the labels of the data sample points. Therefore, a classified network is added behind a sparse representation layer of the network, the used classification module adopts the classification network in the traditional supervised learning field to classify data, the pseudo label generated by the clustering module in the previous step is used as an expected classification result, and the error between the classification result and the expected label can be calculated, so that the self-supervision effect is realized through the back propagation of the neural network, the feature extraction network feature is continuously supervised, and the learning of the subspace clustering network is continuously realized.

And step six, identifying the network flow type. And judging the mapping relation between the clustered cluster and the specific network type by a maximum likelihood estimation method, and identifying the network traffic type. Let B be { B ═ B₁,b₂,…,b_nAnd is the set of clusters after the data set is clustered, wherein n represents the number of clusters. D ═ D₁,d₂,…,d_mAnd f, representing a set of network traffic types to be identified, m represents the number of application types, the number of the applications is less than or equal to the number of clusters, and a relevant mapping f: B → D exists in the classification of the data set. The mapping f is established by maximum likelihood estimation. The probability formula used isP(d_j|b_i)＝h_ji/h_iWherein j is more than or equal to 1 and less than or equal to m, i is more than or equal to 1 and less than or equal to n, h in the formula_jiRepresents a cluster b_iHas been marked as a network application type d_jThe number of data streams. h is_iThen it represents cluster b_iThe sum of all data object quantities. P (d)_j|b_i) To form a cluster b_iMapping to a specific network application type d_jThe probability of (c). The probability matching function formula of the data flow and the network application type is

If the value of the probability lower limit of the maximum likelihood estimation is set as x, the above formula is used, and when the cluster b is a cluster_iMarked as specific network application type d_jIs the largest in the proportion of the known type sample object to the total number of all data objects in the cluster, and the value of the sample object exceeds the probability lower limit x, the data flow is identified as the type d of the network application software_jThe data of (1). If the value d does not reach the lower probability limit x, the network traffic type corresponding to the cluster can be marked as an unknown network application type, that is, the identified traffic is unknown traffic. Thus, the network traffic identification is completed.

The identification effect of the method is evaluated by mainly using the single application identification accuracy and the overall application identification accuracy, wherein the single application identification accuracy is represented by the ratio of the number of the flows which correctly identify a certain application type to the number of the flows which are determined to be the certain application type, the overall application identification accuracy is represented by the ratio of the number of the flows which are correctly identified to be the corresponding application type to the total network traffic of the identified target data set, and the higher the numerical values of the two evaluation indexes are, the better the network traffic identification effect of the method is.

The invention has the following characteristics:

1. the method does not depend on information loaded by a data frame, and has good identification effect on encrypted flow and the like;

2. the introduced self-supervision learning method enables a certain proportion of data samples with pseudo labels to carry out effective mapping guidance on the whole method model, and improves the effect of network traffic identification.

The convolution subspace clustering network not only can effectively reduce the dimension of input data, but also can learn the implicit characteristics of the analyzed data in a mode of adjusting the number of layers of the neural network, optimizing the network training process and the like, and recover the data through the data reconstruction process. The data label of the self-supervision learning comes from the data, the neural network is optimized by constructing a self-supervision task and a target, the self-supervision of the learning process is realized, the quality of the learning representation is improved, and the effect of the subsequent task is improved.

In the process of identifying network traffic, the invention designs a network traffic identification method based on an automatic supervision convolution subspace clustering network, combines the convolution subspace clustering network and an automatic supervision learning method to solve the problems of dependence on the load information of data frames and difficulty in identifying encrypted traffic, and effectively completes the task of traffic identification.

Claims

1. A network flow identification method based on an automatic supervision convolution subspace clustering network is characterized by comprising the following steps;

1) data preprocessing:

2) initializing and pre-training the self-encoder:

4) constructing a pseudo label:

5) self-supervision learning:

6) identifying the type of network traffic:

2. The method for identifying network traffic based on the unsupervised convolutional subspace clustering network as claimed in claim 1, wherein the data set in step 1) is a UNB ISCX network traffic data set, the data set is a network traffic data set collected for 13 applications belonging to Mail, instant messaging, streaming media, file transfer, VoIP and P2P major categories, and the data set relates to specific application types including Fileziela, Handgout, Skype, AIM, Facebook Chat, Gmail Chat, Maiil, Torrent, Vimeo, Youtube, ICQ, Handgouuts Audio and Skype Audio.

3. The method of claim 1, wherein the preprocessing of the network traffic data in step 1) processes the UNB ISCX network traffic data set by performing the steps of stream filtering and stream cleaning, and maps the characteristic attribute of each stream record to the same number of pixels, thereby converting the original data which contains noise, is incomplete and inconsistent into proper input data.

4. The method for identifying network traffic based on the unsupervised convolutional subspace clustering network as claimed in claim 1, wherein the self-encoder used in step 2) is a spindle-shaped structure with two large ends and a small middle part as a whole, and is composed of an encoder and a decoder, that is, a network which is reconstructed from an original data dimensional space to a potential dimensional space and then from the potential dimensional space to the original data space.

5. The method for identifying network traffic based on the unsupervised convolutional subspace clustering network as claimed in claim 1, wherein before training the convolutional subspace clustering network in the step 3), the self-encoder part in the convolutional subspace clustering network is initialized by using the self-encoder parameters learned in the previous step, and then the whole structure of the network is continuously trained until the network converges.

6. The method for identifying network traffic based on the unsupervised convolutional subspace clustering network as claimed in claim 1, wherein the pseudo label of the data constructed in step 4) is generated by adding a clustering module in the convolutional subspace clustering network, measuring the distance between two vectors by using cosine similarity in the similarity matrix construction of the clustering module, and further using a spectral clustering algorithm to realize clustering, wherein the spectral clustering is to convert the obtained data into a graph, construct a graph data structure by using a KNN method, and then realize spectral clustering on the basis of the graph data structure, and in the convolutional subspace clustering network, the pseudo label is generated by using the clustering module by using sparse representation of the data obtained in the training stage.

7. The method for identifying network traffic based on the self-supervision convolutional subspace clustering network as claimed in claim 1, wherein the self-supervision in the step 5) is realized by learning a classification module behind the sparse representation of the data in the convolutional subspace clustering network, the classification module is a classification network in the traditional supervised learning field, and the data is classified, and meanwhile, the pseudo label generated by the clustering module is utilized to calculate the error between the classification result and the expected label, so that the self-supervision effect is realized through the back propagation of the neural network.

8. The method for identifying network traffic based on the unsupervised convolutional subspace clustering network as claimed in claim 1, wherein the network traffic type is identified in the step 6), the mapping relationship between the clustered cluster and the specific network application type is determined by the maximum likelihood estimation method, and B ═ B₁,b₂,…,b_nIs the set of clusters after the data set is clustered, where n represents the number of clusters, D ═ D₁,d₂,…,d_mRepresenting the set of the network traffic types to be identified, m representing the number of application types, the number of the application types being less than or equal to the number of clusters, and establishing a mapping relation f by maximum likelihood estimation, wherein the mapping relation f is established by a probability formula P (D)_j|b_i)＝h_ji/h_iWherein j is more than or equal to 1 and less than or equal to m, i is more than or equal to 1 and less than or equal to n, h in the formula_jiRepresents a cluster b_iHas been marked as a network application type d_jNumber of data streams of h_iThen it represents cluster b_iSum of all data objects, P (d)_j|b_i) To form a cluster b_iMapping to a specific network application type d_jThe probability of (2) is expressed as a probability matching function of the data traffic and the network application type

If the value of the probability lower limit of the maximum likelihood estimation is set as x, the above formula is used, and when the cluster b is a cluster_iMarked as specific network application type d_jIs the largest in the proportion of the known type sample object to the total number of all data objects in the cluster, and the value of the sample object exceeds the probability lower limit x, the data flow is identified as the type d of the network application software_jIf the value d does not reach the lower probability limit x, the network traffic type corresponding to the cluster can be marked as an unknown network application type, that is, the identified traffic is unknown traffic, and thus, the network traffic identification work is completed.