CN114510615A - A fine-grained encrypted website fingerprint classification method and device based on graph attention pooling network - Google Patents
A fine-grained encrypted website fingerprint classification method and device based on graph attention pooling network Download PDFInfo
- Publication number
- CN114510615A CN114510615A CN202111191717.5A CN202111191717A CN114510615A CN 114510615 A CN114510615 A CN 114510615A CN 202111191717 A CN202111191717 A CN 202111191717A CN 114510615 A CN114510615 A CN 114510615A
- Authority
- CN
- China
- Prior art keywords
- graph
- network
- flow
- nodes
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000011176 pooling Methods 0.000 title claims abstract description 37
- 238000012549 training Methods 0.000 claims abstract description 19
- 238000003062 neural network model Methods 0.000 claims abstract description 9
- 238000010586 diagram Methods 0.000 claims abstract description 5
- 230000000694 effects Effects 0.000 claims abstract description 5
- 239000010410 layer Substances 0.000 claims description 26
- 238000004590 computer program Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 8
- 239000000203 mixture Substances 0.000 claims description 7
- 101100481876 Danio rerio pbk gene Proteins 0.000 claims description 6
- 101100481878 Mus musculus Pbk gene Proteins 0.000 claims description 6
- 239000002356 single layer Substances 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 5
- 238000009825 accumulation Methods 0.000 claims 1
- 238000004422 calculation algorithm Methods 0.000 abstract description 9
- 238000013528 artificial neural network Methods 0.000 abstract description 4
- 238000012360 testing method Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 235000019580 granularity Nutrition 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000010200 validation analysis Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000009191 jumping Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及一种基于图注意力池化网络的细粒度加密网站指纹分类方法和装置,属于计算机软件技术领域。The invention relates to a fine-grained encrypted website fingerprint classification method and device based on a graph attention pooling network, and belongs to the technical field of computer software.
背景技术Background technique
随着人们的网络安全意识逐渐增强,HTTPS等加密协议被广泛应用于各类网站。这些加密协议在保护数据隐私的同时也给网络管理(如QoS和恶意行为追踪等)带来了巨大的挑战。近年来,伴随着人工智能的复兴和发展,使用机器学习算法在加密流量中识别特定网页的网站指纹技术成为了网络安全领域的热门研究点。With the increasing awareness of network security, encryption protocols such as HTTPS are widely used in various websites. These encryption protocols bring great challenges to network management (such as QoS and malicious behavior tracking, etc.) while protecting data privacy. In recent years, with the revival and development of artificial intelligence, website fingerprinting technology that uses machine learning algorithms to identify specific web pages in encrypted traffic has become a hot research point in the field of network security.
早期研究从数据包大小、包到达时间间隔等多种角度在加密流量中挖掘有效的统计特征,并采用传统机器学习算法,如K近邻、支持向量机、随机森林等算法模型作为分类器,取得了较好的性能。近年来,由于深度学习技术的快速发展,一些研究采用卷积神经网络、循环神经网络等模型,自动从加密流量中提取有效特征并实现高精度网站指纹分类。这些深度学习方法具有更佳的性能,且不再需要复杂的人工特征选择,因此成为了主流的网站指纹方法。Early research mines effective statistical features in encrypted traffic from various perspectives such as data packet size and packet arrival time interval, and uses traditional machine learning algorithms, such as K-nearest neighbors, support vector machines, random forests and other algorithm models as classifiers. better performance. In recent years, due to the rapid development of deep learning technology, some studies have used models such as convolutional neural networks and recurrent neural networks to automatically extract effective features from encrypted traffic and achieve high-precision website fingerprint classification. These deep learning methods have better performance and no longer require complex manual feature selection, thus becoming mainstream website fingerprinting methods.
然而,已有研究主要针对网站首页分类场景。事实上,人们不会仅局限于访问首页。通常很多网站内部包含若干子网页,而这些网页代表着不同的服务和网络行为。因此,细粒度网页分类问题对于QoS等网络管理同样具有重要意义。但是,由于同一网站内的不同网页之间往往具有相似的布局和内容,传统方法手动或自动提取的流量特征不再具有明显区分性,进而导致已有方法性能下降。However, existing research mainly focuses on the classification scenario of website homepage. In fact, people won't be limited to just visiting the homepage. Usually many websites contain several sub-pages, and these pages represent different services and network behaviors. Therefore, the problem of fine-grained webpage classification is also of great significance to network management such as QoS. However, since different web pages within the same website often have similar layout and content, the traffic features extracted manually or automatically by traditional methods are no longer distinguishable, which in turn leads to performance degradation of existing methods.
少量研究提出了一些通过结合流量的全局特征和局部特征来提升细粒度网页分类性能的方法。如统计网络流的总字节数、总包数等全局特征,各时间片字节数、包数等时间片特征以及前后n个数据包长序列等局部特征。相比传统首页指纹场景下的全局特征,这些局部特征可以很好地表示细粒度网页流量间的微小差异。但是,这些方法采用机器学习算法,需要复杂的人工特征选择和提取过程。而基于深度学习的网站指纹方法虽然可以自动学习到潜在的流量模式,但很难发现细粒度局部特征差异,在细粒度场景下性能同样大幅下降。因此,需要借助额外的知识帮助深度学习模型学习到相似样本之间微小的特征差异。A few studies have proposed some methods to improve the performance of fine-grained webpage classification by combining global and local features of traffic. For example, global characteristics such as the total number of bytes and total packets of the network flow, time slice characteristics such as the number of bytes and packets in each time slice, and local characteristics such as the long sequence of n data packets before and after are counted. Compared with the global features in the traditional homepage fingerprint scenario, these local features can well represent the small differences between fine-grained web traffic. However, these methods employ machine learning algorithms and require complex manual feature selection and extraction processes. Although the deep learning-based website fingerprinting method can automatically learn potential traffic patterns, it is difficult to find the differences in fine-grained local features, and the performance also drops significantly in fine-grained scenarios. Therefore, additional knowledge is needed to help deep learning models learn small feature differences between similar samples.
发明内容SUMMARY OF THE INVENTION
本发明旨在提供一种用于有效解决细粒度加密网站指纹的方法和装置。本发明是一种深度学习算法,无需复杂的人工特征选择,并且可以同时学习网络流量的全局特征和局部特征。The present invention aims to provide a method and device for effectively solving fine-grained encrypted website fingerprints. The present invention is a deep learning algorithm without complicated manual feature selection, and can simultaneously learn global and local features of network traffic.
本发明提出一种基于图注意力池化网络的细粒度加密网站指纹分类方法,采用图结构描述网站访问流量,所有节点表示流量全局信息,边表示网络流的局部上下文信息,并采用图神经网络算法自动学习有效的图表示。在网页访问流量中,不同流代表不同类型的资源请求,本发明中的方法可以自动学习不同流对于最终分类的重要性,具有可解释性。本发明还具有在较少训练样本情况下,分类性能更佳的优势。The invention proposes a fine-grained encrypted website fingerprint classification method based on graph attention pooling network. The graph structure is used to describe website access traffic, all nodes represent global traffic information, edges represent local context information of network flow, and a graph neural network is used. Algorithms automatically learn efficient graph representations. In webpage access traffic, different flows represent different types of resource requests, and the method in the present invention can automatically learn the importance of different flows for final classification, and has interpretability. The present invention also has the advantage of better classification performance in the case of fewer training samples.
具体地,本发明采用的技术方案如下:Specifically, the technical scheme adopted in the present invention is as follows:
一种基于图注意力池化网络的细粒度加密网站指纹分类方法,包括以下步骤:A fine-grained encrypted website fingerprint classification method based on graph attention pooling network, including the following steps:
建立用于描述网络流量模式的流量踪迹图,流量踪迹图中的节点表示网络流,边表示网络流的上下文关系;Build a traffic trace graph to describe the network traffic pattern. The nodes in the traffic trace graph represent the network flow, and the edges represent the context of the network flow;
利用图神经网络模型自动学习流量踪迹图中的流内特征和流间特征,得到流量踪迹图的有效嵌入表示;Using the graph neural network model to automatically learn the intra-flow features and inter-flow features in the traffic trace graph, and obtain the effective embedding representation of the traffic trace graph;
利用流量踪迹图的有效嵌入表示进行网站指纹分类。Website Fingerprinting Using Efficient Embedding Representations of Traffic Trace Graphs.
进一步地,所述流量踪迹图中,对于同一客户端产生的两条流,根据两条流的起始时间间隔是否小于经验阈值决定两节点是否有边。Further, in the traffic trace graph, for two flows generated by the same client, whether the two nodes have an edge is determined according to whether the start time interval of the two flows is less than an empirical threshold.
进一步地,所述利用图神经网络模型自动学习流量踪迹图中的流内特征和流间特征,得到流量踪迹图的有效嵌入表示,包括:Further, the graph neural network model is used to automatically learn the intra-flow features and the inter-flow features in the traffic trace graph to obtain an effective embedded representation of the traffic trace graph, including:
采用多头图注意力层学习节点的注意力权重,使得模型更加关注流量踪迹图中的重要节点,并减少类间相似节点和噪声节点的负面影响;The multi-head graph attention layer is used to learn the attention weight of nodes, so that the model pays more attention to important nodes in the traffic trace graph, and reduces the negative effects of similar nodes and noisy nodes between classes;
采用自注意力池化层进一步筛选重要节点,同时减少模型参数量。The self-attention pooling layer is used to further filter important nodes, while reducing the amount of model parameters.
进一步地,所述多头图注意力层中,流量踪迹图首先经过单层全连接网络提取浅层抽象表示,然后经过K头图注意力网络学习节点注意力权重,得到K种节点表示,再将K种节点表示累加并送入自注意力池化层。Further, in the multi-head graph attention layer, the traffic trace graph first extracts the shallow abstract representation through the single-layer fully connected network, and then learns the node attention weight through the K-head graph attention network to obtain K kinds of node representations, and then uses the K-head graph attention network to learn the node attention weight. K kinds of node representations are accumulated and sent to the self-attention pooling layer.
进一步地,所述自注意力池化层采用图卷积网络计算各节点重要性并保留topK节点,以进一步筛选重要节点,同时减少模型参数量。Further, the self-attention pooling layer uses a graph convolutional network to calculate the importance of each node and retains topK nodes, so as to further screen important nodes and reduce the amount of model parameters.
进一步地,对topK节点图进行全局最大池化和全局平均池化操作,并将两种池化结果拼接得到图的全局嵌入表示,并作为一个卷积块的输出,最后将两个卷积块输出的结果进行拼接得到最终有效嵌入。Further, global max pooling and global average pooling are performed on the topK node graph, and the two pooling results are spliced to obtain the global embedding representation of the graph, which is used as the output of a convolution block. Finally, the two convolution blocks are combined. The output results are spliced to obtain the final effective embedding.
进一步地,所述利用流量踪迹图的有效嵌入表示进行网站指纹分类,包括:使用单层全连接网络结合Log Softmax函数作为分类器,得到网页分类结果,其中利用Dropout防止训练过拟合,采用NLLLoss作为损失函数。Further, using the effective embedded representation of the traffic trace graph to perform website fingerprint classification includes: using a single-layer fully connected network in conjunction with a Log Softmax function as a classifier to obtain a webpage classification result, wherein using Dropout to prevent training overfitting, using NLLLoss as a loss function.
一种基于图注意力池化网络的细粒度加密网站指纹分类装置,其包括:A fine-grained encrypted website fingerprint classification device based on graph attention pooling network, comprising:
构图模块,用于建立用于描述网络流量模式的流量踪迹图,流量踪迹图中的节点表示网络流,边表示网络流的上下文关系;The composition module is used to establish a traffic trace graph for describing the network traffic pattern, the nodes in the traffic trace graph represent the network flow, and the edges represent the context relationship of the network flow;
图注意力层次池化模块,用于利用图神经网络模型自动学习流量踪迹图中的流内特征和流间特征,得到流量踪迹图的有效嵌入表示;The graph attention level pooling module is used to automatically learn the intra-flow features and inter-flow features in the traffic trace graph by using the graph neural network model, and obtain an effective embedded representation of the traffic trace graph;
输出模块,用于利用流量踪迹图的有效嵌入表示进行网站指纹分类。An output module for fingerprinting a website using an efficient embedded representation of the traffic trace graph.
本发明的关键点在于:The key points of the present invention are:
1、针对细粒度网站指纹问题,提出了基于图注意力池化网络的加密网站指纹分类方法。该方法使用流量踪迹图以表示网页访问流量中流的上下文关系,可以同时表示网络流量的全局特征和局部特征。并利用图神经网络模型自动学习踪迹图中流内特征和流间特征,最终得到踪迹图的有效嵌入表示。1. Aiming at the problem of fine-grained website fingerprints, a classification method of encrypted website fingerprints based on graph attention pooling network is proposed. The method uses the traffic trace graph to represent the context of the flow in the web page access traffic, which can simultaneously represent the global and local characteristics of the network traffic. And use the graph neural network model to automatically learn the intra-stream features and inter-stream features of the trace graph, and finally obtain the effective embedding representation of the trace graph.
2、采用多头图注意力机制学习节点的注意力权重,使得模型更加关注流量踪迹图中的重要节点,并减少类间相似节点和噪声节点的负面影响。并采用自注意力池化模块进一步筛选重要节点,同时减少模型参数量。2. The multi-head graph attention mechanism is used to learn the attention weight of nodes, which makes the model pay more attention to the important nodes in the traffic trace graph, and reduces the negative effects of similar nodes and noisy nodes between classes. And use the self-attention pooling module to further filter important nodes, while reducing the amount of model parameters.
3、该方法可以自动适应多种粒度的网站指纹场景。在多种粒度的数据集下均可达到目标指标的最佳分类器。同时,该方法可以在较少的训练样本情况下也获得较佳分类性能。3. The method can automatically adapt to website fingerprinting scenarios of various granularities. The best classifier that can achieve the target metric under various granularity datasets. At the same time, this method can obtain better classification performance with fewer training samples.
本发明对细粒度网站指纹问题的解决有如下特点和有益效果:The present invention has the following characteristics and beneficial effects to the solution of the fine-grained website fingerprint problem:
1、提出可以合理描述网络流量模式的流量踪迹图,使用节点表示网络流,并利用边信息表示网络流的上下文关系。1. A traffic trace graph that can reasonably describe the network traffic pattern is proposed. Nodes are used to represent network flows, and edge information is used to represent the context of network flows.
2、基于图神经网络算法,无需复杂的人工特征选择,并且可以同时有效学习网络流量的全局特征和局部特征。2. Based on the graph neural network algorithm, there is no need for complex manual feature selection, and it can effectively learn the global and local features of network traffic at the same time.
3、可以自动学习并更加关注重要流节点,并减少类间相似流节点和噪声流节点的负面影响。3. It can automatically learn and pay more attention to important flow nodes, and reduce the negative effects of similar flow nodes and noisy flow nodes between classes.
4、适合多种粒度网站指纹场景,性能更优,且所需训练样本数量更少。4. It is suitable for a variety of granular website fingerprinting scenarios, with better performance and fewer training samples.
附图说明Description of drawings
图1是本发明方法的基本框架图。Fig. 1 is a basic frame diagram of the method of the present invention.
具体实施方式Detailed ways
为使本发明的上述目的、特征和优点能够更加明显易懂,下面通过具体实施例和附图,对本发明做进一步详细说明。In order to make the above objects, features and advantages of the present invention more clearly understood, the present invention will be further described in detail below through specific embodiments and accompanying drawings.
图1是本发明方法的基本框架图,包含3个阶段,左侧表示构图阶段,中间表示训练阶段,右侧表示分类阶段。其中训练阶段为图注意力层次池化模块,包含2个卷积块,每块包含3个子层。构图模块和图注意力层次池化模块为本发明的最关键技术所在。本发明的方案包括以下几个技术步骤:1 is a basic frame diagram of the method of the present invention, which includes three stages, the left side represents the composition phase, the middle represents the training phase, and the right side represents the classification phase. The training phase is the graph attention layer pooling module, which contains 2 convolutional blocks, each of which contains 3 sub-layers. The composition module and the graph attention level pooling module are the most critical technologies of the present invention. The scheme of the present invention includes the following technical steps:
1.构图阶段:1. Composition stage:
(1)数据准备:采集当前任务场景下的SSL/TLS加密网络流量数据包(SSL/TLS表示网站采用HTTPS加密协议),在对其进行标注后,按照一定的比例划分数据集,比如训练集:验证集:测试集=6:1:3。训练集的作用是在训练阶段不断优化调整模型参数,而验证集的作用是辅助观察模型的训练程度是否过拟合或达到预期要求,以判断何时停止训练。测试集用于测试模型在实际网络流量中的分类性能。(1) Data preparation: Collect the SSL/TLS encrypted network traffic data packets in the current task scenario (SSL/TLS means that the website adopts the HTTPS encryption protocol), and after labeling it, divide the data set according to a certain proportion, such as the training set :validation set:test set = 6:1:3. The role of the training set is to continuously optimize and adjust the model parameters during the training phase, while the role of the validation set is to help observe whether the training degree of the model is overfitting or meet the expected requirements, so as to judge when to stop training. The test set is used to test the classification performance of the model in real network traffic.
(2)流生成:将(1)中的数据包按照五元组进行分流,并提取每条流的开始时间、数据包大小(包长度)以及数据包到达时间间隔,用于(3)中构建节点和边信息。其中五元组是指源IP地址、目的IP地址、源端口号、目的端口号和协议类型。(2) Stream generation: Divide the data packets in (1) according to quintuple, and extract the start time, data packet size (packet length) and data packet arrival time interval of each flow for use in (3) Build node and edge information. The quintuple refers to the source IP address, the destination IP address, the source port number, the destination port number, and the protocol type.
(3)构建流量踪迹图:将(2)中的每条流作为图中的节点,流的包长序列和包到达时间间隔序列作为节点的特征。对于同一客户端产生的流,根据两条流的开始时间间隔是否小于经验阈值决定两节点是否有边,本发明设定该阈值为2s。如图1所示,其中ClientIP1==ClientIP2表示两条流是同一客户端产生的流,两条流的起始时间间隔|t1-t2|小于等于经验阈值tthreshold,在两个节点v1、v2之间增加边。图1中A表示图的邻接矩阵,X表示特征矩阵。(3) Constructing a traffic trace graph: take each flow in (2) as a node in the graph, and the packet length sequence and packet arrival time interval sequence of the flow as the characteristics of the node. For streams generated by the same client, whether the two nodes have an edge is determined according to whether the start time interval of the two streams is less than an empirical threshold, and the present invention sets the threshold to 2s. As shown in Figure 1, where ClientIP 1 == ClientIP 2 indicates that the two streams are generated by the same client, and the start time interval |t 1 -t 2 | of the two streams is less than or equal to the empirical threshold t threshold , and the An edge is added between nodes v 1 and v 2 . In Fig. 1, A represents the adjacency matrix of the graph, and X represents the feature matrix.
2.训练阶段,主要为图注意力层次池化模块,包含3个子层:2. In the training phase, it is mainly a graph attention layer pooling module, which includes 3 sub-layers:
(1)多头图注意力层:流量踪迹图首先经过单层全连接网络提取浅层抽象表示,然后经过K头图注意力网络(GAT)学习节点注意力权重,得到K种节点表示,再将K种节点表示累加并送入(2),如图1所示,其中FC为全连接网络,Elu表示指数线性单元激活函数,表示学习到的注意力权重。其中K是指对同一个流量踪迹图进行K次相互独立的GAT操作,得到K种不同的节点表示。K次计算目的是学习到更加多样性的权重,从而提升模型的学习能力。(1) Multi-head graph attention layer: The traffic trace graph first extracts a shallow abstract representation through a single-layer fully connected network, and then learns the node attention weight through a K-head graph attention network (GAT) to obtain K kinds of node representations, and then K kinds of nodes are accumulated and sent to (2), as shown in Figure 1, where FC is a fully connected network, Elu is an exponential linear unit activation function, represents the learned attention weights. Among them, K refers to performing K independent GAT operations on the same traffic trace graph to obtain K different node representations. The purpose of K calculations is to learn more diverse weights, thereby improving the learning ability of the model.
(2)自注意力池化层:该层通过计算各节点重要性并保留topK节点(保留重要性分数排名前K的节点),以进一步筛选重要节点,同时减少模型参数量。如图1所示,其中Relu表示整流线性单元激活函数,GCN表示图卷积网络,scorevi表示学习到的节点重要性分数。其中,节点重要性通过图卷积网络(GCN)计算得到,图卷积操作既可以关注节点自身的特征信息,也可以充分利用图的结构信息。(2) Self-attention pooling layer: This layer calculates the importance of each node and retains the topK nodes (retaining the top K nodes in the importance score) to further filter important nodes and reduce the amount of model parameters. As shown in Figure 1, where Relu represents the rectified linear unit activation function, GCN represents the graph convolutional network, and score vi represents the learned node importance score. Among them, the node importance is calculated by the graph convolution network (GCN). The graph convolution operation can not only pay attention to the feature information of the node itself, but also make full use of the structural information of the graph.
(3)读出层:本层分别对(2)中topK节点图进行全局最大池化(MaxPool)和全局平均池化(MeanPool)操作,并将两种池化结果拼接得到该层的流量踪迹图全局嵌入表示,并作为图注意力层次池化模块的一个卷积块输出(Readout)。最后将模块的两个卷积块结果进行拼接即可得到该样本的最终有效嵌入。如图1所示,其中Jumping Knowledge表示跳跃连接,即Readout的拼接操作。(3) Readout layer: This layer performs global maximum pooling (MaxPool) and global average pooling (MeanPool) operations on the topK node graph in (2), and splices the two pooling results to obtain the traffic trace of this layer The graph global embedding representation is used as a convolutional block output (Readout) of the graph attention hierarchical pooling module. Finally, the final effective embedding of the sample can be obtained by splicing the results of the two convolution blocks of the module. As shown in Figure 1, Jumping Knowledge represents jumping connection, that is, the splicing operation of Readout.
3.分类阶段:3. Classification stage:
此阶段即输出模块,使用单层全连接网络(FC)结合Log Softmax函数作为分类器,得到网页分类结果。其中利用Dropout防止训练过拟合,NLLLoss作为损失函数。This stage is the output module, which uses a single-layer fully connected network (FC) combined with the Log Softmax function as a classifier to obtain webpage classification results. Among them, Dropout is used to prevent training overfitting, and NLLLoss is used as the loss function.
本发明的实例:Example of the present invention:
实例1传统网站指纹首页分类场景Example 1 Traditional website fingerprint homepage classification scenario
2020年10月,主动收集Alexa中国排名Top100的HTTPS网站原始访问流量。其中每个网页访问100次,共1w个流量样本。对其进行样本标注和特征提取后进行数据集划分,划分比例为训练集:验证集:测试集=6:1:3。经过构图并训练后,本发明提出的算法在测试集获得了99.85%的高F1分值,相比已有的最先进研究工作提升超过1%。说明本发明在传统网站首页分类场景上是最优方法。In October 2020, actively collect the raw traffic of HTTPS websites ranked Top 100 in Alexa China. Each webpage is visited 100 times, with a total of 1w traffic samples. After sample labeling and feature extraction, the data set is divided, and the division ratio is training set: validation set: test set = 6:1:3. After composition and training, the algorithm proposed by the present invention obtains a high F1 score of 99.85% in the test set, which is more than 1% higher than the existing state-of-the-art research work. It is illustrated that the present invention is the optimal method in the traditional website homepage classification scenario.
实例2单网站下细粒度网页指纹场景Example 2 Fine-grained web page fingerprinting scenario under a single website
2020年11月,收集2个代表性HTTPS网站下共计90个热门网页的原始访问流量。其中选取A网站60个网页,B网站30个网页,每个网页访问100次,共计9k个样本。分别对A和B的流量样本进行标注、特征提取和划分数据集后,送入本发明提出的图注意力池化网络模型进行训练。测试集结果表明,本发明在两个数据集上分别取得了96.72%和91.45%的高F1值,相比已有的最先进方法,提升了3%至23%。In November 2020, the raw access traffic of a total of 90 popular web pages under 2 representative HTTPS websites was collected. Among them, 60 webpages of website A and 30 webpages of website B are selected, and each webpage is accessed 100 times, for a total of 9k samples. After the traffic samples of A and B are marked, feature extracted and divided into data sets, they are sent to the graph attention pooling network model proposed by the present invention for training. The test set results show that the present invention achieves high F1 values of 96.72% and 91.45% on the two datasets, respectively, which is 3% to 23% higher than the existing state-of-the-art methods.
实例3多网站下细粒度网页指纹场景Example 3 Fine-grained web page fingerprinting scenario under multiple websites
2020年12月,主动收集了9个代表性HTTPS网站下共计100个热门网页的原始访问流量,每个网页访问100次,共计1w个流量样本。实验测试结果表明,本发明取得了93.37%的F1值,比已有的最先进方法提升了14%以上。In December 2020, we actively collected the raw access traffic of a total of 100 popular web pages under 9 representative HTTPS websites, each web page was accessed 100 times, and a total of 1w traffic samples were collected. The experimental test results show that the present invention achieves an F1 value of 93.37%, which is more than 14% higher than the existing state-of-the-art methods.
整体上,本发明提出的图注意力池化网络模型在多种粒度的网站指纹场景下均表现最佳。同时实验发现,本发明仅需要很少的训练样本数量即可达到较高准确率,如首页分类场景仅需25个样本即可达到99%准确率,而单网站细粒度网页分类场景仅需要15个样本即可达到90%准确率。On the whole, the graph attention pooling network model proposed by the present invention performs the best in various granularity website fingerprinting scenarios. At the same time, experiments have found that the present invention only needs a small number of training samples to achieve a high accuracy rate. For example, the home page classification scenario only needs 25 samples to achieve 99% accuracy, while the single website fine-grained webpage classification scenario only needs 15 90% accuracy can be achieved with only one sample.
基于同一发明构思,本发明的另一个实施例提供一种基于图注意力池化网络的细粒度加密网站指纹分类装置,其包括:Based on the same inventive concept, another embodiment of the present invention provides a fine-grained encrypted website fingerprint classification device based on graph attention pooling network, which includes:
构图模块,用于建立用于描述网络流量模式的流量踪迹图,流量踪迹图中的节点表示网络流,边表示网络流的上下文关系;The composition module is used to establish a traffic trace graph for describing the network traffic pattern, the nodes in the traffic trace graph represent the network flow, and the edges represent the context relationship of the network flow;
图注意力层次池化模块,用于利用图神经网络模型自动学习流量踪迹图中的流内特征和流间特征,得到流量踪迹图的有效嵌入表示;The graph attention level pooling module is used to automatically learn the intra-flow features and inter-flow features in the traffic trace graph by using the graph neural network model, and obtain an effective embedded representation of the traffic trace graph;
输出模块,用于利用流量踪迹图的有效嵌入表示进行网站指纹分类。An output module for fingerprinting a website using an efficient embedded representation of the traffic trace graph.
其中各模块的具体实施过程参见前文对本发明方法的描述。For the specific implementation process of each module, refer to the foregoing description of the method of the present invention.
基于同一发明构思,本发明的另一实施例提供一种电子装置(计算机、服务器、智能手机等),其包括存储器和处理器,所述存储器存储计算机程序,所述计算机程序被配置为由所述处理器执行,所述计算机程序包括用于执行本发明方法中各步骤的指令。Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.), which includes a memory and a processor, the memory stores a computer program, and the computer program is configured to be The processor is executed, and the computer program includes instructions for performing the steps in the method of the present invention.
基于同一发明构思,本发明的另一实施例提供一种计算机可读存储介质(如ROM/RAM、磁盘、光盘),所述计算机可读存储介质存储计算机程序,所述计算机程序被计算机执行时,实现本发明方法的各个步骤。Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (eg, ROM/RAM, magnetic disk, optical disk), where the computer-readable storage medium stores a computer program, and when the computer program is executed by a computer , realize each step of the method of the present invention.
本发明的其它实施方式:Other embodiments of the present invention:
1.多头注意力层:本发明所述方法在具体实施时可采用图卷积网络(GCN)等替换多头注意力层以减少计算复杂度,也可达到较高准确率。1. Multi-head attention layer: The method of the present invention can use a graph convolutional network (GCN) or the like to replace the multi-head attention layer in order to reduce computational complexity and achieve higher accuracy.
2.读出层:本发明所述方法在具体实施时可采用Readout求平均/求和等方式替代拼接操作。2. Readout layer: The method of the present invention may use Readout averaging/summation and other methods to replace the splicing operation during specific implementation.
3.输出模块:本发明所述方法在具体实施时可根据分类任务自定义输出层结构,如采用全局平均池化层替代全连接层等。3. Output module: The method of the present invention can customize the output layer structure according to the classification task during specific implementation, such as using a global average pooling layer instead of a fully connected layer.
以上公开的本发明的具体实施例,其目的在于帮助理解本发明的内容并据以实施,本领域的普通技术人员可以理解,在不脱离本发明的精神和范围内,各种替换、变化和修改都是可能的。本发明不应局限于本说明书的实施例所公开的内容,本发明的保护范围以权利要求书界定的范围为准。The specific embodiments of the present invention disclosed above are intended to help understand the content of the present invention and implement them accordingly. Those skilled in the art can understand that various substitutions, changes and modifications can be made without departing from the spirit and scope of the present invention. Modifications are possible. The present invention should not be limited to the contents disclosed in the embodiments of this specification, and the protection scope of the present invention shall be subject to the scope defined by the claims.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111191717.5A CN114510615A (en) | 2021-10-13 | 2021-10-13 | A fine-grained encrypted website fingerprint classification method and device based on graph attention pooling network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111191717.5A CN114510615A (en) | 2021-10-13 | 2021-10-13 | A fine-grained encrypted website fingerprint classification method and device based on graph attention pooling network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114510615A true CN114510615A (en) | 2022-05-17 |
Family
ID=81547910
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111191717.5A Pending CN114510615A (en) | 2021-10-13 | 2021-10-13 | A fine-grained encrypted website fingerprint classification method and device based on graph attention pooling network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114510615A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115580547A (en) * | 2022-11-21 | 2023-01-06 | 中国科学技术大学 | Website fingerprint identification method and system based on time-space correlation between network data streams |
CN116232681A (en) * | 2023-01-04 | 2023-06-06 | 东南大学 | Knowledge-based atlas and dynamic embedded learning novel blind network flow classification method |
CN117648623A (en) * | 2023-11-24 | 2024-03-05 | 成都理工大学 | Network classification algorithm based on pooling comparison learning |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111382780A (en) * | 2020-02-13 | 2020-07-07 | 中国科学院信息工程研究所 | Encryption website fine-grained classification method and device based on HTTP different versions |
-
2021
- 2021-10-13 CN CN202111191717.5A patent/CN114510615A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111382780A (en) * | 2020-02-13 | 2020-07-07 | 中国科学院信息工程研究所 | Encryption website fine-grained classification method and device based on HTTP different versions |
Non-Patent Citations (2)
Title |
---|
JIE LU等: "GAP-WF: Graph Attention Pooling Network for Fine-grained SSL/TLS Website Fingerprinting", 《2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN》, 20 September 2021 (2021-09-20), pages 1 * |
张道维;段海新;: "基于图像纹理的网站指纹技术", 计算机应用, no. 06, 22 January 2020 (2020-01-22) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115580547A (en) * | 2022-11-21 | 2023-01-06 | 中国科学技术大学 | Website fingerprint identification method and system based on time-space correlation between network data streams |
CN116232681A (en) * | 2023-01-04 | 2023-06-06 | 东南大学 | Knowledge-based atlas and dynamic embedded learning novel blind network flow classification method |
CN117648623A (en) * | 2023-11-24 | 2024-03-05 | 成都理工大学 | Network classification algorithm based on pooling comparison learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112818257B (en) | Account detection method, device and equipment based on graph neural network | |
CN114510615A (en) | A fine-grained encrypted website fingerprint classification method and device based on graph attention pooling network | |
CN111462183A (en) | Behavior identification method and system based on attention mechanism double-current network | |
CN113989583A (en) | Method and system for detecting malicious traffic on the Internet | |
CN110572362A (en) | Network attack detection method and device for multi-type unbalanced abnormal traffic | |
CN110995652B (en) | Big data platform unknown threat detection method based on deep migration learning | |
CN113938290B (en) | Website de-anonymization method and system for user side flow data analysis | |
US12223709B2 (en) | Methods for more effectively moderating one or more images and devices thereof | |
Jiang et al. | FA-Net: More accurate encrypted network traffic classification based on burst with self-attention | |
CN115037543A (en) | An abnormal network traffic detection method based on bidirectional temporal convolutional neural network | |
CN116451138A (en) | Encryption traffic classification method, device and storage medium based on multi-modal learning | |
CN115567269A (en) | Internet of Things anomaly detection method and system based on federated learning and deep learning | |
Yujie et al. | End-to-end android malware classification based on pure traffic images | |
CN114884704A (en) | Network traffic abnormal behavior detection method and system based on involution and voting | |
CN108494620A (en) | Network service flow feature selecting and sorting technique based on multiple target Adaptive evolvement arithmetic | |
TWI591982B (en) | Network flow recognization method and recognization system | |
CN118094188A (en) | A small sample encrypted traffic classification model training method based on meta-learning | |
CN113255730B (en) | Distributed deep neural network structure conversion method based on split-fusion strategy | |
CN112733689B (en) | An HTTPS terminal type classification method and device | |
CN112817587B (en) | A mobile application behavior recognition method based on attention mechanism | |
CN114666391B (en) | Access trajectory determination methods, devices, equipment and storage media | |
Gu et al. | A practical multi-tab website fingerprinting attack | |
CN115361231A (en) | Access baseline-based host abnormal traffic detection method, system and equipment | |
CN111382780B (en) | Encryption website fine granularity classification method and device based on HTTP (hyper text transport protocol) different versions | |
CN115021986A (en) | A method and apparatus for constructing a deployable model for identification of IoT devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |