CN114510615A

CN114510615A - A fine-grained encrypted website fingerprint classification method and device based on graph attention pooling network

Info

Publication number: CN114510615A
Application number: CN202111191717.5A
Authority: CN
Inventors: 管洋洋; 苟高鹏; 陆杰; 刘畅; 李镇
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2022-05-17

Abstract

The invention relates to a fine-grained encrypted website fingerprint classification method and device based on a graph attention pooling network. The method comprises the steps of establishing a flow trace graph for describing a network flow mode, wherein nodes in the flow trace graph represent network flows, and edges represent context relations of the network flows; automatically learning the intra-flow characteristics and the inter-flow characteristics in the flow trace diagram by using the graph neural network model to obtain effective embedded representation of the flow trace diagram; the effective embedded representation of the traffic trace graph is utilized for website fingerprint classification. The invention provides a flow trace graph capable of reasonably describing a network flow pattern, the method is based on a graph neural network algorithm, complex artificial feature selection is not needed, the global feature and the local feature of the network flow can be effectively learned at the same time, important flow nodes can be automatically learned and paid more attention to, and the negative effects of similar flow nodes and noise flow nodes among classes are reduced. The method is suitable for various granularity website fingerprint scenes, the performance is better, and the number of required training samples is less.

Description

A fine-grained encrypted website fingerprint classification method and device based on graph attention pooling network

技术领域technical field

本发明涉及一种基于图注意力池化网络的细粒度加密网站指纹分类方法和装置，属于计算机软件技术领域。The invention relates to a fine-grained encrypted website fingerprint classification method and device based on a graph attention pooling network, and belongs to the technical field of computer software.

背景技术Background technique

随着人们的网络安全意识逐渐增强，HTTPS等加密协议被广泛应用于各类网站。这些加密协议在保护数据隐私的同时也给网络管理(如QoS和恶意行为追踪等)带来了巨大的挑战。近年来，伴随着人工智能的复兴和发展，使用机器学习算法在加密流量中识别特定网页的网站指纹技术成为了网络安全领域的热门研究点。With the increasing awareness of network security, encryption protocols such as HTTPS are widely used in various websites. These encryption protocols bring great challenges to network management (such as QoS and malicious behavior tracking, etc.) while protecting data privacy. In recent years, with the revival and development of artificial intelligence, website fingerprinting technology that uses machine learning algorithms to identify specific web pages in encrypted traffic has become a hot research point in the field of network security.

早期研究从数据包大小、包到达时间间隔等多种角度在加密流量中挖掘有效的统计特征，并采用传统机器学习算法，如K近邻、支持向量机、随机森林等算法模型作为分类器，取得了较好的性能。近年来，由于深度学习技术的快速发展，一些研究采用卷积神经网络、循环神经网络等模型，自动从加密流量中提取有效特征并实现高精度网站指纹分类。这些深度学习方法具有更佳的性能，且不再需要复杂的人工特征选择，因此成为了主流的网站指纹方法。Early research mines effective statistical features in encrypted traffic from various perspectives such as data packet size and packet arrival time interval, and uses traditional machine learning algorithms, such as K-nearest neighbors, support vector machines, random forests and other algorithm models as classifiers. better performance. In recent years, due to the rapid development of deep learning technology, some studies have used models such as convolutional neural networks and recurrent neural networks to automatically extract effective features from encrypted traffic and achieve high-precision website fingerprint classification. These deep learning methods have better performance and no longer require complex manual feature selection, thus becoming mainstream website fingerprinting methods.

然而，已有研究主要针对网站首页分类场景。事实上，人们不会仅局限于访问首页。通常很多网站内部包含若干子网页，而这些网页代表着不同的服务和网络行为。因此，细粒度网页分类问题对于QoS等网络管理同样具有重要意义。但是，由于同一网站内的不同网页之间往往具有相似的布局和内容，传统方法手动或自动提取的流量特征不再具有明显区分性，进而导致已有方法性能下降。However, existing research mainly focuses on the classification scenario of website homepage. In fact, people won't be limited to just visiting the homepage. Usually many websites contain several sub-pages, and these pages represent different services and network behaviors. Therefore, the problem of fine-grained webpage classification is also of great significance to network management such as QoS. However, since different web pages within the same website often have similar layout and content, the traffic features extracted manually or automatically by traditional methods are no longer distinguishable, which in turn leads to performance degradation of existing methods.

少量研究提出了一些通过结合流量的全局特征和局部特征来提升细粒度网页分类性能的方法。如统计网络流的总字节数、总包数等全局特征，各时间片字节数、包数等时间片特征以及前后n个数据包长序列等局部特征。相比传统首页指纹场景下的全局特征，这些局部特征可以很好地表示细粒度网页流量间的微小差异。但是，这些方法采用机器学习算法，需要复杂的人工特征选择和提取过程。而基于深度学习的网站指纹方法虽然可以自动学习到潜在的流量模式，但很难发现细粒度局部特征差异，在细粒度场景下性能同样大幅下降。因此，需要借助额外的知识帮助深度学习模型学习到相似样本之间微小的特征差异。A few studies have proposed some methods to improve the performance of fine-grained webpage classification by combining global and local features of traffic. For example, global characteristics such as the total number of bytes and total packets of the network flow, time slice characteristics such as the number of bytes and packets in each time slice, and local characteristics such as the long sequence of n data packets before and after are counted. Compared with the global features in the traditional homepage fingerprint scenario, these local features can well represent the small differences between fine-grained web traffic. However, these methods employ machine learning algorithms and require complex manual feature selection and extraction processes. Although the deep learning-based website fingerprinting method can automatically learn potential traffic patterns, it is difficult to find the differences in fine-grained local features, and the performance also drops significantly in fine-grained scenarios. Therefore, additional knowledge is needed to help deep learning models learn small feature differences between similar samples.

发明内容SUMMARY OF THE INVENTION

本发明旨在提供一种用于有效解决细粒度加密网站指纹的方法和装置。本发明是一种深度学习算法，无需复杂的人工特征选择，并且可以同时学习网络流量的全局特征和局部特征。The present invention aims to provide a method and device for effectively solving fine-grained encrypted website fingerprints. The present invention is a deep learning algorithm without complicated manual feature selection, and can simultaneously learn global and local features of network traffic.

本发明提出一种基于图注意力池化网络的细粒度加密网站指纹分类方法，采用图结构描述网站访问流量，所有节点表示流量全局信息，边表示网络流的局部上下文信息，并采用图神经网络算法自动学习有效的图表示。在网页访问流量中，不同流代表不同类型的资源请求，本发明中的方法可以自动学习不同流对于最终分类的重要性，具有可解释性。本发明还具有在较少训练样本情况下，分类性能更佳的优势。The invention proposes a fine-grained encrypted website fingerprint classification method based on graph attention pooling network. The graph structure is used to describe website access traffic, all nodes represent global traffic information, edges represent local context information of network flow, and a graph neural network is used. Algorithms automatically learn efficient graph representations. In webpage access traffic, different flows represent different types of resource requests, and the method in the present invention can automatically learn the importance of different flows for final classification, and has interpretability. The present invention also has the advantage of better classification performance in the case of fewer training samples.

具体地，本发明采用的技术方案如下：Specifically, the technical scheme adopted in the present invention is as follows:

一种基于图注意力池化网络的细粒度加密网站指纹分类方法，包括以下步骤：A fine-grained encrypted website fingerprint classification method based on graph attention pooling network, including the following steps:

建立用于描述网络流量模式的流量踪迹图，流量踪迹图中的节点表示网络流，边表示网络流的上下文关系；Build a traffic trace graph to describe the network traffic pattern. The nodes in the traffic trace graph represent the network flow, and the edges represent the context of the network flow;

利用图神经网络模型自动学习流量踪迹图中的流内特征和流间特征，得到流量踪迹图的有效嵌入表示；Using the graph neural network model to automatically learn the intra-flow features and inter-flow features in the traffic trace graph, and obtain the effective embedding representation of the traffic trace graph;

利用流量踪迹图的有效嵌入表示进行网站指纹分类。Website Fingerprinting Using Efficient Embedding Representations of Traffic Trace Graphs.

进一步地，所述流量踪迹图中，对于同一客户端产生的两条流，根据两条流的起始时间间隔是否小于经验阈值决定两节点是否有边。Further, in the traffic trace graph, for two flows generated by the same client, whether the two nodes have an edge is determined according to whether the start time interval of the two flows is less than an empirical threshold.

进一步地，所述利用图神经网络模型自动学习流量踪迹图中的流内特征和流间特征，得到流量踪迹图的有效嵌入表示，包括：Further, the graph neural network model is used to automatically learn the intra-flow features and the inter-flow features in the traffic trace graph to obtain an effective embedded representation of the traffic trace graph, including:

采用多头图注意力层学习节点的注意力权重，使得模型更加关注流量踪迹图中的重要节点，并减少类间相似节点和噪声节点的负面影响；The multi-head graph attention layer is used to learn the attention weight of nodes, so that the model pays more attention to important nodes in the traffic trace graph, and reduces the negative effects of similar nodes and noisy nodes between classes;

采用自注意力池化层进一步筛选重要节点，同时减少模型参数量。The self-attention pooling layer is used to further filter important nodes, while reducing the amount of model parameters.

进一步地，所述多头图注意力层中，流量踪迹图首先经过单层全连接网络提取浅层抽象表示，然后经过K头图注意力网络学习节点注意力权重，得到K种节点表示，再将K种节点表示累加并送入自注意力池化层。Further, in the multi-head graph attention layer, the traffic trace graph first extracts the shallow abstract representation through the single-layer fully connected network, and then learns the node attention weight through the K-head graph attention network to obtain K kinds of node representations, and then uses the K-head graph attention network to learn the node attention weight. K kinds of node representations are accumulated and sent to the self-attention pooling layer.

进一步地，所述自注意力池化层采用图卷积网络计算各节点重要性并保留topK节点，以进一步筛选重要节点，同时减少模型参数量。Further, the self-attention pooling layer uses a graph convolutional network to calculate the importance of each node and retains topK nodes, so as to further screen important nodes and reduce the amount of model parameters.

进一步地，对topK节点图进行全局最大池化和全局平均池化操作，并将两种池化结果拼接得到图的全局嵌入表示，并作为一个卷积块的输出，最后将两个卷积块输出的结果进行拼接得到最终有效嵌入。Further, global max pooling and global average pooling are performed on the topK node graph, and the two pooling results are spliced to obtain the global embedding representation of the graph, which is used as the output of a convolution block. Finally, the two convolution blocks are combined. The output results are spliced to obtain the final effective embedding.

进一步地，所述利用流量踪迹图的有效嵌入表示进行网站指纹分类，包括：使用单层全连接网络结合Log Softmax函数作为分类器，得到网页分类结果，其中利用Dropout防止训练过拟合，采用NLLLoss作为损失函数。Further, using the effective embedded representation of the traffic trace graph to perform website fingerprint classification includes: using a single-layer fully connected network in conjunction with a Log Softmax function as a classifier to obtain a webpage classification result, wherein using Dropout to prevent training overfitting, using NLLLoss as a loss function.

一种基于图注意力池化网络的细粒度加密网站指纹分类装置，其包括：A fine-grained encrypted website fingerprint classification device based on graph attention pooling network, comprising:

构图模块，用于建立用于描述网络流量模式的流量踪迹图，流量踪迹图中的节点表示网络流，边表示网络流的上下文关系；The composition module is used to establish a traffic trace graph for describing the network traffic pattern, the nodes in the traffic trace graph represent the network flow, and the edges represent the context relationship of the network flow;

图注意力层次池化模块，用于利用图神经网络模型自动学习流量踪迹图中的流内特征和流间特征，得到流量踪迹图的有效嵌入表示；The graph attention level pooling module is used to automatically learn the intra-flow features and inter-flow features in the traffic trace graph by using the graph neural network model, and obtain an effective embedded representation of the traffic trace graph;

输出模块，用于利用流量踪迹图的有效嵌入表示进行网站指纹分类。An output module for fingerprinting a website using an efficient embedded representation of the traffic trace graph.

本发明的关键点在于：The key points of the present invention are:

1、针对细粒度网站指纹问题，提出了基于图注意力池化网络的加密网站指纹分类方法。该方法使用流量踪迹图以表示网页访问流量中流的上下文关系，可以同时表示网络流量的全局特征和局部特征。并利用图神经网络模型自动学习踪迹图中流内特征和流间特征，最终得到踪迹图的有效嵌入表示。1. Aiming at the problem of fine-grained website fingerprints, a classification method of encrypted website fingerprints based on graph attention pooling network is proposed. The method uses the traffic trace graph to represent the context of the flow in the web page access traffic, which can simultaneously represent the global and local characteristics of the network traffic. And use the graph neural network model to automatically learn the intra-stream features and inter-stream features of the trace graph, and finally obtain the effective embedding representation of the trace graph.

2、采用多头图注意力机制学习节点的注意力权重，使得模型更加关注流量踪迹图中的重要节点，并减少类间相似节点和噪声节点的负面影响。并采用自注意力池化模块进一步筛选重要节点，同时减少模型参数量。2. The multi-head graph attention mechanism is used to learn the attention weight of nodes, which makes the model pay more attention to the important nodes in the traffic trace graph, and reduces the negative effects of similar nodes and noisy nodes between classes. And use the self-attention pooling module to further filter important nodes, while reducing the amount of model parameters.

3、该方法可以自动适应多种粒度的网站指纹场景。在多种粒度的数据集下均可达到目标指标的最佳分类器。同时，该方法可以在较少的训练样本情况下也获得较佳分类性能。3. The method can automatically adapt to website fingerprinting scenarios of various granularities. The best classifier that can achieve the target metric under various granularity datasets. At the same time, this method can obtain better classification performance with fewer training samples.

本发明对细粒度网站指纹问题的解决有如下特点和有益效果：The present invention has the following characteristics and beneficial effects to the solution of the fine-grained website fingerprint problem:

1、提出可以合理描述网络流量模式的流量踪迹图，使用节点表示网络流，并利用边信息表示网络流的上下文关系。1. A traffic trace graph that can reasonably describe the network traffic pattern is proposed. Nodes are used to represent network flows, and edge information is used to represent the context of network flows.

2、基于图神经网络算法，无需复杂的人工特征选择，并且可以同时有效学习网络流量的全局特征和局部特征。2. Based on the graph neural network algorithm, there is no need for complex manual feature selection, and it can effectively learn the global and local features of network traffic at the same time.

3、可以自动学习并更加关注重要流节点，并减少类间相似流节点和噪声流节点的负面影响。3. It can automatically learn and pay more attention to important flow nodes, and reduce the negative effects of similar flow nodes and noisy flow nodes between classes.

4、适合多种粒度网站指纹场景，性能更优，且所需训练样本数量更少。4. It is suitable for a variety of granular website fingerprinting scenarios, with better performance and fewer training samples.

附图说明Description of drawings

图1是本发明方法的基本框架图。Fig. 1 is a basic frame diagram of the method of the present invention.

具体实施方式Detailed ways

为使本发明的上述目的、特征和优点能够更加明显易懂，下面通过具体实施例和附图，对本发明做进一步详细说明。In order to make the above objects, features and advantages of the present invention more clearly understood, the present invention will be further described in detail below through specific embodiments and accompanying drawings.

图1是本发明方法的基本框架图，包含3个阶段，左侧表示构图阶段，中间表示训练阶段，右侧表示分类阶段。其中训练阶段为图注意力层次池化模块，包含2个卷积块，每块包含3个子层。构图模块和图注意力层次池化模块为本发明的最关键技术所在。本发明的方案包括以下几个技术步骤：1 is a basic frame diagram of the method of the present invention, which includes three stages, the left side represents the composition phase, the middle represents the training phase, and the right side represents the classification phase. The training phase is the graph attention layer pooling module, which contains 2 convolutional blocks, each of which contains 3 sub-layers. The composition module and the graph attention level pooling module are the most critical technologies of the present invention. The scheme of the present invention includes the following technical steps:

1.构图阶段：1. Composition stage:

(1)数据准备：采集当前任务场景下的SSL/TLS加密网络流量数据包(SSL/TLS表示网站采用HTTPS加密协议)，在对其进行标注后，按照一定的比例划分数据集，比如训练集:验证集:测试集＝6:1:3。训练集的作用是在训练阶段不断优化调整模型参数，而验证集的作用是辅助观察模型的训练程度是否过拟合或达到预期要求，以判断何时停止训练。测试集用于测试模型在实际网络流量中的分类性能。(1) Data preparation: Collect the SSL/TLS encrypted network traffic data packets in the current task scenario (SSL/TLS means that the website adopts the HTTPS encryption protocol), and after labeling it, divide the data set according to a certain proportion, such as the training set :validation set:test set = 6:1:3. The role of the training set is to continuously optimize and adjust the model parameters during the training phase, while the role of the validation set is to help observe whether the training degree of the model is overfitting or meet the expected requirements, so as to judge when to stop training. The test set is used to test the classification performance of the model in real network traffic.

(2)流生成：将(1)中的数据包按照五元组进行分流，并提取每条流的开始时间、数据包大小(包长度)以及数据包到达时间间隔，用于(3)中构建节点和边信息。其中五元组是指源IP地址、目的IP地址、源端口号、目的端口号和协议类型。(2) Stream generation: Divide the data packets in (1) according to quintuple, and extract the start time, data packet size (packet length) and data packet arrival time interval of each flow for use in (3) Build node and edge information. The quintuple refers to the source IP address, the destination IP address, the source port number, the destination port number, and the protocol type.

(3)构建流量踪迹图：将(2)中的每条流作为图中的节点，流的包长序列和包到达时间间隔序列作为节点的特征。对于同一客户端产生的流，根据两条流的开始时间间隔是否小于经验阈值决定两节点是否有边，本发明设定该阈值为2s。如图1所示，其中ClientIP₁＝＝ClientIP₂表示两条流是同一客户端产生的流，两条流的起始时间间隔|t₁-t₂|小于等于经验阈值t_threshold，在两个节点v₁、v₂之间增加边。图1中A表示图的邻接矩阵，X表示特征矩阵。(3) Constructing a traffic trace graph: take each flow in (2) as a node in the graph, and the packet length sequence and packet arrival time interval sequence of the flow as the characteristics of the node. For streams generated by the same client, whether the two nodes have an edge is determined according to whether the start time interval of the two streams is less than an empirical threshold, and the present invention sets the threshold to 2s. As shown in Figure 1, where ClientIP ₁ == ClientIP ₂ indicates that the two streams are generated by the same client, and the start time interval |t ₁ -t ₂ | of the two streams is less than or equal to the empirical threshold t _threshold , and the An edge is added between nodes v ₁ and v ₂ . In Fig. 1, A represents the adjacency matrix of the graph, and X represents the feature matrix.

2.训练阶段，主要为图注意力层次池化模块，包含3个子层：2. In the training phase, it is mainly a graph attention layer pooling module, which includes 3 sub-layers:

(1)多头图注意力层：流量踪迹图首先经过单层全连接网络提取浅层抽象表示，然后经过K头图注意力网络(GAT)学习节点注意力权重，得到K种节点表示，再将K种节点表示累加并送入(2)，如图1所示，其中FC为全连接网络，Elu表示指数线性单元激活函数，

表示学习到的注意力权重。其中K是指对同一个流量踪迹图进行K次相互独立的GAT操作，得到K种不同的节点表示。K次计算目的是学习到更加多样性的权重，从而提升模型的学习能力。(1) Multi-head graph attention layer: The traffic trace graph first extracts a shallow abstract representation through a single-layer fully connected network, and then learns the node attention weight through a K-head graph attention network (GAT) to obtain K kinds of node representations, and then K kinds of nodes are accumulated and sent to (2), as shown in Figure 1, where FC is a fully connected network, Elu is an exponential linear unit activation function,

represents the learned attention weights. Among them, K refers to performing K independent GAT operations on the same traffic trace graph to obtain K different node representations. The purpose of K calculations is to learn more diverse weights, thereby improving the learning ability of the model.

(2)自注意力池化层：该层通过计算各节点重要性并保留topK节点(保留重要性分数排名前K的节点)，以进一步筛选重要节点，同时减少模型参数量。如图1所示，其中Relu表示整流线性单元激活函数，GCN表示图卷积网络，score_vi表示学习到的节点重要性分数。其中，节点重要性通过图卷积网络(GCN)计算得到，图卷积操作既可以关注节点自身的特征信息，也可以充分利用图的结构信息。(2) Self-attention pooling layer: This layer calculates the importance of each node and retains the topK nodes (retaining the top K nodes in the importance score) to further filter important nodes and reduce the amount of model parameters. As shown in Figure 1, where Relu represents the rectified linear unit activation function, GCN represents the graph convolutional network, and score _vi represents the learned node importance score. Among them, the node importance is calculated by the graph convolution network (GCN). The graph convolution operation can not only pay attention to the feature information of the node itself, but also make full use of the structural information of the graph.

(3)读出层：本层分别对(2)中topK节点图进行全局最大池化(MaxPool)和全局平均池化(MeanPool)操作，并将两种池化结果拼接得到该层的流量踪迹图全局嵌入表示，并作为图注意力层次池化模块的一个卷积块输出(Readout)。最后将模块的两个卷积块结果进行拼接即可得到该样本的最终有效嵌入。如图1所示，其中Jumping Knowledge表示跳跃连接，即Readout的拼接操作。(3) Readout layer: This layer performs global maximum pooling (MaxPool) and global average pooling (MeanPool) operations on the topK node graph in (2), and splices the two pooling results to obtain the traffic trace of this layer The graph global embedding representation is used as a convolutional block output (Readout) of the graph attention hierarchical pooling module. Finally, the final effective embedding of the sample can be obtained by splicing the results of the two convolution blocks of the module. As shown in Figure 1, Jumping Knowledge represents jumping connection, that is, the splicing operation of Readout.

3.分类阶段：3. Classification stage:

此阶段即输出模块，使用单层全连接网络(FC)结合Log Softmax函数作为分类器，得到网页分类结果。其中利用Dropout防止训练过拟合，NLLLoss作为损失函数。This stage is the output module, which uses a single-layer fully connected network (FC) combined with the Log Softmax function as a classifier to obtain webpage classification results. Among them, Dropout is used to prevent training overfitting, and NLLLoss is used as the loss function.

本发明的实例：Example of the present invention:

实例1传统网站指纹首页分类场景Example 1 Traditional website fingerprint homepage classification scenario

2020年10月，主动收集Alexa中国排名Top100的HTTPS网站原始访问流量。其中每个网页访问100次，共1w个流量样本。对其进行样本标注和特征提取后进行数据集划分，划分比例为训练集:验证集:测试集＝6:1:3。经过构图并训练后，本发明提出的算法在测试集获得了99.85％的高F1分值，相比已有的最先进研究工作提升超过1％。说明本发明在传统网站首页分类场景上是最优方法。In October 2020, actively collect the raw traffic of HTTPS websites ranked Top 100 in Alexa China. Each webpage is visited 100 times, with a total of 1w traffic samples. After sample labeling and feature extraction, the data set is divided, and the division ratio is training set: validation set: test set = 6:1:3. After composition and training, the algorithm proposed by the present invention obtains a high F1 score of 99.85% in the test set, which is more than 1% higher than the existing state-of-the-art research work. It is illustrated that the present invention is the optimal method in the traditional website homepage classification scenario.

实例2单网站下细粒度网页指纹场景Example 2 Fine-grained web page fingerprinting scenario under a single website

2020年11月，收集2个代表性HTTPS网站下共计90个热门网页的原始访问流量。其中选取A网站60个网页，B网站30个网页，每个网页访问100次，共计9k个样本。分别对A和B的流量样本进行标注、特征提取和划分数据集后，送入本发明提出的图注意力池化网络模型进行训练。测试集结果表明，本发明在两个数据集上分别取得了96.72％和91.45％的高F1值，相比已有的最先进方法，提升了3％至23％。In November 2020, the raw access traffic of a total of 90 popular web pages under 2 representative HTTPS websites was collected. Among them, 60 webpages of website A and 30 webpages of website B are selected, and each webpage is accessed 100 times, for a total of 9k samples. After the traffic samples of A and B are marked, feature extracted and divided into data sets, they are sent to the graph attention pooling network model proposed by the present invention for training. The test set results show that the present invention achieves high F1 values of 96.72% and 91.45% on the two datasets, respectively, which is 3% to 23% higher than the existing state-of-the-art methods.

实例3多网站下细粒度网页指纹场景Example 3 Fine-grained web page fingerprinting scenario under multiple websites

2020年12月，主动收集了9个代表性HTTPS网站下共计100个热门网页的原始访问流量，每个网页访问100次，共计1w个流量样本。实验测试结果表明，本发明取得了93.37％的F1值，比已有的最先进方法提升了14％以上。In December 2020, we actively collected the raw access traffic of a total of 100 popular web pages under 9 representative HTTPS websites, each web page was accessed 100 times, and a total of 1w traffic samples were collected. The experimental test results show that the present invention achieves an F1 value of 93.37%, which is more than 14% higher than the existing state-of-the-art methods.

整体上，本发明提出的图注意力池化网络模型在多种粒度的网站指纹场景下均表现最佳。同时实验发现，本发明仅需要很少的训练样本数量即可达到较高准确率，如首页分类场景仅需25个样本即可达到99％准确率，而单网站细粒度网页分类场景仅需要15个样本即可达到90％准确率。On the whole, the graph attention pooling network model proposed by the present invention performs the best in various granularity website fingerprinting scenarios. At the same time, experiments have found that the present invention only needs a small number of training samples to achieve a high accuracy rate. For example, the home page classification scenario only needs 25 samples to achieve 99% accuracy, while the single website fine-grained webpage classification scenario only needs 15 90% accuracy can be achieved with only one sample.

基于同一发明构思，本发明的另一个实施例提供一种基于图注意力池化网络的细粒度加密网站指纹分类装置，其包括：Based on the same inventive concept, another embodiment of the present invention provides a fine-grained encrypted website fingerprint classification device based on graph attention pooling network, which includes:

其中各模块的具体实施过程参见前文对本发明方法的描述。For the specific implementation process of each module, refer to the foregoing description of the method of the present invention.

基于同一发明构思，本发明的另一实施例提供一种电子装置(计算机、服务器、智能手机等)，其包括存储器和处理器，所述存储器存储计算机程序，所述计算机程序被配置为由所述处理器执行，所述计算机程序包括用于执行本发明方法中各步骤的指令。Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.), which includes a memory and a processor, the memory stores a computer program, and the computer program is configured to be The processor is executed, and the computer program includes instructions for performing the steps in the method of the present invention.

基于同一发明构思，本发明的另一实施例提供一种计算机可读存储介质(如ROM/RAM、磁盘、光盘)，所述计算机可读存储介质存储计算机程序，所述计算机程序被计算机执行时，实现本发明方法的各个步骤。Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (eg, ROM/RAM, magnetic disk, optical disk), where the computer-readable storage medium stores a computer program, and when the computer program is executed by a computer , realize each step of the method of the present invention.

本发明的其它实施方式：Other embodiments of the present invention:

1.多头注意力层：本发明所述方法在具体实施时可采用图卷积网络(GCN)等替换多头注意力层以减少计算复杂度，也可达到较高准确率。1. Multi-head attention layer: The method of the present invention can use a graph convolutional network (GCN) or the like to replace the multi-head attention layer in order to reduce computational complexity and achieve higher accuracy.

2.读出层：本发明所述方法在具体实施时可采用Readout求平均/求和等方式替代拼接操作。2. Readout layer: The method of the present invention may use Readout averaging/summation and other methods to replace the splicing operation during specific implementation.

3.输出模块：本发明所述方法在具体实施时可根据分类任务自定义输出层结构，如采用全局平均池化层替代全连接层等。3. Output module: The method of the present invention can customize the output layer structure according to the classification task during specific implementation, such as using a global average pooling layer instead of a fully connected layer.

以上公开的本发明的具体实施例，其目的在于帮助理解本发明的内容并据以实施，本领域的普通技术人员可以理解，在不脱离本发明的精神和范围内，各种替换、变化和修改都是可能的。本发明不应局限于本说明书的实施例所公开的内容，本发明的保护范围以权利要求书界定的范围为准。The specific embodiments of the present invention disclosed above are intended to help understand the content of the present invention and implement them accordingly. Those skilled in the art can understand that various substitutions, changes and modifications can be made without departing from the spirit and scope of the present invention. Modifications are possible. The present invention should not be limited to the contents disclosed in the embodiments of this specification, and the protection scope of the present invention shall be subject to the scope defined by the claims.

Claims

1. a fine-grained encryption website fingerprint classification method based on graph attention pooling network, is characterized in that, comprises the following steps:

Build a traffic trace graph to describe the network traffic pattern. The nodes in the traffic trace graph represent the network flow, and the edges represent the context of the network flow;

Using the graph neural network model to automatically learn the intra-flow features and inter-flow features in the traffic trace graph, and obtain the effective embedding representation of the traffic trace graph;

Website Fingerprinting Using Efficient Embedding Representations of Traffic Trace Graphs.

2. The method according to claim 1, wherein, in the traffic trace diagram, for two flows generated by the same client, it is determined whether the two nodes have the same flow according to whether the initial time interval of the two flows is less than an empirical threshold. side.

3. method according to claim 1, is characterized in that, described utilizing graph neural network model to automatically learn the intra-flow feature and inter-flow feature in the traffic trace graph, obtain the effective embedded representation of the traffic trace graph, comprising:

The multi-head graph attention layer is used to learn the attention weight of nodes, so that the model pays more attention to important nodes in the traffic trace graph, and reduces the negative effects of similar nodes and noisy nodes between classes;

The self-attention pooling layer is used to further filter important nodes, while reducing the amount of model parameters.

4. The method according to claim 3, wherein in the multi-head graph attention layer, the traffic trace graph first extracts a shallow abstract representation through a single-layer fully connected network, and then learns nodes through a K-head graph attention network. The attention weight is obtained to obtain K node representations, and then the K node representations are accumulated and sent to the self-attention pooling layer.

5 . The method according to claim 3 , wherein the self-attention pooling layer uses a graph convolutional network to calculate the importance of each node and retains topK nodes, so as to further screen important nodes and reduce the amount of model parameters. 6 .

6. The method according to claim 5, wherein the topK node graph is subjected to global maximum pooling and global average pooling operations, and the two pooling results are spliced to obtain a global embedding representation of the graph, which is used as a volume The output of the accumulation block, and finally the results of the output of the two convolution blocks are spliced to obtain the final effective embedding.

7. method according to claim 1, is characterized in that, described utilizing the effective embedded representation of traffic trace graph to carry out website fingerprint classification, comprising: use single-layer fully connected network in conjunction with Log Softmax function as classifier, obtain webpage classification result , in which Dropout is used to prevent training overfitting, and NLLLoss is used as the loss function.

8. A fine-grained encrypted website fingerprint classification device based on graph attention pooling network, characterized in that, comprising:

The composition module is used to establish a traffic trace graph for describing the network traffic pattern, the nodes in the traffic trace graph represent the network flow, and the edges represent the context relationship of the network flow;

The graph attention level pooling module is used to automatically learn the intra-flow features and inter-flow features in the traffic trace graph by using the graph neural network model, and obtain an effective embedded representation of the traffic trace graph;

An output module for fingerprinting a website using an efficient embedded representation of the traffic trace graph.

9. An electronic device, comprising a memory and a processor, wherein the memory stores a computer program, the computer program is configured to be executed by the processor, and the computer program includes a program for executing claims 1- 7. Instructions for the method of claim 7.

10 . A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a computer, the method according to any one of claims 1 to 7 is implemented. 11 .