CN115580547A

CN115580547A - Website fingerprint identification method and system based on time-space correlation between network data streams

Info

Publication number: CN115580547A
Application number: CN202211452131.4A
Authority: CN
Inventors: 谭小彬; 彭闯; 杜玫; 杨坚; 施钱宝
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2023-01-06

Abstract

The invention discloses a website fingerprint identification method and a website fingerprint identification system based on time-space correlation between network data streams, wherein the system scheme corresponds to the method scheme one by one, the network data streams generated in the website loading process are constructed into a time-space correlation diagram between the network data streams in the method scheme, the behavior characteristics of the network data streams and the time-space correlation thereof are modeled from the global time-space perspective to form the time-space correlation diagram between the network data streams, and the problem that Euclidean structure data cannot completely show the behavior pattern of flow data can be overcome, so that different websites can be identified; and the time-space correlation graph among the network data streams is processed through the graph neural network, so that the comprehensive representation of a plurality of network data streams can be extracted, and different website fingerprints can be accurately identified.

Description

Website fingerprint identification method and system based on time-space correlation between network data streams

Technical Field

The invention relates to the technical field of computer networks, in particular to a website fingerprint identification method and a website fingerprint identification system based on time-space correlation among network data streams.

Background

The network flow identification can help a network manager to master the states of the types, the numbers and the like of applications, equipment and protocols in the network, and further accurately implement a network management strategy. The website fingerprint identification aims to identify a webpage corresponding to the encrypted traffic. Website fingerprint identification is an important component of network management and becomes one of hot topics in the field of flow identification; the website fingerprint is mainly a comprehensive characterization (or characteristic) of the website extracted from the website traffic, and then the identification of the website traffic can be realized.

In recent years, with the rapid development of network technologies and the increasing emphasis on private data, encryption technologies have been widely used in network communications. The Google transparency report shows that the proportion of HTTPS (hypertext transfer security protocol) supported by the top 100 non-Google websites on the internet in 2019 is 100%, the proportion of HTTPS used by default is 97%, and by 5 months 2022, the percentage of webpages loaded through HTTPS in Chrome (Google browser) has reached 99%. The wide application of network encryption technology and the rapid growth of encryption traffic effectively improve the network security level and protect privacy, but also improve the difficulty of website fingerprint identification.

With the rise of artificial intelligence technology, many website fingerprint identification methods based on traditional machine learning appear, however, the number of network data streams is large, complex and dynamically changed, a good identification effect is difficult to obtain by a general method based on traditional machine learning, and the method based on traditional machine learning needs prior knowledge to design a set of feature sets capable of accurately reflecting traffic characteristics, which is time-consuming and labor-consuming. The method based on traditional machine learning mainly considers the statistical characteristics of the flow, and because of the hierarchical structure of the network, the statistical characteristics of the data can hardly reflect the original characteristics of the flow when the application layer data of the website flow characteristics is transmitted through an IP (Internet protocol) layer after being segmented by a transport layer. In addition, the actual network environment will also affect the statistical characteristics of the traffic, for example, TCP (transmission control protocol) retransmission mechanism will bring fluctuation to the statistical characteristics of the traffic. Therefore, traffic identification based on the conventional machine learning method has a great limitation.

In recent years, with the successful application of deep learning in many fields, a research for solving the problem of website fingerprint identification by applying a deep learning method appears, and as the deep learning technology can automatically extract and learn potential features from original data, compared with the traditional machine learning method, the deep learning technology overcomes the defect that the deep learning technology excessively depends on feature extraction. At present, a great number of related researches adopt a deep learning (such as a convolutional neural network, a long-short term memory network and the like) method to identify network traffic, and certain effects are achieved.

However, whether the method based on the traditional machine learning or the method applying the deep learning at present mostly observes the pattern of the traffic traces from the layer of the Data packet to obtain the fingerprint of the website, the identification usually organizes the characteristics of the website traffic into Euclidean Structure Data (Euclidean Structure Data), and such a processing mode can weaken the effectiveness of the traffic traces when facing the continuously updated and changing traffic environment, thereby resulting in poor identification accuracy.

Disclosure of Invention

The invention aims to provide a website fingerprint identification method and a website fingerprint identification system based on the time-space correlation among network data streams, which can capture effective information from the characteristics of the network data streams and the time-space correlation among the network data streams, thereby improving the identification accuracy.

The purpose of the invention is realized by the following technical scheme:

a website fingerprint identification method based on time-space correlation among network data streams comprises the following steps:

acquiring network data streams generated by single access of a website, and extracting the characteristics of each network data stream;

taking each network data stream as a node, dividing the continuous network data streams with the same remote destination into a group, numbering each group by combining the time of the network data streams, adding edges between nodes corresponding to the network data streams in the group and between nodes corresponding to adjacent groups of network data streams, and constructing a time-space correlation graph among the network data streams, wherein the characteristics of the nodes in the time-space correlation graph among the network data streams are the characteristics of the corresponding network data streams, and the characteristics of the edges are calculated through the time and space relations among the related network data streams;

processing the spatio-temporal correlation diagram among the network data streams by using a graph neural network to obtain the global representation of the spatio-temporal correlation diagram among the network data streams;

and classifying by using the global representation of the spatio-temporal correlation diagram among the network data streams to obtain a website fingerprint identification result.

The characteristics of each network data flow include: sequence features and statistical features;

each network data flow is divided into an uplink data flow and a downlink data flow, the sequence characteristic is a data packet size sequence with fixed length of the network data flow, the sign of the size of a data packet in the uplink data flow is a positive sign, and the sign of the size of the data packet in the downlink data flow is a negative sign; for the statistical characteristics, setting M statistical indexes, respectively calculating the M statistical indexes for the uplink data stream, the downlink data stream and the complete network data stream formed by the uplink data stream and the downlink data stream to obtain 3M statistical data, sequencing the contribution of the 3M statistical data, and selecting a plurality of statistical data in the front of the sequencing as the statistical characteristics; wherein M is a positive integer.

The calculation of the characteristics of the edges through the temporal and spatial relations between the related network data streams comprises:

the edge is characterized by a two-dimensional vector which respectively represents the time correlation and the space correlation between two network data streams connected by the corresponding edge; wherein:

the larger the value of the time correlation is, the stronger the time correlation is, and the time correlation is calculated by using the starting time interval of the two network data streams;

the larger the numerical value of the spatial correlation is, the stronger the temporal correlation is, the spatial correlation is determined according to whether the remote IPs of the two network data streams belong to the same network destination, or the corresponding IP attributions are obtained according to the remote IPs of the two network data streams, the network destination is defined according to the IP attributions, and then the spatial correlation is obtained by quantization according to the physical distance of the network destination.

The processing the spatio-temporal correlation graph among the network data streams by using the graph neural network to obtain the global representation of the spatio-temporal correlation graph among the network data streams comprises the following steps:

the graph neural network includes: a plurality of map attention layers, map pooling layers, and readout layers; the method comprises the steps that each graph attention layer respectively adopts an attention mechanism and combines the characteristics of edge updating nodes, for each node, the characteristics of the updated nodes of all the graph attention layers are spliced to serve as the final characteristics of each node, the score of each node is calculated by combining the final characteristics of each node through a graph pooling layer, the nodes with the scores smaller than a threshold value and related edges are removed, a pooled network data flow space-time correlation graph is obtained, and then global pooling is carried out through a reading layer, so that the global representation of the network data flow space-time correlation graph is obtained.

The method comprises the following steps that each graph attention layer respectively adopts an attention mechanism and combines the characteristics of edges to update the characteristics of nodes, and for each node, the characteristics of the updated nodes of all the graph attention layers are spliced, and the final characteristics of each node comprise:

set of features of all nodes in a spatio-temporal correlation graph between network data streamshRespectively input into each attention layer, sethExpressed as:

whereinNthe number of the nodes is represented as,

representing nodesiIs characterized in that it is a mixture of two or more of the above-mentioned components,i=1,2,…,N；

in each graph attention layer, for nodesiBy using each of themCalculating attention weight of each neighbor node according to the characteristics of each neighbor node and the characteristics of edges, linearly accumulating the characteristics of each neighbor node by combining the attention weight of each neighbor node, and calculating the updated nodeiThe features of (1); wherein the nodeiThe neighbor node refers to the passing edge and the nodeiA connected node;

for nodeiSplicing the characteristics of the nodes updated by all the graph attention layers to obtain the nodesiFinally, the set of final features of all nodes is expressed as:

wherein

representing nodesiThe final characteristics of (1).

In each graph attention layer, for nodesiCalculating the attention weight of each neighbor node by using the features of each neighbor node and the features of the edges comprises the following steps:

node to nodeiAnd the set formed by all the neighbor nodes is recorded asN _i The attention weight is calculated by the following formula:

wherein,

representing graph attention layer compute nodesjFor nodeiOf importance, i.e. nodesjThe attention weight of (a) is given,LeakyReLU() representsLeakyReLUThe function is activated in such a way that,

representing nodesiNeighbor node of (2)jIs characterized in that it is a mixture of two or more of the above-mentioned components,

representing nodesiWith its neighbour nodesjThe characteristic of the edge of (a) is,

representing nodesiNeighbor node of (2)lIs characterized in that it is a mixture of two or more of the above-mentioned components,

representing nodesiWith its neighbour nodeslThe symbol | | | denotes the stitching operation, W,W _e And

to note three trainable parameters in the force level.

And splicing the characteristics of the nodes updated by all the graph attention layers to obtain the nodesiThe final characteristics of (a) are expressed as:

wherein,Kthe number of attention layers of the diagram is shown,

is shown askNodes calculated by personal attention layerjFor nodeiThe importance of (a) to (b),

is shown askThe weight matrix of the individual attention layer,

representing nodesiNeighbor node of (2)jIs characterized in that it is a mixture of two or more of the above-mentioned components,N _i representing nodesiAnd all the neighbor nodes thereof, the symbol | | | represents the splicing operation,

it is shown that the activation function is,

representing nodesiThe final characteristics of (1).

A system for fingerprinting web sites based on spatiotemporal correlations between network data streams, comprising:

the network data flow feature extraction module is used for extracting the features of each network data flow;

the graph generation module is used for dividing continuous network data streams with the same remote destination into one group by taking each network data stream as a node, numbering each group by combining the time of the network data streams, adding edges between nodes corresponding to the network data streams in the group and between nodes corresponding to adjacent groups of network data streams, and constructing a time-space correlation graph among the network data streams, wherein the characteristics of the nodes in the time-space correlation graph among the network data streams are the characteristics of the corresponding network data streams, and the characteristics of the edges are calculated through the time and space relations among the related network data streams;

the graph neural network is used for processing the spatiotemporal correlation graph among the network data streams to obtain the global representation of the spatiotemporal correlation graph among the network data streams;

and the classifier is used for classifying the global representation of the spatio-temporal correlation diagram among the network data streams to obtain a website fingerprint identification result.

According to the technical scheme provided by the invention, the network data stream generated in the website loading process is constructed into the space-time correlation diagram among the network data streams, the behavior characteristics and the space-time correlation of the network data stream are modeled from the global space-time perspective to form the space-time correlation diagram among the network data streams, and the problem that the Euclidean structure data cannot completely show the behavior pattern of the flow data can be overcome, so that different website fingerprints can be identified; and the time-space correlation graph among the network data streams is processed through the graph neural network, so that the comprehensive representation of a plurality of network data streams can be extracted, and different website fingerprints can be accurately identified.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flowchart of a method for fingerprinting websites based on spatiotemporal correlation between network data streams according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a method for constructing a spatiotemporal correlation diagram between network data streams according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a processing manner of an attention layer according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a processing manner of a graph pooling layer according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an overall framework of a website fingerprint identification method based on spatiotemporal correlation between network data streams according to an embodiment of the present invention

FIG. 6 is a diagram of a system for fingerprinting web sites based on spatiotemporal correlation between network data streams according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The terms that may be used herein are first described as follows:

the term "and/or" means that either or both can be achieved, for example, X and/or Y means that both cases include "X" or "Y" as well as three cases including "X and Y".

The terms "comprising," "including," "containing," "having," or other similar terms of meaning should be construed as non-exclusive inclusions. For example: including a feature (e.g., material, component, ingredient, carrier, formulation, material, dimension, part, component, mechanism, device, step, process, method, reaction condition, processing condition, parameter, algorithm, signal, data, product, or article, etc.) that is not specifically recited, should be interpreted to include not only the specifically recited feature but also other features not specifically recited and known in the art.

The website fingerprinting scheme based on the spatio-temporal correlation between the network data streams provided by the present invention is described in detail below. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art. The examples of the present invention, in which specific conditions are not specified, were carried out according to the conventional conditions in the art or conditions suggested by the manufacturer.

Example one

According to the embodiment of the invention, effective information is captured in website fingerprint identification according to the characteristics of the network data streams and the time-space correlation among the network data streams, so that different websites can be identified. The invention defines a spatio-temporal correlation diagram among the network Data streams to reflect the state and the relation of the state of the network Data streams (obtained by segmenting a series of network Data packets according to quintuple) obtained by the website traffic (collected original Data consisting of a series of network Data packets). In flow identification, the space-time characteristics of multiple streams are considered, simple splicing arrangement (namely, the data are organized into Euclidean structure data) of the data is not suitable, and compared with the Euclidean structure data, the non-Euclidean structure data such as graph data can accurately describe the network data streams and the mutual relations of the network data streams. Graph Neural Networks (GNNs) are an extension of deep learning on Graph data, and are well-suited for solving the problem that conventional deep learning Networks cannot represent relational data such as Graph data. The traditional website fingerprint identification method can also cause the insufficient generalization capability of the model by neglecting the information of the data stream layer, converts the flow data into the graph structure data, and adopts the graph neural network for training and detection, thereby effectively solving the problem, and therefore, the graph neural network can be applied to the website fingerprint identification work to improve the identification accuracy.

Therefore, the embodiment of the present invention provides a website fingerprint identification method based on spatio-temporal correlation between network data streams, where the network data streams are used as nodes of graph data, parameters of the nodes are represented by features of the network data streams, a relationship between the network data streams is used as an edge of the graph data, and parameters of the edge are created based on spatio-temporal relationships (start time, IP address relationship, etc.) between the network data streams. Then, using a graph neural network to extract comprehensive characterization of multiple network data streams from a spatio-temporal correlation graph among the network data streams, and identifying different website fingerprints through classification, as shown in fig. 1, the method mainly comprises the following steps:

step 1, obtaining a network data stream generated by single access of a website, and extracting the characteristics of each network data stream.

In the embodiment of the invention, the feature extraction of the network data stream is carried out aiming at the network flow generated in the loading process of each website. In practical application, when a browser is used for accessing a website, a packet capturing tool can be used for capturing network data packets generated by the website, a series of network data streams are obtained through five-tuple (source IP address, source port, destination IP address, destination port and transport layer protocol) segmentation, and a spatiotemporal correlation diagram among the network data streams is constructed according to the network data streams generated by single access of the website.

In the embodiment of the present invention, the network data flow refers to a session flow, or may be referred to as a bidirectional flow. Generally, a unidirectional flow is uniquely determined by a five-tuple (source IP address, source port, destination IP address, destination port, transport layer protocol), and a bidirectional flow is composed of two unidirectional flows whose source IP address, source port, destination IP address, and destination port are in a symmetric relationship. Node information plays an important role in the recognition result of the graph neural network, so that proper characteristics need to be selected to characterize the nodes.

In the embodiment of the present invention, the characteristics of the network data stream include: sequence features and statistical features; and a vector formed by splicing the sequence features and the statistical features is used as the features of the network data stream.

Data packets in a network data stream may be divided into upstream data packets and downstream data packets depending on whether the source IP is a local host, and a data stream may be divided into upstream data streams and downstream data streams. The sequence feature is a fixed-length packet size sequence of the network data stream, wherein the sign of the size of the packet in the upstream data stream is a positive sign, and the sign of the size of the packet in the downstream data stream is a negative sign.

For the statistical characteristics, setting M statistical indexes, respectively calculating the M statistical indexes for the uplink data stream, the downlink data stream and the complete network data stream formed by the uplink data stream and the downlink data stream to obtain 3M statistical data, sequencing the contribution of the 3M statistical data, and selecting a plurality of statistical data in the front of the sequencing as the statistical characteristics; wherein, M is a positive integer, and the specific statistical index type and number (i.e. the numerical value of M) can be set according to actual conditions or experience. The following are exemplary: the method comprises the steps that 18 statistical indexes such as the minimum value, the maximum value, the average value, the median absolute deviation, the standard deviation, the variance, the deviation, the kurtosis, the percentile (from 10% to 90%), the data packet number and the like can be selected, statistics is respectively carried out on the complete data stream, the uplink data stream and the downlink data stream to obtain 54 statistical data, then the contribution of the statistical data is ranked by using a random forest method (for example, ranking is carried out according to the contribution of each statistical data in a classification task), a plurality of statistical data with the highest contribution (the specific number is set according to actual conditions or experience) are used as statistical characteristics, the number of the selected statistical characteristics can be considered according to actual application conditions, the accuracy rate is higher when the characteristics are more, but the calculation complexity is higher; feature reduction accuracy is reduced, but computational complexity is relatively small.

And 2, taking each network data stream as a node, dividing the continuous network data streams with the same remote destination into a group, numbering each group by combining the time of the network data streams, adding edges between the nodes corresponding to the network data streams in the group and between the nodes corresponding to the network data streams in the adjacent group, and constructing a time-space correlation graph among the network data streams, wherein the characteristics of the nodes in the time-space correlation graph among the network data streams are the characteristics of the corresponding network data streams, and the characteristics of the edges are calculated through the time and space relations among the related network data streams.

The loading process of the website is actually the interaction process between the local host and the remote server, and it should be noted that, in order to ensure the quality of the service and reduce the occurrence of service interruption as much as possible, a certain service of the website may be provided by a plurality of redundant servers at the same time, or a load balancing mechanism may be adopted, so that the number of the remote servers is often more than one. For a web site, the locations of its various servers, i.e., remote network destinations, have different physical addresses, and therefore their IP addresses are spatially correlated.

Different web sites have inherent diversity in web page element types (e.g., text, images, and other media content), different content is stored and sent to the local host by exactly servers located at different network destinations, and the element loading order is fixed. Thus, the local IP interaction with remote servers of different destinations during loading of different web sites is Markov in time order.

Thus, there is a certain correlation between network data streams in time and space.

In the embodiment of the invention, a space-time correlation graph among network data streams is constructed based on space-time correlation among the network data streams, the node characteristics are the characteristics of the corresponding network data streams, the characteristics of edges among the nodes are used for representing the space-time relation among the network data streams, and the characteristics of the edges are represented by a two-dimensional vector and respectively represent the time correlation and the space correlation between two network data streams connected by the corresponding edges.

1) A temporal correlation.

In the embodiment of the present invention, a larger numerical value of the time correlation indicates a stronger time correlation, and the time correlation may be calculated by using the start time interval of the two network data streams.

Illustratively, the time correlation may be set to correspond from weak to strong [0,1]]Interval value, when the time correlation is calculated using the start time interval size of two network data streams, the value of the time correlation is determined by

Is calculated to obtain whereint ₁ Andt ₂ the starting time of the data streams corresponding to the two network data streams is respectively, and e is a natural constant, so that the smaller the time interval, the larger the value is, and the value is in the interval from 0 to 1.

2) Spatial correlation.

The following are exemplary:

the spatial correlation is from weak to strong and corresponds to the value in the interval of [0,1 ]. For example, the number of remote servers of a website is often more than one, and due to the existence of mechanisms such as load balancing, the IP addresses of several remote servers belong to the same IP address pool, different remote IPs belonging to the same IP address pool can be regarded as the same network destination, for example, it can be determined whether remote IPs of two different network data flows belong to the same network destination through an IP address and a TLS (secure transport layer protocol) certificate (if any), and the remote IPs are regarded as the same network destination (if one of the IP addresses and the TLS certificate is the same, the 2-dimensional feature (i.e., spatial correlation) of the edge is set to 1 if the remote IPs of two different network data flows belong to the same network destination), and otherwise, the edge is set to 0.

As shown in FIG. 2, the number of networks is shownExamples of spatiotemporal correlation maps between data streams. In fig. 2: 1~8 denotes the sequence number of 8 network data streams, t ₁ ~t ₈ Indicating the start time, IP, of 8 network data streams ₁ ~IP ₈ Representing remote IP, dest for 8 network data streams ₁ ~dest ₄ 4 network destination addresses are shown, and the corresponding 4 small rectangular boxes of each network data flow represent the characteristics of the corresponding network data flow. Successive network data flows of the same remote destination are considered as a group, and because the time sequence of the network data flows is often ambiguous, there is an edge between the nodes corresponding to every two network data flows in the group. The time sequence between groups is relatively clear, so each group is numbered according to the time sequence corresponding to the network data stream, and for nodes belonging to adjacent groups, an edge is arranged between the nodes to represent the time sequence. In FIG. 2, the edge characteristics are calculated in the manner previously described; however, in the case of sufficient amount of website traffic and allowable conditions (website traffic diversity is sufficient and collected in multiple places), more elaborate spatial correlation definitions can be made, for example, a corresponding IP home is obtained from a remote IP address, a network destination is defined according to the IP home, and then quantization is performed according to the physical distance (such as the distance between cities) of the network destination, so as to obtain the spatial correlation between network data streams.

And 3, processing the spatio-temporal correlation diagram among the network data streams by using a graph neural network to obtain the global representation of the spatio-temporal correlation diagram among the network data streams.

The application scenario in this embodiment is website fingerprint identification oriented to multi-stream spatiotemporal correlation analysis, and the constructed inter-stream spatiotemporal correlation diagram is non-euclidean structure data and needs to be processed using a graph neural network.

The graph neural network includes: a plurality of map attention layers, map pooling layers, and readout layers; the method comprises the steps that each graph attention layer respectively adopts an attention mechanism and updates the characteristics of nodes by combining the characteristics of edges, for each node, the characteristics of the updated nodes of all the graph attention layers are spliced to be used as the final characteristics of each node, the score of each node is calculated by combining the final characteristics of each node through a graph pooling layer, the nodes with the scores smaller than a threshold value and the related edges are removed, a pooled space-time correlation graph among network data streams is obtained, and global pooling is carried out through a reading layer to obtain the global representation of the space-time correlation graph among the network data streams.

In embodiments of the present invention, all the Graph Attention layers form a Graph Attention network (GAT), which introduces an Attention mechanism to Graph structure data, one advantage of which is to make decisions by focusing on the most relevant parts; in practice, it is noted that the number of force layers can be adjusted according to the actual situation. Each figure attention layer has an independent attention mechanism.

The process of calculating the final characteristics of each node through a plurality of graph attention layers comprises the following steps: set of features of all nodes in a spatio-temporal correlation graph between network data streamshRespectively input into each attention layer, sethExpressed as:

whereinNwhich represents the number of nodes that are to be connected,

representing nodesiIs characterized in that it is a mixture of two or more of the above-mentioned components,i=1,2,…,N，

in the case of a real number set symbol,Fa dimension representing a feature input to a node of the graph attention layer; in each graph attention layer, for nodesiCalculating attention weight of each neighbor node by using the characteristics of the neighbor nodes and the characteristics of the edges, linearly accumulating the characteristics of each neighbor node by combining the attention weight of each neighbor node, and calculating to obtain updated nodesiThe features of (1); wherein the nodeiThe neighbor node refers to the passing edge and the nodeiConnected node (including node)iItself); for nodeiAll nodes whose attention level is being updated are characterizedLine splicing to obtain nodesiFinally, the set of final features of all nodes is expressed as:

wherein

representing nodesiThe final characteristics of (a) to (b),

representing the dimension of the final feature of the node.

Graph attention layer uses attention weights to define neighbor nodesjFor nodeiThe importance of (b) can be learned by attention mechanism. Considering the structure information of the graph, if and only if nodesiAnd nodejWhen an edge exists between them, the nodejCan be called a nodeiThe attention weight is calculated at this time, and the calculation mode of the attention weight without edge features is represented as follows:

wherein,

representing nodesjFor nodeiOf importance, i.e. nodesjIs the attention weight of the feature without edges,LeakyReLU() representsLeakyReLUActivating a function;N _i representing nodesiIs composed of all the neighbor nodes (including the node)iBy itself),

representing nodesiOf a neighboring nodejIs characterized in that it is a mixture of two or more of the above-mentioned components,

representing nodesiIs adjacent toResidential nodelIs characterized in that it is a mixture of two or more of the above-mentioned components,

and with

All determined by the feature set of the node; the symbol | | | denotes the stitching operation, W (matrix) and

(vectors) are all trainable parameters of the graph attention layer.

Because the edges of the spatio-temporal correlation graph among the network data streams have multidimensional characteristics in the embodiment of the invention, each graph in the attention layer is corresponding to the nodesiThe attention weight of its neighbor node is calculated by the following formula:

wherein,

representing nodes computed by a graph attention layerjFor nodeiOf importance, i.e. nodesjThe attention weight of the band edge feature, which is the attention weight of the band edge feature,

representing nodesiWith its neighbour nodesjThe characteristics of the edge of (a) are,

representing nodesiWith its neighbour nodeslThe edge characteristics of (A), W,W _e (matrix) and

are trainable parameters of the graph attention layer.

Practice proves that the attention mechanism is expanded to be a multi-head noteWhile the Multi-head Attention mechanism is more effective, the present invention also uses a Multi-head Attention mechanism, in which the number of Attention layers is recorded asKSplicing the characteristics of the nodes updated by all the graph attention layers to obtain the nodesiIs expressed as:

wherein,

is shown askUpdated nodes computed by personal attention layeriIs characterized by W ^k() Is shown askThe weight matrix of the individual graph attention layer, the symbol | | | represents the stitching operation,

it is shown that the activation function is,

representing nodesiThe final characteristics of (a) to (b),

is shown askNode for personal attention layer calculationjFor nodeiOf importance, all figures note that the layers of interest employ the foregoing

Formula calculation is mainly characterized in that weight matrixes and vectors corresponding to attention layers of different drawings are different, i.e. thekThe two weight matrices and vectors in the individual graph attention layer are denoted as W ^k() 、

And

。

FIG. 3 shows the processing of the attention layer. The process of calculating the attention weight by the single map attention layer is shown on the left side of fig. 3; the right side of fig. 3 shows the updating process of the node feature under the multi-head attention mechanism, taking node 1 as an example, a process is provided for calculating the final feature of node 1 by using two independent graph attention layers, wherein the process includes two neighboring nodes.

In order to pay more attention to the important nodes, reduce the influence of noise nodes on the recognition result, reduce the model parameters and improve the training and recognition efficiency, the model is an integral model formed by the method. In the embodiment of the present invention, the Graph Pooling layer uses a Self-attention Graph Pooling (SAGPool) method to pool the output of the Graph attention layer. SAGPool learns a score (namely the feature dimension of the node output by the graph attention layer is 1) for each node in the graph by using the graph attention method, and discards some nodes with low scores based on the ranking of the scores. The concept of maximum pooling operation in CNN (convolutional neural network) is used for reference by this pooling approach, and more important information is screened out, as shown in fig. 4. SAGPool learns the importance of nodes by using the structure information of the graph, and the obtained global representation of the graph includes both the structure information of the graph and the attribute information of each node, thereby being an excellent graph pooling method.

And then, performing global pooling on the pooled network data stream space-time correlation diagram output by the graph pooling layer through the reading layer to obtain global representation of the network data stream space-time correlation diagram.

And 4, classifying by using the global representation of the spatio-temporal correlation diagram among the network data streams to obtain a website fingerprint identification result.

In the embodiment of the invention, the full connection layer can be used as a classifier to execute the classification task and obtain the website fingerprint identification result.

In the following, the above-mentioned solutions provided by the embodiments of the present invention are explained as a whole, and fig. 5 shows an overall framework of a website fingerprint identification method, which mainly includes: feature extraction, graph generation, graph neural network, classifier and the like. Extracting the characteristics of each network data stream according to the mode of the step 1, generating a space-time correlation diagram among the network data streams by a diagram generating part according to the mode of the step 2, obtaining the global representation of the space-time correlation diagram among the network data streams by a diagram neural network according to the mode of the step 3, and obtaining a website fingerprint identification result by a classifier according to the mode of the step 4; the number of three-layer diagram attention layers shown in fig. 5 is by way of example only.

The method is characterized in that the parameters of the graph neural network are required to be updated by combining the classification result (training result) of the step 4 during training, and the parameters of the graph neural network are not updated in the testing stage. A training stage: firstly, an adjacency matrix and a characteristic matrix of a space-time correlation diagram among network data streams are obtained, wherein the adjacency matrix is constructed through the relationship of edges of nodes, and the number of rows and columns are allN(i.e., equal to the number of nodes), the value at each position in the adjacency matrix indicates whether an edge exists between the corresponding nodes, for example, an edge exists between the mth node and the nth node, and the values of the nth row and the mth column in the mth row are 1, otherwise, the values are 0; the feature matrix is the feature of one node per row. Then, updating node information through a plurality of graph attention layers with independent attention mechanisms, splicing the updated node characteristics output by each graph attention layer, wherein the characteristics of each node comprise the information of the node and the information of a neighbor node, pooling the final characteristics of the nodes obtained by splicing through a graph pooling layer, and performing global pooling through a reading layer to obtain the global representation of a space-time correlation graph among network data flows; and finally, obtaining a trained classification result through a classifier. The parameters of the neural network of the graph are adjusted according to the training result by learning and training the spatio-temporal correlation graph among the network data streams end to end. In the testing stage, the trained graph neural network is used for processing the space-time correlation graph among the network data streams generated by the network data streams in the testing set, and the classifier is used for identifying the website fingerprints.

The following examples are provided to demonstrate that the above scheme of the present invention can improve the recognition accuracy.

Example 1: collecting a data set comprising: the original access flow of 21 common websites, each website is accessed 200 times. The data set is processed according to the training set: and (4) verification set: test set =3:1:1, performing feature extraction and generating a space-time correlation diagram among network data streams, wherein a construction method of the space-time correlation diagram among the streams refers to fig. 2, the space-time correlation diagram among the network data streams is sent to a neural network of the graph for training, and after verification of a verification set, the accuracy of the space-time correlation diagram among the network data streams is 98.13% on a test set.

Example 2: in the training phase, the network data stream to be identified is input. The feature extraction part divides the network data flow into two-way flows according to the quintuple, extracts features required for constructing a space-time correlation graph among the network data flows and sends the features to the graph generation part, the graph generation part takes the network data flow as nodes and the features of the network data flow as node features, edges are established among the nodes according to the time and space correlation among the network data flows, and the edge features are given, so that the space-time correlation graph among the network data flows is constructed. And (3) sending the spatio-temporal correlation diagram among the network data streams into a graph neural network for training, and obtaining the global representation of the spatio-temporal correlation diagram among the network data streams by a graph neural network module through the graph neural network learning. To ensure updating of the traffic behavior pattern, training is required at intervals.

In the testing phase, the overall framework of the present invention is deployed on a network node or imported onto a computing device in a conventional manner. The website traffic on the network nodes is utilized to generate a spatio-temporal correlation diagram among the network data streams, the global representation of the spatio-temporal correlation diagram among the network data streams is obtained through a trained neural network of the diagram, and the classifier matches the global representation with the behavior pattern obtained in the training stage, so that the identification of the website fingerprints is realized.

As can be understood by those skilled in the art, the flow of each website has uniqueness, the fingerprint of the website can be identified according to the uniqueness of the flow of the website, and the uniqueness can be regarded as a behavior pattern.

The above scheme provided by the embodiment of the invention mainly obtains the following beneficial effects: the network data flow does not need to be filtered in advance, the characteristics of the network data flow and the time-space correlation between the network data flows are considered, and the identification effect and the generalization capability of the website fingerprint identification technology are improved. The characteristics of the adopted network data flow are representative, wherein the sequence characteristics represent the more important local characteristics of the network data flow, and the statistical characteristics can represent the overall characteristics of the network data flow. The behavior pattern of the network data stream is modeled from the global perspective by considering the time-space correlation between the network data streams, and the website fingerprint is identified by combining the graph neural network and the classifier, so that the identification accuracy of the website fingerprint identification technology can be effectively improved.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

Example two

The present invention further provides a website fingerprint identification system based on temporal-spatial correlation between network data streams, which is implemented mainly based on the method provided in the foregoing embodiment, as shown in fig. 6, the system mainly includes:

the graph generation module is used for dividing continuous network data streams with the same remote destination into a group by taking each network data stream as a node, numbering each group by combining the time of the network data streams, adding edges between nodes corresponding to the network data streams in the group and between nodes corresponding to adjacent groups of network data streams, and constructing a time-space correlation graph among the network data streams, wherein the characteristics of the nodes in the time-space correlation graph among the network data streams are the characteristics of the corresponding network data streams, and the characteristics of the edges are calculated through the time and space relation among the related network data streams;

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A website fingerprint identification method based on time-space correlation among network data streams is characterized by comprising the following steps:

processing the spatiotemporal correlation diagram among the network data streams by using a graph neural network to obtain the global representation of the spatiotemporal correlation diagram among the network data streams;

2. The method of claim 1, wherein the characteristics of each network data stream comprise: sequence features and statistical features;

3. The method of claim 1, wherein the computing of the edge characteristics through the temporal and spatial relationship between related network data streams comprises:

4. The website fingerprinting method based on the spatio-temporal correlation between the network data streams according to claim 1, wherein the processing the spatio-temporal correlation map between the network data streams by using the graph neural network to obtain the global characterization of the spatio-temporal correlation map between the network data streams comprises:

5. The method as claimed in claim 4, wherein each graph attention layer updates the node features by using an attention mechanism and combining the edge features, and for each node, concatenating the updated node features of all the graph attention layers as the final feature of each node includes:

set of features of all nodes in a spatio-temporal correlation graph between network data streamshAre respectively provided withInput to each of the attention layers, setshExpressed as:

whereinNwhich represents the number of nodes that are to be connected,

in each graph attention layer, for nodesiCalculating attention weight of each neighbor node by using the characteristics of each neighbor node and the characteristics of edges, linearly accumulating the characteristics of each neighbor node by combining the attention weight of each neighbor node, and calculating the updated nodeiThe features of (a); wherein the nodeiThe neighbor node refers to the passing edge and the nodeiA connected node;

wherein, in the process,

representing nodesiThe final characteristics of (1).

6. The method as claimed in claim 5, wherein the nodes in each graph attention layer are fingerprinted on the web site based on the spatiotemporal correlation between the network data streamsiCalculating the attention weight of each neighbor node by using the features of each neighbor node and the features of the edges comprises the following steps:

node to be connectediAnd the set formed by all the neighbor nodes is recorded asN _i The attention weight is calculated by the following formula:

wherein,

representing graph attention layer compute nodesjFor nodeiOf importance, i.e. nodesjThe attention weight of (a) is given,LeakyReLU(. Phi.) representsLeakyReLUThe function is activated in such a way that,

representing nodesiWith its neighbour nodeslThe symbol | | represents the splicing operation, W,W _e And

to note three trainable parameters in the force level.

7. The website fingerprinting method based on the spatiotemporal correlation between the network data streams according to claim 5 or 6, characterized in that the features of the nodes updated by all the attention layers are spliced to obtain the nodesiThe final characteristics of (a) are expressed as:

wherein,Kthe number of attention levels of the diagram is indicated,

is shown askNodes calculated by individual graph attention layerjFor nodeiThe importance of (a) to (b),

is shown askThe individual figures note the weight matrix of the force layer,

representing nodesiNeighbor node of (2)jIs characterized in that it is a mixture of two or more of the above-mentioned components,N _i representing nodesiAnd all its neighbors, the symbol | | | represents the splicing operation,

it is shown that the activation function is,

representing nodesiThe final characteristics of (1).

8. A website fingerprinting system based on the spatio-temporal correlation between network data streams, which is implemented based on the method of any one of claims 1~7, and comprises:

and the classifier is used for classifying the global representation of the space-time correlation diagram among the network data streams to obtain a website fingerprint identification result.