CN114205151A

CN114205151A - HTTP/2 page access flow identification method based on multi-feature fusion learning

Info

Publication number: CN114205151A
Application number: CN202111513183.3A
Authority: CN
Inventors: 权迎雪; 刘伟伟; 朱伟佳; 陈浩
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2021-12-12
Filing date: 2021-12-12
Publication date: 2022-03-18

Abstract

The invention discloses a HTTP/2 page access flow identification method based on multi-feature fusion learning, which comprises the steps of firstly collecting a homepage access flow and a resource response flow generated by a target HTTP/2 site in a typical user interaction process; preprocessing the flow data to obtain a complete TCP flow; on one hand, a self-coding network is used for capturing the content distribution rule characteristics of homepage access flow, and on the other hand, a recurrent neural network is used for identifying the main resource category of resource response flow; and further fusing and splicing the content distribution rule characteristics and the main body resource category characteristics, and inputting the fused and spliced content distribution rule characteristics and the main body resource category characteristics into a convolutional neural network model to obtain a site page identification result. The invention utilizes a plurality of data streams as basic units for fingerprint extraction, extracts the characteristics of different types of data streams through a deep learning method, and combines a plurality of characteristics to fully characterize a target site, thereby improving the identification precision of HTTP/2 page access flow.

Description

HTTP/2 page access flow identification method based on multi-feature fusion learning

Technical Field

The invention belongs to the field of network security, and particularly relates to an HTTP/2 page access flow identification method based on multi-feature fusion learning.

Background

With the continuous development of internet technology, more and more Web sites start to encrypt and transmit network traffic, which protects user privacy to a certain extent, but also brings certain challenges to traffic supervision. The HTTP protocol is one of the most widely used application layer transport protocols on the internet, serves as a bridge between a server and a client, and is the basis of current Web applications. Many scholars have conducted a great deal of research on traffic analysis methods based on encrypted HTTP/1.1. However, with the rapid development of Web services, the drawbacks of network delay, bandwidth resource waste, etc. of HTTP/1.1 are increasingly revealed, and to solve the above problems, the HTTP/2 protocol is produced.

The HTTP/2 is an updated version of the HTTP/1.1, and provides new characteristics of binary framing, multiplexing, server pushing and the like while being completely compatible with the HTTP/1.1 semantics. The multiplexing mechanism allows one connection to receive multiple request responses, reduces the link pressure of a server and greatly improves the traffic transmission efficiency. However, the mechanism reorganizes the transmission mode of the web page object, and confuses the boundary of the web page object while reducing the network delay, thereby bringing new challenges to the original page traffic identification and analysis method.

Disclosure of Invention

The invention aims to provide an HTTP/2 page access flow identification method based on multi-feature fusion learning.

The technical solution for realizing the purpose of the invention is as follows: a HTTP/2 page access flow identification method based on multi-feature fusion learning comprises the following steps:

step 1, acquiring a homepage access flow and a resource response flow generated by a target HTTP/2 site in a typical user interaction process;

step 2, carrying out preprocessing operation on the flow data to obtain a complete TCP flow;

step 3, capturing the content distribution rule characteristics of homepage access flow by using a self-coding network;

step 4, identifying the main body resource category of the resource response flow transmission by using a recurrent neural network;

and 5, fusing and splicing the content distribution characteristic vector and the main resource characteristic vector, and inputting the fused and spliced content distribution characteristic vector and the main resource characteristic vector into a convolutional neural network to obtain an HTTP/2 page flow result.

Further, in the step 1, an automatic script tool is used to start a browser, typical user interaction is simulated at a target site to obtain a traffic sample, the traffic sample is generated by using two browsers to access the target HTTP/2 site at a random time point, and the number of the HTTP/2 sites is not less than 10; and storing the homepage access flow and the resource response flow generated by the target site off line by using a data packet capturing tool.

Further, in the step 2, filtering the flow sample, and discarding noise data generated in the capturing process; and integrating the flow data by taking the quadruple as a unit to obtain a complete TCP flow. The quadruplet comprises a source IP, a source port, a destination IP and a destination port.

Further, in the step 3, for the homepage access flow generated by the target site, a packet oriented length sequence and a packet interval time sequence are extracted; sequence information is input into the encoded network to capture content distribution feature vectors for homepage access traffic.

Further, in step 4, for the resource response traffic generated by the target station, a packet directed length sequence and a packet interval time sequence are extracted; sequence information is input into the recurrent neural network to identify the type of subject resource including, but not limited to, video, audio, text, pictures, that the resource responds to the traffic transmission.

Further, the data packet directed length sequence and the data packet interval time sequence are extracted, and truncation or digital 0 filling is performed to obtain the data packet directed length sequence and the data packet interval time sequence with fixed lengths; the data packets have directional length sequences, and positive and negative values are used for respectively identifying the length directions of the uplink data packets and the downlink data packets.

Further, in the step 5, fusion splicing is performed on the content distribution feature vector and the main resource feature vector extracted from the same HTTP/2 site, and the spliced vector is input to a convolutional neural network to obtain an HTTP/2 page traffic identification result.

Compared with the prior art, the invention has the beneficial effects that: a plurality of TCP streams generated by HTTP/2 sites are used as a basic unit for fingerprint extraction, so that the robustness of the webpage fingerprint is improved; meanwhile, for TCP streams with different behaviors, a targeted deep learning network is designed to model potential information modes of the TCP streams, so that the overall characteristics of a target site are more accurately represented, and the identification precision of HTTP/2 page access flow is further improved.

Drawings

FIG. 1 is an overall flow chart of the method of the present invention.

FIG. 2 is a flow chart of automatic page access and traffic collection.

FIG. 3 is a frame diagram of page traffic recognition based on multi-feature fusion learning.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment provides a method for identifying HTTP/2 page access traffic based on multi-feature fusion learning, which is shown in fig. 1 and specifically includes the following steps:

step 1, acquiring a homepage access flow and a resource response flow generated by a target HTTP/2 site in a typical user interaction process, specifically referring to fig. 2, including:

automatically calling a browser to perform webpage access operation by using an open source tool (Selenium), and simulating the interaction behavior of a real user and the browser to generate target flow; in this embodiment, the number of target HTTP/2 sites is 10, the traffic sample collection lasts for 5 days, and the two browsers Chrome and Edge are used to access 40 times at random time points each day.

Capturing page flow by using a Wireshark tool, and storing the page flow as a PCAP file; and decrypting by using a preset TLS session key so as to divide the homepage access flow and the resource response flow and marking the main body resource type transmitted by the subsequent resource response flow.

Step 2, carrying out preprocessing operation on the flow data to obtain a complete TCP flow, comprising the following steps: and integrating and shunting the PCAP file by using a Scapy open source library and taking a quadruple as a unit, filtering a flow sample, and discarding noise data generated in the capturing process, including but not limited to a control packet, a retransmission packet and an incomplete session to obtain a relatively pure TCP stream.

Step 3, capturing the content distribution rule characteristics of homepage access flow by using a self-coding network, comprising the following steps:

extracting a data packet directed length sequence and a data packet interval time sequence aiming at homepage access flow generated by a target site, and performing truncation or filling with a number 0 to obtain fixed-length sequence information; the data packets have directional length sequences, and positive and negative values are used for respectively identifying the length directions of the uplink data packets and the downlink data packets.

In this embodiment, the self-encoding network includes an input layer, an encoding network, and a decoding network connected in sequence; the coding network consists of a convolutional layer and a maximum pooling layer, the features of input data are extracted through the convolutional layer, and then the extracted features of the convolutional layer are compressed by using the maximum pooling layer, so that feature dimension reduction is realized; the decoding network is composed of a convolutional layer and a deconvolution layer, the features extracted by the coding network are decoded by the convolutional layer, and then the feature dimensionality is expanded by utilizing the deconvolution layer for mapping so as to reconstruct an input signal.

Step 4, identifying the main body resource category of the resource response flow transmission by using a recurrent neural network, comprising the following steps:

and (3) extracting a data packet directed length sequence and a data packet interval time sequence aiming at the resource response flow generated by the target station, wherein the operation processing is the same as the mode in the step 3, and the description is omitted here.

In this embodiment, the recurrent neural network includes an input layer, a double-layer GRU unit, and a full connection layer, which are connected in sequence; extracting sequence correlation characteristics through a GRU unit, and identifying the main body resource type of resource response flow transmission by using a full connection layer to obtain a resource type label vector, wherein the resource type label vector comprises the following steps:

C＝(c₁,c₂,…c_i,…,c_n)

wherein, c_i∈[0,1]Indicating asset type identification relevance, wherein asset type includes, but is not limited to, video, audio, text, pictures.

And 5, fusing and splicing the content distribution characteristic vectors and the main resource characteristic vectors, inputting the fused and spliced content distribution characteristic vectors and the main resource characteristic vectors into a convolutional neural network to obtain an HTTP/2 page flow result, wherein the step of splicing the characteristic vectors into a fixed length in sequence is included. See in particular fig. 3. In this embodiment, the convolutional neural network includes an input layer, several convolutional layers, a max-pooling layer, and a full-link layer, which are connected in sequence. After multi-dimensional features are learnt through convolutional layer fusion, the multi-dimensional features are transmitted into a full connection layer, and an HTTP/2 site is identified by using a Softmax classifier.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A HTTP/2 page access flow identification method based on multi-feature fusion learning is characterized by comprising the following steps:

2. The method for identifying HTTP/2 page access traffic based on multi-feature fusion learning according to claim 1, wherein the step 1 comprises: starting a browser by using an automatic script tool, and simulating typical user interaction at a target site to generate a flow sample; and storing the homepage access flow and the resource response flow generated by the target site off line by using a data packet capturing tool.

3. The method according to claim 2, wherein the traffic sample comprises: and (3) accessing the target HTTP/2 sites at random time points by using two browsers, wherein the number of the sites is not less than 10.

4. The method for identifying HTTP/2 page access traffic based on multi-feature fusion learning according to claim 1, wherein the step 2 comprises: filtering the flow sample, and discarding noise data generated in the capturing process; integrating the flow data by taking the quadruple as a unit to obtain a complete TCP flow; the quadruplet comprises a source IP, a source port, a destination IP and a destination port.

5. The method for identifying HTTP/2 page access traffic based on multi-feature fusion learning according to claim 1, wherein the step 3 comprises: extracting a data packet directed length sequence and a data packet interval time sequence aiming at homepage access flow generated by a target site; sequence information is input into the encoded network to capture content distribution feature vectors for homepage access traffic.

6. The method for identifying HTTP/2 page access traffic based on multi-feature fusion learning according to claim 1, wherein the step 4 comprises: aiming at the resource response flow generated by a target station, extracting a data packet directed length sequence and a data packet interval time sequence; sequence information is input into the recurrent neural network to identify the subject resource type of the resource response traffic transmission.

7. The HTTP/2 page access traffic identification method based on multi-feature fusion learning according to claim 6, wherein the subject resource types include video, audio, text, and pictures.

8. The method for identifying HTTP/2 page access traffic based on multi-feature fusion learning according to claim 5 or 6, wherein the extracting the packet directed length sequence and the packet interval time sequence comprises: and truncating the data packet information sequence or filling the data packet information sequence with a number 0 to obtain a data packet directed length sequence and a data packet interval time sequence with fixed lengths.

9. The method for identifying HTTP/2 page access traffic based on multi-feature fusion learning according to claim 5 or 6, wherein the data packet directed length sequence includes a transmission direction: and respectively identifying the length directions of the uplink data packet and the downlink data packet by using a positive value and a negative value.

10. The method for identifying HTTP/2 page access traffic based on multi-feature fusion learning according to claim 1, wherein the step 5 comprises: and performing fusion splicing on the content distribution characteristic vector and the main resource characteristic vector extracted from the same HTTP/2 site, and inputting the spliced vectors into a convolutional neural network to obtain an HTTP/2 page flow identification result.