CN114205151A - HTTP/2 page access flow identification method based on multi-feature fusion learning - Google Patents

HTTP/2 page access flow identification method based on multi-feature fusion learning Download PDF

Info

Publication number
CN114205151A
CN114205151A CN202111513183.3A CN202111513183A CN114205151A CN 114205151 A CN114205151 A CN 114205151A CN 202111513183 A CN202111513183 A CN 202111513183A CN 114205151 A CN114205151 A CN 114205151A
Authority
CN
China
Prior art keywords
flow
http
data packet
resource
feature fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111513183.3A
Other languages
Chinese (zh)
Inventor
权迎雪
刘伟伟
朱伟佳
陈浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202111513183.3A priority Critical patent/CN114205151A/en
Publication of CN114205151A publication Critical patent/CN114205151A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/20Network architectures or network communication protocols for network security for managing network security; network security policies in general
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]

Abstract

The invention discloses a HTTP/2 page access flow identification method based on multi-feature fusion learning, which comprises the steps of firstly collecting a homepage access flow and a resource response flow generated by a target HTTP/2 site in a typical user interaction process; preprocessing the flow data to obtain a complete TCP flow; on one hand, a self-coding network is used for capturing the content distribution rule characteristics of homepage access flow, and on the other hand, a recurrent neural network is used for identifying the main resource category of resource response flow; and further fusing and splicing the content distribution rule characteristics and the main body resource category characteristics, and inputting the fused and spliced content distribution rule characteristics and the main body resource category characteristics into a convolutional neural network model to obtain a site page identification result. The invention utilizes a plurality of data streams as basic units for fingerprint extraction, extracts the characteristics of different types of data streams through a deep learning method, and combines a plurality of characteristics to fully characterize a target site, thereby improving the identification precision of HTTP/2 page access flow.

Description

HTTP/2 page access flow identification method based on multi-feature fusion learning
Technical Field
The invention belongs to the field of network security, and particularly relates to an HTTP/2 page access flow identification method based on multi-feature fusion learning.
Background
With the continuous development of internet technology, more and more Web sites start to encrypt and transmit network traffic, which protects user privacy to a certain extent, but also brings certain challenges to traffic supervision. The HTTP protocol is one of the most widely used application layer transport protocols on the internet, serves as a bridge between a server and a client, and is the basis of current Web applications. Many scholars have conducted a great deal of research on traffic analysis methods based on encrypted HTTP/1.1. However, with the rapid development of Web services, the drawbacks of network delay, bandwidth resource waste, etc. of HTTP/1.1 are increasingly revealed, and to solve the above problems, the HTTP/2 protocol is produced.
The HTTP/2 is an updated version of the HTTP/1.1, and provides new characteristics of binary framing, multiplexing, server pushing and the like while being completely compatible with the HTTP/1.1 semantics. The multiplexing mechanism allows one connection to receive multiple request responses, reduces the link pressure of a server and greatly improves the traffic transmission efficiency. However, the mechanism reorganizes the transmission mode of the web page object, and confuses the boundary of the web page object while reducing the network delay, thereby bringing new challenges to the original page traffic identification and analysis method.
Disclosure of Invention
The invention aims to provide an HTTP/2 page access flow identification method based on multi-feature fusion learning.
The technical solution for realizing the purpose of the invention is as follows: a HTTP/2 page access flow identification method based on multi-feature fusion learning comprises the following steps:
step 1, acquiring a homepage access flow and a resource response flow generated by a target HTTP/2 site in a typical user interaction process;
step 2, carrying out preprocessing operation on the flow data to obtain a complete TCP flow;
step 3, capturing the content distribution rule characteristics of homepage access flow by using a self-coding network;
step 4, identifying the main body resource category of the resource response flow transmission by using a recurrent neural network;
and 5, fusing and splicing the content distribution characteristic vector and the main resource characteristic vector, and inputting the fused and spliced content distribution characteristic vector and the main resource characteristic vector into a convolutional neural network to obtain an HTTP/2 page flow result.
Further, in the step 1, an automatic script tool is used to start a browser, typical user interaction is simulated at a target site to obtain a traffic sample, the traffic sample is generated by using two browsers to access the target HTTP/2 site at a random time point, and the number of the HTTP/2 sites is not less than 10; and storing the homepage access flow and the resource response flow generated by the target site off line by using a data packet capturing tool.
Further, in the step 2, filtering the flow sample, and discarding noise data generated in the capturing process; and integrating the flow data by taking the quadruple as a unit to obtain a complete TCP flow. The quadruplet comprises a source IP, a source port, a destination IP and a destination port.
Further, in the step 3, for the homepage access flow generated by the target site, a packet oriented length sequence and a packet interval time sequence are extracted; sequence information is input into the encoded network to capture content distribution feature vectors for homepage access traffic.
Further, in step 4, for the resource response traffic generated by the target station, a packet directed length sequence and a packet interval time sequence are extracted; sequence information is input into the recurrent neural network to identify the type of subject resource including, but not limited to, video, audio, text, pictures, that the resource responds to the traffic transmission.
Further, the data packet directed length sequence and the data packet interval time sequence are extracted, and truncation or digital 0 filling is performed to obtain the data packet directed length sequence and the data packet interval time sequence with fixed lengths; the data packets have directional length sequences, and positive and negative values are used for respectively identifying the length directions of the uplink data packets and the downlink data packets.
Further, in the step 5, fusion splicing is performed on the content distribution feature vector and the main resource feature vector extracted from the same HTTP/2 site, and the spliced vector is input to a convolutional neural network to obtain an HTTP/2 page traffic identification result.
Compared with the prior art, the invention has the beneficial effects that: a plurality of TCP streams generated by HTTP/2 sites are used as a basic unit for fingerprint extraction, so that the robustness of the webpage fingerprint is improved; meanwhile, for TCP streams with different behaviors, a targeted deep learning network is designed to model potential information modes of the TCP streams, so that the overall characteristics of a target site are more accurately represented, and the identification precision of HTTP/2 page access flow is further improved.
Drawings
FIG. 1 is an overall flow chart of the method of the present invention.
FIG. 2 is a flow chart of automatic page access and traffic collection.
FIG. 3 is a frame diagram of page traffic recognition based on multi-feature fusion learning.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment provides a method for identifying HTTP/2 page access traffic based on multi-feature fusion learning, which is shown in fig. 1 and specifically includes the following steps:
step 1, acquiring a homepage access flow and a resource response flow generated by a target HTTP/2 site in a typical user interaction process, specifically referring to fig. 2, including:
automatically calling a browser to perform webpage access operation by using an open source tool (Selenium), and simulating the interaction behavior of a real user and the browser to generate target flow; in this embodiment, the number of target HTTP/2 sites is 10, the traffic sample collection lasts for 5 days, and the two browsers Chrome and Edge are used to access 40 times at random time points each day.
Capturing page flow by using a Wireshark tool, and storing the page flow as a PCAP file; and decrypting by using a preset TLS session key so as to divide the homepage access flow and the resource response flow and marking the main body resource type transmitted by the subsequent resource response flow.
Step 2, carrying out preprocessing operation on the flow data to obtain a complete TCP flow, comprising the following steps: and integrating and shunting the PCAP file by using a Scapy open source library and taking a quadruple as a unit, filtering a flow sample, and discarding noise data generated in the capturing process, including but not limited to a control packet, a retransmission packet and an incomplete session to obtain a relatively pure TCP stream.
Step 3, capturing the content distribution rule characteristics of homepage access flow by using a self-coding network, comprising the following steps:
extracting a data packet directed length sequence and a data packet interval time sequence aiming at homepage access flow generated by a target site, and performing truncation or filling with a number 0 to obtain fixed-length sequence information; the data packets have directional length sequences, and positive and negative values are used for respectively identifying the length directions of the uplink data packets and the downlink data packets.
In this embodiment, the self-encoding network includes an input layer, an encoding network, and a decoding network connected in sequence; the coding network consists of a convolutional layer and a maximum pooling layer, the features of input data are extracted through the convolutional layer, and then the extracted features of the convolutional layer are compressed by using the maximum pooling layer, so that feature dimension reduction is realized; the decoding network is composed of a convolutional layer and a deconvolution layer, the features extracted by the coding network are decoded by the convolutional layer, and then the feature dimensionality is expanded by utilizing the deconvolution layer for mapping so as to reconstruct an input signal.
Step 4, identifying the main body resource category of the resource response flow transmission by using a recurrent neural network, comprising the following steps:
and (3) extracting a data packet directed length sequence and a data packet interval time sequence aiming at the resource response flow generated by the target station, wherein the operation processing is the same as the mode in the step 3, and the description is omitted here.
In this embodiment, the recurrent neural network includes an input layer, a double-layer GRU unit, and a full connection layer, which are connected in sequence; extracting sequence correlation characteristics through a GRU unit, and identifying the main body resource type of resource response flow transmission by using a full connection layer to obtain a resource type label vector, wherein the resource type label vector comprises the following steps:
C=(c1,c2,…ci,…,cn)
wherein, ci∈[0,1]Indicating asset type identification relevance, wherein asset type includes, but is not limited to, video, audio, text, pictures.
And 5, fusing and splicing the content distribution characteristic vectors and the main resource characteristic vectors, inputting the fused and spliced content distribution characteristic vectors and the main resource characteristic vectors into a convolutional neural network to obtain an HTTP/2 page flow result, wherein the step of splicing the characteristic vectors into a fixed length in sequence is included. See in particular fig. 3. In this embodiment, the convolutional neural network includes an input layer, several convolutional layers, a max-pooling layer, and a full-link layer, which are connected in sequence. After multi-dimensional features are learnt through convolutional layer fusion, the multi-dimensional features are transmitted into a full connection layer, and an HTTP/2 site is identified by using a Softmax classifier.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A HTTP/2 page access flow identification method based on multi-feature fusion learning is characterized by comprising the following steps:
step 1, acquiring a homepage access flow and a resource response flow generated by a target HTTP/2 site in a typical user interaction process;
step 2, carrying out preprocessing operation on the flow data to obtain a complete TCP flow;
step 3, capturing the content distribution rule characteristics of homepage access flow by using a self-coding network;
step 4, identifying the main body resource category of the resource response flow transmission by using a recurrent neural network;
and 5, fusing and splicing the content distribution characteristic vector and the main resource characteristic vector, and inputting the fused and spliced content distribution characteristic vector and the main resource characteristic vector into a convolutional neural network to obtain an HTTP/2 page flow result.
2. The method for identifying HTTP/2 page access traffic based on multi-feature fusion learning according to claim 1, wherein the step 1 comprises: starting a browser by using an automatic script tool, and simulating typical user interaction at a target site to generate a flow sample; and storing the homepage access flow and the resource response flow generated by the target site off line by using a data packet capturing tool.
3. The method according to claim 2, wherein the traffic sample comprises: and (3) accessing the target HTTP/2 sites at random time points by using two browsers, wherein the number of the sites is not less than 10.
4. The method for identifying HTTP/2 page access traffic based on multi-feature fusion learning according to claim 1, wherein the step 2 comprises: filtering the flow sample, and discarding noise data generated in the capturing process; integrating the flow data by taking the quadruple as a unit to obtain a complete TCP flow; the quadruplet comprises a source IP, a source port, a destination IP and a destination port.
5. The method for identifying HTTP/2 page access traffic based on multi-feature fusion learning according to claim 1, wherein the step 3 comprises: extracting a data packet directed length sequence and a data packet interval time sequence aiming at homepage access flow generated by a target site; sequence information is input into the encoded network to capture content distribution feature vectors for homepage access traffic.
6. The method for identifying HTTP/2 page access traffic based on multi-feature fusion learning according to claim 1, wherein the step 4 comprises: aiming at the resource response flow generated by a target station, extracting a data packet directed length sequence and a data packet interval time sequence; sequence information is input into the recurrent neural network to identify the subject resource type of the resource response traffic transmission.
7. The HTTP/2 page access traffic identification method based on multi-feature fusion learning according to claim 6, wherein the subject resource types include video, audio, text, and pictures.
8. The method for identifying HTTP/2 page access traffic based on multi-feature fusion learning according to claim 5 or 6, wherein the extracting the packet directed length sequence and the packet interval time sequence comprises: and truncating the data packet information sequence or filling the data packet information sequence with a number 0 to obtain a data packet directed length sequence and a data packet interval time sequence with fixed lengths.
9. The method for identifying HTTP/2 page access traffic based on multi-feature fusion learning according to claim 5 or 6, wherein the data packet directed length sequence includes a transmission direction: and respectively identifying the length directions of the uplink data packet and the downlink data packet by using a positive value and a negative value.
10. The method for identifying HTTP/2 page access traffic based on multi-feature fusion learning according to claim 1, wherein the step 5 comprises: and performing fusion splicing on the content distribution characteristic vector and the main resource characteristic vector extracted from the same HTTP/2 site, and inputting the spliced vectors into a convolutional neural network to obtain an HTTP/2 page flow identification result.
CN202111513183.3A 2021-12-12 2021-12-12 HTTP/2 page access flow identification method based on multi-feature fusion learning Pending CN114205151A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111513183.3A CN114205151A (en) 2021-12-12 2021-12-12 HTTP/2 page access flow identification method based on multi-feature fusion learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111513183.3A CN114205151A (en) 2021-12-12 2021-12-12 HTTP/2 page access flow identification method based on multi-feature fusion learning

Publications (1)

Publication Number Publication Date
CN114205151A true CN114205151A (en) 2022-03-18

Family

ID=80652731

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111513183.3A Pending CN114205151A (en) 2021-12-12 2021-12-12 HTTP/2 page access flow identification method based on multi-feature fusion learning

Country Status (1)

Country Link
CN (1) CN114205151A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115150297A (en) * 2022-08-15 2022-10-04 北京百润洪科技有限公司 Data filtering and content evaluation method and system based on mobile internet

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115150297A (en) * 2022-08-15 2022-10-04 北京百润洪科技有限公司 Data filtering and content evaluation method and system based on mobile internet
CN115150297B (en) * 2022-08-15 2023-05-19 雁展科技(深圳)有限公司 Data filtering and content evaluating method and system based on mobile internet

Similar Documents

Publication Publication Date Title
CN112163594B (en) Network encryption traffic identification method and device
CN111144470B (en) Unknown network flow identification method and system based on deep self-encoder
CN109600317B (en) Method and device for automatically identifying traffic and extracting application rules
CN112104570B (en) Traffic classification method, traffic classification device, computer equipment and storage medium
CN111611280A (en) Encrypted traffic identification method based on CNN and SAE
CN113935426A (en) Method and device for detecting abnormal data traffic of power internet of things
CN111882367A (en) Method for monitoring and tracking online advertisements through user internet behavior analysis
CN112019500B (en) Encrypted traffic identification method based on deep learning and electronic device
CN112491894A (en) Internet of things network attack flow monitoring system based on space-time feature learning
CN114629718A (en) Hidden malicious behavior detection method based on multi-model fusion
CN112887291A (en) I2P traffic identification method and system based on deep learning
CN114422211B (en) HTTP malicious traffic detection method and device based on graph attention network
CN114205151A (en) HTTP/2 page access flow identification method based on multi-feature fusion learning
CN110365659B (en) Construction method of network intrusion detection data set in small sample scene
CN117130870B (en) Transparent request tracking and sampling method and device for Java architecture micro-service system
CN113938290A (en) Website de-anonymization method and system for user side traffic data analysis
CN116828087B (en) Information security system based on block chain connection
Zhou et al. Encrypted network traffic identification based on 2d-cnn model
CN117318980A (en) Small sample scene-oriented self-supervision learning malicious traffic detection method
CN116401479A (en) Website content behavior identification method and system based on encrypted traffic bidirectional burst sequence
CN116232696A (en) Encryption traffic classification method based on deep neural network
CN111835720B (en) VPN flow WEB fingerprint identification method based on feature enhancement
CN113794687A (en) Malicious encrypted flow detection method and device based on deep learning
CN114553579A (en) Novel malicious flow detection method based on image
CN114970680A (en) CNN + LSTM-based flow terminal real-time identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination