CN116401479A

CN116401479A - Website content behavior identification method and system based on encrypted traffic bidirectional burst sequence

Info

Publication number: CN116401479A
Application number: CN202310269520.1A
Authority: CN
Inventors: 鲁睿; 宋嘉莹; 时磊; 王炳旭; 段荣昌; 秦颖超; 王红兵; 夏耀华; 佟玲玲; 王东安; 马宏远
Original assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Priority date: 2022-11-02
Filing date: 2023-03-20
Publication date: 2023-07-07

Abstract

The invention relates to a website content behavior identification method and system based on an encrypted traffic bidirectional burst sequence. The method comprises the following steps: acquiring behavior flow data of an encrypted website; preprocessing behavior traffic data into a bidirectional burst sequence; establishing a website content behavior recognition model, and training the website content behavior recognition model by taking a bidirectional burst sequence as input; and carrying out website content behavior recognition of the encrypted website by using the trained website content behavior recognition model. The invention selects the bidirectional burst sequence as input, and can better capture the difference between website content behaviors; the convolution neural network is adopted to construct a flow representation model, so that automatic flow representation and feature extraction are realized, manual feature extraction and selection are avoided, and finally, the purpose of accurately identifying the behavior flow of the encrypted website content is achieved.

Description

Website content behavior identification method and system based on encrypted traffic bidirectional burst sequence

Technical Field

The invention belongs to the field of network measurement and behavior analysis, and particularly relates to a website content behavior identification method based on an encrypted traffic bidirectional burst sequence.

Background

The website content behavior refers to specific content of a website behavior of a user, and comprises a behavior mainly based on browsing characters, a behavior mainly based on pictures and a behavior mainly based on videos, namely, a character behavior, a picture behavior and a video behavior. Website content behavior recognition is primarily the presumption that a user produces specific content of a certain behavior on a website from the traffic generated by the user's certain website content behavior.

In recent years, due to the fact that privacy protection and data security transmission are vital, an HTTPS protocol gradually replaces an original HTTP protocol, illegal monitoring and tampering of data in the transmission process are avoided, and data transmission security is guaranteed. More and more websites use https protocols for encrypted transmissions.

Along with the densification of websites and the popularization of TLS1.3, the traditional encryption website identification method based on SNI or certificate matching fails. It is desirable to build more complex web site fingerprints to support web site identification. The existing website identification method utilizes information such as time sequence, packet direction, packet length and the like of flow to assist machine learning and deep learning algorithm to extract deep features so as to realize website identification. Website fingerprints refer to features that a user has to generate traffic in sending data and receiving data when accessing a website. Has the characteristic of uniqueness. The method can be used for identifying websites and webpages and analyzing website behaviors.

The encrypted website has the characteristic of diversity. The website types are rich and various, including various types of websites such as social categories, video categories, news categories and the like. The encrypted website behaviors have the characteristics of similarity and difference. The similarity is that different websites have similar network behaviors, such as picture behaviors can be generated by different websites. The behavior generation traffic in the same website has a certain similarity due to the fact that the traffic has the same receiver and sender. The difference is apparent that the same website may generate different behaviors, such as text behaviors, picture behaviors, and the like. Different website behaviors, such as different lengths of data packets generated by text behaviors and picture behaviors, have certain difference in flow.

Encryption is accompanied by some security issues. Network policing and detection face greater challenges due to encryption of traffic. The encrypted website flow provides a breeding hotbed for malicious network behaviors, and an attacker takes the encrypted flow as a shielding umbrella for the malicious behaviors to threaten network security. Malicious network behaviors such as phishing, cyromazine, phishing and the like spread on the network. Through analysis of website behaviors, the malicious behaviors can be effectively prevented from breeding. Therefore, in order to effectively ensure the security of the encrypted website and even the security of the whole network and the discovery of malicious website behaviors, the accurate identification of the user behaviors in the encrypted website can provide necessary information support for network supervision and is also a premise of malicious network behavior detection and a basis for maintaining network security.

The existing network behavior related research of encrypted traffic is mainly focused on behavior recognition of encrypted applications, and the recognition method of the network behavior related research mainly can be divided into recognition of internal behaviors of the applications and recognition of application-behaviors. The research aiming at the identification of the internal behaviors of the application focuses on the identification of the internal behaviors of one or more encrypted applications, and adopts simpler classification methods such as machine learning and the like by utilizing the statistical characteristics of traffic, packet length and other information. The method relies on manual feature extraction and selection, and in addition, the flow statistical features of the same website have similarity, so that the accuracy of behavior identification in the same website is low. Identification research for application-behavior is less, and is mainly focused on specific behavior identification for instant messaging applications.

The flow generated by the network content behavior can reflect the composition of the access content of the user on the website, and is beneficial to monitoring the bad content in the website. While elements of a web page are closely related to bursts of traffic (meaning that large amounts of traffic are generated in a short time). Therefore, a method for predicting website content behaviors through burst sequences of bidirectional traffic is needed, and on the premise of obtaining the bidirectional traffic, the burst sequences of the traffic are further extracted, so that corresponding website content behaviors are identified.

Disclosure of Invention

The invention provides a website content behavior identification method and system based on an encrypted traffic bidirectional burst sequence. The invention extracts the encrypted traffic bidirectional burst sequence for traffic analysis, and solves the problems of low recognition accuracy and high algorithm complexity caused by the fact that the prior art cannot effectively capture the characteristics of the corresponding traffic of the web content when recognizing the behavior of the web content.

The technical scheme adopted by the invention is as follows:

a website content behavior identification method based on encrypted traffic bidirectional burst sequence comprises the following steps:

acquiring behavior flow data of an encrypted website;

preprocessing behavior traffic data into a bidirectional burst sequence;

establishing a website content behavior recognition model, and training the website content behavior recognition model by taking a bidirectional burst sequence as input;

and carrying out website content behavior recognition of the encrypted website by using the trained website content behavior recognition model.

Further, the behavior flow data of the encrypted website is obtained in an online flow capturing mode or acquired offline data is used, and the flow data is stored by using the pcap as a file extension.

Further, the obtaining the behavioral traffic data of the encrypted website includes:

reading a URL list of the target encrypted website and corresponding behavior operation of the corresponding website, and reading a target URL address from the URL list;

starting a Web dirver program, automatically opening a browser, and inputting a read URL address;

reading a behavior from a behavior list corresponding to a website, calling a script of automatic simulation operation of the corresponding behavior, starting a tcpdump data packet capturing program, executing the behavior automation operation script, and simulating website behavior operation;

after the behavior operation is finished, the data packet capturing process is finished, next behavior operation is executed, when the behavior operation in the behavior list in one website is finished, the browser is closed, the next website in the website list is read, and the operation is repeated until the website list is read.

Further, the preprocessing the behavioral traffic data into a bi-directional burst sequence includes:

filtering irrelevant flow;

extracting the flow from the data packet by using a network session data packet segmentation method based on a quintuple as a unit, classifying according to the quintuple content, wherein the data packet with consistent quintuple belongs to a unidirectional data flow in the same uplink or downlink direction, storing the direction information of the data packet, and marking the flow of an uplink by +1 and the flow of a downlink by-1; discarding the stream with too short length due to the connection establishment failure and the like, and finally obtaining a data stream set meeting the requirements;

the uplink traffic and the downlink traffic are processed respectively, and uplink/downlink burst is defined as a unidirectional data packet sequence corresponding to each HTTP message, and bidirectional burst sequence is defined as a sequence of unidirectional burst lengths in all uplink/downlink links.

Further, the website content behavior recognition model comprises a basic module and a full connection module; the basic module comprises a one-dimensional convolutional neural network layer, a batch standardization layer and a maximum pooling layer; the fully connected module includes a fully connected layer.

Further, the output of the fully-connected module is input into a softmax classifier for classification, a cross entropy loss function is used for calculating loss between a predicted value and a real label, and the website content behavior recognition model is trained.

A bi-directional burst sequence based website content behavior recognition system, comprising:

the flow acquisition module is used for acquiring behavior flow data of the encrypted website;

the traffic preprocessing and bidirectional burst sequence extracting module is used for preprocessing behavior traffic data into a bidirectional burst sequence;

the model building module is used for building a website content behavior recognition model;

the training module is used for training the website content behavior recognition model by taking the bidirectional burst sequence as input;

and the evaluation index calculation module is used for identifying website content behaviors of the encrypted website by using the trained website content behavior identification model, calculating the overall accuracy, the recall rate of the appointed type and the precision of the appointed type, and carrying out accurate quantification.

Compared with the prior art, the invention has the beneficial effects that:

the bidirectional burst sequence is selected as input, compared with the original data packet length sequence, the interaction of HTTP message requests and responses is better reflected, and the burst sequence has large variation due to the difference of website elements, so that the difference between website content behaviors can be better captured;

the convolutional neural network based on CNNs is adopted to construct a flow representation model, so that automatic flow representation and feature extraction are realized, manual feature extraction and selection are avoided, and finally, the purpose of accurately identifying the content behavior flow of the encrypted website is achieved.

Drawings

FIG. 1 is a flow chart of encrypted website content behavior traffic data collection.

Fig. 2 is a schematic structural diagram of a website content behavior recognition model.

Fig. 3 is a schematic diagram of a module composition of a website content behavior recognition system based on an encrypted traffic bidirectional burst sequence according to an embodiment of the present invention.

Detailed Description

The present invention will be further described in detail with reference to the following examples and drawings, so that the above objects, features and advantages of the present invention can be more clearly understood.

The invention discloses a website content behavior identification method based on an encrypted traffic bidirectional burst sequence, which comprises the following steps:

s1: acquiring encrypted website behavior flow data;

s2: preprocessing the flow into a burst flow sequence;

s3: establishing a website content behavior recognition model;

s4: encrypted website content behavior traffic identification.

Each step is described in detail below.

The encrypted website behavior flow data in the step S1 can be obtained in a mode of capturing flow online, or acquired offline data can be used, and the flow data can be stored by using the pcap as a file extension. And collecting off-line data, namely, collecting flow data by an auxiliary flow capture tool tcpdump and the like through compiling an automatic behavior operation script.

The specific acquisition process is shown in fig. 1, and comprises the following steps:

s1-1, reading a URL list of a target encrypted website and corresponding behavior operation of the corresponding website, and reading a target URL address from the URL list. In order to eliminate the influence of the browser on the website traffic, the buffer memory and the Cookie record in the browser are emptied before all operations, and the target website is accessed by adopting the stealth mode of the Chrome browser. The stealth mode or traceless browsing of the Chrome browser allows the user to browse the web page without leaving any trace of the access website on the computer, including cached files, cookies, history, download records, etc., to protect the user's privacy and security.

S1-2, starting a Web server program, automatically opening a browser, and inputting the URL address read in S1-1.

S1-3, reading a behavior from the behavior list corresponding to the website in S1-2, and calling a script of an automatic simulation operation of the corresponding behavior. And at the same time, starting a tcpdump data packet capturing program, executing a behavior automation operation script, and simulating website behavior operation.

And S1-4, after the behavior operation is finished, ending the data packet capturing process, and then executing the next behavior operation. When the behavior operation in the behavior list in one website is finished, closing the browser, reading the next website in the website list, and repeating the operation until the website list is read.

And 2, preprocessing the captured flow data.

S2-1: and filtering irrelevant flow. Data packets such as acknowledgement packets without actual load are filtered out, and retransmission packets such as TCP Retransmission, dup ACK and the like generated by network congestion are filtered out.

S2-2: converging according to five-tuple. The streaming is extracted from the packets using a network session packet splitting method based on quintuple (source IP address, destination IP address, source port, destination port, transport layer protocol type) units. Classifying according to the content of the quintuple, wherein the data packets with consistent quintuple belong to the same unidirectional data flow in the uplink or downlink direction, storing the direction information of the data packets, and identifying the uplink flow by +1 and the downlink flow by-1. In addition, the flows which are too short in length due to connection establishment failure and the like are discarded, and finally the data flow set meeting the requirements is obtained. The five-tuple refers to a source IP, a destination IP, a source port, a destination port and a transport layer protocol.

S2-3, extracting a bidirectional burst sequence. And respectively processing the uplink flow and the downlink flow obtained in the step S2-2. The upstream/downstream burst sequence is defined as a unidirectional packet sequence corresponding to each HTTP message. A bi-directional burst sequence is defined as a sequence consisting of an upstream burst length sequence and a downstream burst length sequence.

Since TCP is a byte stream protocol, it can split messages from the TCP upper layers (e.g., TLS) for transmission in any way. For larger record sizes, TLS records exceeding the MSS limit will be transported in multiple TCP payloads, and only one of these TCP payloads contains a TLS header. The unidirectional burst length is the value of the TLS header length field. Whereas for smaller record sizes, the entire TLS record may be accommodated in a single TCP payload. Since the length of a TLS record is typically smaller than the MSS, TCP will intercept the next TLS record to fill the current payload as large as the MSS. Thus, the TCP payload may not contain any TLS header, one or more TLS headers. The calculation of the uplink/downlink burst length sequence is as follows:

first traversing the TLS header in the current TCP payload and the packet length in the unidirectional stream, adding the current packet length to the value of the TLS header length field. The length of the current TCP payload is then subtracted from the summed value to obtain the length of the remaining TLS record in the subsequent TCP payload. If equal to 0, the current unidirectional burst ends and its length is added to the unidirectional burst sequence. Finally, a unidirectional burst sequence is obtained. And respectively carrying out the processing on the uplink stream and the downlink stream, and restoring the burst sequence according to the time stamp to obtain a bidirectional burst sequence.

And step S3, establishing a website content behavior recognition model. The input of the model is the bi-directional burst sequence output in step S2. The structure of the model is shown in fig. 2, and mainly consists of two parts: a base module (basic CNNs module) and a fully connected module, and implements a deep network by repeating two parts, for example, two base modules in fig. 2.

The first part is the base module. The number of the basic modules is two, and each basic module mainly comprises a one-dimensional convolutional neural network layer (convolutional layer), a batch normalization layer and a maximum pooling layer.

One-dimensional convolutional neural network layer: the convolutional neural network is mainly responsible for extracting features, and mainly comprises a group of filters, and the convolutional operation is performed on input, so that an operation result is transmitted to the next layer. Each basic module in the model is respectively applicable to 3 one-dimensional convolutions, the convolution kernel sizes of the basic modules are respectively set to 5,3,3, the input channels are respectively set to 1, 32 and 32, the step length of the convolution kernels is 1, the padding is set to SAME mode, the biases is initialized to 0, and the Relu activation function is used for nonlinear processing, so that the problems of gradient explosion and gradient disappearance during training are avoided.

Batch normalization layer: also known as batch normalization layers. The purpose of batch standardization is to overcome the problem that the training is difficult due to the deepening of the layer number of the neural network. Because in the neural network, the input of each layer is different from the original input data distribution after being calculated and changed in the layer, the increase of the neural network of the front layer is accumulated and amplified by the neural network of the rear layer, the training sample can be corrected in time by using batch standardization, and the input of each layer is normalized to fix the mean value and variance of the input of each layer.

Maximum pooling layer: the convolution process is followed by a pooling operation. The pooling is essentially sampling, performing dimension reduction compression on the input feature map, reducing parameters and calculation amount while retaining main features, preventing overfitting, and accelerating operation speed. A maximum pooling operation is selected, the pooling size is set to 2, the cloth length is set to 2, and padding is set to SAME.

The second part is a fully connected module. And (3) after carrying out convolution operation on the input by two identical basic modules, entering two fully-connected neural networks. The number of hidden nodes of the first full connection layer is 512, and the second full connection layer is correspondingly adjusted according to the category number of the sample. After the first fully connected layer, a batch standard process is used so that each layer of neural network inputs remain the same distribution. Then, a Relu activation function is used for nonlinear processing, and a Dropout algorithm is used for randomly deleting some hidden neurons in the network, so that a regularization effect is achieved to a certain extent. The ratio of Dropout was chosen to be 0.5.

The output of the fully connected module is input to a softmax classifier for classification. And calculating the loss between the predicted value and the real label by using the cross entropy loss function, so as to train the website content behavior recognition model.

Step S4 is encrypted website content behavior identification. And (3) using the website content behavior recognition model constructed and trained in the step (S4) to recognize the encrypted website content behavior on the test set.

In one embodiment of the present invention, a method for identifying website content behavior based on encrypted traffic bidirectional burst sequence is provided, the method comprising the following steps:

and collecting the traffic according to the step S1 by taking the traffic generated by the encrypted website content behavior as an object.

Preprocessing the flow according to the step S2, filtering the irrelevant flow, and extracting the filtered flow according to the quintuple. Secondly, respectively extracting traffic burst sequences in an uplink and a downlink, and sequencing the extracted unidirectional burst sequences according to time stamps to obtain a bidirectional burst sequence. In addition, the extracted bidirectional burst sequence is divided into a training set, a verification set and a test set according to a cross verification method.

And constructing an encrypted website content behavior recognition model according to the step S3. A convolutional network is built using a deep learning framework (e.g., pytorch, etc.). Meanwhile, training is performed by directly inputting the training set generated in the step S2 into the recognition model.

And according to the step S4, testing the model proposed in the step S3 on the test set generated in the step S2.

Another embodiment of the present invention provides a system for identifying website content behavior based on a bidirectional burst sequence, as shown in fig. 3, including:

the flow acquisition module is used for acquiring (reading) network flow online (offline);

the flow preprocessing and two-way burst sequence extracting module is used for carrying out no-load flow filtering, quintuple aggregation and two-way burst sequence extraction on the obtained original flow;

the model building module is used for building a behavior recognition model by using a website content behavior recognition method based on a bidirectional burst sequence;

the training module is used for training the website behavior recognition model;

Through the technical scheme, the invention provides an effective method and system for identifying the content behavior in the encrypted website.

Another embodiment of the invention provides a computer device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor and a processor, the computer program comprising instructions for performing the steps of the method of the invention.

Another embodiment of the invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, performs the steps of the method of the invention.

The above-disclosed embodiments of the present invention are intended to aid in understanding the contents of the present invention and to enable the same to be carried into practice, and it will be understood by those of ordinary skill in the art that various alternatives, variations and modifications are possible without departing from the spirit and scope of the invention. The invention should not be limited to what has been disclosed in the examples of the specification, but rather by the scope of the invention as defined in the claims.

Claims

1. A website content behavior identification method based on an encrypted traffic bidirectional burst sequence is characterized by comprising the following steps:

acquiring behavior flow data of an encrypted website;

preprocessing behavior traffic data into a bidirectional burst sequence;

2. The method of claim 1, wherein the behavior traffic data of the encrypted website is obtained by capturing traffic online or using collected offline data, and the traffic data is saved for file extension by pcap.

3. The method of claim 1, wherein the obtaining behavioral traffic data of the encrypted website comprises:

4. The method of claim 1, wherein the preprocessing of behavioral traffic data into a bi-directional burst sequence comprises:

filtering irrelevant flow;

5. The method of claim 1, wherein the website content behavior recognition model comprises a base module and a fully connected module; the basic module comprises a one-dimensional convolutional neural network layer, a batch standardization layer and a maximum pooling layer; the fully connected module includes a fully connected layer.

6. The method of claim 5, wherein the one-dimensional convolutional neural network layer comprises 3 one-dimensional convolutions with convolution kernel sizes set to 5,3,3, input channels set to 1, 32, 32, step sizes of convolution kernels set to 1, padding set to SAME mode, biases initialized to 0, and non-linear processing using a Relu activation function to avoid gradient explosion and gradient vanishing problems during training; the batch normalization layer is used for correcting training samples in time, and normalizing the input of the layer to fix the mean value and variance of the input of each layer; the pooling size of the maximum pooling layer is set to 2, the step size is set to 2, and padding is set to SAME.

7. The method of claim 5, wherein the website content behavior recognition model is trained by inputting the output of the fully connected module to a softmax classifier for classification, calculating losses between predicted values and real tags using a cross entropy loss function.

8. A system for identifying website content behavior based on a bi-directional burst sequence, comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1-7.