CN116401479A - Website content behavior identification method and system based on encrypted traffic bidirectional burst sequence - Google Patents

Website content behavior identification method and system based on encrypted traffic bidirectional burst sequence Download PDF

Info

Publication number
CN116401479A
CN116401479A CN202310269520.1A CN202310269520A CN116401479A CN 116401479 A CN116401479 A CN 116401479A CN 202310269520 A CN202310269520 A CN 202310269520A CN 116401479 A CN116401479 A CN 116401479A
Authority
CN
China
Prior art keywords
website
behavior
encrypted
website content
flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310269520.1A
Other languages
Chinese (zh)
Inventor
鲁睿
宋嘉莹
时磊
王炳旭
段荣昌
秦颖超
王红兵
夏耀华
佟玲玲
王东安
马宏远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Information Engineering of CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS, National Computer Network and Information Security Management Center filed Critical Institute of Information Engineering of CAS
Publication of CN116401479A publication Critical patent/CN116401479A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to a website content behavior identification method and system based on an encrypted traffic bidirectional burst sequence. The method comprises the following steps: acquiring behavior flow data of an encrypted website; preprocessing behavior traffic data into a bidirectional burst sequence; establishing a website content behavior recognition model, and training the website content behavior recognition model by taking a bidirectional burst sequence as input; and carrying out website content behavior recognition of the encrypted website by using the trained website content behavior recognition model. The invention selects the bidirectional burst sequence as input, and can better capture the difference between website content behaviors; the convolution neural network is adopted to construct a flow representation model, so that automatic flow representation and feature extraction are realized, manual feature extraction and selection are avoided, and finally, the purpose of accurately identifying the behavior flow of the encrypted website content is achieved.

Description

Website content behavior identification method and system based on encrypted traffic bidirectional burst sequence
Technical Field
The invention belongs to the field of network measurement and behavior analysis, and particularly relates to a website content behavior identification method based on an encrypted traffic bidirectional burst sequence.
Background
The website content behavior refers to specific content of a website behavior of a user, and comprises a behavior mainly based on browsing characters, a behavior mainly based on pictures and a behavior mainly based on videos, namely, a character behavior, a picture behavior and a video behavior. Website content behavior recognition is primarily the presumption that a user produces specific content of a certain behavior on a website from the traffic generated by the user's certain website content behavior.
In recent years, due to the fact that privacy protection and data security transmission are vital, an HTTPS protocol gradually replaces an original HTTP protocol, illegal monitoring and tampering of data in the transmission process are avoided, and data transmission security is guaranteed. More and more websites use https protocols for encrypted transmissions.
Along with the densification of websites and the popularization of TLS1.3, the traditional encryption website identification method based on SNI or certificate matching fails. It is desirable to build more complex web site fingerprints to support web site identification. The existing website identification method utilizes information such as time sequence, packet direction, packet length and the like of flow to assist machine learning and deep learning algorithm to extract deep features so as to realize website identification. Website fingerprints refer to features that a user has to generate traffic in sending data and receiving data when accessing a website. Has the characteristic of uniqueness. The method can be used for identifying websites and webpages and analyzing website behaviors.
The encrypted website has the characteristic of diversity. The website types are rich and various, including various types of websites such as social categories, video categories, news categories and the like. The encrypted website behaviors have the characteristics of similarity and difference. The similarity is that different websites have similar network behaviors, such as picture behaviors can be generated by different websites. The behavior generation traffic in the same website has a certain similarity due to the fact that the traffic has the same receiver and sender. The difference is apparent that the same website may generate different behaviors, such as text behaviors, picture behaviors, and the like. Different website behaviors, such as different lengths of data packets generated by text behaviors and picture behaviors, have certain difference in flow.
Encryption is accompanied by some security issues. Network policing and detection face greater challenges due to encryption of traffic. The encrypted website flow provides a breeding hotbed for malicious network behaviors, and an attacker takes the encrypted flow as a shielding umbrella for the malicious behaviors to threaten network security. Malicious network behaviors such as phishing, cyromazine, phishing and the like spread on the network. Through analysis of website behaviors, the malicious behaviors can be effectively prevented from breeding. Therefore, in order to effectively ensure the security of the encrypted website and even the security of the whole network and the discovery of malicious website behaviors, the accurate identification of the user behaviors in the encrypted website can provide necessary information support for network supervision and is also a premise of malicious network behavior detection and a basis for maintaining network security.
The existing network behavior related research of encrypted traffic is mainly focused on behavior recognition of encrypted applications, and the recognition method of the network behavior related research mainly can be divided into recognition of internal behaviors of the applications and recognition of application-behaviors. The research aiming at the identification of the internal behaviors of the application focuses on the identification of the internal behaviors of one or more encrypted applications, and adopts simpler classification methods such as machine learning and the like by utilizing the statistical characteristics of traffic, packet length and other information. The method relies on manual feature extraction and selection, and in addition, the flow statistical features of the same website have similarity, so that the accuracy of behavior identification in the same website is low. Identification research for application-behavior is less, and is mainly focused on specific behavior identification for instant messaging applications.
The flow generated by the network content behavior can reflect the composition of the access content of the user on the website, and is beneficial to monitoring the bad content in the website. While elements of a web page are closely related to bursts of traffic (meaning that large amounts of traffic are generated in a short time). Therefore, a method for predicting website content behaviors through burst sequences of bidirectional traffic is needed, and on the premise of obtaining the bidirectional traffic, the burst sequences of the traffic are further extracted, so that corresponding website content behaviors are identified.
Disclosure of Invention
The invention provides a website content behavior identification method and system based on an encrypted traffic bidirectional burst sequence. The invention extracts the encrypted traffic bidirectional burst sequence for traffic analysis, and solves the problems of low recognition accuracy and high algorithm complexity caused by the fact that the prior art cannot effectively capture the characteristics of the corresponding traffic of the web content when recognizing the behavior of the web content.
The technical scheme adopted by the invention is as follows:
a website content behavior identification method based on encrypted traffic bidirectional burst sequence comprises the following steps:
acquiring behavior flow data of an encrypted website;
preprocessing behavior traffic data into a bidirectional burst sequence;
establishing a website content behavior recognition model, and training the website content behavior recognition model by taking a bidirectional burst sequence as input;
and carrying out website content behavior recognition of the encrypted website by using the trained website content behavior recognition model.
Further, the behavior flow data of the encrypted website is obtained in an online flow capturing mode or acquired offline data is used, and the flow data is stored by using the pcap as a file extension.
Further, the obtaining the behavioral traffic data of the encrypted website includes:
reading a URL list of the target encrypted website and corresponding behavior operation of the corresponding website, and reading a target URL address from the URL list;
starting a Web dirver program, automatically opening a browser, and inputting a read URL address;
reading a behavior from a behavior list corresponding to a website, calling a script of automatic simulation operation of the corresponding behavior, starting a tcpdump data packet capturing program, executing the behavior automation operation script, and simulating website behavior operation;
after the behavior operation is finished, the data packet capturing process is finished, next behavior operation is executed, when the behavior operation in the behavior list in one website is finished, the browser is closed, the next website in the website list is read, and the operation is repeated until the website list is read.
Further, the preprocessing the behavioral traffic data into a bi-directional burst sequence includes:
filtering irrelevant flow;
extracting the flow from the data packet by using a network session data packet segmentation method based on a quintuple as a unit, classifying according to the quintuple content, wherein the data packet with consistent quintuple belongs to a unidirectional data flow in the same uplink or downlink direction, storing the direction information of the data packet, and marking the flow of an uplink by +1 and the flow of a downlink by-1; discarding the stream with too short length due to the connection establishment failure and the like, and finally obtaining a data stream set meeting the requirements;
the uplink traffic and the downlink traffic are processed respectively, and uplink/downlink burst is defined as a unidirectional data packet sequence corresponding to each HTTP message, and bidirectional burst sequence is defined as a sequence of unidirectional burst lengths in all uplink/downlink links.
Further, the website content behavior recognition model comprises a basic module and a full connection module; the basic module comprises a one-dimensional convolutional neural network layer, a batch standardization layer and a maximum pooling layer; the fully connected module includes a fully connected layer.
Further, the output of the fully-connected module is input into a softmax classifier for classification, a cross entropy loss function is used for calculating loss between a predicted value and a real label, and the website content behavior recognition model is trained.
A bi-directional burst sequence based website content behavior recognition system, comprising:
the flow acquisition module is used for acquiring behavior flow data of the encrypted website;
the traffic preprocessing and bidirectional burst sequence extracting module is used for preprocessing behavior traffic data into a bidirectional burst sequence;
the model building module is used for building a website content behavior recognition model;
the training module is used for training the website content behavior recognition model by taking the bidirectional burst sequence as input;
and the evaluation index calculation module is used for identifying website content behaviors of the encrypted website by using the trained website content behavior identification model, calculating the overall accuracy, the recall rate of the appointed type and the precision of the appointed type, and carrying out accurate quantification.
Compared with the prior art, the invention has the beneficial effects that:
the bidirectional burst sequence is selected as input, compared with the original data packet length sequence, the interaction of HTTP message requests and responses is better reflected, and the burst sequence has large variation due to the difference of website elements, so that the difference between website content behaviors can be better captured;
the convolutional neural network based on CNNs is adopted to construct a flow representation model, so that automatic flow representation and feature extraction are realized, manual feature extraction and selection are avoided, and finally, the purpose of accurately identifying the content behavior flow of the encrypted website is achieved.
Drawings
FIG. 1 is a flow chart of encrypted website content behavior traffic data collection.
Fig. 2 is a schematic structural diagram of a website content behavior recognition model.
Fig. 3 is a schematic diagram of a module composition of a website content behavior recognition system based on an encrypted traffic bidirectional burst sequence according to an embodiment of the present invention.
Detailed Description
The present invention will be further described in detail with reference to the following examples and drawings, so that the above objects, features and advantages of the present invention can be more clearly understood.
The invention discloses a website content behavior identification method based on an encrypted traffic bidirectional burst sequence, which comprises the following steps:
s1: acquiring encrypted website behavior flow data;
s2: preprocessing the flow into a burst flow sequence;
s3: establishing a website content behavior recognition model;
s4: encrypted website content behavior traffic identification.
Each step is described in detail below.
The encrypted website behavior flow data in the step S1 can be obtained in a mode of capturing flow online, or acquired offline data can be used, and the flow data can be stored by using the pcap as a file extension. And collecting off-line data, namely, collecting flow data by an auxiliary flow capture tool tcpdump and the like through compiling an automatic behavior operation script.
The specific acquisition process is shown in fig. 1, and comprises the following steps:
s1-1, reading a URL list of a target encrypted website and corresponding behavior operation of the corresponding website, and reading a target URL address from the URL list. In order to eliminate the influence of the browser on the website traffic, the buffer memory and the Cookie record in the browser are emptied before all operations, and the target website is accessed by adopting the stealth mode of the Chrome browser. The stealth mode or traceless browsing of the Chrome browser allows the user to browse the web page without leaving any trace of the access website on the computer, including cached files, cookies, history, download records, etc., to protect the user's privacy and security.
S1-2, starting a Web server program, automatically opening a browser, and inputting the URL address read in S1-1.
S1-3, reading a behavior from the behavior list corresponding to the website in S1-2, and calling a script of an automatic simulation operation of the corresponding behavior. And at the same time, starting a tcpdump data packet capturing program, executing a behavior automation operation script, and simulating website behavior operation.
And S1-4, after the behavior operation is finished, ending the data packet capturing process, and then executing the next behavior operation. When the behavior operation in the behavior list in one website is finished, closing the browser, reading the next website in the website list, and repeating the operation until the website list is read.
And 2, preprocessing the captured flow data.
S2-1: and filtering irrelevant flow. Data packets such as acknowledgement packets without actual load are filtered out, and retransmission packets such as TCP Retransmission, dup ACK and the like generated by network congestion are filtered out.
S2-2: converging according to five-tuple. The streaming is extracted from the packets using a network session packet splitting method based on quintuple (source IP address, destination IP address, source port, destination port, transport layer protocol type) units. Classifying according to the content of the quintuple, wherein the data packets with consistent quintuple belong to the same unidirectional data flow in the uplink or downlink direction, storing the direction information of the data packets, and identifying the uplink flow by +1 and the downlink flow by-1. In addition, the flows which are too short in length due to connection establishment failure and the like are discarded, and finally the data flow set meeting the requirements is obtained. The five-tuple refers to a source IP, a destination IP, a source port, a destination port and a transport layer protocol.
S2-3, extracting a bidirectional burst sequence. And respectively processing the uplink flow and the downlink flow obtained in the step S2-2. The upstream/downstream burst sequence is defined as a unidirectional packet sequence corresponding to each HTTP message. A bi-directional burst sequence is defined as a sequence consisting of an upstream burst length sequence and a downstream burst length sequence.
Since TCP is a byte stream protocol, it can split messages from the TCP upper layers (e.g., TLS) for transmission in any way. For larger record sizes, TLS records exceeding the MSS limit will be transported in multiple TCP payloads, and only one of these TCP payloads contains a TLS header. The unidirectional burst length is the value of the TLS header length field. Whereas for smaller record sizes, the entire TLS record may be accommodated in a single TCP payload. Since the length of a TLS record is typically smaller than the MSS, TCP will intercept the next TLS record to fill the current payload as large as the MSS. Thus, the TCP payload may not contain any TLS header, one or more TLS headers. The calculation of the uplink/downlink burst length sequence is as follows:
first traversing the TLS header in the current TCP payload and the packet length in the unidirectional stream, adding the current packet length to the value of the TLS header length field. The length of the current TCP payload is then subtracted from the summed value to obtain the length of the remaining TLS record in the subsequent TCP payload. If equal to 0, the current unidirectional burst ends and its length is added to the unidirectional burst sequence. Finally, a unidirectional burst sequence is obtained. And respectively carrying out the processing on the uplink stream and the downlink stream, and restoring the burst sequence according to the time stamp to obtain a bidirectional burst sequence.
And step S3, establishing a website content behavior recognition model. The input of the model is the bi-directional burst sequence output in step S2. The structure of the model is shown in fig. 2, and mainly consists of two parts: a base module (basic CNNs module) and a fully connected module, and implements a deep network by repeating two parts, for example, two base modules in fig. 2.
The first part is the base module. The number of the basic modules is two, and each basic module mainly comprises a one-dimensional convolutional neural network layer (convolutional layer), a batch normalization layer and a maximum pooling layer.
One-dimensional convolutional neural network layer: the convolutional neural network is mainly responsible for extracting features, and mainly comprises a group of filters, and the convolutional operation is performed on input, so that an operation result is transmitted to the next layer. Each basic module in the model is respectively applicable to 3 one-dimensional convolutions, the convolution kernel sizes of the basic modules are respectively set to 5,3,3, the input channels are respectively set to 1, 32 and 32, the step length of the convolution kernels is 1, the padding is set to SAME mode, the biases is initialized to 0, and the Relu activation function is used for nonlinear processing, so that the problems of gradient explosion and gradient disappearance during training are avoided.
Batch normalization layer: also known as batch normalization layers. The purpose of batch standardization is to overcome the problem that the training is difficult due to the deepening of the layer number of the neural network. Because in the neural network, the input of each layer is different from the original input data distribution after being calculated and changed in the layer, the increase of the neural network of the front layer is accumulated and amplified by the neural network of the rear layer, the training sample can be corrected in time by using batch standardization, and the input of each layer is normalized to fix the mean value and variance of the input of each layer.
Maximum pooling layer: the convolution process is followed by a pooling operation. The pooling is essentially sampling, performing dimension reduction compression on the input feature map, reducing parameters and calculation amount while retaining main features, preventing overfitting, and accelerating operation speed. A maximum pooling operation is selected, the pooling size is set to 2, the cloth length is set to 2, and padding is set to SAME.
The second part is a fully connected module. And (3) after carrying out convolution operation on the input by two identical basic modules, entering two fully-connected neural networks. The number of hidden nodes of the first full connection layer is 512, and the second full connection layer is correspondingly adjusted according to the category number of the sample. After the first fully connected layer, a batch standard process is used so that each layer of neural network inputs remain the same distribution. Then, a Relu activation function is used for nonlinear processing, and a Dropout algorithm is used for randomly deleting some hidden neurons in the network, so that a regularization effect is achieved to a certain extent. The ratio of Dropout was chosen to be 0.5.
The output of the fully connected module is input to a softmax classifier for classification. And calculating the loss between the predicted value and the real label by using the cross entropy loss function, so as to train the website content behavior recognition model.
Step S4 is encrypted website content behavior identification. And (3) using the website content behavior recognition model constructed and trained in the step (S4) to recognize the encrypted website content behavior on the test set.
In one embodiment of the present invention, a method for identifying website content behavior based on encrypted traffic bidirectional burst sequence is provided, the method comprising the following steps:
and collecting the traffic according to the step S1 by taking the traffic generated by the encrypted website content behavior as an object.
Preprocessing the flow according to the step S2, filtering the irrelevant flow, and extracting the filtered flow according to the quintuple. Secondly, respectively extracting traffic burst sequences in an uplink and a downlink, and sequencing the extracted unidirectional burst sequences according to time stamps to obtain a bidirectional burst sequence. In addition, the extracted bidirectional burst sequence is divided into a training set, a verification set and a test set according to a cross verification method.
And constructing an encrypted website content behavior recognition model according to the step S3. A convolutional network is built using a deep learning framework (e.g., pytorch, etc.). Meanwhile, training is performed by directly inputting the training set generated in the step S2 into the recognition model.
And according to the step S4, testing the model proposed in the step S3 on the test set generated in the step S2.
Another embodiment of the present invention provides a system for identifying website content behavior based on a bidirectional burst sequence, as shown in fig. 3, including:
the flow acquisition module is used for acquiring (reading) network flow online (offline);
the flow preprocessing and two-way burst sequence extracting module is used for carrying out no-load flow filtering, quintuple aggregation and two-way burst sequence extraction on the obtained original flow;
the model building module is used for building a behavior recognition model by using a website content behavior recognition method based on a bidirectional burst sequence;
the training module is used for training the website behavior recognition model;
and the evaluation index calculation module is used for identifying website content behaviors of the encrypted website by using the trained website content behavior identification model, calculating the overall accuracy, the recall rate of the appointed type and the precision of the appointed type, and carrying out accurate quantification.
Through the technical scheme, the invention provides an effective method and system for identifying the content behavior in the encrypted website.
Another embodiment of the invention provides a computer device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor and a processor, the computer program comprising instructions for performing the steps of the method of the invention.
Another embodiment of the invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, performs the steps of the method of the invention.
The above-disclosed embodiments of the present invention are intended to aid in understanding the contents of the present invention and to enable the same to be carried into practice, and it will be understood by those of ordinary skill in the art that various alternatives, variations and modifications are possible without departing from the spirit and scope of the invention. The invention should not be limited to what has been disclosed in the examples of the specification, but rather by the scope of the invention as defined in the claims.

Claims (10)

1. A website content behavior identification method based on an encrypted traffic bidirectional burst sequence is characterized by comprising the following steps:
acquiring behavior flow data of an encrypted website;
preprocessing behavior traffic data into a bidirectional burst sequence;
establishing a website content behavior recognition model, and training the website content behavior recognition model by taking a bidirectional burst sequence as input;
and carrying out website content behavior recognition of the encrypted website by using the trained website content behavior recognition model.
2. The method of claim 1, wherein the behavior traffic data of the encrypted website is obtained by capturing traffic online or using collected offline data, and the traffic data is saved for file extension by pcap.
3. The method of claim 1, wherein the obtaining behavioral traffic data of the encrypted website comprises:
reading a URL list of the target encrypted website and corresponding behavior operation of the corresponding website, and reading a target URL address from the URL list;
starting a Web dirver program, automatically opening a browser, and inputting a read URL address;
reading a behavior from a behavior list corresponding to a website, calling a script of automatic simulation operation of the corresponding behavior, starting a tcpdump data packet capturing program, executing the behavior automation operation script, and simulating website behavior operation;
after the behavior operation is finished, the data packet capturing process is finished, next behavior operation is executed, when the behavior operation in the behavior list in one website is finished, the browser is closed, the next website in the website list is read, and the operation is repeated until the website list is read.
4. The method of claim 1, wherein the preprocessing of behavioral traffic data into a bi-directional burst sequence comprises:
filtering irrelevant flow;
extracting the flow from the data packet by using a network session data packet segmentation method based on a quintuple as a unit, classifying according to the quintuple content, wherein the data packet with consistent quintuple belongs to a unidirectional data flow in the same uplink or downlink direction, storing the direction information of the data packet, and marking the flow of an uplink by +1 and the flow of a downlink by-1; discarding the stream with too short length due to the connection establishment failure and the like, and finally obtaining a data stream set meeting the requirements;
the uplink traffic and the downlink traffic are processed respectively, and uplink/downlink burst is defined as a unidirectional data packet sequence corresponding to each HTTP message, and bidirectional burst sequence is defined as a sequence of unidirectional burst lengths in all uplink/downlink links.
5. The method of claim 1, wherein the website content behavior recognition model comprises a base module and a fully connected module; the basic module comprises a one-dimensional convolutional neural network layer, a batch standardization layer and a maximum pooling layer; the fully connected module includes a fully connected layer.
6. The method of claim 5, wherein the one-dimensional convolutional neural network layer comprises 3 one-dimensional convolutions with convolution kernel sizes set to 5,3,3, input channels set to 1, 32, 32, step sizes of convolution kernels set to 1, padding set to SAME mode, biases initialized to 0, and non-linear processing using a Relu activation function to avoid gradient explosion and gradient vanishing problems during training; the batch normalization layer is used for correcting training samples in time, and normalizing the input of the layer to fix the mean value and variance of the input of each layer; the pooling size of the maximum pooling layer is set to 2, the step size is set to 2, and padding is set to SAME.
7. The method of claim 5, wherein the website content behavior recognition model is trained by inputting the output of the fully connected module to a softmax classifier for classification, calculating losses between predicted values and real tags using a cross entropy loss function.
8. A system for identifying website content behavior based on a bi-directional burst sequence, comprising:
the flow acquisition module is used for acquiring behavior flow data of the encrypted website;
the traffic preprocessing and bidirectional burst sequence extracting module is used for preprocessing behavior traffic data into a bidirectional burst sequence;
the model building module is used for building a website content behavior recognition model;
the training module is used for training the website content behavior recognition model by taking the bidirectional burst sequence as input;
and the evaluation index calculation module is used for identifying website content behaviors of the encrypted website by using the trained website content behavior identification model, calculating the overall accuracy, the recall rate of the appointed type and the precision of the appointed type, and carrying out accurate quantification.
9. A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-7.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1-7.
CN202310269520.1A 2022-11-02 2023-03-20 Website content behavior identification method and system based on encrypted traffic bidirectional burst sequence Pending CN116401479A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211361850 2022-11-02
CN2022113618505 2022-11-02

Publications (1)

Publication Number Publication Date
CN116401479A true CN116401479A (en) 2023-07-07

Family

ID=87008310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310269520.1A Pending CN116401479A (en) 2022-11-02 2023-03-20 Website content behavior identification method and system based on encrypted traffic bidirectional burst sequence

Country Status (1)

Country Link
CN (1) CN116401479A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117097628A (en) * 2023-10-19 2023-11-21 中国电子科技集团公司第五十四研究所 Networking communication behavior identification method based on signal physical characteristic parameters

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117097628A (en) * 2023-10-19 2023-11-21 中国电子科技集团公司第五十四研究所 Networking communication behavior identification method based on signal physical characteristic parameters
CN117097628B (en) * 2023-10-19 2023-12-22 中国电子科技集团公司第五十四研究所 Networking communication behavior identification method based on signal physical characteristic parameters

Similar Documents

Publication Publication Date Title
WO2022041394A1 (en) Method and apparatus for identifying network encrypted traffic
CN109063745B (en) Network equipment type identification method and system based on decision tree
CN107483488A (en) A kind of malice Http detection methods and system
Yang et al. TLS/SSL encrypted traffic classification with autoencoder and convolutional neural network
Najafabadi et al. User behavior anomaly detection for application layer ddos attacks
CN112165484B (en) Network encryption traffic identification method and device based on deep learning and side channel analysis
CN105103496A (en) System and method for extracting and preserving metadata for analyzing network communications
CN114422211B (en) HTTP malicious traffic detection method and device based on graph attention network
CN113407886A (en) Network crime platform identification method, system, device and computer storage medium
CN109275045B (en) DFI-based mobile terminal encrypted video advertisement traffic identification method
CN108319672A (en) Mobile terminal malicious information filtering method and system based on cloud computing
CN113364787A (en) Botnet flow detection method based on parallel neural network
CN116401479A (en) Website content behavior identification method and system based on encrypted traffic bidirectional burst sequence
Shen et al. Efficient fine-grained website fingerprinting via encrypted traffic analysis with deep learning
Wang et al. 2ch-TCN: a website fingerprinting attack over tor using 2-channel temporal convolutional networks
CN112163493A (en) Video false face detection method and electronic device
CN113938290B (en) Website de-anonymization method and system for user side flow data analysis
CN114650229A (en) Network encryption traffic classification method and system based on three-layer model SFTF-L
CN113726561A (en) Business type recognition method for training convolutional neural network by using federal learning
Zhou et al. Encrypted network traffic identification based on 2d-cnn model
CN113128626A (en) Multimedia stream fine classification method based on one-dimensional convolutional neural network model
CN116232696A (en) Encryption traffic classification method based on deep neural network
CN113949653B (en) Encryption protocol identification method and system based on deep learning
CN111310796A (en) Web user click identification method facing encrypted network flow
CN116260736A (en) Deep learning-based decentralization application flow identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination