CN116208506B - Encryption traffic website identification method based on space-time correlation website fingerprint - Google Patents

Encryption traffic website identification method based on space-time correlation website fingerprint Download PDF

Info

Publication number
CN116208506B
CN116208506B CN202310049743.7A CN202310049743A CN116208506B CN 116208506 B CN116208506 B CN 116208506B CN 202310049743 A CN202310049743 A CN 202310049743A CN 116208506 B CN116208506 B CN 116208506B
Authority
CN
China
Prior art keywords
website
fingerprints
tcp
space
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310049743.7A
Other languages
Chinese (zh)
Other versions
CN116208506A (en
Inventor
余翔湛
龚家兴
石开宇
刘立坤
羿天阳
孔德文
刘奉哲
李竑杰
张森
程明明
王钲浩
郭一澄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202310049743.7A priority Critical patent/CN116208506B/en
Publication of CN116208506A publication Critical patent/CN116208506A/en
Application granted granted Critical
Publication of CN116208506B publication Critical patent/CN116208506B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides an encrypted traffic website identification method based on space-time correlation website fingerprints, and belongs to the technical field of traffic identification. The method comprises the steps of accessing websites one by one in an encryption proxy channel for multiple times by simulating a user, obtaining flow and generating space-time associated website fingerprints; and identifying the website based on the space-time correlation website fingerprint identification encryption traffic, and carrying out website identification on the encryption traffic in the encryption channel constructed by the encryption agent. The invention introduces the spatial information of the traffic, and combines the time sequence information and the spatial information of the traffic together by defining the website indication WIF and the sequence fingerprint importance Score to generate the website fingerprint of time-space association. And by utilizing the space-time associated website fingerprints, the accuracy rate of traffic website identification is greatly improved. The technical problem of low website fingerprint website identification accuracy in the prior art is solved.

Description

Encryption traffic website identification method based on space-time correlation website fingerprint
Technical Field
The application relates to a traffic identification method, in particular to an encrypted traffic website identification method based on space-time correlation website fingerprints, and belongs to the technical field of traffic identification.
Background
The website fingerprint technology is one of important technologies for analyzing encrypted traffic, and can be used for performing website identification on the encrypted traffic by generating website fingerprints, namely judging whether the encrypted traffic contains traffic generated by accessing a specific website or not; the attack model of the website fingerprint technology assumes that an attacker is at a traffic transfer place or has the ability to monitor a communication channel, can passively acquire all traffic data in the communication process of the user and the server, but cannot interfere the communication process of the user and the server. Because the acquired traffic is encrypted, the traffic content cannot be identified directly, but the information such as the size, time sequence and density of a data packet generated by accessing a website often has a certain rule, and the website fingerprint identification technology generates the fingerprint of the website by using the information to identify the website.
Therefore, drager-Gil et al propose a website fingerprint generation scheme based on encrypted traffic time sequence information, and use the website fingerprint generated by the scheme to identify the strange encrypted traffic, because the scheme only uses the traffic time sequence information, and does not use the traffic space information, namely the distance information between traffic in the feature space, the accuracy of the method is not high, and because the webpage content of the dynamic webpage is changed frequently, the traffic time sequence information has larger fluctuation, which further reduces the accuracy of the method.
Disclosure of Invention
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to identify key or critical elements of the invention or to delineate the scope of the invention. Its purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
In view of the above, in order to solve the technical problem of low accuracy of website fingerprint identification in the prior art, the invention provides an encrypted traffic website identification method based on space-time correlation website fingerprints. The encryption agent can be divided into a multi-hop encryption agent and a single-hop encryption agent according to different network structures; the invention aims at a single-hop encryption agent, and provides a webpage identification scheme.
Scheme one: a website identification method based on space-time associated website fingerprint for encrypting traffic comprises the steps of simulating a user to visit websites one by one in an encryption agent channel for multiple times, obtaining traffic and generating space-time associated website fingerprint; and carrying out website identification on the encrypted traffic in the encrypted channel constructed by the encrypted proxy based on the space-time associated website fingerprint.
Preferably, generating space-time associated website fingerprints, comprising the following steps;
s11, simulating a user to access a target website by using an encryption agent tool to acquire a TCP stream;
s12, classifying and marking the TCP stream, generating double marks of the TCP stream, and mapping each different double mark into an integer;
s13, extracting a statistical feature vector of the TCP stream, taking the statistical feature vector of the TCP stream as sample data, taking an integer corresponding to the corresponding double mark as a sample class label, and training a random forest;
s14, generating fingerprints for each TCP stream: for each TCP stream, inputting the statistical feature vector into a random forest after training, and constructing an N-dimensional vector by using integers corresponding to double marks output by all decision trees in the random forest as stream fingerprints, wherein N is the number of the decision trees;
s15, calculating the website indication quantity of the TCP stream;
s16, generating website sequence fingerprints for different double-mark data;
s17, generating space-time associated website fingerprints; the highest 10 sequence fingerprints of each website Score are selected to form a set, and the fingerprint sequence set is used as the website fingerprint of time-space association.
Preferably, classifying and marking the TCP stream to generate double marks of the TCP stream, comprising the following steps:
s121, extracting characteristics of TCP streams, wherein the characteristics comprise: the number of the data messages, the total size of the data messages to be sent and the total size of the data messages;
s122, clustering the characteristics of the TCP streams, wherein the clustering result is used as a second mark of the TCP streams, and the same or similar TCP streams are marked as f ,j
S123, giving a double mark (i, j) to the TCP stream, wherein i represents a website accessed by the TCP stream, j is a second mark, and clustering the TCP stream to generate a class number.
Preferably, the method for calculating the website indication quantity of the TCP stream is as follows:
where τ (p, q) and τ (i, j) represent streams double labeled (p, q) and (i, j), respectively, M represents the set of monitored websites, |M| represents the number of all monitored categories, +1 represents the category to be monitored as a category; #flow (p) indicates the number of flows with different double labels in website p;the K nearest neighbors of the flow f in the flow set X are represented, and K represents K neighbors of the flow to be inspected by the invention; i tau (i, j) i represents the number of all streams double marked (i, j), and +_>Is a normalization factor.
Preferably, the method for generating the website sequence fingerprint is as follows: generating the longest common subsequence for sequences of any two TCP flows:
wherein L (p, q) representsAnd->Is the longest common subsequence of->Representing the flow +.>Added after the subsequence L (p-1, q-1), bi_table (x represents the double tag of the fetch stream x).
Preferably, the method for calculating the importance scores of the sequence fingerprints is as follows:
where #OCur (F denotes the number of occurrences of the sequence fingerprint F).
Preferably, the website identification is performed on the encrypted traffic in the encrypted channel constructed by the encrypted proxy based on the space-time associated website fingerprint, and the method comprises the following steps:
s21, acquiring a TCP stream sequence of the access flow; generating multiple TCP streams for a certain visit to a website, and generating a TCP stream double mark for each TCP stream; sequencing all the double marks of the TCP streams according to the generation time of the TCP streams in sequence to obtain sequence fingerprints of the access flow;
s22, matching sequence fingerprints of the access flow with space-time associated webpage fingerprints, judging whether all subsequences of the TCP flow of the access flow contain certain sequence fingerprints of the webpage fingerprints, and accumulating importance scores of the sequence fingerprints in the webpage fingerprints if all subsequences of the TCP flow of the access flow contain certain sequence fingerprints of the space-time associated webpage fingerprints; judging whether the importance scores of the accumulated sequence fingerprints exceed a threshold value, and if the importance scores do not exceed the threshold value, not including the flow generated by the access website in the access flow; if the threshold value is exceeded, the access traffic contains traffic generated by accessing the website.
The second scheme is an electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor realizes the first scheme of the encrypted traffic website identification method based on space-time correlation website fingerprints when executing the computer program.
A third aspect is a computer readable storage medium having a computer program stored thereon, where the computer program when executed by a processor implements a method for encrypting traffic web site identification based on space-time associated web site fingerprints as described in the first aspect.
The beneficial effects of the invention are as follows: the invention introduces the spatial information of the traffic, and combines the time sequence information and the spatial information of the traffic together by defining the website indication WIF and the sequence fingerprint importance Score to generate the website fingerprint of time-space association. And by utilizing the space-time associated website fingerprints, the accuracy of flow source identification is greatly improved. The method solves the technical problem of low accuracy of identification of the fingerprint sources of the websites in the prior art.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a flow chart of an encryption traffic website identification method based on space-time correlation website fingerprints;
FIG. 2 is a schematic diagram of a website fingerprint process for generating a temporal-spatial association;
FIG. 3 is a schematic diagram of a flow of encrypted traffic based on space-time correlation website fingerprint identification.
Detailed Description
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of exemplary embodiments of the present application is given with reference to the accompanying drawings, and it is apparent that the described embodiments are only some of the embodiments of the present application and not exhaustive of all the embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.
Embodiment 1, referring to fig. 1-3, illustrates the present embodiment, which is an encrypted traffic website identification method based on space-time associated website fingerprints, by simulating that a user accesses websites one by one in a clean network environment multiple times, to obtain traffic and generate space-time associated website fingerprints; identifying the encrypted traffic based on the space-time correlation website fingerprint, and giving an encrypted traffic generation source;
in this embodiment, the user is simulated to access the website one by one multiple times by using the encryption agent tool, so as to obtain enough traffic, and the traffic is classified and marked by using the DBSCAN clustering algorithm. Extracting feature vectors (total number of packets, number of packets in two directions, time distribution of the packets and the like) for different marked streams, training a random forest by using the feature vectors, and obtaining fingerprints of the streams by using a random forest algorithm. Based on the fingerprint of the stream, a metric can be defined: the website indicator (Website Index of Flow, WIF) of a stream is used to quantify the likelihood that a stream is generated by accessing a website. The website indicative WIF of the streams essentially already contains spatial information between the streams, and then further utilization of the temporal information of the streams is needed to generate the final website fingerprint. By calculating the longest common subsequence between any two streams accessing the same web site, the time information between the streams is taken into account, and the resulting sequence is referred to as the sequence fingerprint of the web site. The website indicator and the number of sequence occurrences of the stream are then used for each sequence fingerprint to generate a score, and the top 10 sequence fingerprints are selected as the website fingerprints. FIG. 2 is a flow chart of a method for generating a spatio-temporal associated website fingerprint.
Specifically, s1. The method of generating the website fingerprint of temporal-spatial association, because accessing different websites may generate the same or similar TCP flows, it is necessary to distinguish each TCP flow in more detail, that is, assign a two-dimensional label (i, j) to the TCP. Where i denotes which website this TCP stream is accessing, and j is generated by clustering all TCP streams. The method specifically comprises the following steps:
s11, simulating a user to access a target website by using an encryption agent tool to acquire a TCP stream;
s12, classifying and marking the TCP stream to generate a double mark of the TCP stream; and mapping each different double label to an integer;
the mapping of the double marks into integers is used for subsequent random forest training, and the output of the random forest and the class label y can only be integers, so that mapping is needed, and the subsequent random forest results are all corresponding integers of the double marks;
s121, extracting characteristics of TCP streams, wherein the characteristics comprise: the number of the data messages, the total size of the data messages to be sent, the total size of the data messages and the like;
s122, clustering the characteristics of the TCP streams, wherein the clustering result is used as a second mark of the TCP streams, and the same or similar TCP streams are marked as f i,j
S123, giving a double mark (i, j) to the TCP stream, wherein i represents a website accessed by the TCP stream, j is a second mark, and clustering the TCP stream to generate a class number.
Specifically, since it is already known which website the TCP stream accesses to generate when the traffic is collected, the first flag can be directly obtained;
s13, extracting a statistical feature vector of the TCP stream, taking the statistical feature vector of the TCP stream as sample data, taking an integer corresponding to the corresponding double mark as a sample class label, and training a random forest;
after the dual markers for the TCP streams are generated, statistical features are next designed to fully characterize each TCP stream to support subsequent stream fingerprinting. For each TCP flow, the present embodiment analyzes the sequence of its data packets in two directions, and constructs a 814-dimensional feature vector, and uses the size and timing information of the sequence of the data packets, and the present embodiment does not consider specific information of the payload of the data packets (because the collected traffic is encrypted, and cannot obtain enough information from the data packets). The specific meaning of each dimension of the feature vector is shown in the specific meaning table of the feature vector in table 1
TABLE 1 specific meanings of eigenvectors
S14, generating fingerprints for each TCP stream: for each TCP stream, inputting the statistical feature vector into a random forest after training, and constructing an N-dimensional vector by using integers corresponding to double marks output by all decision trees in the random forest as stream fingerprints, wherein N is the number of the decision trees;
with TCP stream statistics feature vectors, a Random Forest (Random Forest) is used to generate one fingerprint for each TCP stream to reflect the similarity between different website streams from different angles (i.e., feature subspaces). First, for each TCP flow, an N-dimensional vector is constructed as its flow fingerprint using the intermediate decisions made by all decision trees (rather than the final decisions made by RF through voting), where N is the number of decision trees. For example, for a stream f double labeled (i, j) i,j Its stream fingerprint can be represented as [ T ] 1 (i,j),T 2 (i,j),...T N (i,j)]Wherein T is k (k=1, 2,., N) represents the kth decision tree convection f i,j The classification result is an integer corresponding to the double marks one by one, and N is the number of decision trees in the random forest. Meanwhile, to avoid overfitting, each decision tree is constructed based on the bagging of the TCP stream feature space, which indicates that decisions made by different decision trees regarding a certain TCP stream class result from different observation angles (i.e., feature subspace).
S15, calculating the website indication quantity of the TCP stream; based on the fingerprints of TCP flows that can measure the degree of similarity, a metric can be defined: a website indicator (Website Index ofFlow, WIF) of a flow, indicating the size of the likelihood that a TCP flow will result from accessing a website. The calculation mode is specifically as follows: for one TCP flow marked (i, j), its k-nearest TCP flow instance is found in all TCP flows of all websites (calculation of distance is based on fingerprint of TCP flow), and the proportion of TCP flows having the same double mark (i, j) therein is calculated statistically. The same is then done for all TCP flows with double labels (i, j), and the average of this ratio is found. It indicates how much of a TCP flow of a double label (i, j) was generated by accessing web site i when it was observed.
The specific formula is as follows:
where τ (p, q) and τ (i, j) represent streams double labeled (p, q) and (i, j), respectively, M represents the set of monitored websites, |M| represents the number of all monitored categories, +1 represents the category to be monitored as a category; #flow (p) indicates the number of flows with different double labels in website p;the K nearest neighbors of the flow f in the flow set X are represented, and K represents K neighbors of the flow to be inspected by the invention; i tau (i, j) i represents the number of all streams double marked (i, j), and +_>Is a normalization factor.
S16, generating website sequence fingerprints for different double-mark data;
website sequence fingerprints are essentially a sequence of TCP streams that is the longest common subsequence of two TCP stream fingerprints, such as the first TCP stream fingerprint is [3,2,4,5,6], the second TCP stream fingerprint is [3,4,6,8], then the longest common subsequence of the two TCPs (i.e., a sequence fingerprint of the website) is [3,4,6], where each integer corresponds to a double label;
to find the sequential pattern of TCP flows accessing a web site, the Longest Common Subsequence (LCS) of the different TCP flow sequences generated by accessing the web site, called the sequence fingerprint of the web site, needs to be found. One accessThe web site will generate a sequence of TCP flows and for any two sequences of TCP flows (the sequence of TCP flows generated by accessing the same web site), generate their LCS as a sequence fingerprint for the web site. For example, the present invention refers to the stream sequence generated by the a-th visit of a website w as Wherein->Represents the kth stream, n, in the stream sequence a Indicating that the stream sequence resulting from the a-th access yields a total of a streams; similarly, the stream sequence generated by the b-th visit of the same website w is recorded as +.> Stream sequence S generated for any two accesses to web site w a And S is equal to b The present invention generates their Longest Common Subsequence (LCS). The present invention states that the subsequence consisting of the first p streams is +.>(same as->) Record L (n) a ,n b ) Is->And->Is the longest common subsequence of (a).
Generating the longest common subsequence for sequences of any two TCP flows:
wherein L (p, q) representsAnd->Is the longest common subsequence of->Representing the flow +.>Added after the subsequence L (p-1, q-1), bi_table (x represents the double tag of the fetch stream x).
S17, calculating importance scores of the sequence fingerprints; a set of sequential fingerprints is generated for each web site. These sequence fingerprints may be duplicated, meaning that one sequence fingerprint may occur a different number of times in all stream sequences, and in addition, the website indicator WIF value is different for each stream in the sequence fingerprint, which indicates that the importance of each sequence fingerprint is different. Thus, in combination with these two information, a metric Score is defined for each sequence fingerprint F, representing the importance of each sequence fingerprint F.
Where #OCur (F denotes the number of occurrences of the sequence fingerprint F).
S18, generating space-time associated website fingerprints; the highest 10 sequence fingerprints of each website Score are selected to form a set, and the fingerprint sequence set is used as the website fingerprint of time-space association.
The new encrypted traffic is identified using the time-space associated website fingerprint, giving out which websites the encrypted traffic is generated by accessing, and FIG. 3 is a flow chart of a method for identifying the encrypted traffic based on the time-space associated website fingerprint.
The method for identifying the encrypted traffic based on the space-time correlation website fingerprint comprises the following steps:
s2, utilizing the website fingerprint to identify the encrypted flow;
s21, acquiring a sequence of access flow; generating multiple TCP streams for a certain visit to a website, and generating a TCP stream double mark for each TCP stream; sequencing all the double marks of the TCP streams according to the generation time of the TCP streams in sequence to obtain sequence fingerprints of the access flow;
inputting each TCP stream in the access flow into a trained random forest to obtain integers corresponding to double marks of the TCP stream (the invention does not use the intermediate result of the random forest, but directly uses the final result of the random forest, the final result is an integer corresponding to the double marks), and sequencing the integers corresponding to the double marks of all TCP streams according to the generation time of the TCP stream to obtain a TCP stream sequence of the access flow;
the access flow contains a plurality of TCP flows, so the nature of the access flow is a sequence of TCP flows, the sequence of the TCP flows is represented by an integer sequence corresponding to double marks, for example, the sequence of the TCP flows contained in the access flow is [ a, c, d, e, a ], each letter represents different TCP flows, each flow is input into a random forest to obtain an integer corresponding to double marks, the sequence of the flows is changed into [1,4,2,3,1], and 1 is assumed to be the integer corresponding to double marks output after the flow a is input into the random forest, and the other letters are the same;
s22, matching sequence fingerprints of the access flow with space-time related webpage fingerprints, judging whether all subsequences of the TCP flow of the access flow contain a certain sequence fingerprint of the webpage fingerprints, and accumulating importance scores of the sequence fingerprints in the webpage fingerprints if the sequence of the access flow contains the certain sequence fingerprint of the space-time related webpage fingerprints; judging whether the importance scores of the accumulated sequence fingerprints exceed a threshold value, and if the importance scores do not exceed the threshold value, not including the flow generated by the access website in the access flow; if the threshold value is exceeded, the access traffic contains traffic generated by accessing the website. From this it can be assumed that the user is accessing the web site through the encryption agent.
The sequence fingerprint is essentially a TCP stream sequence, if the subsequence of the TCP stream sequence in the access flow contains a certain sequence fingerprint of the webpage fingerprint, the importance scores of the sequence fingerprints in the webpage fingerprint are accumulated, for example, the certain sequence fingerprint of the certain webpage fingerprint is [3,4], the TCP stream sequence of the access flow is [3,2,4,2], all subsequences of [3,2,4] are [3], [2], [4], [3,2], [3,4], [2,4], [3,2,4], and the subsequences contain [3,4], which means that the subsequence of the TCP stream sequence in the access flow contains a certain sequence fingerprint of the webpage fingerprint;
if a series of strange traffic contains a sequence fingerprint in the web page fingerprints of the web site w, the importance scores of the sequence fingerprints are accumulated. Multiple serial fingerprints of the same website may appear in unfamiliar traffic, and the importance scores of the multiple serial fingerprints need to be accumulated, so a threshold may be set, and when the accumulated importance score exceeds the threshold, the unfamiliar traffic is considered to contain traffic generated by accessing the website w, and further, the user is considered to be accessing the website w using the encryption agent. This back logic is the more sequence fingerprint hits, the greater the likelihood of visiting a website;
in embodiment 3, the computer device of the present invention may be a device including a processor and a memory, for example, a single chip microcomputer including a central processing unit. And the processor is used for realizing the steps of the encryption traffic website identification method based on the space-time correlation website fingerprint when executing the computer program stored in the memory.
The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
Embodiment 4, computer-readable storage Medium embodiment
The computer readable storage medium of the present invention may be any form of storage medium that is readable by a processor of a computer device, including but not limited to, nonvolatile memory, volatile memory, ferroelectric memory, etc., on which a computer program is stored, and when the processor of the computer device reads and executes the computer program stored in the memory, the steps of an encrypted traffic web site identification method based on space-time correlation web site fingerprints described above may be implemented.
The computer program comprises computer program code which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of the above description, will appreciate that other embodiments are contemplated within the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention, which is defined by the appended claims.

Claims (8)

1. A method for identifying encrypted traffic website based on space-time associated website fingerprint is characterized in that the method comprises the steps of accessing websites one by one in an encrypted proxy channel for multiple times by simulating a user, obtaining traffic and generating space-time associated website fingerprint; based on space-time correlation website fingerprints, carrying out website identification on encrypted traffic in an encrypted channel constructed by an encrypted proxy;
generating space-time associated website fingerprints, comprising the following steps:
s11, simulating a user to access a target website by using an encryption agent tool to acquire a TCP stream;
s12, classifying and marking the TCP stream, generating double marks of the TCP stream, and mapping each different double mark into an integer;
s13, extracting a statistical feature vector of the TCP stream, taking the statistical feature vector of the TCP stream as sample data, taking an integer corresponding to the corresponding double mark as a sample class label, and training a random forest;
s14, generating fingerprints for each TCP stream: for each TCP stream, inputting the statistical feature vector into a random forest after training, and constructing an N-dimensional vector by using integers corresponding to double marks output by all decision trees in the random forest as stream fingerprints, wherein N is the number of the decision trees;
s15, calculating the website indication quantity of the TCP stream;
s16, generating website sequence fingerprints for different double-mark data;
s17, generating space-time associated website fingerprints; the highest 10 sequence fingerprints of each website Score are selected to form a set, and the fingerprint sequence set is used as the website fingerprint of time-space association.
2. The method for identifying the encrypted traffic website based on the space-time correlation website fingerprint according to claim 1, wherein the method for classifying and marking the TCP stream to generate the double marks of the TCP stream comprises the following steps:
s121, extracting characteristics of TCP streams, wherein the characteristics comprise: the number of the data messages, the total size of the data messages to be sent and the total size of the data messages;
s122, clustering the characteristics of the TCP streams, wherein the clustering result is used as a second mark of the TCP streams, and the same or similar TCP streams are marked as f i,j
S123, giving a double mark (i, j) to the TCP stream, wherein i represents a website accessed by the TCP stream, j is a second mark, and clustering the TCP stream to generate a class number.
3. The method for identifying the website of the encrypted traffic based on the space-time correlation website fingerprint according to claim 2, wherein the method for calculating the website indication quantity of the TCP stream is as follows:
where τ (p, q) and τ (i, j) represent streams double labeled (p, q) and (i, j), respectively, M represents the set of monitored websites, |M| represents the number of all monitored categories, +1 represents the category to be monitored as a category; #flow (p) indicates the number of flows with different double labels in website p;represented in stream set XThe nearest K nearest neighbors of the flow f, K represents the K neighbors of the flow inspected by the invention; i tau (i, j) i represents the number of all streams double marked (i, j), and +_>Is a normalization factor.
4. The method for identifying the website of the encrypted traffic based on the space-time correlation website fingerprints according to claim 3, wherein the method for generating the website sequence fingerprints is as follows: generating the longest common subsequence for sequences of any two TCP flows:
wherein L (p, q) representsAnd->Is the longest common subsequence of->Representing the flow +.>Added after the subsequence L (p-1, q-1), bi_table (x) represents a double tag of the fetch stream x.
5. The method for identifying encrypted traffic web sites based on space-time correlation web site fingerprints according to claim 4, wherein the method for calculating the importance scores of the sequence fingerprints is as follows:
where #OCur (F) represents the number of occurrences of the sequence fingerprint F.
6. The method for identifying encrypted traffic web sites based on space-time correlation web site fingerprints as recited in claim 5, wherein the method for identifying encrypted traffic in an encrypted channel constructed by an encryption agent based on space-time correlation web site fingerprints comprises the steps of:
s21, acquiring a TCP stream sequence of the access flow; generating multiple TCP streams for a certain visit to a website, and generating a TCP stream double mark for each TCP stream; sequencing all the double marks of the TCP streams according to the generation time of the TCP streams in sequence to obtain sequence fingerprints of the access flow;
s22, matching sequence fingerprints of the access flow with space-time associated webpage fingerprints, judging whether all subsequences of the TCP flow of the access flow contain certain sequence fingerprints of the webpage fingerprints, and accumulating importance scores of the sequence fingerprints in the webpage fingerprints if all subsequences of the TCP flow of the access flow contain certain sequence fingerprints of the space-time associated webpage fingerprints; judging whether the importance scores of the accumulated sequence fingerprints exceed a threshold value, and if the importance scores do not exceed the threshold value, not including the flow generated by the access website in the access flow; if the threshold value is exceeded, the access traffic contains traffic generated by accessing the website.
7. An electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of a method for encrypted traffic web site identification based on spatio-temporal correlation web site fingerprints as claimed in any one of claims 1 to 6 when the computer program is executed.
8. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements a method for encrypted traffic web site identification based on spatio-temporal correlation web site fingerprints as claimed in any one of claims 1 to 6.
CN202310049743.7A 2023-02-01 2023-02-01 Encryption traffic website identification method based on space-time correlation website fingerprint Active CN116208506B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310049743.7A CN116208506B (en) 2023-02-01 2023-02-01 Encryption traffic website identification method based on space-time correlation website fingerprint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310049743.7A CN116208506B (en) 2023-02-01 2023-02-01 Encryption traffic website identification method based on space-time correlation website fingerprint

Publications (2)

Publication Number Publication Date
CN116208506A CN116208506A (en) 2023-06-02
CN116208506B true CN116208506B (en) 2023-07-21

Family

ID=86508821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310049743.7A Active CN116208506B (en) 2023-02-01 2023-02-01 Encryption traffic website identification method based on space-time correlation website fingerprint

Country Status (1)

Country Link
CN (1) CN116208506B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022011977A1 (en) * 2020-07-15 2022-01-20 中国科学院深圳先进技术研究院 Network anomaly detection method and system, terminal and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190294642A1 (en) * 2017-08-24 2019-09-26 Bombora, Inc. Website fingerprinting
CA3100237A1 (en) * 2019-11-22 2021-05-22 Royal Bank Of Canada System and method for digitally finderprinting phishing actors
CN115580547A (en) * 2022-11-21 2023-01-06 中国科学技术大学 Website fingerprint identification method and system based on time-space correlation between network data streams

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022011977A1 (en) * 2020-07-15 2022-01-20 中国科学院深圳先进技术研究院 Network anomaly detection method and system, terminal and storage medium

Also Published As

Publication number Publication date
CN116208506A (en) 2023-06-02

Similar Documents

Publication Publication Date Title
Cormode et al. Small summaries for big data
CN111612039A (en) Abnormal user identification method and device, storage medium and electronic equipment
TW202207154A (en) Video matching method and infringement evidence storage method and device based on block chain
CN113656547B (en) Text matching method, device, equipment and storage medium
CN109033148A (en) One kind is towards polytypic unbalanced data preprocess method, device and equipment
CN113656699B (en) User feature vector determining method, related equipment and medium
CN115222443A (en) Client group division method, device, equipment and storage medium
CN115062642A (en) Signal radiation source identification method, device, equipment and storage medium
CN111144546A (en) Scoring method and device, electronic equipment and storage medium
CN116208506B (en) Encryption traffic website identification method based on space-time correlation website fingerprint
CN116805039A (en) Feature screening method, device, computer equipment and data disturbance method
CN116611092A (en) Multi-factor-based data desensitization method and device, and tracing method and device
CN116502261A (en) Data desensitization method and device for retaining data characteristics
CN111461191A (en) Method and device for determining image sample set for model training and electronic equipment
CN110851828A (en) Malicious URL monitoring method and device based on multi-dimensional features and electronic equipment
KR100735308B1 (en) Recording medium for recording automatic word spacing program
CN112949305B (en) Negative feedback information acquisition method, device, equipment and storage medium
CN115292008A (en) Transaction processing method, device, equipment and medium for distributed system
CN116776932A (en) E-commerce behavior recognition method and device for user
CN103793448B (en) Article information providing method and system
CN115134095A (en) Botnet control terminal detection method and device, storage medium and electronic equipment
CN113535951B (en) Method, device, terminal equipment and storage medium for information classification
Phulre et al. Approach on Machine Learning Techniques for Anomaly-Based Web Intrusion Detection Systems: Using CICIDS2017 Dataset
CN118133939B (en) Differential privacy federation learning method, system and equipment based on multi-mode data
CN117792708A (en) Method and device for detecting network space asset and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant