CN113141364B - Encrypted traffic classification method, system, equipment and readable storage medium - Google Patents

Encrypted traffic classification method, system, equipment and readable storage medium Download PDF

Info

Publication number
CN113141364B
CN113141364B CN202110438554.XA CN202110438554A CN113141364B CN 113141364 B CN113141364 B CN 113141364B CN 202110438554 A CN202110438554 A CN 202110438554A CN 113141364 B CN113141364 B CN 113141364B
Authority
CN
China
Prior art keywords
label
encrypted
flow
sample
encrypted traffic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110438554.XA
Other languages
Chinese (zh)
Other versions
CN113141364A (en
Inventor
马小博
安冰玉
瞿建
潘鹏宇
李森
王鑫
卞华峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202110438554.XA priority Critical patent/CN113141364B/en
Publication of CN113141364A publication Critical patent/CN113141364A/en
Application granted granted Critical
Publication of CN113141364B publication Critical patent/CN113141364B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a classification method, a system, equipment and a readable storage medium of encrypted traffic, wherein a K-dimensional vector carrying double labels of corresponding streams simultaneously is formed based on indexes of leaf nodes of a judgment result of each decision tree in a stream classification model, the K-dimensional vector is used as input training K-nearest neighbor classification algorithm to calculate and obtain original encrypted traffic samples and L2I values of the streams, when any given encrypted traffic sample is classified, meta-feature vectors of all the extracted streams are input into the stream-based classification model and the encrypted traffic type labels thereof are predicted to obtain prediction labels, the sum of the L2I values corresponding to the labels is calculated, then the original encrypted traffic samples and the L2I values of the streams are compared to realize encrypted traffic classification based on a stream double label mechanism, complete website access classification can be realized, and traffic intersection in the access process can be prevented, the method is suitable for carrying out encryption traffic classification on web-oriented and stream-oriented network behaviors, and can realize complete website access classification.

Description

Encrypted traffic classification method, system, equipment and readable storage medium
Technical Field
The invention belongs to the field of network security and user privacy, and particularly relates to an encrypted traffic classification method, system, equipment and readable storage medium.
Background
In recent years, with the rapid development of the internet, the network has been tightly integrated into our production and life, and the network security has become a non-negligible problem. In daily life, the network security awareness of people is gradually improved, and more users and enterprises pay attention to the protection and the safe transmission of information. The network behavior identification technology based on the encrypted flow can be used for realizing the safety supervision of the network, in particular the supervision of illegal services and bad information. The encrypted traffic analysis is to analyze the internet access behavior of the current user through the characteristics of some traffic per se, but not through the content analysis of data packets. The most important technology for the current encrypted traffic analysis application is web site fingerprinting (website fingerprinting), which is a technology that classifies user behaviors by extracting features of network traffic and combining with a supervised classification model, and can accurately judge a website accessed by a current user. For the website fingerprint analysis technology, how to accurately realize website classification and be applied in a real network environment is a key problem.
Most of the current encryption traffic analysis technologies stay in the academic research stage, and no people research the application of the encryption traffic analysis technologies in a real network environment. This is because when the existing website fingerprinting technology trains the classification model, the used basic recognition unit is still the complete traffic generated by visiting a website, and this complete traffic cannot be determined in the real network environment. Because in a real network environment, there may be a NAT network, or a scenario similar to this, where multiple websites are accessed simultaneously, which may create a traffic intersection situation. Once traffic crossing occurs, we cannot accurately distinguish traffic belonging to the access to a website.
In summary, the basic unit of identification used in encrypted traffic analysis at home and abroad is the complete traffic generated by visiting one website, and the generation and collection of the traffic are required in a pure network environment, and the visiting time of each website needs to be strictly controlled to ensure that the traffic is not cross-polluted. The research method is suitable for research and learning, but a complete website access flow cannot be distinguished due to the fact that the flow in a real network environment is crossed, so that the research method cannot be applied to the real network environment for a while, and no one researches the application of encrypted flow analysis in the real network environment so far.
Disclosure of Invention
The invention aims to provide a method, a system, equipment and a readable storage medium for classifying encrypted traffic, so as to overcome the defects of the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
an encrypted traffic classification method, comprising the steps of:
s1, generating a training set of the encrypted flow sample based on the flow;
s2, generating an encrypted flow recognition flow-based classification model by adopting a random forest classification model according to the flow-based training set;
s3, forming a K-dimensional vector simultaneously carrying the double labels of the corresponding flow according to the index of the leaf node of the judgment result of each decision tree in the flow-based classification model, and calculating by taking the K-dimensional vector as an input training K-nearest neighbor classification algorithm to obtain an original encrypted flow sample and an L2I value of the flow;
s4, according to port information contained in the data packet, dividing the encrypted flow sample to be detected into flows with the same port and extracting meta-feature vectors, inputting the extracted meta-feature vectors of all the flows into a flow-based classification model and predicting the encrypted flow type labels of the flows to obtain prediction labels, grouping the prediction labels with the same first dimension labels, calculating the sum of L2I values corresponding to the labels in the grouping, and then comparing the original encrypted flow sample with the L2I value of the flow; if the ratio of the two is larger than the threshold value set before the user, the encrypted traffic label with the largest ratio is output as a classification result, and if the ratio of the two is smaller than the threshold value, the classification result is not output, so that encrypted traffic classification is completed.
Further, a user encrypted flow sample set is collected, each encrypted flow sample in the set is an original flow file containing a data packet, and the encrypted flow sample set has a unique encrypted flow type label; splitting each complete encrypted flow sample into a plurality of flow samples according to port information contained in the data packet; then marking the corresponding stream according to the file containing the stream log information in the encrypted flow sample set; according to the meta-feature vectors in the encrypted traffic sample set, performing vectorization representation on the stream in each encrypted traffic sample; and after all the streams in the encrypted flow sample set are represented in a vectorization mode by adopting the meta-feature vectors, the encrypted flow type label and the stream label of each encrypted flow sample are reserved, and a stream-based training set of the encrypted flow samples is obtained.
Further, for the encrypted traffic sample setExtracting d-dimensional sequence characteristics of each encrypted flow sample in the system, and recording the d-dimensional sequence characteristics as [ f [ ]1,f2,…,fd](ii) a Let the total p-type encrypted traffic samples and the encrypted traffic type of the i-th type encrypted traffic sample be labeled as labeli(ii) a After the encrypted flow sample is split into streams according to the ports, the streams are marked as label according to the log filei-jWherein the value of j is determined according to the number of streams of the encrypted traffic samples of different classes; the encrypted traffic sample training set is denoted as T:
T={(label1,label1-1):[f1,f2,…,fd],(label1,label1-2):[f1,f2,…,fd],…,(labelp,labelp-j):[f1,f2,…,fd]}
wherein, labelpThe first dimension label is an encrypted flow sample layer label and corresponds to the network address of each monitoring website; labelp-jIs a second dimension label and is a flow layer label.
Further, taking the stream-based training set obtained in the step S1 as an input, training a random forest classification model, which is composed of k decision trees; and taking the index of the leaf node of the judgment result of each decision tree to form a k-dimensional vector which simultaneously carries the double labels of the corresponding flow.
Further, a stream sample in the stream-based training set T is used as an input of the stream-based classification model C, an index value of a leaf node where a decision result of the v-th decision tree in the stream-based classification model C is located is recorded, and a one-dimensional new feature F belonging to the encrypted traffic sample is generatedjTotal k-dimensional composite feature vector, denoted as [ F1,F2,…,Fk](ii) a And finally, generating k-dimensional new features for each encrypted flow sample in the training set T based on the flow to obtain a fingerprint set, wherein the fingerprint set is represented as P:
P={(label1,label1-1):[F1,F2,…,Fk],(label1,label1-2):[F1,F2,…,Fk],…,(labelp,labelp-j):[F1,F2,…,Fk]}。
further, if the number of each type of encrypted traffic samples is n, K in the K neighbors is n-1, and the label is assumed to be (label)p,labelp-j) Of K samples surrounding one finger print sample, the number of samples labeled with the same label is Nump-jThen the L2I value for this type of stream is:
L2Ip-j=Nump-j/K;
the first dimension label is labelpThe value of L2I for the encrypted traffic sample of (1) is that all first dimension labels are labelspOf the stream L2I.
Furthermore, any encrypted flow sample is given, is divided into flows according to ports, is subjected to vectorization representation by adopting element feature vectors, and is input into a flow-based classification model C to obtain sample labels of all flows.
An encrypted traffic classification system comprising:
the input module is used for splitting the encrypted flow sample to be detected into flows with the same port according to port information contained in the data packet and extracting meta-feature vectors, inputting the extracted meta-feature vectors of all the flows into a flow-based classification model and predicting the encrypted flow type labels of the flows to obtain prediction labels, grouping the prediction labels with the same first dimension label, calculating the sum of L2I values corresponding to the labels in the grouping, and inputting the sum to the classification comparison module;
the classification comparison module is used for forming a K-dimensional vector which simultaneously carries the double labels of the corresponding flow according to the index of the leaf node of the judgment result of each decision tree in the flow-based classification model, and the K-dimensional vector is used as input to train a K-nearest neighbor classification algorithm to calculate and obtain an original encrypted flow sample and an L2I value of the flow; and comparing the original encrypted traffic sample and the L2I value of the stream according to the sum of the L2I values corresponding to the tags in the computation packets; if the ratio of the two is larger than the threshold value set before the user, the encrypted traffic label with the largest ratio is output as the classification result, and if the ratio of the two is smaller than the threshold value, the classification result is not output.
A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above-mentioned encrypted traffic classification method when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned encrypted traffic classification method.
Compared with the prior art, the invention has the following beneficial technical effects:
the invention relates to an encrypted flow classification method, which forms a k-dimensional vector simultaneously carrying double labels of corresponding flows according to the index of a leaf node of a judgment result of each decision tree in a flow-based classification model, training a K-nearest neighbor classification algorithm by taking the K-dimensional vector as input to calculate to obtain an original encrypted flow sample and an L2I value of the flow, when any given encrypted flow sample is classified, the encrypted flow sample to be detected is divided into flows with the same ports according to port information contained in a data packet, meta-feature vectors are extracted, the meta-feature vectors of all the flows extracted are input into a flow-based classification model, encrypted flow type labels of all the flows are predicted to obtain prediction labels, the first dimension labels in the prediction labels are the same and are grouped, the sum of L2I values corresponding to the labels in the groups is calculated, and then the original encrypted flow sample and the L2I value of the flow are compared; if the ratio of the two is larger than the threshold value set by the user, the encrypted traffic label with the maximum ratio is output as a classification result, and if the ratio of the two is smaller than the threshold value, the classification result is not output, and the encrypted traffic classification is completed; the method is suitable for web page-oriented and stream-oriented network behaviors, can quickly classify the encrypted traffic, and realizes that the traffic in a real network environment distinguishes a complete website access traffic.
The encryption traffic classification system can quickly classify the encryption traffic, realizes the traffic differentiation of a complete website access traffic in a real network environment, provides powerful guarantee for network security, and can accurately judge the website accessed by a current user in a real network.
Drawings
FIG. 1 is a flowchart illustrating an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings:
as shown in fig. 1, a method for classifying encrypted traffic includes the following steps:
s1: generating a stream-based training set of encrypted traffic samples;
and acquiring a user encrypted traffic sample set, wherein each encrypted traffic sample in the set is an original traffic file containing a data packet and has a unique encrypted traffic type label. Splitting each complete encrypted flow sample into a plurality of flow samples according to port information contained in the data packet; then marking the corresponding stream according to the file containing the stream log information in the encrypted flow sample set; recording d-dimensional feature vectors in the encrypted flow sample set as meta-feature vectors; according to the meta-feature vector, performing vectorization representation on the stream in each encrypted flow sample; after all the streams in the encrypted flow sample set are represented in a vectorization mode by using the meta-feature vector, keeping the encrypted flow type label and the stream label of each encrypted flow sample, and comprehensively obtaining a double label (encrypted flow label and stream label) of each stream to obtain a stream-based training set of the encrypted flow samples;
extracting d-dimensional sequence characteristics of each encrypted flow sample in the encrypted flow sample set, and recording the d-dimensional sequence characteristics as [ f ]1,f2,…,fd](ii) a Let the total p-type encrypted traffic samples and the encrypted traffic type of the i-th type encrypted traffic sample be labeled as labeli(ii) a After the encrypted flow sample is split into streams according to the ports, the streams are marked as label according to the log filei-jWherein the value of j is determined according to the stream number of the encrypted flow samples of different classes; the meta-feature vector contains d-dimensional sequence features, denoted as [ f [ ]1,f2,…,fd](ii) a Encrypted traffic sample trainingSet T, as follows:
T={(label1,label1-1):[f1,f2,…,fd],(label1,label1-2):[f1,f2,…,fd],…,(labelp,labelp-j):[f1,f2,…,fd]}
wherein, labelpThe first dimension label is an encrypted flow sample layer label and corresponds to the network address of each monitoring website; labelp-jThe second dimension label is a flow layer label and corresponds to the connection network address of the flow in the website; the resulting T serves as the stream-based training set.
S2: generating an encrypted flow identification flow-based classification model by adopting a random forest classification model according to a flow-based training set;
specifically, the stream-based training set obtained in step S1 is used as input to train a random forest classification model, where the model is composed of k decision trees, and each decision tree has an independent determination result; the model integrates the independent judgment results of all decision trees and outputs an integrated judgment result; meanwhile, an index of a leaf node of a judgment result of each decision tree is taken to form a k-dimensional vector which is named as a fingerprint (fingerprint) and simultaneously carries a double label of a corresponding stream;
s3: encrypted traffic identification flow-oriented Label predictive Index (Label-Indication Index, hereinafter referred to as L2I value) calculation and recording: taking the flow-carrying double-label finger print obtained in the step 2 as an input, and training a K-nearest neighbor (KNN) classification algorithm; setting K as the number of each type of encrypted flow samples minus 1, counting the proportion values of the labels in the nearest K neighbors of each label prediction index finger print and the number of the samples same as the labels in the nearest K neighbors after the model is trained, and finally taking the average value of the proportion values of the label prediction indexes finger print of the same label type as the L2I value of the type of flow; meanwhile, in the two-dimensional label (encrypted traffic label, flow label) of the flow, the sum of L2I values belonging to the same type of encrypted traffic is counted as the L2I value of the type of encrypted traffic sample;
in particular, will be flow basedTaking a stream sample in the training set T as the input of a random forest classification model C, recording the index value of a leaf node where the judgment result of the v-th decision tree in the random forest classification model C is located, and generating a one-dimensional new feature F belonging to the encrypted flow samplejTotal k-dimensional composite feature vector, denoted as [ F1,F2,…,Fk]. Finally, k-dimensional new features are generated for each encrypted traffic sample in the initial stream-based training set T, and a set of fingerprints (fingerprint) is obtained, which is denoted as P:
P={(label1,label1-1):[F1,F2,…,Fk],(label1,label1-2):[F1,F2,…,Fk],…,(labelp,labelp-j):[F1,F2,…,Fk]}。
carrying out K nearest neighbor model training: calculating the original encrypted traffic sample and the L2I value of the stream; if the number of each type of encrypted traffic samples is n, K in the K neighbor is n-1, and the label is assumed to be (label)p,labelp-j) Of K samples surrounding one finger print sample, the number of samples labeled with the same label is Nump-jThen the L2I values for such a stream are:
L2Ip-j=Nump-j/K;
label for first dimension labelpThe value of L2I for the encrypted traffic sample of (1) is that all first dimension labels are labelspThe sum of the L2I values of the stream; and recording the calculated original encrypted traffic sample and the L2I value of the stream, so as to be convenient for later calculation.
S4: integrating the flow to the original encrypted traffic label to implement encrypted traffic classification: giving any encrypted flow sample to be detected after flow splitting and extraction of meta-feature vectors, inputting the extracted meta-feature vectors of all flows into a flow-based classification model C, predicting an encrypted flow type label of the flow, and setting the label as (label)x,labelx-1),(labelx,labelx-2)...(labely,labely-j);
Specifically, any encrypted flow sample is given, the whole flow sample is split into flows with the same port according to port information contained in a data packet, and then the obtained flows are respectively input into the flow-based classification model obtained in step S2, so that the judgment result is a double label; next, the sum of the L2I values of all the streams having the same first-dimension label (i.e., encrypted traffic sample label) in the determination result is calculated with reference to the L2I value record of the stream in step S3, and then the L2I value of such encrypted traffic obtained in step S3 is compared; if the ratio of the two is larger than a threshold value t set in front of the user, the encrypted traffic label with the largest ratio is output as a classification result, and if the ratio of the two is not larger than the threshold value t, the classification result is not output.
Let the total p-type encrypted traffic samples be shared, and the encrypted traffic type of the i-th type encrypted traffic sample is calibrated to be labeliThen, after splitting the encrypted traffic sample into streams according to the ports, marking the streams as label according to the log filei-jWherein the value of j is determined according to the number of streams of the encrypted traffic samples of different classes; the post feature vector contains d-dimensional features, denoted as [ f [ ]1,f2,…,fd](ii) a The encrypted traffic sample first stage training set is denoted as T and is expressed as follows:
T={(label1,label1-1):[f1,f2,…,fd],(label1,label1-2):[f1,f2,…,fd],…,(labelp,labelp-j):[f1,f2,…,fd]}
wherein, labelpThe label is a first dimension label and is an encrypted flow sample layer label; labelp-jIs a second dimension label and is a flow layer label.
The vector of the fingerprint (fingerprint) contains k-dimensional features, denoted as [ F [ ]1,F2,…,Fk](ii) a Let the total p-type encrypted traffic samples be shared, and the encrypted traffic type of the i-th type encrypted traffic sample is calibrated to be labeliAfter the encrypted traffic sample is split into flows according to the ports, the jth flow of the ith encrypted traffic sample is marked as labeli-jWherein the value of j is based on different classes of encrypted traffic samplesDetermining the number of streams; the set of fingerprints (fingerprint) generated from the leaf indices of the decision tree is denoted as P, and is represented as follows:
P={(label1,label1-1):[F1,F2,…,Fk],(label1,label1-2):[F1,F2,…,Fk],…,(labelp,labelp-j):[F1,F2,…,Fk]}。
specifically, any encrypted flow sample is given, split into streams according to ports, and subjected to vectorization representation by adopting element feature vector flow, and then input into a stream-based classification model C to obtain sample labels of all streams, and the sample labels are set as (label)x,labelx-m),(labely,labely-s) (ii) a Then, the sum of the L2I values of all encrypted traffic samples with the same first dimension label is countedx,sumyThen, L2I of the encrypted traffic sample of the corresponding category obtained in step S3 is calculatedx,L2IyRatio e ofx,ey(ii) a If the specific values are all larger than a threshold t set by a user, taking a larger category for output, and if the specific values are all smaller than the threshold t, not outputting, and calibrating the category as an invalid sample; and repeating the steps, and classifying all the encrypted flow samples based on the flow double labels.
In one embodiment of the present invention, a terminal device is provided that includes a processor and a memory, the memory storing a computer program comprising program instructions, the processor executing the program instructions stored by the computer storage medium. The processor is a Central Processing Unit (CPU), or other general purpose processor, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), ready-made programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc., which is a computing core and a control core of the terminal, and is adapted to implement one or more instructions, and in particular, to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor described in the embodiments of the present invention may be used for the operation of the encryption traffic classification method.
An encrypted traffic classification system can be used for realizing the encrypted traffic classification method in the embodiment, and specifically comprises an input module and a classification comparison module;
the input module is used for splitting the encrypted flow sample to be detected into flows with the same port according to port information contained in the data packet and extracting meta-feature vectors, inputting the extracted meta-feature vectors of all the flows into a flow-based classification model and predicting the encrypted flow type labels of the flows to obtain prediction labels, grouping the prediction labels with the same first dimension label, calculating the sum of L2I values corresponding to the labels in the grouping, and inputting the sum to the classification comparison module;
the classification comparison module is used for forming a K-dimensional vector which simultaneously carries the double labels of the corresponding flow according to the index of the leaf node of the judgment result of each decision tree in the flow-based classification model, and calculating by taking the K-dimensional vector as an input training K-nearest neighbor classification algorithm to obtain an original encrypted flow sample and an L2I value of the flow; and comparing the original encrypted traffic sample and the L2I value of the stream according to the sum of the L2I values corresponding to the tags in the computation packets; if the ratio of the two is larger than the threshold value set before the user, the encrypted traffic label with the largest ratio is output as the classification result, and if the ratio of the two is smaller than the threshold value, the classification result is not output.
In still another embodiment of the present invention, the present invention further provides a storage medium, which specifically uses a computer-readable storage medium (Memory), where the computer-readable storage medium is a Memory device in a terminal device, and is used for storing programs and data. The computer-readable storage medium includes a built-in storage medium in the terminal device, provides a storage space, stores an operating system of the terminal, and may also include an extended storage medium supported by the terminal device. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a Non-volatile memory (Non-volatile memory), such as at least one disk memory. One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to perform the corresponding steps of the method for classifying encrypted traffic in the above embodiments.
The encrypted flow sample set and the meta-feature vector are provided by a user; a user provides an original data file of each encrypted flow sample and an encrypted flow type label of the original data file; the number k of decision trees of the random forest algorithm and a threshold value t required by the integration stage are set by a user. The encryption flow classification based on the flow double-label mechanism can realize complete website access classification and prevent the cross of flow in the access process; the method is suitable for web-oriented and stream-oriented network behaviors, and can be used in different encrypted flows, including HTTPS protocol, Tor network and ShadowSocks network.

Claims (10)

1. A method for classifying encrypted traffic is characterized by comprising the following steps:
s1, collecting a user encrypted flow sample set, and splitting each complete encrypted flow sample into a plurality of flow samples according to port information contained in the data packet; then marking the corresponding stream according to the file containing the stream log information in the encrypted flow sample set; recording d-dimensional feature vectors in the encrypted flow sample set as meta-feature vectors; according to the meta-feature vector, performing vectorization representation on the stream in each encrypted flow sample; after all the streams in the encrypted flow sample set are represented in a vectorization mode by adopting the meta-feature vectors, keeping the encrypted flow type label and the stream label of each encrypted flow sample, comprehensively obtaining the label of each stream as a double label, and obtaining a stream-based training set of the encrypted flow samples;
s2, generating an encrypted flow recognition flow-based classification model by adopting a random forest classification model according to the flow-based training set;
s3, forming a K-dimensional vector simultaneously carrying an encrypted traffic type label of each encrypted traffic sample and a flow label according to the index of a leaf node of a judgment result of each decision tree in the flow-based classification model, and calculating by taking the K-dimensional vector as an input training K-nearest neighbor classification algorithm to obtain an original encrypted traffic sample and an L2I value of the flow;
training a K-nearest neighbor (KNN) classification algorithm by taking the obtained dual-label finger print carrying the flow as input; taking the average value of the proportional values of the label predictive index finger of the same label type as the L2I value of the stream;
s4, according to port information contained in the data packet, dividing the encrypted flow sample to be detected into flows with the same port and extracting meta-feature vectors, inputting the extracted meta-feature vectors of all the flows into a flow-based classification model and predicting the encrypted flow type labels of the flows to obtain prediction labels, grouping the prediction labels with the same first dimension labels, calculating the sum of L2I values corresponding to the labels in the grouping, and then comparing the original encrypted flow sample with the L2I value of the flow; if the ratio of the encrypted traffic label to the encrypted traffic label is larger than a threshold value set in front of the user, outputting the encrypted traffic label with the largest ratio as a classification result, and if the ratio of the encrypted traffic label to the encrypted traffic label is smaller than the threshold value, not outputting the classification result and finishing encrypted traffic classification;
specifically, any encrypted traffic sample to be measured after being subjected to flow splitting and meta-feature vector extraction is given, the meta-feature vectors of all the flows extracted by the sample are input into the flow-based classification model C, and the encrypted traffic type labels of the samples are predicted.
2. The method according to claim 1, wherein a user encrypted traffic sample set is collected, each encrypted traffic sample in the set is an original traffic file containing a data packet, and has a unique encrypted traffic type tag; splitting each complete encrypted flow sample into a plurality of flow samples according to port information contained in the data packet; then marking the corresponding stream according to the file containing the stream log information in the encrypted flow sample set; according to the meta-feature vectors in the encrypted traffic sample set, performing vectorization representation on the stream in each encrypted traffic sample; and after all the streams in the encrypted flow sample set are represented in a vectorization mode by adopting the meta-feature vector, keeping the encrypted flow type label and the stream label of each encrypted flow sample to obtain a stream-based training set of the encrypted flow samples.
3. The encrypted traffic classification method according to claim 2, characterized in that d-dimensional sequence features are extracted from each encrypted traffic sample in the encrypted traffic sample set and are recorded as [ f [ ]1,f2,…,fd](ii) a Let the total p-type encrypted traffic samples and the encrypted traffic type of the i-th type encrypted traffic sample be labeled as labeli(ii) a After the encrypted flow sample is split into streams according to the ports, the streams are marked as label according to the log filei-jWherein the value of j is determined according to the number of streams of the encrypted traffic samples of different classes; the training set of encrypted traffic samples is denoted as T:
T={(label1,label1-1):[f1,f2,…,fd],(label1,label1-2):[f1,f2,…,fd],…,(labelp,labelp-j):[f1,f2,…,fd]}
wherein, labelpThe first dimension label is an encrypted flow sample layer label and corresponds to the network address of each monitoring website; labelp-jIs a second dimension label and is a flow layer label.
4. The encrypted traffic classification method according to claim 1, characterized in that a random forest classification model is trained with the stream-based training set obtained in step S1 as input, the model being composed of k decision trees; and taking the index of the leaf node of the judgment result of each decision tree to form a k-dimensional vector simultaneously carrying the double labels of the corresponding flow.
5. The encrypted traffic classification method according to claim 4, characterized in that a stream sample in the stream-based training set T is used as an input of the stream-based classification model C, the index value of the leaf node where the decision result of the v-th decision tree in the stream-based classification model C is located is recorded,generating a new one-dimensional feature F belonging to the encrypted flow samplejTotal k-dimensional composite feature vector, denoted as [ F1,F2,…,Fk](ii) a And finally, generating k-dimensional new features for each encrypted flow sample in the training set T based on the flow to obtain a fingerprint set, wherein the fingerprint set is represented as P:
P={(label1,label1-1):[F1,F2,…,Fk],(label1,label1-2):[F1,F2,…,Fk],…,(labelp,labelp-j):[F1,F2,…,Fk]}。
6. the method according to claim 5, wherein if the number of encrypted traffic samples in each class is n, K in K neighbors is n-1, and the label is assumed to be (label)p,labelp-j) Of K samples surrounding one finger print sample, the number of samples labeled with the same label is Nump-jThen the L2I value for this type of stream is:
L2Ip-j=Nump-j/K;
the first dimension label is labelpThe value of L2I for the encrypted traffic sample of (1) is that all first dimension labels are labelspThe sum of the L2I values of the stream;
the set of finger prints generated from the leaf indices of the decision tree is denoted as P and is represented as follows:
P={(label1,label1-1):[F1,F2,…,Fk],(label1,label1-2):[F1,F2,…,Fk],…,(labelp,labelp-j):[F1,F2,…,Fk]}。
7. the method according to claim 5, wherein any encrypted traffic sample is given, split into streams according to ports, and subjected to vectorization representation by using meta-feature vectors, and then input into a stream-based classification model C to obtain type labels of all streams.
8. An encrypted traffic classification system for use in the encrypted traffic classification method according to claim 1, comprising:
the input module is used for splitting the encrypted flow sample to be detected into flows with the same port according to port information contained in the data packet and extracting meta-feature vectors, inputting the extracted meta-feature vectors of all the flows into a flow-based classification model and predicting the encrypted flow type labels of the flows to obtain prediction labels, grouping the prediction labels with the same first dimension label, calculating the sum of L2I values corresponding to the labels in the grouping, and inputting the sum to the classification comparison module;
the classification comparison module is used for forming a K-dimensional vector which simultaneously carries an encrypted traffic type label of each encrypted traffic sample and a flow label according to the index of a leaf node of a judgment result of each decision tree in the flow-based classification model, and calculating by taking the K-dimensional vector as an input training K-nearest neighbor classification algorithm to obtain an original encrypted traffic sample and an L2I value of the flow; and comparing the original encrypted traffic sample with the L2I value of the stream according to the sum of the L2I values corresponding to the tags in the computation packets; if the ratio of the two is larger than the threshold value set before the user, the encrypted traffic label with the largest ratio is output as the classification result, and if the ratio of the two is smaller than the threshold value, the classification result is not output.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202110438554.XA 2021-04-22 2021-04-22 Encrypted traffic classification method, system, equipment and readable storage medium Active CN113141364B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110438554.XA CN113141364B (en) 2021-04-22 2021-04-22 Encrypted traffic classification method, system, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110438554.XA CN113141364B (en) 2021-04-22 2021-04-22 Encrypted traffic classification method, system, equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN113141364A CN113141364A (en) 2021-07-20
CN113141364B true CN113141364B (en) 2022-07-12

Family

ID=76813625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110438554.XA Active CN113141364B (en) 2021-04-22 2021-04-22 Encrypted traffic classification method, system, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113141364B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209959A (en) * 2020-01-05 2020-05-29 西安电子科技大学 Encrypted webpage flow division point identification method based on data packet time sequence

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110138849A (en) * 2019-05-05 2019-08-16 哈尔滨英赛克信息技术有限公司 Agreement encryption algorithm type recognition methods based on random forest
CN110414594B (en) * 2019-07-24 2021-09-07 西安交通大学 Encrypted flow classification method based on double-stage judgment
CN111030941A (en) * 2019-10-29 2020-04-17 武汉瑞盈通网络技术有限公司 Decision tree-based HTTPS encrypted flow classification method
CN112163594B (en) * 2020-08-28 2022-07-26 南京邮电大学 Network encryption traffic identification method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209959A (en) * 2020-01-05 2020-05-29 西安电子科技大学 Encrypted webpage flow division point identification method based on data packet time sequence

Also Published As

Publication number Publication date
CN113141364A (en) 2021-07-20

Similar Documents

Publication Publication Date Title
Hii et al. Multigap: Multi-pooled inception network with text augmentation for aesthetic prediction of photographs
CN113489685B (en) Secondary feature extraction and malicious attack identification method based on kernel principal component analysis
CN116662817B (en) Asset identification method and system of Internet of things equipment
Monshizadeh et al. Improving data generalization with variational autoencoders for network traffic anomaly detection
CN107392311A (en) The method and apparatus of sequence cutting
CN111130942B (en) Application flow identification method based on message size analysis
Yan et al. TL-CNN-IDS: transfer learning-based intrusion detection system using convolutional neural network
Liu et al. Network log anomaly detection based on gru and svdd
CN110414594B (en) Encrypted flow classification method based on double-stage judgment
CN112261169B (en) DGA domain name Botnet identification and judgment method utilizing capsule network and k-means
CN112949778A (en) Intelligent contract classification method and system based on locality sensitive hashing and electronic equipment
CN113141364B (en) Encrypted traffic classification method, system, equipment and readable storage medium
CN116016365B (en) Webpage identification method based on data packet length information under encrypted flow
CN117375896A (en) Intrusion detection method and system based on multi-scale space-time feature residual fusion
CN116977725A (en) Abnormal behavior identification method and device based on improved convolutional neural network
CN113469247B (en) Network asset abnormity detection method
CN113259369B (en) Data set authentication method and system based on machine learning member inference attack
CN110650130B (en) Industrial control intrusion detection method based on multi-classification GoogLeNet-LSTM model
Jammoussi et al. Adaboost face detector based on Joint Integral Histogram and Genetic Algorithms for feature extraction process
CN114358177A (en) Unknown network traffic classification method and system based on multidimensional feature compact decision boundary
CN112261028A (en) DGA botnet domain name detection method based on capsule network and k-means
Zheng et al. Network intrusion detection model based on Chi-square test and stacking approach
Fahad et al. Building a fortress against fake news: Harnessing the power of subfields in artificial intelligence
CN113037729A (en) Deep learning-based phishing webpage hierarchical detection method and system
Desamsetti et al. Artificial Intelligence Based Fake News Detection Techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant