CN114091087A - Encrypted flow identification method based on artificial intelligence algorithm - Google Patents

Encrypted flow identification method based on artificial intelligence algorithm Download PDF

Info

Publication number
CN114091087A
CN114091087A CN202210047506.2A CN202210047506A CN114091087A CN 114091087 A CN114091087 A CN 114091087A CN 202210047506 A CN202210047506 A CN 202210047506A CN 114091087 A CN114091087 A CN 114091087A
Authority
CN
China
Prior art keywords
encrypted
flow
training
connection
packet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210047506.2A
Other languages
Chinese (zh)
Other versions
CN114091087B (en
Inventor
肖梅
陈柯杉
姚胜利
齐凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haohan Data Technology Co ltd
Original Assignee
Haohan Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haohan Data Technology Co ltd filed Critical Haohan Data Technology Co ltd
Priority to CN202210047506.2A priority Critical patent/CN114091087B/en
Publication of CN114091087A publication Critical patent/CN114091087A/en
Application granted granted Critical
Publication of CN114091087B publication Critical patent/CN114091087B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses an encrypted flow identification method based on an artificial intelligence algorithm, which comprises the following steps: s1, preparing a training set, dividing the training set into a training set and a verification set, and respectively using the training set and the verification set for training a model and verifying a training result; the training set includes a large number of service messages, and each message records a quintuple, an application name, and a traffic type. S2, calculating all relevant HTTP connections of the encrypted connections in the training set and forming a relevant HTTP connection set L; s3, based on the associated HTTP connection set L obtained in the step S2, training a single-packet structure model and a flow model of each encrypted connection by using a machine learning classification algorithm; and S4, verifying the training result, and identifying the flow in the verification set by using the trained model. The invention combines the method of multi-stream association identification, single-packet identification and single-stream multi-packet identification, so that the encrypted flow identification method can be applied to any encrypted flow, and the correct identification rate of the encrypted flow is effectively improved.

Description

Encrypted flow identification method based on artificial intelligence algorithm
Technical Field
The invention relates to the technical field of encrypted flow identification, in particular to an encrypted flow identification method based on an artificial intelligence algorithm.
Background
With the development of internet technology, more and more applications and scenes use encrypted traffic, and the HTTP protocol based on plaintext features uses fewer and fewer scenes, and roughly statistically, about 70% of traffic in a network is encrypted traffic, which may be based on the universal encryption protocols HTTPs, QUIC, DTLS, etc., or may be based on a private encryption protocol, where HTTPs is the main encrypted traffic. At present, two identification methods for HTTPS are mainly used, one is identification based on a certificate chain, but the situation of HTTPS session reuse is severe, and when a session is reused, the certificate chain is invalid; one is flow-based statistical feature identification, but the statistical features of the flows are obviously influenced by network quality, the statistical features may not be obviously distinguished among different applications, and the false identification rate is high. Other encrypted traffic is also two identification methods, one is based on single-packet feature identification, the specific meaning cannot be known by the feature, the extracted information is frequently changed, and false identification may exist; one is classification by means of a machine learning algorithm based on flow statistical characteristics, and the recognition problem of HTTPS also exists. The existing identification method based on the flow statistical characteristics cannot realize real-time identification, and only when one connection is finished or the first N messages pass through equipment, the connection can be identified. The method can only identify the granularity accurately to the application, and cannot meet the requirements of people for more and more refinement.
Disclosure of Invention
The invention aims to provide an encrypted flow identification method based on an artificial intelligence algorithm, which is suitable for any encrypted flow by combining a multi-flow association identification method, a single-packet identification method and a single-flow multi-packet identification method, and effectively improves the correct identification rate of the encrypted flow.
In order to achieve the purpose, the invention provides the following technical scheme: an encrypted flow identification method based on an artificial intelligence algorithm comprises the following steps:
s1, preparing a training set, dividing the training set into a training set and a verification set, and respectively using the training set and the verification set for training a model and verifying a training result;
s2, calculating all relevant HTTP connections of the encrypted connections in the training set and forming a relevant HTTP connection set L;
s3, based on the associated HTTP connection set L obtained in the step S2, training a single-packet structure model and a flow model of each encrypted connection by using a machine learning classification algorithm;
and S4, verifying the training result, identifying the flow in the verification set by using the trained model, and if the correct identification rate is less than or equal to the minimum identification rate requirement value P, adjusting the corresponding training parameters and then repeating the step S3.
Preferably, the step S1 further includes extracting start time T and quintuple information for each encrypted connection, and screening all HTTP connections generated by source ip at [ T- Δ T, T ] time and recording related information.
Preferably, the step S2 further includes an associated HTTP calculation policy configured to calculate a Jaccard coefficient for each HTTP connection, where a represents the number of occurrences of traffic X-traffic type Y in the training set, B represents the number of occurrences of HTTP in the training set of full traffic-traffic type,
Figure 677975DEST_PATH_IMAGE001
representing the number of occurrences of an HTTP connection in traffic X-traffic type Y, the formula for the Jaccard coefficient is as follows:
Figure 284537DEST_PATH_IMAGE002
the HTTP connection with the largest Jaccard coefficient value is screened as the associated HTTP connection of the encrypted connection.
Preferably, the step S3 further includes generating a training result R according to the single packet structure model and the stream model, where the training result R includes a service name, a traffic type, an associated HTTP, a single packet structure feature, and a stream statistical feature.
Preferably, the single packet structure characteristics include, but are not limited to, packet length and fixed fingerprint value, and the stream statistical characteristics include, but are not limited to, packet time interval distribution, packet length distribution, and distribution of packet transmission rate.
Preferably, in step S4, the adjusting the corresponding training parameters includes adjusting parameters of a single-packet structure model, or adjusting flow statistical characteristics, or changing a machine learning classification algorithm, or adjusting Δ t time, or optimizing a training set; the step S4 includes adjusting the corresponding training parameters and then proceeding to step S3 again until the correct recognition rate is greater than the minimum recognition rate requirement value P.
Preferably, the step S1 includes randomly extracting 80% of traffic in each message of the application as a training set, and leaving 20% as a verification set, where each message records five-tuple information, an application name, and a traffic type.
Preferably, the step S4 includes, if the correct recognition rate is greater than the minimum recognition rate requirement value P, entering an application step, where the application step includes:
s5, screening the HTTP connection dL in the new flow and recording the starting time;
s6, identifying the HTTP connection dL as a service A based on the payload characteristics, searching whether the HTTP connection dL exists in the associated HTTP connection set L, if not, analyzing the next HTTP connection dL, and if so, executing S7;
s7, monitoring whether encrypted connection eL occurs in the source ip within the time of [ T1, T1-delta T ] in real-time flow, if not, analyzing next HTTP connection dL, and if so, executing S8 on the first packet of the encrypted connection eL;
s8, judging whether the current packet data matches the single packet structure characteristics and the flow statistical characteristics of X in the training result R, if not, marking the encrypted connection eL as unidentified flow, if so, marking the encrypted connection eL as a service A-flow type X with a suspicious state, and executing S9;
s9, judging whether the number of the sent packets is less than or equal to N, if the number of the sent packets is less than or equal to N, repeatedly executing S8, if the number of the sent packets is equal to N, the identification result of the eL is consistent with the identification result of the Nth packet, and the connection identification state is determined.
Preferably, the step S7 includes analyzing and encrypting the first packet single-packet structural feature and the first packet flow statistical feature of the connection eL according to the feature associated with the HTTP connection dL, and determining whether the first packet data matches the single-packet structural feature and the flow statistical feature of X in the training result R.
Preferably, the machine learning classification algorithm adopts bayes or decision trees or SVMs.
Compared with the prior art, the invention has the beneficial effects that:
the encrypted flow identification method of the invention combines the multi-flow association identification, the single-packet identification and the single-flow multi-packet identification, so that the encrypted flow identification method can be applied to any encrypted flow, and the correct identification rate of the encrypted flow is effectively improved. And real-time identification is supported, the encrypted flow can be identified from the first packet, and the granularity of the identification result can be accurate to the flow type of the service.
Drawings
FIG. 1 is a flow chart of an encrypted flow identification method based on an artificial intelligence algorithm according to the present invention;
FIG. 2 is a flow chart of the application steps in an encrypted flow identification method based on an artificial intelligence algorithm according to the present invention;
fig. 3 is a flowchart of an encrypted flow identification method based on an artificial intelligence algorithm according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides an encrypted traffic identification method based on an artificial intelligence algorithm, where the encrypted traffic identification method includes the following steps:
s1, preparing a training set, dividing the training set into a training set and a verification set, and respectively using the training set and the verification set for training a model and verifying a training result; the training set includes a large number of service messages, and each message records a quintuple, an application name, and a traffic type.
S2, calculating all relevant HTTP connections of the encrypted connections in the training set and forming a relevant HTTP connection set L;
s3, based on the associated HTTP connection set L obtained in the step S2, training a single-packet structure model and a flow model of each encrypted connection by using a machine learning classification algorithm;
and S4, verifying the training result, identifying the flow in the verification set by using the trained model, and if the correct identification rate is less than or equal to the minimum identification rate requirement value P, adjusting the corresponding training parameters and then repeating the step S3.
In this embodiment, the encrypted traffic identification method combines the multi-stream association identification, the single-packet identification, and the single-stream multi-packet identification, so that the encrypted traffic identification method is applicable to any encrypted traffic, and the correct identification rate of the encrypted traffic is effectively improved. And real-time identification is supported, the encrypted flow can be identified from the first packet, and the granularity of the identification result can be accurate to the flow type of the service, such as browsing, video streaming, downloading streaming, game fighting streaming, voice and video conversation streaming and the like. The encrypted traffic may be traffic based on a universal encryption protocol HTTPS, QUIC, DTLS, and RTMFP, or may be private encrypted traffic, such as an arcade video, a kuku video, an hero alliance, and a thunderbolt.
Preferably, the step S1 further includes extracting start time T and quintuple information for each encrypted connection, and screening all HTTP connections generated by source ip at [ T- Δ T, T ] time and recording related information. Wherein, the quintuple information is: source ip, source port, destination ip, destination port, protocol. The training set typically stores one operational flow per message, such as non-HTTP flow for a usu video-on-demand tv series 1800s, by automatic dial-up testing. Each packet needs to record a five tuple, an application name, and a traffic type. Under this embodiment, 80% of the traffic in each application message is randomly extracted as a training set, and the remaining 20% is used as a verification set. And verifying the training result by adopting a random sub-sampling verification method.
The step S3 further includes generating a training result R according to the single packet structure model and the stream model, where the training result R includes a service name, a traffic type, an associated HTTP, a single packet structure characteristic, and a stream statistical characteristic. The traffic type is that the traffic type can be divided into browsing, video streaming, downloading streaming, voice and video conversation streaming, game fighting streaming and the like according to scenes or behaviors generated by the traffic. The single-packet structure model comprises the positions and lengths of features such as SeesioniD, PacketNumber, an operating system, a version number, an encryption algorithm and the like in the single packet, and the stream statistical features comprise the features such as packet time interval distribution, packet length distribution, packet sending rate distribution and the like.
Preferably, the step S2 further includes an associated HTTP calculation policy configured to calculate a Jaccard coefficient of each HTTP connection, where the Jaccard coefficient: also known as the Jaccard similarity coefficient (Jaccard similarity coefficient) is used to compare similarity and difference between finite sample sets. The larger the Jaccard coefficient value, the higher the sample similarity.
Given two sets a, B, the Jaccard coefficient is defined as the ratio of the size of the intersection of a and B to the size of the union of a and B, as follows:
Figure 153267DEST_PATH_IMAGE002
when both sets A, B are empty, J (A, B) is defined as 1. Wherein A represents the number of occurrences of traffic X-traffic type Y in the training set, B represents the number of occurrences of HTTP in the training set for full traffic-traffic type,
Figure 720646DEST_PATH_IMAGE001
representing the number of occurrences of an HTTP connection in traffic X-traffic type Y, the associated HTTP calculation policy includes screening the HTTP connection having the largest Jaccard coefficient value as the associated HTTP connection for the encrypted connection.
Preferably, in step S4, the adjusting the corresponding training parameters includes adjusting parameters of a single-packet structure model, or adjusting flow statistical characteristics, or changing a machine learning classification algorithm, or adjusting Δ t time, or optimizing a training set; the step S4 includes adjusting the corresponding training parameters and then proceeding to step S3 again until the correct recognition rate is greater than the minimum recognition rate requirement value P. The corresponding training parameters may be adjusted by changing the structure characteristics of the single packet, the flow statistical characteristics, such as removing the packet length, the packet sending rate characteristics, changing to a fixed fingerprint value, time interval, etc. The specific features to be used and the specific features to be adjusted are determined by the specific flow identification requirements and the implementer. Changing the machine learning algorithm: such as SVM to xgboost. Adjusting delta t: for example, Δ t is adjusted from 1s to 0.1 s. Optimizing a training set: such as: each connection in the training set is of the same duration, and traffic classification is more accurate.
As shown in fig. 2, the step S4 includes, if the correct recognition rate is greater than the minimum recognition rate requirement value P, entering an application step, where the application step includes:
s5, screening the HTTP connection dL in the new flow and recording the starting time;
s6, identifying the HTTP connection dL as a service A based on the payload characteristics, searching whether the HTTP connection dL exists in the associated HTTP connection set L, if not, analyzing the next HTTP connection dL, and if so, executing S7;
s7, monitoring whether encrypted connection eL occurs in the source ip within the time of [ T1, T1-delta T ] in real-time flow, if not, analyzing next HTTP connection dL, and if so, executing S8 on the first packet of the encrypted connection eL;
s8, judging whether the current packet data matches the single packet structure characteristics and the flow statistical characteristics of X in the training result R, if not, marking the encrypted connection eL as unidentified flow, if so, marking the encrypted connection eL as a service A-flow type X with a suspicious state, and executing S9;
s9, judging whether the number of the sent packets is less than or equal to N, if the number of the sent packets is less than or equal to N, repeatedly executing S8, if the number of the sent packets is equal to N, the identification result of the eL is consistent with the identification result of the Nth packet, and the connection identification state is determined.
Preferably, the step S7 includes analyzing and encrypting the first packet single-packet structural feature and the first packet flow statistical feature of the connection eL according to the feature associated with the HTTP connection dL, and determining whether the first packet data matches the single-packet structural feature and the flow statistical feature of X in the training result R.
Specifically, the machine learning classification algorithm adopts Bayes, decision trees or SVM. The training set is composed of vectors converted from the single-packet structural features and the flow statistical features of each connection, and a machine learning algorithm can learn various models according to the vector values, wherein each model is suitable for one flow type. A new connection enters the recognition system, is converted into a vector, is matched with the trained model, and is suitable for which type of flow.
As shown in fig. 3, the complete flow chart of the present invention. Comprising a preparation phase, a training phase and an application phase, wherein,
a preparation stage: the training set typically stores one operational flow per message, such as non-HTTP flow for a usu video-on-demand tv series 1800s, by automatic dial-up testing. Each packet needs to record a five tuple, an application name, and a traffic type.
Then, 80% of the traffic in each applied message is randomly extracted as a training set, and the rest 20% is used as a verification set. And verifying the training result by adopting a random sub-sampling verification method.
A training stage: for each of the encrypted connections, the connection is encrypted,
1. and extracting start time T and quintuple information. Then all HTTP connections generated by source ip in the time of [ T-delta T, T ] are screened and relevant information, such as domain name, User-Aagent, URI and the like, is recorded.
2. And calculating the Jaccard coefficient of each HTTP connection, wherein A represents the number of times of the service X-flow type Y in the training set, B represents the number of times of the HTTP in the training set of the full service-flow type, and represents the number of times of the HTTP connection in the service X-flow type Y.
3. The HTTP connection with the largest Jaccard coefficient value is screened as the associated HTTP connection for this encrypted connection.
The following operations are performed on the same application-flow type message:
a. repeating the steps 1-3 above for all encrypted connections, and finally obtaining an associated HTTP connection set L.
b. Learning a flow model based on a machine learning classification algorithm: single package structure model, flow model.
c. And (4) saving a training result R: the method comprises the following steps of (1) service name, flow type, associated HTTP (hyper text transport protocol), single-packet structural characteristics and flow statistical characteristics;
and (4) performing a-c operation on all the messages in the training set to form a complete training result R.
And (3) identifying the message in the verification set by using the training result R, and calculating for each service-flow type C:
correct identification rate = number of connections identified as C/total number of connections of class C messages in the validation set.
When the correct recognition rate > P (0 < = P < =1, P is the minimum recognition rate requirement determined according to the actual scene), it can be determined that the training result is applied to the application stage. When the correct recognition rate < = P, the parameters of the single-packet structure model or the flow statistical characteristics need to be adjusted, the machine learning classification algorithm needs to be changed, the time of delta t needs to be adjusted, the training set needs to be optimized, and the operation in the training stage needs to be repeated until the correct recognition rate meets the requirements.
An application stage: when traffic enters the equipment, HTTP connections are screened firstly, and each HTTP connection is operated as follows:
HTTP connection, indicated dL, start time, indicated quintuple by T1.
dL connections are identified as traffic a based on payload characteristics, which refer to application layer fingerprint characteristics.
3. The set of associated HTTP connections (denoted L) is searched for the presence or absence of a connection dL, and if not, indicating that this HTTP connection is not associated with any encrypted connection in the training set, the next HTTP connection can be analyzed. If so, perform 4.
The dL associated traffic-traffic type is X, and it needs to see whether encrypted traffic (denoted by eL) occurs in the real-time traffic from the source ip within the time T1, T1- Δ T, and if not, dL connection information is not kept for identifying the encrypted traffic. If so, encrypt the header packet of the connection eL performs 5.
5. And (4) judging whether the current packet data is matched with the single-packet structure model S and the flow statistical model characteristic F of the X in the training result R, if not, the encrypted connection eL can only be marked as unidentified flow, and stopping continuously matching the packet data.
6. If there is a match, the encrypted connection eL is marked as traffic-traffic type X, the status is suspect.
When the number of the sent packets < = N (N determines the maximum number of the sent packets according to the actual application scene), 5 and 6 operations are executed on the current packet.
And if the number of sent packets is less than N and the single packet structure characteristics and the flow statistical characteristics of X are not met, marking the identification result of the encryption connection eL as unidentified or turning the identification result into unknown.
If the number of the transmitted packets = N, the identification result of the encrypted connection is consistent with the identification result of the Nth packet, and the connection identification state is determined.
The working principle is as follows: the encrypted flow identification method of the invention combines the multi-flow association identification, the single-packet identification and the single-flow multi-packet identification, so that the encrypted flow identification method can be applied to any encrypted flow, and the correct identification rate of the encrypted flow is effectively improved. And real-time identification is supported, the encrypted flow can be identified from the first packet, and the granularity of the identification result can be accurate to the flow type of the service.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims (10)

1. An encrypted flow identification method based on an artificial intelligence algorithm is characterized by comprising the following steps:
s1, preparing a training set, dividing the training set into a training set and a verification set, and respectively using the training set and the verification set for training a model and verifying a training result;
s2, calculating all relevant HTTP connections of the encrypted connections in the training set and forming a relevant HTTP connection set L;
s3, based on the associated HTTP connection set L obtained in the step S2, training a single-packet structure model and a flow model of each encrypted connection by using a machine learning classification algorithm;
and S4, verifying the training result, identifying the flow in the verification set by using the trained model, and if the correct identification rate is less than or equal to the minimum identification rate requirement value P, adjusting the corresponding training parameters and then repeating the step S3.
2. The method for identifying encrypted traffic based on artificial intelligence algorithm according to claim 1, wherein said step S1 further comprises extracting start time T and quintuple information for each encrypted connection, and filtering all HTTP connections generated by source ip at [ T- Δ T, T ] time and recording related information.
3. The artificial intelligence algorithm-based encrypted traffic identification method according to claim 2, wherein the step S2 further comprises an associated HTTP calculation strategy configured to calculate Jaccard coefficient of each HTTP connection, wherein A represents the number of occurrences of traffic X-traffic type Y in the training set, B represents the number of occurrences of HTTP in the training set of full traffic-traffic type,
Figure 83441DEST_PATH_IMAGE001
representing the number of occurrences of an HTTP connection in traffic X-traffic type Y, the formula for the Jaccard coefficient is as follows:
Figure 694682DEST_PATH_IMAGE002
the HTTP connection with the largest Jaccard coefficient value is screened as the associated HTTP connection of the encrypted connection.
4. The method for identifying encrypted traffic based on artificial intelligence algorithm according to claim 3, wherein said step S3 further comprises generating a training result R according to the single packet structure model and the stream model, wherein the training result R comprises the service name, the traffic type, the associated HTTP, the single packet structure feature and the stream statistical feature.
5. The method according to claim 4, wherein the single packet structure characteristics include, but are not limited to, packet length and fixed fingerprint value, and the flow statistical characteristics include, but are not limited to, packet time interval distribution, packet length distribution, and packet sending rate distribution.
6. The method for identifying encrypted traffic based on artificial intelligence algorithm according to claim 5, wherein in step S4, the adjusting the corresponding training parameters includes adjusting parameters of a single-packet structure model, or adjusting flow statistics, or changing a machine learning classification algorithm, or adjusting Δ t time, or optimizing a training set; the step S4 includes adjusting the corresponding training parameters and then proceeding to step S3 again until the correct recognition rate is greater than the minimum recognition rate requirement value P.
7. The method for identifying encrypted traffic based on artificial intelligence algorithm according to claim 6, wherein the step S1 includes randomly extracting 80% of traffic in each message of the application as a training set, and the remaining 20% as a verification set, wherein each message records quintuple information, application name and traffic type.
8. The method for identifying encrypted traffic based on artificial intelligence algorithm according to any of claims 4-7, wherein said step S4 includes entering into an application step if the correct identification rate is greater than the minimum identification rate requirement value P, said application step includes:
s5, screening the HTTP connection dL in the new flow and recording the starting time;
s6, identifying the HTTP connection dL as a service A based on the payload characteristics, searching whether the HTTP connection dL exists in the associated HTTP connection set L, if not, analyzing the next HTTP connection dL, and if so, executing S7;
s7, monitoring whether encrypted connection eL occurs in the source ip within the time of [ T1, T1-delta T ] in real-time flow, if not, analyzing next HTTP connection dL, and if so, executing S8 on the first packet of the encrypted connection eL;
s8, judging whether the current packet data matches the single packet structure characteristics and the flow statistical characteristics of X in the training result R, if not, marking the encrypted connection eL as unidentified flow, if so, marking the encrypted connection eL as a service A-flow type X with a suspicious state, and executing S9;
s9, judging whether the number of the sent packets is less than or equal to N, if the number of the sent packets is less than or equal to N, repeatedly executing S8, if the number of the sent packets is equal to N, the identification result of the eL is consistent with the identification result of the Nth packet, and the connection identification state is determined.
9. The method for identifying encrypted traffic based on artificial intelligence algorithm according to claim 8, wherein said step S7 includes analyzing the first packet structure characteristic and the first packet flow statistical characteristic of the encrypted connection eL according to the characteristic associated with the HTTP connection dL, and determining whether the first packet data matches the single packet structure characteristic and the flow statistical characteristic of X in the training result R.
10. The encrypted flow identification method based on artificial intelligence algorithm according to claim 9, characterized in that the machine learning classification algorithm employs bayes or decision trees or SVMs.
CN202210047506.2A 2022-01-17 2022-01-17 Encrypted flow identification method based on artificial intelligence algorithm Active CN114091087B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210047506.2A CN114091087B (en) 2022-01-17 2022-01-17 Encrypted flow identification method based on artificial intelligence algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210047506.2A CN114091087B (en) 2022-01-17 2022-01-17 Encrypted flow identification method based on artificial intelligence algorithm

Publications (2)

Publication Number Publication Date
CN114091087A true CN114091087A (en) 2022-02-25
CN114091087B CN114091087B (en) 2022-04-15

Family

ID=80308798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210047506.2A Active CN114091087B (en) 2022-01-17 2022-01-17 Encrypted flow identification method based on artificial intelligence algorithm

Country Status (1)

Country Link
CN (1) CN114091087B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115277063A (en) * 2022-06-13 2022-11-01 深圳铸泰科技有限公司 Terminal identification device under IPV4 and IPV6 hybrid network environment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109286576A (en) * 2018-10-10 2019-01-29 北京理工大学 A kind of network agent encryption traffic characteristic extracting method of data packet frequency analysis
CN109361617A (en) * 2018-09-26 2019-02-19 中国科学院计算机网络信息中心 A kind of convolutional neural networks traffic classification method and system based on network payload package
CN109861957A (en) * 2018-11-06 2019-06-07 中国科学院信息工程研究所 A kind of the user behavior fining classification method and system of the privately owned cryptographic protocol of mobile application
CN110011931A (en) * 2019-01-25 2019-07-12 中国科学院信息工程研究所 A kind of encryption traffic classes detection method and system
US20190245866A1 (en) * 2018-02-06 2019-08-08 Cisco Technology, Inc. Leveraging point inferences on http transactions for https malware detection
CN110493142A (en) * 2019-07-05 2019-11-22 南京邮电大学 Mobile applications Activity recognition method based on spectral clustering and random forests algorithm
CN112583738A (en) * 2020-12-29 2021-03-30 北京浩瀚深度信息技术股份有限公司 Method, equipment and storage medium for analyzing and classifying network flow
CN113259313A (en) * 2021-03-30 2021-08-13 浙江工业大学 Malicious HTTPS flow intelligent analysis method based on online training algorithm
CN113469366A (en) * 2020-03-31 2021-10-01 北京观成科技有限公司 Encrypted flow identification method, device and equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190245866A1 (en) * 2018-02-06 2019-08-08 Cisco Technology, Inc. Leveraging point inferences on http transactions for https malware detection
CN109361617A (en) * 2018-09-26 2019-02-19 中国科学院计算机网络信息中心 A kind of convolutional neural networks traffic classification method and system based on network payload package
CN109286576A (en) * 2018-10-10 2019-01-29 北京理工大学 A kind of network agent encryption traffic characteristic extracting method of data packet frequency analysis
CN109861957A (en) * 2018-11-06 2019-06-07 中国科学院信息工程研究所 A kind of the user behavior fining classification method and system of the privately owned cryptographic protocol of mobile application
CN110011931A (en) * 2019-01-25 2019-07-12 中国科学院信息工程研究所 A kind of encryption traffic classes detection method and system
CN110493142A (en) * 2019-07-05 2019-11-22 南京邮电大学 Mobile applications Activity recognition method based on spectral clustering and random forests algorithm
CN113469366A (en) * 2020-03-31 2021-10-01 北京观成科技有限公司 Encrypted flow identification method, device and equipment
CN112583738A (en) * 2020-12-29 2021-03-30 北京浩瀚深度信息技术股份有限公司 Method, equipment and storage medium for analyzing and classifying network flow
CN113259313A (en) * 2021-03-30 2021-08-13 浙江工业大学 Malicious HTTPS flow intelligent analysis method based on online training algorithm

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
INSUP LEE 等: "Poster Abstract: Encrypted Malware Traffic Detection Using Incremental Learning", 《IEEE INFOCOM 2020 - IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (INFOCOM WKSHPS)》 *
YAN MENG 等: "Classification of Unknown Mobile Web Traffic Based on Correlation Coefficient Measurement", 《2014 INTERNATIONAL SYMPOSIUM ON WIRELESS PERSONAL MULTIMEDIA COMMUNICATIONS (WPMC)》 *
何晓敏: "基于深度学习的网站指纹攻击与防御技术研究和实现", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115277063A (en) * 2022-06-13 2022-11-01 深圳铸泰科技有限公司 Terminal identification device under IPV4 and IPV6 hybrid network environment
CN115277063B (en) * 2022-06-13 2023-07-25 深圳铸泰科技有限公司 Terminal identification device under IPV4 and IPV6 mixed network environment

Also Published As

Publication number Publication date
CN114091087B (en) 2022-04-15

Similar Documents

Publication Publication Date Title
Salman et al. A review on machine learning–based approaches for Internet traffic classification
Dubin et al. I know what you saw last minute—encrypted http adaptive video streaming title classification
US8539221B2 (en) Method and system for identifying an application type of encrypted traffic
Mangla et al. emimic: Estimating http-based video qoe metrics from encrypted network traffic
WO2018054342A1 (en) Method and system for classifying network data stream
CN104468507B (en) Based on the Trojan detecting method without control terminal flow analysis
CN111611280A (en) Encrypted traffic identification method based on CNN and SAE
Khakpour et al. An information-theoretical approach to high-speed flow nature identification
CN104994016B (en) Method and apparatus for packet classification
CN114866485B (en) Network traffic classification method and classification system based on aggregation entropy
CN114091087B (en) Encrypted flow identification method based on artificial intelligence algorithm
Yang et al. Bayesian neural network based encrypted traffic classification using initial handshake packets
Liu et al. Semi-supervised encrypted traffic classification using composite features set
Dubin et al. Real time video quality representation classification of encrypted http adaptive video streaming-the case of safari
Dahmouni et al. A markovian signature-based approach to IP traffic classification
Yang et al. Markov probability fingerprints: A method for identifying encrypted video traffic
Shi et al. Source identification of encrypted video traffic in the presence of heterogeneous network traffic
Dixit et al. Internet traffic detection using naïve bayes and K-Nearest neighbors (KNN) algorithm
CN108667804B (en) DDoS attack detection and protection method and system based on SDN architecture
De Montigny-Leboeuf Flow attributes for use in traffic characterization
Amour et al. Quality estimation framework for encrypted traffic (q2et)
Dubin et al. Video quality representation classification of Safari encrypted DASH streams
KR101437008B1 (en) Apparatus and Method for Traffic Analysis
CN115190056B (en) Method, device and equipment for identifying and analyzing programmable flow protocol
CN105357166B (en) A kind of method of next generation firewall system and its detection messages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: Room 218, 2nd Floor, Building A, No. 119 West Fourth Ring North Road, Haidian District, Beijing, 100000

Patentee after: HAOHAN DATA TECHNOLOGY CO.,LTD.

Address before: 102, building 14, 45 Beiwa Road, Haidian District, Beijing

Patentee before: HAOHAN DATA TECHNOLOGY CO.,LTD.