CN114091087A

CN114091087A - Encrypted flow identification method based on artificial intelligence algorithm

Info

Publication number: CN114091087A
Application number: CN202210047506.2A
Authority: CN
Inventors: 肖梅; 陈柯杉; 姚胜利; 齐凯
Original assignee: Haohan Data Technology Co ltd
Current assignee: Haohan Data Technology Co ltd
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-02-25
Anticipated expiration: 2042-01-17
Also published as: CN114091087B

Abstract

The invention discloses an encrypted flow identification method based on an artificial intelligence algorithm, which comprises the following steps: s1, preparing a training set, dividing the training set into a training set and a verification set, and respectively using the training set and the verification set for training a model and verifying a training result; the training set includes a large number of service messages, and each message records a quintuple, an application name, and a traffic type. S2, calculating all relevant HTTP connections of the encrypted connections in the training set and forming a relevant HTTP connection set L; s3, based on the associated HTTP connection set L obtained in the step S2, training a single-packet structure model and a flow model of each encrypted connection by using a machine learning classification algorithm; and S4, verifying the training result, and identifying the flow in the verification set by using the trained model. The invention combines the method of multi-stream association identification, single-packet identification and single-stream multi-packet identification, so that the encrypted flow identification method can be applied to any encrypted flow, and the correct identification rate of the encrypted flow is effectively improved.

Description

Encrypted flow identification method based on artificial intelligence algorithm

Technical Field

The invention relates to the technical field of encrypted flow identification, in particular to an encrypted flow identification method based on an artificial intelligence algorithm.

Background

With the development of internet technology, more and more applications and scenes use encrypted traffic, and the HTTP protocol based on plaintext features uses fewer and fewer scenes, and roughly statistically, about 70% of traffic in a network is encrypted traffic, which may be based on the universal encryption protocols HTTPs, QUIC, DTLS, etc., or may be based on a private encryption protocol, where HTTPs is the main encrypted traffic. At present, two identification methods for HTTPS are mainly used, one is identification based on a certificate chain, but the situation of HTTPS session reuse is severe, and when a session is reused, the certificate chain is invalid; one is flow-based statistical feature identification, but the statistical features of the flows are obviously influenced by network quality, the statistical features may not be obviously distinguished among different applications, and the false identification rate is high. Other encrypted traffic is also two identification methods, one is based on single-packet feature identification, the specific meaning cannot be known by the feature, the extracted information is frequently changed, and false identification may exist; one is classification by means of a machine learning algorithm based on flow statistical characteristics, and the recognition problem of HTTPS also exists. The existing identification method based on the flow statistical characteristics cannot realize real-time identification, and only when one connection is finished or the first N messages pass through equipment, the connection can be identified. The method can only identify the granularity accurately to the application, and cannot meet the requirements of people for more and more refinement.

Disclosure of Invention

The invention aims to provide an encrypted flow identification method based on an artificial intelligence algorithm, which is suitable for any encrypted flow by combining a multi-flow association identification method, a single-packet identification method and a single-flow multi-packet identification method, and effectively improves the correct identification rate of the encrypted flow.

In order to achieve the purpose, the invention provides the following technical scheme: an encrypted flow identification method based on an artificial intelligence algorithm comprises the following steps:

s1, preparing a training set, dividing the training set into a training set and a verification set, and respectively using the training set and the verification set for training a model and verifying a training result;

s2, calculating all relevant HTTP connections of the encrypted connections in the training set and forming a relevant HTTP connection set L;

s3, based on the associated HTTP connection set L obtained in the step S2, training a single-packet structure model and a flow model of each encrypted connection by using a machine learning classification algorithm;

and S4, verifying the training result, identifying the flow in the verification set by using the trained model, and if the correct identification rate is less than or equal to the minimum identification rate requirement value P, adjusting the corresponding training parameters and then repeating the step S3.

Preferably, the step S1 further includes extracting start time T and quintuple information for each encrypted connection, and screening all HTTP connections generated by source ip at [ T- Δ T, T ] time and recording related information.

Preferably, the step S2 further includes an associated HTTP calculation policy configured to calculate a Jaccard coefficient for each HTTP connection, where a represents the number of occurrences of traffic X-traffic type Y in the training set, B represents the number of occurrences of HTTP in the training set of full traffic-traffic type,

representing the number of occurrences of an HTTP connection in traffic X-traffic type Y, the formula for the Jaccard coefficient is as follows:

；

the HTTP connection with the largest Jaccard coefficient value is screened as the associated HTTP connection of the encrypted connection.

Preferably, the step S3 further includes generating a training result R according to the single packet structure model and the stream model, where the training result R includes a service name, a traffic type, an associated HTTP, a single packet structure feature, and a stream statistical feature.

Preferably, the single packet structure characteristics include, but are not limited to, packet length and fixed fingerprint value, and the stream statistical characteristics include, but are not limited to, packet time interval distribution, packet length distribution, and distribution of packet transmission rate.

Preferably, in step S4, the adjusting the corresponding training parameters includes adjusting parameters of a single-packet structure model, or adjusting flow statistical characteristics, or changing a machine learning classification algorithm, or adjusting Δ t time, or optimizing a training set; the step S4 includes adjusting the corresponding training parameters and then proceeding to step S3 again until the correct recognition rate is greater than the minimum recognition rate requirement value P.

Preferably, the step S1 includes randomly extracting 80% of traffic in each message of the application as a training set, and leaving 20% as a verification set, where each message records five-tuple information, an application name, and a traffic type.

Preferably, the step S4 includes, if the correct recognition rate is greater than the minimum recognition rate requirement value P, entering an application step, where the application step includes:

s5, screening the HTTP connection dL in the new flow and recording the starting time;

s6, identifying the HTTP connection dL as a service A based on the payload characteristics, searching whether the HTTP connection dL exists in the associated HTTP connection set L, if not, analyzing the next HTTP connection dL, and if so, executing S7;

s7, monitoring whether encrypted connection eL occurs in the source ip within the time of [ T1, T1-delta T ] in real-time flow, if not, analyzing next HTTP connection dL, and if so, executing S8 on the first packet of the encrypted connection eL;

s8, judging whether the current packet data matches the single packet structure characteristics and the flow statistical characteristics of X in the training result R, if not, marking the encrypted connection eL as unidentified flow, if so, marking the encrypted connection eL as a service A-flow type X with a suspicious state, and executing S9;

s9, judging whether the number of the sent packets is less than or equal to N, if the number of the sent packets is less than or equal to N, repeatedly executing S8, if the number of the sent packets is equal to N, the identification result of the eL is consistent with the identification result of the Nth packet, and the connection identification state is determined.

Preferably, the step S7 includes analyzing and encrypting the first packet single-packet structural feature and the first packet flow statistical feature of the connection eL according to the feature associated with the HTTP connection dL, and determining whether the first packet data matches the single-packet structural feature and the flow statistical feature of X in the training result R.

Preferably, the machine learning classification algorithm adopts bayes or decision trees or SVMs.

Compared with the prior art, the invention has the beneficial effects that:

the encrypted flow identification method of the invention combines the multi-flow association identification, the single-packet identification and the single-flow multi-packet identification, so that the encrypted flow identification method can be applied to any encrypted flow, and the correct identification rate of the encrypted flow is effectively improved. And real-time identification is supported, the encrypted flow can be identified from the first packet, and the granularity of the identification result can be accurate to the flow type of the service.

Drawings

FIG. 1 is a flow chart of an encrypted flow identification method based on an artificial intelligence algorithm according to the present invention;

FIG. 2 is a flow chart of the application steps in an encrypted flow identification method based on an artificial intelligence algorithm according to the present invention;

fig. 3 is a flowchart of an encrypted flow identification method based on an artificial intelligence algorithm according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides an encrypted traffic identification method based on an artificial intelligence algorithm, where the encrypted traffic identification method includes the following steps:

s1, preparing a training set, dividing the training set into a training set and a verification set, and respectively using the training set and the verification set for training a model and verifying a training result; the training set includes a large number of service messages, and each message records a quintuple, an application name, and a traffic type.

In this embodiment, the encrypted traffic identification method combines the multi-stream association identification, the single-packet identification, and the single-stream multi-packet identification, so that the encrypted traffic identification method is applicable to any encrypted traffic, and the correct identification rate of the encrypted traffic is effectively improved. And real-time identification is supported, the encrypted flow can be identified from the first packet, and the granularity of the identification result can be accurate to the flow type of the service, such as browsing, video streaming, downloading streaming, game fighting streaming, voice and video conversation streaming and the like. The encrypted traffic may be traffic based on a universal encryption protocol HTTPS, QUIC, DTLS, and RTMFP, or may be private encrypted traffic, such as an arcade video, a kuku video, an hero alliance, and a thunderbolt.

Preferably, the step S1 further includes extracting start time T and quintuple information for each encrypted connection, and screening all HTTP connections generated by source ip at [ T- Δ T, T ] time and recording related information. Wherein, the quintuple information is: source ip, source port, destination ip, destination port, protocol. The training set typically stores one operational flow per message, such as non-HTTP flow for a usu video-on-demand tv series 1800s, by automatic dial-up testing. Each packet needs to record a five tuple, an application name, and a traffic type. Under this embodiment, 80% of the traffic in each application message is randomly extracted as a training set, and the remaining 20% is used as a verification set. And verifying the training result by adopting a random sub-sampling verification method.

The step S3 further includes generating a training result R according to the single packet structure model and the stream model, where the training result R includes a service name, a traffic type, an associated HTTP, a single packet structure characteristic, and a stream statistical characteristic. The traffic type is that the traffic type can be divided into browsing, video streaming, downloading streaming, voice and video conversation streaming, game fighting streaming and the like according to scenes or behaviors generated by the traffic. The single-packet structure model comprises the positions and lengths of features such as SeesioniD, PacketNumber, an operating system, a version number, an encryption algorithm and the like in the single packet, and the stream statistical features comprise the features such as packet time interval distribution, packet length distribution, packet sending rate distribution and the like.

Preferably, the step S2 further includes an associated HTTP calculation policy configured to calculate a Jaccard coefficient of each HTTP connection, where the Jaccard coefficient: also known as the Jaccard similarity coefficient (Jaccard similarity coefficient) is used to compare similarity and difference between finite sample sets. The larger the Jaccard coefficient value, the higher the sample similarity.

Given two sets a, B, the Jaccard coefficient is defined as the ratio of the size of the intersection of a and B to the size of the union of a and B, as follows:

；

when both sets A, B are empty, J (A, B) is defined as 1. Wherein A represents the number of occurrences of traffic X-traffic type Y in the training set, B represents the number of occurrences of HTTP in the training set for full traffic-traffic type,

representing the number of occurrences of an HTTP connection in traffic X-traffic type Y, the associated HTTP calculation policy includes screening the HTTP connection having the largest Jaccard coefficient value as the associated HTTP connection for the encrypted connection.

Preferably, in step S4, the adjusting the corresponding training parameters includes adjusting parameters of a single-packet structure model, or adjusting flow statistical characteristics, or changing a machine learning classification algorithm, or adjusting Δ t time, or optimizing a training set; the step S4 includes adjusting the corresponding training parameters and then proceeding to step S3 again until the correct recognition rate is greater than the minimum recognition rate requirement value P. The corresponding training parameters may be adjusted by changing the structure characteristics of the single packet, the flow statistical characteristics, such as removing the packet length, the packet sending rate characteristics, changing to a fixed fingerprint value, time interval, etc. The specific features to be used and the specific features to be adjusted are determined by the specific flow identification requirements and the implementer. Changing the machine learning algorithm: such as SVM to xgboost. Adjusting delta t: for example, Δ t is adjusted from 1s to 0.1 s. Optimizing a training set: such as: each connection in the training set is of the same duration, and traffic classification is more accurate.

As shown in fig. 2, the step S4 includes, if the correct recognition rate is greater than the minimum recognition rate requirement value P, entering an application step, where the application step includes:

Specifically, the machine learning classification algorithm adopts Bayes, decision trees or SVM. The training set is composed of vectors converted from the single-packet structural features and the flow statistical features of each connection, and a machine learning algorithm can learn various models according to the vector values, wherein each model is suitable for one flow type. A new connection enters the recognition system, is converted into a vector, is matched with the trained model, and is suitable for which type of flow.

As shown in fig. 3, the complete flow chart of the present invention. Comprising a preparation phase, a training phase and an application phase, wherein,

a preparation stage: the training set typically stores one operational flow per message, such as non-HTTP flow for a usu video-on-demand tv series 1800s, by automatic dial-up testing. Each packet needs to record a five tuple, an application name, and a traffic type.

Then, 80% of the traffic in each applied message is randomly extracted as a training set, and the rest 20% is used as a verification set. And verifying the training result by adopting a random sub-sampling verification method.

A training stage: for each of the encrypted connections, the connection is encrypted,

1. and extracting start time T and quintuple information. Then all HTTP connections generated by source ip in the time of [ T-delta T, T ] are screened and relevant information, such as domain name, User-Aagent, URI and the like, is recorded.

2. And calculating the Jaccard coefficient of each HTTP connection, wherein A represents the number of times of the service X-flow type Y in the training set, B represents the number of times of the HTTP in the training set of the full service-flow type, and represents the number of times of the HTTP connection in the service X-flow type Y.

3. The HTTP connection with the largest Jaccard coefficient value is screened as the associated HTTP connection for this encrypted connection.

The following operations are performed on the same application-flow type message:

a. repeating the steps 1-3 above for all encrypted connections, and finally obtaining an associated HTTP connection set L.

b. Learning a flow model based on a machine learning classification algorithm: single package structure model, flow model.

c. And (4) saving a training result R: the method comprises the following steps of (1) service name, flow type, associated HTTP (hyper text transport protocol), single-packet structural characteristics and flow statistical characteristics;

and (4) performing a-c operation on all the messages in the training set to form a complete training result R.

And (3) identifying the message in the verification set by using the training result R, and calculating for each service-flow type C:

correct identification rate = number of connections identified as C/total number of connections of class C messages in the validation set.

When the correct recognition rate > P (0 < = P < =1, P is the minimum recognition rate requirement determined according to the actual scene), it can be determined that the training result is applied to the application stage. When the correct recognition rate < = P, the parameters of the single-packet structure model or the flow statistical characteristics need to be adjusted, the machine learning classification algorithm needs to be changed, the time of delta t needs to be adjusted, the training set needs to be optimized, and the operation in the training stage needs to be repeated until the correct recognition rate meets the requirements.

An application stage: when traffic enters the equipment, HTTP connections are screened firstly, and each HTTP connection is operated as follows:

HTTP connection, indicated dL, start time, indicated quintuple by T1.

dL connections are identified as traffic a based on payload characteristics, which refer to application layer fingerprint characteristics.

3. The set of associated HTTP connections (denoted L) is searched for the presence or absence of a connection dL, and if not, indicating that this HTTP connection is not associated with any encrypted connection in the training set, the next HTTP connection can be analyzed. If so, perform 4.

The dL associated traffic-traffic type is X, and it needs to see whether encrypted traffic (denoted by eL) occurs in the real-time traffic from the source ip within the time T1, T1- Δ T, and if not, dL connection information is not kept for identifying the encrypted traffic. If so, encrypt the header packet of the connection eL performs 5.

5. And (4) judging whether the current packet data is matched with the single-packet structure model S and the flow statistical model characteristic F of the X in the training result R, if not, the encrypted connection eL can only be marked as unidentified flow, and stopping continuously matching the packet data.

6. If there is a match, the encrypted connection eL is marked as traffic-traffic type X, the status is suspect.

When the number of the sent packets < = N (N determines the maximum number of the sent packets according to the actual application scene), 5 and 6 operations are executed on the current packet.

And if the number of sent packets is less than N and the single packet structure characteristics and the flow statistical characteristics of X are not met, marking the identification result of the encryption connection eL as unidentified or turning the identification result into unknown.

If the number of the transmitted packets = N, the identification result of the encrypted connection is consistent with the identification result of the Nth packet, and the connection identification state is determined.

The working principle is as follows: the encrypted flow identification method of the invention combines the multi-flow association identification, the single-packet identification and the single-flow multi-packet identification, so that the encrypted flow identification method can be applied to any encrypted flow, and the correct identification rate of the encrypted flow is effectively improved. And real-time identification is supported, the encrypted flow can be identified from the first packet, and the granularity of the identification result can be accurate to the flow type of the service.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. An encrypted flow identification method based on an artificial intelligence algorithm is characterized by comprising the following steps:

2. The method for identifying encrypted traffic based on artificial intelligence algorithm according to claim 1, wherein said step S1 further comprises extracting start time T and quintuple information for each encrypted connection, and filtering all HTTP connections generated by source ip at [ T- Δ T, T ] time and recording related information.

3. The artificial intelligence algorithm-based encrypted traffic identification method according to claim 2, wherein the step S2 further comprises an associated HTTP calculation strategy configured to calculate Jaccard coefficient of each HTTP connection, wherein A represents the number of occurrences of traffic X-traffic type Y in the training set, B represents the number of occurrences of HTTP in the training set of full traffic-traffic type,

；

4. The method for identifying encrypted traffic based on artificial intelligence algorithm according to claim 3, wherein said step S3 further comprises generating a training result R according to the single packet structure model and the stream model, wherein the training result R comprises the service name, the traffic type, the associated HTTP, the single packet structure feature and the stream statistical feature.

5. The method according to claim 4, wherein the single packet structure characteristics include, but are not limited to, packet length and fixed fingerprint value, and the flow statistical characteristics include, but are not limited to, packet time interval distribution, packet length distribution, and packet sending rate distribution.

6. The method for identifying encrypted traffic based on artificial intelligence algorithm according to claim 5, wherein in step S4, the adjusting the corresponding training parameters includes adjusting parameters of a single-packet structure model, or adjusting flow statistics, or changing a machine learning classification algorithm, or adjusting Δ t time, or optimizing a training set; the step S4 includes adjusting the corresponding training parameters and then proceeding to step S3 again until the correct recognition rate is greater than the minimum recognition rate requirement value P.

7. The method for identifying encrypted traffic based on artificial intelligence algorithm according to claim 6, wherein the step S1 includes randomly extracting 80% of traffic in each message of the application as a training set, and the remaining 20% as a verification set, wherein each message records quintuple information, application name and traffic type.

8. The method for identifying encrypted traffic based on artificial intelligence algorithm according to any of claims 4-7, wherein said step S4 includes entering into an application step if the correct identification rate is greater than the minimum identification rate requirement value P, said application step includes:

9. The method for identifying encrypted traffic based on artificial intelligence algorithm according to claim 8, wherein said step S7 includes analyzing the first packet structure characteristic and the first packet flow statistical characteristic of the encrypted connection eL according to the characteristic associated with the HTTP connection dL, and determining whether the first packet data matches the single packet structure characteristic and the flow statistical characteristic of X in the training result R.

10. The encrypted flow identification method based on artificial intelligence algorithm according to claim 9, characterized in that the machine learning classification algorithm employs bayes or decision trees or SVMs.