CN111245860A - Encrypted malicious flow detection method and system based on two-dimensional characteristics - Google Patents

Encrypted malicious flow detection method and system based on two-dimensional characteristics Download PDF

Info

Publication number
CN111245860A
CN111245860A CN202010066830.XA CN202010066830A CN111245860A CN 111245860 A CN111245860 A CN 111245860A CN 202010066830 A CN202010066830 A CN 202010066830A CN 111245860 A CN111245860 A CN 111245860A
Authority
CN
China
Prior art keywords
flow
packet
malicious traffic
quintuple
bin
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010066830.XA
Other languages
Chinese (zh)
Inventor
周志洪
姚立红
胡斌
银鹰
李建华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202010066830.XA priority Critical patent/CN111245860A/en
Publication of CN111245860A publication Critical patent/CN111245860A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/16Implementing security features at a particular protocol layer
    • H04L63/166Implementing security features at a particular protocol layer at the transport layer

Abstract

The invention discloses a method and a system for detecting encrypted malicious flow based on two-dimensional characteristics, wherein the method comprises the following steps: step 1, merging data packets of the same quintuple into bidirectional conversation flow; step 2, shielding the quintuple characteristics; step 3, extracting the message load characteristics of the session flow; step 4, extracting the stream fingerprint characteristics of the session flow; step 5, integrating and standardizing the flow characteristics extracted in the step 3-4; and 6, carrying out malicious traffic classification by using a logistic regression machine learning model. The method has the beneficial effects that the problem of detection accuracy rate reduction of the encrypted malicious traffic caused by instable quintuple characteristics under large-scale complex network conditions can be solved, and experimental results show that the method has the detection accuracy rate of 97.86 percent higher than that of the traditional method for the encrypted malicious traffic of the SSL/TLS protocol in the complex network environment under the condition of not depending on the quintuple, and the detection accuracy rate of the encrypted malicious traffic is 34.45 percent higher than that of the traditional method.

Description

Encrypted malicious flow detection method and system based on two-dimensional characteristics
Technical Field
The invention relates to the crossing field of network security and machine learning, in particular to an encrypted malicious flow detection method and system based on two-dimensional characteristics.
Background
To protect secure communications between users and enterprises, website traffic encryption has become a mainstream measure, and the application of SSL/TLS (secure socket layer/transport layer security) protocol is the main means for encrypting such traffic. Encrypted traffic can protect the confidentiality and integrity of private information to some extent, but also provides shelter to malicious activities on the network.
At present, a supervised learning method is mainly adopted for encrypting malicious traffic. However, the existing method often uses only one feature, and cannot detect malicious traffic with extremely high domain name updating frequency. In a complex network environment with complex quintuple information, if quintuple information frequently changed with malicious traffic is taken as an important feature, the model identification precision is affected. If the quintuple characteristics of the traffic are removed, the methods are used again to detect the encrypted malicious traffic, and the recognition rate is greatly reduced.
Therefore, the invention provides an encrypted malicious traffic identification method, which divides the encrypted traffic into two dimensions, a message load and a stream fingerprint in a data preprocessing mode. Under the condition of avoiding quintuple information, the position of each flow is described by message load and flow fingerprints, and training and prediction are carried out through a logistic regression machine learning model.
Disclosure of Invention
The invention solves the problems that under the complex network environment with various malicious flow sources, the network layer characteristics of the flow are diversified, the quintuple characteristics have no regularity any more, and the detection rate of the traditional method is reduced. Therefore, the invention provides an SSL/TLS encrypted malicious traffic detection method independent of traffic quintuple characteristics, which induces the traffic multiple characteristics into the combined characteristics of message load characteristics and flow fingerprint characteristics, so that the traffic has more differentiated characteristics in a complex network environment, describes one traffic from two dimensions, and uses a logistic regression model for classification to realize the detection of SSL/TLS protocol encrypted malicious traffic in the complex network environment.
The invention provides an encrypted malicious flow detection method based on two-dimensional characteristics, which extracts message load characteristics and flow fingerprint characteristics of monitored encrypted flow and identifies malicious flow on the basis of the message load characteristics and the flow fingerprint characteristics.
Further, the detection method comprises the following steps:
step 1, merging data packets of the same quintuple into bidirectional conversation flow;
step 2, shielding the quintuple characteristics;
step 3, extracting the message load characteristics of the session flow;
step 4, extracting the stream fingerprint characteristics of the session flow;
step 5, integrating and standardizing the flow characteristics extracted in the step 3-4;
and 6, carrying out malicious traffic classification by using a logistic regression machine learning model.
Further, the step 1 comprises:
step 1.1, merging the same flow of five-tuple into a conversation, wherein the five-tuple refers to a source IP address, a destination IP address, a source port, a destination port and a protocol;
step 1.2, merging the sessions, in which the source IP address of the inflow flow is the same as the destination IP address of the outflow flow, the destination IP address of the inflow flow is the same as the source IP address of the outflow flow, the source port of the inflow flow is the same as the destination port of the outflow flow, the destination port of the inflow flow is the same as the source port of the outflow flow, and the protocol of the inflow flow is the same as the protocol of the outflow flow, into a bidirectional flow.
Further, in the step 2, the data information of the IP, the port, and the protocol field in the session traffic is filled with all 0 s instead.
Further, in the step 3, five elements in the ClientHello and ServerHello messages are selected as characteristics of the message load, including: TLSVersion (protocol version), Ciphers (held cipher suite), Extensions (extension field), EllipticCurves (elliptic curve cipher), EllipticCurvePointFormat (elliptic curve cipher format); the combined data of the five elements is classified into a special fingerprint array XIs just=[x1,x2,x3,x4,x5]Wherein
x1: the code of the protocol version is used,
x2: the code of all the cipher suites that are supported,
x3: the code of all the extension fields is such that,
x4: a code of the elliptic curve cipher type,
x5: code in elliptic curve cipher format.
Further, in the step 4, the packet length and the packet inter-arrival time, and the byte distribution data are extracted as the stream fingerprint feature.
Further, regarding the packet length, the lengths of all packets in each session are scattered into a window with the same size, the window size is N bytes, packets with the packet length between [0, N) bytes are placed into a first bin, packets with the packet length between [ N,2N) bytes are placed into a second bin, and the like; then constructing a matrix A, wherein each element A [ i, j ] represents the number of times a packet in the ith bin is converted into a packet in the jth bin is calculated; and finally, carrying out normalization processing on each row of the A, wherein each row is a Markov chain and is used as a packet length characteristic of the conversation.
Further, regarding the packet arrival time interval, the arrival time intervals of all packets in each session are discretized into a window with the same size, the window size is T milliseconds, packets with the packet arrival time interval between [0, T) milliseconds are placed into a first bin, packets with the packet arrival time interval between [ T,2T) milliseconds are placed into a second bin, and the like; then constructing a matrix B, wherein each element B [ i, j ] represents the number of times a packet in the ith bin is converted into a packet in the jth bin is calculated; and finally, carrying out normalization processing on each row of the B, wherein each row is a Markov chain and is used as a packet arrival time interval characteristic of the conversation.
Further, the byte distribution is a length 256 array that counts each byte value in the payload of each packet in the stream; dividing this count by the total number of bytes found in the packet payload to obtain the probability of each byte value occurring; the byte distribution of different applications provides a large amount of information about the encoding of the application data; in addition, byte distribution may also provide the load ratio of SSL/TLS protocol handshake packets to the entire flow, byte composition of the handshake information, and information to add any unachieved padding.
Further, the combined data of both length and packet inter-arrival time, and byte distribution data are normalized to the stream fingerprint characteristics of the proprietary stream.
The invention provides an encrypted malicious flow detection system based on two-dimensional characteristics, which comprises:
the SSL/TLS flow extraction module is used for capturing flow data from a network;
a bidirectional fluidization module for executing the step 1;
a quintuple feature fuzzification processing module for executing the step 2;
the message load characteristic extraction module executes the step 3;
the stream fingerprint feature extraction module executes the step 4;
a logistic regression analysis module for executing the step 5-6;
and the classification result output module outputs a classification result of the flow.
The method has the beneficial effects that the problem of detection accuracy rate reduction of the encrypted malicious traffic caused by instable quintuple characteristics under large-scale complex network conditions can be solved, and experimental results show that the method has the detection accuracy rate of 97.86 percent higher than that of the traditional method for the encrypted malicious traffic of the SSL/TLS protocol in the complex network environment under the condition of not depending on the quintuple, and the detection accuracy rate of the encrypted malicious traffic is 34.45 percent higher than that of the traditional method.
The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.
Drawings
FIG. 1 is a diagram of the steps of a detection method according to one embodiment of the present application.
FIG. 2 is a block diagram of a detection system according to an embodiment of the present application.
FIG. 3 is a flowchart of an SSL/TLS protocol referenced when extracting a message payload according to an embodiment of the present application.
FIG. 4 is a specific numerical value of an evaluation index obtained in a single network environment and a complex network environment through a test of a known data set in an embodiment of the present application.
FIG. 5 is a two-dimensional graph of different session flows in a single network environment, tested with known data sets, in one embodiment of the present application.
FIG. 6 is a two-dimensional graph of different session flows in a complex network environment, tested with a known data set, according to an embodiment of the present application.
Detailed Description
The technical contents of the preferred embodiments of the present invention will be more clearly and easily understood by referring to the drawings attached to the specification. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.
In a complex network environment with complex quintuple information, if quintuple information frequently changed with malicious traffic is taken as an important feature, the model identification precision is affected. If the quintuple characteristics of the traffic are removed, the methods are used again to detect the encrypted malicious traffic, and the recognition rate is greatly reduced.
Therefore, the encrypted traffic is divided into two dimensions, packet load and stream fingerprint, by means of data preprocessing. Under the condition of fuzzy quintuple information, the position of each flow is described by message load and flow fingerprint, and training and detection are carried out through a logistic regression machine learning model.
The first embodiment is as follows:
as shown in fig. 1, in an embodiment of the detection method proposed by the present invention, a logic model is trained through a known offline data set, and the trained model is tested using known offline data of different contents to test the detection accuracy.
The specific steps of the training process are as follows:
step 1: using offline PCAP files that have been identified as malicious traffic as the training data set for the malicious portion, the malicious traffic is further classified as: encrypted communication of malicious behaviors such as scanning detection, brute force cracking, C & C communication and the like (such as Neris, Rbot, Virut, Menti, Soguo, Murlo, NSIS. ay and the like); using offline PCAP files that have been identified as benign traffic as a training data set for the benign portion; the data sets are encrypted SSL/TLS traffic;
step 2: respectively carrying out data preprocessing on the two parts of data sets;
step 2 again comprises the following substeps:
step 2.1: cutting the flow according to a quintuple of a source IP address, a destination IP address, a source port, a destination port and a protocol;
step 2.2: merging the flows with the same quintuple into a conversation;
step 2.3: merging the conversation with the bidirectional flow, wherein the conversation is the same as the protocol of the outgoing flow, the source IP address of the incoming flow is the same as the destination IP address of the outgoing flow, the source port of the incoming flow is the same as the destination port of the outgoing flow, the destination port of the incoming flow is the same as the source port of the outgoing flow, and the protocol of the incoming flow is the same as the protocol of the outgoing flow;
step 2.4: blurring five-tuple information of all sessions, namely replacing a source IP address, a destination IP address, a source port, a destination port and a protocol by hexadecimal 00, and keeping the rest information of the sessions unchanged;
and step 3: and extracting message load characteristics of the session flow. The content comprises the following steps: TLSVersion (protocol version), Ciphers (supported cipher suite), Extensions (extension field), EllipticCurves (elliptic Curve cipher), EllipticCurvePointFormat (elliptic Curve cipher Format). Data of five elements are combined into a proprietary fingerprint array:
Xis just=[x1,x2,x3,x4,x5]
Wherein x is1: a code of a protocol version; x is the number of2: code for all cipher suites supported; x is the number of3: generation of all extension fieldsCode; x is the number of4: a code of elliptic curve cipher type; x is the number of5: code in elliptic curve cipher format.
And 4, step 4: extracting a flow fingerprint feature of session traffic, comprising: packet length and packet inter-arrival time, byte distribution data capable of providing application data encoding information as stream fingerprint characteristics;
wherein, packet length and packet inter-arrival time: modeling as a Markov chain;
for packet length, discretizing the length of all packets in each session into a window of the same size, the window size being 150 bytes, placing packets with packet lengths between [0,150) bytes into the first bin, placing packets with packet lengths between [150,300) bytes into the second bin, and so on;
then constructing a matrix A, wherein each element A [ i, j ] represents the number of times a packet in the ith bin is converted into a packet in the jth bin is calculated;
finally, each line of A is normalized, each line is a Markov chain and is used as the packet length characteristic of the conversation, namely APacket length
For packet arrival time intervals, discretizing the arrival time intervals of all packets in each session into a window of the same size, the window size being 50 milliseconds, placing packets with packet arrival time intervals between [0,50) milliseconds into a first bin, placing packets with packet arrival time intervals between [50,100) milliseconds into a second bin, and so on;
then constructing a matrix B, wherein each element B [ i, j ] represents the number of times a packet in the ith bin is converted into a packet in the jth bin is calculated;
finally, each row of B is normalized, each row is a Markov chain and is used as the characteristic of the packet arrival time interval of the conversation, namely BTime of arrival interval
Byte distribution: the byte distribution is a length 256 array that counts each byte value in the payload of each packet in the stream. Dividing this count by the total number of bytes found in the packet payload yields the probability of each byte value occurring. Is differentThe byte distribution of an application provides a large amount of information about the encoding of the application's data, namely CByte distribution
Attributing the combined data of the two items as a stream fingerprint characteristic of a proprietary session
YFlow of=[APacket length,BTime of arrival interval,CByte distribution];
And 5: after the message load characteristics and the flow fingerprint characteristics are normalized, the conversation flow is identified and input into a logistic regression machine learning model; the logistic regression machine learning model is configured as follows:
the regularization type is 'l 1', the error range of iteration termination judgment is '1 e-4', the regularization intensity reciprocal is '1.0', the weights of all classes are '1', the algorithm selects 'libilinear', the iteration times are '100', and the loss function selects 'Sigmoid'
Figure BDA0002376220280000051
Step 6: the training process is completed and the model is saved.
Next, testing the model stored after training, specifically including the following steps:
step 1: using offline PCAP files that have been identified as malicious traffic as the test data set for the malicious portion, the malicious traffic is further classified as: scanning for encrypted communication of malicious behaviors such as detection, brute force cracking, C & C communication and the like (such as Neris, Rbot, Virut, Menti, Soguo, Murlo, NSIS. ay and the like). Using offline PCAP files that have been identified as benign traffic as the benign portion of the test data set, the data sets all being encrypted SSL/TLS traffic;
step 2-4 is the same as training step 2-4, respectively;
and 5: and after the message load characteristics and the stream fingerprint characteristics are normalized, inputting the trained logistic regression model to obtain an output classification result of the model, and comparing the output classification result with the actual classification to detect the accuracy of the model.
Example two:
as shown in fig. 2, an embodiment of the detection system provided by the present invention includes the following modules: the system comprises an SSL/TLS flow extraction module, a bidirectional streaming module, a quintuple feature fuzzification processing module, a message load feature extraction module, a flow fingerprint feature extraction module, a logistic regression classifier module and a classification result output module. The detection system is trained and tested, and the processing procedures of all modules are as follows:
SSL/TLS traffic extraction module: and replaying the test traffic at the server network card by using tcprep, and capturing the packet at the server network card by using tshark, wherein only SSL/TLS traffic is captured.
A bidirectional fluidization module: and cutting the flow according to a quintuple of a source IP address, a destination IP address, a source port, a destination port and a protocol. And then combining the same flow of the quintuple into a session.
A quintuple feature fuzzification processing module: and (3) blurring five-tuple information of all the sessions, namely replacing the source IP address, the destination IP address, the source port, the destination port and the protocol with hexadecimal 00, and keeping the rest information of the sessions unchanged.
A message load characteristic extraction module: the SSL/TLS protocol is constructed as shown in FIG. 3, and its principle is as follows:
after the TLS session is initiated, the client sends a ClientHello packet to the server, the generation of which depends on the software package and method used to build the client application. If the connection is accepted, the server uses the server library and the configuration and the detailed information in the ClientHello message to create a ServerHello data packet for response, and then the server sends a Certificate, and serverheyexchange and ServerHelloDone complete the message sending of ServerHello. After receiving the message, the client uses public Key in the Certificate to exchange Session Key of ClientKeyexchange, and then sends ChangeCipherSpec to indicate that all messages sent by the Server from now are encrypted and end with Finished. After receiving the message, the server sends a message with the same property for confirmation. Then, the application data is transmitted and received according to the SSL protocol standard negotiated before;
the message content of the handshake negotiation stage is plaintext, and the content of the application data transmission stage is ciphertext; therefore, the detailed information in the Hello data packet can be used for carrying out fingerprint identification on the client application program from the message content layer, and the message load characteristics of the session flow are extracted, wherein the contents comprise: version, Ciphers (supported cipher suite), Extensions (extension field), eliptitcurves (elliptic curve cipher), eliptitcurvepointformats;
the combined data of the five elements is classified as a proprietary fingerprint array:
Xis just=[x1,x2,x3,x4,x5]
Wherein x1: a code of a protocol version; x is the number of2: code for all cipher suites supported; x is the number of3: codes of all extension fields; x is the number of4: a code of elliptic curve cipher type; x is the number of5: code in elliptic curve cipher format.
The flow fingerprint feature extraction module: extracting a flow fingerprint feature of session traffic, comprising:
packet length and packet inter-arrival time: modeled as a markov chain. For packet length and inter-arrival time, the values are discretized into windows of the same size, for packet length data, a window of size 150 bytes is used, placing data size [0,150 ] into the first bin, data size [150,300 ] into the second bin, and so on. A matrix A [ i, j ] is then constructed]And calculating the transition probability between the ith bin and the jth bin. Finally, carrying out standardization treatment, namely normalization on the A to ensure that a proper Markov chain is obtained; taking A as the characteristic of the data, namely APacket length
For packet arrival time intervals, discretizing the arrival time intervals of all packets in each session into a window of the same size, the window size being 50 milliseconds, placing packets with packet arrival time intervals between [0,50) milliseconds into a first bin, placing packets with packet arrival time intervals between [50,100) milliseconds into a second bin, and so on;
then constructing a matrix B, wherein each element B [ i, j ] represents the number of times a packet in the ith bin is converted into a packet in the jth bin is calculated;
finally, each row of B is normalized, each row is a Markov chain and is used as the characteristic of the packet arrival time interval of the conversation, namely BTime of arrival interval
Byte distribution: the byte distribution is a length 256 array that counts each byte value in the payload of each packet in the stream. Dividing this count by the total number of bytes found in the packet payload yields the probability of each byte value occurring. The byte distribution of the different applications provides a lot of information about the data encoding of the application, i.e. CByte distribution
Attributing the combined data of the two items as a stream fingerprint characteristic of a proprietary session
YFlow of=[APacket length,BTime of arrival interval,CByte distribution]。
A logistic regression analysis module: and after the message load characteristics and the flow fingerprint characteristics are normalized, the conversation flow is identified and input into a logistic regression machine learning model. The logistic regression machine learning model is configured as follows:
the regularization type is 'l 1', the error range of iteration termination judgment is '1 e-4', the regularization intensity reciprocal is '1.0', the weights of all classes are '1', the algorithm selects 'libilinear', the iteration times are '100', and the loss function selects 'Sigmoid'
Figure BDA0002376220280000071
A classification result output module: in the process of testing the model, the performance of the model needs to be evaluated. The evaluation criteria of the present invention are divided into four broad categories: true Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN);
Figure BDA0002376220280000072
the accuracy (accure, marked as A) is defined as the ratio of the number of correctly classified samples to the total number of samples, i.e. the accuracy is defined
Figure BDA0002376220280000081
In addition, the invention also uses the precision rate and the recall rate as evaluation indexes; precision and recall represent the ability of the classifier to work on each category; the accuracy reflects the overall performance of the classifier. F1-measure (denoted as F1) is an evaluation index combining precision and recall.
Accuracy rate is true positive example/(true positive example + false positive example)
Recall rate true positive example/(true positive example + false negative example)
F1 ═ 2 · (precision · recall rate)/(precision + recall rate)
The method and the system use a Czech university of technology (CTU) data set as a data set for testing the method and the system, the SSL flow generated by C & C communication in the CTU13 data set is extracted, the total SSL flow is 0.698GB, and the size of a normal flow data set is 0.76 GB. The size of the positive and negative data sets satisfies the balance of the training data.
Comparative experiments were performed on the data sets in both single and complex network environments, resulting in the following tables and figures 4-6:
Figure BDA0002376220280000082
in the embodiment, after all the characteristics of the flow are classified into the message load characteristics and the flow fingerprint characteristics, the flow is characterized from two dimensions, the classifier model is trained by using the characteristics of the two dimensions, the finally obtained result can reach more than 97% of identification accuracy rate no matter in a single network environment or a complex network environment, and the F1-Measure index of the flow also reaches more than 97%.
Example three:
one embodiment of the detection system provided by the invention comprises the following modules: the system comprises an SSL/TLS flow extraction module, a bidirectional streaming module, a quintuple feature fuzzification processing module, a message load feature extraction module, a flow fingerprint feature extraction module, a logistic regression classifier module and a classification result output module. The detection system is used for real-time flow detection, and the processing process of each module is as follows:
SSL/TLS traffic extraction module: and capturing the packet at the network card of the server by using tshark, and capturing real-time SSL/TLS traffic.
A bidirectional fluidization module: the module with the same name as the second embodiment;
a quintuple feature fuzzification processing module: the module with the same name as the second embodiment;
a message load characteristic extraction module: the module with the same name as the second embodiment;
a logistic regression classifier module: the module with the same name as the second embodiment;
a classification result output module: outputting the judgment of the flow, and finally forming a flow classification result, wherein the flow classification result comprises the following steps: scanning for encrypted communication of malicious behaviors such as detection, brute force cracking, C & C communication and the like (such as Neris, Rbot, Virut, Menti, Soguo, Murlo, NSIS. ay and the like).
In the normal traffic under the complex network environment, because of the normal SSL/TLS communication traffic from different websites, the difference of TLSVersion, Ciphers, Extensions, Elliptics Currves and Elliptics CurvePointFormat is large due to different SSL certificates, the normalized value is distributed between 0 and 1; and malicious traffic can only adopt an SSL/TLS protocol version with an older version because legal SSL certificates of a regular channel cannot be obtained, and a supported password suite and the number of extensions are also small, so that the value distribution area after normalization is limited.
The embodiment of the invention classifies the selected characteristics into the message load characteristics and the flow fingerprint characteristics through encrypted flow preprocessing and IP and port evasion, and describes one flow from two dimensions so as to meet the training requirement of the model. The embodiment result shows that in the traditional encrypted traffic research, quintuple features which gradually lose effectiveness along with the development of malicious behaviors account for larger total classification weight, the identification accuracy of the model on the traffic of a complex network environment is reduced, the provided message and flow fingerprint feature identification method solves the problem, and the model only needs to be trained from the traffic of a single network environment, so the universality is wider.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims (10)

1. A method for detecting encrypted malicious traffic based on two-dimensional features is characterized in that message load features and flow fingerprint features of monitored encrypted traffic are extracted, and malicious traffic is identified on the basis of the message load features and the flow fingerprint features.
2. The encrypted malicious traffic detection method based on the two-dimensional features as claimed in claim 1, characterized by comprising the following steps:
step 1, merging data packets of the same quintuple into bidirectional conversation flow;
step 2, shielding the quintuple characteristics;
step 3, extracting the message load characteristics of the session flow;
step 4, extracting the stream fingerprint characteristics of the session flow;
step 5, integrating and standardizing the flow characteristics extracted in the step 3-4;
and 6, carrying out malicious traffic classification by using a logistic regression machine learning model.
3. The encrypted malicious traffic detection method based on the two-dimensional features as claimed in claim 2, wherein the step 1 comprises:
step 1.1, merging the same flow of five-tuple into a conversation, wherein the five-tuple refers to a source IP address, a destination IP address, a source port, a destination port and a protocol;
step 1.2, merging the sessions, in which the source IP address of the inflow flow is the same as the destination IP address of the outflow flow, the destination IP address of the inflow flow is the same as the source IP address of the outflow flow, the source port of the inflow flow is the same as the destination port of the outflow flow, the destination port of the inflow flow is the same as the source port of the outflow flow, and the protocol of the inflow flow is the same as the protocol of the outflow flow, into a bidirectional flow.
4. The method according to claim 2, wherein in step 2, data information of IP, port and protocol fields in the session traffic is filled with all 0 instead.
5. The method according to claim 2, wherein in the step 3, five elements in a ClientHello and ServerHello message are selected as the message load characteristics, and the method includes: TLSVersion (protocol version), Ciphers (held cipher suite), Extensions (extension field), EllipticCurves (elliptic curve cipher), EllipticCurvePointFormat (elliptic curve cipher format); the combined data of the five elements is classified into a special fingerprint array XIs just=[x1,x2,x3,x4,x5]Wherein
x1: the code of the protocol version is used,
x2: the code of all the cipher suites that are supported,
x3: the code of all the extension fields is such that,
x4: a code of the elliptic curve cipher type,
x5: code in elliptic curve cipher format.
6. The encrypted malicious traffic detection method based on the two-dimensional features as claimed in claim 2, wherein in the step 4, the packet length and the packet inter-arrival time, and byte distribution data are extracted as the stream fingerprint features.
7. The encrypted malicious traffic detection method based on the two-dimensional features of claim 6, wherein for the packet length, the lengths of all packets in each session are dispersed into a window with the same size, the window size is N bytes, the packets with the packet length between [0, N) bytes are placed into the 1 st bin, the packets with the packet length between [ N,2N) bytes are placed into the 2 nd bin, and so on; then constructing a matrix A, wherein each element A [ i, j ] represents the number of times a packet in the ith bin is converted into a packet in the jth bin is calculated; and finally, carrying out normalization processing on each row of the A, wherein each row is a Markov chain and is used as a packet length characteristic of the conversation.
8. The encrypted malicious traffic detection method based on the two-dimensional features of claim 6, wherein for the packet arrival time interval, the arrival time intervals of all packets in each session are discretized into a window with the same size, the window size is T milliseconds, the packets with the packet arrival time interval between [0, T) milliseconds are placed into the 1 st bin, the packets with the packet arrival time interval between [ T,2T) milliseconds are placed into the 2 nd bin, and so on; then constructing a matrix B, wherein each element B [ i, j ] represents the number of times a packet in the ith bin is converted into a packet in the jth bin is calculated; and finally, carrying out normalization processing on each row of the B, wherein each row is a Markov chain and is used as a packet arrival time interval characteristic of the conversation.
9. The method of claim 6, wherein the byte distribution is a 256-length array that counts each byte value in the payload of each packet in the stream; this count is divided by the total number of bytes found in the packet payload to obtain the probability of each byte value occurring.
10. An encrypted malicious traffic detection system based on two-dimensional features, comprising:
the SSL/TLS flow extraction module is used for capturing flow data from a network;
a bi-directional fluidization module to perform step 1 as recited in any one of claims 2-9;
a quintuple feature fuzzification processing module for executing the step 2 according to any one of claims 2 to 9;
a message load characteristic extraction module for executing the step 3 according to any one of claims 2-9;
a stream fingerprint feature extraction module performing step 4 as claimed in any one of claims 2-9;
a logistic regression analysis module performing steps 5-6 as claimed in any one of claims 2-9;
and the classification result output module outputs a classification result of the flow.
CN202010066830.XA 2020-01-20 2020-01-20 Encrypted malicious flow detection method and system based on two-dimensional characteristics Pending CN111245860A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010066830.XA CN111245860A (en) 2020-01-20 2020-01-20 Encrypted malicious flow detection method and system based on two-dimensional characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010066830.XA CN111245860A (en) 2020-01-20 2020-01-20 Encrypted malicious flow detection method and system based on two-dimensional characteristics

Publications (1)

Publication Number Publication Date
CN111245860A true CN111245860A (en) 2020-06-05

Family

ID=70866475

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010066830.XA Pending CN111245860A (en) 2020-01-20 2020-01-20 Encrypted malicious flow detection method and system based on two-dimensional characteristics

Country Status (1)

Country Link
CN (1) CN111245860A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112104570A (en) * 2020-09-11 2020-12-18 南方电网科学研究院有限责任公司 Traffic classification method and device, computer equipment and storage medium
CN112738039A (en) * 2020-12-18 2021-04-30 北京中科研究院 Malicious encrypted flow detection method, system and equipment based on flow behavior
CN112800142A (en) * 2020-12-15 2021-05-14 赛尔网络有限公司 MR (magnetic resonance) job processing method and device, electronic equipment and storage medium
CN113472751A (en) * 2021-06-04 2021-10-01 中国科学院信息工程研究所 Encrypted flow identification method and device based on data packet header
CN113726615A (en) * 2021-11-02 2021-11-30 北京广通优云科技股份有限公司 Encryption service stability judgment method based on network behaviors in IT intelligent operation and maintenance system
CN114124551A (en) * 2021-11-29 2022-03-01 中国电子科技集团公司第三十研究所 Malicious encrypted flow identification method based on multi-granularity feature extraction under WireGuard protocol
CN114143037A (en) * 2021-11-05 2022-03-04 山东省计算中心(国家超级计算济南中心) Malicious encrypted channel detection method based on process behavior analysis
CN115051874A (en) * 2022-08-01 2022-09-13 杭州默安科技有限公司 Multi-feature CS malicious encrypted traffic detection method and system
CN115086242A (en) * 2021-03-12 2022-09-20 天翼云科技有限公司 Encrypted data packet identification method and device and electronic equipment
CN115314240A (en) * 2022-06-22 2022-11-08 国家计算机网络与信息安全管理中心 Data processing method for encryption abnormal flow identification
CN115865534A (en) * 2023-02-27 2023-03-28 深圳大学 Traffic detection method, system, device and medium based on malicious encryption

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109104441A (en) * 2018-10-24 2018-12-28 上海交通大学 A kind of detection system and method for the encryption malicious traffic stream based on deep learning
CN110011931A (en) * 2019-01-25 2019-07-12 中国科学院信息工程研究所 A kind of encryption traffic classes detection method and system
CN110149245A (en) * 2019-05-24 2019-08-20 广州大学 The compressed sensing based high-speed network flow method of sampling and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109104441A (en) * 2018-10-24 2018-12-28 上海交通大学 A kind of detection system and method for the encryption malicious traffic stream based on deep learning
CN110011931A (en) * 2019-01-25 2019-07-12 中国科学院信息工程研究所 A kind of encryption traffic classes detection method and system
CN110149245A (en) * 2019-05-24 2019-08-20 广州大学 The compressed sensing based high-speed network flow method of sampling and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
S. ZANDER, T. NGUYEN AND G. ARMITAGE: ""Automated traffic classification and application identification using machine learning"", 《THE IEEE CONFERENCE ON LOCAL COMPUTER NETWORKS 30TH ANNIVERSARY (LCN"05)L, SYDNEY, NSW, 2005》 *
T. T. T. NGUYEN AND G. ARMITAGE: ""A survey of techniques for internet traffic classification using machine learning"", 《IEEE COMMUNICATIONS SURVEYS & TUTORIALS》 *
胡斌,周志洪,姚立红,李建华: ""结合报文负载与流指纹特征的恶意流量检测"", 《计算机工程》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112104570A (en) * 2020-09-11 2020-12-18 南方电网科学研究院有限责任公司 Traffic classification method and device, computer equipment and storage medium
CN112104570B (en) * 2020-09-11 2023-09-05 南方电网科学研究院有限责任公司 Traffic classification method, traffic classification device, computer equipment and storage medium
CN112800142A (en) * 2020-12-15 2021-05-14 赛尔网络有限公司 MR (magnetic resonance) job processing method and device, electronic equipment and storage medium
CN112800142B (en) * 2020-12-15 2023-08-08 赛尔网络有限公司 MR job processing method, device, electronic equipment and storage medium
CN112738039A (en) * 2020-12-18 2021-04-30 北京中科研究院 Malicious encrypted flow detection method, system and equipment based on flow behavior
CN115086242A (en) * 2021-03-12 2022-09-20 天翼云科技有限公司 Encrypted data packet identification method and device and electronic equipment
CN113472751A (en) * 2021-06-04 2021-10-01 中国科学院信息工程研究所 Encrypted flow identification method and device based on data packet header
CN113726615B (en) * 2021-11-02 2022-02-15 北京广通优云科技股份有限公司 Encryption service stability judgment method based on network behaviors in IT intelligent operation and maintenance system
CN113726615A (en) * 2021-11-02 2021-11-30 北京广通优云科技股份有限公司 Encryption service stability judgment method based on network behaviors in IT intelligent operation and maintenance system
CN114143037A (en) * 2021-11-05 2022-03-04 山东省计算中心(国家超级计算济南中心) Malicious encrypted channel detection method based on process behavior analysis
CN114124551A (en) * 2021-11-29 2022-03-01 中国电子科技集团公司第三十研究所 Malicious encrypted flow identification method based on multi-granularity feature extraction under WireGuard protocol
CN114124551B (en) * 2021-11-29 2023-05-23 中国电子科技集团公司第三十研究所 Malicious encryption traffic identification method based on multi-granularity feature extraction under WireGuard protocol
CN115314240A (en) * 2022-06-22 2022-11-08 国家计算机网络与信息安全管理中心 Data processing method for encryption abnormal flow identification
CN115051874A (en) * 2022-08-01 2022-09-13 杭州默安科技有限公司 Multi-feature CS malicious encrypted traffic detection method and system
CN115865534A (en) * 2023-02-27 2023-03-28 深圳大学 Traffic detection method, system, device and medium based on malicious encryption

Similar Documents

Publication Publication Date Title
CN111245860A (en) Encrypted malicious flow detection method and system based on two-dimensional characteristics
US20210273949A1 (en) Treating Data Flows Differently Based on Level of Interest
TW476207B (en) Information security analysis system
US9813310B1 (en) System and method for discriminating nature of communication traffic transmitted through network based on envelope characteristics
CN107733851A (en) DNS tunnels Trojan detecting method based on communication behavior analysis
Miller et al. Multilayer perceptron neural network for detection of encrypted VPN network traffic
CN111147394B (en) Multi-stage classification detection method for remote desktop protocol traffic behavior
CN112769633B (en) Proxy traffic detection method and device, electronic equipment and readable storage medium
Liu et al. Maldetect: A structure of encrypted malware traffic detection
Yan et al. Identifying wechat red packets and fund transfers via analyzing encrypted network traffic
CN110611640A (en) DNS protocol hidden channel detection method based on random forest
CN113743542B (en) Network asset identification method and system based on encrypted flow
WO2023173790A1 (en) Data packet-based encrypted traffic classification system
CN113676348A (en) Network channel cracking method, device, server and storage medium
CN112217763A (en) Hidden TLS communication flow detection method based on machine learning
CN113923026A (en) Encrypted malicious flow detection model based on TextCNN and construction method thereof
Sheikh et al. Procedures, criteria, and machine learning techniques for network traffic classification: a survey
Wang et al. An unknown protocol syntax analysis method based on convolutional neural network
Ding et al. Multi-granular aggregation of network flows for security analysis
CN115051874B (en) Multi-feature CS malicious encrypted traffic detection method and system
CN114172715B (en) Industrial control intrusion detection system and method based on secure multiparty calculation
CN114117429A (en) Network flow detection method and device
CN113141375A (en) Network security monitoring method and device, storage medium and server
CN116668085B (en) Flow multi-process intrusion detection method and system based on lightGBM
Zheng et al. Identification of Malicious Encrypted Traffic Through Feature Fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200605